Data and Vectorization

Author

Phil Chodrow

Introduction

So far in this course, we’ve considered the general supervised learning scenario, in which we are given a feature matrix \(\mathbf{X}\in \mathbb{R}^{n\times p}\) and a target vector \(\mathbf{y}\in \mathbb{R}^n\). We then solve the empirical risk minimization problem in order to choose model parameters that minimize a loss function on the training data. The exact structure of this loss function depends on things like whether we are doing classification or regression, what our computational resources are, and other considerations.

But feature matrices \(\mathbf{X}\) and target vectors \(\mathbf{y}\) don’t just exist in the world: they are collected and measured. We can think of data collection and measurement as posing three fundamental questions:

Data collection: Which rows (observations) exist in \(\mathbf{X}\) and \(\mathbf{y}\)?
Measurement: which columns (features) exist in \(\mathbf{X}\)?
Measurement: what is the target \(\mathbf{y}\) and how is it measured?

Broadly, we can think of the complete machine learning workflow as having phases corresponding to problem definition, data collection + measurement, modeling, and evaluation. Here’s roughly how this looks:

flowchart TB

    subgraph problem[problem definition]
        need[identify need]-->design_collection[design data collection]
    end
    subgraph measurement[data collection + measurement]
        training[training data] 
        testing[testing data]
    end
    subgraph modeling
        explore[explore data] --> engineer[engineer features]
        engineer --> design[design model]
    end
    subgraph assessment
        test --> audit
        audit --> deploy
        deploy-->evaluate
    end
    design_collection-->measurement
    training --vectorization--> modeling
    design --> assessment
    testing --vectorization--> assessment
    need-->assessment

So far, we’ve spent most of our time in the “modeling” module, especially the last two steps. We’ve also studied some of the ways to test and audit algorithms. Today we’re going to discuss vectorization. We can think of vectorization as what happens between the collection of raw data and the use of that data as input for models.

Definition 1 (Vectorization) Vectorization is the act of assigning to each data observation a vector \(\mathbf{x}\), thus forming a feature matrix \(\mathbf{X}\). Formally, a vectorization map is a function \(v:\mathcal{D}\rightarrow \mathbb{R}^p\) such that, if \(d \in \mathcal{D}\) is a data observation, then \(\mathbf{x}= v(d)\) is a set of features corresponding to \(d\).

The reason that vectorization is necessary is that machine learning models only understand numbers. So, if our data isn’t numbers, we need to convert it into numbers in order to use it for modeling.

What Data Needs Vectorization?

Most of it!

If your data comes to you as a table or matrix containing only numbers, in which each row corresponds to exactly one observation, then you may not need to vectorize.
If your data comes to you in any other form, then you need to vectorize.

Some data that usually require vectorization:

Images
Text
Audio files
Most genomic data
Etc. etc.

There are tons of ways of vectorizing different kinds of data, and we’re not going to cover all of them. Instead, we’re going to go a little more in depth on text vectorization. We’ll discuss image vectorization much more when we get to convolutional neural networks.

For your projects, depending on the data you want to work with, you may need to research vectorization schemes appropriate to your data.

Case Study: Sentiment Analysis of COVID-19 Tweets

Instead of discussing text vectorization in the abstract, let’s jump straight into an example. Sentiment analysis describes modeling techniques that aim to describe the emotional valence of text. For example, sentiment analysis is often used to automatically describe text as “positive”/“happy” or “negative”/“sad”. The function below will download and return a set of training data used for sentiment analysis of tweets related to the COVID-19 pandemic.

I retrieved this data from its original posting on Kaggle.

import pandas as pd

def grab_tweets(data_set = "train"):
    url = f"https://raw.githubusercontent.com/PhilChodrow/PIC16A/master/datasets/Corona_NLP_{data_set}.csv"
    df = pd.read_csv(url, encoding='iso-8859-1') 
    df = df[["OriginalTweet", "Sentiment"]]
    return df
    
df_train = grab_tweets()

Let’s take a look at our training data:

df_train.head()

	OriginalTweet	Sentiment
0	@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...	Neutral
1	advice Talk to your neighbours family to excha...	Positive
2	Coronavirus Australia: Woolworths to give elde...	Positive
3	My food stock is not the only one which is emp...	Positive
4	Me, ready to go at supermarket during the #COV...	Extremely Negative

Activity

Chat with your group. What are three questions you have about how the data was collected?

Sketchy Labels

These tweets were labeled manually by the original collector of the data. As with any setting in which humans need to make subjective decisions, there is considerable possibility for debate. For example, here is one tweet that was labeld “extremely positive”:

print(df_train.iloc[[40338]]["OriginalTweet"].iloc[0])

WE NEED COVID-19 TESTING FOR EVERYONE TODAY!
I have never been afraid to leave my house for a trip to the grocery store in my life. Now I am. I don't want to bring home a virus to my loved ones. It's not me, it's them.
#StayHomeSaveLives

Challenges that can cause sketchy labels include:

Speed of labeling (it takes a LONG time to make high-quality labels)
Language familiarity
Ambiguity in the target language
Lots more!

Almost always, when working with real-world data sets, we need to keep in mind that not only is our model approximate and our data incomplete, but the data may also be contaminated with errors that we aren’t really able to control.

See Northcutt, Athalye, and Mueller (2021) for much more on label errors in common machine learning benchmarks.

Target Vectorization

Our aim is to predict the Sentiment in terms of the text of the OriginalTweet. However, neither the text OriginalTweet nor the target Sentiment are numbers. So, we need to vectorize.

The possible values of the Sentiment column are

import numpy as np
np.unique(df_train["Sentiment"])

array(['Extremely Negative', 'Extremely Positive', 'Negative', 'Neutral',
       'Positive'], dtype=object)

Vectorizing the target Sentiment is simple (although there are multiple ways). We’ll construct a new target vector which is 1 if the sentiment is Positive or Extremely Positive and 0 otherwise:

target = 1*df_train["Sentiment"].str.contains("Positive")
target.head()

0    0
1    1
2    1
3    1
4    0
Name: Sentiment, dtype: int64

Vectorizing the predictor OriginalTweet is much more complicated, and here we face a number of choices.

Term Frequency (TF) Vectorization

In natural language processing (NLP), a data set of text is often called a corpus, and each observation is often called a document. Here, each document is a tweet.

One standard vectorization technique is to construct a term-document matrix. In a term-document matrix, each row corresponds to a document and each column corresponds to a “term” (usually a word) that is present in the document. The entry \(x_{ij}\) of this matrix is the number of terms that term \(j\) appears in document \(i\), which we’ll call \(\mathrm{tf}_{ij}\). To construct a term-document matrix, we can use the CountVectorizer from sklearn.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_df = 0.2, min_df = 0.001, stop_words = 'english')

Here, max_df and min_df specify a range of frequencies to include. If a term is present in almost all documents (like “the” or “of”), then this term may not be a good indication of sentiment. On the other hand, if a term appears in only one or two documents, we probably don’t have enough data to figure out whether it matters. Finally, the choice of stop_words tells our vectorizer to ignore common English words that are unlikely to carry much emotional meaning, like “and” or “if”.

f = cv.fit(df_train["OriginalTweet"])

counts = cv.transform(df_train["OriginalTweet"])
tdm = pd.DataFrame(counts.toarray(), columns = cv.get_feature_names())

/Users/philchodrow/opt/anaconda3/envs/ml-0451/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)

Here’s our term-document matrix. Note that most of the entries are 0 because tweets are so short!

tdm

	00	000	10	100	11	12	13	14	15	16	...	year	years	yes	yesterday	york	young	youtube	youâ	yâ	zero
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
41152	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
41153	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
41154	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
41155	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
41156	2	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

41157 rows × 2322 columns

The function below summarizes our entire data prep pipeline, which we’ll need for when we get to the test set.

def prep_tweets(df, vectorizer, train = True):
    if train: 
        vectorizer.fit(df_train["OriginalTweet"])
    X = vectorizer.transform(df["OriginalTweet"]) # term-document matrix
    y = 1*df["Sentiment"].str.contains("Positive")

    return X, y

X_train_cv, y_train = prep_tweets(df_train, cv, train = True)

First Model

Let’s check on the base rate:

y_train.mean()

0.4384673324100396

So, always guessing that a tweet is not positive would be correct 56% of the time. Let’s see if we can beat this using logistic regression.

from sklearn.linear_model import LogisticRegression
LR_cv = LogisticRegression()
LR_cv.fit(X_train_cv, y_train)
LR_cv.score(X_train_cv, y_train)

0.8755011298199578

This model achieves 87% accuracy on the training data.

Inverse Document Frequency Weighting

Simple term-document matrices are good for some tasks, but in other cases it is useful to downweight terms according to their frequency in the overall training corpus. This allows our models to place greater emphasis on rarer terms, which might be more expressive of strong emotions.

In term-frequency-inverse-document-frequency (TF-IDF) weighting, the entry for term \(j\) in document \(i\) is

Exact details of TF-IDF weightings differ; this is the one implemented by default in sklearn.

\[ \tilde{\mathrm{x}}_{ij} = \overbrace{\mathrm{tf}_{ij}}^{\text{Term frequency}}\times \underbrace{\mathrm{idf}_i}_{\text{inverse document frequency}}\;. \]

Here, the term frequency \(\mathrm{tf}_{ij}\) is again the number of times that term \(i\) appears in document \(j\), while the inverse document frequency \(\mathrm{idf}_i\) is computed with the formula

\[ \mathrm{idf}_i = \log \frac{1+n}{1+\mathrm{df}_i} + 1\; \] with \(\mathrm{df}_i\) being the total number of documents in which term \(i\) appears. Finally, each row of \(\tilde{\mathrm{x}}_{ij}\) is normalized to have unit length:

\[ x_{ij} = \frac{x_{ij}}{\sqrt{\sum_{j}x_{ij}^2}} \]

These \(x_{ij}\) are then collected to form the feature matrix \(\mathbf{X}\). Let’s try constructing a model using TF-IDF vectorization:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidfv = TfidfVectorizer(max_df = 0.2, min_df = 0.001, stop_words = 'english')
X_train_tfidf, y_train = prep_tweets(df_train, tfidfv, train = True)

LR_tfidf = LogisticRegression()
LR_tfidf.fit(X_train_tfidf, y_train)
LR_tfidf.score(X_train_tfidf, y_train)

0.8623077483781617

Our TF-IDF model got a lower training score. At this stage, one good approach would be to choose which vectorization to use (as well as the vectorization parameters) using cross-validation. For now, we’ll just go ahead and grab the test set:

df_test = grab_tweets(data_set = "test")
X_test_cv, y_test = prep_tweets(df_test, vectorizer = cv, train = False)
X_test_tfidf, y_test = prep_tweets(df_test, vectorizer = tfidfv, train = False)

And evaluate!

print("Term-Document Frequency")
print(LR_cv.score(X_test_cv, y_test))
print("TF-IDF")
print(LR_tfidf.score(X_test_tfidf, y_test))

Term-Document Frequency
0.8412322274881516
TF-IDF
0.8370194839389152

In this case, TF-IDF did a little worse than term-document frequency vectorization on the test set.

Model Inspection

Let’s take a moment to learn more about how our term-document frequency-based model looks at the data. One good way to do this is by looking at the confusion matrices:

from sklearn.metrics import confusion_matrix
y_pred = LR_cv.predict(X_test_cv)
confusion_matrix(y_test, y_pred, normalize = "true")

array([[0.89120782, 0.10879218],
       [0.23156533, 0.76843467]])

The false negative rate is higher than the true positive rate, suggesting that our model tends to tilt negative. Let’s take a look at some tweets that our model labeled as negative even though the label was positive:

false_negs = df_test[(y_pred == 0) & (y_test == 1)]["OriginalTweet"]

for t in false_negs.iloc[:5]: 
    print("\n-------------------\n")
    print(t)


-------------------

That's about a week from now. A bit optimistic.  Probably it will take another month.  Supply chain may be recovering, demand chain will be non-existent in US and Europe for the next month or two.
$spx $qqq $es $nq https://t.co/yXcOfL0BnI

-------------------

Control over stocks and gold is lost...gold coming back very nicely! Loves wallbridge and Balmoral and warns listeners about #coronavirus Sprott Money Ltd. recently put in money to $OCG $GENM $MMG and many more... https://t.co/3aURZ2e4Sj

-------------------

#Coronavirus is "an exposure of all the holes in the social safety net," says NELP Government Affairs Director Judy Conti

#UI #Unemployment #PaidLeaveForAll
https://t.co/BrCY9IJWSv

-------------------

If you have booked a ticket to an event as part of a package holiday you will be offered an alternative or a refund by your travel provider, if it has been cancelled due to #Coronavirus.

Check ABTA's consumer Q&amp;A at: https://t.co/oUB4MNmrNA

#COVID19 https://t.co/kMHJehS2JH

-------------------

Ok if #COVID2019 is nothing to panic about why is Italy imposing the biggest restrictions on the civilian population since WW2? 
How will the supermarkets be able to provide food if all the workers are told to stay at home? 
Same with any other Bussiness.

At this point we might have some further questions for the producer of this data set about how he did the labeling: don’t some of these tweets look like they “really” should be negative?

Word-Based Sentiment Analysis

A nice feature of linear models like logistic regression is that we can actually check the coefficient for each word in the model. This coefficient can give us important information about which words the model believes are most positive or most negative. One easy way to get at this information is to construct a data frame with the coefficients and the words:

coef_df = pd.DataFrame({"coef" : LR_cv.coef_[0], "word" : cv.get_feature_names()})

/Users/philchodrow/opt/anaconda3/envs/ml-0451/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)

Now we can obtain positive and negative words by sorting. Here are some of the good ones:

coef_df.sort_values('coef', ascending = False).head(10)

	coef	word
223	3.832914	best
2280	3.470211	won
849	3.327824	friend
945	3.304033	hand
241	3.265563	bonus
916	3.261603	great
1541	3.153919	positive
701	3.145394	enjoy
565	3.088850	dedicated
2303	3.066329	wow

On the other hand, here are some of the negative ones:

coef_df.sort_values('coef', ascending = True).head(10)

	coef	word
975	-3.691077	hell
1791	-3.417773	scams
519	-3.254192	crisis
588	-2.997627	died
1134	-2.988952	kill
1136	-2.915697	killing
2228	-2.910622	war
1536	-2.847513	poor
526	-2.832577	crude
1789	-2.780784	scam

A common use for these coefficients is to assign sentiment scores to sentences. Here’s a function that does this. It works by first stripping the punctuation and capitalization from a string, and then looking up each of its individual words in a dictionary.

from string import punctuation 


d = {coef_df["word"].loc[i] : coef_df["coef"].loc[i] for i in coef_df.index}

def sentiment_of_string(s):
    no_punc = s
    for punc in punctuation:
        no_punc = no_punc.replace(punc, "")
    
    words = no_punc.lower().split()
    return np.mean([d[word] for word in words if word in d ])

s1 = "I love apples."
s2 = "I don't like this pandemic; it's too sad."

print(sentiment_of_string(s1))
print(sentiment_of_string(s2))

2.8312000405630475
0.27823576549007195

This approach is the basis of The Hedonometer, a large-scale Twitter sentiment analysis tool from our friends at the University of Vermont.

Activity

There is a very important kind of information that is not captured by term-document matrices, even with inverse-document-frequency weighting. Consider the following two sentences:

“I like pears, not apples.”
“I like apples, not pears.”

Would these sentences have different representations in a term-document matrix?

References

Northcutt, Curtis G., Anish Athalye, and Jonas Mueller. 2021. “Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks.” In Proceedings of the 35th Conference on Neural Information Processing Systems Track on Datasets and Benchmarks.