flowchart TB subgraph problem[problem definition] need[identify need]-->design_collection[design data collection] end subgraph measurement[data collection + measurement] training[training data] testing[testing data] end subgraph modeling explore[explore data] --> engineer[engineer features] engineer --> design[design model] end subgraph assessment test --> audit audit --> deploy deploy-->evaluate end design_collection-->measurement training --vectorization--> modeling design --> assessment testing --vectorization--> assessment need-->assessment
Data and Vectorization
$$
Introduction
So far in this course, we’ve considered the general supervised learning scenario, in which we are given a feature matrix \(\mathbf{X}\in \mathbb{R}^{n\times p}\) and a target vector \(\mathbf{y}\in \mathbb{R}^n\). We then solve the empirical risk minimization problem in order to choose model parameters that minimize a loss function on the training data. The exact structure of this loss function depends on things like whether we are doing classification or regression, what our computational resources are, and other considerations.
But feature matrices \(\mathbf{X}\) and target vectors \(\mathbf{y}\) don’t just exist in the world: they are collected and measured. We can think of data collection and measurement as posing three fundamental questions:
- Data collection: Which rows (observations) exist in \(\mathbf{X}\) and \(\mathbf{y}\)?
- Measurement: which columns (features) exist in \(\mathbf{X}\)?
- Measurement: what is the target \(\mathbf{y}\) and how is it measured?
Broadly, we can think of the complete machine learning workflow as having phases corresponding to problem definition, data collection + measurement, modeling, and evaluation. Here’s roughly how this looks:
So far, we’ve spent most of our time in the “modeling” module, especially the last two steps. We’ve also studied some of the ways to test and audit algorithms. Today we’re going to discuss vectorization. We can think of vectorization as what happens between the collection of raw data and the use of that data as input for models.
The reason that vectorization is necessary is that machine learning models only understand numbers. So, if our data isn’t numbers, we need to convert it into numbers in order to use it for modeling.
What Data Needs Vectorization?
Most of it!
- If your data comes to you as a table or matrix containing only numbers, in which each row corresponds to exactly one observation, then you may not need to vectorize.
- If your data comes to you in any other form, then you need to vectorize.
Some data that usually require vectorization:
- Images
- Text
- Audio files
- Most genomic data
- Etc. etc.
There are tons of ways of vectorizing different kinds of data, and we’re not going to cover all of them. Instead, we’re going to go a little more in depth on text vectorization. We’ll discuss image vectorization much more when we get to convolutional neural networks.
Case Study: Sentiment Analysis of COVID-19 Tweets
Instead of discussing text vectorization in the abstract, let’s jump straight into an example. Sentiment analysis describes modeling techniques that aim to describe the emotional valence of text. For example, sentiment analysis is often used to automatically describe text as “positive”/“happy” or “negative”/“sad”. The function below will download and return a set of training data used for sentiment analysis of tweets related to the COVID-19 pandemic.
import pandas as pd
def grab_tweets(data_set = "train"):
= f"https://raw.githubusercontent.com/PhilChodrow/PIC16A/master/datasets/Corona_NLP_{data_set}.csv"
url = pd.read_csv(url, encoding='iso-8859-1')
df = df[["OriginalTweet", "Sentiment"]]
df return df
= grab_tweets() df_train
Let’s take a look at our training data:
df_train.head()
OriginalTweet | Sentiment | |
---|---|---|
0 | @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i... | Neutral |
1 | advice Talk to your neighbours family to excha... | Positive |
2 | Coronavirus Australia: Woolworths to give elde... | Positive |
3 | My food stock is not the only one which is emp... | Positive |
4 | Me, ready to go at supermarket during the #COV... | Extremely Negative |
Sketchy Labels
These tweets were labeled manually by the original collector of the data. As with any setting in which humans need to make subjective decisions, there is considerable possibility for debate. For example, here is one tweet that was labeld “extremely positive”:
print(df_train.iloc[[40338]]["OriginalTweet"].iloc[0])
WE NEED COVID-19 TESTING FOR EVERYONE TODAY!
I have never been afraid to leave my house for a trip to the grocery store in my life. Now I am. I don't want to bring home a virus to my loved ones. It's not me, it's them.
#StayHomeSaveLives
Challenges that can cause sketchy labels include:
- Speed of labeling (it takes a LONG time to make high-quality labels)
- Language familiarity
- Ambiguity in the target language
- Lots more!
Almost always, when working with real-world data sets, we need to keep in mind that not only is our model approximate and our data incomplete, but the data may also be contaminated with errors that we aren’t really able to control.
Target Vectorization
Our aim is to predict the Sentiment
in terms of the text of the OriginalTweet
. However, neither the text OriginalTweet
nor the target Sentiment
are numbers. So, we need to vectorize.
The possible values of the Sentiment
column are
import numpy as np
"Sentiment"]) np.unique(df_train[
array(['Extremely Negative', 'Extremely Positive', 'Negative', 'Neutral',
'Positive'], dtype=object)
Vectorizing the target Sentiment
is simple (although there are multiple ways). We’ll construct a new target vector which is 1
if the sentiment is Positive
or Extremely Positive
and 0
otherwise:
= 1*df_train["Sentiment"].str.contains("Positive")
target target.head()
0 0
1 1
2 1
3 1
4 0
Name: Sentiment, dtype: int64
Vectorizing the predictor OriginalTweet
is much more complicated, and here we face a number of choices.
Term Frequency (TF) Vectorization
In natural language processing (NLP), a data set of text is often called a corpus, and each observation is often called a document. Here, each document is a tweet.
One standard vectorization technique is to construct a term-document matrix. In a term-document matrix, each row corresponds to a document and each column corresponds to a “term” (usually a word) that is present in the document. The entry \(x_{ij}\) of this matrix is the number of terms that term \(j\) appears in document \(i\), which we’ll call \(\mathrm{tf}_{ij}\). To construct a term-document matrix, we can use the CountVectorizer
from sklearn
.
from sklearn.feature_extraction.text import CountVectorizer
= CountVectorizer(max_df = 0.2, min_df = 0.001, stop_words = 'english') cv
Here, max_df
and min_df
specify a range of frequencies to include. If a term is present in almost all documents (like “the” or “of”), then this term may not be a good indication of sentiment. On the other hand, if a term appears in only one or two documents, we probably don’t have enough data to figure out whether it matters. Finally, the choice of stop_words
tells our vectorizer to ignore common English words that are unlikely to carry much emotional meaning, like “and” or “if”.
= cv.fit(df_train["OriginalTweet"]) f
= cv.transform(df_train["OriginalTweet"])
counts = pd.DataFrame(counts.toarray(), columns = cv.get_feature_names()) tdm
/Users/philchodrow/opt/anaconda3/envs/ml-0451/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
Here’s our term-document matrix. Note that most of the entries are 0 because tweets are so short!
tdm
00 | 000 | 10 | 100 | 11 | 12 | 13 | 14 | 15 | 16 | ... | year | years | yes | yesterday | york | young | youtube | youâ | yâ | zero | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
41152 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
41153 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
41154 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
41155 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
41156 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
41157 rows × 2322 columns
The function below summarizes our entire data prep pipeline, which we’ll need for when we get to the test set.
def prep_tweets(df, vectorizer, train = True):
if train:
"OriginalTweet"])
vectorizer.fit(df_train[= vectorizer.transform(df["OriginalTweet"]) # term-document matrix
X = 1*df["Sentiment"].str.contains("Positive")
y
return X, y
= prep_tweets(df_train, cv, train = True) X_train_cv, y_train
First Model
Let’s check on the base rate:
y_train.mean()
0.4384673324100396
So, always guessing that a tweet is not positive would be correct 56% of the time. Let’s see if we can beat this using logistic regression.
from sklearn.linear_model import LogisticRegression
= LogisticRegression()
LR_cv
LR_cv.fit(X_train_cv, y_train) LR_cv.score(X_train_cv, y_train)
0.8755011298199578
This model achieves 87% accuracy on the training data.
Inverse Document Frequency Weighting
Simple term-document matrices are good for some tasks, but in other cases it is useful to downweight terms according to their frequency in the overall training corpus. This allows our models to place greater emphasis on rarer terms, which might be more expressive of strong emotions.
In term-frequency-inverse-document-frequency (TF-IDF) weighting, the entry for term \(j\) in document \(i\) is
sklearn
.\[ \tilde{\mathrm{x}}_{ij} = \overbrace{\mathrm{tf}_{ij}}^{\text{Term frequency}}\times \underbrace{\mathrm{idf}_i}_{\text{inverse document frequency}}\;. \]
Here, the term frequency \(\mathrm{tf}_{ij}\) is again the number of times that term \(i\) appears in document \(j\), while the inverse document frequency \(\mathrm{idf}_i\) is computed with the formula
\[ \mathrm{idf}_i = \log \frac{1+n}{1+\mathrm{df}_i} + 1\; \] with \(\mathrm{df}_i\) being the total number of documents in which term \(i\) appears. Finally, each row of \(\tilde{\mathrm{x}}_{ij}\) is normalized to have unit length:
\[ x_{ij} = \frac{x_{ij}}{\sqrt{\sum_{j}x_{ij}^2}} \]
These \(x_{ij}\) are then collected to form the feature matrix \(\mathbf{X}\). Let’s try constructing a model using TF-IDF vectorization:
from sklearn.feature_extraction.text import TfidfVectorizer
= TfidfVectorizer(max_df = 0.2, min_df = 0.001, stop_words = 'english')
tfidfv = prep_tweets(df_train, tfidfv, train = True) X_train_tfidf, y_train
= LogisticRegression()
LR_tfidf
LR_tfidf.fit(X_train_tfidf, y_train) LR_tfidf.score(X_train_tfidf, y_train)
0.8623077483781617
Our TF-IDF model got a lower training score. At this stage, one good approach would be to choose which vectorization to use (as well as the vectorization parameters) using cross-validation. For now, we’ll just go ahead and grab the test set:
= grab_tweets(data_set = "test")
df_test = prep_tweets(df_test, vectorizer = cv, train = False)
X_test_cv, y_test = prep_tweets(df_test, vectorizer = tfidfv, train = False) X_test_tfidf, y_test
And evaluate!
print("Term-Document Frequency")
print(LR_cv.score(X_test_cv, y_test))
print("TF-IDF")
print(LR_tfidf.score(X_test_tfidf, y_test))
Term-Document Frequency
0.8412322274881516
TF-IDF
0.8370194839389152
In this case, TF-IDF did a little worse than term-document frequency vectorization on the test set.
Model Inspection
Let’s take a moment to learn more about how our term-document frequency-based model looks at the data. One good way to do this is by looking at the confusion matrices:
from sklearn.metrics import confusion_matrix
= LR_cv.predict(X_test_cv)
y_pred = "true") confusion_matrix(y_test, y_pred, normalize
array([[0.89120782, 0.10879218],
[0.23156533, 0.76843467]])
The false negative rate is higher than the true positive rate, suggesting that our model tends to tilt negative. Let’s take a look at some tweets that our model labeled as negative even though the label was positive:
= df_test[(y_pred == 0) & (y_test == 1)]["OriginalTweet"]
false_negs
for t in false_negs.iloc[:5]:
print("\n-------------------\n")
print(t)
-------------------
That's about a week from now. A bit optimistic. Probably it will take another month. Supply chain may be recovering, demand chain will be non-existent in US and Europe for the next month or two.
$spx $qqq $es $nq https://t.co/yXcOfL0BnI
-------------------
Control over stocks and gold is lost...gold coming back very nicely! Loves wallbridge and Balmoral and warns listeners about #coronavirus Sprott Money Ltd. recently put in money to $OCG $GENM $MMG and many more... https://t.co/3aURZ2e4Sj
-------------------
#Coronavirus is "an exposure of all the holes in the social safety net," says NELP Government Affairs Director Judy Conti
#UI #Unemployment #PaidLeaveForAll
https://t.co/BrCY9IJWSv
-------------------
If you have booked a ticket to an event as part of a package holiday you will be offered an alternative or a refund by your travel provider, if it has been cancelled due to #Coronavirus.
Check ABTA's consumer Q&A at: https://t.co/oUB4MNmrNA
#COVID19 https://t.co/kMHJehS2JH
-------------------
Ok if #COVID2019 is nothing to panic about why is Italy imposing the biggest restrictions on the civilian population since WW2?
How will the supermarkets be able to provide food if all the workers are told to stay at home?
Same with any other Bussiness.
At this point we might have some further questions for the producer of this data set about how he did the labeling: don’t some of these tweets look like they “really” should be negative?
Word-Based Sentiment Analysis
A nice feature of linear models like logistic regression is that we can actually check the coefficient for each word in the model. This coefficient can give us important information about which words the model believes are most positive or most negative. One easy way to get at this information is to construct a data frame with the coefficients and the words:
= pd.DataFrame({"coef" : LR_cv.coef_[0], "word" : cv.get_feature_names()}) coef_df
/Users/philchodrow/opt/anaconda3/envs/ml-0451/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
Now we can obtain positive and negative words by sorting. Here are some of the good ones:
'coef', ascending = False).head(10) coef_df.sort_values(
coef | word | |
---|---|---|
223 | 3.832914 | best |
2280 | 3.470211 | won |
849 | 3.327824 | friend |
945 | 3.304033 | hand |
241 | 3.265563 | bonus |
916 | 3.261603 | great |
1541 | 3.153919 | positive |
701 | 3.145394 | enjoy |
565 | 3.088850 | dedicated |
2303 | 3.066329 | wow |
On the other hand, here are some of the negative ones:
'coef', ascending = True).head(10) coef_df.sort_values(
coef | word | |
---|---|---|
975 | -3.691077 | hell |
1791 | -3.417773 | scams |
519 | -3.254192 | crisis |
588 | -2.997627 | died |
1134 | -2.988952 | kill |
1136 | -2.915697 | killing |
2228 | -2.910622 | war |
1536 | -2.847513 | poor |
526 | -2.832577 | crude |
1789 | -2.780784 | scam |
A common use for these coefficients is to assign sentiment scores to sentences. Here’s a function that does this. It works by first stripping the punctuation and capitalization from a string, and then looking up each of its individual words in a dictionary.
from string import punctuation
= {coef_df["word"].loc[i] : coef_df["coef"].loc[i] for i in coef_df.index}
d
def sentiment_of_string(s):
= s
no_punc for punc in punctuation:
= no_punc.replace(punc, "")
no_punc
= no_punc.lower().split()
words return np.mean([d[word] for word in words if word in d ])
= "I love apples."
s1 = "I don't like this pandemic; it's too sad."
s2
print(sentiment_of_string(s1))
print(sentiment_of_string(s2))
2.8312000405630475
0.27823576549007195
This approach is the basis of The Hedonometer, a large-scale Twitter sentiment analysis tool from our friends at the University of Vermont.
Activity
There is a very important kind of information that is not captured by term-document matrices, even with inverse-document-frequency weighting. Consider the following two sentences:
- “I like pears, not apples.”
- “I like apples, not pears.”
Would these sentences have different representations in a term-document matrix?
© Phil Chodrow, 2023