import pandas as pd
= "https://raw.githubusercontent.com/middlebury-csci-0451/CSCI-0451/main/data/palmer-penguins/train.csv"
train_url = pd.read_csv(train_url) train
Blog Post: Classifying Palmer Penguins
In this blog post, you’ll work through a complete example of the standard machine learning workflow. Your primary goal is to determine the smallest number of measurements necessary to confidently determine the species of a penguin.
This is one of two suggested options for a blog post this week. You might want to pick this option if some of the following bullet points describe you:
- You are interested in practicing with machine learning models on a real data set.
- You are less interested in learning new machine learning theory this week.
- You are willing to read a little more about how to perform operations on data sets with
Pandas
and how to use other models implemented inscikit-learn
.
The alternative has a more theoretical flavor.
The Palmer Penguins data set is a data set collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. The data contains physiological measurements for a number of individuals from each of three species of penguin:
You can access the (training) data like this:
Here’s how the data looks:
train.head()
studyName | Sample Number | Species | Region | Island | Stage | Individual ID | Clutch Completion | Date Egg | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Comments | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PAL0708 | 27 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | Adult, 1 Egg Stage | N46A1 | Yes | 11/29/07 | 44.5 | 14.3 | 216.0 | 4100.0 | NaN | 7.96621 | -25.69327 | NaN |
1 | PAL0708 | 22 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | Adult, 1 Egg Stage | N41A2 | Yes | 11/27/07 | 45.1 | 14.5 | 215.0 | 5000.0 | FEMALE | 7.63220 | -25.46569 | NaN |
2 | PAL0910 | 124 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N67A2 | Yes | 11/16/09 | 41.4 | 18.5 | 202.0 | 3875.0 | MALE | 9.59462 | -25.42621 | NaN |
3 | PAL0910 | 146 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Dream | Adult, 1 Egg Stage | N82A2 | Yes | 11/16/09 | 39.0 | 18.7 | 185.0 | 3650.0 | MALE | 9.22033 | -26.03442 | NaN |
4 | PAL0708 | 24 | Chinstrap penguin (Pygoscelis antarctica) | Anvers | Dream | Adult, 1 Egg Stage | N85A2 | No | 11/28/07 | 50.6 | 19.4 | 193.0 | 3800.0 | MALE | 9.28153 | -24.97134 | NaN |
The “culmen” refers to a ridge on the bill of the penguin.
Artwork by @allison_horst
Your Challenge
We are going to consider the problem of predicting the species of a penguin based on its measurements.
- Explore: Construct at least one interesting displayed figure (e.g. using seaborn) and at least one interesting displayed table (e.g. using
pandas.groupby().aggregate
). Make sure to include a helpful discussion of both the figure and the table. Don’t just show the result: explain what you learned about the data from these products. - Model: Find three features of the data and a model trained on those features which achieves 100% testing accuracy. You must obtain your three features through a reproducible process. That is, you can’t just pick them: you need to code up some kind of search in order to obtain them.
- One feature must be qualitative (like
Island
orClutch Completion
). - The other two features must be quantitative (like
Body Mass (g)
orCulmen Depth (mm)
).
- One feature must be qualitative (like
- Evaluate: Show the decision regions of your finished model, split out by the qualitative feature.
I’ve supplied code to help you with several parts of this task.
Resources and Hints
This Webpage Runs
If you run all the code on this assignment page in order, you’ll produce the result at the bottom of the page (possibly after installing some more packages in your ml-0451
Anaconda environment). So, one good way to approach this assignment is to take this code into a Jupyter notebook and start tweaking.
Data Preparation
You will need to prepare the qualitative columns in the data. Feature columns like Sex
and Island
should be coded using pd.get_dummies
(as illustrated in lecture). The label column Species
should be coded differently, using a LabelEncoder
. The following function handles this work for you.
from sklearn.preprocessing import LabelEncoder
= LabelEncoder()
le "Species"])
le.fit(train[
def prepare_data(df):
= df.drop(["studyName", "Sample Number", "Individual ID", "Date Egg", "Comments", "Region"], axis = 1)
df = df[df["Sex"] != "."]
df = df.dropna()
df = le.transform(df["Species"])
y = df.drop(["Species"], axis = 1)
df = pd.get_dummies(df)
df return df, y
= prepare_data(train) X_train, y_train
Choosing Features
This is where much of the work for this blog post lies. You need to choose 3 good features! One possibility is to use some of the tools described on this page. Another approach, which is ok to use on this data set, is exhaustive search of all the features contained in the data set. For this, the combinations
function from the itertools
package might be helpful.
USE CROSS-VALIDATION! This is your simplest way to guard against overfitting issues and get a good feeling for how your model might do on unseen data.
Island
feature, you are allowed to use all of the columns that correspond to Island
.from itertools import combinations
# these are not actually all the columns: you'll
# need to add any of the other ones you want to search for
= ["Clutch Completion", "Sex"]
all_qual_cols = ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)']
all_quant_cols
for qual in all_qual_cols:
= [col for col in X_train.columns if qual in col ]
qual_cols for pair in combinations(all_quant_cols, 2):
= qual_cols + list(pair)
cols print(cols)
# you could train models and score them here, keeping the list of
# columns for the model that has the best score.
#
['Clutch Completion_No', 'Clutch Completion_Yes', 'Culmen Length (mm)', 'Culmen Depth (mm)']
['Clutch Completion_No', 'Clutch Completion_Yes', 'Culmen Length (mm)', 'Flipper Length (mm)']
['Clutch Completion_No', 'Clutch Completion_Yes', 'Culmen Depth (mm)', 'Flipper Length (mm)']
['Sex_FEMALE', 'Sex_MALE', 'Culmen Length (mm)', 'Culmen Depth (mm)']
['Sex_FEMALE', 'Sex_MALE', 'Culmen Length (mm)', 'Flipper Length (mm)']
['Sex_FEMALE', 'Sex_MALE', 'Culmen Depth (mm)', 'Flipper Length (mm)']
Model Choices
There are three species of penguin in the data. Most classifiers in scikit-learn
will handle multi-way classification without issue. For example:
from sklearn.linear_model import LogisticRegression
# this counts as 3 features because the two Clutch Completion
# columns are transformations of a single original measurement.
# you should find a way to automatically select some better columns
# as suggested in the code block above
= ["Flipper Length (mm)", "Body Mass (g)", "Clutch Completion_No", "Clutch Completion_Yes"]
cols
= LogisticRegression()
LR
LR.fit(X_train[cols], y_train) LR.score(X_train[cols], y_train)
0.6640625
Even though y_train
contains three categories (labeled 0
, 1
, and 2
), we’re able to fit a LogisticRegression()
no problem.
Since scikit-learn
makes it so easy to experiment, this blog post is a great opportunity to explore some out-of-the-box models that we haven’t discussed in class. I’d suggest:
from sklearn.tree import DecisionTreeClassifier
. This one has amax_depth
parameter that controls the complexity of the model. Use cross-validation to find a good value of the parameter.from sklearn.ensemble import RandomForestClassifier
. State-of-the-art before the rise of neural networks.from sklearn.svm import SVC
. Another state-of-the-art algorithm before the rise of neural networks. Has a parametergamma
that controls the complexity of the model. Again, use cross-validation to selectgamma
. It’s important to letgamma
cover a wide range of values, e.g.gamma = 10**np.arange(-5, 5)
.
You can find a more thorough listing of models on this page.
Testing
To test your model, you should download the test data set and prepare it using the prepare_data
function. You’ll need to make sure that you subset it using only the features you choose. Here’s code that does this:
= "https://raw.githubusercontent.com/middlebury-csci-0451/CSCI-0451/main/data/palmer-penguins/test.csv"
test_url = pd.read_csv(test_url)
test
= prepare_data(test)
X_test, y_test LR.score(X_test[cols], y_test)
0.6617647058823529
Plotting Decision Regions
To plot decision regions for your model, we are going to use the mlxtend
package. You may need to install this package first in your ml-0451
environment. Once you’ve done so, you can use the decision_region_panel
function defined below to plot your regions.
from matplotlib import pyplot as plt
import numpy as np
X_train[cols]
Flipper Length (mm) | Body Mass (g) | Clutch Completion_No | Clutch Completion_Yes | |
---|---|---|---|---|
1 | 215.0 | 5000.0 | 0 | 1 |
2 | 202.0 | 3875.0 | 0 | 1 |
3 | 185.0 | 3650.0 | 0 | 1 |
4 | 193.0 | 3800.0 | 1 | 0 |
5 | 178.0 | 2900.0 | 0 | 1 |
... | ... | ... | ... | ... |
269 | 190.0 | 3900.0 | 0 | 1 |
270 | 211.0 | 4800.0 | 0 | 1 |
271 | 187.0 | 3150.0 | 1 | 0 |
272 | 224.0 | 5350.0 | 0 | 1 |
273 | 210.0 | 4600.0 | 0 | 1 |
256 rows × 4 columns
from matplotlib.patches import Patch
def plot_regions(model, X, y):
= X[X.columns[0]]
x0 = X[X.columns[1]]
x1 = X.columns[2:]
qual_features
= plt.subplots(1, len(qual_features), figsize = (7, 3))
fig, axarr
# create a grid
= np.linspace(x0.min(),x0.max(),501)
grid_x = np.linspace(x1.min(),x1.max(),501)
grid_y = np.meshgrid(grid_x, grid_y)
xx, yy
= xx.ravel()
XX = yy.ravel()
YY
for i in range(len(qual_features)):
= pd.DataFrame({
XY 0] : XX,
X.columns[1] : YY
X.columns[
})
for j in qual_features:
= 0
XY[j]
= 1
XY[qual_features[i]]
= model.predict(XY)
p = p.reshape(xx.shape)
p
# use contour plot to visualize the predictions
= "jet", alpha = 0.2, vmin = 0, vmax = 2)
axarr[i].contourf(xx, yy, p, cmap
= X[qual_features[i]] == 1
ix # plot the data
= y[ix], cmap = "jet", vmin = 0, vmax = 2)
axarr[i].scatter(x0[ix], x1[ix], c
set(xlabel = X.columns[0],
axarr[i].= X.columns[1])
ylabel
= []
patches for color, spec in zip(["red", "green", "blue"], ["Adelie", "Chinstrap", "Gentoo"]):
= color, label = spec))
patches.append(Patch(color
= "Species", handles = patches, loc = "best")
plt.legend(title
plt.tight_layout()
Here it is in action. It’s safe to ignore these particular warnings.
X_train
are quantitative and that the ones after that are qualitative. plot_regions(LR, X_train[cols], y_train)
Explore!
Please feel encouraged to be creative in your choices of data visualization, predictive model, and algorithm to compute your features. I also like pictures of penguins. =)
Useful Resources
- An introduction to seaborn, a convenient package for data visualization with data frames.
- Data Manipulation with Pandas from the Python Data Science Handbook
- You might be interested in some of the explanations of how some other classifiers work, including decision trees and support vector machines, also from the Python Data Science Handbook.
© Phil Chodrow, 2023