Blog Post: Classifying Palmer Penguins

In this blog post, you’ll work through a complete example of the standard machine learning workflow. Your primary goal is to determine the smallest number of measurements necessary to confidently determine the species of a penguin.

Published

March 8, 2023

This is one of two suggested options for a blog post this week. You might want to pick this option if some of the following bullet points describe you:

  1. You are interested in practicing with machine learning models on a real data set.
  2. You are less interested in learning new machine learning theory this week.
  3. You are willing to read a little more about how to perform operations on data sets with Pandas and how to use other models implemented in scikit-learn.

The alternative has a more theoretical flavor.

The Palmer Penguins data set is a data set collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. The data contains physiological measurements for a number of individuals from each of three species of penguin:

Artwork by @allison_horst

You can access the (training) data like this:

import pandas as pd

train_url = "https://raw.githubusercontent.com/middlebury-csci-0451/CSCI-0451/main/data/palmer-penguins/train.csv"
train = pd.read_csv(train_url)

Here’s how the data looks:

train.head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 27 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N46A1 Yes 11/29/07 44.5 14.3 216.0 4100.0 NaN 7.96621 -25.69327 NaN
1 PAL0708 22 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N41A2 Yes 11/27/07 45.1 14.5 215.0 5000.0 FEMALE 7.63220 -25.46569 NaN
2 PAL0910 124 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N67A2 Yes 11/16/09 41.4 18.5 202.0 3875.0 MALE 9.59462 -25.42621 NaN
3 PAL0910 146 Adelie Penguin (Pygoscelis adeliae) Anvers Dream Adult, 1 Egg Stage N82A2 Yes 11/16/09 39.0 18.7 185.0 3650.0 MALE 9.22033 -26.03442 NaN
4 PAL0708 24 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N85A2 No 11/28/07 50.6 19.4 193.0 3800.0 MALE 9.28153 -24.97134 NaN

The “culmen” refers to a ridge on the bill of the penguin.

Artwork by @allison_horst

Your Challenge

We are going to consider the problem of predicting the species of a penguin based on its measurements.

  1. Explore: Construct at least one interesting displayed figure (e.g. using seaborn) and at least one interesting displayed table (e.g. using pandas.groupby().aggregate). Make sure to include a helpful discussion of both the figure and the table. Don’t just show the result: explain what you learned about the data from these products.
  2. Model: Find three features of the data and a model trained on those features which achieves 100% testing accuracy. You must obtain your three features through a reproducible process. That is, you can’t just pick them: you need to code up some kind of search in order to obtain them.
    • One feature must be qualitative (like Island or Clutch Completion).
    • The other two features must be quantitative (like Body Mass (g) or Culmen Depth (mm)).
  3. Evaluate: Show the decision regions of your finished model, split out by the qualitative feature.

I’ve supplied code to help you with several parts of this task.

Resources and Hints

This Webpage Runs

If you run all the code on this assignment page in order, you’ll produce the result at the bottom of the page (possibly after installing some more packages in your ml-0451 Anaconda environment). So, one good way to approach this assignment is to take this code into a Jupyter notebook and start tweaking.

Data Preparation

You will need to prepare the qualitative columns in the data. Feature columns like Sex and Island should be coded using pd.get_dummies (as illustrated in lecture). The label column Species should be coded differently, using a LabelEncoder. The following function handles this work for you.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train["Species"])

def prepare_data(df):
  df = df.drop(["studyName", "Sample Number", "Individual ID", "Date Egg", "Comments", "Region"], axis = 1)
  df = df[df["Sex"] != "."]
  df = df.dropna()
  y = le.transform(df["Species"])
  df = df.drop(["Species"], axis = 1)
  df = pd.get_dummies(df)
  return df, y

X_train, y_train = prepare_data(train)

Choosing Features

This is where much of the work for this blog post lies. You need to choose 3 good features! One possibility is to use some of the tools described on this page. Another approach, which is ok to use on this data set, is exhaustive search of all the features contained in the data set. For this, the combinations function from the itertools package might be helpful.

USE CROSS-VALIDATION! This is your simplest way to guard against overfitting issues and get a good feeling for how your model might do on unseen data.

If you use the Island feature, you are allowed to use all of the columns that correspond to Island.
from itertools import combinations

# these are not actually all the columns: you'll 
# need to add any of the other ones you want to search for
all_qual_cols = ["Clutch Completion", "Sex"]
all_quant_cols = ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)']

for qual in all_qual_cols: 
  qual_cols = [col for col in X_train.columns if qual in col ]
  for pair in combinations(all_quant_cols, 2):
    cols = qual_cols + list(pair) 
    print(cols)
    # you could train models and score them here, keeping the list of 
    # columns for the model that has the best score. 
    # 
['Clutch Completion_No', 'Clutch Completion_Yes', 'Culmen Length (mm)', 'Culmen Depth (mm)']
['Clutch Completion_No', 'Clutch Completion_Yes', 'Culmen Length (mm)', 'Flipper Length (mm)']
['Clutch Completion_No', 'Clutch Completion_Yes', 'Culmen Depth (mm)', 'Flipper Length (mm)']
['Sex_FEMALE', 'Sex_MALE', 'Culmen Length (mm)', 'Culmen Depth (mm)']
['Sex_FEMALE', 'Sex_MALE', 'Culmen Length (mm)', 'Flipper Length (mm)']
['Sex_FEMALE', 'Sex_MALE', 'Culmen Depth (mm)', 'Flipper Length (mm)']

Model Choices

There are three species of penguin in the data. Most classifiers in scikit-learn will handle multi-way classification without issue. For example:

from sklearn.linear_model import LogisticRegression

# this counts as 3 features because the two Clutch Completion 
# columns are transformations of a single original measurement. 
# you should find a way to automatically select some better columns
# as suggested in the code block above
cols = ["Flipper Length (mm)", "Body Mass (g)", "Clutch Completion_No", "Clutch Completion_Yes"]

LR = LogisticRegression()
LR.fit(X_train[cols], y_train)
LR.score(X_train[cols], y_train)
0.6640625

Even though y_train contains three categories (labeled 0, 1, and 2), we’re able to fit a LogisticRegression() no problem.

Since scikit-learn makes it so easy to experiment, this blog post is a great opportunity to explore some out-of-the-box models that we haven’t discussed in class. I’d suggest:

  • from sklearn.tree import DecisionTreeClassifier. This one has a max_depth parameter that controls the complexity of the model. Use cross-validation to find a good value of the parameter.
  • from sklearn.ensemble import RandomForestClassifier. State-of-the-art before the rise of neural networks.
  • from sklearn.svm import SVC. Another state-of-the-art algorithm before the rise of neural networks. Has a parameter gamma that controls the complexity of the model. Again, use cross-validation to select gamma. It’s important to let gamma cover a wide range of values, e.g. gamma = 10**np.arange(-5, 5).

You can find a more thorough listing of models on this page.

Testing

To test your model, you should download the test data set and prepare it using the prepare_data function. You’ll need to make sure that you subset it using only the features you choose. Here’s code that does this:

test_url = "https://raw.githubusercontent.com/middlebury-csci-0451/CSCI-0451/main/data/palmer-penguins/test.csv"
test = pd.read_csv(test_url)

X_test, y_test = prepare_data(test)
LR.score(X_test[cols], y_test)
0.6617647058823529

Plotting Decision Regions

To plot decision regions for your model, we are going to use the mlxtend package. You may need to install this package first in your ml-0451 environment. Once you’ve done so, you can use the decision_region_panel function defined below to plot your regions.

from matplotlib import pyplot as plt
import numpy as np
X_train[cols]
Flipper Length (mm) Body Mass (g) Clutch Completion_No Clutch Completion_Yes
1 215.0 5000.0 0 1
2 202.0 3875.0 0 1
3 185.0 3650.0 0 1
4 193.0 3800.0 1 0
5 178.0 2900.0 0 1
... ... ... ... ...
269 190.0 3900.0 0 1
270 211.0 4800.0 0 1
271 187.0 3150.0 1 0
272 224.0 5350.0 0 1
273 210.0 4600.0 0 1

256 rows × 4 columns

from matplotlib.patches import Patch

def plot_regions(model, X, y):
    
    x0 = X[X.columns[0]]
    x1 = X[X.columns[1]]
    qual_features = X.columns[2:]
    
    fig, axarr = plt.subplots(1, len(qual_features), figsize = (7, 3))

    # create a grid
    grid_x = np.linspace(x0.min(),x0.max(),501)
    grid_y = np.linspace(x1.min(),x1.max(),501)
    xx, yy = np.meshgrid(grid_x, grid_y)
    
    XX = xx.ravel()
    YY = yy.ravel()

    for i in range(len(qual_features)):
      XY = pd.DataFrame({
          X.columns[0] : XX,
          X.columns[1] : YY
      })

      for j in qual_features:
        XY[j] = 0

      XY[qual_features[i]] = 1

      p = model.predict(XY)
      p = p.reshape(xx.shape)
      
      
      # use contour plot to visualize the predictions
      axarr[i].contourf(xx, yy, p, cmap = "jet", alpha = 0.2, vmin = 0, vmax = 2)
      
      ix = X[qual_features[i]] == 1
      # plot the data
      axarr[i].scatter(x0[ix], x1[ix], c = y[ix], cmap = "jet", vmin = 0, vmax = 2)
      
      axarr[i].set(xlabel = X.columns[0], 
            ylabel  = X.columns[1])
      
      patches = []
      for color, spec in zip(["red", "green", "blue"], ["Adelie", "Chinstrap", "Gentoo"]):
        patches.append(Patch(color = color, label = spec))

      plt.legend(title = "Species", handles = patches, loc = "best")
      
      plt.tight_layout()

Here it is in action. It’s safe to ignore these particular warnings.

Note: this function assumes that your first two columns in X_train are quantitative and that the ones after that are qualitative.
plot_regions(LR, X_train[cols], y_train)

Explore!

Please feel encouraged to be creative in your choices of data visualization, predictive model, and algorithm to compute your features. I also like pictures of penguins. =)

Useful Resources



© Phil Chodrow, 2023