PalmerPenguins.jl

Are you looking for a dataset for data exploration and visualization? Maybe you should consider the Palmer penguins dataset, which was published as an R package recently (Horst, Hill, & Gorman (2020)). I created the Julia package PalmerPenguins.jl to simplify its use with the Julia programming language and increase its adoption within the Julia community.

TL;DR

The Palmer penguins dataset is an alternative to the controversial iris dataset for data exploration and visualization (but, of course, not the only one). The Julia package PalmerPenguins.jl provides access to the raw and simplified versions of this dataset, similar to the original R package, without having to download and parse the raw data manually.

Palmer penguins dataset

The Palmer penguins dataset was proposed as an alternative to the iris dataset by Fisher (1936) for data exploration and visualization.

Fisher was a vocal proponent of eugenics and published the iris dataset in the Annals of Eugenics in 1936 (!). Hence there is growing sentiment in the scientific community that the use of the iris dataset is inappropriate.

One does not publish in the Annals of Eugenics in 1936 on a misunderstanding.

By using this dataset in 2020, we are sending a very strong message.

TimothΓ©e Poisot

Many people using iris will be unaware that it was first published in work by R A Fisher, a eugenicist with vile and harmful views on race. In fact, the iris dataset was originally published in the Annals of Eugenics. It is clear to me that knowingly using work that was itself used in pursuit of racist ideals is totally unacceptable.

Megan Stodel

I’ve long known about Ronald Fisher’s eugenicist past, but I admit that I have often thoughtlessly turned to iris when needing a small, boring data set to demonstrate a coding or data principle.

But Daniella and TimothΓ©e Poisot are right: it’s time to retire iris.

Garrick Aden-Buie

Apart from that, the iris dataset is quite boring: it contains no missing values and:

With the exception of one or two points, the classes are linearly separable, and so classification algorithms reach almost perfect accuracy.

TimothΓ©e Poisot

The Palmer penguins dataset consists of measurements of 344 penguins from three islands in the Palmer Archipelago, Antarctica, that were collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER (Gorman, Williams, & Fraser (2014)). The simplified version of the dataset contains at most seven measurements for each penguin, namely the species (Adelie, Chinstrap, and Gentoo), the island (Torgersen, Biscoe, and Dream), the bill length (measured in mm), the bill depth (measured in mm), the flipper length (measured in mm), the body mass (measured in g), and the sex (male and female). In total, 19 measurements are missing.

Palmer penguins
Palmer penguins. Artwork by @allison_horst.

Julia package

The Julia package PalmerPenguins.jl is available in the standard Julia package registry, so you can install it and load it in the usual way by running

julia> import Pkg; Pkg.add("PalmerPenguins")

julia> using PalmerPenguins

in the Julia REPL. The package uses DataDeps.jl to download a fixed (and hence reproducible) version of the dataset once instead of including a copy of the original dataset.

As explained in the package's README, the simplified and the raw version of the Palmer penguins dataset can be loaded in a Tables.jl-compatible format. We can inspect the names and types of the features in the simplified and the raw version by running

using PalmerPenguins
using Tables

const TABLE = PalmerPenguins.load()

Tables.schema(TABLE)
Tables.Schema:
 :species            CSV.PooledString
 :island             CSV.PooledString
 :bill_length_mm     Union{Missing, Float64}
 :bill_depth_mm      Union{Missing, Float64}
 :flipper_length_mm  Union{Missing, Int64}
 :body_mass_g        Union{Missing, Int64}
 :sex                Union{Missing, CSV.PooledString}

and

const TABLE_RAW = PalmerPenguins.load(; raw = true)

Tables.schema(TABLE_RAW)
Tables.Schema:
 :studyName                     CSV.PooledString
 Symbol("Sample Number")        Int64
 :Species                       CSV.PooledString
 :Region                        CSV.PooledString
 :Island                        CSV.PooledString
 :Stage                         CSV.PooledString
 Symbol("Individual ID")        String
 Symbol("Clutch Completion")    Bool
 Symbol("Date Egg")             Dates.Date
 Symbol("Culmen Length (mm)")   Union{Missing, Float64}
 Symbol("Culmen Depth (mm)")    Union{Missing, Float64}
 Symbol("Flipper Length (mm)")  Union{Missing, Int64}
 Symbol("Body Mass (g)")        Union{Missing, Int64}
 :Sex                           Union{Missing, CSV.PooledString}
 Symbol("Delta 15 N (o/oo)")    Union{Missing, Float64}
 Symbol("Delta 13 C (o/oo)")    Union{Missing, Float64}
 :Comments                      Union{Missing, CSV.PooledString}

We also see that the names of the features in the simplified dataset are normalized to lowercase characters without whitespace and brackets.

You might want to convert the tables to a DataFrame object for downstream analyses. The following code extracts the first five rows of the simplified dataset:

using DataFrames

first(DataFrame(TABLE), 5)
5Γ—7 DataFrame
β”‚ Row β”‚ species β”‚ island    β”‚ bill_length_mm β”‚ bill_depth_mm β”‚ flipper_length_mm β”‚ body_mass_g β”‚ sex     β”‚
β”‚     β”‚ String  β”‚ String    β”‚ Float64?       β”‚ Float64?      β”‚ Int64?            β”‚ Int64?      β”‚ String? β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ Adelie  β”‚ Torgersen β”‚ 39.1           β”‚ 18.7          β”‚ 181               β”‚ 3750        β”‚ male    β”‚
β”‚ 2   β”‚ Adelie  β”‚ Torgersen β”‚ 39.5           β”‚ 17.4          β”‚ 186               β”‚ 3800        β”‚ female  β”‚
β”‚ 3   β”‚ Adelie  β”‚ Torgersen β”‚ 40.3           β”‚ 18.0          β”‚ 195               β”‚ 3250        β”‚ female  β”‚
β”‚ 4   β”‚ Adelie  β”‚ Torgersen β”‚ missing        β”‚ missing       β”‚ missing           β”‚ missing     β”‚ missing β”‚
β”‚ 5   β”‚ Adelie  β”‚ Torgersen β”‚ 36.7           β”‚ 19.3          β”‚ 193               β”‚ 3450        β”‚ female  β”‚

Data can be extracted with the Tables.jl-interface as well without creating a DataFrame object, as shown in the following visualizations of the Palmer penguins dataset. The following plots replicate the official examples (even interactively!).

using ColorBrewer
using PlotlyJS

const COLORS = palette("Dark2", 3)

trace = scatter(
    mode = "markers",
    x = Tables.getcolumn(TABLE, :flipper_length_mm),
    y = Tables.getcolumn(TABLE, :body_mass_g),
    transforms = [
        attr(
            type = "groupby",
            groups = Tables.getcolumn(TABLE, :species),
            styles = [
                attr(target = "Gentoo", value_marker_color = COLORS[1]),
                attr(target = "Adelie", value_marker_color = COLORS[2]),
                attr(target = "Chinstrap", value_marker_color = COLORS[3]),
            ],
        ),
    ],
)

layout = Layout(;
    title = "Flipper length and body mass",
    xaxis = attr(title = "Flipper length (mm)"),
    yaxis = attr(title = "Body mass (g)"),
)
p = PlotlyJS.plot([trace], layout)
trace = histogram(
    x = Tables.getcolumn(TABLE, :flipper_length_mm),
    opacity = 0.75,
    transforms = [
        attr(
            type = "groupby",
            groups = Tables.getcolumn(TABLE, :species),
            styles = [
                attr(target = "Gentoo", value_marker_color = COLORS[1]),
                attr(target = "Adelie", value_marker_color = COLORS[2]),
                attr(target = "Chinstrap", value_marker_color = COLORS[3]),
            ],
        ),
    ],
)

layout = Layout(;
    title = "Flipper length",
    xaxis = attr(title = "Flipper length (mm)"),
    yaxis = attr(title = "Frequency"),
    barmode = "overlay",
)
p = PlotlyJS.plot([trace], layout)

References

  • Fisher, R. A. (1936). The use of multiple measurements in taxonomic prolems. Annals of Eugenics, 7(2), 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x

  • Gorman, K. B., Williams, T. D., & Fraser, W. R. (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE, 9(3):e90081. doi:10.1371/journal.pone.0090081

  • Horst, A. M., Hill, A. P., & Gorman, K. B. (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. doi:10.5281/zenodo.3960218