# Data visualisation and hawks with ggplot2

Conferences and meetups have been important throughout my data science career - especially when I was starting out and learning so much, but still critical as ways of keeping on top of new developments and making new connections. It’s been a big change having these in-person events stop throughout lockdown. I was particularly sad to hear that the in-person elements of useR!2020, which I was going to attend as a diversity scholar, were necessarily cancelled.

Yet it’s also been impressive how much has remained, moving online. In the past couple of months, I’ve attended the 2020 Women in Data Science London conference and the N8 CIR ReproHack. Women Driven Development held a remote hackathon focused on products to support people impacted by corovanvirus. And I regularly receive alerts of online sessions going ahead from groups I’ve attended in the past.

I’ve also been involved in this new wave of online-only sessions. I’m currently in the middle of delivering some introductory sessions on R for members of the AI Club for Gender Minorities; next week I’ll be talking about my great love for `data.table`. My most recent session was on `ggplot2`, the widely-used package for data visualisation in R, which made me realise I haven’t posted much about it before. So to remedy that, here’s a blog post version of the training, split into two. Today I’ll be focusing on getting a basic plot out the door, and next time I’ll look at themes and making it look pretty.

I’m using a dataset from this great selection; it’s the results of measurements of hawks. I’ll start off by reading in the data and giving it a bit of a clean to make some of the naming clearer and removing a few rows with missing data.

``````library(ggplot2)
library(data.table)

select = c('Species', 'Age', 'Sex', 'Wing', 'Weight', 'Tail', 'Year'))

# Remove some rows with missing weight or wing data
hawks <- hawks[!is.na(Weight) & !is.na(Wing)]

# Label unknown values for sex and age
hawks[Sex == "", Sex := "Unknown"]
hawks[Age == "", Age := "Unknown"]

# Label hawk species
hawks[Species == "RT", Species := "Red-tailed"]
hawks[Species == "SS", Species := "Sharp-shinned"]
hawks[Species == "CH", Species := "Cooper's"]

# Show the first 5 rows of the dataset
hawks[1:5]
##          Species Age     Sex Wing Weight Tail Year
## 1:    Red-tailed   I Unknown  385    920  219 1992
## 2:    Red-tailed   I Unknown  376    930  221 1992
## 3:    Red-tailed   I Unknown  381    990  235 1992
## 4:      Cooper's   I       F  265    470  220 1992
## 5: Sharp-shinned   I       F  205    170  157 1992``````

There are a few simple things to know to get started working with `ggplot2`.

• All plots produced with `ggplot2` use the function `ggplot()` to start with.

• Additional layers are added or edited with new lines of code that are added using the `+` symbol.

• For the most basic graph, you will need to provide the dataset in the `data` argument, the features you want plotted (at least for the x axis, and sometimes for the y depending on the type of graph) in the `mapping` argument wrapped in `aes()`, and the graph type as an additional line as a `geom_*()` function.

Let’s look at that in practice with a bar chart. Say we just want to know how many hawks of each species are in the dataset.

``````ggplot(data = hawks, mapping = aes(x = Species)) +
geom_bar()``````

You’ve got the `ggplot()` function with your data defined, as well as the aesthetic you want to map - species along the x axis. You don’t have to map anything to the y axis because the default for `geom_bar()` is to perform a count function and display the number of cases in the given category. Without that, you would have to do some work creating a separate dataframe with a count variable, so this is pretty handy.

It’s also worth noting that you can put the mapping (and indeed the data) in the `geom_*()` function too. See how it looks exactly the same if I write the code like that in this instance:

``````ggplot(data = hawks) +
geom_bar(mapping = aes(x = Species))``````

The difference is that the mapping and data in the `ggplot()` function is applied to all geom layers, whereas if they are defined in the `geom_*()` function, they are relevant to that layer specifically. That sometimes matters if you are layering two plots on top of each other (which you can do - ggplot objects are just made up of layers). However, much of the time if you have one geom layer, it doesn’t matter. It is worth knowing about as you’ll see both versions used.

Now imagine we want to know a bit more about these hawks - just how many there are is not enough. We can add other features to the aesthetic mappings. Let’s add in the Age variable as a colour feature (note that A means adult and I means immature). That is just a case of adding the `fill` argument within the `aes()` function.

``````ggplot(data = hawks, mapping = aes(x = Species, fill = Age)) +
geom_bar()``````

If we want to compare the age breakdown across species, it might be more useful to look at proportions. For that, you can change the `position` argument in `geom_bar()`.

``````ggplot(data = hawks, mapping = aes(x = Species, fill = Age)) +
geom_bar(position = "fill")``````

This makes it clearer that more adult Cooper’s hawks were identified than within other species.

Another look you might want is to have the bars next to each other rather than stacked. For that use `position = "dodge"`.

``````ggplot(data = hawks, mapping = aes(x = Species, fill = Age)) +
geom_bar(position = "dodge")``````

But there are other aesthetics you can change too. Let’s explore some more with a scatterplot. Because we are changing the type of graph, we have a different `geom_*()` function: `geom_point()`. Here we use that to look at tail and wing measurements. Note that now we have a y aesthetic mapped.

``````ggplot(data = hawks, mapping = aes(x = Wing, y = Tail)) +
geom_point()``````

This is already useful for understanding the data. There is obviously a correlation between wing and tail measurements, but there also seem to be some clusters here. Suspecting there might be differences by species, we can add that into the `colour` argument. Note that we use `colour` for dots and line, whereas `fill` fills in an object like a bar. If we used `colour` with `geom_bar()` it would colour in the outlines of the bars.

``````ggplot(data = hawks, mapping = aes(x = Wing, y = Tail, colour = Species)) +
geom_point()``````

This allows us to see that the clusters in the data are likely due to the different species included, with the red-tailed species generally on the bigger side of hawks.

Other aesthetics you map include shape of the point (`shape`), size (`size`) and transparancy (`alpha`). Here’s an example with Weight added via transparency.

``````ggplot(data = hawks, mapping = aes(x = Wing, y = Tail, alpha = Weight)) +
geom_point()``````

Note that we are using a continuous variable here - it wouldn’t make sense to assign transparency or size to a categorical variable, although you can also assign colour to a contiuous variable to get a gradient.

``````ggplot(data = hawks, mapping = aes(x = Wing, y = Tail, colour = Weight)) +
geom_point()``````

Those might both be a bit difficult to read, though, so you could also simplify matters by using a logical expression - which of these hawks is over 1.5kg?

``````ggplot(data = hawks, mapping = aes(x = Wing, y = Tail, colour = Weight > 1500)) +
geom_point()``````

You can also use the `facet_wrap()` function to break a categorical variable into separate graphs. This might help pick out specific categories worth exploring further.

``````ggplot(data = hawks, mapping = aes(x = Wing, y = Tail, colour = Species)) +
geom_point() +
facet_wrap(~ Sex, nrow = 2)``````

Here, for example, picking out sex produces three graphs, including one where sex is unknown. Interestingly these correspond to one of the species clusters, which might lead to some questions about the data. Is that hawk species harder to sex? Or was the data collected differently in different studies?

We’ve looked at bar charts and scatter plots. If you type `ggplot2::geom_` into RStudio and start tabbing for suggestions, you’ll see there are tens of different chart types you can use. Here’s a histogram and a boxplot for additional examples. With `geom_histogram()`, the default number of bins is 30.

``````ggplot(data = hawks, mapping = aes(x = Sex, y = Tail)) +
geom_boxplot()``````

``````ggplot(data = hawks, mapping = aes(x = Wing)) +
geom_histogram(bins = 15)``````

That’s an introduction to constructing a basic graph with `ggplot2`! You might think these colours could be improved, or wish the bars were in a different order on the bar chart, or feel there are too many gridlines - but don’t worry, we’ll look at adapting the look of these plots to your specification in the next post.

But I’ll leave you with some other very important visualisations - photos of the actual hawk species the dataset concerns.

Cooper’s hawk, photo by Mike Baird

Red-tailed hawk, photo by Don Sniegowski

Sharp-shinned hawk, photo by Tod Petit