Mapping the SO Developer Survey

Jun 9, 2019 in PROJECT • R • TOOLS
r data-visualisation leaflet mapping
17 min read

I love getting the chance to plot data onto a map. They look so much more engaging than a bar chart, and what’s more, clients love them too. My first foray into using R was when I needed to produce a map showing variation in demand for GP out-of-hours services; it was a super effective way of getting the information across so that non-data people actually wanted to look at it!

Since then, I’ve had a soft spot for maps, but my next opportunity to use them at work only arose a few weeks ago. This time, I used the package leaflet. The results were great, so I’ve done a mini project for the blog to keep practising!

The StackOverflow Developer Survey

I’ve been keen to take a closer look at the StackOverflow Developer Survey 2019. It asks questions of people using the site about their background, work and techy opinions, and was completed by almost 90,000 people this year! In the report on the site, many questions split out results for the USA , but as a UK data scientist, I knew I’d have to dig into the full results for bespoke information. And then I realised that using a map would allow me to visualise results for all countries separately – a perfect project for my practice making choropleth maps of data.

Before I go much further, here are a few caveats on the survey:

The survey respondents are people visiting StackOverflow. They probably aren’t totally representative of everybody using programming tools, or even visitors to the site. The survey analytics team did weight the USA results by gender, because they had US Bureau of Labor Statistics estimates relating to women’s participation in the software developer workforce. This showed there was quite a big difference: the proportion of people answering the survey who were women was half that of the proportion of the developer workforce estimated to be women in the USA. In some areas, weighting the results by gender did have an impact. Given that the results for the world as a whole are unweighted (by gender or any other factor that might be relevant to consider), we definitely should be cautious about overinterpreting things.
Although there are tens of thousands of respondents, these are highly concentrated in a handful of countries. Actually, it is pretty impressive how widely SO is used! But we need to be aware that some countries won’t have enough responses to give us meaningful insights.
The results are based on people’s responses. People can be inaccurate, they can miss out questions, and they can give an answer they think is funny or impressive rather than true. We just have to take their word for it here!

For my purposes, that stuff isn’t super important because I mainly want to try out leaflet…but keep that in mind if you are using the data.

Getting the data ready

Data cleaning

My first step was to download the data and then tidy it up so it contained the information I needed. Primarily, this was making sure countries were named consistently with my map data. I’m not going into detail here, but you can see my code here if you are interested or want to replicate this.

Creating the variable to plot

For my first map, I decided to look at gender. How many people responding to the survey are women in each country?

You can see that there are a number of options people had:

library(data.table)
library(feather)

# Load survey data

dev_survey_19 <- as.data.table(read_feather("dev_survey_19.feather"))
dev_survey_19[, unique(Gender)]
## [1] "Man"                                                        
## [2] NA                                                           
## [3] "Woman"                                                      
## [4] "Non-binary, genderqueer, or gender non-conforming"          
## [5] "Woman;Non-binary, genderqueer, or gender non-conforming"    
## [6] "Woman;Man;Non-binary, genderqueer, or gender non-conforming"
## [7] "Woman;Man"                                                  
## [8] "Man;Non-binary, genderqueer, or gender non-conforming"

I’ve focused on women for this example. As it is a multi-choice category, I’m looking at the proportion of people who chose “Woman”, so they might also have chosen other categories. Obviously, it does not mean that people who didn’t selet “Woman” are men, or that no men are gender minorities in tech.


# Calculate the number of respondents who chose "Woman"
women_by_country <- dev_survey_19[Gender %like% "Woman", 
                                  .(total_women = .N), 
                                  by = country]
women_by_country[is.na(total_women), total_women := 0]  

# Calculate total number of respondents who provided a gender and combine
total_gender_by_country <- dev_survey_19[!is.na(Gender), 
                                         .(total_people = .N), 
                                         by = country]
women_by_country <- merge(women_by_country, 
                          total_gender_by_country, 
                          by = "country", all = TRUE)

# Proportion of people who gave a response who chose "Woman"
women_by_country[, proportion_women := total_women/total_people*100]
women_by_country <- women_by_country[, .(country, proportion_women)]

Getting a world map

I was a bit surprised that finding a world map wasn’t more straightforward. Perhaps the problem was that there are plenty of options out there. The trouble is that they aren’t all fit for purpose. I got quite far with one before realising that it was missing quite a lot of countries, including Singapore! The one I ended up with is from the maps package (as suggested in this SO answer), and while it is probably more granular than I need, and in some cases I’ve had to attribute countries in ways that aren’t totally ideal, it does at least mean that all the data from the countries represented in the survey can be shown.

library(maps)
library(maptools)
## Loading required package: sp
## Checking rgeos availability: TRUE

world <- map("world", fill = TRUE, plot = FALSE)
world <- map2SpatialPolygons(world, sub(":.*$", "", world$names))
world <- SpatialPolygonsDataFrame(world, 
                                  data.table(country = names(world)), 
                                  FALSE)

# Combine the data
world <- merge(world, women_by_country)

And now, leaflet

Leaflet is an R package based on the open-source JavaScript library for interactive maps of the same name. Leaflet maps are great for interactivity - users can zoom, pan, hover and click, for example, and there’s a lot of flexibility for the developer to adjust the look and feel. I found this tutorial super helpful.

I start off by creating a colour palette and filling in the basic arguments.

library(leaflet)

# Create colour palette
mypalette <- colorNumeric("BuPu", domain = world$names)

# Basic map
leaflet(data = world)  %>% 
  addTiles() %>%
  setView(lat = 10, lng = 0, zoom = 2) %>%
  addPolygons(fillColor = ~mypalette(proportion_women))

Ok. Well. That is certainly a map. You can interact with it. But it’s pretty ugly and not at all clear.

The next step is to change the weight and colour of those very bold border lines, and adjust the opacity of the fill colour to make it clearer.

leaflet(data = world)  %>% 
  addTiles() %>%
  setView(lat = 10, lng = 0, zoom = 2) %>%
  addPolygons(fillColor = ~mypalette(proportion_women), 
              weight = 1,
              opacity = 1,
              color = "white",
              dashArray = "1", 
              fillOpacity = 0.8)

So! It is definitely clearer, but still kind of strange. Why are only a small number of countries picked out in a colour? When I looked at the data again, I realised that some countries only had a very small number of responses, so their results were probably less reliable. If a few women responded, that artificially inflated the proportion for that country, and then the scale was extended in a way that didn’t really reflect the world.

In this instance, I decided to remove results for countries with under 100 responses. This unfortunately leads to more countries that are greyed out due to lack of data, but is probably a fairer interpretation of the data.

Right! So although there are more countries without data, the remaining countries definitely show more of a spread. But at the moment you can’t really see the scale.

The next steps are therefore to allow people to hover over countries to highlight them and see the actual data, as well as showing a legend.

# Create labels for map
labels <- sprintf(
  "<strong>%s</strong><br/>%g per cent women",
  world$country, world$proportion_women
) %>% lapply(htmltools::HTML)

# Map
leaflet(data = world)  %>% 
  addTiles() %>%
  setView(lat = 10, lng = 0, zoom = 2) %>%
  addPolygons(fillColor = ~mypalette(proportion_women), 
              weight = 1,
              opacity = 1,
              color = "white",
              dashArray = "1", 
              fillOpacity = 0.8,
              
 # Add a line around countries that are hovered over
              
              highlight = highlightOptions(
                weight = 3,
                color = "#666",
                dashArray = "",
                fillOpacity = 0.8,
                bringToFront = TRUE),
              
 # Add the labels previously defined and adjust the settings
              
              label = labels,
              labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "15px",
                direction = "auto")) %>%
  
  # Add a legend
  
  addLegend(pal = mypalette, values = ~proportion_women, opacity = 0.8, title = "% women", 
            position = "bottomleft")

This is basically there! The one thing I want to change is that the number on the label has a lot of decimal places. It’s not as simple as rounding, because as long as proportion_women is a numeric value, the label will show all the decimal places it knows about. But I need it to be a numeric value for the choropleth! Therefore, I create a character variable that is based on the rounded number and point the label at that, while the map keeps using the numeric variable.

Wonderful. So having a look at the map, we can see from the scale that the proportion of people working in this field who are women never gets above 15%, so nowhere has really high representation. But highest are a few countries in South East Asia, Eastern Europe and North Africa, with North America not too far behind.

In the UK, where I’m based, 8% of respondents to the survey were women. I’m sure I’ll have thoughts on that in a future post!

This was a lot of fun. I’ve got some more maps based on the survey that I’m going to post in a few days, plus some thoughts on the limitations of using maps in data visualisation. Until then, enjoy exploring leaflet!

Bonus tips

You can choose your colour palette from lots of options outlined here, or create your own.
The argument dashArray lets you choose whether your outline of the country polygons is dashed and in what style. Look here for options.