Mapping the SO Developer Survey

I love getting the chance to plot data onto a map. They look so much more engaging than a bar chart, and what’s more, clients love them too. My first foray into using R was when I needed to produce a map showing variation in demand for GP out-of-hours services; it was a super effective way of getting the information across so that non-data people actually wanted to look at it!

Since then, I’ve had a soft spot for maps, but my next opportunity to use them at work only arose a few weeks ago. This time, I used the package leaflet. The results were great, so I’ve done a mini project for the blog to keep practising!

The StackOverflow Developer Survey

I’ve been keen to take a closer look at the StackOverflow Developer Survey 2019. It asks questions of people using the site about their background, work and techy opinions, and was completed by almost 90,000 people this year! In the report on the site, many questions split out results for the USA , but as a UK data scientist, I knew I’d have to dig into the full results for bespoke information. And then I realised that using a map would allow me to visualise results for all countries separately – a perfect project for my practice making choropleth maps of data.

Before I go much further, here are a few caveats on the survey:

  • The survey respondents are people visiting StackOverflow. They probably aren’t totally representative of everybody using programming tools, or even visitors to the site. The survey analytics team did weight the USA results by gender, because they had US Bureau of Labor Statistics estimates relating to women’s participation in the software developer workforce. This showed there was quite a big difference: the proportion of people answering the survey who were women was half that of the proportion of the developer workforce estimated to be women in the USA. In some areas, weighting the results by gender did have an impact. Given that the results for the world as a whole are unweighted (by gender or any other factor that might be relevant to consider), we definitely should be cautious about overinterpreting things.

  • Although there are tens of thousands of respondents, these are highly concentrated in a handful of countries. Actually, it is pretty impressive how widely SO is used! But we need to be aware that some countries won’t have enough responses to give us meaningful insights.

  • The results are based on people’s responses. People can be inaccurate, they can miss out questions, and they can give an answer they think is funny or impressive rather than true. We just have to take their word for it here!

For my purposes, that stuff isn’t super important because I mainly want to try out leaflet…but keep that in mind if you are using the data.

Getting the data ready

Data cleaning

My first step was to download the data and then tidy it up so it contained the information I needed. Primarily, this was making sure countries were named consistently with my map data. I’m not going into detail here, but you can see my code here if you are interested or want to replicate this.

Creating the variable to plot

For my first map, I decided to look at gender. How many people responding to the survey are women in each country?

You can see that there are a number of options people had:


# Load survey data

dev_survey_19 <-"dev_survey_19.feather"))
dev_survey_19[, unique(Gender)]
## [1] "Man"                                                        
## [2] NA                                                           
## [3] "Woman"                                                      
## [4] "Non-binary, genderqueer, or gender non-conforming"          
## [5] "Woman;Non-binary, genderqueer, or gender non-conforming"    
## [6] "Woman;Man;Non-binary, genderqueer, or gender non-conforming"
## [7] "Woman;Man"                                                  
## [8] "Man;Non-binary, genderqueer, or gender non-conforming"

I’ve focused on women for this example. As it is a multi-choice category, I’m looking at the proportion of people who chose “Woman”, so they might also have chosen other categories. Obviously, it does not mean that people who didn’t selet “Woman” are men, or that no men are gender minorities in tech.

# Calculate the number of respondents who chose "Woman"
women_by_country <- dev_survey_19[Gender %like% "Woman", 
                                  .(total_women = .N), 
                                  by = country]
women_by_country[, total_women := 0]  

# Calculate total number of respondents who provided a gender and combine
total_gender_by_country <- dev_survey_19[!, 
                                         .(total_people = .N), 
                                         by = country]
women_by_country <- merge(women_by_country, 
                          by = "country", all = TRUE)

# Proportion of people who gave a response who chose "Woman"
women_by_country[, proportion_women := total_women/total_people*100]
women_by_country <- women_by_country[, .(country, proportion_women)]

Getting a world map

I was a bit surprised that finding a world map wasn’t more straightforward. Perhaps the problem was that there are plenty of options out there. The trouble is that they aren’t all fit for purpose. I got quite far with one before realising that it was missing quite a lot of countries, including Singapore! The one I ended up with is from the maps package (as suggested in this SO answer), and while it is probably more granular than I need, and in some cases I’ve had to attribute countries in ways that aren’t totally ideal, it does at least mean that all the data from the countries represented in the survey can be shown.

## Loading required package: sp
## Checking rgeos availability: TRUE

world <- map("world", fill = TRUE, plot = FALSE)
world <- map2SpatialPolygons(world, sub(":.*$", "", world$names))
world <- SpatialPolygonsDataFrame(world, 
                                  data.table(country = names(world)), 

# Combine the data
world <- merge(world, women_by_country)

And now, leaflet

Leaflet is an R package based on the open-source JavaScript library for interactive maps of the same name. Leaflet maps are great for interactivity - users can zoom, pan, hover and click, for example, and there’s a lot of flexibility for the developer to adjust the look and feel. I found this tutorial super helpful.

I start off by creating a colour palette and filling in the basic arguments.


# Create colour palette
mypalette <- colorNumeric("BuPu", domain = world$names)

# Basic map
leaflet(data = world)  %>% 
  addTiles() %>%
  setView(lat = 10, lng = 0, zoom = 2) %>%
  addPolygons(fillColor = ~mypalette(proportion_women))