When I shifted career into data science, I never thought that one of the first things that would really hook me in would be discovering new, elegant ways to structure and manipulate data. That sounds kind of boring, right? Especially compared to super sexy artificial intelligence. But as you probably know, the reality of a data-driven field means, um, working with data - so you need to clean it, turn it upside down and inside out, rearrange everything to suit your purpose, spread it all out and then tidy it up before you can even get started on your fancy machine learning. Having good tools for the everyday tasks therefore makes a huge difference.
Somewhere along the way, I’ve become a total convert to the R package
data.table, to the extent that yesterday, I gave a presentation about it at an R-Ladies London event. This post is a little slice of that talk, so you’ll understand why I advocate for
data.table. But first…
What is data.table?
A datatable is a pretty generic term for data structures, but in the context of the R package, we are talking about something more specific. It is both a type of data structure (which in this post I will call a data.table, with the dot) and a means to edit and use that data through a syntax that differs from base R (as well as differing from
dplyr, which is probably the package more frequently taught), accessed by downloading and using the package
Actually, a data.table is also a type of data.frame, which means that code you would use to interact with a data.frame will also work with a data.table. So even if you are a little anxious about trying a new structure, you won’t be too far from familiar territory!
Why I use data.table
1) It’s so fast
Speed is important in data science. This is obviously true when you are running complex models and using exceptionally large datasets, where there is a real likelihood of things slowing right down. But it’s also true for simpler tasks that get done over and over again. For example, users of your
shiny app are not going to be very tolerant of waits of more than a few seconds for things to load.
data.table is fast. Not just compared to other options in R, but tools in other languages too. This benchmarking of database-like tools from R, Python and Julia was last updated two weeks ago (19th May 2019) at the time of writing this post, and it shows
data.table as substantially faster across a number of aggregation tests using datasets of varying sizes. With a dataset of 50GB,
data.table takes 112 seconds to process the five tasks, compared with 2,539 seconds for
pandas is not able to complete the tasks at this data size due to insufficient memory in the test environment. If you click through the tests and data sizes, you will see on occasion a different tool will beat
data.table - but in those instances,
data.table will be second, with no other tool coming consistently in the top two for every task.
fwrite() functions are quick ways to read in and write data (here is an example of a recent benchmark comparing
read.csv()). Lots of people use these functions even if they don’t use the wider
There are a number of reasons why
data.table is fast, but a key one is that unlike many other tools, it allows you to modify things in your table by reference, so it is changed in-situ rather than requiring the object to be recreated with your modifications. That means that when I’m using
data.table, I need to use
<- far less frequently. If it isn’t obvious why that’s quicker, imagine you’ve written a long email to somebody, and then notice a typo you want to change. A tool like
data.table means you can just alter that word, rather than rewriting the whole thing.
2) It makes sense
Once I learnt the basic syntactical structure of data.tables, everything totally clicked. I know that many people find
dplyr easy to read, but for me, the consistency and simplicity of
data.table make it super easy for me to look at pretty much any
data.table syntax and just get it.
This image from a wiki provided by the creators of
data.table is a good introduction.
Within the square brackets that signify you are doing something to your data.table, there are three sections to consider. You can do a lot with these building blocks and simply by understanding these you will be able to interpret most
- i: The first section is where you specify the rows that you want considered. It’s a place to filter or subset.
- j: Stuff that’s going on in columns - you might be making a new one or modifying an existing one, for example. Generally where the action is happening.
- by: Doing stuff by group - which is ridiculously handy to be able to do within this same line of code.
If you aren’t used to
data.table, then it can look intimidating simply because it is using syntax that isn’t the same as other R code. There are distinct ways of doing things (such as
:= for indicating a change to a column, or
.N to count rows). However, these elements are often what make
data.table so handy, and you quickly get a hang of them.
3) It’s the hipster choice
Using a tool other than the one most commonly taught could be a sign that you don’t know what you’re doing. Conversely, I say to you, it could be a sign that you really do know what you’re doing. Doesn’t bucking the trend make you a data science badass? I say yes.
But seriously, it isn’t like
data.table is at all fringe. It’s been around since 2006. It has around three quarters of a million downloads each month. And both
data.table have similar usage within other packages, each powering about 10-12% of CRAN.
So really it’s the best of both worlds. It’s totally got the backing of the R community and has proved itself an excellent resource, but gives you that little bit of a cool, free-thinking vibe.
Now you have been 100% convinced to use
data.table, here is some more to look at:
- The data.table site
- The data.table FAQ
- A data.table and dplyr tour (includes comparison of operations in both packages)
- Advanced tips and tricks with data.table (this is so good! Really helps you go to the next level)
You can also find my presentation slides here.
I hope that helps and you join me on Team