Feature engineering with dates

Remember my last post? I was talking about my 2020 resolution to eat less meat, and how I’d tracked it in my bullet journal. The data that I have is…well, on first glance there’s not much to it.

##       Month Day Diet
## 1:  January   1    V
## 2: February   1    M
## 3:    March   1    M
## 4:    April   1    V
## 5:      May   1    M
## 6:     June   1    V

There is one row for each day, with a column for the month, the day of the month, and whether it was a vegetarian (V), pescetarian (P) or meat-eating (M) day.

Initially it feels like there is very little to work with! But actually, dates can offer a rich source of features, because so much can be derived from that information alone. Here, I’ll go about extracting some features from this data, that I’ll later be able to use in my attempt to draw insights from my records.

I’m going to use the lubridate package; it’s a useful R package for dealing with dates. My first step is just to take the Month and Day columns and combine them to get a Date column.

library(lubridate)

food_by_day[, Date := lubridate::make_date(2020, as.integer(Month), Day)]

Day of the week

If you know the date, then you can work out the day of the week. This might be relevant to what I ate, if I have a weekly routine. For example, I almost always get a food delivery on a Tuesday, which might impact my choices depending on whether I am making dinner soon after a delivery or at the tail end of week on a Monday.

The weekdays() function is a simple way to get the day of the week from a date. Once I have a Weekday column, I can also get a binary True/False column for each week day, which can be helpful in modelling when you are dealing with categorical data.

food_by_day[, Weekday := weekdays(Date)]

food_by_day[, `:=` (Monday = Weekday %in% 'Monday', 
                    Tuesday = Weekday %in% 'Tuesday', 
                    Wednesday = Weekday %in% 'Wednesday', 
                    Thursday = Weekday %in% 'Thursday', 
                    Friday = Weekday %in% 'Friday', 
                    Saturday = Weekday %in% 'Saturday', 
                    Sunday = Weekday %in% 'Sunday'
                    )
                    ]

head(food_by_day)
##       Month Day Diet       Date   Weekday Monday Tuesday Wednesday Thursday
## 1:  January   1    V 2020-01-01 Wednesday  FALSE   FALSE      TRUE    FALSE
## 2: February   1    M 2020-02-01  Saturday  FALSE   FALSE     FALSE    FALSE
## 3:    March   1    M 2020-03-01    Sunday  FALSE   FALSE     FALSE    FALSE
## 4:    April   1    V 2020-04-01 Wednesday  FALSE   FALSE      TRUE    FALSE
## 5:      May   1    M 2020-05-01    Friday  FALSE   FALSE     FALSE    FALSE
## 6:     June   1    V 2020-06-01    Monday   TRUE   FALSE     FALSE    FALSE
##    Friday Saturday Sunday
## 1:  FALSE    FALSE  FALSE
## 2:  FALSE     TRUE  FALSE
## 3:  FALSE    FALSE   TRUE
## 4:  FALSE    FALSE  FALSE
## 5:   TRUE    FALSE  FALSE
## 6:  FALSE    FALSE  FALSE

Working days

Because the weekend is quite different from the working week for me, I’m also bringing out a feature related to whether the day is on a weekend or note.

food_by_day[, Weekend := Weekday %in% c("Saturday", "Sunday")]

This doesn’t quite go far enough, because I also had some holidays in 2020, and even if most of them were spent at home, I might have been eating quite differently in holiday mode. I was able to access bank holidays in England using the timeDate package, and then manually added other holiday periods. I considered whether or not a weekend could be considered a holiday, and I decided that if it was a weekend in the middle of a holiday period then I would not have differentiated in how I acted, so some days are both weekends and holidays in the data.

bank_holidays <- as.Date(timeDate::holidayLONDON(year = 2020))
holidays <- unique(c(
  bank_holidays,
  seq(as.Date("2020-01-30"), as.Date("2020-02-04"), by = "day"), 
  seq(as.Date("2020-08-13"), as.Date("2020-08-19"), by = "day"), 
  seq(as.Date("2020-08-31"), as.Date("2020-09-05"), by = "day"), 
  seq(as.Date("2020-10-12"), as.Date("2020-10-19"), by = "day"), 
  as.Date("2020-12-24"), 
  as.Date("2020-12-29")
))

food_by_day[, Holiday := Date %in% holidays]
head(food_by_day)
##       Month Day Diet       Date   Weekday Monday Tuesday Wednesday Thursday
## 1:  January   1    V 2020-01-01 Wednesday  FALSE   FALSE      TRUE    FALSE
## 2: February   1    M 2020-02-01  Saturday  FALSE   FALSE     FALSE    FALSE
## 3:    March   1    M 2020-03-01    Sunday  FALSE   FALSE     FALSE    FALSE
## 4:    April   1    V 2020-04-01 Wednesday  FALSE   FALSE      TRUE    FALSE
## 5:      May   1    M 2020-05-01    Friday  FALSE   FALSE     FALSE    FALSE
## 6:     June   1    V 2020-06-01    Monday   TRUE   FALSE     FALSE    FALSE
##    Friday Saturday Sunday Weekend Holiday
## 1:  FALSE    FALSE  FALSE   FALSE    TRUE
## 2:  FALSE     TRUE  FALSE    TRUE    TRUE
## 3:  FALSE    FALSE   TRUE    TRUE   FALSE
## 4:  FALSE    FALSE  FALSE   FALSE   FALSE
## 5:   TRUE    FALSE  FALSE   FALSE   FALSE
## 6:  FALSE    FALSE  FALSE   FALSE   FALSE

Monthly routine

As well as weekly routines, I thought it was possible I would have a monthly routine. This was less likely as a normal part of life, but because I was tracking things, it seemed feasible to me that towards the start of the month or end of the month, I might be paying particular attention to the tracker. For example, if I was doing really well or really poorly on hitting my monthly target, I would probably make an extra effort at the end of the month to avoid meat.

The first week of the month is pretty simple: all days between the 1st and the 7th will be in it. The last week is a bit more complicated because of differing number of days in the months, so I wrote a function to deal with that, which uses lubridate’s days_in_month() function.

# First 7 days of month

food_by_day[, FirstWeek := Day %in% seq(1, 7)]

# Last 7 days of month

is_last_week_of_month <- function(date, day) {
  days_in_given_month <- lubridate::days_in_month(date)
  last_week <- seq(days_in_given_month - 7, days_in_given_month)
  return(day %in% last_week)
}

food_by_day[, LastWeek := mapply(is_last_week_of_month, Date, Day)]

Other possibilities

There are other things I could do! For example, I thought about adding the weather to this dataset, but I couldn’t find a source that was sufficiently up-to-date to cover 2020. But for now, I’ve already added a lot of extra information to my dataset. My columns are now:

##  [1] "Month"     "Day"       "Diet"      "Date"      "Weekday"   "Monday"   
##  [7] "Tuesday"   "Wednesday" "Thursday"  "Friday"    "Saturday"  "Sunday"   
## [13] "Weekend"   "Holiday"   "FirstWeek" "LastWeek"

I’ll see what I can do with these, and maybe there’ll be more I add along the way.

comments powered by Disqus