In my most recent post, I introduced a project I’m working on to draw structured data out of decisions published online relating to asylum cases heard in the Upper Tribunal. One of the challenges is to identify the country of origin (COI) of the appellant. After exploring the problem one evening, I managed to associate appellants with countries in 87.6% of cases, up from our previous level of 68.4%! How? Not by a complex algorithm, nor a carefully developed machine learning solution. In fact, in an endorsement of keeping things simple, the improvement was all down to enhancing the regular expressions used to spot countries in the text.
Regular expressions, otherwise known as regex, are a way to identify patterns in text. They have a mixed reputation. A widely used quotation from Jamie Zawinksi goes
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
And that can be true. People have gleefully recorded the difficulties in using rule-based systems like regex in identifying addresses, names and a large number of other things. The trouble is that rules have clear lines, but humans often create things organically, with exceptions and alternatives abounding. So we can try to make our rules allow for unusual cases…but risk being too permissive and identifying things that definitely don’t fit the brief.
But now that I’ve told you the risks, I have to say that I often find an artfully created regex to neatly deal with problems that otherwise could be quite intractable. Let’s consider this country-identification in more detail.
We had some limitations. The COI is not included in a consistent place in the document, so it’s not something we can pick up and identify as part of the web scraping process. It also isn’t sufficient to look for a country in the text: many decisions reference other cases that mention different countries, or detail countries that might be relevant to the appellants’ case without relating to their nationality.
Our original approach was to review a small sample of the documents and pull out some of the phrases that precede the COI. In the regex, we can use something called a lookbehind, which essentially looks for the pattern you’ve identified to precede the bit you’re actually interested in. For example, a lot of the documents will use a phrase like, “The appellant is a national of France”. By using a regex that finds the phrase “national of”, you can pull out the following word with some confidence that it’ll be the country you want. Using this approach, the regex looked something like this:
"(?<=phrase one |phrase two )\\b[A-Za-z]+\\b"
Breaking it down:
(?<=phrase one |phrase two ): This is the look behind. Everything is contained within brackets to show it is grouped together. The
| is an OR operator, so the regex is looking for a pattern that matches any of the options – in this case, several phrases that might precede the relevant country. There were ultimately eight phrases we included here, compiled over time as we checked what was working and what was missing. If we hadn’t kept checking and improving this phrase list, our accuracy would have been even lower; I think it started out around 62%.
\\b: Indicates a word boundary. In R, you need to use two backslashes, though in other languages you only need one. This is because the backslash is an escape character in R strings normally, so it is necessary to escape that function before using it in a regex.
[A-Za-z]+: The part in square brackets means any character that is an uppercase or lowercase letter. Therefore only letters will match – not spaces, punctuation or numbers, for example. The plus sign means the regex should look for as many of these characters in a row as it can. Without the plus sign or a different quantifier, the regex will look for exactly one match.
So essentially, this regex is looking for a word made up of letters that follows one of the phrases provided. This is what gave us our starting point, with 68.4% of cases having a country matched in the sample of 250 that I used to quickly test the approach.
Identifying countries that have multiple words
Reviewing the data, it became obvious that some countries were only partially captured. One reason for this was them being made up of multiple words, as you’ll remember that the original regex was looking for one word following the key phrases. (For the record, I haven’t included these partial results as successes when calculating the accuracy of the original method). This meant we had outputs such as “Sri” instead of “Sri Lanka”.
"(?<=phrase one |phrase two )([A-Z][A-Za-z-]*)([\\s|-][A-Z][A-Za-z-]*)*"
This starts off as before, but after the lookbehind, there is some new stuff going on.
([A-z][a-z]*): This time, I’m looking for words that start with a capital letter, with the assumption that countries will be written like that. So I don’t have a quantifier for [A-Z] – I only want one capital – but I use the * to tell the regex to look for additional letters. This quantifier means they aren’t necessary, so if there are none there will still be a match, but it will keep going to match as many as there are. Capital letters are allowed in case the country is an abbreviation.
\\s is regex code for a whitespace character. Again we have an OR operator, this time with a hyphen, so we capture multiple word countries that are hyphenated (such as Guinea-Bissau) as well as separated by spaces. This is then followed by the same pattern as before to indicate a capitalized word.
()*: The space/hyphen and the second word pattern are all enclosed in brackets, grouped together to then be quantified. This means they can be repeated – if there is a country with three or more words, we’ll still get it.
Running this regex upped the cases with matches to 75.2%.
Include countries starting with “the”
The trouble with the above approach is that now we are cutting out countries that begin with the lowercase word “the”. These were being captured by the first approach, but only the first word…so only “the”, which is pretty unhelpful, as each one could have been “the Czech Republic”, “the Ivory Coast”, or anything else with that pattern.
"(?<=phrase one |phrase two )(the )*([A-Z][ A-Z a-z-]*)([\\s|-][A-Z][ A-Z a-z-]*)*"
This is almost the same as the last version, except it adds in
(the )* - so if there is a “the “ after the key phrases, it will be included, though if there isn’t one, that doesn’t present a problem. This brings the matches to 81.6%.
Include countries that have lowercase words
It feels like we’re getting close! But even though we’ve included countries that begin with “the”, there are still some countries that have lowercase words in the middle, such as “the United States of America” or “Trinidad and Tobago”.
"(?<=phrase one |phrase two )(the )*([A-Z][A-Za-z-]*)+([\\s|-][A-Z][A-Za-z]*)*( of)?( the)?( and)?([\\s|-][A-Z][A-Za-z]*)*"
There’s a whole new chunk with this. The primary change, however, is the inclusion of elements like
( of)?. These are the lowercase words we’re looking for, and they are followed by a question mark as a quantifier to show they will be there zero or one times. Now, we have to repeat the pattern the indicates a new word either side of these new patterns so words either side of the lowercase words get captured.
You can see that it’s getting quite long, which is one of the problems with regex. Overcomplicating things can lead to creating barely comprehensible monsters. It’s important to make sure you keep notes if you write long regexes, so you or someone else doesn’t have to laboriously unravel them in the future. Additionally, make sure you check your results carefully, to make sure you aren’t accidentally excluding things you need to find with your extended rule base.
Nonetheless, this now matches 84.0% of cases with countries.
When I looked through the results to try to identify what I was missing, I found that by this point, it wasn’t much. Unfortunately, a sizable wedge of decision documents don’t actually explicitly say what the COI is (although presumably it is recorded somewhere, as official statistics include this). These accounted for the majority of the missing data.
However, there was one thing that I was consistently missing. Occasionally, the text would use a different turn of phrase, and use the country as an adjective rather than a noun; for example, it might talk about an Australian national. Capturing this is less neat than what we’ve done so far, because the country isn’t a noun, so the data will need some cleaning after extraction, but I decided it was still worth getting what data there was. This is structured differently, so needs a different approach.
This is worth remembering: you don’t have to stick everything in the same regex. In the spirit of avoiding mistakes, and also prioritising what is matched, you can use multiple regexes. I ran this one on countries that didn’t have a match following the previous attempt.
"(?<=a |an )[A-Z][A-Za-z-]*(?= national)"
Here we have our old friend the lookbehind, and the same regex to look for a word. This time, we also have a lookahead, which as you might suspect, does the opposite of a lookbehind and anchors the pattern in relation to what comes after it. Here, therefore, the regex pulls out the capitalized word between “a” and “national”.
Running this in conjunction with the other regex gives us a match rate of 87.6%.
This regex isn’t perfect. The data still needs some cleaning – for example, we have the mix of nouns and adjectives, and we also have some trailing ‘and’s from adding in the and/of/the patterns. There is also the fact that countries are sometimes labelled differently (think even “the UK” and “the United Kingdom”). Some of this is quite easy to deal with; some of it is a bit more complicated. Ultimately we will need the data to be consistent in order to analyse it. But at this point, I’m pleased to be able to extract countries for the majority of cases with a fairly simple approach.