Learning Regular Expressions

- 5 mins

Man, was it a day! I have to be honest, my brain still feels like soup after finishing up the assignment from today. For this assignment, we were given a test suite and told to make the tests pass. That’s it! “I can do that!” I thought. “ That sounds easy!” I assured myself.

What I neglected to remember in that moment, was that regular expressions have a reputation for being dense and difficult to work with, and I soon fully understood why.

My first problem was this: regular expressions are incredibly unreadable and just look complex. You can’t (or at least I can’t) glance at an expression and immediately tell what it’s doing. You really have to pick it apart, literally character by character.

One of the primary reasons that I like Python is that it’s SO readable. I can easily dissect my Python code and see where a problem might lie or easily scan to see if I made a small syntax error that’s jamming up the works.

This homework assignment took me a long time to do. I usually work pretty quickly and get started right at 1pm after lecture finishes, but today I didn’t finish until close to 9:30! I really felt like I was slogging through the assignment. I spent most of the time poring over single lines of code, looking for the one out of place character or the unintentional space that was gumming up the whole process.

Regular expressions are difficult because they contain an incredible amount of logic in a single line of code. This makes them very powerful, as you can imagine, but also difficult to get a handle on. It’s very easy to mess them up, because they are just SO dense.

That being said, I learned SO MUCH today. I very possibly learned more about a single topic today than any other day so far. It really is true that the more you struggle learning about a topic, the better it sticks. I’m not naive enough to say that I’m a regex master after practicing for a day, but I kind of get the idea! What first looked like someone banging their head on a keyboard is mostly intelligible to me now. I’m taking that as massive progress and I’m thrilled with my work for the day. I’m aware I have a lot more to learn about these complex little lines of code, but it’s been very rewarding so far.

Let’s break down a fairly simple regex!


def date(x):
    return re.match(r'^[0-9]{1,4}[/|-][0-9]{1,4}[/|-]?[0-9]{1,4}$', x)

This is a sample from what I wrote for today’s assignment, and as far as regular expressions go, it’s fairly straightforward. It’s a function that, given a string, return whether or not the string is a valid date. The basic theory behind regexes is you give it a pattern, and it finds it! Or verifies it, whatever you want it to do with said pattern. In this example, we’re just seeing if a string fits this pattern. It should be a simple True or None (none meaning that there was no incidence of the pattern in the string given).

Okay so a quick break to talk about the kind of dates we want to find. Basically anything that looks like these:

1995-08-08

07/01/1992

8-1-96

08/0115

Or any reasonable variation thereof. Our regex is only setup for numeric dates, so that’s certainly a shortcoming, but otherwise it should be able to handle most of our needs.

Let’s break it down.

re is the module we’re importing in order to use regexes.

.match() takes two parameters, a regex and a string, x in this case. It checks the string to see if it ‘matches’ whatever pattern the regex is looking for. It returns a boolean.

Okay now for the doozy! a regular expression is denoted by the r we see just inside the parens. In this example r’^[0-9]{1,4}[/|-][0-9]{1,4}[/|-][0-9]{1,4}$’ is the pattern.

The r and quotes, just denote the boundaries of the regex. No biggie.

r’insert_regex_here’

The carrot symbol ^ denotes the start of a pattern. Meaning “look for something that starts with this”.

The next part, the [0-9] just means what you might expect, we want the pattern to start with a number between 0-9. Makes sense, right?

At this point we have this funny curly bracket concept going on, {1,4}. What this denotes is that whatever is before it, we want to find items that have between 1 and 4 instances of that. So in our example, we’re saying that the first part of our pattern should be between 1 and 4 numbers. Super simple!

The next part is this [/|-]. This is just the next part of our pattern. The pipe | means ‘or’. So we expect to see either a ‘/’ OR a ‘-‘ as the next part. If you want to allow a space or a comma as your next separator, we can make that work too!


[/|-| |,]

It may look like hieroglyphics, but this is the real deal.

Next you’ll notice that we repeat this structure twice. Every date that we consider viable has three parts that can contain between 1-4 numbers separated by either a slash or a hyphen.

Finally the $ sign denotes the end of our pattern. Meaning “the pattern I chose should end like this”. In this case, with 1-4 numbers.

So maybe it turns out that regular expressions aren’t terrible? This is an intentionally ‘straightforward’ example, but still, all we have to do is break them down into little pieces and they make sense!

Andrew Pierce

Andrew Pierce

Software Engineer based in Durham, NC

comments powered by Disqus
rss facebook twitter github youtube mail spotify instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora