Become a Master Text Wrangler with Regular Expressions

Become a Master Text Wrangler with Regular Expressions

We interact with text everyday. Emails, text messages, sales proposals, spreadsheets - it’s all just characters strung together at the end of the day. If you work with text on a daily basis, you're likely always looking for a way to make changes faster, to stop repeating tasks over and over again, and to feel productive as you write or make changes to bulk blocks of text. 

What is a Regular Expression?

Regular expressions are tools built into almost every text processing application to help you find, replace, and manipulate text in bulk.

Here at Speak, we use regular expressions often when dealing with large spreadsheets, database exports, content imports - you name it. If it’s longer than a 140 character tweet, someone on our Technology team is going to be reaching for a regular expression from the tool belt to help expedite the change.

Let’s start small. You’re working on a spreadsheet that looks like this.

The Users column is in the format first.last, but we actually want those to be email addresses. The first part of the email address can contain what’s already there, but we want to add to the end of each User so that it forms a valid email address. If this spreadsheet had over one thousand rows, making these updates manually would take a very long time. We wouldn’t be able to use a traditional find and replace operation because each cell contains something different. However, because each cell is in the same format (first.last), we can find the last character in the cell and append it with In Google Sheets, in the Find and Replace tool, there is an option to enable Regular Expressions. To update all the users to email addresses, we would:

  1. Select the cells we want to update
  2. Find (.*)$
  3. Replace with $
  4. “Search using regular expressions” and “Match entire cell contents”
  5. Replace All

Let’s break down what’s happening here. The $ matches the end of a line of text. We know we want to put at the end of each cell, so we’ll need that.

.* matches any character (.) zero or more times (*). We could have also matched for .+, which would find any character (.) one or more times (+). The ( ) creates a group from this match that we can use later in our replace expression. In the "Replace with" expression, we start with $1, which is a reference to the group created by ( ) in the "Find" expression. We want to replace the whole cell with the content of the original "Find" result plus, so the group $1 that matched (.*) (all characters) will give us the original content, and then we just have to populate what we want to come after the original content - resulting in $

What Can Regular Expressions Help Find?

So we've learned ., *, +, (), $, and how to refer back to a group with $1. What else can regular expressions find? There's a long list of special characters that come in handy. Below is a short list of some of the most useful expressions, as well as some examples to get you started using them.

  • [ ] matches a single character from a list of characters, or a range of characters.
    • [24680] would match any single digit even number.
    • [a-z] would match any lowercase letter between a and z (the lowercase alphabet).
    • [0-9] would match any single digit number between 0 and 9.
    • [^abc] would match any letter except for a, b, or c.
  • ? optionally matches the previous expression, which means mooses? would match moose and mooses since the final s was made optional by the ?.
  • ^ matches the beginning of the string (opposite of $). So ^you would match the you in you're awesome, but not hey you.

For a full list of regular expressions that most applications support, a great resource is the cheatsheet on Regexr. Regexr is a web app that you can use to visually test regular expressions. Here's an example of matching all words that start with an uppercase letter, and uses another useful expression \w which matches any "word" character (lowercase and uppercase alphabet).

Matching regular expressions | Speak Creative

Ok one final example, I promise. Consider a blog export from Wordpress.

Wordpress blog export | Speak Creative

So, let's say we want to:

  1. Extract the images contained in each post body
  2. Change the root domain of each link to a new domain.

To do that, we should use a text editor like Sublime Text or Atom that supports multiple cursors at once. The steps we would take in Sublime Text would be:

  1. Search for [">]https?://[^"<]+(\.png|\.jpg|\.gif)
  2. "Find All" occurrences of that expression
  3. Copy all those occurrences to the clipboard
  4. Paste into a new file
  5. On the new file, remove the leading character that is either " or > depending on what the initial [">] captured.
  6. "Replace" https?://[^\/]+ with

Let's break down the two regular expressions here.

In [">]https?://[^"<]+(\.png|\.jpg|\.gif), the [">] will match a " or a > which is what starts all of the links in our XML. After that we want to match any URL that ends with .png, .jpg, or .gif. We do that by matching http:// and https:// with the expression https?://, then [^"<]+(\.png|\.jpg|\.gif) any character that is not a " or > on or more times, until it matches (\.png|\.jpg|\.gif). We have to escape the . for each filetype because . is the special character that matches any character. However, we can match a period by escaping the . with \.

In https?://[^\/]+, we're simply finding all domains by matching http:// and https:// again with the expression https?://, and then matching every character that's not a / by escaping it with \ and matching it one or more times with [^\/]+. Then we just populate in the "Replace with" field the domain that we want to use for all the links.

Regular expressions may seem daunting at first, but once you start using them little by little, they will become more useful and you will find more applications for them. Next time you're deep in a repetitive manual text editing task, try to find tools that can automate that task for you. More often than not, you'll find that regular expressions will come out of the toolbox first.

Do you need help migrating content from an old site or have questions on how our web application and mobile developers can help you? We can help.

Questions? Comments? Feel free to drop us a line in the comment section or contact us to talk shop.
Posted by Kevin Reed at 15:07