Ruby Regular Expressions

Ruby Regular Expressions

A regular expression (often referred to as regex or regexp) is a way of specifying a sequence of characters that is used to match patterns in text.

In Ruby, regexs are first class citizens. They are built into the core of the programming language. As such, it has a lot of useful integrations with the rest of the language. We’ll take a dive into the world of Ruby’s regular expressions to look at some of the lesser known or used features.

Plenty of people look at regex as though it is some sort of black magic. If you have never used or worked with regex before, it can look anything but regular. Although it may not look the part, it is easy to learn and very expressive. It is arguably the most useful text related skill any programmer can pick up.

If you’re just starting out with regular expressions, I suggest reading beginner’s guides to understand the basics. Once you know what’s going on, come back and see just how useful Ruby makes it.

Creating a Regexp object

Most Ruby programmers create a Regexp object by specifying a pattern between forward slash characters /.../. You can also create one by using the %r literal between any two punctuation marks %r{...} or by using the Regexp.new('...') constructor.

irb(main):001:0> /.../.class
=> Regexp
irb(main):002:0> %r{...}.class
=> Regexp
irb(main):003:0> %r|...|.class
=> Regexp
irb(main):004:0> %r(...).class
=> Regexp
irb(main):005:0> %r=...=.class
=> Regexp
irb(main):006:0> %r@...@.class
=> Regexp
irb(main):007:0> %r!...!.class
=> Regexp
irb(main):008:0> %r/.../.class
=> Regexp
irb(main):009:0> %r"...".class
=> Regexp
irb(main):010:0> Regexp.new('...').class
=> Regexp

Regexp object as an argument

Common use cases for regex are methods in the String class. #=~, #match, #scan, #sub, #gsub. Did you know that the #[] method accepts a Regexp object too?

irb(main):001:0> names = "adam ben charles"
=> "adam ben charles"
irb(main):002:0> names[/a\w*/]
=> "adam"
irb(main):003:0> names[/b\w*/]
=> "ben"
irb(main):004:0> names[/c\w*/]
=> "charles"
irb(main):005:0> names[/a(\w*)/, 1] # We can use capture group references
=> "dam"
irb(main):006:0> names[/b(\w*)/, 1]
=> "en"
irb(main):007:0> names[/z(\w*)/, 1]
=> nil
irb(main):008:0> # ^ is a more concise expression compared to:
irb(main):009:0* names =~ /a(\w*)/ and $1
=> "dam"
irb(main):006:0> names =~ /b(\w*)/ and $1
=> "en"

Capture groups

We have seen earlier that we can use parentheses () to capture groups. These groups are assigned to the global variables $1, $2, ... according to it’s occurrence. The first group will be assigned to $1, the second will be assigned to $2, and so on. It is also possible to refer to them within the regex itself using backreferences \1, \2, ....

irb(main):001:0> string = "how now brown cow"
=> "how now brown cow"
irb(main):002:0> string[/[hn](..)\s+[hn]\\1/]
=> "how now"
irb(main):003:0> $1
=> "ow"

Capture groups can also be assigned a name to improve code clarity. It can be defined by using the (?<name>) or (?'name') constructs.

irb(main):001:0> pattern = /(?<day>\d{2})-(?<month>\d{2})-(?<year>\d{4})/
=> /(?<day>\d{2})-(?<month>\d{2})-(?<year>\d{4})/
irb(main):002:0> data = "24-02-2014"
=> "24-02-2014"
irb(main):003:0> data[pattern, :day]
=> "24"
irb(main):004:0> data[pattern, :month]
=> "02"
irb(main):005:0> data[pattern, :year]
=> "2014"
irb(main):006:0> if pattern =~ data
irb(main):007:1>   $~[:year]
irb(main):008:1> end
=> "2014"

When using a literal regex on the left-hand side of an expression, the =~ operator, followed by the string. The named capture groups will be assigned to local variables with corresponding names.

irb(main):009:0> /(?<day>\d{2})-(?<month>\d{2})-(?<year>\d{4})/ =~ data
=> 0
irb(main):010:0> day
=> "24"
irb(main):011:0> month
=> "02"
irb(main):012:0> year
=> "2014"
irb(main):013:0> if /(?<day>\d{2})-(?<month>\d{2})-(?<year>\d{4})/ =~ data
irb(main):014:1>   puts "Day: #{day}"
irb(main):015:1>   puts "Month: #{month}"
irb(main):016:1>   puts "Year: #{year}"
irb(main):017:1> end
Day: 24
Month: 02
Year: 2014
=> nil

Named capture groups can also be back referenced with \k<name>.

irb(main):001:0> pattern = /(?<day>\d{2})-\k<day>-(?<year>\d{4})/
=> /(?<day>\d{2})-\k<day>-(?<year>\d{4})/
irb(main):002:0> data = "01-02-2014\n02-02-2014\n03-02-2014\n04-02-2014\n"
=> "01-02-2014\n02-02-2014\n03-02-2014\n04-02-2014\n"
irb(main):003:0> data[pattern]
=> "02-02-2014"

Take note that it is not possible to use numbered backreferences together with named backreferences.

Regex with inline comments

Complex expressions are hard to read. Thankfully, we can add a special option after the end delimiter to control how the patterns are matched. /.../x allows us to add whitespace and comments to the pattern for code clarity.

# Take a look at this example,

pattern = /\A&[#](0[0-7]+|[0-9]+|x[0-9a-fA-F]+);\Z/

# versus this.

pattern = / \A                 # Start of string
           &[#]                # Start of numeric character reference
           (
               0[0-7]+         # Octal form
             | [0-9]+          # Decimal form
             | x[0-9a-fA-F]+   # Hexadecimal form
           )
           ;                   # Trailing semicolon
           \Z                  # End of string
          /x                   # Option

It is a contrived example, yet it shows how increased code clarity is possible with complex expressions. Since whitespace is ignored when the x option is activated, use escapes such as \s or \p{Space} to match them.

We can also use (?#comment) to add comments to expressions without the x option. I find this approach to be less useful. It is not consistent with Ruby’s style and further complicates the expression.

# Inline comments without the 'x' option

pattern = /\A&[#](0[0-7]+(?#octal)|[0-9]+|x[0-9a-fA-F]+);\Z/

I found that these little tricks help me write code that is more maintainable. As always, be careful when using regular expressions. Many security issues in Ruby often occur due to oversight when writing them.

Everything mentioned here can be found in the Ruby documentation. Have a look at it as it covers everything that is implemented by Ruby.

Did you find anything cool you could do with regular expressions in Ruby? Let me know!