Difficulty: Beginner
Estimated Time: 10 minutes

Let's dive into the powerful regular expressions and use them for pattern matching.

Don’t stop now! The next scenario will only take about 10 minutes to complete.

Regular expressions

Step 1 of 2

RegExp and grep

Let's prepare our file for the search. Create a file named fruits.txt and fill it with the following content (yes there are typos, on purpose)

apples 
oranges 
limes
grape 
watermelons 
peeeeaars 
limes 
peaches 
oranges 
grapes 
peaches 
pineapppppples 
oranges

And remember that grep can be used to search some strings in plain-text data sets like files. The name comes from g/re/p (globally search a regular expression and print) which it does exactly within the plain text. In some cases we will use the egrep, extended version that supports extended regular expression, some of which we will define here.

Let's go over the regular expressions:

  • . (dot) - a single character.

grep peach. fruits.txt

how does that differ from

grep peach fruits.txt

can you notice the difference?

  • ? - the preceding character matches 0 or 1 times only.

Try the same search with question mark egrep peach? fruits.txt

Let's try egrep peach??? fruits.txt

This works, why? Think about it a little bit.

  • * - the preceding character matches 0 or more times.

Let's see the meaning of this in two examples:

grep p fruits.txt

and

grep p* fruits.txt

Why are the different? Read the definition again and it should be obvious.

  • + - the preceding character matches 1 or more times.

  • {n} - the preceding character matches exactly n times.

  • {n,m} - the preceding character matches at least n times and not more than m times.

  • [agd] - the character is one of those included within the square brackets.

  • [^agd] - the character is not one of those included within the square brackets.

  • [c-f] - the dash within the square brackets operates as a range. In this case it means either the letters c, d, e or f.

Let's see some examples

Try egrep p{3} fruits.txt to find the patterns that have at least three p inside.

Or search for a b or c in the file: egrep [a-c] fruits.txt

Note that [c-f1-9] matches any one of the characters in the ranges c to f and 1 to 9 (takes the union), for instance, [a-z0-9] matches all the lowercase letters or any digit.

Combined sequences of bracketed characters match common word patterns. [Hh][Ee][Yy] matches hey, Hey, HEY, and so on. (Q: How does that differ from [HhEeYy] ?)

Let try it egrep [ao][pr] fruits.txt

  • ^ - matches the beginning of the line (in some cases, negates the meaning of the set, see above for one case).
  • $ - matches the end of the line.
  • \x - matches the character x, where the character's special meaning is stripped by the backslash.
  • \ - matches a backslash (strip the special meaning of the second backslash).

Try egrep 'es$' fruits.txt to find patterns that end with "es"

Some more extended patterns (may be available for non-POSIX compliant systems)

[[:class:]] Matches all the characters defined by a POSIX character class: alnum, alpha, ascii, blank, cntrl, digit, graph, lower, print, punct, space, upper, word and xdigit

grep [[:alnum:]] fruits.txt

This will search for patterns that have alphanumeric characters (here, all will have it). Let's search for digits:

grep [[:digit:]] fruits.txt