Mastering LPeg
by Roberto Ierusalimschy
(Version 1.0)
LPeg is a pattern-matching library for Lua, based on Parsing Expression Grammars. LPeg performs all tasks of a typical regex system, but it goes well beyond
that. Among other tasks, we can write entire parsers with LPeg, with scanners
For us, pattern matching is a system for finding and extracting pieces of information from a text. For instance, we may want to find a line starting with
“From:” in an email message and extract the rest of the line; we may have an
XML document and want to extract all emphasized text, that is, text written
between <em> and </em>; we may have a list of license plates and want to find
all plates that are palindromes or that start with a given prefix. We may want
to determine whether a sequence of characters is a valid identifier in C, that is,
a letter or underscore followed by zero or more letters or underscores or digits;
moreover, the sequence cannot be equal to a reserved word.
Most pattern-matching systems are based on regexes, also called regular expressions. (Most regex systems are extensions of the original regular-expression
definition that break the nice properties of the original. For that reason, I prefer
to save the name “regular expression” for the original definition and use the
term “regex” for those extensions.) A regex is a string that specifies a pattern
and occasionally what to extract from a match—an occurrence of that pattern
in a text. As a simple example, consider the following Lua code:
subject = "birth date: 12/03/1980"
pattern = "(%d%d)/(%d%d)/(%d%d%d%d)"
d, m, y = string.match(subject, pattern)
print(d, m, y) --> 12 03 1980
(Remember that, in Lua, a function can return multiple values.) In the pattern,
"%d" represents any digit, "/" represents itself, and the parentheses delimit
the captures, which is what to extract from the match. So, in this example,
pattern means any two digits followed by a slash followed by two digits followed
by another slash followed by four digits, capturing the three groups of digits.
The function string.match searches for that pattern in the subject; if if finds
a match, it returns the captured values, that is, the parts of the subject that
matched the parenthesized parts of the pattern.
Unlike most other pattern-matching systems, LPeg is not based on regexes.
Following the Snobol tradition, LPeg defines patterns as first-class objects. This
means that patterns are handled like regular Lua values. The LPeg library offers
several functions to create and compose patterns; with the use of metamethods,
several of these functions are provided as infix or prefix operators. On the
one hand, the result is usually much more verbose than the typical encoding
of patterns using regexes. On the other hand, first-class patterns allow us to
create patterns piecemeal; it is easy to test each piece independently, to properly
document them with good names and comments, to reuse those pieces, and to
compose them to create more complex patterns. In other words, we can create


