Two-Pass Greedy Regular Expression Parsing
Niels Bjørn Bugge Grathwohl, Fritz Henglein, Lasse Nielsen, Ulrik Terp Rasmussen
July, 2013
Proceedings 18th International Conference on Implementation and Application of Automata (CIAA)
Abstract
We present new algorithms for producing greedy parses for regular expressions (REs) in a semi-streaming fashion. Our lean-log algorithm executes in time O(mn) for REs of size m and input strings of size n and outputs a compact bit-coded parse tree representation. It improves on previous algorithms by: operating in only 2 passes; using only O(m) words of random-access memory (independent of n); requiring only kn bits of sequentially written and read log storage, where k < 1/3 m is the number of alternatives and Kleene stars in the RE; processing the input string as a symbol stream and not requiring it to be stored at all. Previous RE parsing algorithms do not scale linearly with input size, or require substantially more log storage and employ 3 passes where the first consists of reversing the input, or do not or are not known to produce a greedy parse. The performance of our unoptimized C-based prototype indicates that the superior performance of our lean-log algorithm can also be observed in practice; it is also surprisingly competitive with RE tools not performing full parsing, such as Grep.
@inproceedings{grathwohl2013,
author = {Niels Bjørn Bugge Grathwohl, Fritz Henglein, Lasse Nielsen, Ulrik Terp
Rasmussen},
title = {Two-Pass Greedy Regular Expression Parsing},
abstract = {We present new algorithms for producing greedy parses for regular
expressions (REs) in a semi-streaming fashion. Our lean-log
algorithm executes in time O(mn) for REs of size m and input
strings of size n and outputs a compact bit-coded parse tree
representation. It improves on previous algorithms by:
operating in only 2 passes; using only O(m) words of
random-access memory (independent of n); requiring only kn
bits of sequentially written and read log storage, where k <
1/3 m is the number of alternatives and Kleene stars in the
RE; processing the input string as a symbol stream and not
requiring it to be stored at all. Previous RE parsing
algorithms do not scale linearly with input size, or require
substantially more log storage and employ 3 passes where the
first consists of reversing the input, or do not or are not
known to produce a greedy parse. The performance of our
unoptimized C-based prototype indicates that the superior
performance of our lean-log algorithm can also be observed in
practice; it is also surprisingly competitive with RE tools
not performing full parsing, such as Grep.},
year = {2013},
month = {July},
booktitle = {Proceedings 18th International Conference on Implementation and
Application of Automata (CIAA)},
volume = {7982},
series = {Lecture Notes in Computer Science},
editor = {Konstantinidis, Stavros},
publisher = {Springer},
pages = {60-71},
doi = {10.1007/978-3-642-39274-0_7},
}