PEG Parsing in Less Space Using Progressive Tabling and Dynamic Analysis
Fritz Henglein, Ulrik Terp Rasmussen
January, 2017
Proceedings 2017 ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation
Abstract
Tabular top-down parsing and its lazy variant, Packrat, are linear-time execution models for the TDPL family of recursive descent parsers with limited backtracking. Exponential work due to backtracking is avoided by tabulating the result of each (nonterminal, offset)-pair at the expense of always using space proportional to the product of the input length and grammar size. Current methods for limiting the space usage rely either on manual annotations or on static analyses that are sensitive to the syntactic structure of the grammar. We present progressive tabular parsing (PTP), a new execution model which progressively computes parse tables for longer prefixes of the input and simultaneously generates a leftmost expansion of the parts of the parse tree that can be resolved. Table columns can be discarded on-the-fly as the expansion progresses through the input string, providing best-case constant and worst-case linear memory use. Furthermore, semantic actions are scheduled before the parser has seen the end of the input. The scheduling is conservative in the sense that no action has to be "undone" in the case of backtracking. The time complexity is O(dmn) where m is the size of the parser specification, n is the size of the input string, and d is either a configured constant or the maximum parser stack depth. For common data exchange formats such as JSON, we demonstrate practically constant space usage.
@inproceedings{henglein2017,
author = {Fritz Henglein, Ulrik Terp Rasmussen},
title = {{PEG} Parsing in Less Space Using Progressive Tabling and Dynamic Analysis},
booktitle = {Proceedings 2017 ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation},
doi = {10.1145/3018882.3018889},
year = {2017},
month = {January},
abstract = {Tabular top-down parsing and its lazy variant, Packrat, are
linear-time execution models for the TDPL family of recursive
descent parsers with limited backtracking. Exponential work
due to backtracking is avoided by tabulating the result of
each (nonterminal, offset)-pair at the expense of always using
space proportional to the product of the input length and
grammar size. Current methods for limiting the space usage
rely either on manual annotations or on static analyses that
are sensitive to the syntactic structure of the grammar. We
present progressive tabular parsing (PTP), a new execution
model which progressively computes parse tables for longer
prefixes of the input and simultaneously generates a leftmost
expansion of the parts of the parse tree that can be
resolved. Table columns can be discarded on-the-fly as the
expansion progresses through the input string, providing
best-case constant and worst-case linear memory
use. Furthermore, semantic actions are scheduled before the
parser has seen the end of the input. The scheduling is
conservative in the sense that no action has to be "undone" in
the case of backtracking. The time complexity is O(dmn) where
m is the size of the parser specification, n is the size of
the input string, and d is either a configured constant or the
maximum parser stack depth. For common data exchange formats
such as JSON, we demonstrate practically constant space
usage.},
}