Kleenex: Compiling Nondeterministic Transducers to Deterministic Streaming Transducers
Bjørn Bugge Grathwohl, Fritz Henglein, Ulrik Terp Rasmussen, Kristoffer Aalund Søholm, Sebastian Paaske Tørholm
January, 2016
Proceedings 43rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL'2016)
Abstract
We present and illustrate Kleenex, a language for expressing general nondeterministic finite transducers, and its novel compilation to streaming string transducers with worst-case linear-time performance and sustained high throughput. Its underlying theory is based on transducer decomposition into oracle and action machines: the oracle machine performs streaming greedy disambiguation of the input; the action machine performs the output actions. In use cases Kleenex achieves consistently high throughput rates around the 1 Gbps range on stock hardware. It performs well, especially in complex use cases, in comparison to both specialized and related tools such as awk, sed, RE2, Ragel and regular-expression libraries.
@inproceedings{grathwohl2016,
author = {Bj{\o}rn Bugge Grathwohl, Fritz Henglein, Ulrik Terp Rasmussen,
Kristoffer Aalund S{\o}holm, Sebastian Paaske T{\o}rholm},
title = {{K}leenex: Compiling Nondeterministic Transducers to Deterministic
Streaming Transducers},
booktitle = {Proceedings 43rd ACM SIGPLAN-SIGACT Symposium on Principles of
Programming Languages (POPL'2016)},
year = {2016},
month = {January},
location = {St. Petersburg, Florida, USA},
doi = {10.1145/2837614.2837647},
abstract = {We present and illustrate Kleenex, a language for expressing
general nondeterministic finite transducers, and its novel
compilation to streaming string transducers with worst-case
linear-time performance and sustained high throughput. Its
underlying theory is based on transducer decomposition into
oracle and action machines: the oracle machine performs
streaming greedy disambiguation of the input; the action
machine performs the output actions. In use cases Kleenex
achieves consistently high throughput rates around the 1 Gbps
range on stock hardware. It performs well, especially in
complex use cases, in comparison to both specialized and
related tools such as awk, sed, RE2, Ragel and
regular-expression libraries.},
}