(packrat) -- Packrat parser library

Library (packrat)

This library is ported from Chicken Scheme packrat. The documentation is from the PDF file located on the website and formatted Sagittarius document format.

Packrat parsing is a memorizing, backtracking recursive-descent parsing technique that runs in time and space linear in the size of the input test. The technique was originally discovered by Alexander Birman in 1970 [1], and Bryan Ford took up the idea for his master's thesis in 2002 [4, 3, 2]. For detailed information on the technique, please see Bryan Ford's web pate at

"http://pdos.csail.mit.edu/~baford/packrat/"This document describes an R5RS Scheme library of parsing combinators implemented using the packrat parsing algorithm. The main interfaces are the packrat-parse macro and the combinators into into which it expands, the base-generator->results function, and the accessors for parse-result records.

Data Structures

This section describes the data structures that make up the core of the packrat parsing algorithm, and some of the low-level procedures that operate on them.

parse-result

A parse-result record describes the results of an attempt at a parse at a particular position in the input stream. It can either record a successful parse, in which case it contains an associated semantic-value, or a failed parse, in which case it contains a parse-error structure.

This is a predicate which answers #t if and only if its argument is a parse-result record.

This predicate returns #t if its argument represents a successful parse, or #f if it represents a failed parse.

If the argument represents a successful parse, this function returns the associated semantic-value; otherwise, it will return #f.

If the argument represents a successful parse, this function returns a parse-results record representing the parsed input stream starting immediately after the parse this parse-results represents. For instance, given an input stream [a, b, c, d, e], if the parse-result given to parse-result-nexthad completed successfully, consuming the [a, b, c] prefix of the input stream and producing some semantic value, then the parse-result returned from parse-result-next would represent all possible parses starting from the [d, e] suffix of the input stream.

If the argument represents a failed parse, this function returns a parse-error structure; otherwise, it may return a parse-error structure for internal implementation reasons (to do with propagating errors upwards for improved error-reporting), or it may return #f

This function constructs an instance of parse-result representing a successful parse. The first argument is used as the semantic value to include with the new parse-result, and the second argument should be a parse-results structure representing the location in the input stream from which continue parsing.

This function constructs an instance of parse-result representing a failed parse. The parse-position in the first argument and the value in the second argument are used to construct a variant of a parse-error record for inclusion in the parse-result that reports that a particular kind of value was expected at the given parse-position.

This function constructs an instance of parse-result representing a failed parse. The parse-position in the first argument and the string in the second argument are used to construct a variant of a parse-error record for inclusion in the parse-result that reports a general error message at the given parse position.

This function propagates error information through a particular parse result. The parse-error contained in the first argument is combined with the parse-error from the second argument, and the resulting parse-result structure is returned embedded in the error field of a copy of the first argument.

parse-results

A parse-results record notionally describes all possible parses that can be attempted from a particular point in an input stream, and the results of those parses. It contains a parse-position record, which corresponds to the position in the input stream that this parse-results represents, and a map associating "key objects" with instance of parse-result.

Atomic input objects (known as "base values"; usually either characters or token / semantic-value pairs) are represented specially in the parse-results data structure, as an optimisation: the two fields base and code{next} represent the implicit successful parse of a base value at the current position. The base field contains a pair of a toke-class-identifier and a semantic value unless the parse-results data structure as a whole is representing the of the input stream, in which case it will contain #f.

This is a predicate which answer #t if and only if its argument is a parse-results record.

Returns the parse-position corresponding to the argument. An unknown position is represented by #f.

If the argument corresponds to the end of the input stream, this function returns #f; otherwise, it returns a pair, where the car is to be interpreted as a base lexical token class identifier (for instance, "symbol", "string", "number") and the cdr is to be interpreted as the semantic value of the data.

This function returns the car (the token class identifier) of the result of parse-results-base, if that result is a pair; otherwise it returns #f.

This function returns the car (the token class identifier) of the result of parse-results-base, if that result is a pair; otherwise it returns #f.

This function returns the cdr (the token value) of the result of parse-results-base, if that result is a pair; otherwise it returns #f.

This function returns the parse-results record representing the position in the input stream immediately after the argument's base token. For instance, if the base tokens used represented characters, then this function would return the parse-results representing the next character position; or, if the base tokens represented lexemes, then this function would return a representation of the results obtainable starting from the next lexeme position. The value #f is returned if there is no next position (that is, if the argument represents the final possible position before the end-of-stream).

This function is used to set up an initial input stream of base tokens. The argument is to be nullary function returning multiple-values, the first of which is to be a parse-position record or #f, and the second of which is to be a base token, that is a pair of a token class identifier and a semantic value. The argument is called every time the parser needs to read a fresh base token from the input stream.

This function effectively prepends a base token to particular parse-results. This can be useful when implementing extensible parsers: using this function in a suitable loop, it is possible to splice together two streams of input.

For instance, if r is a parse-results representing parse over the input token stream '((b . 2) (c . 3)), then the result of the call

(prepend-base #f '(a . 1) r)

is a new parse-results representing parse over the input stream '((a . 1) (b . 2) (c . 3)).

The first argument to prepend-base, the parse-position, should be either a parse-position representing the location the base token being prepended, or #f if the input position of the base token is unknown.

This function is similar to prepend-base, but prepends an already-computed semantic value to a parse-results, again primarily for use in implementing extensible parsers. The resulting parse-results is assigned the given parse-position, and has an entry in its result map associating the given key-object with the given semantic-value and input parse-results.

This function is the central function that drives the parsing process. It examines the result in the parse-results given to it, searching for an entry matching the given key-object. If such an entry is found, the parse-result structure associated with the key is returned; otherwise, the nullary result-thunk is called, and the resulting parse-result is both stored into the result map and returned to the caller of results->result.

parse-error

Parse-error structure represent collected error information from attempted parses. They contain two kinds of error report, following [3]: a collection of "expected token" messages, and a collection of free-format message strings.

This is a predicate which answers #t if and only if its argument is a parse-error record.

Retrieves the parse-position in the input stream that this parse-error is describing. A #f result indicates an unknown position.

Retrieves the set (represented as a list) of token class identifiers that could have allowed the parse to continue from this point.

Retrieves the list of error messages associated with this parser-error.

Constructs an "expected token" parse-error record from its arguments. Called by make-expected-result.

Constructs an "general error message" parse-error record from its arguments. Called by make-message-result.

Returns #f if its argument contains no expected tokens, and no general error messages; otherwise returns #f. Used internally by merge-result-errors.

Merges two parse-error records, following [3]. If one record represents a position earlier in the input stream than the other, then that record is returned; if they both represent the same position, the "expected token" sets are unioned and the general message lists are appended to form a new parse-error record at the same position. The standard parsing combinators call this function as appropriate to propagate error information through the parse.

parse-position

A parse-position record represents a character location in an input stream.

Constructs a parse-position record from its arguments. The given filename may be #f if the filename is unknown or not appropriate for the input stream the parse-position is indexing into.

This is a predicate which answer #t if any only if its argument is parse-position record.

Retrieves the file name associated with a parse-position record. Returns #f if the filename is absent or not appropriate for this input stream.

Retrieves the line number this parse-position represents. Line numbers begin at 1; that is all characters on the very first line in a file will have line number 1.

Retrieves the column number within a line that parse-position represents. Column numbers begin at 0; that is, the very first character of the very first line in a file will have line number 1 and column number 0.

Constructs a parse-position representing the very beginning of an input stream. The argument is passed into make-parse-position as the "filename" parameter, and so may be either a string or #f.

Given a position, and the character occurring at that position, returns the position of the next character in the input stream. Most characters simply increment the column number. Exceptions to this rule are: #\return, which resets the column number to zero; #\newline, which both resets the column number to zero and increments the line number; and #\tab, which increments the column number to the nearest multiple of eight, just as terminal with an eight-column tab stop setting might do.

Converts a parse-position record into an emacs-compatible display format. If the filename in the parse-position is unknown, the string "<??>" is used in its place. The result is of the form

filename:linenumber:columnnumber

for example,

main.c:33:7

Returns #t if the first parse-position is more than advanced in the input stream than the second parse-position. Either or both positions may be #f, representing unknown positions; an unknown position is considered to be less advanced in the input stream than any known position. Note that the filename associated with each parse-position is completely ignored. It is the caller's responsibility to ensure the two positions are associated with the same input stream.

Parsing Combinators

Parsing combinators are functions taking a parse-results structure and retruning a parse-result structure. Each combinator attempts to parse the input stream in some manner, and the result of the combinator is either a successful parse with an associated semantic value, or a failed parse with an associated error record.

This section describes the procedures that produce the mid-level parsing combinators provided as part of the library.

The type of a parser combinator, written in ML-like notation, would be

parse-results -> parse-result

Returns a combinator which, if the next base token has token class identifier equal to the first argument ("kind-object"), calls the second argument ("semantic-value-acceptor") with the semantic value of the next base token. The result of this cal should be another parser combinator, which is applied to the parse-results representing the remainder of the input stream.

The type of the semantic value acceptor, written in ML-like notation, would be

semanticValue -> parserCombinator

or more fully expanded,

semanticValue -> parse-results -> parse-result

These types recall the types of functions that work with monads.

Returns a combinator which attempts to parse using the first argument, and if the parse is successful, hands the resulting semantic value to the semantic-value-acceptor (which has the same type as the semantic-value-acceptor passed to packrat-check-base ) and continues parsing using the resulting combinator.

Returns a combinator which attempts to parse using the first argument, only trying the second argument if the first argument fails to parse the input. This is the basic combinator used to implement a choice among several alternative means of parsing an input stream.

The combinator returned from this function first tries the first combinator given. If it fails, the second is tried; otherwise, an error message containing the given string is returned as the result. This can be used to assert that a particular sequence of tokens does not occur at the current position before continuing on. (This is the "not-followed-by" matcher).

The parckrat-parser macro

The packrat-parse macro provides syntactic sugar for building complex parser combinators from simpler combinators. The general form of the macro, in an EBNF-like language, is:

(packrat-parser <result-expr> <nonterminal-definition>*)

where

<nonterminal-definition> :==
  (<nonterminal-id> (<sequence> <body-expr>+)*)
<sequence> :== (<part>*)
<part> :== (! <part>*)
       |   (/ <sequence>*)
       |   <var> <- '<kind-object>
       |   <var> <- 

       |   <var> <- <nonterminal-id>
       |   '<kind-object>
       |   <nonterminal-id>

Each nonterminal-definition expands into a parser-combinator. The collection of defined nonterminal parser-combinators expands to a (begin) containing an internal definition for each nonterminal.

The result of the whole packrat-parser form is the <result-expr>immediately following the packrat-parser keyword. Since (begin)within (begin) forms are flattened out in Scheme, the <result-expr> can be used to introduce handwritten parser combinators which can call, and can be called by, the nonterminal definitions built in the rest of the parser definition.

Each nonterminal definition expands into:

(define (<nonterminal-id> results)
  (results->result results 'nonterminal-id
    (lambda ()
      (<...> results))))

where <...> is the expanded definition-of-sequences combinator formed form the body of the nonterminal definition.

An alternation (either implicit in the main body of a nonterminal definition, or introduced via a <part> of the form (/ <sequence> ...)) expands to

(packrat-or <expansion-of-first-alternative> (packrat-or <expansion-of-second-alternative> ...)) This causes each alternative to be tried in turn, in left-to-right order of occurrence.

Wherever a <part> of the form "<var> <- ..." occurs, a variable binding for <var> is made available in the <body-expr>s that make up each arm of a nonterminal definition. The variable will be bound to the semantic value resulting form parsing according to the parser definition to the right of the arrow (the "..." above).

The (! <part> ...) syntax expands into an invocation of packrat-unless.

The "@" syntax in "<var> <- @" causes <var>to be bound to the parse-position at that point in the input stream. This can be used for annotating abstract syntax trees with location information.

<part>s of the form '<kind-object> expand into invocations of packrat-check-base; those of the form <nonterminal-id> expand into invocations of packrat-check, with the procedure associated with the named nonterminal passed in as the combinator argument.

References

[1] Alexander Birman and Jeffrey D. Ullman. Parsing algorithms with backtrack. Information and Control, 23(1):1 34, August 1973

[2] Bryan Ford. Parsing expression grammars: A recognition-based syntactic foundation.

[3] Bryan Ford. Packrat parsing: a practical linear-time algorithm with backtracking. Master's thesis. Massachusetts Institute of Technology, Sep 2002.

[4] Bryan Ford. Packrat parsing: Simple, powerful, lazy, linear time. In Proceedings of the 2002 International Conference on Functional Programming. Oct 2002.