Pratt’s parser API¶
The TDOP (Top Down Operator Precedence) parser implemented within this library is a variant of the original Pratt’s parser based on a class for the parser and metaclasses for tokens.
The parser base class includes helper functions for registering token classes, the Pratt’s methods and a regexp-based tokenizer builder. There are also additional methods and attributes to help the developing of new parsers. Parsers can be defined by class derivation and following a tokens registration procedure.
Token base class¶
-
class
elementpath.
Token
(parser, value=None)¶ Token base class for defining a parser based on Pratt’s method.
Each token instance is a list-like object. The number of token’s items is the arity of the represented operator, where token’s items are the operands. Nullary operators are used for symbols, names and literals. Tokens with items represent the other operators (unary, binary and so on).
Each token class has a symbol, a lbp (left binding power) value and a rbp (right binding power) value, that are used in the sense described by the Pratt’s method. This implementation of Pratt tokens includes two extra attributes, pattern and label, that can be used to simplify the parsing of symbols in a concrete parser.
- Parameters
parser – The parser instance that creates the token instance.
value – The token value. If not provided defaults to token symbol.
- Variables
symbol – the symbol of the token class.
lbp – Pratt’s left binding power, defaults to 0.
rbp – Pratt’s right binding power, defaults to 0.
pattern – the regex pattern used for the token class. Defaults to the escaped symbol. Can be customized to match more detailed conditions (eg. a function with its left round bracket), in order to simplify the related code.
label – defines the typology of the token class. Its value is used in representations of the token instance and can be used to restrict code choices without more complicated analysis. The label value can be set as needed by the parser implementation (eg. ‘function’, ‘axis’, ‘constructor’ are used by the XPath parsers). In the base parser class defaults to ‘symbol’ with ‘literal’ and ‘operator’ as possible alternatives. If set by a tuple of values the token class label is transformed to a multi-value label, that means the token class can covers multiple roles (eg. as XPath function or axis). In those cases the definitive role is defined at parse time (nud and/or led methods) after the token instance creation.
-
arity
¶
-
tree
¶ Returns a tree representation string.
-
source
¶ Returns the source representation string.
-
nud
()¶ Pratt’s null denotation method
-
led
(left)¶ Pratt’s left denotation method
-
evaluate
(*args, **kwargs)¶ Evaluation method
-
iter
()¶ Returns a generator for iterating the token’s tree.
Helper methods for checking symbols and for error raising:
-
expected
(*symbols)¶
-
unexpected
(*symbols)¶
-
wrong_syntax
(message=None)¶
-
wrong_value
(message='unknown error')¶
-
wrong_type
(message='unknown error')¶
Parser base class¶
-
class
elementpath.
Parser
¶ Parser class for implementing a Top Down Operator Precedence parser.
- Variables
SYMBOLS – the symbols of the definable tokens for the parser. In the base class it’s an immutable set that contains the symbols for special tokens (literals, names and end-token). Has to be extended in a concrete parser adding all the symbols of the language.
symbol_table – a dictionary that stores the token classes defined for the language.
token_base_class – the base class for creating language’s token classes.
tokenizer – the language tokenizer compiled regexp.
-
position
¶ Property that returns the current line and column indexes.
Parsing methods:
-
parse
(source)¶ Parses a source code of the formal language. This is the main method that has to be called for a parser’s instance.
- Parameters
source – The source string.
- Returns
The root of the token’s tree that parse the source.
-
advance
(*symbols)¶ The Pratt’s function for advancing to next token.
- Parameters
symbols – Optional arguments tuple. If not empty one of the provided symbols is expected. If the next token’s symbol differs the parser raise a parse error.
- Returns
The next token instance.
-
raw_advance
(*stop_symbols)¶ Advances until one of the symbols is found or the end of source is reached, returning the raw source string placed before. Useful for raw parsing of comments and references enclosed between specific symbols. This is an extension provided by this implementation.
- Parameters
stop_symbols – The symbols that have to be found for stopping advance.
- Returns
The source string chunk enclosed between the initial position and the first stop symbol.
-
expression
(rbp=0)¶ Pratt’s function for parsing an expression. It calls token.nud() and then advances until the right binding power is less the left binding power of the next token, invoking the led() method on the following token.
- Parameters
rbp – right binding power for the expression.
- Returns
left token.
Helper methods for checking parser status:
-
is_source_start
()¶ Returns True if the parser is positioned at the start of the source, ignoring the spaces.
-
is_line_start
()¶ Returns True if the parser is positioned at the start of a source line, ignoring the spaces.
-
is_spaced
(before=True, after=True)¶ Returns True if the source has an extra space (whitespace, tab or newline) immediately before or after the current position of the parser.
- Parameters
before – if True considers also the extra spaces before the current token symbol.
after – if True considers also the extra spaces after the current token symbol.
Helper methods for building new parsers:
-
classmethod
register
(symbol, **kwargs)¶ Register/update a token class in the symbol table.
- Parameters
symbol – The identifier symbol for a new class or an existent token class.
kwargs – Optional attributes/methods for the token class.
- Returns
A token class.
-
classmethod
unregister
(symbol)¶ Unregister a token class from the symbol table.
-
classmethod
duplicate
(symbol, new_symbol, **kwargs)¶ Duplicate a token class with a new symbol.
-
classmethod
literal
(symbol, bp=0)¶ Register a token for a symbol that represents a literal.
-
classmethod
nullary
(symbol, bp=0)¶ Register a token for a symbol that represents a nullary operator.
-
classmethod
prefix
(symbol, bp=0)¶ Register a token for a symbol that represents a prefix unary operator.
-
classmethod
postfix
(symbol, bp=0)¶ Register a token for a symbol that represents a postfix unary operator.
-
classmethod
infix
(symbol, bp=0)¶ Register a token for a symbol that represents an infix binary operator.
-
classmethod
infixr
(symbol, bp=0)¶ Register a token for a symbol that represents an infixr binary operator.
-
classmethod
method
(symbol, bp=0)¶ Register a token for a symbol that represents a custom operator or redefine a method for an existing token.
-
classmethod
build
()¶ Builds the parser class. Checks if all declared symbols are defined and builds a the regex tokenizer using the symbol related patterns.
-
static
create_tokenizer
(symbol_table, name_pattern='[A-Za-z0-9_]+')¶ Returns a regex based tokenizer built from a symbol table of token classes. The returned tokenizer skips extra spaces between symbols.
A regular expression is created from the symbol table of the parser using a template. The symbols are inserted in the template putting the longer symbols first. Symbols and their patterns can’t contain spaces.
- Parameters
symbol_table – a dictionary containing the token classes of the formal language.
name_pattern – pattern to use to match names.