*This page is no longer maintained, please go to ParserCombinators.*

-- DoaitseSwierstra - 30 Aug 2005

Fast, Error Repairing Parser Combinators

Why These Combinators?

Maybe you have always been intrigued by the use of Combinator Parsers because they allow you to:

  • use the abstraction, typing and naming mechanism of Haskell
  • create parsers dynamically
  • keep life simple by not having to run a separate program in order to generate a parser
  • work with (limited forms of) infinite grammars
  • write your grmmars the way you think they should look like, withpout having to pay attention to things like LALR(1) restrictions

Maybe you have thus far refrained form using them because of:
  • expensive, backtracking implementations
  • implementations that just try all alternatives sequentially
  • bad error reporting and no error recovery properties
  • combinator libraries that put strong restrictions on your grammar, e.g. requiring the grammar to be LL(1)
  • combinator libraries that force you to control the backtracking process by explicitly annotation your parsers
  • error messages that are only produced only after the parsing process has fully completed
  • space leaks
  • non-online parsers that only return something after the whole input has been consumed

So why not use the Utrecht Parser Combinators, that have all the afore mentioned advantages, but not the mentioned disadvantages! We furthermore provide you with:
  • the possibility to control your own error-reporting
  • the possibility to add state to the parsing process, and to define the way your result is constructed by giving appropriate class instances
  • a "range" construct that allows you to define parsers that accept any element in a specific range
  • combinators for recognising "woven lists"
  • combinators allowing you to define parsers that recognise a permutation of elements
  • a default implementation of all classes, that makes your life simple in case you are not interested in fiddling the inner workings
  • an error printing mechanism that prints interleaved error messages whenever repairs are necessary
  • an implementation that is generally faster than Parsec

How to get software

  • The files are all located in cvs.cs.uu.nl. The best thing to do is to check out the complete uust tree, and follow the install instructions.
    • cvs -d:pserver:anonymous@cvs.cs.uu.nl:/data/cvs-rep login
    • cvs -d:pserver:anonymous@cvs.cs.uu.nl:/data/cvs-rep checkout uust

  • Examples of uses of the parsing library can be found in the AG (attribute grammar) system and the UHC (Utrecht Haskell Compiler, that is under construction)
  • The manual of the previous version, which is actually a chapter from our lecture notes on the Implementation of Programming Languages. We are working on a new version of this manual for the current version. The general idea did not change, but many functions have been renamed, reimplemented or removed (e.g. <<|>).

How to use the software

  • in Hugs: use the "-98" option
  • in Ghc: use the follwing options
    • -fglasgow-exts -fallow-undecidable-instances -fallow-overlapping-instances -fwarn-missing-signatures

Benchmarks

Daan Leijen has produced a parser for Oberon, and has provided a set of benchmarks programs. We show you some statistics.
  • a comparison between several different parser machineries, including the time it takes to merely convert the input into a sequence of tokens. This part is included in all the parsing times mentioned (courtesy of Daan Leijen)
  • a comparison between different implementations of UU_Parsing (on a different machine and a different Ghc compiler)

Note that (as with all benchmarks) these numbers are highly disputable. In general parsing will be so fast compared to what you want to do with the result that it will not make a great difference for the overall performance of your final product which alternative you take. It is however remarkable that all the extra facilities you get actually do not seem to cost you much.

Related SoftwareDistributions

You may be interested also in:
  • SyntaxMacros: a macro-like mechanism that allows the programmer to extend both the context-free syntax and the associated attribute computations on the abstrcat syntax trees used by the language on a per program basis
  • a Tiger Compiler that describes the implementation of the famous Tiger langauge as used in the compiler construction books written by Andrew Appel, but in this case making use of our parser combinators and [[Attribute Grammar System.

Things to come

We are about to release:
  • a new version that will allow you to switch from scanner
  • a version that enables you to efficiently handle ambiguous grammars, as described in our ICFP03 paper
  • a way of getting access to the generated error messages as described in a MessageFromMartijnSchrage

Feedback

I would be happy it you would let me know:
  • whether you are using the package and what for
  • what modifications you are interested in
  • what modifications you have made yourself
  • whether you think the package is useful (visiting committees have called the construction of such packages "not useful"; once they show up again it would be nice if I could present proof of the contrary)

If you have any comments do not hesitate to contact me: Doaitse Swierstra

Version History

If you want to stay informed about changes and corrections you should send me an email..The August 2003 is basically a refinement, that has been brought in line with the other tools we provide.

Version August 23, 2003:

  • contains suppor for the offside rule
  • slightly changed class structure
  • better integeration with the AG system
  • made part of the lib section of the uust tree at http://cvs.cs.uu.nl
  • examples can be found in the examples section
  • systems that make use of the parser combinators can be found in the tools section

Version of March 6, 2002

  • added `seq` construct to force error messages stemming from not-used part of the inpute

Version of March 5, 2002

  • great speed improvement
  • faster repairing process
  • merged all files into one
  • separate recognisers
  • offside rule parser added
  • parseIO for printing error messages during parsing
  • many other small improvements

Version of September 25, 2001

  • small change in selection of best alternative to speed up most common cases. Ok steps have been replaced by Cost 0 steps. Only UU_Parsing_Core was changed.

Version of September 24, 2001

  • fixed bug causing infinite sequences of insertions. This bug was introduced when making error correction faster ;-{{
  • added th functions anaMap and libMap that may be used to "peek/poke" the internal state of the parser. It makes it possible to give the parsers a statelike behaviour, a la Monads. It remains to be seen how this can be made working with an integrated scanner

Version of August 8, 2001

  • a completely different, much simpler underlying parsing machine, which:
    • is inherently faster than the earlier version (also than the March version)
    • provides error messages "on the go", i.e. you do not have to wait until parsing has been completed before error messages come out
    • better error recovery by using a longer look-ahead, so more information is taken into account when repairing
  • a limitation of the look-ahead computation to a single symbol (the reasons for this are explained later)
  • the internal use of ranges, instead of single symbols, which eases the recognition of e.g. character based sequences, and facilitates the description of lexers
  • the possibility to specify separate repair (insertion and deleetion) costs for individual symbols. Be careful when using this facility. It may easily lead to "infinite" insertions, something we have gone far in order to prevent this from happening.
  • the introduction of a "low priority" alternative, that allows you to give preference to the non-empty alternative of a choice combinator,  based on a one-symbol look-ahead
  • the introduction of a failsafe insertion strategy when repairs are needed and the end of the file has been reached. This was a problem in the earlier version. By the introduction of the computation of the length of the shortest sequence of tokens that can be derived from a non-terminal we now always take "the shortest route out".
  • (2.1) the possibility to define the wayb in which corrections are being reported
  • (2.1) the permutation combinators
  • (2.1) the weaving combinators
  • the Bibtex example was added

Version of October 9, 2001

  • added the functions libWrap and anaWrap, that may be used to manipulate the internal machine.
Topic revision: r6 - 27 Nov 2007, UnknownUser
 

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding UUCS? Send feedback