Haskell, Interesting

Unordered Parsing in Parsec

Follow me in my little foray into creating unordered parsers in parsec, somebody said that it was not well documented so I decided to do it for them…

The Problem

When dealing with certain file formats, or things to parse, you will inevitably come across some part of the format where the order of some your data is not important; but all of the data is expected to be present (you should be able to parse the data in whatever order it comes in). I recently came accross this problem when trying to parse an iCalendar file. Now the problem is that something like parsec (normally) will only parse things one after another. Most of the common examples show you something like: expect a character, then expect this string, then expect another X of anything that is not an element of “abcde”. But they are sequential steps; which I believe makes sense considering that Parsec is a Monadic parser.

So the real question is “How do we create something that is ordering invariant (ordering does not matter) in a land where ordering seems to just be a natural part of what is happening?” Before we answer that question lets look at some sample data so that we know what we are dealing with, what is a Real World example of what we might like to parse?

Example Data

Lets take two different lines from two different files, they may look familiar to you (at least the first line should):

<name key1="value1" key2="value2" key3="value3'>inside value</name>
name;key3=value3:key1=value1;key2=value2:inside value

Now the top one should look familiar to every web developer in the world, that is the type of format that makes up the common HTML and XML markup. The second one is actually an equivalent format to XML (it has a 1-1 mapping) and it is the format that iCalendars use. Now I am guessing that you can see the order independant stuff here; that the key/value pairs can be in any order and they will still be the same set of key/value pairs. For example, we would all agree that:

<name key1="value1" key2="value2" key3="value3'>inside value</name>

is equivalent to:

<name key1="value1" key3="value3" key2="value2'>inside value</name>

Now the problem goes like this: in my head I was thinking “Okay, parsing this is easy, I just expect an element with the right name and now I expect key1…oh wait, It could be key3 that comes next…or even key2…son of a…now what on earth am I supposed to do? How do I write this parser?” This was when I decided to ask Stack Overflow and see what responses I got.

The Solution

The solution is to use the Text.Parsec.Perm library because it implements exactly these features; it allows you to parse a set of objects that all exist on the same level in whatever order that you like. Lets just start with an example of how you would parse XML key/value pairs in Haskell code, its not even that long:

Download This File

-- Simple example of Parsec Perm
-- Author: Robert Massaioli (2010) - http://robertmassaioli.wordpress.com/
-- Requirements: A version of parsec with these imports in it. (cabal install 'parsec > 3')
-- Other Requirements: Ability to realise that this is why Haskell rocks.
import Text.Parsec.Char
import Text.Parsec.Combinator
import Text.Parsec.Perm
import Text.Parsec.Prim
import Text.Parsec.String
import Control.Monad

import Prelude

-- these are just some example properties from the HTML IMG tag
-- As you can see from this example the src is required and the width and hight are optional
data ImageTagProperties = ImageTagProperties {
 src :: String,
 width :: Maybe Integer,
 height :: Maybe Integer
-- Begin Parser implementation
-- xmlKeywords contains the the permutation grunt

xmlKeywords :: Parser ImageTagProperties
xmlKeywords = permute (ImageTagProperties
 <$$> (try srcParser)
 <|?> (Nothing, Just `liftM` (try $ pixelParser "width"))
 <|?> (Nothing, Just `liftM` (try $ pixelParser "height"))

-- just some simple parser functions to get the actual data
srcParser :: Parser String
srcParser = do
 many1 space
 string "src=""
 value <- many $ noneOf """
 char '"'
 return value

pixelParser :: String -> Parser Integer
pixelParser attributeName = do
 many1 space
 string (attributeName ++ "="")
 value <- many digit
 string "px""
 return $ read value
-- End Parser implementation

-- Testing code for your convinience follows
main = do
 mapM_ putStrLn . map (show . parse xmlKeywords "") $ goodTests

-- here are some examples of some tests that pass, notice that the width, height
-- and src come in different orders yet they all have the same result
goodTests =
 " src="www.example.com/example.jpg" width="2px" height="5px""
 : " src="www.example.com/example.jpg" width="2px" height="5px""
 : " width="2px" src="www.example.com/example.jpg" height="5px""
 : " height="5px" src="www.example.com/example.jpg" width="2px""
 : " src="www.example.com/example.jpg""
 : []
-- these are examples that will fail on purpose
fail1 = " height="5px" width="2px"" -- missing the src attribute, we stated that it was required

Basically everything else in that code is white noise except the xmlKeywords function. It is where the permutation code happens and it could not be more simple. You write a function that is capable of constucting your target data structure (in my case it is ‘createImageProperties’) then you apply that to each of your different elements that you expect to appear in an unordered way. Once that is done you mere call permute on the whole lot and you are done. It does not get more simple than that and all that you have to know is that the order that you add your permutations to the parser will be the order that your constructor function will expect them in. Now if you paid attention to the last sentence then you would have realised that we can make the xmlKeywords function even more compact like this:

xmlKeywords :: Parser ImageTagProperties
xmlKeywords = permute (ImageTagProperties                            (try srcParser)
                                              (Nothing, Just `liftM` (try $ pixelParser "width"))
                                              (Nothing, Just `liftM` (try $ pixelParser "height"))

Just pass in the constructor for the data structure that you wanted to create and order the permutations correctly and you still get the same result. What could be more easy?

I hope that you can see that with parsec you can create really powerful order independent parsers as well as ordered parsers. In my opinion this is something that is amazingly powerful that I did not expect Parsec to be able to do as easily as it does. Kudos to the Parsec developers and the people that wrote the initial paper.