Building an Assembler in Haskell: Implementation
In the previous post we wrote a grammar for a simple assembly language, wrote the outline of our parser, derived some properties from the grammar for a simple parser
byte and implemented
byte. We also saw that there are a few deficiencies in our grammar. In this post we'll implement
labelAssign. For each parser I'll start with some QuickCheck properties then use those as the spec to implement the parser. Let's get to it!
First, the imports for modules we use, that should make it clearer where functions are coming from.
import Control.Monad (void) -- from the "text" package import qualified Data.Text as T -- from the "megaparsec" package import Text.Megaparsec hiding (Label, label) -- from the "megaparsec" package import qualified Text.Megaparsec.Lexer as L
Text.Megaparsec contains a type
Label and a function
label, we use both these as names for our own type and function for label parsing, so we hide them when importing.
Lexemes And Space
The lexemes of a language are the smallest syntactic unit. Tokens are categories of lexemes. In our case, the "STORE" string is an example of a lexeme in the category of label tokens. Let's assume we can safely eat any whitespace proceeding lexemes. With this in mind, and before we continue implementing the parsers for our language, let's create convenience functions for parsing trailing space after our lexemes.
spaceEater :: Parser () spaceEater = L.space (void spaceChar) (L.skipLineComment ";") (L.skipBlockComment "/*" "*/") lexeme :: Parser a -> Parser a lexeme = L.lexeme spaceEater
spaceEater uses megaparsec's space function to build a parser that consumes and discards whitespace and comments. Note that it is prefixed with
L. here (
Text.Megaparsec.Lexer is imported qualified as
L, see here. Using
spaceEater we create a function called
lexeme which we will use to wrap parsers so they also consume trailing whitespace. This uses lexeme from megaparsec which takes a space parser, in our case
spaceEater, and a parser for a lexeme.
Megaparsec Space Function
space comes from
space :: MonadParsec e s m => m () -- ^ A parser for a space character (e.g. @'void' 'C.spaceChar'@) -> m () -- ^ A parser for a line comment (e.g. 'skipLineComment') -> m () -- ^ A parser for a block comment (e.g. 'skipBlockComment') -> m ()
The type of
space corresponds to the following:
- The first argument is a parser for space, we use
void spaceCharhere. spaceChar parses a space character, and void discards the parsed character.
- The second argument is a line comment parser. We use a function from megaparsec called skipLineComment, which does what it says - skips line comments starting with the provided character, ";" in our case.
- The last argument is a block comment parser, here we use skipBlockComment, which parses and discards data between "/" and "/".
I left out the description of the MonadParsec typeclass, I'll leave this until a future post where I'll dive into the megaparsec types in more depth.
First, let's write some properties!
newtype TwoCharHexString = TwoCharHexString T.Text deriving Show instance Arbitrary TwoCharHexString where arbitrary = do upper <- choose ('A', 'F') lower <- choose ('a', 'f') num <- choose ('0', '9') let vals = [upper, lower, num] x <- elements vals y <- elements vals pure $ TwoCharHexString (T.pack (x:[y])) -- Should parse valid two char hexstring. prop_byte_parseValidData (TwoCharHexString s) = parse byte "" s `shouldParse` s -- When successful should not consume more input. prop_byte_parseSuccessShouldNotConsume (TwoCharHexString s) extra = runParser' byte (initialState (T.append s extra)) `succeedsLeaving` extra
byte should parse a two character hex string, representing a byte, so we create a newtype
TwoCharHexString to represent this. The
Arbitrary instance is made up of a random selection of two characters from the set of upper case
F, lower case
f letters and any single digit. With this arbitrary instance QuickCheck can now generate random instances of our
Note the use of:
- shouldParse -
shouldParsetakes a parse result, the expected value and returns an Expectation. The
Expectationsucceeds if the result is equal to the expected value, and fails otherwise.
- succeedsLeaving -
succeedsLeavingchecks whether a parser has succeeded leaving the specified input untouched.
Simple! We now have a property which checks that the parser parses what we expect and another which checks that it does not consume more input than it should. As of now I have nothing in mind for custom error messages, the ones megaparsec spits out are generally useful enough for parsers this small, so for now I don't think there is a need for a property which checks the error case.
Ok, now that's done, we have a spec for our implementation to follow!
bytes parser parses two bytes, the second being optional. We can use our single byte parser (which we defined in the previous post) to parse each one, along with megaparsec's option function to optionally parse the second byte.
option x p tries to apply parser
p fails without consuming input it will return the value
x. In our case this is exactly what we want, if
byte fails to parse when parsing the second byte we can just return the empty string. As we are building a
Text value with this parser we can then append the first byte onto the second and if the second is empty, we just get the first.
bytes :: Parser T.Text bytes = do char '$' firstByte <- byte anotherByte <- option T.empty byte pure $ T.append firstByte anotherByte
Run the properties and all should be green!
newtype ValidMnemonic = ValidMnemonic T.Text deriving Show instance Arbitrary ValidMnemonic where arbitrary = do upper <- choose ('A', 'Z') pure $ ValidMnemonic (T.pack [upper, upper, upper]) prop_mnemonic_parseValidMnemString (ValidMnemonic s) = parse mnemonic "" s `shouldParse` (Mnemonic s)
Really simple, a valid mnemonic is any upper case three letter string, so that's exactly what our
Arbitrary instance specifies.
count n pruns the parser
upperCharis a parser for upper case Unicode characters.
Following is the implementation.
mnemonic :: Parser Mnemonic mnemonic = lexeme $ Mnemonic . T.pack <$> mnem where mnem = count 3 upperChar
So what does this actually do?
- We create a function called
mnem, defined as
count 3 upperChar, this parses three upper case characters,
Data.Text, it packs a
Mnemonicis the constructor for our
newtypewhich we defined in the last post -
newtype Mnemonic = Mnemonic T.Text deriving Show.
lexemewe defined above, it eats trailing whitespace and comments.
<$>is the infix synonym for
fmap, which lifts a single argument function into a Functor.
Putting it all together...
Putting it all together, what we get is a function which parses three upper case characters as a
String, we then map
Mnemonic . T.pack over the value that
mnem parses which packs it into a
Text value and builds a
Mnemonic from that value, finally it consumes whitespace or comments after the three characters with
newtype LabelWithLetter = LabelWithLetter T.Text deriving Show newtype LabelWithNonLetter = LabelWithNonLetter T.Text deriving Show instance Arbitrary LabelWithLetter where arbitrary = do lbl <- genAlphaNum lowerLetter <- choose ('a', 'z') upperLetter <- choose ('A', 'Z') start <- elements [lowerLetter, upperLetter] pure . LabelWithLetter $ T.pack (start:lbl) instance Arbitrary LabelWithNonLetter where arbitrary = do (LabelWithLetter lbl) <- arbitrary nonAlphaChar <- suchThat (arbitrary :: Gen Char) (\s -> not $ isAlpha s) pure . LabelWithNonLetter $ T.append (T.pack [nonAlphaChar]) lbl prop_label_validLabelString (LabelWithLetter lbl) = parse label "" lbl `shouldParse` (Label lbl) prop_label_invalidLabelString (LabelWithNonLetter lbl) = parse label "" lbl `shouldFailWith` err posI (utok (T.head lbl) <> elabel "letter")
label is a little more involved than the last few parsers. In this case we want to verify that it fails when trying to parse a string that does not start with a letter. So we need two
Arbitrary instances - the first,
LabelWithLetter, is for all letter strings which start with a letter and the second,
LabelWithNonLetter, is for strings which start with a non letter character.
shouldParse, it was used in the
bytes parser, however we haven't seen shouldFailWith yet and there seems to be quite a bit to it! Let's break it down.
Sometimes you want to verify that a parser fails on a given input. Not only that, but you want to verify that the error which is given on that failure contains the right message, position information, etc...
shouldFailWith allows you to do this. Let's have a look at its type.
shouldFailWith :: (Ord t, ShowToken t, ShowErrorComponent e, Show a) => Either (ParseError t e) a -> ParseError t e -> Expectation
So it takes:
Either (ParseError t e) a, which is the return type of
Text.Megaparsec's parse function.
ParseError t e, which we can build with
Test.Hspec.Megaparsec's err function.
err functions have the following types:
err :: NonEmpty SourcePos -- ^ 'ParseError' position -> EC t e -- ^ Error components -> ParseError t e -- ^ Resulting 'ParseError' parse :: Parsec e s a -- ^ Parser to run -> String -- ^ Name of source file -> s -- ^ Input for parser
Looking at our
prop_label_invalidLabelString (LabelWithNonLetter lbl) = parse label "" lbl `shouldFailWith` err posI (utok (T.head lbl) <> elabel "letter")
We run the parser
label on the string
lbl which is generated from the
Arbitrary instance of
LabelWithNonLetter. We then assert that it fails with the following information in the error.
- The initial character of
lblis the cause, this is specified in the
errcall with posI.
- It was an unexpected token,
utok (T.head lbl).
- It was expected to be a letter,
- Finally we combine the unexpected token and expected token into one EC using <> which is an infix synonymn for mappend (this is from
EC's Monoid instance) - i.e.
utok (T.head lbl) <> elabel "letter".
In English, we expect the
label parser to fail on any string which does not start with a letter and the error information should say that it failed at the initial position because of an unexpected token (giving the actual unexpected character) and finally that it expected to see a letter here.
label :: Parser Label label = lexeme $ Label . T.pack <$> ((:) <$> letterChar <*> many alphaNumChar)
Let's forget about everything outside of the brackets for now, and only focus on the following.
(:) <$> letterChar <*> many alphaNumChar
To parse a label we first need to parse some string of any length which starts with a letter, according to our grammar.
letterChar and alphaNumChar are functions from megaparsec that parse a single letter and a single alpha-numeric character, respectively. many parses zero or more occurences of the given parser, so
many alphaNumChar parses zero or more alpha-numeric characters.
Functor/Applicative Quick Description
A Functor is something you can map over (a list is a Functor) and an Applicative (or Appplicative Functor) is something that you can sequence functions through. For a more in depth description see Learn You A Haskell: Functors, Applicative Functors and Monoids. Lists give a nice example comparing the two. Let's say I have a list of
[1,2] and want to add
1 to each element. List is a
Functor so I can use
> fmap (+1) [1,2] [2,3] -- Or the inline version > (+1) <$> [1,2] [2,3]
Now, lets say I have a list
x = [1,2] and a function inside a list,
[(+1)], that I want to apply to each element of
x. I can't use
fmap here as its type is
(a -> b) -> f a -> f b - meaning it lifts the function
(a -> b) into the
f a which gives an
f b. What we want is a function, like
fmap, but where the function to apply is within the
Functor already - and this is precisely what
Applicative gives us with the function
<*>. The type of
(<*>) :: f (a -> b) -> f a -> f b
Excellent! Turns out list also has an
Applicative instance, so lets use it in our problem:
> [(+1)] <*> [1,2] [2,3]
Hopefully that gives some intuition as to what Functors and Applicative Functors are and how they can be used. It may not be clear yet as to why we use them in our parsers, but I'll leave that discussion for another post, as it's not really needed for the rest of this post, and there is quite a bit more to both that I have not mentioned.
Breaking it down
Now the interesting bit.
letterChar will parse a single
many alphaNumChar will parse a
String, which is a
[Char], we need a way of combining the values produced from these parsers so we just get a
To do this we map
: (the list constructor) over
letterChar which has type
Parser Char. What we end up creating is a function of type
Parser ([Char] -> [Char]). Let's assume
letterChar parses a 'c':
-- In case you forgot, here is the type of (:) > (:) :: a -> [a] -> [a] -- So, given > (:) <$> letterChar <*> many alphaNumChar -- After mapping (:) over the value letterChar parsed ('c' in this case) we get > p ((:) 'c') <*> many alphaNumChar
<*> we then apply
p ((:) 'c') over the value of
many alphaNumChar, which itself has type
Parser [Char]. Let's assume
many alphaNumChar parses "abc123":
-- From here > p ((:) 'c') <*> many alphaNumChar -- we get > p ((:) 'c') <*> p "abc123" -- giving > p ((:) 'c' "abc123") -- which gives > p "cabc123"
Now we have the value we want, our letter char combined with many alpha-numeric chars, with the type
Parser [Char] (or
Packing It Up
Following the example above we now have:
label = lexeme $ Label . T.pack <$> (p "cabc123")
With our parsed string, we pack it into a
Text value, wrap it up in a
Label and parse (and discard) possible whitespace with
lexeme. We've seen this above in other parsers, no need to repeat.
Phew! There was quite a lot to implementing
label but we're done now, and can use this info later in other parsers.
label is complete,
labelAssign is simple. We will build it from
label so really when creating a property all we need to check is that it parses and discards a ':'.
prop_labelAssign_shouldDiscardColon (LabelWithLetter lbl) = parse labelAssign "" (T.snoc lbl ':') `shouldParse` Label lbl
labelAssign :: Parser Label labelAssign = lexeme $ label <* char ':'
<* is similar to
<*>, the difference is
<* discards the value of the second argument. In our case
label <* char ':' says parse a label, then parse a ':' but discard it, so it's not part of the
Now we have ~80% of our parser completed, properties for each of the parsers, and some understanding (I hope) of one way to test megaparsec parsers. In the next post I'll dive deeper into megaparsec and implement the remaining parsers. You can view the code up to this point at https://github.com/wayofthepie/emu-mos6502-asm-blog/tree/hasm-blog02.