Building An Assembler In Haskell: Implementation
- Lexemes And Space
we wrote a grammar for a simple assembly language, wrote the outline of our parser,
derived some properties from the grammar for a simple parser
byte and implemented
We also saw that there are a few deficiencies in our grammar. In this post we'll implement
labelAssign. For each parser I'll start with some QuickCheck
properties then use those as the spec to implement the parser. Let's get to it!
First, the imports for modules we use, that should make it clearer where functions are coming from.
import Control.Monad (void) -- from the "text" package import qualified Data.Text as T -- from the "megaparsec" package import Text.Megaparsec hiding (Label, label) -- from the "megaparsec" package import qualified Text.Megaparsec.Lexer as L
Text.Megaparsec contains a type
Label and a function
label, we use both
these as names for our own type and function for label parsing, so we hide them when
Lexemes And Space
The lexemes of a language are the smallest syntactic unit. Tokens are categories of lexemes. In our case, the "STORE" string is an example of a lexeme in the category of label tokens. Let's assume we can safely eat any whitespace proceeding lexemes. With this in mind, and before we continue implementing the parsers for our language, let's create convenience functions for parsing trailing space after our lexemes.
spaceEater :: Parser () spaceEater = L.space (void spaceChar) (L.skipLineComment ";") (L.skipBlockComment "/*" "*/") lexeme :: Parser a -> Parser a lexeme = L.lexeme spaceEater
spaceEater uses megaparsec's
function to build a parser that consumes and discards whitespace and comments. Note that it is
L. here (
Text.Megaparsec.Lexer is imported qualified as
L, see here.
spaceEater we create a function called
lexeme which we will use to wrap parsers
so they also consume trailing whitespace. This uses
from megaparsec which takes a space parser, in our case
spaceEater, and a parser for a
Megaparsec Space Function
space comes from
space :: MonadParsec e s m => m () -- ^ A parser for a space character (e.g. @'void' 'C.spaceChar'@) -> m () -- ^ A parser for a line comment (e.g. 'skipLineComment') -> m () -- ^ A parser for a block comment (e.g. 'skipBlockComment') -> m ()
The type of
space corresponds to the following:
- The first argument is a parser for space, we use
void spaceCharhere. spaceChar parses a space character, and void discards the parsed character.
- The second argument is a line comment parser. We use a function from megaparsec called skipLineComment, which does what it says - skips line comments starting with the provided character, ";" in our case.
- The last argument is a block comment parser, here we use skipBlockComment, which parses and discards data between "/" and "/".
I left out the description of the MonadParsec typeclass, I'll leave this until a future post where I'll dive into the megaparsec types in more depth.
First, let's write some properties!
newtype TwoCharHexString = TwoCharHexString T.Text deriving Show instance Arbitrary TwoCharHexString where arbitrary = do upper <- choose ('A', 'F') lower <- choose ('a', 'f') num <- choose ('0', '9') let vals = [upper, lower, num] x <- elements vals y <- elements vals pure $ TwoCharHexString (T.pack (x:[y])) -- Should parse valid two char hexstring. prop_byte_parseValidData (TwoCharHexString s) = parse byte "" s `shouldParse` s -- When successful should not consume more input. prop_byte_parseSuccessShouldNotConsume (TwoCharHexString s) extra = runParser' byte (initialState (T.append s extra)) `succeedsLeaving` extra
byte should parse a two character hex string, representing a byte, so we create a newtype
TwoCharHexString to represent this. The
Arbitrary instance is made up of a random
selection of two characters from the set of upper case
F, lower case
letters and any single digit. With this arbitrary instance QuickCheck can now generate
random instances of our
Note the use of:
- shouldParse -
shouldParsetakes a parse result, the expected value and returns an Expectation. The
Expectationsucceeds if the result is equal to the expected value, and fails otherwise.
- succeedsLeaving -
succeedsLeavingchecks whether a parser has succeeded leaving the specified input untouched.
Simple! We now have a property which checks that the parser parses what we expect and another which checks that it does not consume more input than it should. As of now I have nothing in mind for custom error messages, the ones megaparsec spits out are generally useful enough for parsers this small, so for now I don't think there is a need for a property which checks the error case.
Ok, now that's done, we have a spec for our implementation to follow!
bytes parser parses two bytes, the second being optional. We can use our single
(which we defined in the previous post) to parse each one, along with megaparsec's
to optionally parse the second byte.
option x p tries to apply parser
p fails without consuming input it will return
x. In our case this is exactly what we want, if
byte fails to parse when parsing
the second byte we can just return the empty string. As we are building a
Text value with
this parser we can then append the first byte onto the second and if the second is empty, we
just get the first.
bytes :: Parser T.Text bytes = do char '$' firstByte <- byte anotherByte <- option T.empty byte pure $ T.append firstByte anotherByte
Run the properties and all should be green!
newtype ValidMnemonic = ValidMnemonic T.Text deriving Show instance Arbitrary ValidMnemonic where arbitrary = do upper <- choose ('A', 'Z') pure $ ValidMnemonic (T.pack [upper, upper, upper]) prop_mnemonic_parseValidMnemString (ValidMnemonic s) = parse mnemonic "" s `shouldParse` (Mnemonic s)
Really simple, a valid mnemonic is any upper case three letter string, so that's exactly
Arbitrary instance specifies.
count n pruns the parser
upperCharis a parser for upper case Unicode characters.
Following is the implementation.
mnemonic :: Parser Mnemonic mnemonic = lexeme $ Mnemonic . T.pack <$> mnem where mnem = count 3 upperChar
So what does this actually do?
- We create a function called
mnem, defined as
count 3 upperChar, this parses three upper case characters,
Data.Text, it packs a
Mnemonicis the constructor for our
newtypewhich we defined in the last post -
newtype Mnemonic = Mnemonic T.Text deriving Show.
lexemewe defined above, it eats trailing whitespace and comments.
<$>is the infix synonym for
fmap, which lifts a single argument function into a Functor.
Putting it all together...
Putting it all together, what we get is a function which parses three upper case characters as a
we then map
Mnemonic . T.pack over the
mnem parses which packs it into a
Text value and builds a
Mnemonic from that value, finally it consumes whitespace or comments after the three characters
newtype LabelWithLetter = LabelWithLetter T.Text deriving Show newtype LabelWithNonLetter = LabelWithNonLetter T.Text deriving Show instance Arbitrary LabelWithLetter where arbitrary = do lbl <- genAlphaNum lowerLetter <- choose ('a', 'z') upperLetter <- choose ('A', 'Z') start <- elements [lowerLetter, upperLetter] pure . LabelWithLetter $ T.pack (start:lbl) instance Arbitrary LabelWithNonLetter where arbitrary = do (LabelWithLetter lbl) <- arbitrary nonAlphaChar <- suchThat (arbitrary :: Gen Char) (\s -> not $ isAlpha s) pure . LabelWithNonLetter $ T.append (T.pack [nonAlphaChar]) lbl prop_label_validLabelString (LabelWithLetter lbl) = parse label "" lbl `shouldParse` (Label lbl) prop_label_invalidLabelString (LabelWithNonLetter lbl) = parse label "" lbl `shouldFailWith` err posI (utok (T.head lbl) <> elabel "letter")
label is a little more involved than the last few parsers. In this case we want to verify
that it fails when trying to parse a string that does not start with a letter. So we
Arbitrary instances - the first,
LabelWithLetter, is for all letter strings which
start with a letter and the second,
LabelWithNonLetter, is for strings which start with a
non letter character.
shouldParse, it was used in the
bytes parser, however we haven't seen
yet and there seems to be quite a bit to it! Let's break it down.
Sometimes you want to verify that a parser fails on a given input. Not only that, but you
want to verify that the error which is given on that failure contains the right message,
position information, etc...
shouldFailWith allows you to do this. Let's have a look at
shouldFailWith :: (Ord t, ShowToken t, ShowErrorComponent e, Show a) => Either (ParseError t e) a -> ParseError t e -> Expectation
So it takes:
Either (ParseError t e) a, which is the return type of
Text.Megaparsec's parse function.
ParseError t e, which we can build with
Test.Hspec.Megaparsec's err function.
err functions have the following types:
err :: NonEmpty SourcePos -- ^ 'ParseError' position -> EC t e -- ^ Error components -> ParseError t e -- ^ Resulting 'ParseError' parse :: Parsec e s a -- ^ Parser to run -> String -- ^ Name of source file -> s -- ^ Input for parser
Looking at our
prop_label_invalidLabelString (LabelWithNonLetter lbl) = parse label "" lbl `shouldFailWith` err posI (utok (T.head lbl) <> elabel "letter")
We run the parser
label on the string
lbl which is generated from the
LabelWithNonLetter. We then assert that it fails with the following
information in the error.
- The initial character of
lblis the cause, this is specified in the
errcall with posI.
- It was an unexpected token,
utok (T.head lbl).
- It was expected to be a letter,
- Finally we combine the unexpected token and expected token into one
which is an infix synonymn for
(this is from
EC's Monoid instance) - i.e.
utok (T.head lbl) <> elabel "letter".
In English, we expect the
label parser to fail on any string which does not start with a
letter and the error information should say that it failed at the initial position because
of an unexpected token (giving the actual unexpected character) and finally that it expected
to see a letter here.
label :: Parser Label label = lexeme $ Label . T.pack <$> ((:) <$> letterChar <*> many alphaNumChar)
Let's forget about everything outside of the brackets for now, and only focus on the following.
(:) <$> letterChar <*> many alphaNumChar
To parse a label we first need to parse some string of any length which starts with a letter, according to our grammar.
are functions from megaparsec that parse a single letter and a single alpha-numeric
parses zero or more occurences of the given parser, so
many alphaNumChar parses zero or more alpha-numeric
Functor/Applicative Quick Description
something you can map over (a list is a Functor) and an
(or Appplicative Functor) is something
that you can sequence functions through. For a more in depth description see
Learn You A Haskell: Functors, Applicative Functors and Monoids.
Lists give a nice example comparing the two. Let's
say I have a list of
[1,2] and want to add
1 to each element. List is a
Functor so I can
> fmap (+1) [1,2] [2,3] -- Or the inline version > (+1) <$> [1,2] [2,3]
Now, lets say I have a list
x = [1,2] and a function inside a list,
[(+1)], that I want to
apply to each element of
x. I can't use
fmap here as its type is
(a -> b) -> f a -> f b -
meaning it lifts the function
(a -> b) into the
f a which gives an
f b. What we want is a
fmap, but where the function to apply is within the
Functor already -
and this is precisely what
Applicative gives us with the function
<*>. The type of
(<*>) :: f (a -> b) -> f a -> f b
Excellent! Turns out list also has an
Applicative instance, so lets use it in our problem:
> [(+1)] <*> [1,2] [2,3]
Hopefully that gives some intuition as to what Functors and Applicative Functors are and how they can be used. It may not be clear yet as to why we use them in our parsers, but I'll leave that discussion for another post, as it's not really needed for the rest of this post, and there is quite a bit more to both that I have not mentioned.
Breaking it down
Now the interesting bit.
letterChar will parse a single
many alphaNumChar will parse a
[Char], we need a way of combining the values produced from these parsers so we just
To do this we map
: (the list constructor) over
letterChar which has
Parser Char. What we end up creating is a function of type
Parser ([Char] -> [Char]). Let's assume
letterChar parses a 'c':
-- In case you forgot, here is the type of (:) > (:) :: a -> [a] -> [a] -- So, given > (:) <$> letterChar <*> many alphaNumChar -- After mapping (:) over the value letterChar parsed ('c' in this case) we get > p ((:) 'c') <*> many alphaNumChar
<*> we then apply
p ((:) 'c')
over the value of
many alphaNumChar, which itself has type
Parser [Char]. Let's assume
many alphaNumChar parses "abc123":
-- From here > p ((:) 'c') <*> many alphaNumChar -- we get > p ((:) 'c') <*> p "abc123" -- giving > p ((:) 'c' "abc123") -- which gives > p "cabc123"
Now we have the value we want, our letter char combined with many alpha-numeric chars,
with the type
Parser [Char] (or
Packing It Up
Following the example above we now have:
label = lexeme $ Label . T.pack <$> (p "cabc123")
With our parsed string, we pack it into a
Text value, wrap it up in a
and parse (and discard) possible whitespace with
lexeme. We've seen this above in other
parsers, no need to repeat.
Phew! There was quite a lot to implementing
label but we're done now, and can use this
info later in other parsers.
label is complete,
labelAssign is simple. We will build it from
really when creating a property all we need to check is that it parses and discards a ':'.
prop_labelAssign_shouldDiscardColon (LabelWithLetter lbl) = parse labelAssign "" (T.snoc lbl ':') `shouldParse` Label lbl
labelAssign :: Parser Label labelAssign = lexeme $ label <* char ':'
<* is similar to
<*>, the difference is
<* discards the value of the second argument.
In our case
label <* char ':' says parse a label, then parse a ':' but discard it, so it's
not part of the
Now we have ~80% of our parser completed, properties for each of the parsers, and some understanding (I hope) of one way to test megaparsec parsers. In the next post I'll dive deeper into megaparsec and implement the remaining parsers. You can view the code up to this point at https://github.com/wayofthepie/emu-mos6502-asm-blog/tree/hasm-blog02.