Building an Assembler in Haskell: Implementation
In the previous post we wrote a grammar for a simple assembly language, wrote the outline of our parser, derived some properties from the grammar for a simple parser byte
and implemented byte
. We also saw that there are a few deficiencies in our grammar. In this post we'll implement bytes
, menmonic
, label
and labelAssign
. For each parser I'll start with some QuickCheck properties then use those as the spec to implement the parser. Let's get to it!
Imports
First, the imports for modules we use, that should make it clearer where functions are coming from.
import Control.Monad (void)
-- from the "text" package
import qualified Data.Text as T
-- from the "megaparsec" package
import Text.Megaparsec hiding (Label, label)
-- from the "megaparsec" package
import qualified Text.Megaparsec.Lexer as L
Note that Text.Megaparsec
contains a type Label
and a function label
, we use both these as names for our own type and function for label parsing, so we hide them when importing.
Lexemes And Space
The lexemes of a language are the smallest syntactic unit. Tokens are categories of lexemes. In our case, the "STORE" string is an example of a lexeme in the category of label tokens. Let's assume we can safely eat any whitespace proceeding lexemes. With this in mind, and before we continue implementing the parsers for our language, let's create convenience functions for parsing trailing space after our lexemes.
spaceEater :: Parser ()
spaceEater = L.space
(void spaceChar)
(L.skipLineComment ";")
(L.skipBlockComment "/*" "*/")
lexeme :: Parser a -> Parser a
lexeme = L.lexeme spaceEater
spaceEater
uses megaparsec's space function to build a parser that consumes and discards whitespace and comments. Note that it is prefixed with L.
here (L.space
) because Text.Megaparsec.Lexer
is imported qualified as L
, see here. Using spaceEater
we create a function called lexeme
which we will use to wrap parsers so they also consume trailing whitespace. This uses lexeme from megaparsec which takes a space parser, in our case spaceEater
, and a parser for a lexeme.
Megaparsec Space Function
space
comes from Text.Megaparsec.Lexer
.
space :: MonadParsec e s m
=> m () -- ^ A parser for a space character (e.g. @'void' 'C.spaceChar'@)
-> m () -- ^ A parser for a line comment (e.g. 'skipLineComment')
-> m () -- ^ A parser for a block comment (e.g. 'skipBlockComment')
-> m ()
The type of space
corresponds to the following:
- The first argument is a parser for space, we use
void spaceChar
here. spaceChar parses a space character, and void discards the parsed character. - The second argument is a line comment parser. We use a function from megaparsec called skipLineComment, which does what it says - skips line comments starting with the provided character, ";" in our case.
- The last argument is a block comment parser, here we use skipBlockComment, which parses and discards data between "/" and "/".
I left out the description of the MonadParsec typeclass, I'll leave this until a future post where I'll dive into the megaparsec types in more depth.
bytes
Properties
First, let's write some properties!
newtype TwoCharHexString = TwoCharHexString T.Text deriving Show
instance Arbitrary TwoCharHexString where
arbitrary = do
upper <- choose ('A', 'F')
lower <- choose ('a', 'f')
num <- choose ('0', '9')
let vals = [upper, lower, num]
x <- elements vals
y <- elements vals
pure $ TwoCharHexString (T.pack (x:[y]))
-- Should parse valid two char hexstring.
prop_byte_parseValidData (TwoCharHexString s) = parse byte "" s `shouldParse` s
-- When successful should not consume more input.
prop_byte_parseSuccessShouldNotConsume (TwoCharHexString s) extra =
runParser' byte (initialState (T.append s extra)) `succeedsLeaving` extra
byte
should parse a two character hex string, representing a byte, so we create a newtype TwoCharHexString
to represent this. The Arbitrary
instance is made up of a random selection of two characters from the set of upper case A
to F
, lower case a
to f
letters and any single digit. With this arbitrary instance QuickCheck can now generate random instances of our TwoCharHexString
type.
Note the use of:
- shouldParse -
shouldParse
takes a parse result, the expected value and returns an Expectation. TheExpectation
succeeds if the result is equal to the expected value, and fails otherwise. - succeedsLeaving -
succeedsLeaving
checks whether a parser has succeeded leaving the specified input untouched.
Simple! We now have a property which checks that the parser parses what we expect and another which checks that it does not consume more input than it should. As of now I have nothing in mind for custom error messages, the ones megaparsec spits out are generally useful enough for parsers this small, so for now I don't think there is a need for a property which checks the error case.
Ok, now that's done, we have a spec for our implementation to follow!
Implementation
The bytes
parser parses two bytes, the second being optional. We can use our single byte parser (which we defined in the previous post) to parse each one, along with megaparsec's option function to optionally parse the second byte.
option x p
tries to apply parser p
, if p
fails without consuming input it will return the value x
. In our case this is exactly what we want, if byte
fails to parse when parsing the second byte we can just return the empty string. As we are building a Text
value with this parser we can then append the first byte onto the second and if the second is empty, we just get the first.
bytes :: Parser T.Text
bytes = do
char '$'
firstByte <- byte
anotherByte <- option T.empty byte
pure $ T.append firstByte anotherByte
Run the properties and all should be green!
mnemonic
Properties
newtype ValidMnemonic = ValidMnemonic T.Text deriving Show
instance Arbitrary ValidMnemonic where
arbitrary = do
upper <- choose ('A', 'Z')
pure $ ValidMnemonic (T.pack [upper, upper, upper])
prop_mnemonic_parseValidMnemString (ValidMnemonic s) =
parse mnemonic "" s `shouldParse` (Mnemonic s)
Really simple, a valid mnemonic is any upper case three letter string, so that's exactly what our Arbitrary
instance specifies.
Implementation
Here we use megaparsec's count and upperChar functions.
count n p
runs the parserp
n
times.upperChar
is a parser for upper case Unicode characters.
Following is the implementation.
mnemonic :: Parser Mnemonic
mnemonic = lexeme $ Mnemonic . T.pack <$> mnem
where
mnem = count 3 upperChar
So what does this actually do?
- We create a function called
mnem
, defined ascount 3 upperChar
, this parses three upper case characters, T.pack
ispack
fromData.Text
, it packs aString
into aText
value.Mnemonic
is the constructor for ournewtype
which we defined in the last post -newtype Mnemonic = Mnemonic T.Text deriving Show
.lexeme
we defined above, it eats trailing whitespace and comments.- Finally
<$>
is the infix synonym forfmap
, which lifts a single argument function into a Functor.
Putting it all together...
Putting it all together, what we get is a function which parses three upper case characters as a String
, we then map Mnemonic . T.pack
over the value that mnem
parses which packs it into a Text
value and builds a Mnemonic
from that value, finally it consumes whitespace or comments after the three characters with lexeme
.
label
Properties
newtype LabelWithLetter = LabelWithLetter T.Text deriving Show
newtype LabelWithNonLetter = LabelWithNonLetter T.Text deriving Show
instance Arbitrary LabelWithLetter where
arbitrary = do
lbl <- genAlphaNum
lowerLetter <- choose ('a', 'z')
upperLetter <- choose ('A', 'Z')
start <- elements [lowerLetter, upperLetter]
pure . LabelWithLetter $ T.pack (start:lbl)
instance Arbitrary LabelWithNonLetter where
arbitrary = do
(LabelWithLetter lbl) <- arbitrary
nonAlphaChar <- suchThat (arbitrary :: Gen Char) (\s -> not $ isAlpha s)
pure . LabelWithNonLetter $ T.append (T.pack [nonAlphaChar]) lbl
prop_label_validLabelString (LabelWithLetter lbl) =
parse label "" lbl `shouldParse` (Label lbl)
prop_label_invalidLabelString (LabelWithNonLetter lbl) =
parse label "" lbl `shouldFailWith` err posI (utok (T.head lbl) <> elabel "letter")
label
is a little more involved than the last few parsers. In this case we want to verify that it fails when trying to parse a string that does not start with a letter. So we need two Arbitrary
instances - the first, LabelWithLetter
, is for all letter strings which start with a letter and the second, LabelWithNonLetter
, is for strings which start with a non letter character.
We've seen shouldParse
, it was used in the bytes
parser, however we haven't seen shouldFailWith yet and there seems to be quite a bit to it! Let's break it down.
shouldFailWith
Sometimes you want to verify that a parser fails on a given input. Not only that, but you want to verify that the error which is given on that failure contains the right message, position information, etc... shouldFailWith
allows you to do this. Let's have a look at its type.
shouldFailWith :: (Ord t, ShowToken t, ShowErrorComponent e, Show a)
=> Either (ParseError t e) a
-> ParseError t e
-> Expectation
So it takes:
- An
Either (ParseError t e) a
, which is the return type ofText.Megaparsec
's parse function. - A
ParseError t e
, which we can build withTest.Hspec.Megaparsec
's err function.
The parse
and err
functions have the following types:
err :: NonEmpty SourcePos -- ^ 'ParseError' position
-> EC t e -- ^ Error components
-> ParseError t e -- ^ Resulting 'ParseError'
parse
:: Parsec e s a -- ^ Parser to run
-> String -- ^ Name of source file
-> s -- ^ Input for parser
Looking at our prop_label_invalidLabelString
property:
prop_label_invalidLabelString (LabelWithNonLetter lbl) =
parse label "" lbl `shouldFailWith` err posI (utok (T.head lbl) <> elabel "letter")
We run the parser label
on the string lbl
which is generated from the Arbitrary
instance of LabelWithNonLetter
. We then assert that it fails with the following information in the error.
- The initial character of
lbl
is the cause, this is specified in theerr
call with posI. - It was an unexpected token,
utok (T.head lbl)
. - It was expected to be a letter,
elabel letter
. - Finally we combine the unexpected token and expected token into one EC using <> which is an infix synonymn for mappend (this is from
EC
's Monoid instance) - i.e.utok (T.head lbl) <> elabel "letter"
.
In English, we expect the label
parser to fail on any string which does not start with a letter and the error information should say that it failed at the initial position because of an unexpected token (giving the actual unexpected character) and finally that it expected to see a letter here.
Implementation
label :: Parser Label
label = lexeme $ Label . T.pack <$> ((:) <$> letterChar <*> many alphaNumChar)
Let's forget about everything outside of the brackets for now, and only focus on the following.
(:) <$> letterChar <*> many alphaNumChar
To parse a label we first need to parse some string of any length which starts with a letter, according to our grammar.
letterChar and alphaNumChar are functions from megaparsec that parse a single letter and a single alpha-numeric character, respectively. many parses zero or more occurences of the given parser, so many alphaNumChar
parses zero or more alpha-numeric characters.
We saw <$>
(fmap
) when building mnemonic
, however we haven't used <*> yet. <*>
("apply") is from the Applicative typeclass, it is just function application for Applicative Functors.
Functor/Applicative Quick Description
A Functor is something you can map over (a list is a Functor) and an Applicative (or Appplicative Functor) is something that you can sequence functions through. For a more in depth description see Learn You A Haskell: Functors, Applicative Functors and Monoids. Lists give a nice example comparing the two. Let's say I have a list of [1,2]
and want to add 1
to each element. List is a Functor
so I can use fmap
:
> fmap (+1) [1,2]
[2,3]
-- Or the inline version
> (+1) <$> [1,2]
[2,3]
Now, lets say I have a list x = [1,2]
and a function inside a list, [(+1)]
, that I want to apply to each element of x
. I can't use fmap
here as its type is (a -> b) -> f a -> f b
- meaning it lifts the function (a -> b)
into the Functor
f a
which gives an f b
. What we want is a function, like fmap
, but where the function to apply is within the Functor
already - and this is precisely what Applicative
gives us with the function <*>
. The type of <*>
is:
(<*>) :: f (a -> b) -> f a -> f b
Excellent! Turns out list also has an Applicative
instance, so lets use it in our problem:
> [(+1)] <*> [1,2]
[2,3]
Hopefully that gives some intuition as to what Functors and Applicative Functors are and how they can be used. It may not be clear yet as to why we use them in our parsers, but I'll leave that discussion for another post, as it's not really needed for the rest of this post, and there is quite a bit more to both that I have not mentioned.
Breaking it down
Now the interesting bit. letterChar
will parse a single Char
, and many alphaNumChar
will parse a String
, which is a [Char]
, we need a way of combining the values produced from these parsers so we just get a [Char]
.
To do this we map :
(the list constructor) over letterChar
which has type Parser Char
. What we end up creating is a function of type Parser ([Char] -> [Char])
. Let's assume letterChar
parses a 'c':
-- In case you forgot, here is the type of (:)
> (:) :: a -> [a] -> [a]
-- So, given
> (:) <$> letterChar <*> many alphaNumChar
-- After mapping (:) over the value letterChar parsed ('c' in this case) we get
> p ((:) 'c') <*> many alphaNumChar
Using <*>
we then apply p ((:) 'c')
over the value of many alphaNumChar
, which itself has type Parser [Char]
. Let's assume many alphaNumChar
parses "abc123":
-- From here
> p ((:) 'c') <*> many alphaNumChar
-- we get
> p ((:) 'c') <*> p "abc123"
-- giving
> p ((:) 'c' "abc123")
-- which gives
> p "cabc123"
Now we have the value we want, our letter char combined with many alpha-numeric chars, with the type Parser [Char]
(or Parser String
).
Packing It Up
Following the example above we now have:
label = lexeme $ Label . T.pack <$> (p "cabc123")
With our parsed string, we pack it into a Text
value, wrap it up in a Label
and parse (and discard) possible whitespace with lexeme
. We've seen this above in other parsers, no need to repeat.
Phew! There was quite a lot to implementing label
but we're done now, and can use this info later in other parsers.
labelAssign
Now that label
is complete, labelAssign
is simple. We will build it from label
so really when creating a property all we need to check is that it parses and discards a ':'.
prop_labelAssign_shouldDiscardColon (LabelWithLetter lbl) =
parse labelAssign "" (T.snoc lbl ':') `shouldParse` Label lbl
The implementation:
labelAssign :: Parser Label
labelAssign = lexeme $ label <* char ':'
<*
is similar to <*>
, the difference is <*
discards the value of the second argument. In our case label <* char ':'
says parse a label, then parse a ':' but discard it, so it's not part of the Label
which labelAssign
builds.
Conclusion
Now we have ~80% of our parser completed, properties for each of the parsers, and some understanding (I hope) of one way to test megaparsec parsers. In the next post I'll dive deeper into megaparsec and implement the remaining parsers. You can view the code up to this point at https://github.com/wayofthepie/emu-mos6502-asm-blog/tree/hasm-blog02.