Building An Assembler In Haskell
- The Language
- Building Our Parser
Recently I started building an emulator for the MosTech 6502 Cpu, this post is about the initial stages of building an assembler for a simple assembly language that compiles to runnable 6502 machine code. I've created a repo and updated it as I wrote this post, so at the end of most sections that introduce new code I'll link to a commit which has the code up to that point.
You can see the repo here:
https://github.com/wayofthepie/emu-mos6502-asm-blog/tree/hasm-blog01. The only pre-requisite is installing
stack, once the project is
cloned you can use
stack install to install dependencies and build the project.
First, let's define what we want our assembly language to be able to do. To keep it simple, we only want to allow assignment and the definition of instructions for now:
- We should be able to define instructions and their operands, e.g.
LDA #$20. This says load the value $20 into the accumulator.
LDAmeans load date into the accumulator
#means this is immediate addressing so use the operand as a value not as an address.
- We should be able to assign values to variables, e.g.
LOCATION = $2020.
- Now anywhere we see
LOCATIONwe can replace with
- Now anywhere we see
Now for a quick and dirty grammar for our simple language:
<expr> := <instruction> | <assignment> <assignment> := <label> "=" <bytes> <instruction> := [<label> ":"] <menmonic> [<label> | <operand>] <operand> := ["#"] <bytes> <bytes> := "$" <byte> [<byte>] <label> := ([A-Za-z]*[A-Za-z0-9]*) <mnemonic> := 3 * ([A-Z]) <byte> := 2 * ([A-Fa-f0-9])
The above is a variation of EBNF (Extended Backus Naur Form) notation, we allow regular
expressions (denoted by brackets
([A-Z]) denotes a single upper case letter)
Here's a breakdown, from the bottom up:
<byte>is defined as
2 * ([A-Fa-f0-9]), this means two consecutive characters that are upper or lower case letters betwen
For digits, i.e. a two digit hexadecimal value.
<mnemonic>is a three letter string - all upper case.
<label>is an alphanumeric string of any size, which must start with a letter.
- Note that we should limit it's size, but lets leave it infinite for now!
<bytes>is a string which starts with a
$and must contain at least one
<byte>, at most two.
<operand>starts with an optional
<instruction>starts with an optional
<label>which must be followed by a
:if it exists, this is followed by a
<mnemonic>and finally an optional
<operand>. Some instructions use implicit addressing and require no operand or label, hence why this is optional.
LABEL LDA #$20
- Note that the possible label at the end is only for certain instructions, for example
BNE LABELwill jump to the address corresponding to
LABELif the zero flag is not set.
<label>on the left hand side and a
<bytes>on the right.
STORE = $2020
<expr>can be either an
An example program would look as follows.
LDA #$01 CMP #$02 BNE notequal STA $22 notequal: BRK
And the machine code this compiles to 1:
a9 01 c9 02 d0 02 85 22 00
Now that the language is somewhat spec'd out, we have an nice overview of how we can start
building a parser for it. There are a lot of rules not defined in the above spec, for example a
<label> cannot match a mnemonic - e.g.
LDA cannot be a
<label> - let's not worry about
these for now.
Building Our Parser
Now that we have our grammar, we can start thinking about how we want to build our parser. Normally I would work out some types first along with some top-level functions and go from there, so lets do that.
module Assembler where import qualified Data.Text as T -- from the "text" package import Text.Megaparsec -- from the "megaparsec" package -- | Create a custom parser type. This is megaparsec specific, we will gloss over this in -- this post. type Parser = Parsec Dec T.Text -- | A Label is just a Text value. newtype Label = Label T.Text deriving Show -- | Indicates whether an address/ value is preceeded by a "#". newtype IsImmediate = IsImmediate Bool deriving Show -- | An address/value of one or two bytes which may have a "#", meaning -- immediate, before it. data Operand = Operand IsImmediate T.Text deriving Show -- | A three letter upper case string. newtype Mnemonic = Mnemonic T.Text deriving Show -- | A label, which should be assigned a value. newtype Var = Var Label deriving Show -- | A value, which should be assigned to a Var. newtype Val = Val T.Text deriving Show -- | Either a label or an operand. data LabelOrOperand = Lbl Label | Op Operand deriving Show -- | Either an instruction or an assignment. data Expr = Instruction (Maybe Label) Mnemonic (Maybe LabelOrOperand) | Assignment Var Val deriving Show
Most of the values will just be strings (
Text types) so to distinguish between them we
Text in a
newtype wrapper for each type we care about.
For now we're not going to worry about constructing anything other than strings (I'll be using the
Text to denote strings instead of the built-in
String in this post 2).
Looking back at our grammar we have eight symbols, each one can be represented as a
function which is itself a parser for some subset of the grammar. So the top-level
functions in this case would be the symbols in our grammar -
We'll also add an extra function here for parsing label assignment - labels with a ":"
after them as in the first part of
<instruction> - let's call it
expression :: Parser Expr expression = undefined assignment :: Parser Expr assignment = undefined instruction :: Parser Expr instruction = undefined operand :: Parser Operand operand = undefined bytes :: Parser T.Text bytes = undefined labelAssign :: Parser Label labelAssign = undefined label :: Parser Label label = undefined mnemonic :: Parser Mnemonic mnemonic = undefined byte :: Parser T.Text byte = undefined
All functions are
undefined so the type checker will pass before we
begin to implement the logic. With our grammar, we know what each symbol corresponds
to, so we can use QuickCheck
to write properties for each function that adhere to its specification in the
Property Driven Parser Development
To build the parser I'm going to use a parser combinator library called megaparsec. I won't go into much detail on megaparsec or parser combinators in this post, simply put, parser combinators are a way of building more complex parsers by combining parsers.
The simplest parser above would be
byte, from our grammar this is just a two character
hexadecimal string. Before we start implementing it, let's write a property which encodes
what we expect it to do.
Generating Our Data - QuickCheck Arbitrary
Using QuickCheck to test parsers is really simple and quite powerful. It involves writing properties which encode expectations about the ouput of a function given some input.
To build a property for
byte first we need to create an
instance 3 for
the data it expects - two character hexadecimal strings. Creating an instance of
for a type allows random values of that type to be generated, by default
QuickCheck will generate 100 random values of the type each test run. For
byte this might look
-- | Wrapper for our two character hexadecimal strings. newtype TwoCharHexString = TwoCharHexString T.Text deriving Show instance Arbitrary TwoCharHexString where arbitrary = do upper <- choose ('A', 'F') lower <- choose ('a', 'f') num <- choose ('0', '9') let vals = [upper, lower, num] x <- elements vals y <- elements vals pure $ TwoCharHexString (T.pack (x:[y]))
Here we define a
TwoCharHexString, which is just a wrapper for
Then we create an
Arbitrary instance for this type which builds two character hex string
Text values. Let's run through the instance:
generates a random element in the given range,
choose (1, 4)generates integers between 1 and 4 inclusive.
- elements generates a single value from the given list.
- With these functions we can generate
yand build our two character string by building a two element list of characters made up of
x:[y]above, this is just a
Stringwhich is a list of
Char- we then pack this with
T.packto get our
Building Our Property
Next we need to write a property that defines what should happen when
byte parses these
prop_byte_parse (TwoCharHexString s) = parse byte "" s `shouldParse` s
This is simply a function called
prop_byte_parse which takes a value of type
TwoCharHexString runs the parser
byte with the megaparsec
parse function 4 and
checks that the result is as expected, in this case parsing a string
s should return that
parse is a function from the megaparsec package which runs our parser on the
shouldParse is a function from
hspec-megaparsec - a library
containing utility functions for testing parsers built with megaparsec. Here we are using it to
parse byte "" s should parse to the string
s - meaning the
byte parser run on string
just give us back
Let's add this to our spec so the property check gets run when we launch
asmSpec = do describe "byte" $ it "should parse two consecutive characters in the hex range into a two character string" $ property prop_byte_parse
Running the tests with
stack test will run this spec and check that the property
holds when parsing the random values of
TwoCharHexString that quickcheck produces - which we
defined in our
But wait! The
byte function was
undefined so the test should fail! Yip, it should give
output similar to the following.
... Assembler byte should parse two consecutive characters in the hex range into a two character string FAILED  Failures: test/AssemblerSpec.hs:19: 1) Assembler.byte should parse two consecutive characters in the hex range into a two character string uncaught exception: ErrorCall (Prelude.undefined CallStack (from HasCallStack): error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err undefined, called at src/Assembler.hs:52:8 in emu-mos6502-asm-blog-0.1.0.0-J3FxRJn8XZkGwqJZfEo09O:Assembler) (after 1 test) *** Failed! (after 1 test): Exception: Prelude.undefined CallStack (from HasCallStack): error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err undefined, called at src/Assembler.hs:52:8 in emu-mos6502-asm-blog-0.1.0.0-J3FxRJn8XZkGwqJZfEo09O:Assembler TwoCharHexString "fF" Randomized with seed 1913513661 Finished in 0.0023 seconds 1 example, 1 failure ...
Good, now we can implement
byte. I've deliberatly left out some boiler plate such as dependencies
and test setup, but you can view the full code up to this point, see
for the types and functions and
for the the property and
So now that we have a property which defines what our
byte function should do, we can
implement it. hexDigitChar
is a parser from megaparsec which parses a a hexadecimal digit. A byte is made up of two
such digits so
byte is simply a parser which tries to parse two hexadecimal chars.
byte :: Parser T.Text byte = do high <- hexDigitChar low <- hexDigitChar pure $ T.pack [high,low]
Nice! We read a hex char and call it
high and another called
low and build a
The rest of the parsers can be implemented in a similar way - define
for the data they should take, define properties for the expected output and implement!
I'll leave the rest of the implementation for another post. Running
stack test now
should give the following output.
... Assembler byte should parse two consecutive characters in the hex range into a two character string Finished in 0.0016 seconds 1 example, 0 failures ...
Excellent, our implementation passed the property check! You can check out the code up to this point here.
This is where I'll leave this post. I think it's long enough! In the next post we'll create the rest of the parsers and their properties, and also run through megaparsec in some more detail. There are definitely quite a few improvements that can be added to the language, and plenty more features that would be useful to have which we can implement in the future.
There are also some limitations in the grammar, for example it currently does not allow
Y indexed addressing 5 - e.g.
LDA ($2020,X) - we can address these too.
This post outlines what I have done so far when building the assembler, and really just shows my own thought process around designing and implementing. Im actively working on the 6502 emulator, so I hope to do a post every week. The main goal is to outline my development process in implementing the project, hopefully I'll introduce some bugs or have some interesting issues along the way!
The reference I'm using for the emulator is mainly http://obelisk.me.uk/6502/reference.html.
Haskell String Types is a good post detailing the different string types.
See the documentation for parse.
See Indexed Indirect, and Indirect Indexed addressing modes here.