Building an Assembler in Haskell
Recently I started building an emulator for the MosTech 6502 Cpu, this post is about the initial stages of building an assembler for a simple assembly language that compiles to runnable 6502 machine code. I've created a repo and updated it as I wrote this post, so at the end of most sections that introduce new code I'll link to a commit which has the code up to that point.
You can see the repo here: https://github.com/wayofthepie/emu-mos6502-asm-blog/tree/hasm-blog01. The only pre-requisite is installing stack, once the project is cloned you can use stack install
to install dependencies and build the project.
The Language
First, let's define what we want our assembly language to be able to do. To keep it simple, we only want to allow assignment and the definition of instructions for now:
- We should be able to define instructions and their operands, e.g.
LDA #$20
. This says load the value $20 into the accumulator.LDA
means load date into the accumulator#
means this is immediate addressing so use the operand as a value not as an address.$20
stands for20
in hexadecimal
- We should be able to assign values to variables, e.g.
LOCATION = $2020
.- Now anywhere we see
LOCATION
we can replace with$2020
.
- Now anywhere we see
Now for a quick and dirty grammar for our simple language:
<expr> := <instruction> | <assignment>
<assignment> := <label> "=" <bytes>
<instruction> := [<label> ":"] <menmonic> [<label> | <operand>]
<operand> := ["#"] <bytes>
<bytes> := "$" <byte> [<byte>]
<label> := ([A-Za-z]*[A-Za-z0-9]*)
<mnemonic> := 3 * ([A-Z])
<byte> := 2 * ([A-Fa-f0-9])
The above is a variation of EBNF (Extended Backus Naur Form) notation, we allow regular expressions (denoted by brackets ()
e.g. ([A-Z])
denotes a single upper case letter) for simplicity.
Here's a breakdown, from the bottom up:
<byte>
is defined as2 * ([A-Fa-f0-9])
, this means two consecutive characters that are upper or lower case letters betwenA
andF
or digits, i.e. a two digit hexadecimal value.- e.g.
2F
- e.g.
<mnemonic>
is a three letter string - all upper case.- e.g.
LDA
- e.g.
<label>
is an alphanumeric string of any size, which must start with a letter.- e.g.
Stor3
- Note that we should limit it's size, but lets leave it infinite for now!
- e.g.
<bytes>
is a string which starts with a$
and must contain at least one<byte>
, at most two.- e.g.
$2F
- e.g.
<operand>
starts with an optional#
followed by<bytes>
.- e.g.
#$2f
- e.g.
<instruction>
starts with an optional<label>
which must be followed by a:
if it exists, this is followed by a<mnemonic>
and finally an optional<label>
xor an<operand>
. Some instructions use implicit addressing and require no operand or label, hence why this is optional.- e.g.
LABEL LDA #$20
- Note that the possible label at the end is only for certain instructions, for example
BNE LABEL
will jump to the address corresponding toLABEL
if the zero flag is not set.
- e.g.
<assignment>
simply an=
with a<label>
on the left hand side and a<bytes>
on the right.- e.g.
STORE = $2020
- e.g.
Finally an
<expr>
can be either an<instruction>
or an<assignment>
.
An example program would look as follows.
LDA #$01
CMP #$02
BNE notequal
STA $22
notequal: BRK
And the machine code this compiles to 1:
a9 01 c9 02 d0 02 85 22 00
Now that the language is somewhat spec'd out, we have an nice overview of how we can start building a parser for it. There are a lot of rules not defined in the above spec, for example a <label>
cannot match a mnemonic - e.g. LDA
cannot be a <label>
- let's not worry about these for now.
Building Our Parser
Now that we have our grammar, we can start thinking about how we want to build our parser. Normally I would work out some types first along with some top-level functions and go from there, so lets do that.
module Assembler where
import qualified Data.Text as T -- from the "text" package
import Text.Megaparsec -- from the "megaparsec" package
-- | Create a custom parser type. This is megaparsec specific, we will gloss over this in
-- this post.
type Parser = Parsec Dec T.Text
-- | A Label is just a Text value.
newtype Label = Label T.Text deriving Show
-- | Indicates whether an address/ value is preceeded by a "#".
newtype IsImmediate = IsImmediate Bool deriving Show
-- | An address/value of one or two bytes which may have a "#", meaning
-- immediate, before it.
data Operand = Operand IsImmediate T.Text deriving Show
-- | A three letter upper case string.
newtype Mnemonic = Mnemonic T.Text deriving Show
-- | A label, which should be assigned a value.
newtype Var = Var Label deriving Show
-- | A value, which should be assigned to a Var.
newtype Val = Val T.Text deriving Show
-- | Either a label or an operand.
data LabelOrOperand = Lbl Label | Op Operand deriving Show
-- | Either an instruction or an assignment.
data Expr
= Instruction (Maybe Label) Mnemonic (Maybe LabelOrOperand)
| Assignment Var Val
deriving Show
Most of the values will just be strings (Text
types) so to distinguish between them we wrap Text
in a newtype
wrapper for each type we care about. For now we're not going to worry about constructing anything other than strings (I'll be using the type Text
to denote strings instead of the built-in String
in this post 2). Looking back at our grammar we have eight symbols, each one can be represented as a function which is itself a parser for some subset of the grammar. So the top-level functions in this case would be the symbols in our grammar - <expression>
, <label>
etc... We'll also add an extra function here for parsing label assignment - labels with a ":" after them as in the first part of <instruction>
- let's call it labelAssign
.
expression :: Parser Expr
expression = undefined
assignment :: Parser Expr
assignment = undefined
instruction :: Parser Expr
instruction = undefined
operand :: Parser Operand
operand = undefined
bytes :: Parser T.Text
bytes = undefined
labelAssign :: Parser Label
labelAssign = undefined
label :: Parser Label
label = undefined
mnemonic :: Parser Mnemonic
mnemonic = undefined
byte :: Parser T.Text
byte = undefined
All functions are undefined
so the type checker will pass before we begin to implement the logic. With our grammar, we know what each symbol corresponds to, so we can use QuickCheck to write properties for each function that adhere to its specification in the grammar.
Property Driven Parser Development
To build the parser I'm going to use a parser combinator library called megaparsec. I won't go into much detail on megaparsec or parser combinators in this post, simply put, parser combinators are a way of building more complex parsers by combining parsers.
The simplest parser above would be byte
, from our grammar this is just a two character hexadecimal string. Before we start implementing it, let's write a property which encodes what we expect it to do.
Generating Our Data - QuickCheck Arbitrary
Using QuickCheck to test parsers is really simple and quite powerful. It involves writing properties which encode expectations about the ouput of a function given some input.
To build a property for byte
first we need to create an Arbitrary instance 3 for the data it expects - two character hexadecimal strings. Creating an instance of Arbitrary
for a type allows random values of that type to be generated, by default QuickCheck will generate 100 random values of the type each test run. For byte
this might look as follows.
-- | Wrapper for our two character hexadecimal strings.
newtype TwoCharHexString = TwoCharHexString T.Text deriving Show
instance Arbitrary TwoCharHexString where
arbitrary = do
upper <- choose ('A', 'F')
lower <- choose ('a', 'f')
num <- choose ('0', '9')
let vals = [upper, lower, num]
x <- elements vals
y <- elements vals
pure $ TwoCharHexString (T.pack (x:[y]))
Here we define a newtype
called TwoCharHexString
, which is just a wrapper for Text
. Then we create an Arbitrary
instance for this type which builds two character hex string Text
values. Let's run through the instance:
- choose generates a random element in the given range,
choose (1, 4)
generates integers between 1 and 4 inclusive. - elements generates a single value from the given list.
- With these functions we can generate
x
andy
and build our two character string by building a two element list of characters made up ofx
andy
- seex:[y]
above, this is just aString
which is a list ofChar
- we then pack this withT.pack
to get ourText
value.
Building Our Property
Next we need to write a property that defines what should happen when byte
parses these string values.
prop_byte_parse (TwoCharHexString s) = parse byte "" s `shouldParse` s
This is simply a function called prop_byte_parse
which takes a value of type TwoCharHexString
runs the parser byte
with the megaparsec parse
function 4 and checks that the result is as expected, in this case parsing a string s
should return that same string. parse
is a function from the megaparsec package which runs our parser on the supplied string.
Finally, shouldParse
is a function from hspec-megaparsec - a library containing utility functions for testing parsers built with megaparsec. Here we are using it to say parse byte "" s
should parse to the string s
- meaning the byte
parser run on string s
should just give us back s
.
Let's add this to our spec so the property check gets run when we launch stack test
.
asmSpec = do
describe "byte" $
it "should parse two consecutive characters in the hex range into a two character string" $
property prop_byte_parse
Running the tests with stack test
will run this spec and check that the property prop_byte_parse
holds when parsing the random values of TwoCharHexString
that quickcheck produces - which we defined in our Arbitrary
instance.
But wait! The byte
function was undefined
so the test should fail! Yip, it should give output similar to the following.
...
Assembler
byte
should parse two consecutive characters in the hex range into a two character string FAILED [1]
Failures:
test/AssemblerSpec.hs:19:
1) Assembler.byte should parse two consecutive characters in the hex range into a two character string
uncaught exception: ErrorCall (Prelude.undefined
CallStack (from HasCallStack):
error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err
undefined, called at src/Assembler.hs:52:8 in emu-mos6502-asm-blog-0.1.0.0-J3FxRJn8XZkGwqJZfEo09O:Assembler) (after 1 test)
*** Failed! (after 1 test):
Exception:
Prelude.undefined
CallStack (from HasCallStack):
error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err
undefined, called at src/Assembler.hs:52:8 in emu-mos6502-asm-blog-0.1.0.0-J3FxRJn8XZkGwqJZfEo09O:Assembler
TwoCharHexString "fF"
Randomized with seed 1913513661
Finished in 0.0023 seconds
1 example, 1 failure
...
Good, now we can implement byte
. I've deliberatly left out some boiler plate such as dependencies and test setup, but you can view the full code up to this point, see Assembler.hs for the types and functions and AssemblerSpec.hs for the the property and Arbitrary
instance.
Implementation
So now that we have a property which defines what our byte
function should do, we can implement it. hexDigitChar is a parser from megaparsec which parses a a hexadecimal digit. A byte is made up of two such digits so byte
is simply a parser which tries to parse two hexadecimal chars.
byte :: Parser T.Text
byte = do
high <- hexDigitChar
low <- hexDigitChar
pure $ T.pack [high,low]
Nice! We read a hex char and call it high
and another called low
and build a Text
value.
The rest of the parsers can be implemented in a similar way - define Arbitrary
instances for the data they should take, define properties for the expected output and implement!
I'll leave the rest of the implementation for another post. Running stack test
now should give the following output.
...
Assembler
byte
should parse two consecutive characters in the hex range into a two character string
Finished in 0.0016 seconds
1 example, 0 failures
...
Excellent, our implementation passed the property check! You can check out the code up to this point here.
Conclusion
This is where I'll leave this post. I think it's long enough! In the next post we'll create the rest of the parsers and their properties, and also run through megaparsec in some more detail. There are definitely quite a few improvements that can be added to the language, and plenty more features that would be useful to have which we can implement in the future.
There are also some limitations in the grammar, for example it currently does not allow X
or Y
indexed addressing 5 - e.g. LDA ($2020,X)
- we can address these too.
This post outlines what I have done so far when building the assembler, and really just shows my own thought process around designing and implementing. Im actively working on the 6502 emulator, so I hope to do a post every week. The main goal is to outline my development process in implementing the project, hopefully I'll introduce some bugs or have some interesting issues along the way!
The reference I'm using for the emulator is mainly http://obelisk.me.uk/6502/reference.html.
Haskell String Types is a good post detailing the different string types.
See the documentation for Arbitrary, this StackOverflow answer is also good.
See the documentation for parse.
See Indexed Indirect, and Indirect Indexed addressing modes here.