6502 Emulator

6502 Emulator Online
6502 Emulator For Windows

Apr 24, 2020 MOS6502 Emulator in C. This is my C implementation of the MOS Technology 6502 CPU. The code is written to be more readable than fast, however some minor tricks have been introduced to greatly reduce the overall execution time. Main features: 100% coverage of legal opcodes. Decimal mode implemented. Python 6502 Emulator. A MOS 6502 Emulator intended to be used from within other programs. All opcodes, included the undocumented illegal opcodes are implemented. Example Usage:: from py65emu.cpu import CPU from py65emu.mmu import MMU f = open ('program.rom', 'rb') # Open your rom # define your blocks of memory.

TL;DR: The how and why of the new 6502 emulator in the chips project.

6502 CPU Emulation. It's the weekend, so I relax from spending all week programming by writing a hobby project. I wrote the framework of a MOS 6502 CPU emulator yesterday, the registers, stack, memory and all the opcodes are implemented. (Link to source below). Powered by a 6502 CPU and custom state machine, they could draw points and lines at a resolution far beyond the CRT-based games of the era. Williams The Williams games like Defender, Robotron, and Joust were powered by Motorola's 6809 CPU (our emulation uses the Z80 though) and plenty of RAM.

I wrote a new version of my 6502/6510 emulator in the last weeks which canbe stepped forward in clock cycles instead of full instructions.

The actual work didn’t take a couple of weeks, more like a few evenings anda weekend, because the new emulator is more or less just a different output of the code-generation python script which I keep mutating for each newemulator version.

But while working on the new emulator I got a bit side-tracked by anotherproject:

But this is (maybe) the topic of another blog post.

Before I’m getting to the cycle-stepped 6502 emulator, a little detourinto 8-bit CPUs and CPU emulators in general:

What CPUs actually do

The general job of a CPU (no matter if old or new) is quite simple: givensome memory filled with instructions and data, process instructions one afteranother to change some of the data in memory into some other data, and repeatthis process forever. Everything else in a typical computer just exists toturn some of the data in memory into pretty pixels and sounds, or sendthe data to another computer.

The processing of a single instruction can be broken down into several steps:

fetch the instruction opcode from memory (on 8-bit CPUs this is thefirst - and sometimes only - byte of an instruction)
‘decode’ this opcode to decide what actions must be taken to‘execute’ the instruction, and then step through those actions, like:
load additional data from memory into the CPU
change the data loaded into the CPU in some way (arithmetics, bit twiddling, etc)
store data from the CPU back to memory
repeat those steps for the next instruction

Apart from simple data manipulation, this basic ‘core execution loop’is also used for more interesting control-flow actions:

branches: Jump to a different memory location and continue executinginstructions from there, this is like a goto in higher level languages
conditional branches: Based on the result of a previous computation, either jump toa different memory location, or continue with the instruction directly following thebranch instruction. Conditional branches are the assembly-level building blocks of higher-levellanguage constructs like if(), for() and while()
subroutine calls: Store the current execution location on the ‘stack’(a special area in memory set aside for storing temporary values), jump to adifferent memory location (the subroutine), and at the end of that subroutine(indicated by a special ‘return’ instruction), load the stored executionlocation from the stack and continue execution right after the originalsubroutine call. This is like a function call in a high level language.
interrupts: Interrupts are like subroutine calls except they areinitiated by an external hardware event (such as a hardware timer/counterreaching zero, a key being pressed, or the video signal reaching a specificraster line). The CPU stops whatever it is currently doing, jumps into aspecial ‘interrupt service routine’, and at the end of the service routine,continues with whatever it did before.

CPUs usually have a few special registers to deal with control flow:

The current execution location (where the next instruction is loaded from)is stored in the Program Counter (PC) register.
The current stack location is stored in the Stack Pointer (S or SP) register.
A Flags or Status Register (for some mysterious reason called ‘P’ on the 6502)which stores (mostly) information about the result of previous arithmeticoperations to be used for conditional branching. For instance: if the lastinstruction subtracted two numbers and the result is 0, the Flags Registerwill have the Zero Flag bit set. The conditional branchinstructions BEQ (branch when equal) and BNE (branch when not equal) look atthe current state of the Zero Flag to decide whether to branch or not (thereare other Flag bits and associated branch instructions too, but you get the idea).

It’s interesting to note that the branch instructions of a CPU are essentiallynormal load instructions into the PC register, since that’s what they do: loadinga new location value into the PC register so that the next instruction is fetchedfrom a different memory location.

The components of an 8-bit CPU

8-bit CPUs like the Z80 or 6502 are fairly simple from today’s pointof view: a few thousand transistors arranged in a few logical blocks:

A Register Bank with somewhere between a handful and a dozen 8- and 16-bitregisters (the 6502 has about 56 bits worth of ‘register storage’, while theZ80 has over 200 bits worth of registers)

An ALU (Arithmetic Logic Unit) which implements integer operations likeaddition, subtraction, comparison (usually done with a subtraction anddropping the result), bit shifting/rotation, and the bitwise logic operationsAND, OR and XOR.

The Instruction Decoder. This is where all the ‘magic’ happens. Theinstruction decoder takes an instruction opcode as input, and in thefollowing cycles runs through a little instruction-specific hardwired programwhere each program step describes how the other componentsof the CPU (mainly the register bank and ALU) need to be wired together andconfigured to execute the current instruction substep.

How exactly the instruction decoder unit is implemented differs quite a lotbetween CPUs, but IMHO the general mental model that each instruction is asmall hardwired ‘micro-program’ of substeps which reconfigures the data flowwithin the CPU and the current action of the ALU is (AFAIK) valid for allpopular 8-bit CPUs, no matter how the decoding process actually happens indetail on a specific CPU.

And finally there’s the ‘public API’ of the CPU, the input/output pinswhich connect the CPU to the outside world.

Most popular 8-bit CPUs look fairly similar from the outside:

16 address bus output pins, for 64 KBytes of directly addressable memory
8 data bus in/out pins, for reading or writing one data byte at a time
a handful of control- and status-pins, these differ betweenCPUs, but the most common and important pins are:
- one or two RW (Read/Write) output pins to indicate to the outside world whether the CPU wants to read data from memory (with the data bus pins acting as inputs), or write data into memory (data pins as outputs) at the location indicated by the address bus pins
- IRQ and NMI: These are input pins to request a “maskable” (IRQ) or non-maskable (NMI) interrupt.
- RES: an input pin used to reset the CPU, usually together with the whole computer system, this puts the CPU back into a defined starting state

Of course the CPU is only one part of a computer system. At least some memoryis needed to store instructions and data. And to be any usefulthere must be a way to get data into and out of the computer (keyboard, joystick, display, loud speakers, and a tape- or floppy-drive). These devices are usually not controlled directly by the CPU, but by additionalhelper chips, for instance in the C64:

2 ‘CIA’ chips to control the keyboard, joysticks, and tape drive
the ‘SID’ chip to generate audio
and the ‘VIC-II’ chip to generate the video output

The C64 is definitely on the extreme side for 8-bit computers when it comesto hardware complexity. Most other 8-bit home computers had a much simplerhardware configuration and were made of fewer and simpler chips.

All these chips and the CPU are connected to the same shared address- and data-bus,and some additional ‘address decoder logic’ (and clever engineeringin general) was needed so that all those chips only use theshared address and data bus when it’s their turn, but diving in theregoes a bit too far for this blog post :)

Back to emulators:

How CPU emulators work

While all CPU emulators must somehow implement the tasks of real CPUs outlinedabove, there are quite a few different approaches how they reach that goal.

The range basically goes from “fast but sloppy” to “accurate but slow”. Whereon this range a specific CPU emulator lives depends mainly on the computer systemthe emulator is used for, and the software that needs to run:

Complex general-purpose home computers like the C64 and Amstrad CPC with anactive and recent demo scene need a very high level of accuracy down to theclock cycle, because a lot of frighteningly smart demo-sceners are exploring (andexploiting) every little quirk of the hardware, far beyond what the originalhardware designers could have imagined or what’s written down in theoriginal system specifications and chip manuals.
The other extreme are purpose-built arcade machines which only need to runa single game. In this case, the emulation only needs to be ‘good enough’ torun one specific game on one specific hardware configuration, and a lot ofshortcuts can be taken as long as the game looks and feels correct.

Here’s the evolution from “fast but sloppy” to “slow but accurate” CPUemulators that I’ve gone through, I’m using the words “stepped” vs “ticked”in a very specific way:

“stepped” means how the CPU emulator can be “stepped forward” from the outsideby calling a function to bring the emulator from a “before state” into an “afterstate”
“ticked” means how the entire emulated system is brought from a “before state”into an “after state”

Instruction-Stepped and Instruction-Ticked

This was the first naive implementation when I started writing emulators, aZ80 emulator which could only be stepped forward and inspectedfor complete instructions.

The emulator had specialized callback functions for memory accesses, andthe Z80’s special IO operations, but everything else happening “inside”an instruction was completely opaque from the outside.

This means that an emulated computer system using such a CPU emulator could only“catch up” with the CPU after a full instruction was executed, which on theZ80 can take anywhere between 4 and 23 clock cycles (or even more when aninterrupt is handled).

This worked fine for simple computer systems which didn’t need cycle-accurateemulation, like the East German Z- and KC-computers. But 23..30 clock cyclesat 1 MHz is almost half of a video scanline, and once I moved to more complexsystems like the Amstrad CPC it became necessary to know when exactly withinan instruction a memory read or write access happens. The Amstrad CPC’svideo system can be reprogrammed at any time, for instance in the middle of ascanline, and when such a change happens at the wrong clock cycle within ascanline, things start to look wrong, from subtle errors like a few missingpixels or wrong colors here and there to a completely garbled image.

Instruction-Stepped and Cycle-Ticked

The next step was to replace the specialized IO and memory access callbackswith a single universal ‘tick callback’, and to call this from within theCPU emulation for each clock cycle of an emulated instruction. I moved to thisemulation model for two reasons: (a) because I had to improve the Amstrad CPCvideo emulation, and (b) because I started with a 6502 CPU emulator where a memoryaccess needs to happen in each clock cycle anyway.

From the outside, it’s still only possible to execute instructions as a whole.So calling the emulator’s ‘execute’ function once will run through the entirenext instruction, calling the tick callback several times in turn, once for eachclock cycle. It’s not possible to tell the CPU to only execute one clock cycle,or suspend the CPU in the middle of an instruction.

With this approach, the CPU takes a special, central place in an emulatedsystem. The CPU is essentially the controller which ticks the system forwardby calling the tick callback function, and the entire remaining system isemulated in this tick callback. This allows a perfectly cycle accurate systememulation, and as far as I’m aware, this approach is used in most ‘serious’emulators, and to be honest, the reasons to use an even finer-grainedapproach are a bit esoteric:

Cycle-Stepped and Cycle-Ticked

This is where I’m currently at with the 6502 emulation (but not yet with theZ80 emulation).

Instead of an “exec” function which executes an entire instruction, there’s nowa “tick” function which only executes one clock cycle of an instruction and returnsto the caller. With this approach a ‘tick callback’ is no longer needed, andthe CPU has lost it’s special ‘controller status’ in an emulated system.

Instead the CPU is now just one chip amongst many, ticked forward like all theother chips in the system. The ‘system tick’ function has inverted its role:

Instead of the system tick function being called from inside the CPUemulation, the CPU emulation is now called from inside the system tickfunction. This makes the entire emulator code more straightforward andflexible. It’s now ‘natural’ and trivial to tick the entire system forward insingle clock cycles (yes, the C64 emulation now has a c64_tick() function).

Being able to tick the entire system forward in clock cycles is very usefulfor debugging. So far, the step-debugger could only step one complete CPUinstruction at a time.

That’s good enough for debugging CPU code, but not so great for debugging theother chips in the system. Each debug step would skip over several clockcycles depending on what CPU instruction is currently executed. But now thatthe CPU can be cycle-stepped, implementing a proper ‘cycle-step-debugger’ isfairly trivial.

Another situation where a cycle-stepped CPU is useful is multi-processorsystems, which actually weren’t that unusual in the 80’s: the Commodore 128had both a 6502 and Z80 CPU (although they couldn’t run at the same time),some arcade machines had sound and gameplay logic running on different CPUs,and even attaching a floppy drive to a C64 turns this into a multiprocessorsystem, because the floppy drive was its own computer with its own 6502 CPU.

With a cycle-stepped CPU, it’s much more straightforward now to writeemulators for such multi-CPU systems. If required, completely unrelatedsystems can now run cycle-synchronized without weird hacks in the tickcallbacks or variable-length ‘time slices’.

The new 6502 emulator

The new emulator only has two ‘interesting’ functions for initializinga new 6502 instance and ticking it:

Like my older emulators, a 64-bit integer is used to communicate the pin statusin and out of the CPU. What’s new is that the m6502_init() function returnsan initial pin mask to ‘ignite’ the CPU startup process, which must be passedinto the first call to m6502_tick().

The first 7 calls to m6502_tick() will then run through the 7-cyclereset-sequence, only then will the CPU emulation start to run ‘regular’instructions.

The code example above still doesn’t do anything useful yet, because wehaven’t connected the CPU to memory, it’s a bit similar to connecting a real6502 to a clock signal, but leaving all the other pins floating in the air.

Let’s fix that:

…and that’s it basically, let’s try a real C program which loads a specificvalue into the A register:

Running this program should print a A: 33 to the terminal.

Creating a complete home computer emulator from this is now “just” a matterof making the part after m6502_tick() where the pin mask is inspectedand modified more interesting. Instead of just reading and writing memory,the other system chip emulations are ticked, and some sort of ‘address decoding’needs to take place so that memory accesses to specific memory regions arererouted to access custom chip registers instead of memory.

Implementation Details

The new emulator uses a “giant switch case”, just like the old emulator. Butinstead of one case-branch per instruction, the giant-switch-case has been“unfolded” to handle one clock cycle of a specific instruction per case branch.

What looked like this before to handle the complete LDA # instruction:

…looks like this now:

The LDA # instruction costs 2 clock cycles, so there’s two case-branches.

The first case:

…puts the PC registerinto the address bus pins (with the macro _SA(), ‘set address’),increments the PC, and returns to the caller.

The caller inspects the pins, sees that it needs to do a read access from theaddress at PC - which in case of the LDA # instruction would be theimmediate byte value to load into the A register - puts this value frommemory into the data bus pins, and calls m6502_tick() again with the new pin mask.

Inside m6502_tick(), the next case branch is executed:

…this takes the data bus value (the _GD() macro, or ‘get data’) and puts it into the A register (c->A). Next it checks whether the Z(ero) Flag needs to be set with the _NZ() macro, and finally ‘calls’ the _FETCH() macroto initiate the next instruction fetch.

How instruction fetch works

Note how we conveniently skipped the whole process of fetching the opcode bytefor the LDA # instruction above, this is because the instruction fetchingprocess is like the snake which bites its own tail. A new instruction cannotbe fetched without finishing a previous instruction:

The _FETCH macro at the end of each instruction decoder switch-caseblock doesn’t actually load the next opcode byte, it only manipulates the pinmask to let the ‘user code’ do this.

All that the _FETCH macro actually does it put the current PC into theaddress bus pins, and set the SYNC pin to active (and importantly: theRW pin is set to signal a memory read, this happens automatically foreach tick, unless it’s a special write cycle.

After the m6502_tick() function returns with the _FETCH pin mask active, thecaller’s memory access code ‘sees’ a normal READ access with the current PCas address, this loads the next instruction byte into the data bus pins.

The SYNC pin isn’t interesting for the caller, but it signals the start ofa new instruction to the next call of m6502_tick():

At the start of the m6502_tick() function, and before the giantinstruction-decoder switch-case statement, the incoming pin mask is checkedfor specific control pins, among others the SYNC pin. An active SYNC pinsmarks the start of a new instruction, with the instruction’s opcode alreadyloaded into the data bus pin.

When the SYNC pin is active, the current data bus value is loaded into aninternal IR (Instruction) register, much like in a real 6502, but with alittle twist:

The actual opcode value is shifted left by 3 bits to ‘make room’ for a 3 bit cycle counter (3 bits because a 6502 instruction can be at most 7cycles long). This merged ‘opcode plus cycle counter’ is all theinstruction decoder state needed to find the proper handler code ( case branch)for the current cycle of the current instruction.

The instruction fetch and decoding process looks like this inside them6502_tick() function:

It is interesting to note that this somewhat weird ‘overlapped’ instructionfetch process, where the last processing cycle of an instruction overlapswith fetching the next instruction opcode is exactly like it works on a real 6502 CPU. In the cycle-stepped emulator this overlapping happened‘naturally’, there’s no other way to make this work while at the sametime having correct instruction cycle counts :)

6502 Emulator Online

How interrupt handling works

…also much like in a real 6502: Somewhat simplified, at the start ofthe m6502_tick() function where the SYNC pin is tested, it is also checkedwhether an interrupt needs to be handled (interrupts are handled onlyat the end of an instruction, and before the next instruction starts).

If the ‘interrupt condition’ is true, the next regular opcode byte (which isalready loaded into the data bus pins at this point) is discarded, and insteadthe IR register is loaded with the BRK instruction opcode, together withan internal flag that this isn’t a regular BRK, but a special ‘interrupt BRK’.

From here on, the normal instruction decoding process continues. The giant-switch-caseis called and ends up in the handling code for the BRK instruction, and thishas some special-case handling for a ‘BRK running as interrupt handler’.

This is also pretty much identical to how it works in a real 6502, bothmaskable and non-maskable interrupts, and the reset sequence are actually runningthrough the BRK instruction decoding process, with some ‘special case tweaks’.

Of course it can’t ever be quite as simple as that, the 6502 has quite a few interrupt-related quirks. For instance, the interrupt condition is only detected up to two cycles before the end of an instruction, any laterand the interrupt is delayed until the end of the next instruction. But eventhis isn’t consistent, as there is a ‘branch quirk’ in conditional brancheswhere this ‘2 cycle delay rule’ isn’t true but reduced to a 1-cycle delay.

Another interesting ‘feature’ is interrupt hijacking. If a maskableinterrupt is detected, but in the few cycles between the detection and themiddle of the BRK instruction which handles this interrupt a higher-prioritynon-maskable interrupt is detected too, than the maskable interrupt is‘hijacked’ and finishes as a non-maskable interrupt.

The whole thing gets more complicated because similar exceptions alsoexist in the external chips which trigger interrupts (in a C64: the twoCIAs and the VIC-II), so getting the interrupt handling in a completeemulated system cycle-accurate under all conditions can be a bit challenging(and it’s also not completely accurate in my emulators yet, although I’msomewhat convinced that at least the CPU side is correct).

Conclusion: Links, Demos and Test Coverage

The new 6502 emulator can be seen in action in the C64 and Acorn Atomemulators here:

Currently there’s no way to actually do cycle-stepping in the debuggerthough, this will be added at a later time.

The 6502 source code is here:

…which is code-generated from these two files:

The C64 system emulator source code is here:

There’s also quite a few new C64 demo-scene demos on the Tiny Emulators mainpage. Those demos give a good impressionabout the overall emulation quality, since they often use the hardware ininteresting ways and have strict timing requirements.

The C64 emulation is now “pretty good but still far from perfect”, while theCPU and CIA accuracy are much better now, the VIC-II emulation leaves a lotto be desired (this will be my next focus for the C64 emulation, but I’llmost likely spend a bit of time with something else first).

Here are some C64 demos which still show various rendering artefacts:

6502 Emulator For Windows

Some other demos even get stuck, which is most likely related to the VIC-IIraster interrupt firing at the wrong time, but as I said, the VIC-IIemulation quality will be my next focus on the C64.

The CPU and CIA emulation are nearly (but not quite) perfect now, there’s oneproblem related to the CIA-2 and non-maskable interrupts which I wasn’t ableto track down yet.

Here’s an overview of the conformance tests I’m currently using, and theirresults:

All NEStest CPU tests are succeeding, these are fairly “forgiving” high leveltests which only test the correct behaviour and cycle durationof documented instructions without the BCD mode.

All Wolfgang Lorenz tests are succeeding, except:

the nmi test has one failed subtest
the four CIA realtime clock related tests are failing because my currentCIA emulation doesn’t implement the realtime clock

The Wolfgang Lorenz test suite covers things like:

all instructions doing the “right thing” and taking the right number of clock cycles, including all undocumented and unintended instructions, and the BCD arithmetic mode
various ‘branch quirks’ when branches go to the same or a different 256 byte page
correct timing and behaviour of interrupts, including various interrupt related quirks both in the CPU and CIAs (like NMI hijacking, both CIAs requesting interrupts at the same time, various delays when reading and writing CIA registers etc)
various behaviour tests of the 6510’s IO port at address 0 and 1
and some other even more obscure stuff

The big area that’s not covered by the Wolfgang Lorenz test suite is theVIC-II, for this I’ve started to use tests from the VICE-emulator testbench now. But thesetests are “red” all over the place at the moment, and it’s not worth yetwriting about them :)

A new automated testing method I’m using for the cycle-stepped 6502 emulationis perfect6502 this is a C portof the transistor-level simulation from visual6502.

Perfect6502 allows me to compare the state of my 6502 emulation against the‘real thing’ after each clock cycle, instead of only checking the result ofcomplete instructions. The current test isn’t complete, it only checks thedocumented instructions, and some specific interrupt request situations. Butthe point of this test is that it is trivial now to create automated testswhich allow checking of specific situations against the real transistor-levelsimulation.