I’ve taken what I’ve learned writing a Chip8 interpreter and applied that to writing an emulator for the original Game Boy.
Like the Chip8 interpreter, this project isn’t very polished. There is on major piece of functionality missing still, the audio. But there’s enough done that it can successfully play quite a few ROMs.
The source code for this is on sourcehut: https://git.sr.ht/~jos/gb-rs
Emulating the Game Boy turned out to be quite a bit more complex than the Chip8 interpreter in some surprising ways. Now I can see why folks in this community tend to make the distinction between the Chip8 being an interpreter and not an emulator.
I’ll call out a few notable things that have to do with implementing the main loop in this post. I’ll discuss the complexity of instruction timing, and some difficulties when implementing drawing. I’ll cover more of the project in future posts.
Game Boy Instruction Timing
With the Chip8, the timing is a simple control of how often one “ticks” to the next instruction. All instructions take the same amount of time, so the main loop looks just like what one would see implementing a video game. In this example, we’re executing the next instruction every 2ms (or at 500Hz).
loop {
let now = std::time::Instant::now();
// <update timers here>
if now - last_update > std::time::Duration::from_millis(2) {
last_update = now;
chip8.tick();
}
canvas.present();
std::thread::sleep(std::time::Duration::from_millis(2));
}
When emulating a Game Boy it’s not so simple. Like with most (all?) CPUs, the Game Boy has an internal clock that determines how quickly all of the instructions execute, and it has the property that the instructions take up some number of ticks of the clock running at this frequency.
With the exception of the Super Game Boy, Game Boys have a base clock speed of 4194304 Hz (that’s 2^22 Hz), often shortened to 4.194 MHz. (Anecdotally, the Super Game Boy’s slightly different clock speed of 4.295 MHz causes a few games to work incorrectly.)
On the Game Boy, in almost all cases, instructions execute in multiples
of 4 ticks of the system clock (4.194 MHz). The shortest instruction,
NOOP
takes 4 clock ticks. In order to account for this, the
step()
function executes a single instruction and returns how many
emulated clock cycles the instruction is supposed to take. This allows
the main loop to account for how much real time it should take.
One could then adapt the main loop structure from the Chip8 by simply multiplying the base duration by how many cycles need to be used.
loop {
let now = std::time::Instant::now();
// <update timers here>
if now - last_update > std::time::Duration::from_nanoseconds(238) * cycles {
last_update = now;
cycles = gameboy.tick();
}
std::thread::sleep(std::time::Duration::from_nanoseconds(238) * cycles);
}
But by doing this we run into the problem that 238 nanoseconds isn’t
very much time. So if the gameboy.tick()
tries to draw to the screen
as well, we start to run into some performance problems that blow the
loop’s time budget.
Drawing Difficulties
Initially, I did what I thought was the simplest thing that could possibly work and implemented the “PPU” (Pixel Processing Unit) as a direct model of the real PPU behavior: Draw directly the screen every time a pixel needs to be updated. But this was a performance nightmare because drawing even a single pixel takes a relatively long time.
On the Game Boy hardware, the PPU runs at the speed of the system clock, fetching sprites and drawing their pixels line-by-line concurrent with normal CPU execution. However, once a line is drawn on the screen, the code cannot change what is displayed currently, it can only change what will be displayed on the next pass. So the insight is that even though a real Game Boy works line by line, we can treat it as if it’s working frame by frame. Which is great because that’s how today’s drawing APIs expect to be used.
It’s not quite as simple as converting the PPU to behave frame-by-frame, though, because the PPU has observable behavior as it does it’s line by line drawing: at the end of each line there’s an “HBLANK” period during which time the software can make changes to the video memory. At the end of the frame there’s a similar (much longer) “VBLANK” period used for the same purpose. So the emulated PPU still works line-by-line, but instead of drawing to the screen, it’s modifying a buffer in memory.
So then, how does this frame-by-frame treatment fit within the main loop? Despite being line-by-line, the CPU can’t change the actual contents of the frame once it’s been drawn until the next time the line is actually drawn. So allowing the CPU to draw a whole frame, and then presenting it would work just as well as drawing each pixel one a time, assuming the difference was imperceptible (with pixel graphics at 60 fps, it’s probably true that one could not see the difference).
So then we run the CPU for a budget of 70224 cycles. At 4.194 MHz, that’s how long it takes the Game Boy to display one frame (59.7 fps). We then pause execution so we allow real time to catch up with emulated time (since modern CPUs run at GHz speeds, the emulated Game Boy CPU runs much faster than 4.194 MHz). Then at that point we draw the contents of the emulated screen.
loop {
let mut cycle_budget = 70224;
let frame_start = Instant::now();
while cycle_budget >= 0 {
let next_cycles = gb.step(&mut display) as i32;
cycle_budget -= next_cycles;
}
while frame_start.elapsed().as_millis() < 16 {
// Wait for real time to catch up with emulated time.
}
display.draw(); // move the emulated screen into a GPU texture
canvas.copy(&display.texture, None, None)?; // texture to canvas
canvas.present(); // canvas to screen.
}
One change I might make is to have the main loop detect when the VBLANK has occurred and use that as a trigger for drawing to the screen. That way, if the CPU is emulated at the appropriate speed, no additional time synchronization is needed between the display and the CPU.