Ways to improve performance

Ok, you’ve got a pretty good game going, but it’s slowing down when you play it; what do you do!? Or, perhaps you want to plan ahead for a more involved game. Either way, there are a number of things we can do to improve performance. The key to this is to decide what tradeoffs you can afford.

Option 1: Making existing code more efficient

C lets you do a lot of cool things with a lot less code than assembly language. It’s very powerful, but also makes it very easy to write inefficient things without noticing. This section covers some code that produces inefficent assembly code, and how to replace it. (Note: this is very specific to 6502 assembly, and specifically our compiler - it will not translate to other languages.)

There are also guidelines published by cc65’s Authors and Shiru (Neslib). These are less detailed, but great if you want the raw information.

Work in multiples of 8 and 16 where possible

In the normal world, we often group things into sizes of 5 or 10 naturally, becaue they make sense to us. Most people have five fingers on each hand, so we see this number a lot. In coding for low-powered systems though, this can hurt us. The console is much better at doing math based on powers of 2 (2, 4, 8, 16, 32, 64, etc…) - this can actually make a huge difference in performance.

If you find yourself doing multiplication or division by numbers that aren’t powers of two, see if you can find ways to change that. For example, if you have data stored for multiple objects in an array, try moving to store more (or less) data to make the sprite index a multiple of two. That way your index becomes myArray[index*8] instead of myArray[index*5].

It may help your understanding and performance to use bit shifting instead of multiplication and division. This can help you force yourself to use powers of two. Here is a simple table that gives the equivalents for multiplication and division when doing a bit shift.

Bit Shift Multiplication/Division equivalent
var << 1 var * 2
var << 2 var * 4
var << 3 var * 8
var << 4 var * 16
var << 5 var * 32
var << 6 var * 64
var >> 1 var / 2
var >> 2 var / 4
var >> 3 var / 8
var >> 4 var / 16
var >> 5 var / 32
var >> 6 var / 64

So, if you wanted to switch myArray[index*8] to bit shifting, you would use myArray[index<<3] instead.

Use unsigned types whenever possible

In short, this will generate faster code. Unsigned types start at 0 and go up; they have no concept of negative values. As a result, the underlying assembly code around them is much simpler and faster. We do this whenever possible in the nes-starter-kit engine. This obviously makes math a little confusing, since using unsigned integers will make 7-5 result in 254 rather than -2 – but if you can work within the limitations the performance benefit is worth it.

Use char instead of int whenever possible

A char data type takes up 1 byte of ram, whereas an int data type will take up two by default. This causes it to take up more ram, but also makes any operations using the variable take longer.

To understand this, think of a simple operation like (variable == 25). If variable is represented by a one-byte char, the underlying code just has to load the one bye of variable, and see if it equals the byte value of 25.

If we used an int in the example instead, this same comparison would involve loading the first byte, and checking that the byte is equal to 25, like we did above. After this, the code has to load the second byte, and make sure this byte is equal to 0, since if it is non-zero, the number must be much larger than 25.

Prefer the preincrement operator over the postincrmeent operator

This may sound confusing, but basically prefer ++i over i++ unless you are actively using the variable and need to use the first syntax. The code generated by the second option can be significantly slower in some cases.

This also applies to the decrement operators - use --i instead of i++ whenever possible.

You may be surprised how much of a difference this can make - both to program size and application performance.

Use global variables instead of local ones where possible

This one goes directly against most good C practice, but the NES has extremely limited RAM, and C does not make the best use of it. The best way to use this is to declare variables as global wherever possible. This means adding them at the file level, rather than inside individual functions. (The reasons for this are beyond the scope of the guide.)

Use ZEROPAGE variables wisely

There is a special section of ram called ZEROPAGE that works slightly more quickly than other sections. It has 256 bytes available in it, and mmany of these are in use by the engine. All of the tempInt and tempChar variables are located in ZEROPAGE, alongside a few other common variables. (Such as i and j.)

Whenever possible prefer to use these for calculations; especially repeated ones. Reading these variables works slightly differently, and the console can do it faster. You can also give them nicknames/aliases by using #define. The FAQ chapter has more detail on this.

Avoid passing parameters to functions if not needed

Passing parameters is surprisingly slow on the NES due to how variables are allocated. This gets worse as you add more parameters to a function. In many cases, it may be possible to use a global variable that both functions can access instead. Prefer to do that whenever possible - the code will not look as nice, however the result will be much faster.

// This demonstrates our ideal - myNumber is a variable used by the code calling this function. 
// This should be fast.
unsigned char multiply_myNumber_by_two(void) {
    return myNumber << 1;

// This will also work - myNumber is a variable that is passed into this function - it would be called like:
// multiply_by_two(32)
// This will be slower than the function above because of the parameter passing.
unsigned char multiply_by_two(unsigned char myNumber) {
    return myNumber << 1;

Prefer separate arrays to creating arrays of structs

This one may be slightly counter-intuitive. If you have worked in C before, you may be tempted to create structs to represent things like enemies, then create an array of these structures. It makes logical sense and is easy to write code around.

For the NES, you will want to avoid this practice. Behind the scenes, your game will have to calculate the index of each element in the array, and depending on the size of the struct, this can be slow. (For the same reason we want to avoid multiplication/division and use powers of two.) Worse yet, if your array ends up having more than 255 bytes total, this can make accesses even slower since we need to use an int.

It is much more efficient to create one array for each element you would put in the struct. In our sprite example, it would be better to have an array for X positions, and a second array for Y positions. (There are parts of nes-starter-kit that do not do this, and they might benefit from a refactor like this.)

Don’t forget to mark variables as const if they are constant

This is easy to forget if you mostly work in languages with less memory constraints. In our game, variables are generally stored in our very limited RAM space (2K) when you declare them. This allows you to change the value of those variables while the program is running.

If you have data that will not change during gameplay, you definitely want to declare it with const to make sure we do not waste RAM space on them. The variable will instead be stored with the game code, and not possible to change. This has a positive impact on performance and also reduces your risk of running out of RAM.

// This will be stored with your code, and cannot be changed by the code. It is faster
const unsigned char characterWidth = 32;
// This will be stored in memory, and can be changed by code. It is slower
unsigned char characterWidth = 32;

Option 2: Breaking logic up to run on different frames

Most of the time when we write game logic, we expect this logic to run every frame. It’s simple and it works. That said, one often-overlooked option for improving performance is to break this habit. We can make some of our logic run on every other frame, or with a little more work, even less often. If you can find pieces of your game that can do this and feel natural, it can get you back a lot of time. That said, this has to be applied carefully, or it can result in weird bugs.

The built-in engine actually does this for sprite collisions - we test for collision with half of our sprites on every even frame, then test the other half every odd frame. If you have slow code that could work well every other frame, or even less often, this is a good option.

Here’s how we do it for sprite collisions - you can follow along in source/sprites/map_sprites.c in the update_map_sprites() method. We’ve ommitted a bunch of the action code to make this understandable.

    for (i = 0; i < MAP_MAX_SPRITES; ++i) {
        currentMapSpriteIndex = i << MAP_SPRITE_DATA_SHIFT;
        // ... Code to draw the current sprite skipped here...

        // ... Code to animate the current sprite skiped here...

        // We only want to do movement once every other frame, to save some cpu time. 
        // So, split this to update even sprites on even frames, odd sprites on odd frames
        if ((i & 0x01) == everyOtherCycle) {

            // Movement code ommitted - this is the part that takes a long time, and we want to skip sometimes.

The key here is the line that reads if ((i % 0x01) == everyOtherCycle) - the variable everyOtherCycle is updated every time our main loop runs, by doing everyOtherCycle = !everyOtherCycle. This makes it jump between 1 and 0. As such, this alternates between only running when i is even one lop, then when i is odd the next loop.

Option 3: Use a less resource-hungry music engine

The music engine our game uses is pretty resource-intensive. The reason for this is that it supports all of Famitracker’s features. The goal was to allow new developers to create music without any restrictions, so they could focus more on making good music. That said, this engine takes up a lot of cpu time, as well as a lot of ram.

Switching to a new engine will require modifying some assembly language, a bit of neslib itself, and also likely reworking much of your music to fit that engine’s requirements.

This option is not for the faint of heart!

As of now, there are no plans of including multiple music/sound libraries in the base nes-starter-kit repository/engine. (If a PR is made that does this in a sane way, it will be considered.)

The first thing you’ll have to do is pick an engine. The path of least resistance is probably famitone2. You can find the library alongside C bindings for it on the Author’s website. If you grab the copy of neslib that doesn’t use the Famitracker library, there is a version of neslib.asm that supports famitone2 natively. The problem is, we’ve made changes to neslib.asm in nes-starter-kit, so you will have to merge the two files. You will also likely have to merge some changes from crt0.asm. (variable definitions, mainly)

If you are using another engine, you will need to figure out what the engine requires, and replace all famitracker code with the code for that library. This is unfortunately very complex, so it will not be detailed here.

Most NES music engines require you to structure your songs in a certain way, so that they are compatible with the engine. You may need to go through your music to make it compatible. You also may need to run the output of Famitracker through a converter program, or even save things from Famitracker differently.

Improve this page