← JoysticksFlood Fill →
  Efficient rasterization
Wed 2nd February 2022   
I was writing an article about how to create an efficient Flood Fill routine on the Oric, but then I realized that many of the optimizations were generic and could be part of a dedicated article about how to write efficient rasterization code.

Rasterization

When doing computer graphics, on modern machines as well as on retro computers, it is important to exploit the architecture of the hardware to our advantage.

Generally speaking this involves understanding how the hardware is structured, and then write code which will benefit from this structure, like how the pixels are organized, the location of the video memory, but also the features of the processing units available1
Writing code at a high level of abstraction will tend to be quite inefficient compared to code designed to closely match the hardware.

Cost of Abstraction

A typical example of abstraction is the "Put Pixel" function.

In a perfect mathematical world, you can just have very high level functions (like "draw a pixel") and glue them together to create more complex shapes like "lines" and "circles".

In practice, this means passing parameters, calling functions, iterating, etc... and ignoring all the possible optimizations coming from knowing that pixels are topologically set up in patterns we can exploit:
  • If the next pixel is on the same byte, can we write them together?
  • If the next pixel is on the next scanline, can't we just do "+40" on the pointer instead of recomputing the address?
  • If we draw a large fill rectangle which fits neatly on multiples of 6, why not just "memset" the whole area?

This simple program should illustrate the point easily:
10 HIRES:PAPER 4:INK 3
20 ' Draw rectangle using FILL command
30 CURSET 30,75,3:FILL 50,10,64+1+4+16
40 ' Draw rectangle using CURSET loop
50 FOR Y=75 TO 125:FOR X=150 TO 210 STEP 2:CURSET X,Y,1:NEXT:NEXT
On the left side, the rectangle is drawn using the FILL command of the Oric BASIC ROM2, which simply fills with a byte value a rectangular area defined by a number of scanlines and columns of bytes.

It is not very flexible, but it is very fast at doing anything involving changing the content of rectangles on the screen, as long as you can express that in a single byte value to replicate, which in this case is made of 64 (which defines "a byte with pixel content") plus 1+4+16 which draws one pixel every two.

On the right side, a similar looking rectangle is drawn, but this time done instead by calling the CURSET function3 for each of the pixels we want to change.

FILL vs looping CURSET
FILL vs looping CURSET

As you can see, it is considerably slower, but on the other hand we could have done a circle or triangle using this method, while this would have been impossible using FILL.

Why is it slow?

There are multiple reasons for this horrible performance, the first being that in is written in BASIC, an interpreted language where each variable is actually a floating point value which needs to be converted to integer screen coordinates for every single pixel we draw.

By the way, I gave the BASIC its best chances by having all the code on the same line - did not help much, did it?

So, how better would it be if we replaced BASIC by C, and kept the rest of the code identical?

Let's try that!

#include <lib.h>

void main()
{
int x, y;

hires();
paper(4);
ink(3);
// Draw rectangle using FILL command
curset(30, 75, 3);
fill(50, 10, 64 + 1 + 4 + 16);
// Draw rectangle using CURSET loop
for (y = 75; y <= 125; y++)
{
for (x = 150; x <= 210; x+=2)
{
curset(x, y, 1);
}
}
}

Here is what the result looks like:

FILL vs looping CURSET
FILL vs looping CURSET

It's definitely faster than the BASIC, but still nothing too get too impressed with, and if we were to convert the code to assembler, the speed up would be even more negligible, because the problem is the curset function itself.

Let's look at CURSET

The simplest way to look at why CURSET is so slow, is to just look at how it is implemented.

You could just press F2 in Oricutron and disassemble from the address $F0C8, or you could just look in the existing commented disassembly listings of the ROM, such as the one in the Oric Advanced User Guide, l'Oric a Nu ou Au Coeur de l'Oric Atmos4.

CURSET disassembly in Oricutron's debugger
CURSET disassembly in Oricutron's debugger

The first issue comes from the way parameters are passed, which we can here see in the C OSDK library calls:

_curset
jmp _curset
ldx #3 ;Get three parms
jsr getXparm
jsr $f0c8 ;curset
jmp grexit ;common exit point

getXparm ; Get X params (16bit) from stack
ldy #0 ; X is the number of params
sty $2e0 ; Zero error indicator.
stx tmp ; store X in storage byte
ldx #0
getXloop
lda (sp),y
sta $2e1,x
inx
iny
lda (sp),y
sta $2e1,x
inx
iny
dec tmp ; decrement pointer
bne getXloop
rts
Basically, the graphic functions are using some variables in page 2 (in the $2e0 area), so before calling the CURSET function we need first to extract the coordinates from the callee language, and write them down in this buffer.

Or as the Oric user manual says:

ROM Routines parameters conventions
ROM Routines parameters conventions

The CURSET function has three parameters: X, Y and FB. That last one is used to define if the pixel should be SET, UNSET, INVERTED or not drawn at all5.

All these parameters would naturally fit on an unsigned 8 bit value, but because of the parameter passing conventions they have to be passed as 16 bit values, which is never going to win any speed contest!

ROM Routines and Addresses
ROM Routines and Addresses

Here in summary is what the function does:

We start in F0C8
  • $F0C8 Check each of the parameters (x,y,fb) by calling $F2F8 to see if they are in the accepted ranges (less than 240, 200 and 4 respectively), and if not sets the error code in the system variable $2E0, else it continues to $EEE86
  • $EEE8 This function saves the registers A and Y on the stack and then sucessively call $EEDE, $F049 and finally $F024 before restoring the registers and returning to the caller
  • $EEDE All it does is to shift and rotate the content of $212 before returning
  • $F049 Calls $F731 to compute the screen offset from Y coordinates, adds $A000 and stores the final results in $10/$11, then calls $EFC8 to compute X divided by 6 and X modulo 6 and uses that to compute both the bit mask (stored in $215) and the offset in the screen scanline (added to $10/$11)
  • $F024 Loads the byte at the address pointed in $10, tests if it's an attribute by checking the bit 6, and then based on FB's value will perform a OR, AND or EOR before writing back the value.
  • $F731 Multiply the content of A by 40, the result is stored in A and Y
  • $EFC8 Divides the content of $0C/0D by the content of $200 and stores the quotient in $0C/$0D and the remainder in $0E/$0F
I'm not going into the implementations details of the entire routine, but you can download the Oric Advanced User Guide - ROM Disassembly section and open it at the pages 96, 92, 94, 108 and 93 where you will find the various subroutines involved.

That's a lot of code, many function calls, many pushes and pops, many loops to compute values.

If you are curious about how long exactly this routines takes to execute, you can look at the previous article about Performance Profiling, here are the results I found.

CURSET profiling in Oricutron's debugger
CURSET profiling in Oricutron's debugger

When calling curset(120, 100, 1);, the JSR CURSET takes exactly 4662 clock cycles to execute, and just so you realize how bad that is, it's about a quarter of the total number of clock cycles you have during a 50hz screen refresh.

Of course, since it's all in the 16KB of ROM, it makes sense that the code tries to be as compact as possible, but the cost is real, and in our own routines we can gain a humongous amount of time by using tables instead of computations.

A C equivalent

An actual equivalent in C language would be the following:

char CURSET(unsigned char x, unsigned char y, unsigned char fb)
{
if (fb >= 4) return 1;
if (x >= 240) return 1;
if (y >= 200) return 1;

{
int div6 = (x / 6);
int mod6 = x-(div6*6);
unsigned char bit_mask = (32 >> mod6);
int address = 0xa000 + div6 + (y*40);

unsigned char value = peek(address);
if (fb == 0) value &= ~bit_mask;
else
if (fb == 1) value |= bit_mask;
else
if (fb == 2) value ^= bit_mask;
poke(address, value);
}
return 0;
}

Let's test this C version against the BASIC ROM routine!

BASIC curset vs C curset
BASIC curset vs C curset

Ouch, 304 hundreds of a second vs 612, this means our C routine is about two times slower than the ROM routine!

Let's optimize!

Maybe we can improve the code a bit, like for example by removing the things we don't need: We are drawing pixels, so FB is always set to 1, and we don't use the return value anyway.

Here is our new version:

void CURSET(unsigned char x, unsigned char y)
{
if (x >= 240) return;
if (y >= 200) return;

{
int div6 = (x / 6);
int mod6 = x-(div6*6);
unsigned char bit_mask = (32 >> mod6);
int address = 0xa000 + div6 + (y*40);
unsigned char value = peek(address);
poke(address, value | bit_mask);
}
}

The results?

Well, we still have 304 for the BASIC obviously, but the C version is now down to 576.

Still some way to go.

We don't need to check if the coordinates are valid, and instead of peek and poke we can just use pointer arithmetic directly.

void CURSET(unsigned char x, unsigned char y)
{
int div6 = (x / 6);
int mod6 = x-(div6*6);
unsigned char bit_mask = (32 >> mod6);
unsigned char* address = (unsigned char*)0xa000 + div6 + (y*40);
*address |= bit_mask;
}

The C version is now down to 548.

We do not actually need to have a function, so instead we can make it a macro, which will remove the parameters passing.

#define CURSET(x,y) \
{ \
int div6 = (x / 6); \
int mod6 = x-(div6*6); \
unsigned char bit_mask = (32 >> mod6); \
unsigned char* address = (unsigned char*)0xa000 + div6 + (y*40); \
*address |= bit_mask; \
}

This brings us down to 447.

At this point, we can look at the generated code (in TMP/linked.s) where most of the time7 is spent in the divide and multiply routines:

// div6 = (x / 6)
lda reg0 : sta tmp0 : lda reg0+1 : sta tmp0+1
lda tmp0 : sta op1 : lda tmp0+1 : sta op1+1
lda #<6 : sta op2 : lda #>6 : sta op2+1
jsr div16i : stx tmp0 : sta tmp0+1

// mod6 = x-(div6*6)
lda tmp0 : sta reg2 : lda tmp0+1 : sta reg2+1
lda reg0 : sta tmp0 : lda reg0+1 : sta tmp0+1
lda reg2 : sta tmp1 : lda reg2+1 : sta tmp1+1
lda #<6 : sta op1 : lda #>6 : sta op1+1
lda tmp1 : sta op2 : lda tmp1+1 : sta op2+1 :
jsr mul16i : stx tmp1 : sta tmp1+1

// bit_mask = (32 >> mod6)
sec : lda tmp0 : sbc tmp1 : sta tmp0
lda tmp0+1 : sbc tmp1+1 : sta tmp0+1
lda tmp0 : sta reg3 : lda tmp0+1 : sta reg3+1
lda reg3 : sta tmp0 : lda reg3+1 : sta tmp0+1
lda #<32 : sta tmp : lda #>32 : ldx tmp0 : beq *+8 : lsr : ror tmp : dex : bne *-4
ldx tmp : stx tmp0 : sta tmp0+1

// address = (unsigned char*)0xa000 + div6 + (y*40)
lda tmp0 : sta reg4
lda reg1 : sta tmp0 : lda reg1+1 : sta tmp0+1
lda #<40 : sta op1 : lda #>40 : sta op1+1 : lda tmp0 : sta op2 : lda tmp0+1 : sta op2+1
jsr mul16i : stx tmp0 : sta tmp0+1
lda reg2 : sta tmp1 : lda reg2+1 : sta tmp1+1
clc : lda #<$a000 : adc tmp1 : sta tmp1 : lda #>$a000 : adc tmp1+1 : sta tmp1+1
clc : lda tmp0 : adc tmp1 : sta tmp0 : lda tmp0+1 : adc tmp1+1 : sta tmp0+1
lda tmp0 : sta reg5 : lda tmp0+1 : sta reg5+1
lda reg5 : sta tmp0 : lda reg5+1 : sta tmp0+1

// *address |= bit_mask
ldy #0 : lda (tmp0),y : sta tmp1
lda tmp1 : sta tmp1 : lda #0 : sta tmp1+1
lda reg4 : sta tmp2
lda tmp2 : sta tmp2 : lda #0 : sta tmp2+1
lda tmp1 : ora tmp2 : sta tmp1 : lda tmp1+1 : ora tmp2+1 : sta tmp1+1
ldy #0 : lda tmp1 : sta (tmp0),y

So, what can we do to improve this mess?

Well, considering the Oric does not have any faster way to compute a divide or a multiply, the usual solution is to use tables, in this case we need three:
  • A table for Y multiplied by 40 (to access the right scanline)
  • A table for X divided by 6 (to find the right byte on the scanline)
  • A table for X modulo 6 (to get the bit position in the byte)
Let's do that.

My bytes have 6 bits!

One of the peculiarities of the Oric, is that there are only 6 pixels in each of the bytes of the screen8, and a consequence of that is that we end-up having to deal with the number six, which is not a power of two, so you can't just shift around to find where to draw a pixel.

Given an horizontal coordinate between 0 and 240, we need to divide the value by 6 to find which byte on the scanline contains the pixel we are interested in.

Let's start with the div 6 table and change the code to the following:

unsigned char Div6[240];   // Values divided by 6

void GenerateTables()
{
int x;
for (x = 0; x < 240; x++)
{
Div6[x] = x / 6;
}
}

#define CURSET(x,y) \
{ \
int div6 = Div6[x]; \ x/6
int mod6 = x-(div6*6); \
unsigned char bit_mask = (32 >> mod6); \
unsigned char* address = (unsigned char*)0xa000 + div6 + (y*40); \
*address |= bit_mask; \
}

We were down to 447, and now... 292!

We've now beaten the ROM routines which were at 304.

Let's modulate

Obviously, if we had to divide by six to find the proper byte, it means we can use the remainder of the operation to get a value between 0 and 5 which will indicate the bit we are interested in.

Let's continue with the modulo 6 table.

unsigned char Mod6[240];   // Values modulo 6

void GenerateTables()
{
int x;
for (x = 0; x < 240; x++)
{
Mod6[x] = x % 6;
}
}

#define CURSET(x,y) \
{ \
int div6 = Div6[x]; \
int mod6 = Mod6[x]; \ x%6
unsigned char bit_mask = (32 >> mod6); \
unsigned char* address = (unsigned char*)0xa000 + div6 + (y*40); \
*address |= bit_mask; \
}

New record: 186

Finding the scanline

Next on the list, the multiply by 40 table. That one will have to be slightly different since the values will not fit on a 8bit result.

unsigned int  Mul40[200];  // Values multiplied by 40

void GenerateTables()
{
int y;
for (y = 0; y < 200; y++)
{
Mul40[y] = y * 40;
}
}

#define CURSET(x,y) \
{ \
int div6 = Div6[x]; \
int mod6 = Mod6[x]; \
unsigned char bit_mask = (32 >> mod6); \
unsigned char* address = (unsigned char*)0xa000 + div6 + Mul40[y]; \ // y*40
*address |= bit_mask; \
}

Our final result is 90, so about 3.3 times faster than the ROM routine.

Are we done yet?

We could totally stop there, enjoying the snow falling outside, until we start having a nagging feeling that maybe there are more optimizations we could unleash.

Up to this point we had some relatively generic tables (multiply by 6, modulo 6, multiply by 40), but if these routines are ever going to be used for graphical routines, why not just push even farther?

Let's look at this line for example:

unsigned char* address = (unsigned char*)0xa000 + div6 + Mul40[y];

We compute the final address by adding together the base address of the screen, the offset to reach the proper scanline, and the offset to access the proper byte, but as it happens, the base address never change so we could just put it in the table as well!

Here is what the slightly reorganized code looks like:

unsigned char* Scanlines[200];  // Values multiplied by 40 + a000

void GenerateTables()
{
int y;
for (y = 0; y < 200; y++)
{
Scanlines[y] = (unsigned char*)0xa000 + y*40;
}
}

#define CURSET(x,y) \
{ \
unsigned char* address = Scanlines[y] + Div6[x]; \
*address |= (32 >> Mod6[x]); \
}

Which brings us down to 73 (4.16 times faster that the ROM)

But we can still do better, like by realizing that the bit mask is no stored in the same order as the modulo value, but instead of shifting we could just put that in the table instead.

Let's do that.

unsigned char Mod6[240];   // Values modulo 6 (inverted)

void GenerateTables()
{
int x;
for (x = 0; x < 240; x++)
{
Mod6[x] = 32 >> (x % 6);
}
}

#define CURSET(x,y) \
{ \
unsigned char* address = Scanlines[y] + Div6[x]; \
*address |= Mod6[x]; \
}

This small change brings us down to 64 (4.75 faster).

A little bit more?

At this point we are limited by the C compiler inefficient code, and the fact we have no control on the memory location of our tables.

The 6502 typically will spend more time accessing data when it has to cross a 256 bytes page boundary, so it is quite important to try to align things in memory.

Another important thing is that it is a 8 bit processor, it does not have any native way to handle 16 bit values, so things like our integer arrays of screen addresses would be better replaced by two 8 bit arrays, one for the high byte of the address and one for the lower byte.

Unfortunately at this optimization level it become very difficult to fight against the C language rules such as "implicit integer promotion", so we have to turn to assembler.

And intermediate step is to use inline assembler!

unsigned char ScanlinesHigh[200];  // High byte Values multiplied by 40 + a000 
unsigned char ScanlinesLow[200]; // Low byte Values multiplied by 40 + a000

unsigned char PosX; // Temporary: Can't access locals from inline ASM
unsigned char PosY; // Temporary: Can't access locals from inline ASM

void GenerateTables()
{
int y;
for (y = 0; y < 200; y++)
{
unsigned int address = 0xa000 + y*40;
ScanlinesHigh[y] = (address>>8);
ScanlinesLow[y] = (address & 255);
}
}

#define CURSET(x,y) \
{ \
PosX = x; \
PosY = y; \
asm( \
"ldy _PosY;" \
"lda _ScanlinesLow,y;" \
"sta tmp0+0;" \
"lda _ScanlinesHigh,y;" \
"sta tmp0+1;" \
"ldx _PosX;" \
"ldy _Div6,x;" \
"lda (tmp0),y;" \
"ora _Mod6,x;" \
"sta (tmp0),y;" \
); \
}

This change from C to assembler bring us down to 29 (10.4 times faster than the ROM).

Just to get more accurate numbers, let's look at exactly how many clock cycles we are talking about.

Our CURSET profiling in Oricutron's debugger
Our CURSET profiling in Oricutron's debugger

The original ROM CURSET took 4662 clock cycles to execute, and our improved one only uses 85 clock cycles, which is actually 54 times faster9, which means we could draw 235 pixels per frame instead of 4.

Normally we would not need the PosX and PosY variables, but the inline assembler syntax of the compiler is quite limited and it does not have access to the local variables, so we had to do a temporary copy.

The assembler itself is quite simple:
  • Use the Y value to get a byte from _ScanlinesLow and _ScanlinesHigh that we store in the 16 bit zero page pointer tmp0
  • Use the X value to get from the Div6 table the scanline offset into the Y register
  • Load the byte from the video memory using the screen address stored in tmp0 and the column offset in the Y register
  • Mask in the bit pattern we get from the Mod6 table for this X coordinate
  • Write back the modified byte at the same video memory location
And that's it really, if you have everything properly laid-out in tables, and the right values in your registers, you can do the entire Curset in just eight 6502 assembler instructions.

Final notes

Something important to consider, is that a pixel drawing routine is rarely used to just draw a single dot. Most of the time it's part of a more complex shape like a line, circle, polygon, etc...

In such a context, there are many optimizations you can apply, which will make the pixel drawing even faster, like for example:
  • If all the pixels are on the same vertical line, the Div6 and Mod6 values stay the same, so you can just fetch it once and reuse it all over
  • If all the pixels are on the same horizontal scanline, the pointer value does not change, so you can skip the ScanLineHigh and Low table accesses
  • In this situation, you can also just draw the left and right extremities using some precomputed masks for each of the 6 possible lengths, and fill in between all the bytes with the value 127 (64+1+2+4+8+16+32)

I'm probably forgetting things, so feel free to ask question or point out inefficiencies!

Anyway, I hope that was interesting and that you learnt a thing or two :)


1. On the Oric we only have a 6502, but some machines have graphic cards, Blitting chips, coppers, etc...
2. See page 129 of the Oric Atmos User Manual
3. See page 120 of the Oric Atmos User Manual
4. All these books are available for download in the Oric Library.
5. This last case is useful for when you need to move the current cursor position before drawing something else like a CIRCLE
6. Also the code does JSR $EEE8 followed by RTS instead of doing a JMP which would be faster
7. Yes, the OSDK compiler generates horrible code, but here it's the div16i and mul16i which dwarf the rest
8. Which is why we get 240 pixels horizontaly using 40 bytes per scanline
9. The discrepancy between the high level counter and the cycle profiler is caused by the time spent in the two for loops
comments powered by Disqus

Coverity Scan Build Status