Theoric MagazineOptimization strategiesOriginal implementationHorizontal drawingSpan initializationIteration loopDepth computationToo Smart???Keep It Simple Stupid

Now that the final scroller is finally running at an **acceptable speed**, the second most pressing issue is the

**Mandelbrot fractal**part.

Optimizing a scroller is just a matter of finding a fast way to copy things around, but optimizing a fractal is much more complicated due to the

**limitations**of the 6502 processor regarding doing

**mathematics**.

## Theoric Magazine

The idea of having a fractal came from an old Oric Magazine called**Theoric**.

One of the issues contained a small

**BASIC program**which could render arbitrary parts of the Mandelbrot set, in

**full color**.

It was obviously

**terribly slow**, but you could just change the parameters and let it run during the night, and on the

**next morning**you would have your beautiful fractal on the screen ready to be saved to tape :)

Just for the sake of completeness here is the complete

**original**

^{1}source code:

The program is

0 PAPER 0:INK 4

10 HIRES:DOKE #306,#FFFF

11 PAPER 0:INK 3

15 PR=19:D1=64:D2=32:D3=512:D4=15

20 FOR I=1 TO 199 STEP 2

30 CURSET 0,I,3:FILL 1,1,16+4

40 CURSET 6,I-1,3:FILL 1,1,1

50 NEXT

60 FOR X=11 TO 237 STEP 2

70 FOR Y=0 TO 199 STEP 2

79 N=PR:X1=X-100:Y1=Y-100:XN=X1:YN=Y1

80 REPEAT

81 D=XN+YN:B=XN-YN:C=XN*YN

82 XN=D*B/D1-X1:YN=C/D2-Y1:N=N-1

83 UNTIL N=0 OR (ABS(XN)+ABS(YN)>D3)

90 IF N AND 1 THEN CURSET X,Y,1

91 IF N AND 2 THEN CURSET X+1,Y,1

92 IF N AND 4 THEN CURSET X,Y+1,1

93 IF N AND 8 THEN CURSET X+1,Y+1,1

120 NEXT

125 NEXT

130 DOKE #306,10000:END

**not**very complicated:

- Line 15 defines the
**Mandelbrot set**to compute - Lines 20-50 sets some attributes on the screen to achieve the
**color effects** - Line 60 is the outer loop iterating
**horizontally**on the screen - Line 70 is the second loop iterating
**vertically**on the screen - Lines 79-83
**compute**the actual color of the point at this location - Lines 90-93 draw some points to achieve the dithering effect based on the
**point color**

**REPEAT-UNTIL section**

^{2}!

An important element to notice is the STEP 2 on each of the FOR loops: We are actually computing only

**a quarter**of the total number of points, because each of the point is then rendered on screen using a 2 by 2 pixel matrix to achieve the nice colored

**dithering**effect.

Something else to notice is that in this particular case the image is symmetrical along the horizontal axis, so a

**simple symmetry**could have achieved a near doubling in performance... but that's definitely cheating:

**Let's keep that as an option for later if we really need to**.

If you are interested by the topic, you can read the very long wikipedia page about the Mandelbrot set (in this particular case we are particularly interested in the section called

Computer drawings), and there are also many different implementations in various languages on the Rosetta Code site.

## Optimization strategies

Until we**profile**the code we will not actually know where most of the time is spent, but we can already notice a few things:

- The BASIC variable format uses
**floating point**representation, do we need that? - What is the range of the variables, do they fit on one
**byte**, do they get**negative**? - There are two
**multiplications**and two**divisions**in the inner loop... - The loop test performs
**absolute value**computations - Drawing 2x2 sized pixels requires additional
**masking**when blitting individually

**recursively drawn**squares:

This is a

**common optimization**seen in fractal rendering code, it's based on the observation that if all the points on the border of a section of the Mandelbrot

^{3}set have the same color (depth) then all the points inside

**ALSO**have the same depth, thus do not require additional calculation.

I did the same thing as I did with the scroller, and modified the demo so it jumped directly to the fractal part, without playing the music.

Something

**worth noting**is that without the music playing, the fractal finishes just a fraction of time after displaying the

**WRATHDESIGNS**group name, skipping the

**WWW.C64.ORG**and

**WWW.ORIC.ORG**names.

Indeed, comparing the computation time with and without the music gives us a running time of

**1:30 without**and

**1:40 with**the music.

Not a huge difference, but worth keeping in mind when doing adjustments at the end if we want the music to be in

**sync**with the effects for example.

So let see

**how**the demo code is written so we can see

**where**to optimize.

## Original implementation

The mandelbrot computation is called from**main.c**:

The file

//

// Mandelbrot

//

Mandelbrot:

SwitchToHires();

ClearHiresScreen();

Mandel_InitDisplay();

Mandel_DrawFractal();

**mandel.s**contains the actual assembler code:

What this routine does is to first compute the

_Mandel_DrawFractal

lda #0

sta _x

lda #0

sta _y

lda #128

sta _w

jsr _Mandel_HLine

lda #0

sta _x

lda #64

sta _y

lda #128

sta _w

jsr _Mandel_HLine

lda #0

sta _x

lda #0

sta _y

lda #64

sta _h

jsr _Mandel_VLine

lda #128

sta _x

lda #0

sta _y

lda #64

sta _h

jsr _Mandel_VLine

jsr _Mandle_BlitBigBuffer

lda #0

sta _VblCounter

lda #0

sta _x

lda #0

sta _y

lda #128

sta _w

lda #64

sta _h

jsr _Mandel_Divide

jsr _Mandle_BlitBigBuffer

rts

**top**,

**bottom**,

**left**and

**right**side values, and then call a

**recursive routine**.

While writing this article I noticed that some of the values seem to be computed

**more than once**: When drawing a rectangle, the corner points are

**shared**, and in this case it looks like the point (0,0) at least is computed more than once. If this happens on each recursive level, that may end up being a

**significant amount of time**.

Another obvious thing is that despite being computed, we do

**not see**the left vertical column being displayed!

So maybe we are actually computing

**more pixels**than we are actually able to draw on screen...

Something else to check is if intermediate variables are using

**zero page**locations or

**normal**memory:

Everything seems

.zero

_x .dsb 1

_y .dsb 1

_w .dsb 1

_h .dsb 1

_x1 .dsb 2

_y1 .dsb 2

_xn .dsb 2

_yn .dsb 2

**alright**, we are indeed using zero page for all these variables.

## Horizontal drawing

So, let see how these**subroutines**are implemented:

The

_Mandel_HLine

jsr _Mandel_ComputeParameters

lda _w

pha

loop_x

clc

lda _x1+0

sta _xn+0

adc #2

sta _x1+0

lda _x1+1

sta _xn+1

adc #0

sta _x1+1

lda _y1+0

sta _yn+0

lda _y1+1

sta _yn+1

jsr _compute_outer

lda _n

ldy _x

sta (big_buff),y ; Store pixel in buffer

inc _x

dec _w

bne loop_x

pla

sta _w

rts

**lda _w/pha/pla/sta _w**seem peculiar, but then we have to remember that the code is doing

**recursive calls**, so it makes sense to

**store**the original Width and Height values, else

**Bad Things Would Happen**.

The rest of the code seems reasonable, first calling

**_Mandel_ComputeParameters**to initialize some parameters, then iterate for each of the values along the requested

**_w**parameters, for each calling

**_compute_outer**and then storing the result in some

**intermediate buffer**.

Let see how these two sub-routines looks like...

**dig down**into the rabbit hole.

## Span initialization

What this routine does is not very complicated, it takes the

#define MANDEL_X_POS 32

#define MANDEL_Y_POS 92

_Mandel_ComputeParameters

ldy _y

lda _BufferAddrLow,y

sta big_buff+0

lda _BufferAddrHigh,y

sta big_buff+1

clc ; Compute X1=-64+(x*2)

lda _x

rol

sta _x1+0

lda #0

rol

sta _x1+1

sec

lda _x1+0

sbc #<MANDEL_X_POS

sta _x1+0

lda _x1+1

sbc #>MANDEL_X_POS

sta _x1+1

clc ; Compute y1=-64+(y*2)

lda _y

rol

sta _y1+0

lda #0

rol

sta _y1+1

sec

lda _y1+0

sbc #<MANDEL_Y_POS

sta _y1+0

lda _y1+1

sbc #>MANDEL_Y_POS

sta _y1+1

rts

**vertical position**(stored passed in the

**_y**variable) and use it to compute which

**scanline**in the buffer will be used to store the result of the computations, as well as computing the

**X1=X-100:Y1=Y-100**part of the BASIC program, except here

**100**was replaced by

**64**because we prefer

**power of two**values when doing assembler code.

This routine can most probably be improved considering all the initial values are 8 bit, but not being part of the inner loop this would seem to be a

**premature optimization**to me.

## Iteration loop

This code is pretty much the straight conversion from the BASIC original code

#define MANDEL_ITER_BIS 19

#define MAX_MANDEL_ITER 512

_outer_exit

rts

_compute_outer

// n=pr

lda #MANDEL_ITER_BIS

sta _n

compute_outer_loop

jsr _compute_inner

dec _n

beq _outer_exit

lda _xn+1 ; Compute _axn=ABS(xn)

bmi xn_neg

sta _axn+1

lda _xn

sta _axn

jmp xn_abs

xn_neg

sec

lda #0

sbc _xn

sta _axn

lda #0

sbc _xn+1

sta _axn+1

xn_abs

lda _yn+1 ; Compute _ayn=ABS(yn)

bmi yn_neg

sta _ayn+1

lda _yn

sta _ayn

jmp yn_abs

yn_neg

sec

lda #0

sbc _yn

sta _ayn

lda #0

sbc _yn+1

sta _ayn+1

yn_abs

// sum axn+ayn => axn

// if axn+ayn>d3 exit loop

clc

lda _axn

adc _ayn

sta _axn

lda _axn+1

adc _ayn+1

cmp #>MAX_MANDEL_ITER

bcc compute_outer_loop

lda _axn

cmp #<MAX_MANDEL_ITER

bcc compute_outer_loop

end_outer

rts

**UNTIL N=0 OR (ABS(XN)+ABS(YN)>D3)**.

First it calls

**_compute_inner**then it checks if we have reached

**MANDEL_ITER_BIS**(the end of iteration), or if the computed

**_xn**and

**_yn**values gets larger than

**MAX_MANDEL_ITER**.

The reason why the code is

**so long**compared to what it actually does, is that the 6502 only have 8 bit registers, and does not have any way to compute an

**absolute value**, which means we manually need to check the

**sign**of the most significant byte, eventually perform a

**subtraction**from 0 to get the positive value.

Let's continue digging...

## Depth computation

Now we are talking.

_compute_inner

clc ; Compute d=xn+yn

lda _xn

adc _yn

sta _d

lda _xn+1

adc _yn+1

sta _d+1

sec ; Compute b=xn-yn

lda _xn

sbc _yn

sta _b

lda _xn+1

sbc _yn+1

sta _b+1

lda _xn ; Compute c=xn*yn

sta op1

lda _xn+1

sta op1+1

lda _yn

sta op2

lda _yn+1

sta op2+1

jsr mul16i

stx _c

sta _c+1

ldx _c+1 ; Compute yn=c/d2-y1

ldy _c

lda _TableDiv32_low,x

and #%11111000

sta _tempfastdiv

lda _TableDiv32_low,y

and #%00000111

ora _tempfastdiv

sec

sta _c

sbc _y1

sta _yn

lda _TableDiv32_high,x

sbc _y1+1

sta _yn+1

lda _d ; Compute xn=d*b/d1-x1

sta op1

lda _d+1

sta op1+1

lda _b

sta op2

lda _b+1

sta op2+1

jsr mul16i

stx op1

sta op1+1

ldx op1+1

ldy op1

lda _TableDiv64_low,x

and #%11111100

sta _tempfastdiv

lda _TableDiv64_low,y

and #%00000011

ora _tempfastdiv

sec

sta op1

sbc _x1

sta _xn

lda _TableDiv64_high,x

sbc _x1+1

sta _xn+1

rts

This routine is both

**long**, and full of

**slow**code (Like calling

**mul16i**twice).

This routine is called for every single point, up to 20 times, so any improvement done there will have a

**huge impact**on the total running time.

## Too Smart???

So we know that this particular implementation is using a recursive divide and conquer in order to speed up calculations by not having to recompute pixels belonging to the same depth level, but do we**actually**know if this

**improved**the computation time?

Sure, from a

**mathematical**point of view we can expect better results, but in the real world it means we may have to compute any particular value of any particular point on the Mandelbrot set, which means we can't really use any of the usual optimizations:

If you iterate

**horizontally**or

**vertically**, you can just move a pointer forward instead of having to recompute the position in the buffer.

So what about trying first to

**hack around**a function that just try to fill the screen from top to bottom and see if there is any

**speed difference**?

As you can see this is

_Mandel_DrawFractalNaive

lda #0

sta _VblCounter

lda #0

sta _y

loop

lda #0

sta _x

lda #128

sta _w

jsr _Mandel_HLine ; h_line(0,0,largeur,XO,YO,PAS,buf,screen,pitch);

jsr _Mandle_BlitBigBuffer

inc _y

ldy _y

cpy #64

bne loop

rts

**ultra basic code**: I did not change anything else, all I did was to

**call the horizontal drawing loop code 64 times**.

As expected the performance is

**not awesome**(1 minute and 47 seconds), specially toward the bottom part (which was also slow on the original demo), but it

**could**have been much

**worse**, specially considering that we call the fullscreen buffer blitting after each scanline!

## Keep It Simple Stupid

There are some clear benefits in using a**simpler**code: It is generally much

**easier**to optimize and debug.

So I continued on the KISS

^{4}road, and I modified the blit routine to only

**display the scanline**we just computed (instead of blitting the entire buffer), and then

**adjusted**the computed width from 128 to 114 pixels (the actually visible part).

This made the whole

**computation time sink**to 1 minute and 20 seconds.

Basically what that mean is that all the

**convoluted recursive**cha-bang is not actually faster on this particular set, or more exactly, the

**overhead**of doing complicated recursive code is larger than the performance boost given by not doing some of the computations.

Another possibility is that there's a

**bug**in the recursive call, but this code is quite complex, so I think changing the code to just draw horizontal lines is probably

**the way to go**!

I guess I'll have to dig a bit more, and who knows, maybe there will be a part III to this article.

1. I added spaces and tabs for readability.↩

2. The inner-most loop is where most of the time is spent, so always check there first because any optimisation done there is multiplicated by the number of iteration made by all the outer calls, which can be very many↩

3. On the borders of the set, this is not valid for the bounding box that contains the set itself.↩

4. See the title, it's a common idiom in programming, opposing complexity to simplicity↩