Crafty Optimizations

What do you do if the usual optimization recommendations, like using the optimisers, choosing the smallest feasible data types, choosing unsigned types over signed, etc., just aren't quite enough? The optimizations discussed in this article take advantage of how statements must be encoded by your compiler and they might allow you to squeeze a little extra into a device or allow your code to run just a little bit faster. As always, it is important to avoid degrading the readability of code for the sake of saving a few bytes, as difficult-to-read source code can be difficult to debug; however, the changes described in this article are trivial and should impact little on your code's understandability.

To help assess the impact of these suggestions, check your project’s list file to monitor the size of the assembly code sequences produced by the compiler for statements in your program.

Accessing Structure Members Via Pointers

If you are accessing structure members via a pointer, changing the order of the members might improve the code that accesses them.

Consider the following structure with three members and a pointer that could be used to access these.

struct MYSTRUCT {
    int a;
    int b;
    int c;
} myStruct;

struct MYSTRUCT * stp = &myStruct;

When you access a structure member indirectly, the compiler has to produce assembly instructions that take the address contained in the structure pointer add the offset of the member in the structure to form the final address, then read the memory located at that address. As the content of the pointer is not known by the compiler and must be obtained at runtime, the addition must also be performed at runtime.

Some device instruction sets cannot perform the addition of a member's offset to the address when the indirect access is made, so the compiler might need to insert one or more extra instructions to perform this addition. However, if you are accessing the first member inside the structure, it will be located at an offset of zero and so, this addition is not required. Thus, if you place the most commonly accessed member first in the structure definition, your code might gain a little extra performance.

The following shows the assembly code that might be required to access the first and last members of the above structure via a pointer for a 16-bit PIC® microcontroller.

28:                    stp->a = 0x0000;
000300  804042     MOV stp, W2
000302  EB0900     CLR [W2]
29:                    stp->c = 0x0000;
000304  804042     MOV stp, W2
000306  EB0180     CLR W3
000308  980123     MOV W3, [W2+4]

Note that this situation is not present with direct access of structure members. In this case, the addition is performed between two constant operands that are known at compile time, hence the addition can also be performed at compile time, and no extra code is required.

Note also that the compilers for 16- and 32-bit devices might pad structures so that members are word aligned. This might improve the code required to access the members, but it increases the memory required for the structure. When writing the definition for a structure and if possible, try to arrange the members so that they are grouped by type size as well as having the most commonly accessed first.

struct MYSTRUCT {
    int a;   // I need to be accessed fast
    int c;   // both ints together
    char b;  // both chars together
    char d;
} myStruct;

Access of Elements in Awkwardly Sized Arrays

If you have an array, consideration of the array dimensions or element size might improve the code required to access the array elements.

Consider a multidimensional array, which is really just an array of subarrays, defined as:

char myArray[2][5];

If we consider only arrays of single-byte char types for the moment, the expression to access an array element, such as myArray[x][y], requires the compiler to take the address of the array and add to this an offset which is calculated as the second index, y, added to the result of the first index, x, multiplied by the size of the subarray, 5. As a C expression, that operation could be written as:

*(&myArray + y + x * 5)

If x is a variable, and hence not known at compile time, the multiplication must be performed at runtime. When using devices with instruction sets that do not readily allow for multiplication, consideration of this expression can improve the generated code.

In the above case, simply swapping the dimensions of the array, as shown below, change the constant multiplication operand from a not-friendly 5 to a more-friendly 2.

char myArray[5][2];

Multiplication by a value that is a power of 2 can often be performed as a logical shift, which will be much faster. The following MPLAB® XC8 output snippets show a call to a multiplication routine being used to obtain the address of an array defined as myArray[2][5], followed by code using a shift to access an array defined as myArray[5][2].

35:                result = myArray25[x][y];
075F  3005     MOVLW 0x5
0760  00F0     MOVWF ___bmul@multiplicand
0761  087D     MOVF x, W
0762  27F4     CALL ___bmul      ;call to multiplication routine
36:                result = myArray52[x][y];
076C  087D     MOVF x, W
076D  00F6     MOVWF ??_main
076E  302A     MOVLW low (_myArray52)
076F  35F6     LSLF ??_main, F   ;logical shift

Even with a single-dimension array, this consideration can still be made, although in this case, the only option available to you is to adjust the size of the array elements. If you had an array of structures and each structure was 3 bytes long, adding an extra byte to the structure would increase the size of the memory required to store the array, but accessing elements that were 4 bytes long might be faster.

Note that if you only use constant array indices, both operands to the multiplication will be known, and the operation will be performed at compile time with no runtime penalty.

Multiplication By Big Numbers

The order of the operands of multiplication can sometimes affect how fast the multiplication is performed.

If a hardware multiply instruction is not available in the instruction set of the device you are using, then a routine might be called to perform this operation. With the MPLAB XC8 compiler, this routine can use a loop, which runs for as long as there are non-zero bits remaining in the first multiplication operand. Thus, if you can arrange for the first operand to have the smaller value, the multiplication might be performed faster.

In the following example, the first multiplication expression takes roughly 50 instruction cycles less time to execute than the second expression for a mid-range PIC device.

char a = 10;
char b = 200;
int result;

result = a * b;  // faster
result = b * a;  // slower
© 2024 Microchip Technology, Inc.
Notice: ARM and Cortex are the registered trademarks of ARM Limited in the EU and other countries.
Information contained on this site regarding device applications and the like is provided only for your convenience and may be superseded by updates. It is your responsibility to ensure that your application meets with your specifications. MICROCHIP MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WHETHER EXPRESS OR IMPLIED, WRITTEN OR ORAL, STATUTORY OR OTHERWISE, RELATED TO THE INFORMATION, INCLUDING BUT NOT LIMITED TO ITS CONDITION, QUALITY, PERFORMANCE, MERCHANTABILITY OR FITNESS FOR PURPOSE. Microchip disclaims all liability arising from this information and its use. Use of Microchip devices in life support and/or safety applications is entirely at the buyer's risk, and the buyer agrees to defend, indemnify and hold harmless Microchip from any and all damages, claims, suits, or expenses resulting from such use. No licenses are conveyed, implicitly or otherwise, under any Microchip intellectual property rights.