Cortex-m3 Supports Hardware Floating-point Operations Directly

My mantra is *not* to use whatsoever floating indicate data types in embedded applications, or at least to avoid them whenever possible: for almost applications they are non necessary and can be replaced past fixed betoken operations. Not only floating signal operations have numerical problems, they can atomic number 82 to performance problems as in the post-obit (simplified) example:

          
#define NOF  64 static uint32_t samples[NOF]; static float Fsamples[NOF]; float fZeroCurrent = 8.0;  static void ProcessSamples(void) {   int i;    for (i=0; i < NOF; i++) {     Fsamples[i] = samples[i]*3.3/4096.0 - fZeroCurrent;   } }          

ARM designed the Cortex-M4 architecture in a way it is possible to have a FPU added. For example the NXP ARM Cortex-M4 on the FRDM-K64F board has a FPU present.

MK64FN1M0VLL12 on FRDM-K64F

MK64FN1M0VLL12 on FRDM-K64F

The question is: how long will that function need to perform the operations?

Looking at the loop, it does

            
Fsamples[i] = samples[i]*three.three/4096.0 - fZeroCurrent;            

which is to load a 32bit value, and then perform a floating indicate multiplication, followed by a floating betoken division and floating signal subtraction, then store the upshot back in the result array.

The NXP MCUXpresso IDE has a cool feature showing the number of CPU cycles spent (see Measuring ARM Cortex-One thousand CPU Cycles Spent with the MCUXpresso Eclipse Registers View). So running that role (without whatsoever special optimization settings in the compiler takes:

Cycle Delta

Bicycle Delta

0x4b9d or 19'357 CPU cycles for the whole loop. Measuring but one iteration of the loop takes 0x12f or 303 cycles. One might wonder why information technology takes such a long time, equally nosotros do have a FPU?

The respond is in the assembly code:

This really shows that it does not use the FPU, simply instead uses software floating point operations from the standard library?

The answer is the mode the operation is written in C:

              
Fsamples[i] = samples[i]*3.3/4096.0 - fZeroCurrent;              

We have here a uint32_t multplied with a floating bespeak number:

                
samples[i]*3.3                

The thing is that a abiding as 'iii.three' in C is of type *double*. Equally such, the operation will start catechumen the uint32_t to a double, and then perform the multiplication as double operation.
Aforementioned for the division and subtraction: it will be performed as double operation:

                  
samples[i]*3.iii/4096.0                  

Same for the subtraction with the bladder variable: considering the left functioning consequence is double, it has to exist performed equally double operation.

                    
samples[i]*iii.3/4096.0 - fZeroCurrent                    

Finally the upshot is converted from a double to a bladder to store information technology in the array:

                      
Fsamples[i] = samples[i]*three.iii/4096.0 - fZeroCurrent;                      

Now the library routines chosen should be clear in above assembly code:

  • __aeabi_ui2d: convert unsigned int to double
  • __aeabi_dmul: double multiplication
  • __aeabi_ddiv: double division
  • __aeabi_f2d: float to double conversion
  • __aeabi_dsub: double subtraction
  • __aeabi_d2f: double to bladder conversion

Simply why is this done in software and non in hardware, as nosotros have a FPU?

The answer is that the ARM Cortex-M4F has but a *unmarried precision* (float) FPU, and not a double precision (double) FPU. Equally such information technology just can practice bladder operations in hardware but non for double type.

The solution in this instance is to use float (and not double) constants. In C the 'f' suffix can be used to mark constants as bladder:

                        
Fsamples[i] = samples[i]*3.3f/4096.0f - fZeroCurrent;                        

With this, the lawmaking changes to this:

Using Single Precision FPU Instructions

Using Single Precision FPU Instructions

So now it is using unmarried precision instructions of the FPU :-). Which only takes 0x30 (48) cycles for a unmarried iteration or 0xc5a (3162) for the whole affair: 6 times faster :-).

The example can be fifty-fifty farther optimized with:

                          
Fsamples[i] = samples[i]*(3.3f/4096.0f) - fZeroCurrent;                          

Other Considerations

Using bladder or double is cracking per se: it all depends on how it is used and if they are really necessary. Using fixed-indicate arithmetics is not without issues, and standard sin/cos functions use double, so you lot don't want to re-invent the cycle.

Centivalues

One way to use a float type say for a temperature value:

                            
float temperature; /* e.g. -37.512 */                            

Instead, information technology might be a better idea to use a 'centi-temperature' or 'milli' integer variable type:

                              
int32_t centiTemperature; /* -3751 corresponds to -37.51 */                              

That manner, normal integer operations can be used.

Gcc Single precision Constants

The GNU gcc compiler offers to care for double constants equally iii.0 as single precision constants (3.0f) using the post-obit option:

-fsingle-precision-constant causes floating-point constants to be loaded in single precision even when this is not exact. This avoids promoting operations on single precision variables to double precision like in x + i.0/3.0. Note that this also uses single precision constants in operations on double precision variables. This can amend performance due to less retentiveness traffic.

See https://gcc.gnu.org/wiki/FloatingPointMath

RTOS

The other consideration is: if using the FPU, information technology means potentially stacking more registers. This is a possible performance problem for an RTOS like FreeRTOS (see https://www.freertos.org/Using-FreeRTOS-on-Cortex-A-Embedded-Processors.html). The ARM Cortex-M4 supports a 'lacy stacking' (come across https://stackoverflow.com/questions/38614776/cortex-m4f-lazy-fpu-stacking). So if the FPU is used, it means more stacked registers. If no FPU is used, then it is amend to selecte the M4 port in FreeRTOS:

M4 and M4F in FreeRTOS

M4 and M4F in FreeRTOS

Summary

I recommend not to use any float and double data types if not necessary. And if y'all have a FPU, pay attention if information technology is just a single precision FPU or if the hardware supports both single and double precision FPU. If having a single precision FPU only, using the 'f' suffix for constants and casting things to (float) can make a large divergence. But proceed in mind that float and double accept different precision, and then this might not solve every problem.

Happy Floating 🙂

PS: if in need for a double precision FPU: have a wait at the ARM Cortex-M7 (e.g. First steps: ARM Cortex-M7 and FreeRTOS on NXP TWR-KV58F220M or First Steps with the NXP i.MX RT1064-EVK Board)

Links

  • Measuring ARM Cortex-M CPU Cycles Spent with the MCUXpresso Eclipse Registers View
  • Cycle Counting on ARM Cortex-M with DWT
  • MCUXpresso IDE: http://mcuxpresso.nxp.com/ide/
  • DWT Registers: http://infocenter.arm.com/aid/index.jsp?topic=/com.arm.doc.ddi0439b/BABJFFGJ.html
    DWT Control Register: http://infocenter.arm.com/assistance/alphabetize.jsp?topic=/com.arm.doc.ddi0337e/ch11s05s01.html

0 Response to "Cortex-m3 Supports Hardware Floating-point Operations Directly"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel