In this post we will discuss the pros and cons of fixed- and floating-point arithmetic in FPGAs and provide links to several useful resources on this topic.

*TLDR*

*TLDR*

*The FPGA Audio Processor design uses floating-point processing because it helps speed up the development and keep the project moving quickly**The negative impact of floating-point processing in resource utilization and performance, as well as the reliance on vendor-specific IP cores, are acknowledged and deemed worth it for this project, but this won’t be the case for most designs**Converting a floating-point algorithm to fixed-point and evaluating its performance is a non-trivial process that will be explored on its own in this Blog*

### Background

One of the most rewarding things about writing this Blog over the past months has been sharing it with the FPGA community and getting feedback from FPGA enthusiasts all over the world. This has played a huge role in my staying motivated to keep putting time and effort into further developing the FPGA Audio Processor Project and documenting its progress in the Blog.

One topic that keeps coming up as part of that feedback is a criticism of my using floating-point numeric representation for the FPGA Audio Processor. Though I briefly presented my rationale for using floating-point processing in a previous post, I decided it would make sense to have a post dedicated to it, so that I can reference it whenever this comes up in a discussion.

### Advantages of Floating-Point Representation

Using floating-point representation makes it easy to develop and test a design that involves a lot of math, which is the case with most DSP-centric projects. This is because the algorithms are usually developed on a PC, using (often double-precision) floating-point representation in languages and environments that are removed from the limitations of the platform that will actually run those algorithms. In many cases the algorithms don’t even consider basic characteristics of real-time processing, like streaming architectures or time deadlines.

**By mapping the floating-point operations of the high-level algorithm directly to floating-point operator IP cores in our FPGA, we can substantially reduce the time and effort required to implement and verify a processing module in our RTL-based design.** **This is the reason why I decided to use floating-point processing in the FPGA Audio Processor design. This speed of development helps me keep this project moving forward, however, this factor alone will not justify using floating-point processing in most designs.**

Another advantage of using floating-point numbers is that they allow us to work with a much larger dynamic range, that is, to represent very large and very small numbers. The dynamic range of audio signals is usually limited to the bit depths of the AD and DA converters, which in most cases goes up to 24 bit (though some recording devices in the past few years have started incorporating 32-bit converters). This is low when compared to what can be represented with single-precision floating-point numbers, but floating-point numbers still help us do math between those audio sample values and, say, filter coefficients, which can be very small values and often require many fractional digits.

### Disadvantages of Floating-Point Representation

The main disadvantage of using floating-point processing is that it consumes more resources (in some cases *a lot* more) than the equivalent operations using fixed-point representation. Higher resource utilization will require a larger FPGA, which has a higher power consumption, larger footprint and, most importantly, higher cost. In most projects the cost of the device is the driving factor that it makes floating-point processing prohibitively expensive, and therefore fixed-point is usually the default for most designs.

Another disadvantage of floating-point processing is that, all else being equal, the performance of the system will be worse than a fixed-point equivalent. Floating-point arithmetic requires more logical operations than its fixed-point equivalent, and this will be reflected in the maximum latency and throughput that can be achieved by the design.

A third disadvantage of floating-point processing is the biproduct of one of its advantages. As we mentioned earlier, floating-point processing offers a large dynamic range, so we can easily do math with very large and very small numbers. However, this comes at the expense of precision and accuracy. That is, we can represent larger and smaller values at the expense of the precision and accuracy of intermediate values. Single-precision fixed-point presentation will still be good enough for many, perhaps most applications, but there are some instances in which it might not be (for instance, when using recursive filters).

Finally, floating-point processing has the potential to create a strong dependency on vendor or third-party IP. Translating algorithms from their high-level floating-point representation into an RTL description is only straightforward if we already have IP cores that can perform the floating-point operations for us. Otherwise we would have to put in the up-front effort to develop and validate them *before* we start working on our domain-specific algorithms. This dependecy is not exclusive to floating-point processing, many designs rely on CORDIC and Direct Digital Synthesis (DDS) IP cores for doing fixed-point math. However, this is more prominent in floating-point representation, where even the simplest arithmetic and logical operations rely on IP cores.

### Conversion to Fixed-Point Representation

Ok, so now we have established the advantages and disadvantages of floating-point processing when compared to fixed-point. We also clarified why floating-point representation is well-suited for our FPGA Audio Processor. But what can we do if floating-point processing is not appropriate (which is the case for most designs)?

The answer is simple, but the actual process of it is not: we need to convert the floating-point algorithm to fixed-point representation and make sure that it still meets the performance required by our system. This is an important topic that we will address later, in the meantime here are some useful resources you can check out:

- This paper presents a methodology for converting floating-point audio algorithms to fixed-point. It includes an example conversion of an IIR filter, as well as references to other sources exploring this topic.
- The ZipCPU Blog has three great articles focused on fixed-point arithmetic with FPGAs with in-depth discussions of bit growth and rounding, as well as a methodology for debugging DSP algorithms.
- This paper explores the conversion of floating-point C algorithms to fixed-point representation to run on an integer processing pipeline.
- This MathWorks webinar explores the conversion of floating-point algorithms to fixed-point representation for FPGAS using Matlab and Simulink.

Of course, once the RTL Audio Lab series on fixed-point conversion is out, I’ll link to it here as well.

It is also worth mentioning that some High-level Synthesis tools (like Vitis HLS) provide assitence in the implementation of fixed-point algorithms. HLS is a major topic for FPGA development moving forward, and we will address it here in the Blog.

That’s it for this discussion on fixed- and floating-point representation. Next week we will be back to a more hands-on topic with part two of our floating-point FIR filter. See you then!

Cheers,

Isaac

*If you would like to support RTL Audio Lab, you can make a one-time donation **here**.*