018 – FPGA Stereo Delay


In this post we will go over the implementation of a stereo delay effect for our FPGA Audio Processor.

In the previous post we took our first steps into the world of time-based processing with a mono delay. Expanding that design to support stereo audio will open new possibilities for creating interesting effects. Let’s get started.

From Mono to Stereo

As a quick reminder, here’s what a mono feedback delay looks like:

Feedback Delay. Source Hack Audio by Eric Tarr
Feedback Delay. Source Hack Audio by Eric Tarr

Adding stereo support to our mono Delay is relatively straightforward. The core idea is that, instead of converting the incoming samples to mono and storing them in a single circular buffer, we create one buffer per channel. The way in which we keep track of the write and read pointers, apply the delay gain and implement a feedforward or a feedback architecture remains unchanged, we’ll just have two of everything now.

However, expanding our mono Delay to stereo does bring some changes to the architecture and resources of our logic. Because we don’t have to convert the incoming stereo data to mono, we can get rid of the floating-point division logic altogether. On the other hand, we must perform the multiplication and addition in the delay line for both channels. However, the biggest difference in terms of resource utilization is given by the doubling of the memory needed for the delay lines. There is no way around this though, we need a delay line for each channel if we want to have a Delay effect in the first place. We’ll come back to this later.

Because the mono conversion is no longer needed, we can simplify the logic in the Main FSM so that it better follows the logical flow of the stereo Delay effect. The Main FSM will now:

  • Fetch the audio samples for the left and right channels. In a feedforward architecture, they are also added to the circular buffer here.
  • Read the next sample from the left buffer and apply the delay gain to it.
  • Read the next sample from the right buffer and apply the delay gain to it.
  • Add the left delay sample to the real-time sample and assign it to the output. In a feedback architecture, the outgoing left sample is also added to the circular buffer here.
  • Add the right delay sample to the real-time sample and assign it to the output. In a feedback architecture, the outgoing right sample is also added to the circular buffer here.
  • Mark the outputs as valid and go back to waiting for the next samples.

As a potential optimization, performing the addition for each channel directly after applying the delay gain would probably allow us to shave off one signal (for each channel) that is currently needed for intermediate storage. The RTL description of our stereo Delay is shown below.

module delay #(
    parameter string DELAY_TYPE = "STEREO",         // "STEREO"
    parameter string FEED_TYPE = "FEEDBACK"         // "FEEDBACK", "FEEDFORWARD"
    )(
    input   logic           i_clock,
    input   logic [31 : 0]  i_data_left,
    input   logic [31 : 0]  i_data_right,
    input   logic           i_data_valid,
    output  logic [31 : 0]  o_data_left,
    output  logic [31 : 0]  o_data_right,
    output  logic           o_data_valid
);

    // Floating-point Multiplier
    logic           fp_mult_valid_out;
    logic [31 : 0]  fp_mult_data_out;
    logic [31 : 0]  fp_multiplier_data_a_in;
    parameter logic [31 : 0]  feedback_gain = 31'b00111111010000000000000000000000;     // 0.75
    logic           fp_multiplier_data_valid;
    fp_multiplier fp_multiplier_inst(
        .aclk                   (i_clock),
        .s_axis_a_tvalid        (fp_multiplier_data_valid),
        .s_axis_a_tdata         (fp_multiplier_data_a_in),
        .s_axis_b_tvalid        (fp_multiplier_data_valid),
        .s_axis_b_tdata         (feedback_gain),
        .m_axis_result_tvalid   (fp_mult_valid_out),
        .m_axis_result_tdata    (fp_mult_data_out)
    );

    // Floating-point Adder
    logic           fp_adder_valid_out;
    logic [31 : 0]  fp_adder_data_out;
    logic [31 : 0]  fp_adder_data_a_in;
    logic [31 : 0]  fp_adder_data_b_in;
    logic           fp_adder_data_valid;
    fp_adder fp_adder_inst(
        .aclk                   (i_clock),
        .s_axis_a_tvalid        (fp_adder_data_valid),
        .s_axis_a_tdata         (fp_adder_data_a_in),
        .s_axis_b_tvalid        (fp_adder_data_valid),
        .s_axis_b_tdata         (fp_adder_data_b_in),
        .m_axis_result_tvalid   (fp_adder_valid_out),
        .m_axis_result_tdata    (fp_adder_data_out)
    );

    // Circular Buffer for the Left Channel
    logic           delay_buffer_left_wr_en;
    logic [13 : 0]  delay_buffer_left_addra = 'b0;
    logic [31 : 0]  delay_buffer_left_dina;
    logic [13 : 0]  delay_buffer_left_addrb = 14'd1;
    logic [31 : 0]  delay_buffer_left_doutb;
    delay_circular_buffer delay_circular_buffer_left_inst(
        .clka   (i_clock),
        .wea    (delay_buffer_left_wr_en),
        .addra  (delay_buffer_left_addra),
        .dina   (delay_buffer_left_dina),
        .clkb   (i_clock),
        .addrb  (delay_buffer_left_addrb),
        .doutb  (delay_buffer_left_doutb)
    );

    // Circular Buffer for the Right Channel
    logic           delay_buffer_right_wr_en;
    logic [13 : 0]  delay_buffer_right_addra = 'b0;
    logic [31 : 0]  delay_buffer_right_dina;
    logic [13 : 0]  delay_buffer_right_addrb = 14'd1;
    logic [31 : 0]  delay_buffer_right_doutb;
    delay_circular_buffer delay_circular_buffer_right_inst(
        .clka   (i_clock),
        .wea    (delay_buffer_right_wr_en),
        .addra  (delay_buffer_right_addra),
        .dina   (delay_buffer_right_dina),
        .clkb   (i_clock),
        .addrb  (delay_buffer_right_addrb),
        .doutb  (delay_buffer_right_doutb)
    );

    // Main FSM
    enum logic [2 : 0]  {IDLE,
                        DELAY_GAIN_LEFT,
                        DELAY_GAIN_RIGHT,
                        ADD_OUTPUT_LEFT,
                        ADD_OUTPUT_RIGHT,
                        GENERATE_OUTPUT} fsm_state = IDLE;
    logic [31 : 0]  mono_sample;
    logic [31 : 0]  current_sample_left;
    logic [31 : 0]  current_sample_right;
    logic [31 : 0]  sample_left_aux;
    logic [31 : 0]  sample_right_aux;
    logic           ping_pong_left_buffer_active = 1'b1;

    always_ff @(posedge i_clock) begin
        case (fsm_state)
            IDLE : begin
                fp_adder_data_valid <= 1'b0;
                delay_buffer_left_wr_en <= 1'b0;
                delay_buffer_right_wr_en <= 1'b0;
                o_data_valid <= 1'b0;
                if (i_data_valid == 1'b1) begin
                    current_sample_left <= i_data_left;
                    current_sample_right <= i_data_right;
                    if (FEED_TYPE == "FEEDFORWARD") begin
                        delay_buffer_left_dina <= i_data_left;
                        delay_buffer_left_wr_en <= 1'b1;
                        delay_buffer_right_dina <= i_data_right;
                        delay_buffer_right_wr_en <= 1'b1;
                    end
                    fp_multiplier_data_a_in <= delay_buffer_left_doutb;         // Start the delay gain on the left channel
                    fp_multiplier_data_valid <= 1'b1;
                    if (delay_buffer_left_addra == 12000) begin                 // Wrap the pointers around if they reach the end of the buffer
                        delay_buffer_left_addra <= 0;
                        delay_buffer_left_addrb <= 1;
                    end
                    if (delay_buffer_right_addra == 16381) begin                // Wrap the pointers around if they reach the end of the buffer
                        delay_buffer_right_addra <= 0;
                        delay_buffer_right_addrb <= 1;
                    end
                    fsm_state <= DELAY_GAIN_LEFT;
                end
            end

            DELAY_GAIN_LEFT : begin
                delay_buffer_left_wr_en <= 1'b0;
                delay_buffer_right_wr_en <= 1'b0;
                fp_multiplier_data_valid <= 1'b0;
                if (fp_mult_valid_out == 1'b1) begin
                    sample_left_aux <= fp_mult_data_out;
                    fp_multiplier_data_a_in <= delay_buffer_right_doutb;        // Start the delay gain on the right channel
                    fp_multiplier_data_valid <= 1'b1;
                    fsm_state <= DELAY_GAIN_RIGHT;
                end
            end

            DELAY_GAIN_RIGHT : begin
                fp_multiplier_data_valid <= 1'b0;
                if (fp_mult_valid_out == 1'b1) begin
                    sample_right_aux <= fp_mult_data_out;
                    fp_adder_data_a_in <= current_sample_left;                  // Start adding the current left sample with the delayed one
                    fp_adder_data_b_in <= sample_left_aux;
                    fp_adder_data_valid <= 1'b1;
                    fsm_state <= ADD_OUTPUT_LEFT;
                end
            end

            ADD_OUTPUT_LEFT : begin
                fp_adder_data_valid <= 1'b0;
                if (fp_adder_valid_out == 1'b1) begin
                    o_data_left <= fp_adder_data_out;
                    if (FEED_TYPE == "FEEDBACK") begin
                        delay_buffer_left_dina <= fp_adder_data_out;
                        delay_buffer_left_wr_en <= 1'b1;
                    end;
                    fp_adder_data_a_in <= current_sample_right;                 // Start adding the current right sample with the delayed one
                    fp_adder_data_b_in <= sample_right_aux;
                    fp_adder_data_valid <= 1'b1;
                    fsm_state <= ADD_OUTPUT_RIGHT;
                end
            end

            ADD_OUTPUT_RIGHT : begin
                fp_adder_data_valid <= 1'b0;
                delay_buffer_left_wr_en <= 1'b0;
                if (fp_adder_valid_out == 1'b1) begin
                    o_data_right <= fp_adder_data_out;
                    if (FEED_TYPE == "FEEDBACK") begin
                        delay_buffer_right_dina <= fp_adder_data_out;
                        delay_buffer_right_wr_en <= 1'b1;
                    end;
                    o_data_valid <= 1'b1;
                    delay_buffer_left_addra <= delay_buffer_left_addra + 1;     // Increment the write and read pointers
                    delay_buffer_left_addrb <= delay_buffer_left_addrb + 1;
                    delay_buffer_right_addra <= delay_buffer_right_addra + 1;
                    delay_buffer_right_addrb <= delay_buffer_right_addrb + 1;
                    fsm_state <= IDLE;
                end;
            end

            default : begin
                fsm_state <= IDLE;
            end
        endcase
    end

endmodule

Simulation and Implementation

We are now ready to simulate our new stereo Delay. For this test we set the left buffer pointers to wrap around after 2000 samples (~45 milliseconds), and the right buffer pointers to wrap around after 6381 samples (~145 milliseconds). Thus, we expect to see about three taps of the left delay for every tap of the right channel. We’ll use the same mono snare hit that we used in the previous post for easier visualization. The result of the simulation is shown in the figure below.

Simulation of the Feedback Stereo Delay
Simulation of the Feedback Stereo Delay

We can see that the taps in the left channel decay more rapidly than in the right channel. This makes sense, as they are generated from the same input signal, but the feedback delay is applied three times as often to the left channel.

Let’s turn our attention to the resource utilization of our stereo delay. The figure below shows how much of the on-chip Block RAMs are used by the stereo Delay, more than 20%! This is surely not a scalable solution for larger delay lines, which can easily require storing several seconds worth of audio samples. The solution here is to move most of the delayed samples to off-chip memory, and only keep the last few hundreds in the Block RAM. Off-chip memory is usually plentiful compared with on-chip Block RAM’s (for our FPGA/board pair we are talking gigabytes vs. megabits), but it comes at the expense of more complex logic and higher memory bandwidth for reading from and writing to external memory. Having said that, once we have managed the memory accesses, memory bandwidth in an audio application with low channel counts will rarely be a bottleneck.

BRAM Utilization in the Implemented Design
BRAM Utilization in the Implemented Design

Multi-tap Delay

Before we wrap up our discussion of delays, there is one more type I’d like to mention: the multi-tap delay.

In a feedforward delay we have exactly one tap per channel. In a feedback delay we have several (in theory infinite) taps, but we can’t control exactly how many of them we’ll hear, they decay on their own after we’ve set the feedback gain. If we would like to specify the exact number of taps that we want to hear, we need to use a multi-tap delay, as shown in the figure below.

Multi-tap Delay. Source Hack Audio by Eric Tarr
Multi-tap Delay. Source Hack Audio by Eric Tarr

A multi-tap delay consists of several feedforward delay lines, each with its own delay time and, if so desired, its own feedback gain. This architecture is fairly straightforward, but as you can probably guess, requires several times more storage than a single delay line.

That’s it for our exploration of delays. In the next post we will update our LED Meter to support stereo display.

Cheers,

Isaac


Leave a Reply

Your email address will not be published. Required fields are marked *