Intel® Hyperflex™ Architecture High-Performance Design Handbook

ID 683353
Date 12/08/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

6.1. Round Robin Scheduler

The round robin scheduler is a basic functional block. The following example uses a modulus operator to determine the next client for service. The modulus operator is relatively slow and area inefficient because the modulus operator performs division.

Source Code for Round Robin Scheduler

module round_robin_modulo (last, requests, next);

parameter CLIENTS       = 7;
parameter LOG2_CLIENTS  = 3;

// previous client to be serviced
input wire [LOG2_CLIENTS -1:0]  last;
// Client requests:- 
input wire [CLIENTS -1:0] requests;
// Next client to be serviced: -
output reg [LOG2_CLIENTS -1:0] next;


//Schedule the next client in a round robin fashion, based on the previous
always @*
begin
   integer J, K;

   begin : find_next
   next = last; // Default to staying with the previous
   for (J = 1; J < CLIENTS; J=J+1)
      begin
      K = (last + J) % CLIENTS;
      if (requests[K] == 1'b1)
         begin
         	 next = K[0 +: LOG2_CLIENTS];
         	 disable find_next;
       	 end
		end // of the for-loop
    end   // of 'find_next'
end 

 endmodule
Figure 125. Fast Forward Compile Report for Round Robin Scheduler

The Retiming Summary report identifies insufficient registers limiting Hyper-Retiming on the critical chain. The chain starts from the register that connects to the last input, through the modulus operator implemented using a divider, and continuing to the register that connects to the next output.

Figure 126. Critical Chain for Base Performance for Round Robin Scheduler

The 44 elements in the critical chain above correspond to the circuit diagram below that has 10 levels of logic. The modulus operator contributes significantly to the low performance. Seven of the 10 levels of logic are part of the implementation for the modulus operator.

Figure 127. Schematic for Critical Chain

As Figure 125 shows, Fast Forward compilation estimates a 140% performance improvement from adding two pipeline stages at the module inputs, for retiming through the logic cloud. At this point, the critical chain is a short path/long path and the chain involves the modulus operator.

Figure 128. Critical Chain Fast Forward Compile for Round Robin Scheduler

The divider in the modulus operation is the bottleneck that requires RTL modification. Paths through the divider exist in the critical chain for all steps in the Fast Forward compile. Consider alternate implementations to calculate the next client to service, and avoid the modulus operator. If you switch to an implementation that specifies the number of clients as a power of two, determining the next client to service does not require a modulus operator. When you instantiate the module with fewer than 2n clients, tie the unused request inputs to logic 0.

Source Code for Round Robin Scheduler with Performance Improvement with 2n Client Inputs

module round_robin_modulo (last, requests, next);

parameter LOG2_CLIENTS  = 3;
parameter CLIENTS       = 2**LOG2_CLIENTS;

// previous client to be serviced
input wire [LOG2_CLIENTS -1:0]  last;
// Client requests:- 
input wire [CLIENTS -1:0] requests;
// Next client to be serviced: -
output reg [LOG2_CLIENTS -1:0] next;


//Schedule the next client in a round robin fashion, based on the previous
always @(next or last or requests)
begin
   integer J, K;

   begin : find_next
   next = last; // Default to staying with the previous
   for (J = 1; J < CLIENTS; J=J+1)
      begin
      K = last + J;
      if (requests[K[0 +: LOG2_CLIENTS]] == 1'b1)
         begin
         	 next = K[0 +: LOG2_CLIENTS];
         	 disable find_next;
       	 end
	end // of the for-loop
  end   // of 'find_next'
end

 endmodule
Figure 129. Fast Forward Summary Report for Round Robin Scheduler with Performance Improvement with 2n Client Inputs

Even without any Fast Forward optimizations (the Base Performance step), this round robin implementation runs at double the frequency compared with the version without the performance improvement in Source Code for Round Robin Scheduler. Although critical chains in both versions contain only two registers, the critical chain for the 2n version contains only 26 elements, compared to 44 elements in the modulus version.

Figure 130. Critical Chain for Round Robin Scheduler with Performance Improvement

The 26 elements in the critical chain correspond to the following circuit diagram with only four levels of logic.

Figure 131. Schematic for Critical Chain with Performance Improvement

By adding three register stages at the input, for retiming through the logic cloud, Fast Forward Compile takes the circuit performance to 1 GHz. Similar to the modulus version, the final critical chain after Fast Forward optimization has a limiting reason of short path/long path, as Figure 132 shows, but the performance is 1.6 times the performance of the modulus version

Figure 132. Critical Chain for Round Robin Scheduler with Best Performance

Removing the modulus operator, and switching to a power-of-two implementation, are small design changes that provide a dramatic performance increase.

  • Use natural powers of two for math operations whenever possible
  • Explore alternative implementations for seemingly basic functions.

In this example, changing the implementation of the round robin logic provides more performance increase than adding pipeline stages.