Flyslice Technologies Accelerates Low-Latency Trading in Fintech

Overview

The FA728Q FPGA-based accelerator card, a high-performance PCIe*-based acceleration card equipped with the Intel® Stratix® 10 FPGA, is shipping today.
Flyslice uses the OFS base infrastructure to expedite the development of their custom FPGA Interface Manager (FIM), which incorporates an integrated TCP/IP offload engine.
The TCP transmit latency of the FA728Q accelerator card is below 100 ns for time-critical network applications such as LLT.

PDF

Chen Ailian

System Architect, Hangzhou Flyslice Technologies, Ltd

Tamara Lin

Product Marketing Specialist, Intel Programmable Solutions Group

Executive Summary

LLT and other time-sensitive applications are ideal use cases for FPGA acceleration. To address this market, Flyslice Technologies developed the FA728Q FPGA-based acceleration card. The FA728Q acceleration platform provides powerful FPGA resources, abundant storage capacity, and easy-to-use interfaces for end users. To expedite, simplify, and standardize the development of their acceleration board, Flyslice Technologies used the OFS infrastructure, which provides a powerful methodology for the rapid development of FPGA solutions using a ‘take and tailor’ approach. Using the OFS infrastructure, Flyslice Technologies integrates its TCP/IP offload engine into the open-source base FIM, commonly called an FPGA ‘shell’.

Background and Challenge

LLT is the modern practice of electronically executing trades of financial securities with minimal time delay between order entry and order execution. Large investment banks, hedge funds, and other financial institutions commonly use this method. In the past, trades were executed manually instead of electronically, and the execution of transactions varied from seconds to minutes. However, with technological advancements in hardware and corresponding software, systems could be programmed to automatically make buy or sell decisions based on certain market signals and movements, reducing trade execution times to milliseconds. With the broader availability of FPGA-based acceleration products in recent years, transaction times have further reduced to microseconds or sub-microseconds.

At the same time, LLT systems increasingly rely on complex trading algorithm models unique to each trading firm’s particular trading strategy for order book interaction. Solutions require general-purpose processors and special-purpose co-processor computing to meet the trading firms’ power and performance requirements, such as in heterogeneous computing. FPGAs are ideal for implementing tailored trading algorithms; however, programming this hardware acceleration device can be time-consuming and difficult to migrate as FPGA families improve and evolve.

Flyslice Technologies, a company headquartered in China, is actively addressing the demand for data center heterogeneous acceleration and high-performance computing, including the LLT segment. They bring FPGA- based hardware accelerator platforms, FPGA acceleration intellectual property (IP) functions, and FPGA-based platform design services to market.

Solution

To meet the low-latency, standardization, and portability requirements of LLT applications, Flyslice Technologies developed their FA728Q acceleration card, which instantiates an integrated TCP/IP offload engine. To do this, Flyslice Technologies modified the provided base FIM in the open-source release of OFS. Because of the composable architecture and ‘take and tailor’ approach, OFS enabled them to simply port their algorithm to the FA728Q acceleration card while leveraging the rest of the provided infrastructure, including the OFS software drivers and libraries, only making minimal modifications.

OFS is an open-source hardware and software infrastructure that provides all the key design, software, and infrastructure components needed to jump-start custom FPGA-based board or workload development. The OFS infrastructure consists of the FIM, commonly called a ‘shell,’ and an Accelerator Functional Unit (AFU) region, a designated region for workload development. Using OFS, FPGA board – or FIM – developers can leverage the open-source infrastructure – or base FIM – to quickly develop a tailored, customized FIM for their board based on the target application or industry. OFS also ships with a oneAPI Accelerator Support Package (ASP), which can be leveraged to abstract the FPGA hardware and design flow. OFS saves developers time, increases portability across FPGA generations, uses industry-standard interfaces, and provides an optional high-level design flow using oneAPI.

The FA728Q acceleration card is available today and is a high-end PCIe-based FPGA acceleration board that offers 32 GB onboard DDR4 memory and three QSFP28 sockets to support up to 100 GbE for each interface. The FA728Q acceleration card is also enabled with oneAPI through the OFS infrastructure, so customers can implement their kernels in RTL or migrate algorithms from CPU/GPU to high- level design languages, including C/C++. The Intel oneAPI Base Toolkit also helps synthesize and optimize the kernels to FPGA resources, further improving time to market.

Flyslice Technologies has also begun development on Intel Agilex® FPGA-based boards, including the FA927S card using the Intel Agilex 7 FPGA I-Series and the FA925E card using the Intel Agilex 7 FPGA F-Series.

The FA927S card features high transceiver rates of up to 116 Gbps, PCIe 5.0 x16, and Compute Express Link (CXL) support. It targets bandwidth-intensive applications and is available now for RTL-based development. The FA927S card will support OFS in the first quarter of 2024.

On the other hand, the FA925E card offers four banks of 8 GB and four banks of 4 GB DDR4, totaling 48 GB onboard memory. It is designed for applications with high external memory capacity and bandwidth requirements. The card provides complete support for OFS and will be available by the end of 2023. See Table 1. to compare the three acceleration cards.

Table 1. Comparison Table

	FA728Q	FA927S	FA925E
Power	215 W	200 W	150 W
Cooling Requirement	Active/passive (optional)	Active/passive (optional)	Active/passive (optional)
Form Factor	3/4 length, full-height, dual-slot PCIe	Half-length, full-height, dual-slot PCIe	3/4 length, full-height, dual-slot PCIe
Networking Interfaces	Triple QSFP28 ports: 3 x 100 GbE / 40 GbE	Dual QSFP28 ports: 2 x 100 GbE / 40 GbE	Dual QSFP28 ports 2 x 100 GbE / 40 GbE
Memory Interfaces	4 x 8 GB DDR4, 2,400 MHz with ECC	4 x 8 GB DDR4, 2,400 MHz with ECC	4 x 8 GB and 4 x 4 GB DDR4, 2,400 MHz with ECC
PCIe Interfaces	-	5.0 x16	-
Extension Interfaces	-	2 x8 slim SAS connectors for PCIe 4.0 extension	-
Management Port	Micro-USB	Micro-USB	Micro-USB
FPGA Device	1SX280HN2F43E2VG	AGIB027R29A1E2VR3	AGFB027R25A2E2V

Results

The offload engine IP function implemented by Flyslice Technologies on the FA728Q card is optimized for latency and performance to meet LLT requirements. In speedup mode, the TCP transmit latency is less than 100 ns, ensuring stable and low-latency connections for time-critical network applications. Table 2. shows the measured latency for various connections. Table 3. shows the high-bandwidth PCIe 3.0 x16 and DDR interfaces.

Specification	Value
Maximum TCP/UDP connections	63 for TCP, 63 for UDP
TCP TX latency (speedup mode)	15 clocks
TCP TX latency (non-speedup mode)	46 clocks
TCP RX latency	32 clocks
UDP TX latency	42 clocks for a 512-byte packet 18 clocks for 128-byte packet
UDP RX latency	23 clocks
Loopback latency for oneAPI kernels	18 clocks

Table 2. TCP/IP Offload Engine (TOE) specification

Note:

1. One clock period is 6.4ns

2. TX latency is counted from the falling edge of packet EOP to valid data in XGMII TXC

3. RX latency is counted from packet SOP to valid data in XGMII RXC

Data Path	Bandwidths
Host write memory	8,287.68 MBps for 8,192-KB block
Host read memory	8,241.19 MBps for 8,192-KB block
Kernel write memory	16,909.6 MBps for 4,096-MB block
Kernel read memory	17,340.3 MBps for 4,096-MB block

Table 3. Bandwidth provided by each interface

High-bandwidth data path in ofs platform block diagram — Figure 2. High-bandwidth data paths in the OFS platform

OFS helped us to build a required acceleration platform more easily and quickly for customers, from software APIs drivers to underlying hardware as a full set of equipment.

Cheng Ailian, Flyslice Technologies, ltd.

How to Get Started with FPGA Acceleration Using OFS

FPGA developers can leverage the FA728Q accelerator card and OFS-enabled board, using the open-source documentation and source code to start building their custom workload.

The following table outlines how a developer can start FPGA-based workload development using the Flyslice Technologies acceleration board.

Leverage FPGA Acceleration for Your Workload
Step 1: Choose a board	View Flyslice Technologies' OFS-enabled board, the FA728Q accelerator card
Step 2: Evaluate OFS open-source resources	Flyslice Technologies will provide the corresponding version of the OFS technical documentation.
Step 3: Access open-source hardware and software code	Flyslice Technologies will provide the corresponding OFS software and hardware code. This is their specific distribution of the OFS base code provided by Intel.
Step 4: Develop workload using RTL or C/C++ (using oneAPI)	Follow the OFS RTL flow OR OFS enables the compulation of oneAPI kernels. Utilize the oneAPI development flow and build FPGA workloads in C/C++.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Flyslice Technologies Accelerates Low-Latency Trading (LLT) Applications with the FA728Q Accelerator Card

Overview

Related Links

Products

Documentation

Learn More