Developer Guide

  • 2021.1
  • 11/03/2021
  • Public
Contents

PCIe-from-Memory Sample Demo

This demo is intended to show you how to use the data streams optimizer, as well as provide an example of the performance benefits that the tool offers.
You will use the data streams optimizer to improve the latency of a sample workload by tuning the PCIe-from-memory stream (PCIe reads).
After trying this demo, you will have a better sense of how the tool works. You can then start to use the tool with your real-time workload instead of the sample workload and assess how the tool can help you achieve your requirements.

Scenario

As context for this demo, consider the following scenario:
Imagine a high-speed motion control application which begins and ends its control loop with I/O from a PCIe attached industrial Ethernet controller. The high-speed motion controller requires operating at 8 kHz, or a 125 microsecond (µs) cycle time.
The sensor and actuator latency overhead, including the industrial Ethernet connection to/from the integrated controller, is budgeted at about 5 µs each way leaving a target of 115 µs maximum latency from sensor packet coming in (receive, Rx) to actuator packet going out (transmit, Tx). In the diagram above, the overall packet latency is t
3
’ - t
0
’ = 115 µs.
The target requirements can be decomposed even further if the core must read the network data in 50 µs (compute latency). This means there is 65 µs allowed for the Rx and Tx latency, including static software overhead.
You decide to use the data streams optimizer to try to meet the Rx and Tx latency requirement of 65 µs.
The RTCP Sample Workload is a proxy for such a scenario.

About Real-Time Compute Performance (RTCP)

RTCP is a sample application that simulates an industrial control loop with input (packet reception, Rx), compute, and output (packet transmission, Tx) segments. The receive and transmit portions work off a user space poll mode driver. The compute portion is a random pointer-chase to mimic a worst-case memory access pattern.
RTCP has two modes because the data streams optimizer only optimizes the receive and transmit portions. The compute portion can be optimized by other Intel® TCC Tools, such as the cache allocation library. One mode runs the full RTCP workload including receive, compute, and transmit. This mode is available to run when using the data streams optimizer and cache allocation library or simply experimentation with a real-time proxy workload. The second mode runs only the receive and transmit portions of RTCP. This scenario is referred to as the “empty buffer,” because there is no random pointer-chase compute. For both RTCP modes, the workload can be broken into data streams with individual latency requirements. For this demonstration, the RTCP empty buffer mode will be measured.
Streams exercised in the full RTCP workload are:
Packet Rx:
  • PCIe to memory (PCIe writes)
Compute:
  • Core from memory (Core reads)
  • Core to memory (Core writes)
Packet Tx:
  • PCIe from memory (PCIe reads)
  • Core to PCIe (MMIO writes)
Streams exercised in the RTCP empty buffer mode are:
Packet Rx:
  • PCIe to memory (PCIe writes)
Packet Tx:
  • PCIe from memory (PCIe reads) – this demo
  • Core to PCIe (MMIO writes)
For more details about deconstructing RTCP into streams and calculating RTCP stream latency requirements, see Deconstructing RTCP into Streams.
RTCP requires two physical systems:
  • Target system: Runs RTCP. The data streams optimizer will tune this system.
  • Packet generator: Sends packets to the target system.

Target User for RTCP

For real-time workloads consisting of end-to-end control loops, like RTCP, a system integrator may be tasked with tuning a system to meet use case driven cycle times. This system integrator role would have a global system viewpoint of underlying hardware and software. They should have insight into the functionality of and interactions between the various system components within the critical path from data reception, to the compute workload, to the data transmission. This individual should have the knowledge (and/or have gathered all the data from other roles) and ability to:
  • Measure the RTCP workload
  • Identify the system components and their interactions:
    • Intel® Ethernet Controller I210
    • DPDK Driver
    • Packet receive flow
    • Compute application
    • Packet transmission flow
    • Safety margins
    • Network overhead
  • Deconstruct RTCP into individual streams
  • Measure the individual stream latencies
  • Calculate the individual stream latency targets

Overview of the Sample RTCP Requirements File

For the demo, you will use a sample requirements file specific to the RTCP workload, with predefined values. The sample RTCP requirements file is shown below:
{ "workload": { "command": "python3 /usr/share/tcc_tools/tools/demo/workloads/bin/rtcp_validation_script.py", "arguments": [ "--latency_us 65", "--ssh root@192.168.0.2", "--pci_rtcp 0000:01:00.0", "--pci_pglm 0000:01:00.0", "--no-compute", "--cpuid 3" ] }, "requirements": [ { "producer": "Memory", "consumer": "01:00.0", "traffic_class": 0, "latency": 10, "bytes_per_transfer": 64, "relative_priority": 0 } ] }
In the requirements file, the “workload” fields specify the sample workload validation script command and arguments. After generating a tuning configuration, the tool will run the workload validation script.
  • “command” shows the sample script that will be used to validate RTCP latency during the tool flow. The exit code must return 0 if the requirements are met and 1 if the requirements are not met. Any other value means the validation failed to complete, and the tuning process will stop. Any type of program is OK.
  • In the
    --latency_us
    argument, as defined earlier in this scenario, the requirement is 65 microseconds.
  • In the
    --ssh
    argument,
    root@192.168.0.2
    is the SSH credentials of the packet generator.
  • In the
    --pci_rtcp
    argument,
    0000:01:00.0
    is the address of the PCIe device on the target system.
  • In the
    --pci_pglm
    argument,
    0000:01:00.0
    is the address of the PCIe device on the packet generator.
  • The
    --no-compute
    argument indicates that the compute part of RTCP will be disabled. Only data transmission will be measured.
  • In the
    --cpuid
    argument, 3 indicates that Core 3 is where this sample script will run.
The “requirements” fields specify the data stream requirements that the tool will use to generate the aforementioned tuning configuration.
  • The data stream is specified in the form of a producer/consumer pair. For the PCIe-from-memory stream, the “producer” value is always “Memory”, and the “consumer” value is the PCIe device in Bus:Device.Function (BDF) notation with additional Traffic Class (TC) specified. This value varies by PCIe device, but this sample demo uses BDF 01:00.0 and TC 0.
  • The “latency” and “bytes_per_transfer” values are described in Sample Demo Requirements Calculation.
  • The “relative_priority” field is not used in this release.
See Create a Requirements File for general specifications for all requirements files.

Sample Demo Requirements Calculation

For the sample RTCP requirements file, Intel generated the latency and bytes_per_transfer values using the approach described in Generate Requirements.
For the PCIe-from-memory stream, the bytes per transfer requirement is defined as the burst size of the buffer that is part of the real-time control loop. In other words, amount of bytes read. The stream latency is defined as the maximum acceptable time to access the buffer or bytes transferred.
The PCIe-from-memory bytes per transfer is 64 bytes since RTCP operates on 64-byte packets.
The maximum acceptable time to perform PCIe reads for the 64-byte buffer is some portion of the overall cycle time, or overall RTCP latency. For this demo, the maximum RTCP latency is 65 µs and we can budget 10 µs for the maximum PCIe-from-memory read latency. The Intel® Ethernet Controller I210 needs to retrieve the requested data from main memory in less than 10 µs. Therefore, the PCIe-from-memory latency is 10 µs.
For more details on how to deconstruct a complex workload into stream latency requirements, see Data Stream Latency Requirements Examples and the Real-Time Compute Performance (RTCP) Example.

Next Steps

Now that you have read the scenario for this sample demo, walk through the steps of this demo to see the value of tuning.
  1. Step 1: RTCP Setup: Set up the hardware and network for RTCP. This setup task is specific to the workload itself and is not related to the data streams optimizer steps. If you were to use a different workload, the setup steps would likely be different from those of RTCP, but the steps for using the data streams optimizer would be the same regardless of workload.
  2. Step 2: Run RTCP on Untuned System: Run the RTCP workload on the untuned system to get the baseline latency measurement.
  3. Step 3: Preproduction: Generate a Tuning Config: Walk through the data streams optimizer preproduction steps. The tool will tune the system and it is expected that the latency measurement will improve.
  4. Step 4: Production: Apply Tuning Configuration: Walk through the data streams optimizer production steps.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.