Autonomous driving system is a very complicated system. It consists of multiple modules to process the data over the perception, localization, planning and control stages. The performance, or the module latency, is the key fact which could impact the hardware selection, functional safety, user experience and also the total cost of the solution. Modern CPUs have lots of new features including wide vectors data processing units, high efficiency instruction execution pipeline and fast memory access etc. In this article, we select the open source Apollo (http://apollo.auto) project as the reference project, use the EM motion planning module as the reference module, and run the benchmark and performance profiling on one of Intel Atom Architecture based low cost hardware platform. We introduce the general methodologies for running the profiling inside Apollo framework and the detailed analysis based on the profiling data such as hotspot, CPU pipeline execution efficiency, parallelization etc. Based on the performance profiling data, we are trying to extract the best performance on the selected hardware by using different optimization technologies including the compiler optimization, high performance libraries, vectorization and parallelization to optimize the planning module. With these optimizations applied, we could receive close to 2x performance speedup compared to the initial setup. The optimization work approves that even for the low cost, power efficiency hardware platform, as long as the workload is highly optimized, we can still get the expected performance, which breaks the door for the autonomous driving solution vendor to select the cost saving hardware platforms.
Project Apollo is an open autonomous driving platform which includes open software platform, hardware development platform and also cloud service platform. In this article, we focus on planning module from realtime module pipeline, and use Apollo version 2.0 release to apply our optimizations.
Apollo version 2.0 supports vehicles autonomously driving on the simple urban roads. Vehciles are able to cruise on reads safely, avoid collisions with obstacles, stop at traffic lights and change lanes if needed to reach their destinations. ref [ 1 ]
In Figure 1 below, the HD map module provides a high-definition map that can be accessed by every on-line module. Perception and Localization modules provide the necessary dynamic environment information, which can be further used to predict future environment status in the prediction module. the planning module considers all information to generate a safe and smooth trajectory to feed into the vehicle control module.
the execution time of the motion planning module is important for driving safety. In the case of an emergency, the system including planning module should react within 100ms, compared with a 300 ms reaction time for a normal human driver.
Fig. 1 Apollo modules
In this article, we used EM planner as the planner of the planning module inside Project Apollo. ref 
EM planner is based on expectation maximum type iterative algorithm. This planner targets safety and ride experience with a multilane, path-speed iterative, traffic rule and decision combined design.
Figure 2 presents an overview of EM planner. The reference line generator will produce some candidate lane-level reference lines along with information about traffic regulations and obstacles. the process is based on the HD-map and navigation information from the routing module.
During lane-level motion planning, it first constructs a frenet frame based on a specified reference line. the relation between the ego car and its surrounding environment is evaluated in the Frenet frame constructed by the reference line, as well as traffic regulations. Furthermore, restructured information passes to the lane-level optimizer.
the lane-level optimizer module performs path optimization and speed optimization. During path optimization, information about the surroundings is projected on the Frenet frame. Based on the information projected in the Frenet frame, a smooth path is generated by a dynamic-programming-search-based method. Similarly, during speed optimization, once a smooth path is generated by the path optimizer, obstacles are projected on the station -time graph. Then, the speed optimizer will generate a smooth speed profile by a dynamic-programming-search-based method and a spline quadratic programming solver. Combining path and speed profiles, it will obtain a smooth trajectory for the specified lane.
In the last step, all lane-level best trajectories are sent to the reference line trajectory decider. Based on current car status, regulations and the cost of each trajectory, the trajectory decider will decide a best trajectory for the ego car maneuver.
Fig. 2 overview of EM planner
Hardware and Software Configuration
All the data and results presented in this paper were obtained on Intel® Atom™ CPU based platform, code name Intel Harcuvar Platform. The platform details listed in Table 1.
Table 1. System configuration for performance evaluation
|Specification||Intel Harcuvar Platform|
|CPU||Intel® Atom™ CPU C3955@ 2.10GHz |
|Cores per Socket||Ubuntu 14.04.516|
|OS Version||Ubuntu 14.04.5|
Base Line Performance Measurement
For base line performance measurement, the tests were conducted for the same driving path as below fig. 3. The path is marked as the orange line. The starting point is marked as blue diamond and the destination is marked as red diamond. Along the path, planning module with debug level 4 will receive ~300 ROS message as input data from other modules like prediction, localization and routing. We will get ~300 different processing latencies in sequence along the path.
Fig. 3 the driving path for the performance measurement
In Apollo2.0 planning module, the planning process is triggered by the timer events. The performance of EM planning module is correlated to the latency of the function RunOnce() which calculates the best trajectory according to the received ROS message. In the function Planning::RunOnce(), the latency is calculated by the timestamp difference before and after the processing for each received ROS message, and Fig. 4 illustrates the executed functions within every run .
Fig. 4 function tree graph view of function Planning::RunOnce()
Along the path, planning module receives ~300 ROS messages and we record the planning latency for these~ 300 processing. In order to measure the performance more accurately, we calculate 4 statistics based on the recorded latency data in Table 2.
We collect and calculate the base line performance for planning module as shown in Fig. 5, and listed latency number of different runs with different received ROS message in Table 3. Please note that the latency number is collected based on debug level 4 which including the process time of the debugging. It is not the latency in the production environment. The latency in the lower debug level could be further reduced.
|Mean||The mean of all latency for 300 runs/input|
The maximum among 300 runs/input
|Mean of top 10%||Mean of top 10% of sorted latency data from low to high|
|Mean of bottom 10%||Mean of bottom 10% of sorted latency data from low to high|
Table 3. The baseline statistics performance
|Statistics||Original Latency (ms)|
|Mean of Top 10%||19|
|Mean of bottom 10%||64|
Table 3. The baseline statistics performance
Fig. 5 The baseline performance over ROS input
The hotspot is the first thing we try to identify. We mainly use Intel® VTune™ Amplifier (VTune)  to collect the profiling data inside Apollo framework. We do the detailed analysis based on the hotspot data. For some of the hotspots, we could make the code changes to reduce the cost of the hot functions. For some of the hotspot, we use VTune microarchitecture analysis to identify the CPU execution bottleneck. According to the analysis for the hotspots and CPU microarchitecture metrics, we identify the potential optimization to eliminate the hotspots, make the hot functions to run more effectively. We also identify the potential candidates for parallelization so that to utilize the hardware resources. We develop the scripts to run the profiling data collection in the command line with the follow steps:
- Launch the planning module without the ROS package processing
- Use VTune to attach to the running planning process
- Send the ROS packages so the planning module is triggered for package processing
- Detach the VTune from planning process once the ROS package is finished
- Finalize the profiling result based on the planning module
A. Hotspot Analysis
From the hotspot analysis, we observed the default planning module is running in single thread mode for most of CPU time, as shown in below effective CPU utilization histogram in Fig. 6. Only one CPU is utilized while the whole system contains 16 CPU cores. The parallelization for the hotspot function could be the potential optimization opportunities.
Fig. 6 CPU Utilization Histogram
By grouping the CPU Time based on Process/Module, we can figure out the overall CPU time percentage consumed by each module, see Fig. 7. The module “planning” is the expected hot module which we will focus for the analysis and optimization. The other significant CPU time is spent on the system library libc.so and libm.so.
By expanding the functions from these system modules, we can find the hot functions from system library. In libc.so module, the top 3 functions are related to memory allocation and deallocation. That means the dynamic memory allocation during the EM planning processing causes the relatively big overhead. A better memory management might helpful for the performance. By furtherly expanding those libc functions through call stacks, we can locate the application level memory operations from caller sites and exploit a better dynamic memory management to reduce the overhead. Fig. 8 shows an example after expanding libc_malloc function call stack in VTune, which indicates overhead is actually caused by new operator when adding new element to a C++ STL vector. Using pre-reserved vector size whenever possible provides a better practice in handling such dynamic operations.
From the libm.so module, the math functions such as hypot, exp, atan2, sin/cos etc are intensively used in planning module. These functions are optimized already with SSE2. However if there is further optimized library version, it should help to improve the performance from the system level. Intel® C++ Compiler (icc)  has an optimized version of the math library libimf.so and a vectorized version libsvml.so. Switch to Intel® C++ Compiler to build may have good performance benefits.
The function level hotspot from planning module is the most important thing we are trying to identify. According to the latency calculation from the planning module Planning::RunOnce() function, we did the data filtering in the Intel VTune Amplifier profiling result to figure out the hotspot functions which we need to focus, see Fig. 9.
Fig. 7 CPU Time module breakdown from planning process
Fig. 8 Call stacks from libc module
Fig. 9 CPU time distribution based on source function
Fig. 9 CPU time distribution based on source function
From the top-down tree of the source function stack, we know the tasks “Apollo::planning::SpeedOptimizer” (SpeedOptimizer) and Apollo::planning::PathOptimizer” (PathOptimizer) are the hotspots we need to analyze and optimize. For the other functions, they either take relatively small percentage of CPU time or the function is for debugging only. These functions are not our focus for the optimizations.
We also measured the processing time for each major tasks in planning module, and we concluded that most of processing time are from SpeedOptimizer and PathOptimizer. We noticed that SpeedOptimizer has heavier computation around first 0-50 ROS messages, and PathOptimizer has heavier computation around first 130-230 ROS messages, as indicated in figure 10.
Fig. 10 latency from optimizer over ROS input
B. Hotspots in PathOptimizer
By filter-in the function “Apollo::planning::PathOptimizer” in VTune Amplifier, we get the bottom-up hot function list from this optimizer task, as shown in Fig. 11.
Fig. 11 Hot functions from PathOptimizer
From the Process/Module/Function/Call Stack grouping list, we can easily identify the hot functions which consumes big percentage of the CPU time. The function “apollo::common::match::CrossProd” causes the performance bottleneck here. This function itself consumes 4.9% of the CPU time in PathOptimizer task. It calls two functions “Apollo::common::match::Vec2d::CrossProd” and “Apollo::common::math::Vec2d::operator-“ which consumes 6.8% and 20.6% CPU time respectively.
Look at the correlated source code and assembly code for function “apollo::common::match::CrossProd”in Fig. 12, we can identify several potential optimization opportunities to reduce the overhead of this function thus to reduce the latency and improve the performance.
Base on correlated source and assembly code for function CrossProd in file module/common/math/math_utils.cc, we can conclude:
Fig. 12 correlated source and assembly code for function CrossProd
- This function itself is not inlined. If this function can be inlined by the compiler, the stack push/pop instructions could be eliminated.
- The operator “-“ is generated as a function call to “Apollo::common::math::Vec2d::operator-“ which is defined in another file module/common/math/vec2d.cc. Usually such operators should be inlined if it is called frequently. However since this operator is defined in another .cc file, by default the compiler does not have chance to inline such functions. The cross file optimizations such as link time optimization from GNU gcc compiler , the interprocedural optimization from Intel C++ compiler  could be applied for the optimizations.
- The function Apollo::common::match::Vec2d::CrossProd is triggered as a function call and also not inlined.
In these hot functions, the function “Apollo::common::math::Vec2d::operator-“ consumes 20.6% of total CPU time. So it is worth to spend some time to investigate further why this function takes so much CPU time. From VTune Amplifier sampled data, by checking the cycle per instruction (CPI) of the function “Apollo::common::math::Vec2d::operator-“, it’s 1.05 cycles per instruction, see Table 4. Based on the understanding of what this function does, this CPI seems higher than expected which means there may have CPU execution efficiency issues.
Table 4, 16 function CPI from sample profiling
We use the “Microarchitecture Exploration” analysis from VTune Amplifier to run the profiling again to check the CPU microarchitecture execution issues. As shown in below Fig. 13, the function “Apollo::common::math::Vec2d::operator-“ has the big performance penalty because of the “Loads Blocked by Store Forwarding”. .
“Loads Blocked by Store Forwarding” issue usually caused by a write instruction to memory follows by a different size of read instruction. On Intel architecture, to streamline memory operations in the pipeline, a load can avoid waiting for memory if a prior store, still in flight, is writing the data that the load wants to read (a 'store forwarding' process). However, in some cases, generally when the prior store is writing a smaller region than the load is reading, the load is blocked for a significant time pending the store forward. This metric measures the performance penalty of such blocked loads.
In this case, if we look at the correlated source code and assembly code, as shown in below Fig. 14, the write instruction at address 0x782cbe is writing 128 bit size while the read instruction at address 0x782cc3 is reading for 64 bit size. This causes the “Loads Blocked by Store Forwarding” issue.
Fig. 14 Source/assembly correlation for function “Vec2d::operator-“
The further analysis shows that for this Vec2d::operator- function, the GNU gcc compiler generated the vectorized code. The data is moved from stack memory to vector registers, do the vectorized sub (subpd), and then store the result into the stack memory. The result is loaded into the general register later and then moved to the vector registers.
Memory->Vector register->Vectorized sub -> Memory -> General register -> Vector register
Here it seems the compiler didn’t generate the smart instructions for better performance. A simple 4 instructions would be good enough to run the none-vectorized sub operation, see Fig. 15.
Fig. 15 optimized instructions for function “Vec2d::operator-“
To solve this issue, potentially there are several solutions:
- Change the code to prevent the compiler from vectorizing this function.
- Inline this function so the compiler properly can do better job to generate the better instructions.
- Switch to Intel® C++ compiler for better optimization.
C. Hotspots in SpeedOptimizer
For the SpeedOptimizer task, here is the list of the functions from the planning module in Fig. 16. There are around other 50% of CPU time which consumed by the system library libm.so and libc.so. Here we would only focus on the functions from planning module.
Fig. 16 hot functions from SpeedOptimizer
By checking the CPU Top down microarchitecture analysis metrics, a significant portion of the cycle cost is from BackEnd-Bound > Divider. Fig. 17 below shows the cycles fraction for each hot function where the CPU Divider unit is active.
Fig.17 CPU cycles fraction because of Divider unit
To reduce the unnecessary divider usage is the basic idea for the optimization. Use the multiplication instead of the division if possible can help to reduce the instruction latency. We check the source code and find several opportunities to reduce the divider usage. Take the function “DpStCost::GetSpeedCost” as an example in Fig. 18, it is define in file dp_st_cost.cc. The “speed” is calculated by dividing “unit_t” where “unit_t” is initialized in the constructor by “config_.total_time()/config_.matrix_dimension_t()”.
Fig. 18 divider in modules/planning/tasks/dp_st_speed/dp_st_cost.cc
We could eliminate this divide operation with the multiplication. That means, for speed calculation,
“Speed = A/B where B=C/D”, replace it with “Speed = A*_B where _B=D/C”.
For example, for the function “DpStCost::GetSpeedCost” in file modules/planning/tasks/dp_st_speed/dp_st_cost.cc in Fig. 19, we can make the changes as shown below figure. Updated code shows on right side.
Fig. 19 replace divide with multiplication
By replacing the div instruction with other faster instruction, it could remove the usage of the CPU divider unit, reduce the function latency and thus to improve the overall processing performance.
D. Multithreading Parallelization
As indicated in Fig. 6, the default planning module is running in a single thread mode. The hardware contains 16 CPU cores which are not fully utilized. In order to improve the performance, multithreading parallelization is a good point to try. We try to parallelize the loops from planning module by using OpenMP*.
In order to identify the loop candidates for parallelization, we are considering several correlated criteria.
- The CPU Time. The loops which spend higher CPU time get higher priority for parallelization.
- The loop trip count. The loops with bigger average loop trip count get more chance for parallelization.
- The loop dependence. The code may need to be changed to eliminate the loop carried dependence before the parallelization. Some loops can’t be parallelized if there are dependence among the loop iterations.
Based on these criteria, we use the VTune Amplifier profiling with the loop count enabled, list the loops ordered by the CPU time and correlated with loop trip count. By investigating the source code to identify the loop carried dependence, we identify several loop candidates as highlighted below in Fig. 20 for the parallelization.
Fig. 20 Loop parallelization candidates
After applying multithreading on SpeedOptimizer, the long processing among first 0-50 ROS messages dropped significantly as below diagram in Fig.21.
Fig. 21 latency optimized in SpeedOptimizer
We also applied multithreading on PathOptimizer, the long processing among first 130-230 ROS messages dropped significantly as below diagram[TL1] Fig. 22.
Figure 22 latency optimized in PathOptimizer
When the multithreading is applied, we evaluated the performance with different CPU core number to measure the core number sensitivity for planning module. As shown from below Fig. 23, the multithreading improve performance ~1.4X with 8 OpenMP* threads, but there almost no CPU core scaling gain after 8 OpenMP* threads.
Fig. 23 CPU core number sensitivity
Performance Measurement with Optimizations
Based on the previous analysis, we applied several optimizations to reduce the planning module process latency.
- Enable the Intel C++ Compiler build. We use Intel C++ Compiler 18.0 to build the Apollo EM planning module with optimization level O3. The option –xSSE4.2_ATOM option is enabled to optimize for Intel Atom processors.
- Apply the optimizations as analyzed in previous sessions which include the Intel C++ Compiler inter-procedural optimization (-ipo) , manual code changes for divider and several others code changes which help to optimize for single thread performance.
- Enable OpenMP* parallelization. We use OpenMP* to enable the parallelization for the loops we identified with 8 CPU cores.
With all these optimization applied, the overall average latency from planning module reduced to 20.6ms, 1.84x speedup compared to the original implementation, see Fig. 24.
Fig. 24 performance speedup from optimizations
Table 5 and Fig. 25 shows the performance improvements based on the different statistics metrics.
|Statistics||Original Latency (ms)||Optimized Latency (ms)|
|Mean of Top 10%||19||16|
|Mean of bottom 10%||64||30|
Table 5. performance statistics metrics with optimized
Fig. 25 improvement ratio for each statistics metrics
The latency for each ROS message are reduced significantly as shown in below Fig. 26.
Fig. 26 latency over ROS messages
Based on the detailed analysis for Apollo EM planning module, we identified the performance bottleneck and the potential optimization opportunities. With these optimizations implemented, we get 1.84x speedup compared to the original implementation. We reduced the processing latency significantly for the Apollo EM planning module on Intel Atom® processor based platform. This concludes that the Intel Atom® has enough process capability for automated driving planning module.
Use the right software tools for analysis and optimization can help a lot. We use Intel® VTune Amplifier intensively during this optimization work to find the performance bottleneck, execution efficiency and pinpoint the optimization opportunities. We also use Intel® C++ compiler to rebuild the application to get the out of box performance. All these tools we used are from Intel® System Studio  product suite.
-  Project Apollo Github : https://github.com/ApolloAuto/apollo
-  Haoyang Fan, Fan Zhu, Changchun Liu, Liangliang Zhang, Li Zhuang, Dong Li, Weicheng Zhu, Jiangtao Hu, Hongye Li, Qi Kong, “Baidu Apollo EM Motion Planner” - https://arxiv.org/abs/1807.08048
-  Intel® Software Optimization Reference Manual https://software.intel.com/en-us/articles/intel-sdm, Chapter 16.2.3 for Atom Architecture
-  Intel® C++ Compiler optimization and reference guide, Interprocedural Optimization (IPO) https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-interprocedural-optimization-ipo
-  GCC Compiler link time optimization https://gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html
-  Intel® VTune™ Amplifier user guide https://software.intel.com/en-us/vtune-amplifier-help
-  Intel® System Studio https://software.intel.com/en-us/system-studio
-  Intel® Atom® C3000 Processor https://www.intel.com/content/www/us/en/processors/atom/atom-c3000-family-brief.html
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.