Intel® VTune™ Profiler Case Study with Intel® Distribution for Python for Kernel Density Estimation from scikit-learn*

ID 769194

Updated 12/8/2020

Version Latest

Public

author-image

By

Background

Kernel Density Estimation (KDE) from scikit-learn* can be used to measure the difference between data sets (ex. 5G/LTE cells behavior in time) and detect anomaly.
Normal Python’s KDE shows high Front-End Bound.
Scikit-learn* KDE uses KD Tree algorithm which introduces high L3 latency.

Approach

Adopt Intel® Distribution for Python* from Intel® oneAPI Base Toolkit.
Adopt Sorting algorithm to the input data.
Measure the performance data with Intel® VTune™ Profiler.

Environment

Intel® Xeon® Platinum 8280 CPU @ 2.7GHz with 28 cores per socket;
112 logical cores (hyperthreading)
128 GB DRAM
Ubuntu* 18.04.3 LTS, 64-bit
Normal Python*: conda create -n no_mkl_python nomkl python=3.6 python scipy numpy pandas numexpr scikit-learn=0.21.3
Intel® Distribution for Python*: conda create -n intel_python -c intel python=3.6 python scipy numpy pandas numexpr scikit-learn=0.21.3

Performance Results

Normal Python* v.s. Intel^® Distribution for Python*

With 500,000 input data, Intel® Distribution for Python* shows +38% faster performance.
Intel® Distribution for Python* has improved instruction decoding which improves the front-end bound metrics.
From the help in the metrics demonstrating the most improvement, it appears the DSB caching is improved – due to better control flow.

Figure 1. Normal Python* v.s. IDP performance comparison summary — Figure 1. Normal Python* v.s. Intel® Distribution for Python* performance comparison summary

Intel^® Distribution for Python* without sorting - VTune™ Profiler Summary

L3 latency is 100% in the Back-end bound. L3 cache isn’t taking any advantage in this case
Memory Bandwidth and Latency shows some numbers as well

Figure 2. IDP without sorting performance summary — Figure 2. Intel® Distribution for Python* without sorting performance summary

Intel^® Distribution for Python* with sorting - VTune™ Profiler Summary

L3 latency decreases to 0

Memory bound goes down from 38.6% -> 1.6%
CPI Rate also decreases
DRAM bandwidth (59.1% -> 0.4%)
DRAM memory latency (18.8% -> 1.0%)

Figure 3. IDP with sorting performance summary — Figure 3.Intel® Distribution for Python* with sorting performance summary

Intel^®Distribution for Python* without sorting - Bottleneck

KD_Tree recursive shows CPU time at 277.3
100% L3 Latency 100%

Figure 4. IDP without sorting bottom-up — Figure 4. Intel® Distribution for Python* without sorting bottom-up

Intel^®Distribution for Python* with sorting - Solving the bottleneck

KD_Tree recursive shows CPU time at 112.3

0% L3 Latency
Definitely L3 caching takes advantage in this case. More than x2 speed up in the function

Figure 5. IDP with sorting bottom-up — Figure 5. Intel® Distribution for Python* with sorting bottom-up

Kernel Density Estimation results

Estimated time for 1 KDE in 4 different cases
Intel® Distribution for Python* with sorted input shows the best performance

Figure 6. KDE performance comparison

NumPy QuickSort time results

Estimated time for Sort
Normal Python* shows stable sort overheads but sort overheads are significantly smaller than the performance benefits for the KDE calculation time.

Figure 7. Numpy Quicksort comparison

Conclusion

Normal Python* tends to run faster with small data sizes (up to 15,000).
IntelDistribution for Python outperforms as the data size grows above 100,000.
The performance gap grows larger with larger data sets.

Sorting starts to become beneficial for Intel Distribution for Python with the data size about 100,000 and above.
Intel Distribution for Python shows lower sorting time results but it is tremendously smaller than the performance benefit of adopting it. Great tradeoff.

Intel software Tools can improve your solution and productivity. Download Intel® oneAPI Base Toolkit for Intel® VTune™ Profiler and Intel® Distribution for Python* today.

Product and Performance Information

¹

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

<link rel="stylesheet" href="/etc.clientlibs/settings/wcm/designs/ver/240715/intel/clientlibs/pages/commons-page.min.css" type="text/css"><script src="/etc.clientlibs/settings/wcm/designs/ver/240715/intel/clientlibs/pages/commons-page.min.js" defer></script>

<link rel="preload" href="/etc.clientlibs/settings/wcm/designs/ver/240715/intel/clientlibs/pages/contact-us.min.css" as="style"><link rel="stylesheet" href="/etc.clientlibs/settings/wcm/designs/ver/240715/intel/clientlibs/pages/contact-us.min.css" type="text/css">

<script>!function(){var e=setInterval(function(){"undefined"!=typeof $CQ&&($CQ(function(){CQ_Analytics.SegmentMgr.loadSegments("/etc/segmentation"),CQ_Analytics.ClientContextUtils.init("/etc/clientcontext/intel",window.location.pathname.substr(0,window.location.pathname.indexOf(".")))}),clearInterval(e))},100)}();</script>

<link rel="preload" as="style" href="/etc.clientlibs/settings/wcm/designs/intel/us/en/css/resources/css/intel.rwd.override.css"/>
<link rel="stylesheet" type="text/css" href="/etc.clientlibs/settings/wcm/designs/intel/us/en/css/resources/css/intel.rwd.override.css"/>