Solving Latency Challenges in End-to-End Deep Learning Applications

Published: 05/10/2018  

Last Updated: 05/10/2018

Intel® Student Ambassador David Ojika Uses Intel® Movidius™ Myriad™ 2 Technology for Specialized Vision Processing at the Edge

ai banner


The Intel® Student Ambassador Program for Artificial Intelligence, part of the Intel® AI Developer Program, collaborates with universities around the globe. The program offers key resources to artificial intelligence (AI) students, data scientists and developers, including education, access to newly optimized frameworks and technologies, hands-on training, and workshops. This paper details the decoupling of cloud-based deep learning training from accelerated inference at the edge.

While the compute-intensive process of training convolutional neural networks (CNNs) can be greatly enhanced in the cloud, cloud communication introduces the problem of latency which may lead to lagging inference performance in edge devices and mission-critical applications.

Movidius Myriad 2Intel fellowship recipient David Ojika and graduate research assistant Vahid Daneshmand set out to resolve the problem using specialized vision processors and distributed computing architecture. Their technique, conclusions and future work as they explored end-to-end image analytics with the Intel® Movidius™ Myriad™ 2 vision processing unit (VPU) are examined here.

The compute-intensive process of training machine learning models is being accelerated by cloud computing. Cloud communications, however, introduce the problem of latency during model inference, leading to lagging performance for edge applications.

Solve Deep Learning Challenges with Intel® Technology

Ojika is an Intel fellowship for Code Modernization recipient and a recent doctoral graduate in computer engineering at the University of Florida. He has completed several internships at Intel, where he worked on near-memory accelerators and heterogeneous platforms including Intel® Xeon® processors and FPGAs. Ojika’s research interest spans systems research, focusing on machine learning platforms and architectures for large-scale, distributed data analytics.

Ojika’s Intel internship exposed him to a broad range of hardware and software systems from the company that enabled him to advance his Ph.D. studies. That exposure prompted him to continue his collaboration with Intel as an Intel® Student Ambassador, helping build an AI community at the University of Florida.

ai cpu brainThe training of CNNs is highly computation-expensive, often requiring several hours or days of training with moderate hardware. Deploying the trained model for inference can present unique challenges depending on specific application requirements, for example real-time response, low power utilization, reduced form factor, ease of updating and managing trained models, and so forth. Intel Movidius Myriad 2 technology was chosen as a development platform to address some of these challenges.

Accelerating CNN Architectures with VPUs at the Edge

Much research has gone into utilizing GPUs to train CNNs, which are commonly used in image recognition. But, researchers have dedicated less attention to real-time performance of CNNs in resource-constrained environments where low latency or low power is of utmost importance.

This project leveraged a specialized, low-power VPU at the edge to accelerate the inferencing process of CNNs. The researchers presented a method that simplifies CNN/end-application integration with a microservices approach, presenting a loosely-coupled architecture, allowing for the elastic scaling of CNN “services” per requests. These processing inference requests, feature a light-weight front-end (for request-admission) and a load-sensitive back-end (for request-processing), exposing to end-applications simplified web interfaces and language-independent APIs serving CNN models.

Software architecture diagram

Figure 1. Software architecture

Key to the success of their research was the Intel Movidius 2 VPU, the industry’s first always-on vision processor. Offering high performance using low power, this family of vision processors gives developers immediate access to the vision processing core, enabling them to differentiate their development for proprietary capabilities. The Intel Movidius 2 VPU also offers its dedicated vision processing platform in a small footprint.

system overview diagram

Figure 2. System Overview

Intel Movidius Myriad 2

 Intel Movidius 2 VPU

The first step in their development was to integrate trained CNN models into the Intel Movidius technology tool chain. For demonstration purposes, the team obtained publicly available, pre-trained models, including GoogLeNet, ResNet-50* trained with ImageNet dataset on Caffe* and TensorFlow*. Next, they compiled each of the Caffe and TensorFlow models into Movidius-specific file formats using the provided Intel® Movidius™ Neural Compute Stick (NCS) toolkit. This toolkit also supports other advanced features such as checking and profiling of compiled models.

Next, the team designed and implemented two microservices, a Java-based front-end and a Python* based back-end (figure 1) which were then deployed on an Intel Atom® processor-based platform as shown in the figure 2. Requests were received by the Intel Atom processor-based platform on behalf of the Intel Movidius Myriad 2 VPU, which then processed those requests accordingly.

Finding Workarounds for Virtualization Support

A major issue Ojika encountered involved virtualization support for the Intel Movidius NCS. Although his team managed to find a workaround, they have alerted the Intel Movidius NCS team to the challenge and hope to integrate a solution in their future development efforts.

The Intel Movidius NCS toolkit, it should be noted, provides an important tool for dealing with trained CNNs in end-to-end deployment scenarios such as Ojika’s use case. The toolkit is Python based, with intuitive APIs that allowed the team to easily integrate the Intel Movidius NCS tool chain into custom applications.

A Simpler Way to Deploy Deep Neural Networks

Ojika’s solution will significantly reduce the management complexity of deploying CNNs at scale in resource-constrained environments. And, it will help maximize resource utilization, including energy, and network bandwidth, as well as return on hardware investment. Currently, the solution is useful for real-time video analytics, such as in drones, surveillance and facial recognition.

At present, the number of clients and back-end components limits performance. In the future, they plan to implement an automated, elastic scaling mechanism for handling requests within a set of defined service-level agreements. And, they will design an efficient resource utilization scheme based on network traffic and power constraints. The researchers also plan to explore the use of overlay networks for a larger-scale deployment of their proposed architecture.

The Intel Movidius Myriad 2 VPU was found to achieve real-time performance for CNN inference on embedded devices. Ojika and Daneshmand proposed a software architecture that presents inference as a web service, enabling a shared platform for image analytics on embedded devices and latency-sensitive applications.

Check out David's Intel® Developer Mesh project for more details and updates.

Join the Intel® AI Developer Program

Sign up for the Intel® AI Developer Program and access essential learning materials, community, tools and technology to boost your AI development. Apply to become an Intel AI Student Ambassador and share your expertise with other student data scientists and developers.


D. Guo, W. Wang, G. Zeng and Z. Wei, "Microservices Architecture Based Cloudware Deployment Platform for Service Computing", 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE), Oxford, 2016

Ganguly, Arijit, et al. “IP over P2P: enabling self-configuring virtual IP networks for grid computing” Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International. IEEE, 2006

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at