Developers all over the world competed to build and tune high-performance embedded deep neural networks (DNNs) using the Intel® Movidius™ Neural Compute Stick in a Topcoder* Challenge. Read the stories from the challenge winners for instruction and inspiration, and then go bring visual intelligence to the network edge in your own project.
The future of AI development is being democratized. No longer restricted to specialists, challenges such as computer-based perception and learning can now be tackled by the general developer and maker community. Hardware and software tools have become viable for mainstream use, in terms of both cost and complexity, and the knowledge to guide new participants in this field has become widely available.
The context for this shift is compelling, as market estimates emerge of as many as 20 billion devices being connected to the Internet by 2020.1 The massive data being generated by these devices provides opportunities that have scarcely yet been imagined, in terms of both improved human experience and emerging business possibilities.
In particular, devices such as smart cameras, drones, and robots will generate growing amounts of visual data. Establishing intelligence at the network edge is a key requirement for delivering value from that data, interacting with the environment in real time. This approach eliminates the inefficiencies in power consumption, bandwidth, and latency associated with sending data back to the network core for decision making. Here are a few simple examples, among an infinite set of possibilities:
- Surveillance systems analyzing video in real time to make decisions about whether intruders or other dangerous conditions exist, to support alerting response teams when needed.
- Agricultural drones processing visible-spectrum and infrared images to identify areas of a field where additional data is required to fine-tune the application of water, herbicides, and pesticides.
- Consumer or healthcare robots avoiding obstacles as they navigate through a home or other location, allowing them to move autonomously in real-world situations.
Intel Movidius, a unit within Intel’s Artificial Intelligence Product Group (AIPG), develops embedded hardware solutions for AI and computer vision that provide processing to edge devices, without requiring cloud or network connectivity. The Intel® Movidius™ vision processing unit (VPU) is integrated into millions of devices already in deployment and is also the basis for the Intel® Movidius™ Neural Compute Stick (NCS). A tiny, fanless deep learning device, the NCS is available globally for a retail price of $79 and well-supported by free open-source software for developers.
To advance the state of development for the NCS and to stimulate its adoption, Intel Movidius held a competition in partnership with Topcoder, with prizes totaling U.S. $20,000. Competitors fine-tuned convolutional neural networks (CNNs) by analyzing bandwidth, execution time, and complexity at each layer, to optimize accuracy and execution time for edge-based applications. This article introduces the NCS and related tools and then reports on the contest itself. Developers already familiar with NCS technologies can skip ahead to the discussion of contest winners’ approaches, problem resolutions, and lessons learned.
Hardware to Accelerate Deep Learning at the Edge
The NCS, illustrated in figure one, is housed in a form factor that resembles a USB flash drive and based on the Intel® Movidius™ Myriad™ 2 visual processing unit (VPU). All power and data connectivity is accomplished using the single USB type A port, which supports either USB 2.0 or USB 3.0. Solutions can be scaled out by running multiple NCS devices on a single host system. Supported hosts include desktop or laptop machines running 64-bit Ubuntu* Linux* or Raspberry Pi* 3 Model B running Raspbian Stretch* desktop or an Ubuntu 16.04 VirtualBox* instance.
Neural Compute Stick (NCS)
Myriad™ 2 Visual Processing Unit (VPU)
Figure 1. The Intel Movidius NCS, powered by the Intel Movidius Myriad 2 VPU.
The VPU that constitutes the main contents of the NCS package is based on a system-on-chip (SoC) design, built to deliver optimized data flows for high throughput while running at low power. Hardware accelerators within the NCS are purpose-built for imaging and computer vision, complementing an array of 12 vector processors called SHAVE (Streaming Hybrid Architecture Vector Engine) processors, two RISC CPUs, and a software-controlled memory subsystem based on 4 GB of LPDDR3 RAM and configurable caches.
The SHAVE processors support both 8/16/32 bit integer and 16/32 bit floating-point arithmetic as well as features such as hardware support for sparse data structures. They use a very long instruction word (VLIW) architecture to break program instructions down into basic operations that exploit instruction-level parallelism using single-instruction, multiple data (SIMD) processing to accelerate neural networks.
The RISC CPUs run custom firmware loaded using an API from the host machine that allows the NCS to accept data and commands to execute inferences using the neural network. It also parses work and schedules it on the SHAVE execution engines, as well as handling various functions such as those related to temperature monitoring, alerts, and processor throttling.
Models and Tools for Developing Visual Intelligence
The algorithms behind deep neural networks (DNNs) are inspired by the connections among neurons in the human brain. Likewise, image classification based on visual deep learning mimics and builds on human processes of making guesses based on known information. CNNs are a type of DNN that is specifically well suited to the needs of analyzing visual imagery, with design characteristics such as minimizing pre-processing requirements and taking advantage of factors such as the relative locations of pixels to each other. The process of building and deploying a neural network based image classifier involves the following phases:
- Train, including selection of an appropriate base neural network, preparing a dataset of images, and running the actual training process to create a model.
- Profile, consisting of testing whether and how well the model runs on the NCS, generating an analysis of bandwidth, complexity, and execution time for each layer.
- Fine tune, which involves customizations of the neural network such as altering or removing layers to affect factors such as speed and accuracy.
- Deploy the neural network on an edge device and test it by classification of a set of target images.
Deep Learning Frameworks and CNN Architectures
The capabilities of the NCS hardware for deep learning are abstracted by frameworks that offer standardized building blocks that help streamline the process of designing, training, and validating CNNs. Two open-source deep-learning frameworks are validated for use with NCS:
- Caffe* is a deep learning framework from Berkeley Vision Labs, provided under BSD license. It is written in C++, with a Python* API. Caffe provides a large variety of pre-trained models and allows models to be trained without writing any code; it appears to be in the process of being replaced by its successor, Caffe2, which is backed by Facebook* among others.
- TensorFlow* is a deep learning framework from Google, provided under Apache* 2.0 license. It supports writing applications in Python and provides Java*, C, and Go APIs for deploying TensorFlow models in existing applications. Google and Intel both offer free online training to familiarize developers with machine learning concepts based on TensorFlow.
For either of the frameworks, a variety of CNN architectures can be used. Common choices for Caffe are GoogLeNet, AlexNet, and SqueezeNet. TensorFlow developers often use various versions of Google’s Inception CNN architecture or alternatives such as MobileNet or DeepNet.
Development Tools for CNNs and AI Applications
Developers profile, tune, and compile hardware-accelerated CNNs using the NCS hardware, a host computer, and a set of command-line development tools provided by the NCS software development kit (NCSDK). The tools include the following, as illustrated in figure two:
- mvNCCompile compiles the Caffe or TensorFlow network and weights (which govern the significance of specific inputs) into the Movidius Graph format which can then be loaded onto the NCS at runtime for execution of inferences.
- mvNCProfile compiles and runs the network on the NCS hardware and outputs performance profiles based on that execution. The profiles contain layer-by-layer statistical data that developers use to identify bottlenecks and improve overall inference time.
- mvNCCheck validates the functioning of a Caffe or TensorFlow neural network on the NCS by running inferences on both the NCS and the host hardware. Comparing the results of both sets of computations will indicate whether the network passes or fails the validation process.
In addition to the CNN itself, developers create application software (in either Python or C) to facilitate its use in real-world tasks. The NCSDK provides an API for offloading neural network computations onto the NCS, allowing software to programmatically open and close the NCS, load graphs, and run inferences on the NCS. This offload process accelerates computations and inferences.
Figure 2. Toolkit and API framework provided by the NCS software development kit (NCSDK).
Meet the Winners, Their Motivations, and Their Methods
The competition held in conjunction with Topcoder set about its goal of advancing AI development with an emphasis on the accessibility of programming for the NCS by the general community. The pool of contest entrants reflects the diversity of people who are getting involved in AI development: participants ran the gamut from students and academics, to industry-based developers, to hobbyists and makers. A common refrain among all of these groups was they decided to enter the competition as an exploratory exercise to become familiar with the technology, as shown in figure three. The fact that many of the winners made such statements is a direct reflection of how accessible development of deep-learning solutions has become.
Figure 3. Sample motivations for competing and developing for the Intel® Movidius™ Neural Compute Stick (NCS).
Over the course of the competition, which lasted from December 2017 until March 2018, all of the winners reported experimenting with a number of factors, such as various frameworks, networks, and data models, as discussed below. The scoring methodology2 used for the contest is designed to challenge participants to achieve as perfect as possible a balance between accuracy and execution time. This requirement reflects the reality that image recognition in real-world scenarios typically must be accomplished in near-real time with a high degree of accuracy, and neither must be compromised in favor of the other.
For example, a solution that recognizes license plates or faces from a police vehicle must place equal importance on being both fast and accurate. The same is true for countless other applications, from an order-picking robot that chooses items on a shelf to a next-generation self-propelled vacuum cleaner that is programmed to go right up to furniture and walls but steer clear of a sleeping cat. The five top winners are shown in figure four, along with the amount of the prize each won and their relative scores, which reflect their success in tuning their networks for the best possible accuracy and speed at the image-recognition task.
Mauricio Pamplona Segundo
Table 1. Winners of the Embedded Image Classification Challenge.
Choice of framework and network
At the time of this writing, both Caffe and TensorFlow are supported by the NCS, and each offers a range of choices for the network (e.g., Inception, MobileNet). These factors mean that there are many possible pairings, and each of the contest winners experimented with several, weighing the relative advantages of each to determine which framework and network selection would produce the best combination of speed and accuracy. In some (but not all) cases, using more recent versions of an architecture (e.g., MobileNet v2 as opposed to MobileNet v1) can improve accuracy. Variations in that general rule make it worthwhile to test multiple versions against each other.
“I used a stock pre-trained Inception V3 model using Tensorflow ... I tried other more both more simple and more complex networks. My initial submission was on Mobilenet but seems the network was too simple. I also tried Inception V4 but it was too slow.”
‒ Cloves Almeida, fourth place winner
Notably, all five top winners used TensorFlow for their final implementations. Many developers will benefit from the optimizations of TensorFlow for various types of Intel architecture, ranging from the NCS to Intel® Xeon® and Intel® Xeon Phi™ processor-based platforms. These optimizations are the fruit of a close collaboration between Intel and Google engineers that focuses on a range of opportunities for performance improvement, including the following:
- Vectorization of key primitives such as convolution, matrix multiplication, and batch normalization takes advantage of SIMD instructions.
- Parallelization helps ensure efficient use of the full range of processing cores available, within a given layer of operation as well as across layers.
- Data locality makes data available where and when execution units need it by taking advantage of prefetching, cache blocking techniques and data formats that promote spatial and temporal locality.
TensorFlow optimization on modern Intel architectures include the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN).
Model size and training
Larger model sizes tend to improve accuracy, with the tradeoff of taking more time to complete the categorization task. Because this contest (like most real-world implementations) required both accuracy and speed, the model size must be adjusted to achieve that balance.
“My submission consisted of simply fine-tuning the largest MobileNet architecture known to be supported by the NCSDK.”
‒ Paul Froissart, first place winner
Several participants built models from scratch in an effort to increase speed or accuracy beyond what was possible with pre-trained models, with varying success. Because this approach may increase the complexity of implementing the solution as a whole, developers should weigh that complexity against potential benefits.
Enabling data augmentation methods such as cropping, rotating, and flipping input images during training enhances the effective richness of information available from a given set of source data, increasing the accuracy of the DNN. First-place winner Paul Froissart suggests that enabling augmentation “may require changing hyperparameters for best performance (for example, training for more epochs and decreasing the learning rate more slowly).”
Community resources: NCS App Zoo, blog, and forum
As with many open-source projects, official documentation of the tools and technologies used with NCS is very valuable, but not enough on its own. The winners of the competition universally reported getting help from the Intel Movidius community as they worked through issues and improved their solutions. Key examples include the following:
- App Zoo: the Intel Movidius Neural Compute App Zoo (NC App Zoo) is a GitHub* repository that provides a large body of examples, including user-submitted networks and applications.
- Blog: The NCS blog provides a body of entries that explore capabilities, issues, and resolutions that are valuable to the community.
- Forum: The official NCS forum provides notifications of new capabilities and tools, technical discussions about NCS and NCSDK, and the latest documentation and downloads for developers.
“Some important information could only be found in the forum.”
‒ Mauricio Pamplona Segundo, second place winner
A few more key challenges that participants overcame
During the course of the competition, participants encountered challenges that ranged from bugs in the code to arcane procedural issues. Some of those challenges are described above; here are a few others, selected because of their importance to the outcome of participants’ entries in the competition:
Imbalance between speed and accuracy
It is critical to select a network with an appropriate depth. For example, one participant attempted to use Inception-ResNet-V2, an extremely deep network, in an effort to maximize accuracy. The result was that the model was both very expensive to train and took a long time to make an image inference. By using MobileNet_01_224 instead, which has about one-tenth as many parameters, the model reduced training time by 3x to 5x and inference time by approximately 10x, with a penalty in accuracy of only about 13 percent.
Difficulty exporting the network.meta file
One area where the learning curve proved high for at least one participant was exporting the network and corresponding weights. In addition to using materials provided in the NCS App Zoo, 3rd place finisher Fernando Sadu resolved this issue using the Technica-Corporation/TF-Movidius-Finetune gitgub page, an end-to-end solution for fine-tuning a TensorFlow application and exporting the network.meta file.
Mismatched color formats for Caffe versus TensorFlow
One participant spent several days unable to generate non-zero scores when performing inference on sample data. Looking deeper, he recognized that the compiled.graph file he was using was implemented for a Caffe solution, meaning that it was reading images in BGR format instead of the RGB format expected by TensorFlow. Once the source of this issue was identified, a simple conversion from BGR to RGB format resolved it.
Potential for future improvements
Beyond the challenges that the winning participants in the competition encountered and resolved, they identified many areas for future refinement and ongoing improvement to their models. Key opportunities for experimentation that participants are considering include the following:
Investigate additional architectures
The broad (and growing) array of architectures available offers the potential to pursue better balances and tradeoffs between speed and accuracy. Multiple factors such as the number of parameters and the architecture’s memory footprint play roles in this selection.
Refine the scope of training
Several participants reported interest in experimentation with a larger data set when training their models, as well as potentially extending the training to cover more layers or improving the tuning of training parameters (e.g., learning rate and decay).
Apply additional data augmentation techniques
While some participants augmented data using techniques such as cropping, rotating, and flipping input images during training, the range of other possibilities is extensive. For example, additional techniques such as color jittering (applying changes to individual R, G, or B values) and altering saturation or value may be applied, together with various combinations of techniques.
Combine resources of multiple NCS devices
Experimentation with dividing up inference tasks within multi-NCS arrays would theoretically accelerate operation of the image-classification algorithm.
In the months and years to come, AI will continue to advance as a common technology that will power an ongoing emergence of applications as diverse as smart cameras, drones, robots, and augmented reality. The Intel Movidius NCS is helping make machine intelligence ubiquitous, taking it out of data centers and university labs and onto consumer devices of every description. This competition and the lessons learned during it provide a window into the growing AI expertise in the mainstream developer community. As intelligence continues to develop at the network edge, a universe of opportunities is waiting to be imagined.
- Source: Gartner Says 8.4 Billion Connected "Things" Will Be in Use in 2017, Up 31 Percent From 2016.
- For details of the scoring mechanism, see the “Scoring” section of the Challenge Problem Statement.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.