Achieve Cost-Effective, Real-Time
Machine Translation at Lowest Latency

Surprise, It’s Done with CPUs

Authors: Sidharth Kashyap, Manos Farsarakis, Nikhil Deshpande 

Machine Translation2 is the task of translating from one language to another using Artificial Intelligence techniques.  This capability is crucial for any business serving global customers where content needs to be translated across multiple languages. Neural machine translation (NMT) uses artificial neural networks to predict the likelihood of a sequence of words in the target language. NMT has revolutionized the way we read, write and travel – achieving high quality results that enable the ease of operation in a growing multilingual landscape of business.  For example, translation of posts on Facebook, or Microsoft Translator service use machine translation3 to provide optimal user experience across the globe.

NMT has continually pushed the boundaries of Deep Learning by innovating on game-changing Neural Network techniques like Transformers4 that have kept the progress ever growing. These techniques also make the task of machine translation to be computationally expensive5. NMT techniques are known to challenge the computational needs for both training and inference, given the number of parameters in the current state of the art Neural Network architectures. This insatiable demand for compute calls for efficient inference to meet strict latency requirements, and ideally, do so with minimal investment in specialized hardware. This is necessary in order to enable these innovations to be used in real world deployments for best user experience.

The Workshop on Neural Generation and Translation (WNGT)6 organized by the NMT community has a recurring efficiency shared task with rolling submissions to assess the throughput, cost and latency (included after initial submissions) of NMT models with respect to the quality achieved across a range of pre-determined hardware platforms on Amazon Web Services (AWS) EC2 instances. The efficiency task was organized by Dr. Kenneth Heafield7 from the University of Edinburgh this year.

The hardware platforms chosen were the 2nd gen Intel Xeon Platinum 8270 processor (C5.metal8 instance on Amazon Web Services) and NVIDIA T49 GPUs; with the quality, latency, throughput and size of the models being the metrics of evaluation. The c5.metal instance for Intel Xeon Scalable system was billed at $4.08/hr and the g4dn.xlarge instance for one NVIDIA T4 GPU was billed at $0.526/hr at the time of this evaluation.

We summarize results below from the efficiency task on latency and throughput which are critical for seamless end user experience. As an operational aspect, we also discuss deployment costs involved and provide comparison across CPU and GPU. In our analysis, we also use Pareto Optimality across these parameters. Pareto efficiency indicates the states where one cannot choose a better efficiency for a parameter without making another parameter worse off. In the case of NMT efficiency, we measure the quality vs the other parameters at the Pareto frontier, this is indicated by the line in the graphs shown in the results. A solution that lies on the Pareto frontier makes a case for a deployment at that quality and the corresponding metric (Speed, Cost and Latency).

NMT Latency: Intel Xeon Platinum delivers as much as 2.64x lower latency than Nvidia T4 at 32-34 BLEU

Latency determines the ability of a model to be deployed in real world translation scenarios, it is paramount for the translation systems to achieve as low a latency as possible to be satisfactory for the user in translation scenarios like e-commerce, productivity solutions in office suites and translation systems on cloud and browser.

Figure 1 exhibits the results obtained by the University of Edinburgh team using the Marian Translation toolkit on 2nd gen Intel Xeon Platinum; the triangles depict the CPU models and the crosses depict the GPU models. The details regarding the model configuration corresponding to the BLEU 10 (BiLingual Evaluation Understudy (BLEU) score measures the Machine Translation accuracy) scores can be found in the paper11 that details the submissions from the University of Edinburgh. The applicability of models to the real-world and the corresponding BLEU score required varies by the use-case, for example, the BLEU score required to qualify a model to be usable for conversation translation can vary significantly when compared to a medical translation service.

The lowest latency of < 10 milliseconds was achieved by submissions from the University of Edinburgh at a BLEU score of 32, where the Pareto frontier is taken up by the submission from the University of Edinburgh on a single core of the Intel Xeon Platinum-based system – the two superimposing triangles depict models with slightly different pruning configurations that achieve this result. Similarly, a 4-thread model at the BLEU score of 34 achieves the Pareto optimal performance at a latency close to 14 milliseconds as shown in the graph below.

The key takeaway from this graph is the fact that submissions on Intel Xeon Platinum systems are nearly 2.5x faster than the submissions on competitor systems, this makes the case for latency sensitive deployment scenarios on Intel Xeon processors to be even stronger.

It is to be noted that latency measures were added to the competition by the organizers after the initial round of submissions and the efficiency task itself has now been made rolling. 

Figure 1: Accuracy (BLEU) vs Latency: Submissions by the University of Edinburgh team using Marian toolkit on Intel Xeon Scalable systems achieves <10ms latencies at BLEU scores 32 and 2.64x faster than competitor solutions near the same accuracy metric. The CPU submissions are also 2.26x faster than the GPU submissions at 34.06 BLEU score. The “smiley face” on the following charts from the University of Edinburgh indicates the most ideal outcome, such as perfect quality with zero latency.  Results that move in the direction of the ideal outcome are better
Picture source: https://neural.mt/speed/#latency

Throughput: Xeon Platinum delivers 80,000 words per second at 32 BLEU

The competition also measured the throughput of the submitted systems, Intel achieves impressive 80,000 words per second on 48 cores of the Intel Xeon Platinum systems (two 24-core CPUs). The submissions from the University of Edinburgh using the Marian Translation toolkit achieved these results and showcase the methodology to achieve high quality translation at a very high rate for bulk translation scenarios like document translation. 

Figure 2: Accuracy (BLEU) vs Throughput: Submissions by the University of Edinburgh team using Marian toolkit on Intel Xeon Scalable systems achieves up to 80k Words per Second at BLEU scores ~32.
Picture Source: https://neural.mt/speed/#speed

Cost: Intel Xeon Platinum stays competitive on the cost frontier

The cost of deployment is paramount for any real-world scenario, and the money spent per translation dictates the platform of choice for most of the scenarios. The organizers measured this balance on a Pareto curve of translation quality and the cost of translation. Any model that lives on the Pareto line makes the case for deployment at the quality point. The submissions from the University of Edinburgh show that Intel Xeon Platinum systems are competitive with accelerator solutions with many models close to the Pareto line on many BLEU scores above 32. Hardware prices in this analysis are based on instance prices in AWS at the time of the test.  Prices are not fixed technical quantities so customers should base their evaluation on pricing from their system manufacturer or cloud provider.

Figure 3: Accuracy (BLEU) vs Throughput/$: The submissions from various the University of Edinburgh at BLEU score 32 using the Marian toolkit on Intel Xeon Scalable systems achieves competitive performance with perf/$ being close to the Pareto frontier.
Picture source: https://neural.mt/speed/

Summary:

Results from WNGT 2020 re-affirm Intel Xeon Platinum to be effective under latency sensitive inference scenarios for Machine Translation. The winning submissions from the University of Edinburgh utilized hardware features of the Intel Xeon processor, such as Intel Deep Learning Boost and Intel AVX-512, as well as software optimized for the CPU, such as the Intel Math Kernel Library, to achieve these results. Real world deployment scenarios like these should continue to take advantage of close coupling of business and the inference logic and extract the best out of Intel Xeon Scalable systems to serve their customers under real world conditions and constraints.