A Tutorial Series for Software Developers, Data Scientists, and Data Center Managers
There are situations when we need to bring human intelligence (HI)—when actual humans do some intelligent work such as annotating images or transcribing the text—to the artificial intelligence (AI) project. For example, we might need it during the dataset search stage, the data annotation stage, or the AI evaluation stage. When the project is small, you can do things in a lab using limited human resources and intelligence. However, the majority of AI projects require high throughput (for example, an annotation of a large image collection) or real labels from a representative user base rather than a small research team (for example, a large-scale survey or highly personal emotional tags). That’s why we need to involve many people at the same time. Crowdsourcing serves as a scalable solution to satisfy the need to add HI to AI.
In this article, we introduce the Amazon Mechanical Turk* (MTurk*) crowdsourcing marketplace and explain its key terminology. We also share tips on how to use crowdsourcing through quality control, budget optimization, and more.
Introduction to Amazon Mechanical Turk
Amazon Mechanical Turk is a crowdsourcing Internet marketplace for work, which at functions at a high level as follows:
- Employers post jobs or (micro) tasks and specify the payouts, number of tasks, and so on.
- Workers (called providers in MTurk’s Terms of Service or, more colloquially, turkers) can then browse among existing jobs using a search interface and complete tasks in exchange for a monetary payment set by the employer.
Figure 1. Main page of the official Amazon Mechanical Turk* site.
Let’s first define the relevant terminology and describe the Mechanical Turk ecosystem, which is a set of websites that you must be familiar with in order to use Mechanical Turk effectively.
- Human Intelligence Task (HIT) is an atomic (micro) task that an individual worker completes on the Mechanical Turk platform. Some example HITs are:
- Choosing the best among several photographs of a storefront
- Writing product descriptions
- Identifying performers on music CDs
- HIT Group is a set of related HITs. Each HIT is an instance of a HIT Group parameterized by a few fields serving as inputs to the workers. One can think of HIT Groups as task templates. For example, a HIT Group could define an image tagging task, while each HIT is a specific image that needs to be tagged.
- Worker Qualification is a HIT Group that requesters use to select diligent workers or only the workers that fit the study, such as in a demographic survey.
- HIT Rating is a score that each worker accumulates by working on HITs. When the HIT is submitted to the requester for review, the requester can accept or reject it. This is typically done automatically, but some requesters review the high-paying HITs manually to avoid unnecessary spending. If the work is rejected, the rating for the worker is decreased inversely proportionally to the number of HITs submitted during the worker’s lifetime. For example, if the worker submitted 1,000 HITs and the work has been accepted 950 times, the HIT rating is 95 percent. Workers with low ratings cannot work on some HIT Groups, such as those with high qualification requirements, and, therefore, they try not to get rejected.
Related Mechanical Turk Sites
There are also unofficial sites, where Mechanical Turk workers can share their experience working on the HITs of various requesters. Sharing experiences helps make requesters accountable for their actions too and help prevent low payment, high rejection rate, ill-defined HIT Groups and so on.
Here workers can rate requesters, rate HITs, and search for well-paid HITs and generous requesters.
This is a forum where workers socialize with each other, discuss best practices on how to earn money online and specifically on the Mechanical Turk platform.
To place jobs, employers use a Mechanical Turk API, or the more limited Mechanical Turk requester site with a HIT Creation Wizard. Employers can request to have multiple workers per HIT to achieve more accurate results. This is a common quality-control technique that is covered in more detail below. The following settings are available to the requesters:
- Number of workers per HIT
- Number of different HITs within a HIT Group
- Reward per HIT per worker
- Time interval to submit the work once the HIT has been accepted and the worker starts working on the task. This helps avoid dangling HITs that are accepted but never completed.
- Time to pay the worker for the completed work
- HIT name and tags
- HIT user interface
- Worker qualifications
- Worker location
- Worker rating
Getting the Most from Mechanical Turk
Mechanical Turk provides human intelligence at scale. This is achieved by tapping into a distributed network of remote workers, which in turn creates organizational challenges. Workers might disappear or drop tasks, or cheat by trying to find shortcuts to complete the tasks faster without contributing actual value with their work. Also, since many requesters post tasks, there is no guarantee that a task will be completed. The requesters compete for workers’ attention and task discoverability is an important issue.
The following is a set of techniques on how to use Mechanical Turk effectively despite these challenges. These techniques involve the quality of the results, monetary cost to complete a HIT Group, and time to complete a HIT Group. Different techniques help boost one or another variable.
Each individual worker might submit an incorrect answer or do low-quality work. Therefore, a popular technique is to “hire” multiple workers per HIT and use the redundancy in answers to eliminate errors. Theoretically, with many workers the error will become negligible. For example, one can ask nine workers to type the text from an image, and find the most popular answer.
This technique is easy to implement using the settings that MTurk provides because you can specify how many redundant workers to assign per HIT. The disadvantage is that you must pay many workers for the same work.
Diligent workers spend a reasonable amount of time on a HIT, while spammers and cheaters accept a HIT, do a minimal amount of meaningless work, and then submit the HIT for approval expecting to be approved for their work if the requester does no quality control. Therefore, it makes sense to filter out all submissions that appear too quickly. Typically, requesters fit the Gaussian distribution to the completion times and select the absolute cut-off point at 1.5+ standard deviations. Alternatively, one can drop the quickest five percent of submissions.
This approach is easy to implement and cost effective because only “good” submissions are compensated. However, it might be unfair to some workers who simply work more quickly than others, and there is no guarantee that there are no “slow-spammers.”
Based on the assumption that spammers don’t read the instructions carefully and therefore cannot even answer even basic questions about a HIT, requesters try to incorporate into their HIT Groups questions or even HITs with the known answers. That way, workers’ answers can be automatically compared with the known answers, rejecting all HITs from workers who made a mistake.
Similar to the previous technique used for time cutoff, this approach is cost effective, but it might also be unfair to some workers. It might also be more complex to implement because the requester must prepare the verification questions.
Mechanical Turk provides an opportunity to test the workers before they can work on the HITs. The qualification HITs are just like regular HITs, but their completion additionally is tied to some other HITs. Once a worker completes a qualification HIT, they can work on many other (typically well-paid) HITs from the same requester. One can think of qualification HITs as (micro)internships.
This is an effective technique if you plan to post thousands of tasks or have very specific restrictions on the workers’ demographics, such as workers from a specific country.
Zero Reward, in which no payment is made, creates a separating equilibria in the marketplace and segments workers into two groups: those deeply interested in the HIT Group and the task (intrinsic motivation) and those who aren’t interested. The spammers chasing money aren’t interested in working on such HITs. As a result, the answers to HITs that unpaid workers submit are typically of high quality.
Rather than posting an isolated HIT Group, you can construct a more complex workflow where HIT Groups placed later in the workflow are for verification (workers review the work of other workers). If a worker from the verification HIT accepts the result, the worker who made the original submission is paid. The workers working on verification HITs are always paid or an additional layer of quality control can be applied.
Workers who successfully complete many HITs have high ratings (the number of completed HITs and approval rate). You can specify the acceptable rating during the HIT Group creation time and Mechanical Turk won’t show your HITs to workers with lower than allowed ratings.
As in any labor market, higher rewards to workers lead to better quality and faster completion times. However, nobody wants to spend money unnecessarily. There are several ways you can reduce the cost of an HI project.
Find the Optimal Reward with Binary Search
At the beginning of a large HI project, you can search for the optimal reward based on acceptable quality and time requirements by posting a series of similar HITs with different reward levels. The search process should start from a high reward and at each step the reward is divided by two. Expectedly, at some point the quality or time to complete a HIT Group will become substandard. The reward from the previous step is considered to be the optimal reward. Surprisingly, even for $0.01, which is a minimal reward on Mechanical Turk, workers will provide valuable contributions.
Batch Multiple HITs into One HIT
Each HIT is typically made of two parts: the instructions for the HIT and the HIT itself. Workers spend time reading instructions for each HIT, which is a disadvantage. It would be better to have workers read the instructions once for the entire HIT Group and do many HITs at the same time. For example, rather than having an image-tagging HIT with just one image to tag, you can ask workers to tag 5 or 10 images in the same HIT.
Speed Up Task Completion
Workers react to incentives. The higher the reward, the more interested workers are to work on a HIT.
Improve the discoverability of HITs with sorting in Mechanical Turk:
- By reward amount
- Alphabetically by title
- By recency
- By allotted completion time
- By the number of HITs
Some of these sorting criteria are hard to manipulate, while others, like HIT Group Title, are flexible and tweakable:
- You can name your HIT Groups with leading exclamation marks, such as “!!!Tag Images,” or with the letter Z, such as “Z Tag Images.”
- You can post and repost HITs from the same HIT Group periodically, thus achieving higher ranks in the recency-based lists.
It may be helpful to add meaningful tags or keywords to the HIT Group description during the setup to make sure that workers using keyword search can easily find it.
Develop relationships with workers and advertise your HITs widely. After you’ve used Mechanical Turk for a while, you’ll most likely receive emails from workers asking you about your HITs. You can use these contacts to reach out to workers when you post a new HIT Group. You can also post about your new HIT Groups on related forums, like TurkerNation.
In this article, we introduced crowdsourcing as a scalable and effective way for data annotation. We covered the basics of the Amazon Mechanical Turk crowdsourcing marketplace and discussed numerous techniques on how to get the most out of it. In the next article, we will demonstrate how to use this marketplace in a real crowdsourcing pipeline.
|Prev: Select an AI Computing Infrastructure||Next: Crowdsourcing Word Selection for Image Search|
Create Applications with Powerful AI Capabilities
The Anatomy of an AI Team
Select a Deep Learning Framework
Select an AI Computing Infrastructure
Augment AI with Human Intelligence Using Amazon Mechanical Turk*
Crowdsourcing Word Selection for Image Search
Data Annotation Techniques
Set Up a Portable Experimental Environment for Deep Learning with Docker*
Image Dataset Search
Image Data Collection
Image Data Exploration
Image Data Preprocessing and Augmentation
Overview of Convolutional Neural Networks for Image Classification
Modern Deep Neural Network Architectures for Image Classification
Emotion Recognition from an Images Baseline Model
Emotion Recognition from Images Model Tuning and Hyperparameters
Music Dataset Search
Music Data Collection and Exploration
Emotion-Based Music Transformation
Deep Learning for Music Generation: Choosing a Model and Preprocessing
Deep Learning for Music Generation: Implementing the Model
TensorFlow Serving for AI API and Web App Deployment
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.