Intel® DevCloud Published Datasets

Published: 04/26/2019  

Last Updated: 04/26/2019

Summary

The AI datasets described in this document were cleaned and preprocessed for use on the Intel® DevCloud. Descriptions, usage examples, keywords, and other information are included in this document. For details on the steps that were followed and the techniques used to preprocess and clean the datasets, see the respective Read Me files at the dataset location.

Join Today

Disclaimer

All datasets provided for use on the Intel® DevCloud are subject to their relevant licensing terms, and therefore all users must adhere to the licensing requirements prior to use.

BoxCar21k

Description

Boxcar21k is an image dataset that contains 21,250 vehicles (63,750 images) of 27 different makes. The vehicles are divided into classes:

  • 102 make and model classes
  • 126 make and model + submodel classes
  • 148 make and model + sub-model + model year classes

A mask of each image is also available in the respective folders.

This dataset puts a bounding box around vehicles; use it to help recognize and verify the make and model of the vehicle.

Dataset Details

Original Dataset BoxCar21k Direct Download
Domain Automobile
Dataset Estimate Size 3 GB
Type of Dataset Images
Cleaned Dataset Location on the Intel® DevCloud /data/BoxCarDataset
Keywords Vehicle recognition; classification
Usage Examples Vehicle model classification
Vehicle recognition
Licensing Information Creative Commons* Attribution-NonCommercial-ShareAlike 4.0 International Public License

CreditCard

Description

This dataset contains transactions made by European cardholders in September 2013. This dataset presents 284,807 credit card transactions that occurred in two days, where 492 frauds were found. The dataset is highly unbalanced, and the positive class (frauds) account for 0.172 percent of all transactions. Among the 31 features of the dataset, V1 to V28 are principal components obtained with Principal Component Analysis (PCA). The Time feature contains the seconds elapsed between each transaction and the first transaction in the dataset. The Amount column represents the transaction amount. The Class feature is the response variable, and it takes value 1 in the case of a fraud transaction and value 0 otherwise.

Dataset Details

Original Dataset CreditCard Data
Domain Banking / Finance
Dataset Estimate Size 143 MB
Type of Dataset Text / Finance
Cleaned Dataset Location on the Intel® DevCloud /data/CreditCardDataset
Keywords CreditCard; fraud
Usage Examples Detect fraudulent transactions
Licensing Information Creative Commons* Public Domain Mark 1.0

Telco Customer Churn

Description

This dataset contains the customer data of telecom users. Each row represents a customer and each column represents a customer's attributes. The raw data contains 7043 rows (customers) and 21 columns (features); some of the attributes include:

  1. Demographic information about customers: gender, age range, and if they have partners and dependents.
  2. Services that each customer has signed up for: phone, multiple lines, internet, online security, online backup, device protection, tech support, streaming TV and movies.
  3. Customer account information: tenure (number of months the customer has been with the company), the customer’s contract term (month-to-month, one year, two years), payment method, whether or not the customer opted for paperless billing, monthly charges, and total charges.
  4. Customers who left within the last month (Churn column).

The dataset is unbalanced with 73 percent of the customers belonging to non-churn class.

Dataset Details

Original Dataset Telco Customer Churn Data
Domain Telecom
Dataset Estimate Size 172 KB
Type of Dataset Text / Numeric
Cleaned Dataset Location on the Intel® DevCloud /data/TelcoCustomerChurnDataset
Keywords Telecom; churn
Usage Examples Predict customer churn
Licensing Information Creative Commons* Attribution 4.0 International

Chest X-ray Pneumonia

Description

This dataset contains 5,863 chest X-ray images (JPEG) in two image categories: Pneumonia and Normal. The dataset is organized into three folders, including Train, Test, and Val, and each folder contains subfolders for each image category (pneumonia/normal). Chest X-ray images (anterior-posterior) were selected from retrospective cohorts of pediatric patients ranging from one year to five years old from Guangzhou Women and Children’s Medical Center in Guangzhou, China.

All chest X-ray imaging was performed as part of the patients’ routine clinical care. All low quality or unreadable scans were removed for quality control. Before being cleared for training the AI system, the diagnoses for the images were graded by two expert physicians. Also, the evaluation set was checked by a third expert to ensure there were no grading errors.

Dataset Details

Original Dataset Chest X-ray Pneumonia Data
Domain Healthcare
Dataset Estimate Size 1 GB
Type of Dataset Images
Cleaned Dataset Location on the Intel® DevCloud /data/ChestXrayDataset
Keywords Chest X-ray; pneumonia
Usage Examples Create two classifications:
Chest X-ray: normal or pneumonia
Chest X-ray: normal, viral pneumonia, or bacterial pneumonia
Licensing Information Creative Commons* Attribution 4.0 International

BOSCH

Description

The Bosch Production Line Performance (PLP) dataset represents measurements of parts moving through production lines. The data consists of a large number of anonymized features of various parts. Each part has a unique ID. The Response variable value decides the quality-control outcome of the part.

Dataset Details

Original Dataset Bosch Production Line Performance Data
Domain Manufacture / Production
Dataset Estimate Size 820 MB
Type of Dataset Text / Numeric
Cleaned Dataset Location on the Intel® DevCloud /data/Bosch
Keywords Prediction; production line
Usage Examples Predict internal failures using thousands of measurements and tests made for each component moving along an assembly line.
Licensing Information Creative Commons* Attribution 4.0 International

UK Smart Meter

Description

This open public data was taken from two data sets.

The first dataset contains public SmartMeter energy consumption data in London, England, households. The energy consumption readings came from a sample of 5,567 London households that participated in the Low Carbon London project led by the UK Power Networks between November 2011 and February 2014. The energy consumption dataset contains energy consumed in kWh (per half hour), unique household identifiers, dates and times, and CACI Acorn group information. The unzipped CSV file is about 10 GB and contains about 167 million rows.

The second dataset taken from Kaggle* is a modified version of the data from the London data store. Weather data for London area from darksky api and ACORN classification details from CACI website are also included in this dataset.

Dataset Details

Original Datasets SmartMeter Energy Consumption Data in London Households
Weather Actual/Forecast Data
Domain Household
Dataset Estimate Size 540 MB
Type of Dataset Text / Numeric
Cleaned Dataset Location on the Intel® DevCloud /data/UKSmartMeter
/data/smartmeter
Keywords Forecasting; prediction; energy consumption
Usage Examples Forecast the monthly electricity consumption of a household
Recommend electricity plans for households
Segment the daily consumption pattern
Licensing Information SmartMeter Energy Consumption Data in London Households on UK Power Networks
Smart meters in London on Kaggle*

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.