Intel® DevCloud Published Datasets
Published: 04/26/2019
Last Updated: 04/26/2019
Summary
The AI datasets described in this document were cleaned and preprocessed for use on the Intel® DevCloud. Descriptions, usage examples, keywords, and other information are included in this document. For details on the steps that were followed and the techniques used to preprocess and clean the datasets, see the respective Read Me files at the dataset location.
Disclaimer
All datasets provided for use on the Intel® DevCloud are subject to their relevant licensing terms, and therefore all users must adhere to the licensing requirements prior to use.
BoxCar21k
Description
Boxcar21k is an image dataset that contains 21,250 vehicles (63,750 images) of 27 different makes. The vehicles are divided into classes:
- 102 make and model classes
- 126 make and model + submodel classes
- 148 make and model + sub-model + model year classes
A mask of each image is also available in the respective folders.
This dataset puts a bounding box around vehicles; use it to help recognize and verify the make and model of the vehicle.
Dataset Details
Original Dataset | BoxCar21k Direct Download |
Domain | Automobile |
Dataset Estimate Size | 3 GB |
Type of Dataset | Images |
Cleaned Dataset Location on the Intel® DevCloud | /data/BoxCarDataset |
Keywords | Vehicle recognition; classification |
Usage Examples | Vehicle model classification Vehicle recognition |
Licensing Information | Creative Commons* Attribution-NonCommercial-ShareAlike 4.0 International Public License |
CreditCard
Description
This dataset contains transactions made by European cardholders in September 2013. This dataset presents 284,807 credit card transactions that occurred in two days, where 492 frauds were found. The dataset is highly unbalanced, and the positive class (frauds) account for 0.172 percent of all transactions. Among the 31 features of the dataset, V1 to V28 are principal components obtained with Principal Component Analysis (PCA). The Time feature contains the seconds elapsed between each transaction and the first transaction in the dataset. The Amount column represents the transaction amount. The Class feature is the response variable, and it takes value 1 in the case of a fraud transaction and value 0 otherwise.
Dataset Details
Original Dataset | CreditCard Data |
Domain | Banking / Finance |
Dataset Estimate Size | 143 MB |
Type of Dataset | Text / Finance |
Cleaned Dataset Location on the Intel® DevCloud | /data/CreditCardDataset |
Keywords | CreditCard; fraud |
Usage Examples | Detect fraudulent transactions |
Licensing Information | Creative Commons* Public Domain Mark 1.0 |
Telco Customer Churn
Description
This dataset contains the customer data of telecom users. Each row represents a customer and each column represents a customer's attributes. The raw data contains 7043 rows (customers) and 21 columns (features); some of the attributes include:
- Demographic information about customers: gender, age range, and if they have partners and dependents.
- Services that each customer has signed up for: phone, multiple lines, internet, online security, online backup, device protection, tech support, streaming TV and movies.
- Customer account information: tenure (number of months the customer has been with the company), the customer’s contract term (month-to-month, one year, two years), payment method, whether or not the customer opted for paperless billing, monthly charges, and total charges.
- Customers who left within the last month (Churn column).
The dataset is unbalanced with 73 percent of the customers belonging to non-churn class.
Dataset Details
Original Dataset | Telco Customer Churn Data |
Domain | Telecom |
Dataset Estimate Size | 172 KB |
Type of Dataset | Text / Numeric |
Cleaned Dataset Location on the Intel® DevCloud | /data/TelcoCustomerChurnDataset |
Keywords | Telecom; churn |
Usage Examples | Predict customer churn |
Licensing Information | Creative Commons* Attribution 4.0 International |
Chest X-ray Pneumonia
Description
This dataset contains 5,863 chest X-ray images (JPEG) in two image categories: Pneumonia and Normal. The dataset is organized into three folders, including Train, Test, and Val, and each folder contains subfolders for each image category (pneumonia/normal). Chest X-ray images (anterior-posterior) were selected from retrospective cohorts of pediatric patients ranging from one year to five years old from Guangzhou Women and Children’s Medical Center in Guangzhou, China.
All chest X-ray imaging was performed as part of the patients’ routine clinical care. All low quality or unreadable scans were removed for quality control. Before being cleared for training the AI system, the diagnoses for the images were graded by two expert physicians. Also, the evaluation set was checked by a third expert to ensure there were no grading errors.
Dataset Details
Original Dataset | Chest X-ray Pneumonia Data |
Domain | Healthcare |
Dataset Estimate Size | 1 GB |
Type of Dataset | Images |
Cleaned Dataset Location on the Intel® DevCloud | /data/ChestXrayDataset |
Keywords | Chest X-ray; pneumonia |
Usage Examples | Create two classifications: Chest X-ray: normal or pneumonia Chest X-ray: normal, viral pneumonia, or bacterial pneumonia |
Licensing Information | Creative Commons* Attribution 4.0 International |
BOSCH
Description
The Bosch Production Line Performance (PLP) dataset represents measurements of parts moving through production lines. The data consists of a large number of anonymized features of various parts. Each part has a unique ID. The Response variable value decides the quality-control outcome of the part.
Dataset Details
Original Dataset | Bosch Production Line Performance Data |
Domain | Manufacture / Production |
Dataset Estimate Size | 820 MB |
Type of Dataset | Text / Numeric |
Cleaned Dataset Location on the Intel® DevCloud | /data/Bosch |
Keywords | Prediction; production line |
Usage Examples | Predict internal failures using thousands of measurements and tests made for each component moving along an assembly line. |
Licensing Information | Creative Commons* Attribution 4.0 International |
UK Smart Meter
Description
This open public data was taken from two data sets.
The first dataset contains public SmartMeter energy consumption data in London, England, households. The energy consumption readings came from a sample of 5,567 London households that participated in the Low Carbon London project led by the UK Power Networks between November 2011 and February 2014. The energy consumption dataset contains energy consumed in kWh (per half hour), unique household identifiers, dates and times, and CACI Acorn group information. The unzipped CSV file is about 10 GB and contains about 167 million rows.
The second dataset taken from Kaggle* is a modified version of the data from the London data store. Weather data for London area from darksky api and ACORN classification details from CACI website are also included in this dataset.
Dataset Details
Original Datasets | SmartMeter Energy Consumption Data in London Households Weather Actual/Forecast Data |
Domain | Household |
Dataset Estimate Size | 540 MB |
Type of Dataset | Text / Numeric |
Cleaned Dataset Location on the Intel® DevCloud | /data/UKSmartMeter /data/smartmeter |
Keywords | Forecasting; prediction; energy consumption |
Usage Examples | Forecast the monthly electricity consumption of a household Recommend electricity plans for households Segment the daily consumption pattern |
Licensing Information | SmartMeter Energy Consumption Data in London Households on UK Power Networks Smart meters in London on Kaggle* |
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.