Background
Challenges associated with data privacy, limited data availability, data labeling, ineffective data governance, high costs, and the need for a high volume of data are driving the use of synthetic data to fulfill the high demand for AI solutions across industries.
Specifically, data is frequently unavailable for completely new goods, and human-annotated data is an expensive and time-consuming procedure. This may be avoided if businesses invest in synthetic data, which can be created fast and aid in the development of solid machine learning models.
Solution
In collaboration with Accenture*, Intel developed a structured data generation AI reference kit, which may assist you with generating structured synthetic data.
Synthetic data, a new class of data used for AI development and applications, is increasingly contributing to the global AI training dataset market. Using synthetic data increases accessibility and drives innovation toward newer and better AI solutions. As synthetic data becomes a growing focus for AI solutions, it will become increasingly important to optimize the AI pipeline wherever possible. Intel strives to bring those optimizations to all parts of the data science pipeline, from data generation through model inference.
Python* offers multiple capabilities for generating and handling synthetic data depending on the use case. For example, time-series data is a subclass of structured data collected across numerous industries. TimeSynth, an open source library that provides classes of signals and noise, can be used to mimic real data when using existing Python libraries that include SciPy and NumPy.
End-to-End Flows Using Intel ®AI Software Products
Industry-specific datasets were created for utilities and e-commerce, while the industry-agnostic dataset allows the user to customize small details of the synthetic data, such as defining the number of columns needed, data type, distributions, and weights. Each of the datasets have unique additional techniques and metadata that apply to their respective contexts based on the values provided in the configuration file.
This reference kit includes:
- Training data
- An open source, trained model
- Libraries
- User guides
- Intel® AI software products
At a Glance
- Industry: Utilities, healthcare, e-commerce, environmental studies, finance, cross-industry
- Task: Generate structured synthetic data consisting of time-series signal generation using TimeSynth, numeric data, and categorical data, all using NumPy and SciPy for problem context and data manipulation
- Dataset: Timestamp and Sensor Data Values synthetically generated
- Output: Structured synthetic data: time-series data, numeric data, and categorical data
- Intel AI Software Products:
- Intel® Distribution of Modin*
- Intel® Distribution for Python* (specifically the optimizations for NumPy and SciPy)
Technology
Optimized with Intel AI Software Products for Better Performance
The AI-structured data generation models were optimized by Intel Distribution of Modin and Intel Distribution for Python (specifically the optimizations for NumPy and SciPy), for better performance across heterogeneous XPU and FPGA architectures. Intel Distribution of Modin and Intel Distribution for Python allow you to reuse your model development code with minimal code changes for training and inferencing. Performance benchmark tests were run on Microsoft Azure* Standard_D8_v5 using 3rd generation Intel® Xeon® processors to optimize the solution.
Benefits
Generating structured synthetic data can be a powerful tool for organizations that need to work with data but face limitations such as data privacy concerns, limited data availability, or biased datasets.
Note The industry choices were largely arbitrary and were not meant to be exhaustive. You can update the configuration file (and the code) to include specific industry-relevant parameters based on your use case.
With Intel® oneAPI toolkits, little to no code change is required to attain the performance boost.