Business Results

  • Being able to generate structured synthetic data for a variety of industry specific AI applications.

  • Lower total cost of ownership (TCO) compared to acquiring data.

  • Process terabytes of data on a single workstation and scale from a single workstation to the cloud using the same code, and focus more on data analysis and less on learning new APIs.

author-image

By

View All Reference Kits

Background

Challenges associated with data privacy, limited data availability, data labeling, ineffective data governance, high costs, and the need for a high volume of data are driving the use of synthetic data to fulfill the high demand for AI solutions across industries.

Specifically, data is frequently unavailable for completely new goods, and human-annotated data is an expensive and time-consuming procedure. This may be avoided if businesses invest in synthetic data, which can be created fast and aid in the development of solid machine learning models.

Solution

In collaboration with Accenture*, Intel developed a structured data generation AI reference kit, which may assist you with generating structured synthetic data.

Synthetic data, a new class of data used for AI development and applications, is increasingly contributing to the global AI training dataset market. Using synthetic data increases accessibility and drives innovation toward newer and better AI solutions. As synthetic data becomes a growing focus for AI solutions, it will become increasingly important to optimize the AI pipeline wherever possible. Intel strives to bring those optimizations to all parts of the data science pipeline, from data generation through model inference.

Python* offers multiple capabilities for generating and handling synthetic data depending on the use case. For example, time-series data is a subclass of structured data collected across numerous industries. TimeSynth, an open source library that provides classes of signals and noise, can be used to mimic real data when using existing Python libraries that include SciPy and NumPy.

End-to-End Flows Using Intel ®AI Software Products

illustration of the data generation and data analysis capabilities of the Intel Distribution for Python

Industry-specific datasets were created for utilities and e-commerce, while the industry-agnostic dataset allows the user to customize small details of the synthetic data, such as defining the number of columns needed, data type, distributions, and weights. Each of the datasets have unique additional techniques and metadata that apply to their respective contexts based on the values provided in the configuration file.

This reference kit includes:
 

  • Training data
  • An open source, trained model
  • Libraries
  • User guides
  • Intel® AI software products

At a Glance

  • Industry: Utilities, healthcare, e-commerce, environmental studies, finance, cross-industry
  • Task: Generate structured synthetic data consisting of time-series signal generation using TimeSynth, numeric data, and categorical data, all using NumPy and SciPy for problem context and data manipulation
  • Dataset: Timestamp and Sensor Data Values synthetically generated
  • Output: Structured synthetic data: time-series data, numeric data, and categorical data
  • Intel AI Software Products:
    • Intel® Distribution of Modin*
    • Intel® Distribution for Python* (specifically the optimizations for NumPy and SciPy)

Technology

Optimized with Intel AI Software Products for Better Performance

The AI-structured data generation models were optimized by Intel Distribution of Modin and Intel Distribution for Python (specifically the optimizations for NumPy and SciPy), for better performance across heterogeneous XPU and FPGA architectures. Intel Distribution of Modin and Intel Distribution for Python allow you to reuse your model development code with minimal code changes for training and inferencing. Performance benchmark tests were run on Microsoft Azure* Standard_D8_v5 using 3rd generation Intel® Xeon® processors to optimize the solution.

Benefits

Generating structured synthetic data can be a powerful tool for organizations that need to work with data but face limitations such as data privacy concerns, limited data availability, or biased datasets.

Note The industry choices were largely arbitrary and were not meant to be exhaustive. You can update the configuration file (and the code) to include specific industry-relevant parameters based on your use case.

With Intel® oneAPI toolkits, little to no code change is required to attain the performance boost.

Download Kit

Stay Up to Date on AI Workload Optimizations

Sign up to receive hand-curated technical articles, tutorials, developer tools, training opportunities, and more to help you accelerate and optimize your end-to-end AI and data science workflows.

Take a chance and subscribe. You can change your mind at any time.

By submitting this form, you are confirming you are an adult 18 years or older and you agree to share your personal information with Intel to use for this business request. Intel's web sites and communications are subject to our Privacy Notice and Terms of Use.
By submitting this form, you are confirming you are an adult 18 years or older and you agree to share your personal information with Intel to use for this business request. You also agree to subscribe to stay connected to the latest Intel technologies and industry trends by email and telephone. You may unsubscribe at any time. Intel's web sites and communications are subject to our Privacy Notice and Terms of Use.