Synthetic data

May 28

The concept of synthetic data generation is the following: taking an original dataset which is based on actual events. And create a new, artificial dataset with similar statistical properties from that original dataset. These similar properties allow for the same statistical conclusions if the original dataset would have been used.

Generating synthetic data increases the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It creates new and representative data that can be processed into output that plausibly could have been drawn from the original dataset.

Synthetic data is created through the use of generative models. This is unsupervised machine learning based on automatically discovering and learning of regularities/patterns of the original data.

Why is synthetic data important now?

With the rise of Artificial Intelligence (AI) and Machine Learning, the need for large and rich (test & training) data sets increases rapidly. This is because AI and Machine Learning are trained with incredible data, which is often difficult to obtain or generate without synthetic data. Large datasets are in most sectors not yet available at scale. Think about health, autonomous vehicle sensors, image recognition, and financial services data. By generating synthetic data, more and more data will become available. At the same time, consistency and availability of large data sets are a solid foundation of a mature Development/Test/Acceptance/Production (DTAP) process, which is becoming a standard approach for data products & outputs.

Existing initiatives on federated AI (where data availability is increased by maintaining the data within the source, and the AI model is sent to the source to perform the AI algorithms there) have proven to be complex due to differences between (the quality) of these data sources. In other words, data synthetization achieves more reliability and consistency than federated AI.

An additional benefit of generating synthetic data is compliance with privacy legislation. Synthesized data is less (but not zero) directly referable to an identified or identifiable person. This increases opportunities to use data, enabling data transfers to cross-border cloud servers, extending data sharing with trusted 3rd parties and selling data to customers & partners.

Relevant considerations

Privacy

Synthetisation increases data privacy but is not an assurance of privacy regulations.

A good synthethisation solution will:

include multiple data transformation techniques (e.g., data aggregation);
remove potential sensitive data;
include ‘noise’ (randomization to datasets);
perform manual stress testing.

Companies must realize that even with these techniques, additional measures such as anonymization can still be relevant.

Outliers

Outliers may be missing: Synthetic data mimics real-world data. It is not an exact replica of it. So, synthetic data may exuberate some original data's outliers. Yet, outliers are important for training & test data.

Quality

The quality of synthetic data depends on the quality of the data source. This should be taken into account when working with synthetic data.

Black-box

Although data synthetization is taking centre stage in the current hype cycle, it is still in the pioneering phase for most companies. This means that at this stage, the full effect of unsupervised data generation is unclear. In other words, it is data generated by machine learning for machine learning. A potential double black box. Companies need to build evaluation systems for the quality of synthetic datasets. As the use of synthetic data methods increases, an assessment of the quality of their output will be required. A trusted synthetization solution must always include good information on the origin of the set, its potential purposes for usage, its requirements for usage, a data quality indication, a data diversity indication, a description of (potential) bias and risk descriptions including mitigating measures based on a risk evaluation framework.

Synthetic data is a new phenomenon for most digital companies. Understanding the potential and risk will allow you to keep up with the latest developments and ahead of your competition or even your clients!

Chris Lemmens

Synthetic data

How to monetize data by integrating data teams

Contact information