Not Real, but Useful: The Actuary’s Guide to Synthetic Data

Synthetic data is becoming a relevant topic for actuaries in today’s data-driven landscape. The increasing demand for rich and granular data for more accurate models, especially with ongoing advancements in technology and computing power, works in opposite forces with the constraints around data privacy, regulatory compliance and data availability in this highly regulated industry. Synthetic data offers a promising solution to help bridge this gap. Synthetic data can maintain policyholder data privacy by protecting sensitive information, and at the same time allow actuaries to generate larger datasets.

What is synthetic data and why is it important?

“Synthetic data is annotated information that computer simulations or algorithms generate as an alternative to real-world data. It may be artificial, but synthetic data reflects real-world data, mathematically or statistically.”

This article explores the transformative potential of AI-generated synthetic data, the risks associated with using synthetic data, practical applications of synthetic data and how to create synthetic data in an actuarial context. As our industry embraces digital transformation, understanding and leveraging synthetic data can be highly valuable for actuarial professionals seeking to enhance their models.

Privacy considerations

Some of the biggest drivers for using synthetic data are privacy-law considerations, which manifest in restrictions or limitations on how and where data can be used. For example, historical data – previously collected personal or sensitive information – may have been collected under consent language that didn’t explicitly or implicitly authorize a use-case like model training, or privacy law may explicitly limit how and where data can be transferred. 

Synthetic datasets could serve as a solution in cases like this, where value can be extracted from data without the need for the data itself to be used or transferred. Well-constructed synthetic datasets should retain nearly all the statistical relationships of the underlying dataset while removing connections to individuals. Under certain privacy schemes, this data may no longer be considered personal information, leaving fewer restrictions on use and transfer. 

What are (re)insurance companies doing in the synthetic data space?

Synthetic data is a new approach to a problem that insurers commonly face, that “data can be expensive, imbalanced, unavailable, or unusable due to privacy regulations.” Traditionally, insurance companies have used less-modern techniques to help solve this problem. For example, in the case of a dataset that has direct identifiers (e.g., names, addresses, contact information), companies could simply remove the direct identifiers. The problem with this, and other less-modern approaches, is that there could be other quasi-identifiable information in the data.

Now, more technically capable insurers are experimenting with synthetic data techniques. Using the same example as above, where a dataset has direct identifiers, some companies would instead use a deep-learning synthetic data generator to build a new data set that has a high degree of privacy.

There are several reasons why not every company does this:

  1. The synthetic data generation process can be complex to execute. It requires a team with an advanced understanding of programming and data management.
  2. There is some degree of information lost in the process.
  3. A very large source of data is typically required to generate a viable synthetic data set.

In addition to (re)insurers using synthetic data techniques on their own, there are companies that now specialize in this space. For example, MOSTLY AI is a company that provides a proprietary platform for generating synthetic data while also providing open-source code for users to build their own custom solutions.

Companies are also exploring another class of synthetic data aimed at facilitating testing of software or data pipelines, where synthetic data is used in place of sensitive data. For use cases like these, software packages can be used to generate a variety of realistic data types, including names, email addresses and phone numbers.

Creating synthetic data

While there are numerous approaches to generating synthetic data, nearly all involve the use of machine-learning models. In the simplest cases for structured input (i.e., tabular data), these models learn the statistical properties of the dataset and generate records which mimic them in aggregate. With more complex data types, these models aim to understand deeper relationships and structures that exist within the data. If we think about images, for example, a model used to generate synthetic portraits of human beings would need to understand the structure, intricacy and variety of human facial features.

While synthetically generated images and videos may be “cool”, most actuaries will find that synthetically generated tabular datasets are more directly applicable to use cases in their day-to-day work.

For actuaries looking to generate synthetic tabular datasets, there are a few options at their disposal. Within Python and R, for example, there are several packages available:

  1. Synthetic Data Vault (Python)
  2. YData SDK (Python)  
  3. Lace (Python)
  4. Synthpop (R)

For Synthetic Data Vault, the package provides a few methods for generating synthetic tabular data. The table below contains a summary of the common methods, including pros and cons:

Synthetic data modelDescriptionProsCons
Gaussian copulaUses a Gaussian copula to model joint distributions and generate realistic synthetic data– Fast and intuitive
– Less computationally intensive
– Difficulty capturing tail dependence
– Difficulty capturing non-linear relationships
CTGAN (conditional tabular generative adversarial network)Makes use of a generative adversarial network (GAN), a deep-learning technique, to generate synthetic data– Able to capture more complex interactions (nonlinear, tail)
– Excellent at handling mixed data types (numeric, categorical)
– Slower and more computationally intensive
– More opaque
– Potentially less stable
– Typically requires larger datasets to effectively generate synthetic data
CopulaGanA hybrid approach that uses both copula- and GAN-based methods – More robust than GANS
– Less prone to instability in training than GANS
– More complex to implement and tune than standard GAN-based approached
– Computationally complex and demanding

Ultimately the choice of algorithm comes down to the specific use case at hand, and any constraints that exist around computation, time, complexity and data size. Simpler approaches that rely on classical statistical techniques, such as Gaussian copulas, will likely be the right choice where time and computation are limited, or where greater interpretability into underlying assumptions is desired. In the case where there are less constraints around computation, where more data is available, and where there is a desire to model more complex interdependencies, GAN-based or other deep-learning-based methods may be preferred.

In practice, the constraints outlined above will usually be less cut and dry, and experimenting with multiple techniques will oftentimes be the best approach.

Risks of using synthetic data

While using synthetic data offers significant advantages, it also comes with important risks that actuaries must carefully consider.

One of the key concerns is the potential misrepresentation of real-world data, which is when the synthetic data fails to accurately capture the underlying relationships, dependencies or variabilities present in the real-world data. This can happen if the model simplifies patterns, relies on limited or poor quality input data, or is subject to significant computational constraints, leading to synthetic data that does not generalize well or produces misleading results. Another form of misrepresentation, which is also cause for concern, is if synthetic data inadvertently introduces biases to a dataset, which can have ethical and regulatory implications. Biases can also arise if the original data contains bias and causes the synthetically generated data to perpetuate these underlying patterns. 

Despite the benefits of anonymization that comes with synthetic data, there is industry concern about sensitive information leakage. This can occur if the data-generation process is not properly designed, reviewed and validated. Insurance regulators may also introduce additional scrutiny on models developed on synthetic data, particularly requesting evidence that there is no unfair discrimination and unintended bias within the synthetic data. Depending on the generative techniques employed, it may also introduce an additional layer of challenges around explainability and transparency, which further complicates AI models that might already be difficult to explain.

Therefore, actuaries looking to employ these cutting-edge techniques in their actuarial work should adopt robust validation frameworks, be mindful of the limitations of synthetic data and take actions to effectively mitigate these risks. Some mitigation strategies include performing statistical comparisons and model performance testing to ensure synthetic data accurately reflects real-world patterns, conducting bias and fairness audits in synthetic data and downstream models, and maintaining transparency about data-generation techniques.

What is the future of synthetic data?

It is difficult to predict where the future of synthetic data will take us. With the progress made in the fields of AI and big data, there could be major progress on this topic in the short-to-medium term. That said, there are broad predictions that can be reasonably made:

1) Credible synthetic data will continue to be difficult to produce without some base of existing data.

In essence, something cannot be produced from nothing. It is tough to envision a future where even the most cutting-edge AI is able to produce synthetic data without existing information.

2) Synthetic data could be used to circumvent (either legally or maliciously) strict data regulations and restrictions.

This could allow for the transfer of data to other parties or locales and facilitate global collaboration. This is not a stretch to envision, as this application has been discussed by companies specializing in synthetic data[1]. Not to mention that legacy anonymization techniques are already being used for this purpose. The big question is likely not “will insurers use synthetic data” but “when” and “how much.” Further, data regulations and restrictions are likely to become stricter. Currently, most privacy laws (e.g., the EU’s General Data Protection Regulation, or GDPR) allow for the use of synthetic data if it is truly impossible to re-identify. Will this continue to be the case for privacy laws?

3) Synthetic data will be considered as an option for sharing data among insurance companies.

Synthetic data can allow for sharing or combining data sources where it was otherwise not possible (i.e., data sharing between organizations for collaboration on projects). Several attempts have been made in the insurance industry to share data: some successful, others not. Synthetic data would allow for the anonymization of insured personal information, which could help alleviate privacy concerns. However, there are still factors that might work against these initiatives. For example, large insurers might see their data (anonymized or not) as a competitive advantage and be unwilling to share.

Conclusion

Synthetic data is an exciting innovation opportunity for actuaries and insurers, enabling the creation of rich and privacy-preserving datasets to address existing challenges related to data limitations and privacy-law constraints. While adopting these new techniques comes with risks and complexities, it is a promising area worth exploring, provided robust validation and risk governance framework are in place. 

Looking ahead, we believe synthetic data will play an important role in fostering data cooperation and driving innovation in the insurance industry. However, it is important to recognize that quality and credibility of synthetic data still depend on the underlying datasets, and its use will continue to be shaped by the evolving regulatory landscape. Thoughtful integration of synthetic data promises significant benefits for actuaries and the broader insurance industry.

About the authors

Harrison Jones, ASA, is a Director of Portfolio Management at Ecclesiastical Insurance, based in Toronto, Ontario. He has held various actuarial and data science roles over the last decade.

Bernice Lim, FCIA, FSA, is a Principal at the Actuarial Practice of Oliver Wyman with over 10 years of experience in the life insurance and annuity space, working with insurers in areas such as actuarial modelling and data‑driven analytics.

Tristan Walsh is a Staff Data Scientist within the Munich Re North American Integrated Analytics team, where he applies data science techniques to drive innovation in the life insurance industry. He has a BSc. in physics from McGill University and is an Associate of the Society of Actuaries.

This article reflects the opinion of the authors and does not represent an official statement of the CIA.


[1] https://mostly.ai/use-case/data-sharing