2020 Young Statisticians' Prize Award winner - Kenza Sallier

Official Statistics data

Congratulations to the 2020 IAOS Young Statisticians' Prize Award winner Kenza Sallier!

 

Toward More User-Centric Data Access Solutions: Producing Synthetic Data of High Analytical Value by Data Synthesis

About the Author, Kenza Sallier - by Nancy Torrieri

Kenza Sallier was born and raised in France, moving to Canada at the age of seventeen. She earned a bachelors degree in mathematics in 2013 and a masters degree in statistics in 2016 from the Université de Montréal. As part of a competition with two other students, she won an award for a case study on data analysis organized by the Statistical Society of Canada.  The topic of this study was exploring spatial and temporal heterogeneity of environmental noise in Toronto.

Kenza has been a survey statistician at Statistics Canada since completing her masters degree. She has been working on innovative projects related to data access and confidentiality.  
 
Kenza recently joined the Census Operations Section at Statistics Canada. While building expertise in the implementation of data synthesis for the creation of public synthetic data files of high analytical value, she has had the opportunity to present her work at various forums. Engaged in the promotion of innovative methods, she served for over a year as head of Statistics Canada’s Machine Learning Community of Practice. Currently Kenza is part of a United Nations Economic Commission for Europe (UNECE) working group on synthetic data.
 
In her free time Kenza attends Latin social dancing events, take pictures and tries to cook dishes from various parts of the world.

 

Toward More User-Centric Data Access Solutions: Producing Synthetic Data of High Analytical Value by Data Synthesis

Summary of a Paper by Kenza Sallier, Statistics Canada - Submitted for the 2020 IAOS Young Statisticians’ Prize Competition

Statistics Canada has undertaken a modernization program that includes an emphasis on producing data with better analytical value while maintaining its core values of protecting the confidentiality of respondents’ information. One avenue that is currently being explored as a means of delivering synthetic data with high analytical values to users is data synthesis. This paper describes the use of data synthesis as a proof of concept for modernizing Statistics Canada’s data access solutions.
The modernization program underway is described at: https://www.canada.ca/en/government/system/digitalgovernment.html
It calls for more innovation and responsiveness to expand the usefulness of current data access solutions and to develop new data access solutions so that “users have the information and data they need, when they need it, in the way they want to access it with the tools and knowledge to make full use of it.” Data access solutions have traditionally been developed with confidentiality protection as the main objective. With the launch of the modernization program, analytical utility and flexibility now seem just as important.

A graphic used in the introduction of the paper shows, from an advanced user’s perspective, Statistics Canada’s current social data access solutions, along the axes of utility and accessibility. None of these solutions scores high on both of these dimensions at once: each solution represents a different balance achieved between accessibility and analytical utility. The graphic is helpful to understanding the
relationships among these dimensions in view of the challenges and opportunities of applying them.

Kenza proposes that data synthesis is a leading-edge avenue worth considering to disseminate more microdata in a suitably confidentialized form. Its appeal centers on the fact that users would draw, ideally, the same statistical conclusions from a synthetic file as they would from the original one. Data synthesis is used to extract and maintain the more general characteristics that are used to find general patterns of the population dataset and not of a specific individual. In her paper, Kenza mentions, “Given their resemblance to the original data, not only in terms of structure, but also in terms of analytical value, synthetic data are useful in many situations. For example, synthetic data of high analytical value will be most useful to researchers seeking to create statistical models that describe the many complex relationships existing in the original data set. The advantage over the original data is such that synthetic
data can freely be accessed outside of a secure environment.”

The main body of the paper provides an introduction to the theoretical concepts underlying data synthesis and how they can be implemented in practice. Although proposed over 25 years ago by Donald B. Rubin, it is only recently that data synthesis has become more popular as part of official statistics. Statistics Canada’s experiences to date with data synthesis involved a hackathon as part of the 5th International Population Data Linkage Network conference held in Banff, Canada. The second
experience was in 2019: the Canadian Partnership Against Cancer in collaboration with Statistics Canada’s Center for Population Health Data.

At Statistics Canada the intent with data synthesis is not to replace any of the existing data access solutions, but to complement them. Also, requests for estimates of simple finite population parameters of interest will likely remain better served by data access solutions such as Research Data Centers or real-time remote access, which provides data sets close to the original ones. Data synthesis fits perfectly with a user-centric and modern view intended to provide users with high quality data, which involves
concepts such as timeliness and accessibility in addition to accuracy.