See How It’s Done

Conducting a Disclosure Risk Assessment requires you to use statistical methods to estimate the likelihood of a disclosure taking place. The following instructional videos and guidance explain these methods and how they can be applied to humanitarian microdata.

Prepare the Disclosure Risk Assessment

Before you start the risk assessment, it is important to explore your data. This could involve reviewing the original questionnaire and the sample methodology, assessing the data environment and conducting exploratory analysis to understand the relationships between variables.

Read More

Start by reviewing the questionnaire.

In case of survey data, you should review the questionnaire before starting the assessment. This will help you to understand the different variables represented in the dataset.

Have the sampling weights on hand.

Sampling weights are used to correct for the systematic differences in the selection probabilities of different respondents. If you are working with data collected through sampling, you will need the sample weights to perform a disclosure risk assessment.

Explore your data.

The first step in the risk assessment is to get to know the data you have. Applying Statistical Disclosure Control requires you to understand relationships between variables. Before jumping into the assessment, take the time to dig into those relationships.

Remove all direct identifiers from the dataset.

It is important to gather information about the survey methodology, such as strata, sampling methods, survey design and sample weights. This will be important throughout the statistical disclosure control process.

Set up your tool of choice.

At the Centre for Humanitarian Data, we use sdcMicro to perform the disclosure risk assessment. This is one of a few open source tools that can be used to apply Statistical Disclosure Control. If this is your first time using sdcMicro, you can download the package from the Comprehensive R Archive Network.

General Questions

sdcMicro is an open source add-on package in R. It was developed by the World Bank and is one tool that can be used to assess the risk of re-identification of your data. Learn more about why we choose sdcMicro here. sdcMicro requires an understanding of the R programming language. We have developed a step-by-step tutorial that takes you through the steps required to conduct a disclosure risk assessment using sdcMicro.

Both R and sdcMicro are freely available from the CRAN (Comprehensive R Archive Network). Applying Statistical Disclosure Control using sdcMicro requires some basic knowledge statistics as well as the R programming language. R is freely available from the CRAN (Comprehensive R Archive Network) for Mac, Windows and Linux.

Selecting Your Key Variables

The first step in a disclosure risk assessment is the selection of key variables. These are the variables, or the columns in your dataset, that are most likely to lead to the disclosure of confidential information, including an individual’s identity. Watch this video to learn more about different types of variables and how to select your key variables.

Read More

Start by classifying the variables in your microdata as identifying and non-identifying.

Identifying variables contain information that can lead to the identification of respondents in the dataset. These can be further categorised as either direct identifiers or indirect identifiers (also referred to as quasi-identifiers). Remember, direct identifiers such as full names, addresses, phone numbers and GPS coordinates should always be removed from the microdata before starting the risk assessment. Non-identifying variables cannot be used to re-identify individuals but could lead to the disclosure of confidential information.

Select your key variables.

A key variable is typically an indirect identifier that could be used to re-identify individuals within a datasets or to link records between different datasets. Common examples of key variables are age, material status, geographical variables, gender and religion. Removing all indirect identifiers from a dataset is likely to severely limit the analytical value of the dataset. The SDC process is intended to assess the disclosure risk presented by the indirect identifiers and to take steps to limit that risk, when possible, while maintaining the analytic power of the data.

Remember that the sensitivity of indirect identifiers depends on the context.

Direct identifiers are always considered sensitive while the sensitivity of indirect identifiers is often context specific. This is why it is important to understand both the data environment and the real life situation when selecting your key variables. Keep in mind that even when indirect identifiers are not themselves sensitive, it may be possible to combine them with other variables to lead to the disclosure of sensitive information.

Note whether your key variables are continuous or categorical.

You will use different techniques to assess the disclosure risk of continuous and categorical variables. Categorical variables take values from a finite set (i.e. gender) whereas continuous variables are numeric variables that can take an infinite number of values (i.e. income). Continuous variables can be transformed into categorical variables by creating intervals (i.e. income brackets).

Pay close attention to exclusive or partial variables.

While you do not want to remove all indirect identifiers, it may be important to remove some. For example, you may want to consider removing variables with many missing values, such as a variable recorded only for a select group.

General Questions

Key variables are your indirect identifiers that are most likely to lead to a disclosure whereas keys are all the unique combinations of values those indirect identifiers take. For the key variables ‘Marital Status’ and ‘Gender’ you could have keys such as ‘Married, Female’, ‘Married, Male’ and ‘Single, Female’. The number of times, or the frequency, a given key appears in a dataset is the basis for many disclosure risk measures.

Selecting key variables does take some practice. When in doubt, we recommend you working with a few colleagues to do the selection. You can also select different sets of key variables and run a disclosure risk assessment on each. Finally, remember that it is important for you to have an understanding of the data environment before selecting the key variables. Selecting key variables correctly requires you to make assumptions about the data that others are likely to have access to as well as whether specific data is sensitive in your context (even if it might not be considered sensitive in another context).

Run the Assessment

There are a number of different methods that can be used to evaluate the probability of individuals within a dataset being correctly re-identified. Watch the video to learn more about these different methods and how they are applied.

Read More

Use different risk assessment methods for continuous and categorical variables.

There are different disclosure risk assessment methods for continuous and categorical key variables. Assessing the disclosure risk for categorical key variables is based on the concept of uniqueness with more unique combinations of key variables (15, female, widowed) having a higher risk of disclosure. For continuous variables, variables that can take an infinite number of values, the concept of uniqueness of a key is not helpful because every respondent could have a unique value for these variables. Most disclosure risk measures for continuous variables are a posteriori measures. For this reason, they are not useful for assessing the initial disclosure risk but can instead be used to evaluate disclosure risk after the data has been treated.

Don’t forget continuous key variables.

We focus on categorical variables because they are more prevalent in humanitarian datasets but that doesn’t mean that you should ignore continuous key variables. One way to work with these variables in a disclosure risk assessment is to transform your continuous variable into categorical variables by creating intervals (income brackets, age ranges etc). If you don’t want to do this, outlier detection is one way to assess the disclosure risk of continuous key variables. You can apply Statistical Disclosure Control methods for continuous variables and then use risk assessment techniques like record linkage to evaluate the difference between the original and treated data.

Calculate the sample and population frequency of keys.

The unique combinations of key variable values are called keys. One way to assess disclosure risk is to calculate the frequency of different keys within the dataset and, if working with a sample, within the population. As a general rule, the more individual respondents that share a key, the lower the risk of a disclosure taking place.

Review k-anonymity, a common risk measure for categorical data.

To achieve k-anonymity there needs to be at least k individuals in the dataset that share a combination of values for the selected key variables. A record that has the same key as two other individuals in the dataset would satisfy 3-anonymity because there are at least three (k) individuals in the dataset with that key. A record that violates 2-anonymity is said to be a unique record because it is the only record in the dataset with that specific key. Remember that k-anonymity does not take into account sample weights. While there may only be three individuals in the sample that share a key, depending on the sample weights, this may correspond to many thousands of people in the population.

Calculate the Individual Disclosure Risk.

The Individual Disclosure Risk is the probabilitythat an individual within a dataset could be correctly re-identified. The main factors influencing the individual risk calculation are the sample frequencies (the number of individuals that share a combination of key variables in the sample) and the sample weights. When individuals with rare combinations of key variables also have small sample weights, they will have a high relative individual disclosure risk. In other words, if the number of individuals with this specific combination of key variables is expected to be low in the population, this increases the risk that they can be correctly re-identified.

Calculate the Global Disclosure Risk.

Individual disclosure risk measures are useful for identifying high-risk records. These individual risk measures can also be aggregated to obtain a global disclosure risk measure for the entire file. A straightforward way of calculating global risk is to take the average (mean) of the individual risks.

General Questions

Given long computation times for some methods, it is recommended, where possible, to first test the SDC methods on a subset or sample of the microdata, and then choose the appropriate SDC methods.

We recommend that you develop more than one disclosure risk scenario and conduct the assessment on each. To develop a disclosure scenario, you will need to think through the motivations of malicious actors, describe the data that they may have access to, and articulate how this and other publically data could be linked to your data and lead to disclosure.

Read the Assessment Results

Once you have run the assessment, it is important to understand how to interpret the results. Because the disclosure risk measures discussed above provide you with a probability of a disclosure taking place, your own judgement remains important when deciding how to proceed. Watch this video to learn more about what the risk probability means and the actions you might take to lower the risk.

Read More

Consider different risk measures when interpreting the results of the assessment.

Calculating these different disclosure risk measures helps you decide whether and how to share your data. On HDX, we calculate the global risk and review k-anonymity for all microdata shared on the platform.

Set a risk threshold that is right for your organization and context.

The risk threshold will vary according to who you are sharing the data with and the and the sensitivity of data in your context. When setting your risk threshold, consider existing institutional policies, guidelines, and applicable regulations in the country of operation. On HDX, our threshold for Global Risk of microdata is 3%.

Approach the global risk measure with caution.

A common way to calculate the Global Risk is to take the average of the individual disclosure risk scores. Even if the global risk is below the agreed risk threshold, there could still be a small number of individuals in the dataset with high individual risk.

General Questions

There is no one size fits all approach to managing this disclosure risk. You will need to determine if it is possible to apply statistical disclosure control techniques to reduce this risk while maintaining the analytical value of the data. If this is not possible, you should explore safe ways of sharing the data with trusted partners under strict terms and conditions. Continue to the next step to learn more about your options if you conduct the risk assessment and determine that the data has a high risk of disclosure.

You can and you should! If you run the assessment and identify that the disclosure risk is too high, you can decide to apply SDC techniques such as recoding and local suppression to reduce that risk. Following the application of disclosure control techniques, you will need to run the assessment again to make sure that the actions you took were sufficient in limiting the risk of disclosure.

Manage Data Responsibly

Knowing the disclosure risk helps you make informed decisions about whether and how to share the data. Because we want to bias toward sharing data responsibly, it is important to consider options that will allow for you to share the data in a way that protects the individuals in the dataset as opposed to simply not sharing the data at all. Watch the video to learn more about your options for managing microdata responsibly.

Read More

Use disclosure control techniques to reduce the risk of disclosure.

Disclosure control techniques are either non-perturbative or perturbative. Non-perturbative methods preserve the integrity of the data but limit the disclosure risk by reducing the detail in the microdata. These methods include local suppression, recoding, and eliminating variables. Through local suppression individual values are suppressed and replaced with missing values (NA) whereas, with global recoding, the number of distinct values for a given variable is reduced by creating intervals. Perturbative methods, on the other hand, alter values and as a result limit disclosure risk by creating uncertainty around what the true values are.

Navigate the trade off between disclosure risk and data utility.

The optimum trade-off between risk and utility in the statistical disclosure process depends greatly on who the users are and the conditions under which the microdata is shared. The application of disclosure control techniques will always result in the loss of information. After applying SDC, you need to quantify the information loss in order to determine if there is still value in sharing the data. Otherwise, it may be necessary to reverse course and find other methods for sharing the data.

Find other ways to share your data responsibly.

If the disclosure risk or information loss after applying SDC is too high, there are still options for sharing the data. For example, you share only the metadata on HDX via HDXConnect. This option allows you to let users know that the data exists and is available ‘by request’. Once users request access, you decide whether and how to share it. Alternatively, you could decide to share the data with trusted partners under strict terms and conditions defined in a data sharing agreement or information sharing protocol.

General Questions

There is a trade-off between risk of disclosure and data utility. Through the SDC process, the goal is to minimise the disclosure risk and maximise the data utility. Data utility is a measure of how useful and valid your data is following Statistical Disclosure Control. The reduction in data utility should be evaluated, whenever possible, with respect to the intended uses of the data. However, because it is not possible to imagine all possible uses of the data, you can also quantify the information loss following the application of SDC.

HDX allows microdata to be shared publicly through the site. However, to protect individuals and vulnerable groups, our team runs a disclosure risk assessment on any resource containing microdata. Once you have successfully uploaded the resource, our team will review it to better understand the likelihood of a disclosure taking place. We notify the contributor within 24 hours if any risk has been detected and then we work with them to make a decision together about whether and how it should be shared on HDX.

Conduct a disclosure risk assessment yourself

We developed a follow along tutorial to demonstrate how to assess the disclosure risk of a dataset and apply statistical disclosure control using sdcMicro, an add-on package in R.

Try It On Your Own