As part of its role in managing HDX, the Centre is aware of various types of sensitive data collected and used by our partners to meet needs in humanitarian operations. While organisations are not allowed to share personally identifiable information on HDX, they can share survey or needs assessment data which may (or may not) be sensitive due to the risk of re-identifying people and their locations. Below, we share the findings from user research on this issue and the changes we are making to prevent exposure of high-risk data on HDX.

Sensitive data may contain aggregate information that, if disclosed or accessed without proper authorisation, could cause negative impacts to affected people, humanitarian actors and/or a response. This data is often challenging to identify without deeper analysis. Microdata from surveys and assessments can contain non-personal information on a range of topics, including disabilities, exposure to gender-based violence and other issues that may be recorded in free text fields. 

Humanitarian organisations routinely anonymise personal data such as names, biometrics or ID numbers. However, it is often still possible to re-identify individual respondents or organisations by combining answers to different questions, even after such anonymisation is applied. A string of data points can allow for re-identification, either in isolation or when combined with basic contextual understanding. The risk of re-identification in a dataset can range from 1% to 100% depending on the presence of key variables such as age, marital status, gender, and location, which, when combined, can point to a specific individual (e.g., a 24-year-old widow in a particular camp).

The HDX team currently applies statistical disclosure control (SDC) to all survey data shared on the platform to determine the risk of re-identification of individuals and groups. When requested by our partners, we go further by applying measures to reduce this risk to an acceptable level. Since the beginning of 2018, we have conducted a risk assessment on some 60 datasets shared on HDX with varying results (see more detail in this guidance note on SDC). 

With support from the Directorate-General for European Civil Protection and Humanitarian Aid Operations (ECHO), we are developing improved processes so that data shared on HDX is automatically assessed during the upload process rather than after it is made public. We are also integrating the SDC review into our technical environment. These changes are based on what we learned from user research and our team’s direct experience in managing this type of data. 

Insights from User Research 

In 2019, we worked with Oblo Design to conduct user research with a range of data collection partners. The research focused on understanding how these organisations manage potentially sensitive survey data, both within their internal workflows and when sharing data, including via HDX. 

Through a series of structured interviews and risk mapping exercises, the research team found that the responsible handling of sensitive humanitarian data requires improvements in two areas: competencies and technology. 

  • In terms of competencies, the researchers found that, at times, organisations had insufficient internal knowledge to detect potential data risks. Not all organisations have data scientists or other staff with specialised skills needed to manage sensitive data. 
  • In terms of technology, the researchers found that organisations may lack some of the tools and infrastructure (such as a secure cloud hosting environment for performing SDC) required to assess sensitive data before sharing it. 

Organisations also noted the need for contextual knowledge in understanding the level of risk, i.e. data related to a conflict environment would present more risk than data related to a weather event. As such, organisations aim to de-risk data soon after its collected and within the operational context.

The Oblo team examined two components of the de-risking process: the performance of a sensitivity check and measures for anonymisation of data. Research participants expressed interest in a partially-automated process for both of these components. While automating a sensitivity check was perceived as a relatively straightforward task, automated anonymisation was understood to be much more complex. 

Planned improvements for HDX 

Drawing on the research insights and recommendations from Oblo, we identified three areas for improving how sensitive data is managed on HDX:

  1. Enhanced quality assurance process
  2. Integrated statistical disclosure control 
  3. Automated screening for all new data 

We plan to release these improvements in a phased manner over the coming months, with an initial focus on an enhanced quality assurance process and more robust tools and infrastructure for SDC by the HDX team. 

  1. Enhanced quality assurance process  

The HDX team manually reviews every dataset uploaded to the platform as part of a standard quality assurance (QA) process. This process exists to ensure compliance with the HDX Terms of Service, which prohibit the sharing of personal data. It also serves as a means to check different quality criteria, including the completeness of metadata, the relevance of the data to humanitarian action, and the integrity of the data file(s). 

To improve this process, we have created an internal QA dashboard which enables the HDX team to more easily identify and prioritise potentially sensitive datasets for review. If an issue is found, the dataset will appear ‘under review’ in the HDX public interface until it is resolved.

  1. Integrated statistical disclosure control 

We use an open-source software package for SDC called sdcMicro. We have integrated the SDC package and dataset review process within the HDX infrastructure. This means that our team will no longer have to download microdata from HDX to perform SDC on their local machines, decreasing the risk of this data getting into the wrong hands.  

  1. Automated screening for all new data

Later this year, our goal is to deploy an algorithm to screen all data that is being uploaded to HDX for personally identifiable information as well as other forms of sensitive data. This will be done using a script that scans the entire data file for different attributes (e.g. column headers that may indicate the presence of sensitive data). 

Datasets flagged by the script as containing sensitive data will be (a) automatically marked as ‘under review’ in the public interface of HDX and (b) quarantined until the HDX team completes a manual review of the data. During this time, users will not be able download the file.

Over time, we will refine the script based on its performance, adding or removing key attributes to improve the detection of different forms of sensitive data.

_ _ 

Alongside these improvements, we will continue to work closely with our partners to respond to the challenges they face within their data management workflows. Helping partners to de-risk data before it is ever shared is the best way to prevent risky data from being exposed publicly, on HDX or elsewhere.

For more information, see our guidance on different topics related to data responsibility here. Let us know if we are on the right track by contacting us at centrehumdata@un.org