My fellowship began by reading ‘An Overview of Education Data on HDX.’ The blog includes a visualisation showing what education data is available and missing on the platform. The analysis highlighted a major problem: the data that is critical for understanding education in emergencies is either missing or unidentified.

The main objective of my fellowship was to discover and improve access to education data across organizations and crises, and reshape it into a structured format. My idea was to develop a ‘meta-dataset’ on education in emergencies that could be used to analyze completeness for specific indicators and locations.

Understanding the Problem

In order to understand more about the types of education data that are most commonly collected and used in different crises, I decided to narrow my focus to four countries with complex emergencies: Iraq, South Sudan, Syria and Yemen. In total, I analyzed around 23 education datasets or reports that I found on and off of HDX. I compiled these datasets into a table comparing the indicator, the country and the date of the dataset.

I came to five conclusions about education in emergencies data:

  1. Datasets express the same indicator in different languages or using different terms. Take a look at the two images of reports below: both reports are attempting to express the indicator on the percentage of affected schools/learning spaces in affected areas. However, the South Sudan report shows a percentage of non-functional schools while the Yemen report shows a number of schools unfit for use due to conflict.

 

Source: South Sudan Education Cluster (2017), and Yemen Education Cluster (2017)

 

  1. Datasets have different levels of granularity and disaggregation. For example, dataset A might provide attendance rate by gender, while dataset B provides the aggregated attendance rate without showing the data by gender.
  2. Some countries had missing education indicators. I discovered data on the number of children reached with education support in Iraq and Syria but not in South Sudan nor Yemen. I developed a table to show what indicators are available by country which can be used immediately by the HDX team in its data outreach efforts.
  3. Some datasets provide an appropriate amount of detail, but are outdated. The data collected for many education indicators is from several years ago. HDX could add a notification feature so that partners can find out when a dataset is updated.
  4. Some education indicators and datasets are available on other sites or from other sources but are not in HDX. Humanitarian Response Plans, such as South Sudan’s, show funding requirements for the Education Cluster (see image). This data is included in larger financial tables but not as a separate education funding requirement indicator.

 

Source: South Sudan Humanitarian Response Plan 2018

Across the four crises that I examined, education crisis data was inconsistent in terms of its representation, availability, updatedness, and quality. My project was designed to reshape education data into a structured format to improve its comparability and to provide an overview of data completeness.

The Outcome

To develop the meta-dataset, I created a form to analyse each education dataset. The form included the following questions:

  • Who provided the dataset?
  • What education indicator did they provide?
  • What country does the data cover?
  • When was the dataset last updated?
  • Which attributes (i.e. gender, age) does the dataset include?

The meta-dataset followed the design of a star schema (see image below), an approach to develop a scalable data warehouse. This is helpful in answering high-level management questions with ease such as: How many HDX partners collect data about attendance rates in Yemen?

The star schema architecture that I built for the education meta-dataset. See more here at GitHub.

More importantly, a data grid specific to education in emergencies can be leveraged out of the meta-dataset. A data grid assesses the attributes of the available and unavailable information. It communicates:

    • Freshness: up-to-date, over-due, and delinquent datasets.
    • Reliability: knowing which organization provided the dataset.
    • Coverage: linking the dataset to the location it covers.

The HDX team was working on a data grid concept during my fellowship. The grid provides an at-a-glance look at the core data that is available for a crisis. The visual below is an example of how the grid would be represented on an HDX crisis page. The education data category is showing that there is no data available.

 

An early stage mock-up of a data grid on HDX showing missing education data for Uganda.

I developed a web crawler that systematically browses education datasets in HDX and populates the meta-dataset. The crawler scrapes all information required to fill out the rows in the meta-dataset: file type, dataset title, provider, coverage country, publishing and update dates, HDX tags, indicators, and qualities.

The crawler sees if any word in a dataset is in the list of synonyms for an indicator. If so, the crawler tags the dataset with the indicator. For example, suppose that a HXLated education dataset includes the word #reached. Then, by definition of the synonym list, the crawler can tell that the dataset mentions the number of people reached. The crawler adapts to any edit in the synonym list every time it runs.

HDX can scale the output of the crawlerthe meta-datasetto other sectors since the list is editable by any spreadsheet application. It is a simple matter of adding, removing, and changing cells in the list. This information can then be used to populate the data grid, providing users with a quick sense of what data is available in real-time as new data is added or removed from the site.

Read the rest of the blogs by the 2018 Data Fellows, including their work on data storytelling, predictive analytics and user experience research. Learn more about the programme from Senior Data Fellow Stuart Campo in Part 1 and Part 2 of his summary posts. Plus, watch the Data Fellows summary video here.