Welcome to our glossary of data terms.

Here you will find definitions of terms that we commonly use, but may not always understand. The definitions are culled from trusted sources. Where a source is cited, that is the single source of the definition. Where a source is not cited, we have rephrased excerpts from multiple sources to provide the most clear definition possible.

Are there terms we should add? Let us know by emailing us at centrehumdata@un.org.

 

A

Algorithm is a set of instructions to be followed in computations or other problem-solving procedures. Source TechTerms.

Anonymisation is a process by which personal data is irreversibly altered, either by removing or modifying the identifying variables, in such a way that a data subject can no longer be identified directly or indirectly.

Anticipatory action is action taken in anticipation of a crisis, (either before the shock or before substantial humanitarian needs have manifested themselves), with the intention to mitigate the impact of the crisis or improve the response. It is a proactive intervention, which takes place upon issuance of a warning or activation of a trigger. 

Application programming interface (API), is a set of commands, functions, protocols, and objects programmers can use to create software or to interact with external systems. It provides standard commands for performing common operations so that code does not have to be written from scratch. Source TechTerms.

Artificial intelligence (AI) is a field of science concerned with building computers and machines that can reason, learn, and act in such a way that would normally require human intelligence or that involves data whose scale exceeds what humans can analyze. Source Google Cloud.

B

Big data are high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. Source NTNU.

C

Climate data is information about weather patterns, such as temperature or precipitation, and forecasts, which offer predictions about expected weather conditions over a period of time in the future.

Cyber threat is an activity that occurs at least in part within the cyber realm, utilizing and/or targeting information communications technologies (ICTs) to achieve an effect that is not authorized by the legitimate user of the data or the ICT, and/or has a harmful intent or effect on the victim(s). Source Virtual Risk, Tangible Harm: The Humanitarian Implications of Cyber Threats.

D

Data cleaning is the process of correcting and/or standardizing data from a record set, table, or database. 

Data culture is an atmosphere in an organization or other entity where people come together to improve the organization and their work with data. They value, practice and encourage using data to make decisions. Source Tableau.

Data incidents are events involving the management of data that have caused harm or have the potential to cause harm. Source Guidance Note Series, Data Responsibility in Humanitarian Action Note #1: Data Incident Management

Data literacy includes the ability to read, work with, analyze, and converse with data.

Data mining is the practice of searching through large amounts of computerized data to find useful correlations, patterns or trends.

Data responsibility in humanitarian action is the safe, ethical and effective management of personal and non-personal data for operational response, in accordance with established frameworks for personal data protection. Source IASC Operational Guidance on Data Responsibility in Humanitarian Action.

Data science is the study of data that involves developing methods of recording, storing, and analyzing data to effectively extract useful information. The goal of data science is to gain insights and knowledge from any type of data — both structured and unstructured Source TechTerms.

Data security is a set of physical, technological and procedural measures that safeguard the confidentiality, integrity and availability of data and prevent its accidental or intentional, unlawful or otherwise unauthorized loss, destruction, alteration, acquisition, or disclosure. Source OCHA Data Responsibility Guidelines.

Data sensitivity is based on the likelihood and severity of potential harm that may materialize as a result of its exposure in a particular context.

Data standard is a published specification, e.g. the structure of a particular file format, recommended nomenclature to use in a particular domain, a common set of metadata fields, etc. Conforming to relevant standards greatly increases the value of published data by improving machine readability and easing data integration. Source Open Data Handbook. The Humanitarian Exchange Language (HXL) is a lightweight data standard for exchanging humanitarian data in an interoperable way based on spreadsheet formats such as CSV or Excel.

Data storytelling is a structured approach for communicating data insights, that includes data, visuals, and narrative.

Data visualization is the graphical representation of information and data. Visual elements like charts, graphs, and maps, provide an accessible way to see and understand trends, outliers, and patterns in data. Source Tableau.

Demographically identifiable information (DII) are data points that enable the identification, classification, and tracking of individuals, groups, or multiple groups of individuals by demographically defining factors. These may include ethnicity, gender, age, occupation, and religion. May also be referred to as Community Identifiable Information. Source OCHA Data Responsibility Guidelines.

F

False negative is when a model output fails to predict a condition or attribute when one is present. For example, failing to predict a shock when one does occur.

False positive is when a model output predicts a condition or attribute when one is not present. For example, predicting a shock when one does not manifest.

Forecast is a prediction or estimate of future events and their expected impacts and consequences. It is the output of a predictive model.

Foresight is an organized and systemic process to engage with uncertainty regarding the future. Source The SOIF Primer on Strategic Foresight.

Foundation models (sometimes called a ‘general-purpose AI’ or ‘GPAI’ system) are a type of Generative AI model trained on broad data at scale such that they can be adapted to a wide range of downstream tasks, such as text synthesis, image manipulation and audio generation. Source Stanford University Center for Research on Foundation Models + Ada Lovelace.

G

Generative artificial intelligence (GenAI) is a category of artificial intelligence model that learns patterns from input data in order to generate new data, such as text, images, audio, or video. Source TechTerms.

Geodata is information about geographic locations that is stored in a format that can be used with a geographic information system (GIS). Geodata can be stored in a database, geodatabase, shapefile, coverage, raster image, or even a dbf table or Microsoft Excel spreadsheet. Source ArcGIS.

H

Human-centred design is a philosophy that empowers an individual or team to design products, services, systems, and experiences that address the core needs of those who experience a problem. Source DC Design.

Humanitarian data is data about the context in which a humanitarian crisis is occurring (e.g. baseline/development data, damage assessments, geospatial data); about the people affected by the crisis and their needs; or about the response by organisations and people seeking to help those who need assistance. Source OCHA Data Responsibility Guidelines.

Humanitarian microdata is data on the characteristics of a population that is gathered through exercises such as household surveys, needs assessment or monitoring activities. Source Guidance Note Series, Data Responsibility in Humanitarian Action Note #1: Statistical Disclosure Control.

K

Key variables also called “quasi-identifiers”, are a set of variables that, in combination, can be linked to external information to re-identify respondents in the released dataset. 

L

Large language models (LLMs) are a narrow type of foundation model that works with language. This provides the basis for a wide range of natural language processing (NLP) tasks, such as generating blocks of text based on a user prompt. Source TechTerms.

Lead time is the length of time between the forecast publication and the forecast or target period. For example, rainfall forecasts for the month of June published in April would have a two-month lead time.

M

Machine learning are techniques used to automatically find the valuable underlying patterns within complex data that we would otherwise struggle to discover. The hidden patterns and knowledge about a problem can be used to predict future events and perform different kinds of complex decision making. Source towardsdatascience.com.

Metadata is data about data, or data that defines or describes other data. Metadata is additional information or documentation about your dataset that makes it easier for others to understand and put your data into context. Source HDX.

Model is a representation of a system which is used to study the system itself and to make predictions about the expected behaviour of the system. Source Stanford Encyclopedia of Philosophy.

Mosaic effect is when disparate pieces of data or information—although individually of limited utility—become significant when combined with other types of information. Applied to public use data, the concept of a mosaic effect suggests that even anonymized data, which may seem innocuous in isolation, may become vulnerable to re-identification if enough datasets containing similar or complementary information are released. Source: U.S. Department of Health and Human Services.

O

Open data is data that can be freely used, re-used and redistributed by anyone – subject only, at most, to the requirement to attribute and sharealike. Source Open Data Handbook.

Open data portal is a web-based interface designed to make it easier to find re-usable information. Like library catalogues, it contains metadata records of datasets published for re-use, i.e. mostly relating to information in the form of raw, numerical data and not to textual documents. Source European Commission

P

Personal data is any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person. Source OCHA Data Responsibility Guidelines.

Personally identifiable information (PII), also called “direct identifiers”, are variables that reveal directly and unambiguously the identity of a respondent, (e.g. names, social identity numbers). Source SDC Practice Guide.

Predictive analytics involves the analysis of current and historical data to anticipate an event or some characteristic of an event (the probability, severity, magnitude, or duration). Predictive analytics can also support decision making by examining data or content to answer the question “What should be done?” or “What can we do to make ___ happen?”. It is characterized by techniques such as graph analysis, simulation, complex event processing, neural networks, recommendation engines, heuristics, and machine learning.

Programming language is a set of commands, instructions, and other syntax used to create a software program. Languages that programmers use to write code are called “high-level languages.” This code can be compiled into a “low-level language,” which is recognized directly by the computer hardware. Source TechTerms

R

Re-identification is a process by which de-identified (anonymised) data becomes re-identifiable again and thus can be traced back or linked to an individual(s) or group(s) of individuals through reasonably available means at the time of data re-identification. Source OCHA Data Responsibility Guidelines.

Risk analysis is the assessment of the combination of the probability of an event and its negative consequences. Source United Nations International Strategy for Disaster Reduction.

S

Sensitive data is data that, if disclosed or accessed without proper authorization, is likely to cause harm to any person, including the source of the data or other identifiable persons or groups, or a negative impact on an organization’s capacity to carry out its activities or on public perceptions of that organization. Source OCHA Data Responsibility Guidelines

Statistical bias occurs when a model or statistic is not representative of the underlying population.

Statistical disclosure control (SDC) is a technique used in statistics to assess and lower the risk of a person or organisation being re-identified from the results of an analysis of survey or administrative data, or in the release of microdata. Source Guidance Note Series, Data Responsibility in Humanitarian Action Note #1: Statistical Disclosure Control.

Statistics is a branch of mathematics dealing with the collection, classification, analysis, interpretation and presentation of numerical data. Statistics can interpret aggregates of data too large to be intelligible by ordinary observation because such data (unlike individual quantities) tend to behave in a regular, predictable manner. It is subdivided into descriptive statistics and inferential statistics.

T

Threshold is a predetermined value that must be reached or exceeded for a certain result or outcome to occur.

Trigger mechanism is a predetermined criterion that, when met, is used to initiate actions. Source Risk-informed Early Action Partnership.

U

User experience (UX) is the overall experience of a person using a product such as a website or application.

User interface (UI) is the means by which a user interacts with a computer system or hardware device to complete tasks.