User Guide

Overview

intake_sdmx is a plugin for intake which leverages pandaSDMX to make data and metadata from the SDMX ecosystem accessible via intake. To achieve this, intake_sdmx provides three intake drivers:

Whether you are familiar with intake or pandaSDMX, the above concepts should ring a bell. However, if you are new to both ecosystems, just read on, follow the code examples, and dig deeper as needed by skimming the docs of either intake or pandaSDMX as needed.

The following sections expand on the introductory code example from the pandaSDMX documentation.

Exploring the available data sources

You can instantiate the catalog of SDMX data sources in one of two ways:

# firstly, via intake_sdmx:
In [1]: from intake_sdmx import *

In [2]: src = SDMXSources()

In [3]: type(src)
Out[3]: intake_sdmx.SDMXSources

# secondly, via the intake API:
In [4]: import intake

In [5]: src2 = intake.open_sdmx_sources()

# YAML representation
In [6]: print(src.yaml())
sources:
  SDMX data sources:
    args: {}
    description: SDMX data sources (a.k.a. agencies / data providers)        supported
      by pandaSDMX
    driver: intake_sdmx.SDMXSources
    metadata: {}


In [7]: src.yaml() == src2.yaml()
Out[7]: True

# Available data sources
In [8]: list(src)
Out[8]: 
['ABS_XML',
 'Australian Bureau of Statistics - XML',
 'BBK',
 'Deutsche Bundesbank (German Central Bank)',
 'BIS',
 'Bank for International Settlements',
 'ECB',
 'European Central Bank',
 'ESTAT',
 'Eurostat',
 'ILO',
 'International Labor Organization',
 'IMF',
 'International Monetary Fund',
 'INEGI',
 'Instituto Nacional de Estadística y Geografía (MX)',
 'INSEE',
 'Institut national de la statistique et des études économiques (FR)',
 'ISTAT',
 'Instituto Nationale di Statistica (IT)',
 'NB',
 'Norges Bank (NO)',
 'SGR',
 'SDMX Global Registry',
 'UNICEF',
 "UN International Children's Emergency Fund",
 'CD2030',
 'COUNTDOWN 2030',
 'SPC',
 'Pacific Data Hub',
 'UNSD',
 'United Nations Statistics Division',
 'WB',
 'World Bank World Integrated Trade Solution',
 'WB_WDI',
 'World Bank World Development Indicators',
 'LSD',
 'Statistics Lithuania']

Two observations:

  • For intake-novices: each driver instance can create a declarative YAML description of itself which suffices to re-generate clones by calling intake.open_yaml().

  • For pandaSDMX-novices: The catalog contains two copies of each data provider entry accessible via its ID and name (in SDMX terminology) respectively. The duplicate entries are a pragmatic response to the fact that catalog entries are expensive to instantiate as each one requires a HTTP request to a different SDMX web service. And dict keys are just fine to show human-readable descriptions alongside the IDs.

As in pandasdmx, you can configure your HTTP connections:

In [9]: src_via_proxy = SDMXSources(
   ...:     storage_options={'proxies': {'http': 'http://1.1.1.1:4567'}})
   ...: 

The storage_options argument is an intake feature. Options will be propagated to any HTTP connection established by instances derived from src_via_proxy. Note that upon instantiation of:class:SDMXSources no HTTP connection is made.

Exploring the dataflows of a given data source

Suppose we want to analyze annual unemployment data for some EU countries. We assume such data to be available from Eurostat.

In [10]: estat_flows = src.ESTAT

In [11]: type(estat_flows)
Out[11]: intake_sdmx.SDMXDataflows

In [12]: print(estat_flows.yaml())
sources:
  ESTAT:
    args:
      storage_options: {}
    description: Eurostat
    driver: intake_sdmx.SDMXDataflows
    metadata:
      catalog_dir: ''
      source_id: ESTAT


In [13]: len(estat_flows)
Out[13]: 13448

# Wow!
In [14]: list(estat_flows)[:20]
Out[14]: 
['DS-018995',
 'EU trade since 1999 by SITC',
 'DS-022469',
 'Extra-EU trade since 2000 by mode of transport, by NST/R',
 'DS-032655',
 'EU trade since 1988 by BEC/rev.4',
 'DS-043227',
 'EFTA trade since 1995 by SITC',
 'DS-066341',
 'Sold production, exports and imports',
 'DS-066342',
 'Total production',
 'DS-1060915',
 'EU trade since 1988 by CPA 2008',
 'DS-1062396',
 'EU trade since 2002 by CPA 2.1',
 'DS-1165455',
 'UK trade Feb-Oct 2020 by HS2-4-6 and CN8',
 'DS-1180622',
 'EU trade since 2015 of COVID-19 medical supplies by categories']

Luckily, this class has a rudimentary intake_sdmx.SDMXDataflows.search() method generating a shorter subcatalog:

In [15]: unemployment_flows = estat_flows.search("unemployment")

In [16]: len(unemployment_flows)
Out[16]: 186

# This is still too large...
# So let's refine our search.
In [17]: unemployment_flows = estat_flows.search("annual unemployment", operator="&")

In [18]: list(unemployment_flows)
Out[18]: 
['lfsi_long_e01',
 'Transition from unemployment to employment by sex, age and duration of unemployment - annual averages of quarterly transitions, estimated probabilities',
 'lfsi_long_e02',
 'Transition from unemployment to employment by sex, age and previous work experience - annual averages of quarterly transitions, estimated probabilities',
 'lfsi_long_e03',
 'Transition from unemployment to employment by sex, age and degree of urbanisation - annual averages of quarterly transitions, estimated probabilities',
 'lfsi_long_e04',
 'Transition from employment to unemployment by sex, age and degree of urbanisation - annual averages of quarterly transitions, estimated probabilities',
 'lfsi_long_e05',
 'Transition from employment to unemployment by sex, age and type of contract - annual averages of quarterly transitions, estimated probabilities',
 'lfsi_sup_a',
 'Supplementary indicators to unemployment - annual data',
 'lfsi_sup_a_h',
 'Supplementary indicators to unemployment (1992-2020) - annual data',
 'tipsun20',
 'Unemployment rate - annual data',
 'une_educ_a',
 'Unemployment by sex, age and educational attainment - annual data',
 'une_educ_a_h',
 'Unemployment by sex, age and educational attainment (1998-2020) - annual data',
 'une_ltu_a',
 'Long-term unemployment by sex - annual data',
 'une_ltu_a_h',
 'Long-term unemployment by sex (1996-2020) - annual data',
 'une_rt_a',
 'Unemployment by sex and age – annual data',
 'une_rt_a_h',
 'Unemployment by sex and age (1992-2020) - annual data']

Note that an intake catalog is essentially a dict. In our case, it is noteworthy that while the keys of the above catalog are already populated by IDs and names of the dataflow definitions, the corresponding values are None. This is for performance, as instantiating a catalog entry and populating it with all the metadata associated with an SDMX Dataflow is expensive. Therefore, intake_sdmx uses a intake_sdmx.LazyDict under the hood. Each value is None until it is accessed.

Caution

Avoid iterating over all values of a large catalog of dataflows as this could take forever.

While with pandaSDMX, you would have performed these searches in a pandas DataFrame, a catalog cannot be exported to a DataFrame. Well, you can convert a list of dataflow names to aboveDataFrame in a single line and do more sophisticated filtering. Anyway, we choose une_rt_a for further analysis.

Exploring the data structure

As most pandaSDMX users will know, each dataflow references a data structure definition (DSD). It contains descriptions of dimensions, codelists etc. One of the most powerful features of SDMX and pandaSDMX is the ability to select subsets of the available data by specifying a so-called key mapping dimension names to codes selected from the codelist referenced by a given dimension. intake_sdmx translates dimensions and codelists to user-parameters of a catalog entry for a chosen dataflow. Allowed values ofthese parameters are populated with the allowed codes. intake thus gives you argument validation for free.

# Download the complete structural metadata on our
# 'une_rt_a' dataflow
In [19]: une = unemployment_flows.une_rt_a

In [20]: type(une)
Out[20]: intake_sdmx.SDMXData

In [21]: print( une.yaml())
sources:
  une_rt_a:
    args:
      AGE:
      - '*'
      FREQ:
      - '*'
      GEO:
      - '*'
      SEX:
      - '*'
      UNIT:
      - '*'
      attributes: ''
      dtype: ''
      endPeriod: '2021'
      freq_dim: FREQ
      index_type: object
      startPeriod: '2020'
      storage_options: {}
      time_dim: TIME_PERIOD
    description: "Unemployment by sex and age \u2013 annual data"
    driver: intake_sdmx.SDMXData
    metadata:
      catalog_dir: ''
      dataflow_id: une_rt_a
      search:
        text: annual unemployment
        upstream: ESTAT
      source_id: ESTAT
      structure_id: DSD_une_rt_a

Two observations:

  • The intake_sdmx.SDMXData instance knows about the dimensions of the dataflow on annual unemployment data. This information has been extracted from the referenced DatastructureDefinition - a core concept of SDMX.

  • All dimensions are wildcarded (“*”). Thus, if we asked the server to send us the corresponding dataset, we would probably exceed the server limits, or at least obtain a bunch of data we are not interested in. So let’s try to select some interesting columns for our data query.

Not only do we have the dimension names. We also have all the allowed codes, namely in the catalog entry “une_rt_a” from which we have created our instance:

In [22]: print(str(une.entry))
name: une_rt_a
container: dataframe
plugin: ['sdmx_dataset']
driver: ['sdmx_dataset']
description: Unemployment by sex and age – annual data
direct_access: True
user_parameters: [{'name': 'FREQ', 'description': 'FREQ', 'type': 'mlist', 'allowed': ['D', 'Daily', 'W', 'Weekly', 'Q', 'Quarterly', 'A', 'Annual', 'S', 'Semi-annual', 'M', 'Monthly', 'H', 'Half-year', '*'], 'default': ['*']}, {'name': 'AGE', 'description': 'AGE', 'type': 'mlist', 'allowed': ['Y15-24', 'From 15 to 24 years', 'Y15-29', 'From 15 to 29 years', 'Y15-74', 'From 15 to 74 years', 'Y20-64', 'From 20 to 64 years', 'Y25-54', 'From 25 to 54 years', 'Y25-74', 'From 25 to 74 years', 'Y55-74', 'From 55 to 74 years', '*'], 'default': ['*']}, {'name': 'UNIT', 'description': 'UNIT', 'type': 'mlist', 'allowed': ['THS_PER', 'Thousand persons', 'PC_POP', 'Percentage of total population', 'PC_ACT', 'Percentage of population in the labour force', '*'], 'default': ['*']}, {'name': 'SEX', 'description': 'SEX', 'type': 'mlist', 'allowed': ['T', 'Total', 'M', 'Males', 'F', 'Females', '*'], 'default': ['*']}, {'name': 'GEO', 'description': 'GEO', 'type': 'mlist', 'allowed': ['EU27_2020', 'European Union - 27 countries (from 2020)', 'EA19', 'Euro area - 19 countries  (from 2015)', 'BE', 'Belgium', 'BG', 'Bulgaria', 'CZ', 'Czechia', 'DK', 'Denmark', 'DE', 'Germany (until 1990 former territory of the FRG)', 'EE', 'Estonia', 'IE', 'Ireland', 'EL', 'Greece', 'ES', 'Spain', 'FR', 'France', 'HR', 'Croatia', 'IT', 'Italy', 'CY', 'Cyprus', 'LV', 'Latvia', 'LT', 'Lithuania', 'LU', 'Luxembourg', 'HU', 'Hungary', 'MT', 'Malta', 'NL', 'Netherlands', 'AT', 'Austria', 'PL', 'Poland', 'PT', 'Portugal', 'RO', 'Romania', 'SI', 'Slovenia', 'SK', 'Slovakia', 'FI', 'Finland', 'SE', 'Sweden', 'IS', 'Iceland', 'NO', 'Norway', 'CH', 'Switzerland', 'ME', 'Montenegro', 'MK', 'North Macedonia', 'RS', 'Serbia', 'TR', 'Turkey', '*'], 'default': ['*']}, {'name': 'startPeriod', 'description': 'startPeriod', 'type': 'str', 'default': '2020'}, {'name': 'endPeriod', 'description': 'endPeriod', 'type': 'str', 'default': '2021'}, {'name': 'dtype', 'description': "data type for pandas.DataFrame. See pandas docs\n                        for      allowed values.\n                        Default is '' which translates to 'float64'.", 'type': 'str', 'default': ''}, {'name': 'attributes', 'description': "Include any attributes alongside observations\n                    in the DataFrame. See pandasdmx docx for details.\n                    Examples: 'osgd' for all attributes, or\n                    'os': only attributes at observation and series level.", 'type': 'str', 'default': ''}, {'name': 'index_type', 'description': 'Type of pandas Series/DataFrame index', 'type': 'str', 'allowed': ['object', 'datetime', 'period'], 'default': 'object'}, {'name': 'freq_dim', 'description': "To generate PeriodIndex (index_type='period')\n                    Default is set based on heuristics.", 'type': 'str', 'default': 'FREQ'}, {'name': 'time_dim', 'description': "To generate datetime or period index.\n                        Ignored if index_type='object'.", 'type': 'str', 'default': 'TIME_PERIOD'}]
metadata: 
  source_id: ESTAT
  catalog_dir: 
  search: 
    text: annual unemployment
    upstream: ESTAT
  dataflow_id: une_rt_a
  structure_id: DSD_une_rt_a
args: 
  FREQ: {{FREQ}}
  AGE: {{AGE}}
  UNIT: {{UNIT}}
  SEX: {{SEX}}
  GEO: {{GEO}}
  startPeriod: {{startPeriod}}
  endPeriod: {{endPeriod}}
  dtype: {{dtype}}
  attributes: {{attributes}}
  index_type: {{index_type}}
  freq_dim: {{freq_dim}}
  time_dim: {{time_dim}}
  storage_options: 

# select some countries
# and the startPeriod to restrict our query
In [23]: une = une(GEO=['IE', 'ES', 'EL'], startPeriod="2007")

# Note the new config values
In [24]: print(une.yaml())
sources:
  une_rt_a:
    args:
      AGE:
      - '*'
      FREQ:
      - '*'
      GEO:
      - IE
      - ES
      - EL
      SEX:
      - '*'
      UNIT:
      - '*'
      attributes: ''
      dtype: ''
      endPeriod: '2021'
      freq_dim: FREQ
      index_type: object
      startPeriod: '2007'
      storage_options: {}
      time_dim: TIME_PERIOD
    description: "Unemployment by sex and age \u2013 annual data"
    driver: intake_sdmx.SDMXData
    metadata:
      catalog_dir: ''
      dataflow_id: une_rt_a
      search:
        text: annual unemployment
        upstream: ESTAT
      source_id: ESTAT
      structure_id: DSD_une_rt_a


# Passed Codes are validated against the codelists:
In [25]: try:
   ....:     invalid = une(FREQ=['XXX'])
   ....: except ValueError as e:
   ....:     print(e)
   ....: 
('Item %s not in allowed list', 'XXX')

Note that when deriving a new instance from an exiting one, the entire configuration is propagated, except for those values we overwrite by passing new arguments.

Downloading and analyzing data

intake_sdmx can export datasets as pandas Series (default) or DataFrames. A Series is preferrable, in particular, when you aren’t sure about the periodicity of the data, as DataFrames requires columns to have consistent datetime indices. We shall export our annual unemployment data as a DataFrame. To do this, we configure our intake_sdmx.SDMXData instance as follows:

# configure for DataFrame with PeriodIndex
In [26]: une = une(index_type='period')

# Now download the dataset and export it as DataFrame:
In [27]: df = une.read()

In [28]: df.loc[:, ('Y15-74', 'PC_ACT', 'T')]
Out[28]: 
GEO            EL    ES    IE
TIME_PERIOD                  
2007          NaN   NaN   NaN
2008          NaN   NaN   NaN
2009          9.8  17.9  12.6
2010         12.9  19.9  14.6
2011         18.1  21.4  15.4
2012         24.8  24.8  15.5
2013         27.8  26.1  13.8
2014         26.6  24.5  11.9
2015         25.0  22.1   9.9
2016         23.9  19.6   8.4
2017         21.8  17.2   6.7
2018         19.7  15.3   5.8
2019         17.9  14.1   5.0
2020         17.6  15.5   5.9
2021         14.7  14.8   6.2