User Guide¶
Overview¶
intake_sdmx is a plugin for intake which leverages pandaSDMX to make data and metadata from the SDMX ecosystem accessible via intake. To achieve this, intake_sdmx provides three intake drivers:
intake_sdmx.SDMXSources
: a catalog of SDMX data sources (a.k.a. agencies or data providers) such as national statistics offices, central banks and international institutionsintake_sdmx.SDMXDataflows
: a catalog of dataflows provided by a given SDMX data sourceintake_sdmx.SDMXData
: a data-set that can download data-sets of a specified dataflow and convert it to apandas.DataFrame
.
Whether you are familiar with intake or pandaSDMX, the above concepts should ring a bell. However, if you are new to both ecosystems, just read on, follow the code examples, and dig deeper as needed by skimming the docs of either intake or pandaSDMX as needed.
The following sections expand on the introductory code example from the pandaSDMX documentation.
Exploring the available data sources¶
You can instantiate the catalog of SDMX data sources in one of two ways:
# firstly, via intake_sdmx:
In [1]: from intake_sdmx import *
In [2]: src = SDMXSources()
In [3]: type(src)
Out[3]: intake_sdmx.SDMXSources
# secondly, via the intake API:
In [4]: import intake
In [5]: src2 = intake.open_sdmx_sources()
# YAML representation
In [6]: print(src.yaml())
sources:
SDMX data sources:
args: {}
description: SDMX data sources (a.k.a. agencies / data providers) supported
by pandaSDMX
driver: intake_sdmx.SDMXSources
metadata: {}
In [7]: src.yaml() == src2.yaml()
Out[7]: True
# Available data sources
In [8]: list(src)
Out[8]:
['ABS_XML',
'Australian Bureau of Statistics - XML',
'BBK',
'Deutsche Bundesbank (German Central Bank)',
'BIS',
'Bank for International Settlements',
'ECB',
'European Central Bank',
'ESTAT',
'Eurostat',
'ILO',
'International Labor Organization',
'IMF',
'International Monetary Fund',
'INEGI',
'Instituto Nacional de Estadística y Geografía (MX)',
'INSEE',
'Institut national de la statistique et des études économiques (FR)',
'ISTAT',
'Instituto Nationale di Statistica (IT)',
'NB',
'Norges Bank (NO)',
'SGR',
'SDMX Global Registry',
'UNICEF',
"UN International Children's Emergency Fund",
'CD2030',
'COUNTDOWN 2030',
'SPC',
'Pacific Data Hub',
'UNSD',
'United Nations Statistics Division',
'WB',
'World Bank World Integrated Trade Solution',
'WB_WDI',
'World Bank World Development Indicators',
'LSD',
'Statistics Lithuania']
Two observations:
For intake-novices: each driver instance can create a declarative YAML description of itself which suffices to re-generate clones by calling
intake.open_yaml()
.For pandaSDMX-novices: The catalog contains two copies of each data provider entry accessible via its ID and name (in SDMX terminology) respectively. The duplicate entries are a pragmatic response to the fact that catalog entries are expensive to instantiate as each one requires a HTTP request to a different SDMX web service. And dict keys are just fine to show human-readable descriptions alongside the IDs.
As in pandasdmx, you can configure your HTTP connections:
In [9]: src_via_proxy = SDMXSources(
...: storage_options={'proxies': {'http': 'http://1.1.1.1:4567'}})
...:
The storage_options argument is an intake feature. Options will be propagated to any HTTP connection established by instances derived from src_via_proxy. Note that upon instantiation of:class:SDMXSources no HTTP connection is made.
Exploring the dataflows of a given data source¶
Suppose we want to analyze annual unemployment data for some EU countries. We assume such data to be available from Eurostat.
In [10]: estat_flows = src.ESTAT
In [11]: type(estat_flows)
Out[11]: intake_sdmx.SDMXDataflows
In [12]: print(estat_flows.yaml())
sources:
ESTAT:
args:
storage_options: {}
description: Eurostat
driver: intake_sdmx.SDMXDataflows
metadata:
catalog_dir: ''
source_id: ESTAT
In [13]: len(estat_flows)
Out[13]: 13448
# Wow!
In [14]: list(estat_flows)[:20]
Out[14]:
['DS-018995',
'EU trade since 1999 by SITC',
'DS-022469',
'Extra-EU trade since 2000 by mode of transport, by NST/R',
'DS-032655',
'EU trade since 1988 by BEC/rev.4',
'DS-043227',
'EFTA trade since 1995 by SITC',
'DS-066341',
'Sold production, exports and imports',
'DS-066342',
'Total production',
'DS-1060915',
'EU trade since 1988 by CPA 2008',
'DS-1062396',
'EU trade since 2002 by CPA 2.1',
'DS-1165455',
'UK trade Feb-Oct 2020 by HS2-4-6 and CN8',
'DS-1180622',
'EU trade since 2015 of COVID-19 medical supplies by categories']
Luckily, this class has a rudimentary intake_sdmx.SDMXDataflows.search()
method
generating a shorter subcatalog:
In [15]: unemployment_flows = estat_flows.search("unemployment")
In [16]: len(unemployment_flows)
Out[16]: 186
# This is still too large...
# So let's refine our search.
In [17]: unemployment_flows = estat_flows.search("annual unemployment", operator="&")
In [18]: list(unemployment_flows)
Out[18]:
['lfsi_long_e01',
'Transition from unemployment to employment by sex, age and duration of unemployment - annual averages of quarterly transitions, estimated probabilities',
'lfsi_long_e02',
'Transition from unemployment to employment by sex, age and previous work experience - annual averages of quarterly transitions, estimated probabilities',
'lfsi_long_e03',
'Transition from unemployment to employment by sex, age and degree of urbanisation - annual averages of quarterly transitions, estimated probabilities',
'lfsi_long_e04',
'Transition from employment to unemployment by sex, age and degree of urbanisation - annual averages of quarterly transitions, estimated probabilities',
'lfsi_long_e05',
'Transition from employment to unemployment by sex, age and type of contract - annual averages of quarterly transitions, estimated probabilities',
'lfsi_sup_a',
'Supplementary indicators to unemployment - annual data',
'lfsi_sup_a_h',
'Supplementary indicators to unemployment (1992-2020) - annual data',
'tipsun20',
'Unemployment rate - annual data',
'une_educ_a',
'Unemployment by sex, age and educational attainment - annual data',
'une_educ_a_h',
'Unemployment by sex, age and educational attainment (1998-2020) - annual data',
'une_ltu_a',
'Long-term unemployment by sex - annual data',
'une_ltu_a_h',
'Long-term unemployment by sex (1996-2020) - annual data',
'une_rt_a',
'Unemployment by sex and age – annual data',
'une_rt_a_h',
'Unemployment by sex and age (1992-2020) - annual data']
Note that an intake catalog is essentially a dict.
In our case, it is noteworthy that while the keys of the above catalog are already populated by IDs and names of the dataflow definitions, the corresponding values
are None. This is for performance, as instantiating
a catalog entry and populating it with all
the metadata associated with an SDMX Dataflow
is expensive. Therefore, intake_sdmx uses a intake_sdmx.LazyDict
under the hood.
Each value is None until it is accessed.
Caution
Avoid iterating over all values of a large catalog of dataflows as this could take forever.
While with pandaSDMX, you would have performed these searches in a pandas DataFrame, a catalog cannot be exported to a DataFrame. Well, you can convert a list of dataflow names to aboveDataFrame in a single line and do more sophisticated filtering. Anyway, we choose une_rt_a for further analysis.
Exploring the data structure¶
As most pandaSDMX users will know, each dataflow references a data structure definition (DSD). It contains descriptions of dimensions, codelists etc. One of the most powerful features of SDMX and pandaSDMX is the ability to select subsets of the available data by specifying a so-called key mapping dimension names to codes selected from the codelist referenced by a given dimension. intake_sdmx translates dimensions and codelists to user-parameters of a catalog entry for a chosen dataflow. Allowed values ofthese parameters are populated with the allowed codes. intake thus gives you argument validation for free.
# Download the complete structural metadata on our
# 'une_rt_a' dataflow
In [19]: une = unemployment_flows.une_rt_a
In [20]: type(une)
Out[20]: intake_sdmx.SDMXData
In [21]: print( une.yaml())
sources:
une_rt_a:
args:
AGE:
- '*'
FREQ:
- '*'
GEO:
- '*'
SEX:
- '*'
UNIT:
- '*'
attributes: ''
dtype: ''
endPeriod: '2021'
freq_dim: FREQ
index_type: object
startPeriod: '2020'
storage_options: {}
time_dim: TIME_PERIOD
description: "Unemployment by sex and age \u2013 annual data"
driver: intake_sdmx.SDMXData
metadata:
catalog_dir: ''
dataflow_id: une_rt_a
search:
text: annual unemployment
upstream: ESTAT
source_id: ESTAT
structure_id: DSD_une_rt_a
Two observations:
The
intake_sdmx.SDMXData
instance knows about the dimensions of the dataflow on annual unemployment data. This information has been extracted from the referenced DatastructureDefinition - a core concept of SDMX.All dimensions are wildcarded (“*”). Thus, if we asked the server to send us the corresponding dataset, we would probably exceed the server limits, or at least obtain a bunch of data we are not interested in. So let’s try to select some interesting columns for our data query.
Not only do we have the dimension names. We also have all the allowed codes, namely in the catalog entry “une_rt_a” from which we have created our instance:
In [22]: print(str(une.entry))
name: une_rt_a
container: dataframe
plugin: ['sdmx_dataset']
driver: ['sdmx_dataset']
description: Unemployment by sex and age – annual data
direct_access: True
user_parameters: [{'name': 'FREQ', 'description': 'FREQ', 'type': 'mlist', 'allowed': ['D', 'Daily', 'W', 'Weekly', 'Q', 'Quarterly', 'A', 'Annual', 'S', 'Semi-annual', 'M', 'Monthly', 'H', 'Half-year', '*'], 'default': ['*']}, {'name': 'AGE', 'description': 'AGE', 'type': 'mlist', 'allowed': ['Y15-24', 'From 15 to 24 years', 'Y15-29', 'From 15 to 29 years', 'Y15-74', 'From 15 to 74 years', 'Y20-64', 'From 20 to 64 years', 'Y25-54', 'From 25 to 54 years', 'Y25-74', 'From 25 to 74 years', 'Y55-74', 'From 55 to 74 years', '*'], 'default': ['*']}, {'name': 'UNIT', 'description': 'UNIT', 'type': 'mlist', 'allowed': ['THS_PER', 'Thousand persons', 'PC_POP', 'Percentage of total population', 'PC_ACT', 'Percentage of population in the labour force', '*'], 'default': ['*']}, {'name': 'SEX', 'description': 'SEX', 'type': 'mlist', 'allowed': ['T', 'Total', 'M', 'Males', 'F', 'Females', '*'], 'default': ['*']}, {'name': 'GEO', 'description': 'GEO', 'type': 'mlist', 'allowed': ['EU27_2020', 'European Union - 27 countries (from 2020)', 'EA19', 'Euro area - 19 countries (from 2015)', 'BE', 'Belgium', 'BG', 'Bulgaria', 'CZ', 'Czechia', 'DK', 'Denmark', 'DE', 'Germany (until 1990 former territory of the FRG)', 'EE', 'Estonia', 'IE', 'Ireland', 'EL', 'Greece', 'ES', 'Spain', 'FR', 'France', 'HR', 'Croatia', 'IT', 'Italy', 'CY', 'Cyprus', 'LV', 'Latvia', 'LT', 'Lithuania', 'LU', 'Luxembourg', 'HU', 'Hungary', 'MT', 'Malta', 'NL', 'Netherlands', 'AT', 'Austria', 'PL', 'Poland', 'PT', 'Portugal', 'RO', 'Romania', 'SI', 'Slovenia', 'SK', 'Slovakia', 'FI', 'Finland', 'SE', 'Sweden', 'IS', 'Iceland', 'NO', 'Norway', 'CH', 'Switzerland', 'ME', 'Montenegro', 'MK', 'North Macedonia', 'RS', 'Serbia', 'TR', 'Turkey', '*'], 'default': ['*']}, {'name': 'startPeriod', 'description': 'startPeriod', 'type': 'str', 'default': '2020'}, {'name': 'endPeriod', 'description': 'endPeriod', 'type': 'str', 'default': '2021'}, {'name': 'dtype', 'description': "data type for pandas.DataFrame. See pandas docs\n for allowed values.\n Default is '' which translates to 'float64'.", 'type': 'str', 'default': ''}, {'name': 'attributes', 'description': "Include any attributes alongside observations\n in the DataFrame. See pandasdmx docx for details.\n Examples: 'osgd' for all attributes, or\n 'os': only attributes at observation and series level.", 'type': 'str', 'default': ''}, {'name': 'index_type', 'description': 'Type of pandas Series/DataFrame index', 'type': 'str', 'allowed': ['object', 'datetime', 'period'], 'default': 'object'}, {'name': 'freq_dim', 'description': "To generate PeriodIndex (index_type='period')\n Default is set based on heuristics.", 'type': 'str', 'default': 'FREQ'}, {'name': 'time_dim', 'description': "To generate datetime or period index.\n Ignored if index_type='object'.", 'type': 'str', 'default': 'TIME_PERIOD'}]
metadata:
source_id: ESTAT
catalog_dir:
search:
text: annual unemployment
upstream: ESTAT
dataflow_id: une_rt_a
structure_id: DSD_une_rt_a
args:
FREQ: {{FREQ}}
AGE: {{AGE}}
UNIT: {{UNIT}}
SEX: {{SEX}}
GEO: {{GEO}}
startPeriod: {{startPeriod}}
endPeriod: {{endPeriod}}
dtype: {{dtype}}
attributes: {{attributes}}
index_type: {{index_type}}
freq_dim: {{freq_dim}}
time_dim: {{time_dim}}
storage_options:
# select some countries
# and the startPeriod to restrict our query
In [23]: une = une(GEO=['IE', 'ES', 'EL'], startPeriod="2007")
# Note the new config values
In [24]: print(une.yaml())
sources:
une_rt_a:
args:
AGE:
- '*'
FREQ:
- '*'
GEO:
- IE
- ES
- EL
SEX:
- '*'
UNIT:
- '*'
attributes: ''
dtype: ''
endPeriod: '2021'
freq_dim: FREQ
index_type: object
startPeriod: '2007'
storage_options: {}
time_dim: TIME_PERIOD
description: "Unemployment by sex and age \u2013 annual data"
driver: intake_sdmx.SDMXData
metadata:
catalog_dir: ''
dataflow_id: une_rt_a
search:
text: annual unemployment
upstream: ESTAT
source_id: ESTAT
structure_id: DSD_une_rt_a
# Passed Codes are validated against the codelists:
In [25]: try:
....: invalid = une(FREQ=['XXX'])
....: except ValueError as e:
....: print(e)
....:
('Item %s not in allowed list', 'XXX')
Note that when deriving a new instance from an exiting one, the entire configuration is propagated, except for those values we overwrite by passing new arguments.
Downloading and analyzing data¶
intake_sdmx can export datasets as pandas Series (default) or DataFrames.
A Series is preferrable, in particular, when you aren’t sure
about the periodicity of the data, as DataFrames requires columns to have consistent datetime indices.
We shall export our annual unemployment data
as a DataFrame. To do this, we
configure our intake_sdmx.SDMXData
instance
as follows:
# configure for DataFrame with PeriodIndex
In [26]: une = une(index_type='period')
# Now download the dataset and export it as DataFrame:
In [27]: df = une.read()
In [28]: df.loc[:, ('Y15-74', 'PC_ACT', 'T')]
Out[28]:
GEO EL ES IE
TIME_PERIOD
2007 NaN NaN NaN
2008 NaN NaN NaN
2009 9.8 17.9 12.6
2010 12.9 19.9 14.6
2011 18.1 21.4 15.4
2012 24.8 24.8 15.5
2013 27.8 26.1 13.8
2014 26.6 24.5 11.9
2015 25.0 22.1 9.9
2016 23.9 19.6 8.4
2017 21.8 17.2 6.7
2018 19.7 15.3 5.8
2019 17.9 14.1 5.0
2020 17.6 15.5 5.9
2021 14.7 14.8 6.2