User Guide *************** .. contents:: :local: :backlinks: none Overview ========== **intake_sdmx** is a plugin for `intake `_ which leverages `pandaSDMX `_ to make data and metadata from the `SDMX `_ ecosystem accessible via **intake**. To achieve this, **intake_sdmx** provides three intake `drivers `_: * :class:`intake_sdmx.SDMXSources`: a `catalog `_ of SDMX data sources (a.k.a. agencies or data providers) such as national statistics offices, central banks and international institutions * :class:`intake_sdmx.SDMXDataflows`: a catalog of dataflows provided by a given SDMX data source * :class:`intake_sdmx.SDMXData`: a `data-set `_ that can download data-sets of a specified dataflow and convert it to a :class:`pandas.DataFrame`. Whether you are familiar with **intake** or **pandaSDMX**, the above concepts should ring a bell. However, if you are new to both ecosystems, just read on, follow the code examples, and dig deeper as needed by skimming the docs of either **intake** or **pandaSDMX** as needed. The following sections expand on the introductory `code example `_ from the pandaSDMX documentation. Exploring the available data sources ====================================== You can instantiate the catalog of SDMX data sources in one of two ways: .. ipython:: python # firstly, via intake_sdmx: from intake_sdmx import * src = SDMXSources() type(src) # secondly, via the intake API: import intake src2 = intake.open_sdmx_sources() # YAML representation print(src.yaml()) src.yaml() == src2.yaml() # Available data sources list(src) Two observations: * For intake-novices: each driver instance can create a declarative YAML description of itself which suffices to re-generate clones by calling :func:`intake.open_yaml`. * For pandaSDMX-novices: The catalog contains two copies of each data provider entry accessible via its ID and name (in SDMX terminology) respectively. The duplicate entries are a pragmatic response to the fact that catalog entries are expensive to instantiate as each one requires a HTTP request to a different SDMX web service. And dict keys are just fine to show human-readable descriptions alongside the IDs. As in pandasdmx, you can configure your HTTP connections: .. ipython:: python src_via_proxy = SDMXSources( storage_options={'proxies': {'http': 'http://1.1.1.1:4567'}}) The `storage_options` argument is an **intake** feature. Options will be propagated to any HTTP connection established by instances derived from `src_via_proxy`. Note that upon instantiation of:class:`SDMXSources` no HTTP connection is made. Exploring the dataflows of a given data source ================================================ Suppose we want to analyze annual unemployment data for some EU countries. We assume such data to be available from Eurostat. .. ipython:: python estat_flows = src.ESTAT type(estat_flows) print(estat_flows.yaml()) len(estat_flows) # Wow! list(estat_flows)[:20] Luckily, this class has a rudimentary :meth:`intake_sdmx.SDMXDataflows.search` method generating a shorter subcatalog: .. ipython:: python unemployment_flows = estat_flows.search("unemployment") len(unemployment_flows) # This is still too large... # So let's refine our search. unemployment_flows = estat_flows.search("annual unemployment", operator="&") list(unemployment_flows) Note that an intake catalog is essentially a dict. In our case, it is noteworthy that while the keys of the above catalog are already populated by IDs and names of the dataflow definitions, the corresponding values are None. This is for performance, as instantiating a catalog entry and populating it with all the metadata associated with an SDMX Dataflow is expensive. Therefore, **intake_sdmx** uses a :class:`intake_sdmx.LazyDict` under the hood. Each value is None until it is accessed. .. caution:: Avoid iterating over all values of a large catalog of dataflows as this could take forever. While with pandaSDMX, you would have performed these searches in a pandas DataFrame, a catalog cannot be exported to a DataFrame. Well, you can convert a list of dataflow names to aboveDataFrame in a single line and do more sophisticated filtering. Anyway, we choose `une_rt_a` for further analysis. Exploring the data structure ============================== As most pandaSDMX users will know, each dataflow references a data structure definition (DSD). It contains descriptions of dimensions, codelists etc. One of the most powerful features of SDMX and pandaSDMX is the ability to select subsets of the available data by specifying a so-called key mapping dimension names to codes selected from the codelist referenced by a given dimension. **intake_sdmx** translates dimensions and codelists to user-parameters of a catalog entry for a chosen dataflow. Allowed values ofthese parameters are populated with the allowed codes. **intake** thus gives you argument validation for free. .. ipython:: python # Download the complete structural metadata on our # 'une_rt_a' dataflow une = unemployment_flows.une_rt_a type(une) print( une.yaml()) Two observations: * The :class:`intake_sdmx.SDMXData` instance knows about the dimensions of the dataflow on annual unemployment data. This information has been extracted from the referenced DatastructureDefinition - a core concept of SDMX. * All dimensions are wildcarded ("*"). Thus, if we asked the server to send us the corresponding dataset, we would probably exceed the server limits, or at least obtain a bunch of data we are not interested in. So let's try to select some interesting columns for our data query. Not only do we have the dimension names. We also have all the allowed codes, namely in the catalog entry "une_rt_a" from which we have created our instance: .. ipython:: python print(str(une.entry)) # select some countries # and the startPeriod to restrict our query une = une(GEO=['IE', 'ES', 'EL'], startPeriod="2007") # Note the new config values print(une.yaml()) # Passed Codes are validated against the codelists: try: invalid = une(FREQ=['XXX']) except ValueError as e: print(e) Note that when deriving a new instance from an exiting one, the entire configuration is propagated, except for those values we overwrite by passing new arguments. Downloading and analyzing data ================================== **intake_sdmx** can export datasets as pandas Series (default) or DataFrames. A Series is preferrable, in particular, when you aren't sure about the periodicity of the data, as DataFrames requires columns to have consistent datetime indices. We shall export our annual unemployment data as a DataFrame. To do this, we configure our :class:`intake_sdmx.SDMXData` instance as follows: .. ipython:: python # configure for DataFrame with PeriodIndex une = une(index_type='period') # Now download the dataset and export it as DataFrame: df = une.read() df.loc[:, ('Y15-74', 'PC_ACT', 'T')]