Typically an Open SDG implementation is split into a site repository and a data repository. For this reason the Open SDG configuration is split into site configuration and data configuration. This document details the available settings for data configuration.
These settings are mostly related to the conversion/alteration of data, metadata, translations, and schema. Also, many of these settings (those that start with "docs_") affect the construction of the "data documentation" mini-website that is automatically generated to document your particular data service. This website includes examples of each type of output, as well as a useful disaggregation report.
To see many of these options in action, the data starter repository contains an example config file.
Optional: If specified, then your data will be converted into CSVW format. The available parameters correspond to the parameters available in the OutputCsvw class.
common_properties: Key/value pairs showing common properties to add to the CSVW metadata. Here is a list of support properties. Note that this can also be set per-indicator in a "csvw" metadata property.
at_properties: Key/value pairs showing "at" properties (those starting with @). Note that this can also be set per-indicator in a "csvw" metadata property.
table_schema_properties: Key/value pairs showing properties to add to the CSVW table schema. Supported properties include (but may not be limited to) "aboutUrl". Note that this can also be set per-indicator in a "csvw" metadata property.
column_properties: Key/value pairs (where each value is itself key/value pairs) showing properties to add to the CSVW columns, keyed by column name. Supported properties include (but may not be limited to) "propertyUrl" and "valueUrl". Note that this can also be set per-indicator in a "csvw" metadata property.
sorting: This works the same as in the
Optional: Your data will automatically be converted into a machine-readable standard known as datapackages. This setting can be used to affect the way that these datapackages are constructed.
NOTE: These datapackages are actually used in Open SDG to determine the order in which the disaggregations and their values are displayed to the end-user. So, the "sorting" option described below has a direct effect on your Open SDG site.
The available parameters correspond to the parameters available in the OutputDataPackage class:
field_properties: Key/value pairs (where each value is itself a set of key/value pairs) showing properties to add to specific fields, keyed by field name. Note that this can also be set per-indicator in a "datapackage" metadata property.
package_properties: Key/value pairs showing common properties to add to all the data packages. Note that this can also be set per-indicator in a "datapackage" metadata property.
resource_properties: Key/value pairs showing common properties to add to the resource in all data packages. Note that this can also be set per-indicator in a "datapackage" metadata property.
sorting: Which strategy to use when sorting the columns and values. The available options are:
alphabetical: Sort columns/values in alphabetical order. For example, assuming the data came from the following CSV file, "Age" would be before "Sex", and "Female" would be before "Male":
Year,Sex,Age,Value 2021,Male,5 years,43 2021,Female,5 years,53
Note that this alphabetizing happens before the columns/values are translated.
default: Sort columns/values according to their position in the source data. For example, assuming the data came from the same CSV shown above, "Sex" would be before "Age", and "Male" would be before "Female".
sorting is not specified, Open SDG assumes that you want "default".
If you require more direct control over the sorting of your data columns/values, see the
data_schema setting below.
Optional: If you need direct control of the sorting and/or validation of your data column/values, you can maintain individual "data schema" for selected indicators. Note that if this is omitted, all indicators will have an "inferred" data schema. So this setting is rarely used -- typically only in cases where you are not happy with the inferred data schema (such as if you are not happy with the order of the disaggregation controls in Open SDG).
The specifics of these "data schema" depend on which of the data schema classes you use, which you can specify with the "class" option. The available classes are:
DataSchemaInputSdmxDsd: Use an SDMX DSD to import the data schema. Note that this single schema will apply to all indicators. The available parameters are:
source: The path or remote URL to the SDMX DSD. For example, this would use the global DSD:
data_schema: class: DataSchemaInputSdmxDsd source: 'https://registry.sdmx.org/ws/public/sdmxapi/rest/datastructure/IAEG-SDGs/SDG/latest/?format=sdmx-2.1&detail=full&references=children'
DataSchemaInputTableSchemaYaml: Import data schema from a folder of YAML files following the Table Schema spec and named according to their indicator (eg, "1-1-1.yml"). Note that each of these files applies to only one particular indicator. Indicators that do not have a data schema file will get an "inferred" schema based on its data. The available parameters are:
source: The folder/file pattern to use to load the files. For example, this would use a local folder called "data-schema":
data_schema: class: DataSchemaInputTableSchemaYaml source: data-schema/*.yml
Required: A baseurl to put at the beginning of all absolute links in the "data documentation" website. If this is not set then there may be some incorrect links in the "data documentation". This is usually the name of your data repository, after a slash. For example, if your data repository is "data", then this should be:
Optional: This setting controls the title which displays at the top of the data documentation website. The default if omitted is shown below:
docs_branding: Build docs
Optional: An optional list of extra columns that would not otherwise be included in the data documentation website's "disaggregation report". Common columns included here are the Series and/or Units columns (SERIES and UNIT_MEASURE, if using SDMX column names) since they would not normally be considered "disaggregations", but are still useful to include in this report. For example:
docs_extra_disaggregations: - Series - Units
Optional: This adds an introductory paragraph on the homepage of the data documentation website. If omitted, no introductory paragraph will appear. Here is an example:
docs_intro: This is a list of examples of endpoints and output that are available on this service. Click each of the links below for more information on the available output.
Optional: This can be used to convert any indicator IDs in the data documentation website into actual links to your implementation's indicator pages. If omitted, the indicator IDs will not be hyperlinked. Here is an example:
Optional: This can be used to configure the metadata report on the SDG data documentation site.
To add metadata fields to the report you must add the key and label for each field to you wish to display. Here is an example:
docs_metadata_fields: - key: reporting_status label: Reporting status - key: graph_type label: Graph type - key: data_non_statistical label: Non-statistical
Optional: This can be used to put your documentation website into a subfolder. This is rarely used. Typically it is only used if you have combined your data repository and site repository into one single repository. Eg:
Optional: If set to true, then the documentation website's "disaggregation report" will include extra columns showing the translations of each disaggregation into the languages you specified in
languages. For example:
Optional: This creates additional "download" buttons on each indicator page of your Open SDG implementation. Use this if there are additional per-indicator files (such as SDMX files) that you would like to make available for download.
This should be a list of objects, each having certain parameters. The available parameters are:
button_label: The label of the button to display. This can be a translation key.
source_pattern: A wildcard pattern used to identify the files you would like to make available for download.
output_folder: A folder in which to create for placing the files, where they will be available for download.
indicator_id_pattern: A regular expression to convert filenames into indicator IDs. The default is
indicator_(.*), which would convert "indicator_1-1-1" into "1-1-1". For more help with regular expressions, look for online tools such as Regex 101.
The following example would ensure that all files matching
data/indicator_*.csv will be available for download in the build at
indicator_downloads: - button_label: csv source_pattern: tests/data/indicator_*.csv output_folder: data-csv indicator_id_pattern: indicator_(.*)
Open SDG will automatically convert all data to CSV and provide a zip file. This setting controls the name of that file. Note that the ".zip" extension should not be added here. The following example shows the default, if omitted:
Optional: This controls how your indicators are loaded. The available parameters are:
non_disaggregation_columns: This specifies a list of columns that should not be considered disaggregations. Adding a column here has several effects:
- Prevents the column from being considered as an "edge" (parent/child column)
- Prevents the column from being used in the "data package" output and CSVW output
- Are used to decide which data rows will be displayed when the chart is first displayed (aka, the "headline" rows). Normally, if a row has content under a disaggregation column, it cannot be considered a headline row. But if that column is in this list, then it can still be considered a headline row.
- Keeps the column out of the disaggregation report
- Keeps the column out of the disaggregation status
NOTE: This parameter does not prevent columns from appearing as dropdowns in the left sidebar on Open SDG indicator pages. In order to prevent columns from appearing as dropdowns, you need to use the ignored_disaggregations site configuration.
series_column: The name of the data column that should be considered the series. Historically this has been "Series", but if your data source is SDMX then it may be "SERIES".
- unit_column: The name of the data column that should be considered the unit of measurement. Historically this has been "Units", but if your data source is SDMX then it may be "UNIT_MEASURE".
Here are the defaults that are assumed if this is omitted:
indicator_options: non_disaggregation_columns: - Year - Units - Series - Value - GeoCode - Observation status - Unit multiplier - Unit measure series_column: Series unit_column: Units
Optional: This setting identifies the source (or sources) of your data and metadata. This can be omitted if you are using the legacy Open SDG approach of CSV data and YAML/markdown metadata, but it is strongly recommended to specify your inputs using this setting. This is required if you are using any other input, such as SDMX or Word templates.
Each item must have a "class" which corresponds to classes in the /sdg/inputs folder of the sdg-build library. Further, each item can have any/all of the parameters that class uses. Below are full descriptions of all the possible inputs and their corresponding parameters.
The following parameters can be used for any input:
column_map: The path to a local CSV file which contains two columns: "Text" and "Value". This will be used to change the names of any data columns. For example, the following will change any data column named "Foo" to "Bar":
code_map: The path to a local CSV file which contains three columns: "Text", "Dimension", and "Value". This will be used to change the names of any data cells, where "Dimension" is the column name. For example, the following will change any data cell named "Foo" into "Bar" if it is found in the "Baz" column:
request_params: Options to apply to any remote HTTP requests that may happen during the input's execution. These options are detailed in the urllib.request.Request documentation. For example, the following could give a custom HTTP header:
request_params: headers: My-Custom-Header: my-value
Now here are specific descriptions and parameters available for each class:
InputCkan: Input data from a CKAN service. The available parameters are:
endpoint: The remote URL of the endpoint for fetching indicators.
indicator_id_map: Map of API ids (such as "resource ids") to indicator ids.
post_data: Key/value pairs which will be passed as a payload in a POST request, rather than the usual GET request.
year_column: The name of the column which will be changed to "Year".
value_column: The name of the column which will be changed to "Value".
sleep: Number of seconds to wait in between each request.
For more technical information see the InputCkan class definition and an example of using InputCkan in Python code.
InputCsvData: Input data from a folder of CSV files. The available parameters are:
path_pattern: A wildcard pattern used for identifying the source files.
To see this in practice see this example of InputCsvData configuration. For more technical information see the InputCsvData class definition.
InputCsvMeta: Input metadata from a folder of CSV files. The available parameters are:
path_pattern: Same as described above in other inputs.
metadata_mapping: A map of human-readable labels to machine keys or a path to a CSV file containing that mapping. This allows the CSV metadata files to use human-readable labels instead of machine keys, which makes management easier.
git: Whether to use Git (version control) information to populate "last updated" dates in the metadata. This is a convenience feature to save you from the manual work of keeping the "last updated" dates accurate.
git_data_dir: Only used if you are using the "git" option described above. Location of folder containing the data files.
git_data_filemask: Only used if you are using the "git" option described above. A pattern for data filenames, where "*" is the indicator id. Any indicator can override this setting by having a metadata field called "data_filename" with the name of the data file for that indicator.
For more technical information see the InputCsvMeta class definition.
InputExcelMeta: Input metadata from a folder of Excel files. The available parameters are the same as in InputCsvMeta.
For more technical information see the InputExcelMeta class definition.
InputSdgMetadata: Input metadata from a folder or repository of subfolders that follow the same pattern as the SDG Metadata project. Specifically, each subfolder must be a language code (like
en), and each language folder contains one YAML file per indicator, named like
source: The local path or remote Git repository.
tag: If using a Git repository, the Git tag to use.
branch: If using a Git repository, the Git branch to use.
repo_subfolder: If using a Git repository, the name of the folder within the repo to use as the main folder.
default_language: Which language is your site's main language.
For more technical information see the InputSdgMetadata class definition.
InputSdmxJson: Input data from an SDMX-JSON file or endpoint. The available parameters are:
source: Remote URL of the SDMX source, or path to local SDMX file.
drop_dimensions: List of SDMX dimensions/attributes to ignore
drop_singleton_dimensions: Whether to drop dimensions/attributes with only 1 variation
dimension_map: Map of SDMX ids to human-readable names. For dimension names, the key is simply the dimension id. For dimension value names, the key is the dimension id and value id, separated by a pipe (|). This also includes attributes.
indicator_id_map: A map of SDMX series codes to indicator ids. Normally this is not needed, but sometimes the DSD may contain typos or mistakes, or the DSD may not contain any reference to the indicator ID numbers. This need not contain all indicator ids, only those that need it. If a particular series should be mapped to multiple indicators, then they can be a list of strings. Otherwise each indicator is a string.
import_names: Whether to import names. Set to false to rely on global names.
import_codes: Whether to import codes instead of text values. If left false, text values are imported instead, taken from the first language in the DSD. This is strongly recommended to be set to true.
import_series_attributes: Recommended to be set to true.
import_observation_attributes: Recommended to be set to true.
dsd: Remote URL of the SDMX DSD (data structure definition) or path to local file.
indicator_id_xpath: An xpath query to find the indicator id within each Series code.
indicator_name_xpath: An xpath query to find the indicator name within each Series code.
For more technical information see the InputSdmxJson class definition, an example of InputSdmxJson configuration, and an example of using InputSdmxJson in Python code.
InputSdmxMeta: Input metadata from SDMX, either remote inputs or a local file. The available parameters are the same as in InputSdmxJson.
For more technical information see the InputSdmxMeta class definition.
InputSdmxMl_Multiple: Input data from multiple SDMX-ML files (which can be a mix of either "Structure" or "Structure Specific"). The available parameters are the same as in InputSdmxJson, along with these additional parameters:
path_pattern: Same as described above in other inputs.
For more technical information see the InputSdmxMl_Multiple class definition.
InputSdmxMl_Structure: Input data from an SDMX-ML Structure file. The available parameters are the same as in InputSdmxJson.
For more technical information see the InputSdmxMl_Structure class definition, and an example of using InputSdmxMl_Structure in Python code.
InputSdmxMl_StructureSpecific: Input data from an SDMX-ML Structure Specific (also known as "Compact") file. The available parameters are the same as in InputSdmxJson.
For more technical information see the InputSdmxMl_StructureSpecific class definition.
InputSdmxMl_UnitedNationsApi: Input data from the United Nations Global SDG Database. The available parameters are the same as in InputSdmxJson, plus the following:
reference_area: The SDMX in the REF_AREA dimension. Defaults to '1' (world).
dimension_query: Key/value pairs for SDMX dimensions to use in generating the query. For details see the UN SDG API manual.
For more technical information see the InputSdmxMl_UnitedNationsApi class definition.
InputWordMeta: Input data from the Microsoft Word templates popular for SDG metdata. The available parameters are the same as InputCsvMeta.
For more technical information see the InputWordMeta class definition.
InputYamlMeta: Input metadata from YAML files. The available parameters are the same as InputCsvMeta.
For more technical information see the InputYamlMeta class definition.
InputYamlMdMeta: Input metadata from a folder of YAML/Markdown files. The available parameters are the same as in InputCsvMeta.
Note that YAML/Markdown files should have a
--- at the bottom. Any Markdown text below that line will be used as the
page_content metadata field.
For more technical information see the InputYamlMdMeta class definition and an example of InputYamlMdMeta configuration.
Defaults: As mentioned above, this
inputs setting is optional. The defaults below show what is assumed if
inputs is omitted entirely. Note that these defaults should not be considered the recommended approach -- they are left only for backwards compatibility.
inputs: - class: InputCsvData path_pattern: data/*-*.csv - class: InputYamlMdMeta path_pattern: meta/*-*.md git: true git_data_dir: data
Optional: This setting corresponds exactly to the language setting in the site configuration. It is technically optional, but strongly recommended, and will be required in future releases. If you use this setting, your data will be translated and placed in language subfolders. For more information on how this translation works, see documentation on translating metadata and translating data.
languages: - es - en
Optional: This determines the types of logs that will appear when the Python code is running to perform these "builds". The available log types are:
warn: Show warnings - ie, things that may be problems but that are not so bad that they halt the build.
debug: Show more details on every step of the build - usually used to help with development of sdg-build.
The default is just "warn". So if omitted, the following is assumed:
logging: - warn
Optional: This allows the build to generate one or more GeoJSON files to be used by Open SDG maps. This should be a list of layers, each one containing certain parameters. The parameters available correspond to the sdg-build library's OutputGeoJson class and are described below:
geojson_file: A path to a GeoJSON file (remote or local) which contains all of the "geometries" for the regions to include. Each region should have an id and a name, in properties (see name_property and id_property).
name_property: The property in the geometry file which contains the region's name.
id_property: The property in the geometry file which contains the region's id.
id_column: The name of a column in the indicator data which corresponds to the id that is in the "id_property" of the geometry file. This serves to "join" the indicator data with the geometry file.
output_subfolder: A folder beneath 'geojson' to put the files. The full path will be:
filename_prefixA prefix added before the indicator id to construct a filename for each geojson file.
exclude_columns: A list of strings, each a column name in the indicator data that should not be included in the disaggregation. This is typically for any columns that mirror the region referenced by the id column.
id_replacements: An optional for with replacements to apply to the values in the id_column. This is typically used if another column exists which "mirrors" what would be in an id column, to avoid duplicate work. For example, maybe a "Region" column exists with the names of the regions as values. This can be used to "map" those region names to geocodes, and save you the work of maintaining a separate id column.
Below is an example of a possible configuration which includes one layer:
map_layers: - geojson_file: https://geoportal1-ons.opendata.arcgis.com/datasets/4fcca2a47fed4bfaa1793015a18537ac_4.geojson name_property: rgn17nm id_property: rgn17cd output_subfolder: regions filename_prefix: indicator_
Optional: This allows the build to generate stats for reporting status by additional fields, beyond the default "status by goal" report. This is optional, but the example below shows how to generate reporting status by the
reporting_status_extra_fields: - un_custodian_agency
Optional: This can be used as an alternative to the
schema_file setting. Whereas
schema_file needs to point to a particular type of schema (the _prose.yml style) the
schema setting can point to any of the sdg-build schema inputs (described below).
This setting can even point to multiple schemas. Each item must have a "class" which corresponds to classes in the /sdg/schemas folder of the sdg-build library. Further, each item can have any/all of the parameters that class uses. Below are full descriptions of all the possible translations and their corresponding parameters:
SchemaInputOpenSdg: Input a metadata schema from the Prose.io-style schema, historically called
_prose.yml. The available parameters are:
schema_path: A path (remote or local) to the schema file or endpoint
scope: Which metadata scope the fields should apply to. Not usually used here, since "scope" can be assigned in the Prose.io schema per field.
request_params: Only used in the case of a remote schema_path, to control the behavior of the HTTP request. Not usually used since the Prose.io schema is typically a local file.
For more technical information see the SchemaInputOpenSdg class definition.
SchemaInputSdmxMsd: Input a metadata schema from an SDMX metadata structure definition. The available parameters are:
schema_path: A path (remote or local) to the schema file or endpoint
scope: Which metadata scope the fields should apply to. This is necessary here, since the MSD has no idea about the Open SDG concept of "scope".
request_params: Only used in the case of a remote schema_path, to control the behavior of the HTTP request.
For more technical information see the SchemaInputSdmxMsd class definition.
schema: - class: SchemaInputOpenSdg schema_path: _prose.yml - class: SchemaInputSdmxMsd schema_path: https://example.com/my-msd-file.xml scope: national request_params: headers: My-Custom-Header: my-value
Optional: This identifies a file containing the schema (possible fields) for metadata. Currently this needs to be a prose.io config, and defaults to '_prose.yml'. Note that if you are using the
schema setting described above, you do not need to use
schema_file (and vice versa).
Optional: If specified, then SDMX-ML will be outputted to your data documentation website. However, your data must also be compliant with the DSD specified in this configuration.
dsd: Remote URL of the SDMX DSD (data structure definition) or path to local file. If omitted, the global DSD will be assumed.
msd: Remote URL of the SDMX MSD (metadata structure definition) or path to local file. If omitted, the global MSD will be assumed.
default_values: Since SDMX output is required to have a value for every dimension/attributeyou may need to specify defaults here. If not specified here, defaults for attributes will be '' and defaults for dimensions will be '_T'.
header_id: Optional identifying string to put in the "ID" element in the header of the XML. If not specified, it will be "IREF" and a timestamp.
sender_id: Optional identifying string to put in the "id" attribute of the "Sender" element in the header of the XML. If not specified, it will be the current version of this library.
structure_specific: Whether to output as StructureSpecific instead of Generic data. Defaults to true.
column_map: Remote URL of CSV column mapping or path to local CSV column mapping file. Expects columns 'Text' (data CSV column name e.g. Sex) and 'Value' (SDMX concept which data CSV column name maps to e.g. SEX).
code_map: Remote URL of CSV code mapping or path to local CSV code mapping file. Expects columns 'Text' (item within data CSV column e.g. Female), 'Dimension' (SDMX concept that item belongs to e.g. SEX), and 'Value' (SDMX concept code which item maps to e.g. F).
constrain_data: Whether to use the DSD to remove any rows of data that are not compliant. Defaults to false.
constrain_meta: Whether to use the MSD to remove any metadata fields that are not complaint. Defaults to true.
meta_ref_area: REF_AREA code to use in the metadata output. If omitted, will use the first available in a REF_AREA data column.
meta_reporting_type: REPORTING_TYPE code to use in the metadata output. If omitted, will use the first available in a REPORTING_TYPE data column.
global_content_constraints: Whether to enforce the global content constraints, which is in a draft state.
output_subfolder: A subfolder in which to place this output. Defaults to 'sdmx'.
Optional: Can be used alone or with
sdmx_output (above) allowing for multiple SDMX outputs i.e. one national and one global.
This works exactly the same as
sdmx_output, except the following parameters are automatically set:
dsd: (the global DSD) msd: (the global MSD) structure_specific: true constrain_data: true constrain_meta: true global_content_constraints: true output_subfolder: sdmx-global
If no other customizations are needed beyond these, then the sdmx_output_global can simply be set to true. Example:
Otherwise it can have all the same parameters as the existing sdmx_output option. Example:
sdmx_output_global: meta_reporting_type: N meta_ref_area: KG etc...
Optional: This identifies a directory to hold the "built" files. The default is '_site' and you usually do not need to change this.
Optional: This setting controls the directory in which scripts should find source files. In most cases this can be left at the default ('') which points to the root of the data repository. However this is available in case you need to place your source files in a subfolder.
Optional: This setting identifies the source (or sources) of your translations. This can be omitted if your languages are already included in sdg-translations and you do not need any custom translations. But if you are using other languages or need custom translations, then you can use this as needed.
Each item must have a "class" which corresponds to classes in the /sdg/translations folder of the sdg-build library. Further, each item can have any/all of the parameters that class uses. Below are full descriptions of all the possible translations and their corresponding parameters:
TranslationInputCsv: Input translations from a folder of local CSV files. The available parameters are:
source: The folder containing the translation files. Defaults to "translations".
For more technical information see the TranslationInputCsv class definition.
TranslationInputSdgTranslations: Input translations from a Git repository structured like the sdg-translations project. The available parameters are:
tag: Specifies a particular tag (or branch or commit) to use in the Git repository.
branch: Specifies a particular branch (or tag or commit) to use in the Git repository. Alias for "tag".
source: Specifies the endpoint for the Git repository. Defaults to the sdg-translations project: 'https://github.com/open-sdg/sdg-translations.git'
For more technical information see the TranslationInputSdgTranslations class definition and an example of TranslationInputSdgTranslations configuration.
TranslationInputSdmx: Input translations from an SDMX DSD file. The available parameters are:
source: The location of the SDMX DSD file (either local or remote).
For more technical information see the TranslationInputSdmx class definition and an example of TranslationInputSdmx configuration.
TranslationInputSdmxMsd: Input translations from an SDMX MSD (metadata structure definition) file. The available parameters are:
source: The location of the SDMX MSD file (either local or remote).
For more technical information see the TranslationInputSdmxMsd class definition.
TranslationInputYaml: Input translations from a folder of local YAML files. The available parameters are the same as in TranslationInputCsv above.
For more technical information see the TranslationInputYaml class definition and an example of TranslationInputYaml configuration.
Defaults: As mentioned above, this
translations setting is optional. The defaults below show what is assumed if
translations is omitted entirely.
translations: - class: TranslationInputSdgTranslations source: https://github.com/open-sdg/sdg-translations.git branch: master - class: TranslationInputYaml source: translations