SOFT7 Data Source#

This notebook contains examples related to the SOFT7 data source. How to create, manage, resolve, and use SOFT7 data sources.

Generate a SOFT7 Data Source#

SOFT7 Data Source instances can be generated based on information from the following parts:

  1. Data source (DB, File, Webpage, …).

  2. Generic data source parser.

  3. Data source parser configuration.

  4. SOFT7 entity (data model).

Parts 2 and 3 are together considered to produce the “specific parser”. Parts 1 through 3 are provided as a collective, based on the ResourceConfig from OTEAPI Core.

Resource configuration#

The resource configuration, originally based on the Data Catalog Vocabulary (DCAT), is in this case a small set of data catalog fields mapped to resource-specific values:

  • downloadUrl or accessUrl:

    • downloadUrl: The URL of the downloadable file in a given format. E.g. CSV file or RDF file.

      Usage: downloadURL SHOULD be used for the URL at which this distribution is available directly, typically through a HTTPS GET request or SFTP.

    • accessUrl: A URL of the resource that gives access to a distribution of the dataset. E.g. landing page, feed, SPARQL endpoint.

      Usage: accessURL SHOULD be used for the URL of a service or location that can provide access to this distribution, typically through a Web form, query or API call.
      downloadURL is preferred for direct links to downloadable resources.

  • mediaType: The media type of the distribution as defined by IANA [IANA-MEDIA-TYPES].

    Usage: This property SHOULD be used when the media type of the distribution is defined in IANA [IANA-MEDIA-TYPES].

  • accessService: A data service that gives access to the distribution of the dataset.

It is worth noting that the resource configuration MUST contain either downloadUrl and mediaType or accessUrl and accessService. It may contain any combination otherwise, but a minimum of one of the two combinations is required.

The part described up to now defines the “data source” part of the SOFT7 data source. Furthermore, based on either mediaType or accessService the “generic parser” part of the SOFT7 data source is determined.

To supply a parser configuration, one needs to know the generic parser that will be used, as well as the specific data to be retrieved. Under “normal” circumstances, the parser configuration is stored alongside a reference to the data source for easy reusability.

Finally, a SOFT7 entity (or data model) is required to base the generated SOFT7 Data Source instance on. Again, under “normal” circumstances, the SOFT7 entity is stored alongside a reference to the data source for easy reusability.

Example of generating a SOFT7 Data Source#

The following example shows how to generate a SOFT7 Data Source instance based on the parts described above.

import logging

logging.getLogger("s7").addHandler(logging.StreamHandler())
from s7.factories import create_entity

OPTIMADEStructure = create_entity("http://onto-ns.com/meta/1.0/OPTIMADEStructure#")
# OPTIMADEStructure.model_fields["properties"].annotation.model_fields[
#   "attributes"
# ].annotation.model_fields["properties"].annotation.model_fields

optimade_structure_data_as_soft = {
    "properties": {
        "id": "mp-1228448",
        "type": "structures",
        "attributes": {
            "dimensions": {
                "nsites": 5,
                "nelements": 2,
                "dimensionality": 3,
                "nspecies": 2,
                "nstructure_features": 0,
            },
            "properties": {
                "immutable_id": "645d307dbcd30f748b48eefb",
                "last_modified": "2021-02-10T01:38:17Z",
                "elements": ["Al", "O"],
                "elements_ratios": [0.4, 0.6],
                "chemical_formula_descriptive": "Al2O3",
                "chemical_formula_reduced": "Al2O3",
                "chemical_formula_hill": "Al2O3",
                "chemical_formula_anonymous": "A3B2",
                "dimension_types": [1, 1, 1],
                "nperiodic_dimensions": 3,
                "lattice_vectors": [
                    [
                        -1.508747,
                        -2.6132,
                        0.000129
                    ],
                    [
                        -1.508747,
                        2.6132,
                        -0.000129
                    ],
                    [
                        0,
                        0.000184,
                        -7.574636
                    ]
                ],
                "cartesian_site_positions": [
                    [
                        -1.508747,
                        2.6132805930160004,
                        -5.469387346584
                    ],
                    [
                        -1.508747,
                        -2.613096593016,
                        -2.105248653416
                    ],
                    [
                        -1.508747,
                        -0.8710047307360002,
                        -6.063801547042
                    ],
                    [
                        -1.508747,
                        0.8711887307360001,
                        -1.510834452958
                    ],
                    [
                        0,
                        0.000092,
                        -3.787318
                    ]
                ],
                "species": [
                    {
                        "dimensions": {
                            "nelements": 1,
                            "nattached_elements": 0,
                        },
                        "properties": {
                            "name": "Al",
                            "chemical_symbols": ["Al"],
                            "concentration": [1],
                            "mass": None,
                            "original_name": None,
                            "attached": None,
                            "nattached": None,
                        }
                    },
                    {
                        "dimensions": {
                            "nelements": 1,
                            "nattached_elements": 0,
                        },
                        "properties": {
                            "name": "O",
                            "chemical_symbols": ["O"],
                            "concentration": [1],
                            "mass": None,
                            "original_name": None,
                            "attached": None,
                            "nattached": None,
                        }
                    },
                ],
                "species_at_sites": ["Al", "Al", "O", "O", "O"],
                "assemblies": None,
                "structure_features": [],
            }
        }
    }
}

# print(OPTIMADEStructure.model_fields["properties"].annotation.model_fields[
#   "attributes"
# ].annotation.__args__[0].__doc__)
OPTIMADEStructure(**optimade_structure_data_as_soft)
import os

from s7.factories import create_datasource

os.environ["OTELIB_DEBUG"] = "true"

datasource = create_datasource(
    oteapi_url="python",
    entity="http://onto-ns.com/meta/1.0/OPTIMADEStructure",
    configs={
        "dataresource": {
            "resourceType": "resource/url",
            "downloadUrl": "https://optimade.materialsproject.org/v1/structures/mp-1228448",
            "mediaType": "application/json",
        },
        "parser": {
            "parserType": "parser/json",
            "entity": "http://onto-ns.com/meta/1.0/OPTIMADEStructure",
        },
        "mapping": {
            "mappingType": "triples",
            "prefixes": {
                "optimade": "https://optimade.materialsproject.org/v1/structures/mp-1228448#",
                "s7_top": "http://onto-ns.com/meta/1.0/OPTIMADEStructure#",
                "s7_attr": "http://onto-ns.com/meta/1.0/OPTIMADEStructureAttributes#",
                "s7_species": "http://onto-ns.com/meta/1.0/OPTIMADEStructureSpecies#",
            },
            "triples": {
                # top
                ("optimade:data.id", "", "s7_top:properties.id"),
                ("optimade:data.type", "", "s7_top:properties.type"),
                ("optimade:data.attributes", "", "s7_top:properties.attributes"),

                # attributes - dimensions
                ("optimade:data.attributes.nsites", "", "s7_attr:dimensions.nsites"),
                (
                    "optimade:data.attributes.nelements",
                    "",
                    "s7_attr:dimensions.nelements",
                ),
                (
                    "optimade:data.attributes.dimension_types",
                    "",
                    "s7_attr:dimensions.dimensionality",
                ),
                ("optimade:data.attributes.species", "", "s7_attr:dimensions.nspecies"),
                (
                    "optimade:data.attributes.structure_features",
                    "",
                    "s7_attr:dimensions.nstructure_features",
                ),
                # attributes - properties
                (
                    "optimade:data.attributes.immutable_id",
                    "",
                    "s7_attr:properties.immutable_id",
                ),
                (
                    "optimade:data.attributes.last_modified",
                    "",
                    "s7_attr:properties.last_modified",
                ),
                (
                    "optimade:data.attributes.elements",
                    "",
                    "s7_attr:properties.elements",
                ),
                (
                    "optimade:data.attributes.elements_ratios",
                    "",
                    "s7_attr:properties.elements_ratios",
                ),
                (
                    "optimade:data.attributes.chemical_formula_descriptive",
                    "",
                    "s7_attr:properties.chemical_formula_descriptive",
                ),
                (
                    "optimade:data.attributes.chemical_formula_reduced",
                    "",
                    "s7_attr:properties.chemical_formula_reduced",
                ),
                (
                    "optimade:data.attributes.chemical_formula_hill",
                    "",
                    "s7_attr:properties.chemical_formula_hill",
                ),
                (
                    "optimade:data.attributes.chemical_formula_anonymous",
                    "",
                    "s7_attr:properties.chemical_formula_anonymous",
                ),
                (
                    "optimade:data.attributes.dimension_types",
                    "",
                    "s7_attr:properties.dimension_types",
                ),
                (
                    "optimade:data.attributes.nperiodic_dimensions",
                    "",
                    "s7_attr:properties.nperiodic_dimensions",
                ),
                (
                    "optimade:data.attributes.lattice_vectors",
                    "",
                    "s7_attr:properties.lattice_vectors",
                ),
                (
                    "optimade:data.attributes.cartesian_site_positions",
                    "",
                    "s7_attr:properties.cartesian_site_positions",
                ),
                (
                    "optimade:data.attributes.species_at_sites",
                    "",
                    "s7_attr:properties.species_at_sites",
                ),
                (
                    "optimade:data.attributes.structure_features",
                    "",
                    "s7_attr:properties.structure_features",
                ),
                ("optimade:data.attributes.species", "", "s7_attr:properties.species"),

                # attributes.species - properties
                (
                    "optimade:data.attributes.species.name",
                    "",
                    "s7_species:properties.name",
                ),
                (
                    "optimade:data.attributes.species.chemical_symbols",
                    "",
                    "s7_species:properties.chemical_symbols",
                ),
                (
                    "optimade:data.attributes.species.concentration",
                    "",
                    "s7_species:properties.concentration",
                ),
                (
                    "optimade:data.attributes.species.mass",
                    "",
                    "s7_species:properties.mass",
                ),
                (
                    "optimade:data.attributes.species.original_name",
                    "",
                    "s7_species:properties.original_name",
                ),
                (
                    "optimade:data.attributes.species.attached",
                    "",
                    "s7_species:properties.attached",
                ),
                (
                    "optimade:data.attributes.species.nattached",
                    "",
                    "s7_species:properties.nattached",
                ),
                # attributes.species - dimensions
                (
                    "optimade:data.attributes.species.chemical_symbols",
                    "",
                    "s7_species:dimensions.nelements",
                ),
                (
                    "optimade:data.attributes.species.attached",
                    "",
                    "s7_species:dimensions.nattached_elements",
                ),
            },
        },
    },
)
datasource.attributes.properties.species[0].properties.name
from s7.factories.datasource_factory import CACHE

print(CACHE)