Developer Interface

Main Interface

soso.main.convert(file: str, strategy: str, **kwargs: dict) str[source]

Return SOSO markup for a metadata file and specified strategy.

Parameters:
  • file – The path to the metadata file. Refer to the strategy’s documentation for a list of supported file types.

  • strategy – The conversion strategy to use. Available strategies include: EML and SPASE.

  • kwargs – Additional keyword arguments for passing information to the chosen strategy. This can help in the case of unmappable properties. See the Notes section in the strategy’s documentation for more information.

Returns:

The SOSO graph in JSON-LD format.

Strategy Interface

class soso.interface.StrategyInterface(metadata: Any = None, file: str = None, schema_version: str = None, **kwargs: dict)[source]

Define the interface that each conversion strategy must implement.

Attributes:
metadata:

The metadata object, which could be an XML tree, a JSON object, or another suitable representation. This object is utilized by strategy methods to generate SOSO properties.

file:

The path to the metadata file.

schema_version:

The version of the metadata schema.

kwargs:

Additional keyword arguments for passing information to the chosen strategy. This can help in the case of unmappable properties. See the Notes section in the strategy’s documentation for more information.

get_citation()[source]
Returns:

The citation for the dataset.

get_contributor()[source]
Returns:

The contributor(s) of a dataset.

get_creator()[source]
Returns:

The creator(s) of a dataset.

get_date_created()[source]
Returns:

The date the dataset was initially generated.

get_date_modified()[source]
Returns:

The date the dataset was most recently updated or changed.

get_date_published()[source]
Returns:

The date when a dataset was made available to the public through a publication process.

get_description()[source]
Returns:

A short summary describing a dataset.

get_distribution()[source]
Returns:

Where to get the data and in what format.

get_expires()[source]
Returns:

The date when the dataset expires and is no longer useful or available.

get_funding()[source]
Returns:

The funding for a dataset.

get_id()[source]
Returns:

The @id property for the dataset.

get_identifier()[source]
Returns:

The identifier for the dataset, such as a DOI.

get_included_in_data_catalog()[source]
Returns:

The data catalog that the dataset is included in.

get_is_accessible_for_free()[source]
Returns:

If the dataset is accessible for free.

get_is_based_on()[source]
Returns:

Links to source datasets.

get_keywords()[source]
Returns:

Keywords summarizing the dataset.

get_license()[source]
Returns:

The license of a dataset.

get_name()[source]
Returns:

A descriptive name of a dataset.

get_potential_action()[source]
Returns:

The query parameters and methods to actuate a data request.

get_provider()[source]
Returns:

The provider of a dataset.

get_publisher()[source]
Returns:

The publisher of a dataset.

get_same_as()[source]
Returns:

Other URLs that can be used to access the dataset page, usually in a different repository.

get_spatial_coverage()[source]
Returns:

The location on Earth that is the focus of the dataset content.

get_subject_of()[source]
Returns:

The metadata record for the dataset.

get_temporal_coverage()[source]
Returns:

The time period(s) that the content applies to.

get_url()[source]
Returns:

The location of a page describing the dataset.

get_variable_measured()[source]
Returns:

The measurement variables of the dataset.

get_version()[source]
Returns:

The version number or identifier for the dataset.

get_was_derived_from()[source]
Returns:

Links to source datasets.

get_was_generated_by()[source]
Returns:

An execution linking a program to source and derived products.

get_was_revision_of()[source]
Returns:

A link to the prior version of the dataset.

EML Strategy

class soso.strategies.eml.eml.EML(file: str, **kwargs: dict)[source]

Define the conversion strategy for EML (Ecological Metadata Language).

Attributes:
file: The path to the metadata file. This should be an XML file in

EML format.

schema_version: The version of the EML schema used in the metadata

file.

kwargs: Additional keyword arguments for handling unmappable

properties. See the Notes section below for details.

Notes:

Some properties of this metadata standard don’t directly map to SOSO. However, these properties can still be included by inputting the information as kwargs. Keys should match the property name, and values should be the desired value. For a deeper understanding of each SOSO property, refer to the SOSO guidelines.

Below are unmappable properties for this strategy:
  • @id of the Dataset

  • url

  • sameAs

  • version

  • isAccessibleForFree

  • citation

  • includedInDataCatalog

  • subjectOf

  • potentialAction

  • dateCreated

  • expires

  • provider

  • publisher

  • prov:wasRevisionOf

  • prov:wasGeneratedBy

get_citation() None[source]
Returns:

The citation for the dataset.

get_contributor() list | None[source]
Returns:

The contributor(s) of a dataset.

get_creator() list | None[source]
Returns:

The creator(s) of a dataset.

get_date_created() None[source]
Returns:

The date the dataset was initially generated.

get_date_modified() str | None[source]
Returns:

The date the dataset was most recently updated or changed.

get_date_published() str | None[source]
Returns:

The date when a dataset was made available to the public through a publication process.

get_description() str | None[source]
Returns:

A short summary describing a dataset.

get_distribution() list | None[source]
Returns:

Where to get the data and in what format.

get_expires() None[source]
Returns:

The date when the dataset expires and is no longer useful or available.

get_funding() list | None[source]
Returns:

The funding for a dataset.

get_id() None[source]
Returns:

The @id property for the dataset.

get_identifier() str | None[source]
Returns:

The identifier for the dataset, such as a DOI.

get_included_in_data_catalog() None[source]
Returns:

The data catalog that the dataset is included in.

get_is_accessible_for_free() None[source]
Returns:

If the dataset is accessible for free.

get_is_based_on() list | None[source]
Returns:

Links to source datasets.

get_keywords() list | None[source]
Returns:

Keywords summarizing the dataset.

get_license() str | None[source]
Returns:

The license of a dataset.

get_name() str | None[source]
Returns:

A descriptive name of a dataset.

get_potential_action() None[source]
Returns:

The query parameters and methods to actuate a data request.

get_provider() None[source]
Returns:

The provider of a dataset.

get_publisher() None[source]
Returns:

The publisher of a dataset.

get_same_as() None[source]
Returns:

Other URLs that can be used to access the dataset page, usually in a different repository.

get_spatial_coverage() list | None[source]
Returns:

The location on Earth that is the focus of the dataset content.

get_subject_of() dict | None[source]
Returns:

The metadata record for the dataset.

get_temporal_coverage() str | dict | None[source]
Returns:

The time period(s) that the content applies to.

get_url() None[source]
Returns:

The location of a page describing the dataset.

get_variable_measured() list | None[source]
Returns:

The measurement variables of the dataset.

get_version() None[source]
Returns:

The version number or identifier for the dataset.

get_was_derived_from() list | None[source]
Returns:

Links to source datasets.

get_was_generated_by() None[source]
Returns:

An execution linking a program to source and derived products.

get_was_revision_of() None[source]
Returns:

A link to the prior version of the dataset.

SPASE Strategy

class soso.strategies.spase.spase.SPASE(file: str, **kwargs: dict)[source]

Define the conversion strategy for SPASE (Space Physics Archive Search and Extract).

Attributes:
file: The path to the metadata file. This should be an XML file in

SPASE format.

schema_version: The version of the SPASE schema used in the metadata

file.

kwargs: Additional keyword arguments for handling unmappable

properties. See the Notes section below for details.

Notes:

Some properties of this metadata standard don’t directly map to SOSO. However, these properties can still be included by inputting the information as kwargs. Keys should match the property name, and values should be the desired value. For a deeper understanding of each SOSO property, refer to the SOSO guidelines.

Below are unmappable properties for this strategy:
  • includedInDataCatalog

  • is_accessible_for_free

  • version

  • expires

  • provider

A shared conversion script is available for this standard. It is designed for repositories that supplement SPASE metadata with shared infrastructure, using the ancillary information to generate a richer SOSO record.

get_citation() List[Dict] | None[source]
Returns:

The citation for the dataset.

get_contributor() List[Dict] | None[source]
Returns:

The contributor(s) of a dataset.

get_creator() List[Dict] | None[source]
Returns:

The creator(s) of a dataset.

get_date_created() str | None[source]
Returns:

The date the dataset was initially generated.

get_date_modified() str | None[source]
Returns:

The date the dataset was most recently updated or changed.

get_date_published() str | None[source]
Returns:

The date when a dataset was made available to the public through a publication process.

get_description() List | str[source]
Returns:

A short summary describing a dataset.

get_distribution() List[Dict] | None[source]
Returns:

Where to get the data and in what format.

get_expires() None[source]
Returns:

The date when the dataset expires and is no longer useful or available.

get_funding() List[Dict] | None[source]
Returns:

The funding for a dataset.

get_id() str[source]
Returns:

The @id property for the dataset.

get_identifier() Dict | List[Dict] | None[source]
Returns:

The identifier for the dataset, such as a DOI.

get_included_in_data_catalog() None[source]
Returns:

The data catalog that the dataset is included in.

get_is_accessible_for_free() None[source]
Returns:

If the dataset is accessible for free.

get_is_based_on() List[Dict] | Dict | None[source]
Returns:

Links to source datasets.

get_keywords() List | None[source]
Returns:

Keywords summarizing the dataset.

get_license() List | None[source]
Returns:

The license of a dataset.

get_name() str[source]
Returns:

A descriptive name of a dataset.

get_potential_action() List[Dict] | None[source]
Returns:

The query parameters and methods to actuate a data request.

get_provider() None[source]
Returns:

The provider of a dataset.

get_publisher() Dict | None[source]
Returns:

The publisher of a dataset.

get_same_as() List | None[source]
Returns:

Other URLs that can be used to access the dataset page, usually in a different repository.

get_spatial_coverage() List[Dict] | None[source]
Returns:

The location on Earth that is the focus of the dataset content.

get_subject_of(*moreLicenseInfo) Dict | None[source]
Returns:

The metadata record for the dataset.

get_temporal_coverage() str | Dict | None[source]
Returns:

The time period(s) that the content applies to.

get_url() str[source]
Returns:

The location of a page describing the dataset.

get_variable_measured() List[Dict] | None[source]
Returns:

The measurement variables of the dataset.

get_version() None[source]
Returns:

The version number or identifier for the dataset.

get_was_derived_from() Dict | None[source]
Returns:

Links to source datasets.

get_was_generated_by() List[Dict] | None[source]
Returns:

An execution linking a program to source and derived products.

get_was_revision_of() List[Dict] | Dict | None[source]
Returns:

A link to the prior version of the dataset.

Utilities

Utilities

soso.utilities.as_numeric(value: Any) None | int | float[source]
Parameters:

value – The value to convert to a numeric value.

Returns:

A numeric value.

soso.utilities.delete_null_values(res: Any) Any[source]

Remove null values from results returned by strategy methods.

Parameters:

res – The results to clean.

Returns:

The results with all null values removed. None is returned if all values are null.

Notes:

This function is to help developers of strategy methods clean their results before returning them to the user, to ensure that the results are free of meaningless values.

Null values are defined as follows:
  • None

  • An empty string

  • An empty list

  • An empty dictionary

  • A dictionary with only one key, “@type”

soso.utilities.delete_unused_vocabularies(graph: dict) dict[source]

Delete unused vocabularies from the top level JSON-LD @context. This function is to help clean the graph created by main.convert before returning it to the user.

Parameters:

graph – The JSON-LD graph.

Returns:

The JSON-LD graph, with unused vocabularies removed from the top level @context.

soso.utilities.generate_citation_from_doi(url: str, style: str, locale: str) str | None[source]
Parameters:
  • url – The URL prefixed DOI.

  • style – The citation style. For example, “apa”. Options are listed here.

  • locale

    The locale. For example, “en-US”. Options are listed here.

Returns:

The citation in the specified style and locale. None is returned if the DOI is invalid or if the citation could not be generated.

Notes:

This function supports the DOI registration agencies and methods listed here.

soso.utilities.get_empty_metadata_file_path(strategy: str) Path[source]
Parameters:

strategy – Metadata strategy. Can be: EML.

Returns:

File path of an empty metadata file.

soso.utilities.get_example_metadata_file_path(strategy: str) Path[source]

Return the file path of an example metadata file.

Parameters:

strategy – Metadata strategy. Can be: EML, SPASE.

Returns:

File path.

soso.utilities.get_sssom_file_path(strategy: str) Path[source]

Return the SSSOM file path for the specified strategy.

Parameters:

strategy – Metadata strategy. Can be: EML.

Returns:

File path.

soso.utilities.guess_mime_type_with_fallback(filename: str) str | None[source]

Guesses a MIME type by first checking our consistent, bundled database. If no match is found, it falls back to the operating system’s default.

Parameters:

filename – The file name or path to guess the MIME type for.

Returns:

The guessed MIME type as a string, or None if no type could be determined.

soso.utilities.is_html(text: str) bool[source]
Parameters:

text – The string to be checked.

Returns:

True if the string is likely an HTML document, False otherwise.

soso.utilities.is_url(text: str) bool[source]
Parameters:

text – The string to be checked.

Returns:

True if the string is likely a URL, False otherwise.

Note:

A string is considered a URL if it has scheme and network location values.

soso.utilities.limit_to_5000_characters(text: str) str[source]
Parameters:

text – The text to limit to 5000 characters.

Returns:

The text limited to 5000 characters as per Google recommendations for textual properties.

soso.utilities.setup_logging(level: str = 'INFO', log_file: str = None)[source]

Set up global Daiquiri logging for the application.

Configures logging to output to the console (with color formatting) and optionally to a file. Should be called once at application startup (e.g., in main.py or CLI entry point).

Parameters:
  • level – Logging level to use (e.g., “DEBUG”, “INFO”, “WARNING”, “ERROR”).

  • log_file – If provided, log output will also be written to this file.

Validation

The validation module.

soso.validation.validate(data_graph: str, shacl_graph: str = None) dict[source]

Validate a data graph against a SHACL shape graph.

This is a simple wrapper around pyshacl.validate.

Parameters:
  • data_graph – The path to the data graph file in JSON-LD format.

  • shacl_graph – The path to the SHACL shape graph file in Turtle format. If shacl_graph is a valid file path,use it. If it matches a known resource, resolve from package. If None, a default SOSO SHACL shape is used. Available package resources include: soso_common_v1.2.3.ttl.

Returns:

A dictionary with validation results, including: data_graph: The input data graph path. shacl_graph: The resolved SHACL shape graph path. conforms: Boolean indicating if the data graph conforms to the SHACL shape. report: Full SHACL validation report as text.