Data Concept#

We like to see ORCESTRA as our common field campaign. All should be able to use the gathered data. Together and for mutual benefit.

This page first focusses on the general goals we want to achieve, they should reflect general ideas of what we think a useful, practical datasystem should be capable of. The next section details on requirements we derive from the goals, in order to find a good implementation, which is described in the third part.

Goals#

The purpose of these goals is to have a well-working environment for data dissemination (both during and post campaign) and to learn from what worked and what didn’t work during the EUREC4A field campaign and other previous projects. The goals are sorted in decreasing priority (i.e. 1 is the most important). We aim for all of them, but if we have to cut, we should cut at the end.

  1. a single list of existing datasets
    We want a common data collection of our field campaign. Everyone interested in ORCESTRA should be able to find available datasets. For clarity and consistency, there must be exactly one list.

  2. the datasets in list are accessible
    Given someone found a dataset in the list, the dataset should be usable. That is, the information in the list must be sufficient for everyone to be able to open the dataset with common tools and little effort.

  3. datasets are well-formed and analysis-ready
    Useful datasets are typically written once and read often. The overall effort can be reduced if we spend a bit more time on creating the dataset if that facilitates the later use.

  4. incremental backups are possible
    We expect that the ORCESTRA data collection is a valuable contribution to our scientific field. We should be able to have a backup of this collection. Realistically, the list will evolve over time, thus we will have to update any backups incrementally.

  5. datasets are on a shared, distributed system
    We want the data system to be use in actual scientific work (not only for “data publication”). Traditional systems are often too complicated or slow for day-to-day usage. A distributed system increases the availability and performance (e.g. due to local caches, redundant servers…), which renders the actual use of own published data convenient, fast and fun.

Requirements#

About the wording

The use of the words must, must not, should, should not and may in bold case follows the definition of RFC 2119.

1. a single list of existing datasets#

There must be one definitive list of all ORCESTRA datasets. There must not be more than one definitive list. Any datasets which are not part of the list must not be called “an ORCESTRA dataset”.

  • likely not all datasets will be hosted on a single server (this didn’t work at all in the past), thus the list must be kept somewhere independent of data storage location

    • list must be updated when data is added / changed / removed

  • likely, the list will not be complete at one point in time

    • we must expect that the list changes over time

  • in the past, we had problems answering the question of “what’s all the campaign data”. We aim to sidestep this problem by requiring every dataset which should be associated to ORCESTRA to be in this list. The effort to add datasets to the list should be kept low in order not to exclude anyone from making “ORCESTRA datasets”.


2. the datasets in list are accessible#

dataset = get(dataset_list, dataset_identifier)

☝️ any dataset can be opened directly by anyone

  • The access must be testable (e.g. weekly CI).

  • The dataset should be accessible without credentials.

  • The dataset should be available publicly as soon as possible (ideally immediately after acquisition).

  • The dataset should be available publicly not later than one year after the campaign finished. (TODO: this point might be better placed in the data policy, we really want something like a “must” here, but in this section this would technically mean that no data can be added after a year, which we also don’t want) The dataset must be stored in one the following data formats:

    • NetCDF

    • Zarr

  • The list of accepted data formats can be extended, if the data format is well standardized and readable by several common programming languages

3. datasets are well-formed and analysis-ready#

  • You should follow standard metadata schemes (CF-Conventions)

  • You may adhere to the GEOM metadata standard. In the event of a conflict, the CF-Conventions must take precedence.

  • You may designate data processing levels to help users understand the quality of the provided data. If you provide processing levels, the levels should follow the EOSDIS data processing levels scheme.

  • You should work with your own published datasets.

  • Datasets should be reviewed across teams.

4. incremental backups are possible#

  • You must provide a version number for your dataset in the corresponding catalog metadata

  • You should provide the version number of your dataset in the dataset attributes

  • You must not provide any version information in the dataset name

  • You should add a content identifier for your dataset (specify has algorithm?)

  • The catalog entry must point to the most recent version of the dataset by default.

  • The catalog entry should point to previous versions if explicitly requested.

  • The storage location should provide a method to efficiently check if something changed (e.g. HTTP ETag, If-Modified-Since…)

5. datasets are on a shared, distributed system#

Use a distributed storage protocol to make datasets accessible e.g. IPFS, ONEDATA

Closing remarks#

  • One may write a data paper to describe datasets that are part of the ORCESTRA data collection. The ORCESTRA data collection and a data paper may benefit mutually:

    • Preparing data for the ORCESTRA data collection may help writing a data paper.

    • Writing a data paper may help preparing data for the ORCESTRA data collection.

Implementation#

Caution

This section is currently in exploratory stage. We aim to show options and their respective advantages and disatvantages. We should try to converge to a more concrete implementation plan until the start of the campaign.

Catalog#

We aim to implement the dataset list from the requirements in form of a data catalog. A data catalog in our sense is a somewhat formalized way of listing datasets and a method to access those datasets. A catalog is machine readable and supports the goal of dataset accessibility.

Note

The current decision is to aim for a STAC catalogue to benefit from the more robust format and the more language-agnostic feature set. However, if the creation of the catalogue proves to be too complicated in real-world applications, intake catalogs seems to be an acceptable fallback.

Tool for reading data, from Python ecosystem.

  • ✅ known and tested in EUREC4A and AC3

  • ✅ easy to create

  • ✅ compatible with any kind of data

  • ❌ limited to Python

  • ❌ unstable format (Intake 2 broke a lot of things)

  • 🤔 has room for creative hacks

SpatioTemporal Asset Catalogs, the STAC specification is a common language to describe geospatial information.

  • ✅ stable format

  • integrations for multiple languages exist

  • ✅ can be used with Intake

  • ✅ common set of earth observation related metadata is defined

  • ❌ more complicated to create (but tools exist)

  • ❌ can only be used for spatio-temporal datasets

In any case, the catalog should be accessible through a well-known public URL, such that users always know where to start. We suggest either https://data.orcestra-campaign.org/catalog.yaml or https://data.orcestra-campaign.org/catalog.json, depending on whether Intake or STAC will be chosen.

We may want to use Continuous Integration tools to automatically build the actual catalog based on simpler source input files. This might be particularly relevant if we opt for STAC catalogs, as they require providing spatial and temporal extend for every dataset. We might want to automatically extract this information from the actual datasets if they follow e.g. CF-Conventions, thus simplifying catalog creation and improving consistency.

Storage and Access#

While the Catalog provides a unified access method to all the datasets, mostly independent of the underlying storage and access methods, the particular choices in this section will have an influence on practical data accessibility and maintenance effort. We can (and likely will have to) support multiple underlying storage and access methods. This section tries to briefly cover the advantages and disadvantages of those methods.

HTTP / Object Store

E.g. Swift, S3 or just a static HTTP server.

  • ✅ HTTP as access protocol

  • ✅ compatible with about everything

  • ❌ prone to link rot

  • ❌ single point of failure

  • ❌ in general: no guarantees about data integrity or persistence

DOI repo

This includes e.g. Pangaea, Aeris, Zenodo, etc…

  • ✅ DOI providers must state that they keep data available for a while

  • ❌ providing all required information may be a burden

  • ❌ DOI does not provide direct access to data thus we must fall back to direct HTTP links, effectively bypassing the DOI

  • ❌ single point of failure

NextCloud / OwnCloud
  • ✅ easy to upload

  • ✅ can be installed on-site at the campaign

  • ❌ access performance may be sub-optimal

  • ❌ single point of failure

  • 🤔 easy way to create user accounts

  • ✅ can be installed on-site at the campaign

  • ✅ distributed, i.e. we can have multiple copies at different location (including local)

  • ✅ very fast access if data is cached locally

  • ✅ tracking data changes is easy due to the Merkle Tree structure

  • ❌ requires setting up an IPFS node on (or close to) the accessing machine for reasonable performance

  • ✅ minimizes data transfer

  • ❌ very hard to cache

  • ❌ very hard to track changes

  • ❌ often poor performance due to high server load

  • ❌ many problematic data types

Tracking Progress#

Some form of a second list of to-be-created datasets in advance is likely helpful to track progress (e.g. as for (AC)3 campaign)

Good Datasets#

The implementation builds on top of HowTo EUREC4A and (AC)3 Airborne.

  • few large datasets are better than many small datasets

    • daily datasets are nice during creation, but
      full-campaign datasets are easier afterwards

    • (large) datasets benefit from good chunking

  • Some datatypes are problematic, you should not use these datatypes.

Special considerations on-site during the field campaign#

It would be great if we can fill and access the data catalog already during the field campaign. This might however require some special considerations, as we generally can’t rely on a very fast internet connection.

  • We may use links to local copies of the datasets in the catalog

  • If using IPFS, things might work out automatically as IPFS would discover and access local copies.