Step-by-step Guide to Indexing Data#

Introduction#

Once you have the Data Cube software installed and connected to a database, you can start to load in some data. This step is performed using the datacube command line tool.

Note

class:: important

When you load data into the Data Cube using indexing, all you are doing is recording the existence of and detailed metadata about the data into the index. None of the data itself is copied, moved or transformed. This is therefore a relatively safe and fast process.

Steps Overview#

Create a new product definition
Before the data itself can be added, a product describing the data must be created. Requires creation of a Product Definition document (yaml)

Ensure the data is prepared
The data to be indexed requires datacube friendly Dataset Documents for data which is to be indexed

Index the data
Run the actual indexing process

Step 1. Creating a Product Definition#

The ODC can handle many different types of data, and requires a bit of information up front to know what to do with them. This is the task of a Product Definition.

More detailed information on the structure of a product definition document can be found here.

Some example Product definitions are supplied here. Other examples include the Digital Earth Africa product definitions.

Loading Product Definitions#

To load Products into your ODC run:

datacube product add <path-to-product-definition-yml>

If you made a mistake, you can update them with:

datacube product update <path-to-product-definition-yml>

Step 2. Ensure Dataset Documents are complete#

Every dataset that you intend to index requires a metadata document describing what the data represents and where it has come from, as well has what format it is stored in. At a minimum, you need the dimensions or fields your want to search by, such as lat, lon and time, but you can include any information you deem useful.

It is typically stored in YAML documents, but JSON is also supported. It is stored in the index for searching, querying and accessing the data.

The data from Geoscience Australia already comes with relevant files (named ga-metadata.yaml), so no further steps are required for indexing them.

For third party datasets, see the examples detailed here. For common distribution formats, data can be indexed using one of the tools from odc-apps-dc-tools. In other cases, the metadata may need to be mapped to an ODC-compatible format. You can find examples of data preparation scripts here.

Step 3. Run the Indexing process#

Everything is now ready, and we can use the datacube tool to add one or more datasets into our Cube

datacube dataset add <path-to-dataset-document-yaml>

Note that this path can be a URI, such as the path to a document on S3.

The Cube currently supports the indexing of STAC datasets but does not natively support them. This means that dataset add command accepts both EO3 and STAC datasets, but STAC datasets will be converted to and stored as EO3. Likewise, dataset update accepts both EO3 and STAC datasets and allowed update keys, but only changes pertaining to the EO3 metadata document will be registered. Other operations on the dataset will deal with the EO3 version of the dataset definition.