Dataset Documents#

Dataset metadata documents define critical metadata about a dataset including:

  • available data measurements

  • platform and sensor names

  • geospatial extents and projection

  • acquisition time

  • provenance information

Traditionally the EO (deprecated) format was used to capture information about individual datasets. However there are a number of issues with this format, so it is now deprecated and we recommend everyone move to using the EO3 Format.

The format is determined by the Open Data Cube using the $schema field in the document. Include an eo3 $schema for eo3 documents. If no schema field exists, it is treated as the older eo format.

EO3 Format#

EO3 is an intermediate format before we move to something more standard like STAC. Primary drivers for the development:

  1. Avoid duplication of spatial information, by storing only native projection information

  2. Capture geo-registration information per band, not per entire dataset

  3. Capture image size/resolution per band

  4. Lightweight lineage representation

# UUID of the dataset
id: f884df9b-4458-47fd-a9d2-1a52a2db8a1a
$schema: 'https://schemas.opendatacube.org/dataset'

# Product name
product:
  name: landsat8_example_product

# Native CRS, assumed to be the same across all bands
crs: "epsg:32660"

# Optional GeoJSON object in the units of native CRS.
# Defines a polygon such that all valid pixels across all bands
# are inside this polygon.
geometry:
  type: Polygon
  coordinates: [[..]]

# Mapping name:str -> { shape:     Tuple[ny: int, nx: int]
#                       transform: Tuple[float x 9]}
# Captures image size, and geo-registration
grids:
    default:  # "default" grid must be present
       shape: [7811, 7691]
       transform: [30, 0, 618285, 0, -30, -1642485, 0, 0, 1]
    pan:  # Landsat Panchromatic band is higher res image than other bands
       shape: [15621, 15381]
       transform: [15, 0, 618292.5, 0, -15, -1642492.5, 0, 0, 1]

# Per band storage information and references into `grids`
# Bands using the "default" grid should not need to reference it
measurements:
   pan:               # Band using non-default "pan" grid
     grid: "pan"      # should match the name used in `grids` mapping above
     path: "pan.tif"
   red:               # Band using "default" grid should omit `grid` key
     path: red.tif    # Path relative to the dataset location
   blue:
     path: blue.tif
   multiband_example:
     path: multi_band.tif
     band: 2          # int: 1-based index into multi-band file
   netcdf_example:    # just example, mixing TIFF and netcdf in one product is not recommended
     path: some.nc
     layer: some_var  # str: netcdf variable to read

# optional dataset location (useful for public datasets)
location: https://landsatonaws.com/L8/099/072/LC08_L1GT_099072_20200523_20200523_01_RT/metadata.yaml

# Dataset properties, prefer STAC standard names here
# Timestamp is the only compulsory field here
properties:
  eo:platform: landsat-8
  eo:instrument: OLI_TIRS

  # If it's a single time instance use datetime
  datetime: 2020-01-01T07:02:54.188Z  # Use UTC

  # When recording time range use dtr:{start,end}_datetime
  dtr:start_datetime: 2020-01-01T07:02:02.233Z
  dtr:end_datetime:   2020-01-01T07:03:04.397Z

  # ODC specific "extensions"
  odc:processing_datetime: 2020-02-02T08:10:00.000Z

  odc:file_format: GeoTIFF
  odc:region_code: "074071"   # provider specific unique identified for the same location
                              # for Landsat '{:03d}{:03d}'.format(path, row)

  dea:dataset_maturity: final # one of: final| interim| nrt (near real time)
  odc:product_family: ard     # can be useful for larger installations

# Lineage only references UUIDs of direct source datasets
# Mapping name:str -> [UUID]
lineage: {}  # set to empty object if no lineage is defined

Elements shape and transform can be obtained from the output of rio info <image-file>. shape is basically height, width tuple and transform capturing a linear mapping from pixel space to projected space encoded in a row-major order:

A command-line tool to validate eo3 documents called eo3-validate is available in the eodatasets3 library, as well as optional tools to write these files more easily.

# transform [a0, a1, a2, a3, a4, a5, 0, 0, 1]

[X]   [a0, a1, a2] [ Pixel]
[Y] = [a3, a4, a5] [ Line ]
[1]   [ 0,  0,  1] [  1   ]

3D dataset metadata#

Dataset metadata documents for 3D measurements conform to the same EO3 schema as above. The example below is for a GEDI L2B cover_z dataset.

id: 9361f681-6e92-4b82-a3ca-3ed799df1116
$schema: https://schemas.opendatacube.org/dataset
product:
  name: gedi_l2b_cover_z
crs: epsg:4326
grids:
  default:
    shape:
      - 420
      - 551
    transform:
      - 0.00027778
      - 0.0
      - 149.03966950632497
      - 0.0
      - -0.00027778
      - -35.30265746130061
      - 0.0
      - 0.0
      - 1.0
measurements:
  "cover_z":
    layer: array
    path: ./GEDI02_B_2019294155401_O04856_T03859_02_001_01_cover_z.xarray_3d
properties:
  datetime: 2019-10-21 15:54:01+00:00
  dtr:end_datetime: 2019-10-21 15:54:01+00:00
  dtr:start_datetime: 2019-10-21 15:54:01+00:00
  eo:instrument: GEDI
  eo:platform: ISS
  odc:file_format: xarray_3d
  odc:processing_datetime: 2021-04-15 04:43:01.926659
lineage: {}
Note that this dataset metadata document:
  • references a 3D product definition which includes an extra_dimensions specification and an extra_dim name for the cover_z measurement (see: 3D product definition)

  • specifies a file_format which supports 3D data and for which there is a 3D enabled driver (see 3D Data Read Plug-ins).

  • describes storage properties specific to that format/driver. E.g. layer indicates the xarray variable for the driver to load (similar to the netcdf example above).

Time-stacked NetCDF files#

It is possible to add NetCDF files with multiple time slices to the Open Data Cube index. The time slice index can be specified by adding a fragment #part=<int> to the path of a band starting from 0.

Example:

measurements:
   time_stacked_netcdf_example:
     path: file://some.nc#part=0
     layer: some_var

EO (deprecated)#

The majority of prepare scripts still generate this format, so this section is maintained for historical context.

id: a066a2ab-42f7-4e72-bc6d-a47a558b8172
creation_dt: '2016-05-04T09:06:54'
product_type: DEM
platform: {code: SRTM}
instrument: {name: SIR}
format: {name: ENVI}
extent:
  coord:
    ll: {lat: -44.000138890272005, lon: 112.99986111}
    lr: {lat: -44.000138890272005, lon: 153.99986111032797}
    ul: {lat: -10.00013889, lon: 112.99986111}
    ur: {lat: -10.00013889, lon: 153.99986111032797}
  from_dt: '2000-02-11T17:43:00'
  center_dt: '2000-02-21T11:54:00'
  to_dt: '2000-02-22T23:23:00'
grid_spatial:
  projection:
    geo_ref_points:
      ll: {x: 112.99986111, y: -44.000138890272005}
      lr: {x: 153.999861110328, y: -44.000138890272005}
      ul: {x: 112.99986111, y: -10.00013889}
      ur: {x: 153.999861110328, y: -10.00013889}
    spatial_reference: GEOGCS["GCS_WGS_1984",DATUM["WGS_1984",SPHEROID["WGS_84",6378137.0,298.257223563]],PRIMEM["Greenwich",0.0],UNIT["degree",0.0174532925199433],AUTHORITY["EPSG","4326"]]
image:
  bands:
    elevation: {path: dsm1sv1_0_Clean.img}
lineage:
  source_datasets: {}
id

UUID of the dataset

creation_dt

Creation datetime

product_type, platform/code, instrument/name

Metadata fields supported by default.

format

Format the data is stored in. For NetCDF and HDF formats it must be ‘NetCDF’ and ‘HDF’.

extent

Spatio-temporal extents of the data in EPSG:4326 (lat/lon) coordinates. Used for search in the database. Note: Take care when reprojecting the geo_ref_points bounding box to the new coordinate system. The extent should be the bounding box of the data in EPSG:4326. (Don’t just re-project the four points, its likely wrong)

grid_spatial/projection
spatial_reference

Coordinate reference system the data is stored in. ‘EPSG:<code>’ or WKT string.

geo_ref_points

Spatial extents of the data in the CRS of the data.

valid_data (optional)

GeoJSON Geometry Object for the ‘data-full’ (non no-data) region of the data. Coordinates are assumed to be in the CRS of the data. Used to avoid loading useless parts of the dataset into memory. Only needs to be roughly correct. Prefer simpler geometry over accuracy.

image/bands

Dictionary of band names to band definitions.

path

Path to the file containing band data. Can be absolute of relative to the folder containing this document.

layer (optional)

Variable name if format is ‘NetCDF’ or ‘HDF’. Band number otherwise. Default is 1.

lineage

Dataset lineage metadata.

source_datasets

Dictionary of source classifier to dataset documents like this one (yay recursion!).

source_datasets:
    level1:
        id: b7d01e8c-1cd2-11e6-b546-a0000100fe80
        product_type: level1
        creation_dt: 2016-05-18 08:09:34
        platform: { code: LANDSAT_5 }
        instrument: { name: TM }
        format: { name: GeoTIFF }
        ...
algorithm (optional)

Algorithm used to generate this dataset.

algorithm:
    name: brdf
    version: '2.0'
    doi: http://dx.doi.org/10.1109/JSTARS.2010.2042281
    parameters:
        aerosol: 0.078565
machine (optional)

Machine and software used to generate this dataset.

machine:
    hostname: r2200
    uname: 'Linux r2200 2.6.32-573.22.1.el6.x86_64 #1 SMP Wed Mar 23 03:35:39 UTC 2016 x86_64'
    runtime_id: d052fcb0-1ccb-11e6-b546-a0000100fe80
    software_versions:
        eodatasets:
            repo_url: https://github.com/GeoscienceAustralia/eo-datasets.git
            version: '0.4'
ancillary (optional)

Additional data used to generate this dataset.

ancillary:
    ephemeris:
        name: L52011318DEFEPH.S00
        uri: /g/data/v10/eoancillarydata/sensor-specific/LANDSAT5/DefinitiveEphemeris/LS5_YEAR/2011/L52011318DEFEPH.S00
        access_dt: 2016-05-18 18:30:03
        modification_dt: 2011-11-15 02:10:26
        checksum_sha1: f66265314fc12e005deb356b69721a7031a71374
  1. Duplication of spatial information

    Extent is stored in native projection grid_spatial->projection->geo_ref_points->{ll,lr,ul,ur}->{x,y}, and then again in lon/lat: extent->coord->{ll,lr,ul,ur}->{lat,lon}

  2. Extent in lon/lat uses 4 points to encode a bounding box

    This format strongly suggests incorrect implementation of simply projecting four image corners into lon/lat in the prepare script.

  3. Costly lineage representation

    To record lineage one has to recursively include the entire dataset document for every input dataset. This gets expensive for summary products with thousands of input datasets.

  4. Format does not capture per band resolution/image size

## TODO