Datacube Class#

class datacube.Datacube(index=None, config=None, app=None, env=None, validate_connection=True)[source]#

Interface to search, read and write a datacube.

Create the interface for the query and storage access.

If no index or config is given, the default configuration is used for database connection.

Parameters
  • index (datacube.index.Index or None.) – The database index to use.

  • config (Union[LocalConfig|str]) –

    A config object or a path to a config file that defines the connection.

    If an index is supplied, config is ignored.

  • app (str) –

    A short, alphanumeric name to identify this application.

    The application name is used to track down problems with database queries, so it is strongly advised that be used. Required if an index is not supplied, otherwise ignored.

  • env (str) –

    Name of the datacube environment to use. ie. the section name in any config files. Defaults to ‘datacube’ for backwards compatibility with old config files.

    Allows you to have multiple datacube instances in one configuration, specified on load, eg. ‘dev’, ‘test’ or ‘landsat’, ‘modis’ etc.

  • validate_connection (bool) – Should we check that the database connection is available and valid

Returns

Datacube object

Methods:

close()

Close any open connections

create_storage(coords, geobox, measurements)

Create a xarray.Dataset and (optionally) fill it with data.

find_datasets(**search_terms)

Search the index and return all datasets for a product matching the search terms.

find_datasets_lazy([limit, ensure_location, ...])

Find datasets matching query.

group_datasets(datasets, group_by)

Group datasets along defined non-spatial dimensions (ie.

list_measurements([show_archived, with_pandas])

List measurements for each product

list_products([with_pandas, dataset_count])

List all products in the datacube.

load([product, measurements, output_crs, ...])

Load data as an xarray.Dataset object.

load_data(sources, geobox, measurements[, ...])

Load data from group_datasets() into an xarray.Dataset.

close()[source]#

Close any open connections

static create_storage(coords, geobox, measurements, data_func=None, extra_dims=None)[source]#

Create a xarray.Dataset and (optionally) fill it with data.

This function makes the in memory storage structure to hold datacube data.

Parameters
  • coords (dict) – OrderedDict holding DataArray objects defining the dimensions not specified by geobox

  • geobox (GeoBox) – A GeoBox defining the output spatial projection and resolution

  • measurements – list of datacube.model.Measurement

  • data_func – Callable Measurement -> np.ndarray function to fill the storage with data. It is called once for each measurement, with the measurement as an argument. It should return an appropriately shaped numpy array. If not provided memory is allocated an filled with nodata value defined on a given Measurement.

  • extra_dims (ExtraDimensions) – A ExtraDimensions describing the any additional dimensions on top of (t, y, x)

Return type

xarray.Dataset

find_datasets(**search_terms)[source]#

Search the index and return all datasets for a product matching the search terms.

Parameters

search_terms – see datacube.api.query.Query

Returns

list of datasets

Return type

list[datacube.model.Dataset]

find_datasets_lazy(limit=None, ensure_location=False, dataset_predicate=None, **kwargs)[source]#

Find datasets matching query.

Parameters
  • kwargs – see datacube.api.query.Query

  • ensure_location – only return datasets that have locations

  • limit – if provided, limit the maximum number of datasets returned

  • dataset_predicate – an optional predicate to filter datasets

Returns

iterator of datasets

Return type

__generator[datacube.model.Dataset]

static group_datasets(datasets, group_by)[source]#

Group datasets along defined non-spatial dimensions (ie. time).

Parameters
  • datasets – a list of datasets, typically from find_datasets()

  • group_by (GroupBy) – Contains: - a function that returns a label for a dataset - name of the new dimension - unit for the new dimension - function to sort by before grouping

Return type

xarray.DataArray

See also

find_datasets(), load_data(), query_group_by()

list_measurements(show_archived=False, with_pandas=True)[source]#

List measurements for each product

Parameters
  • show_archived – include products that have been archived.

  • with_pandas – return the list as a Pandas DataFrame, otherwise as a list of dict.

Return type

pandas.DataFrame or list(dict)

list_products(with_pandas=True, dataset_count=False)[source]#

List all products in the datacube. This will produce a pandas.DataFrame or list of dicts containing useful information about each product, including:

‘name’ ‘description’ ‘license’ ‘default_crs’ ‘default_resolution’ ‘dataset_count’ (optional)

Parameters
  • with_pandas (bool) – Return the list as a Pandas DataFrame. If False, return a list of dicts.

  • dataset_count (bool) – Return a “dataset_count” column containing the number of datasets for each product. This can take several minutes on large datacubes. Defaults to False.

Returns

A table or list of every product in the datacube.

Return type

pandas.DataFrame or list(dict)

load(product=None, measurements=None, output_crs=None, resolution=None, resampling=None, skip_broken_datasets=False, dask_chunks=None, like=None, fuse_func=None, align=None, datasets=None, dataset_predicate=None, progress_cbk=None, **query)[source]#

Load data as an xarray.Dataset object. Each measurement will be a data variable in the xarray.Dataset.

See the xarray documentation for usage of the xarray.Dataset and xarray.DataArray objects.

Product and Measurements

A product can be specified using the product name, or by search fields that uniquely describe a single product.

product='ls5_ndvi_albers'

See list_products() for the list of products with their names and properties.

A product can also be selected by searching using fields, but must only match one product. For example:

platform='LANDSAT_5',
product_type='ndvi'

The measurements argument is a list of measurement names, as listed in list_measurements(). If not provided, all measurements for the product will be returned.

measurements=['red', 'nir', 'swir2']
Dimensions

Spatial dimensions can specified using the longitude/latitude and x/y fields.

The CRS of this query is assumed to be WGS84/EPSG:4326 unless the crs field is supplied, even if the stored data is in another projection or the output_crs is specified. The dimensions longitude/latitude and x/y can be used interchangeably.

latitude=(-34.5, -35.2), longitude=(148.3, 148.7)

or

x=(1516200, 1541300), y=(-3867375, -3867350), crs='EPSG:3577'

The time dimension can be specified using a tuple of datetime objects or strings with YYYY-MM-DD hh:mm:ss format. Data will be loaded inclusive of the start and finish times. E.g:

time=('2000-01-01', '2001-12-31')
time=('2000-01', '2001-12')
time=('2000', '2001')

For 3D datasets, where the product definition contains an extra_dimension specification, these dimensions can be queried using that dimension’s name. E.g.:

z=(10, 30)

or

z=5

or

wvl=(560.3, 820.5)

For EO-specific datasets that are based around scenes, the time dimension can be reduced to the day level, using solar day to keep scenes together.

group_by='solar_day'

For data that has different values for the scene overlap the requires more complex rules for combining data, a function can be provided to the merging into a single time slice.

See datacube.helpers.ga_pq_fuser() for an example implementation.

Output

To reproject or resample data, supply the output_crs, resolution, resampling and align fields.

By default, the resampling method is ‘nearest’. However, any stored overview layers may be used when down-sampling, which may override (or hybridise) the choice of resampling method.

To reproject data to 30 m resolution for EPSG:3577:

dc.load(product='ls5_nbar_albers',
        x=(148.15, 148.2),
        y=(-35.15, -35.2),
        time=('1990', '1991'),
        output_crs='EPSG:3577`,
        resolution=(-30, 30),
        resampling='cubic'
)
Parameters
  • product (str) – The product to be loaded.

  • measurements (list(str)) –

    Measurements name or list of names to be included, as listed in list_measurements(). These will be loaded as individual xr.DataArray variables in the output xarray.Dataset object.

    If a list is specified, the measurements will be returned in the order requested. By default all available measurements are included.

  • **query – Search parameters for products and dimension ranges as described above. For example: 'x', 'y', 'time', 'crs'.

  • output_crs (str) –

    The CRS of the returned data, for example EPSG:3577. If no CRS is supplied, the CRS of the stored data is used if available.

    This differs from the crs parameter desribed above, which is used to define the CRS of the coordinates in the query itself.

  • resolution ((float,float)) –

    A tuple of the spatial resolution of the returned data. Units are in the coordinate space of output_crs.

    This includes the direction (as indicated by a positive or negative number). For most CRSs, the first number will be negative, e.g. (-30, 30).

  • resampling (str|dict) –

    The resampling method to use if re-projection is required. This could be a string or a dictionary mapping band name to resampling mode. When using a dict use '*' to indicate “apply to all other bands”, for example {'*': 'cubic', 'fmask': 'nearest'} would use cubic for all bands except fmask for which nearest will be used.

    Valid values are:

    'nearest', 'average', 'bilinear', 'cubic', 'cubic_spline',
    'lanczos', 'mode', 'gauss',  'max', 'min', 'med', 'q1', 'q3'
    

    Default is to use nearest for all bands.

    See also

    load_data()

  • align ((float,float)) –

    Load data such that point ‘align’ lies on the pixel boundary. Units are in the coordinate space of output_crs.

    Default is (0, 0)

  • dask_chunks (dict) –

    If the data should be lazily loaded using dask.array.Array, specify the chunking size in each output dimension.

    See the documentation on using xarray with dask for more information.

  • like (xarray.Dataset) –

    Use the output of a previous load() to load data into the same spatial grid and resolution (i.e. datacube.utils.geometry.GeoBox). E.g.:

    pq = dc.load(product='ls5_pq_albers', like=nbar_dataset)
    

  • group_by (str) – When specified, perform basic combining/reducing of the data. For example, group_by='solar_day' can be used to combine consecutive observations along a single satellite overpass into a single time slice.

  • fuse_func – Function used to fuse/combine/reduce data with the group_by parameter. By default, data is simply copied over the top of each other in a relatively undefined manner. This function can perform a specific combining step. This can be a dictionary if different fusers are needed per band.

  • datasets – Optional. If this is a non-empty list of datacube.model.Dataset objects, these will be loaded instead of performing a database lookup.

  • skip_broken_datasets (bool) – Optional. If this is True, then don’t break when failing to load a broken dataset. Default is False.

  • dataset_predicate (function) –

    Optional. A function that can be passed to restrict loaded datasets. A predicate function should take a datacube.model.Dataset object (e.g. as returned from find_datasets()) and return a boolean. For example, loaded data could be filtered to January observations only by passing the following predicate function that returns True for datasets acquired in January:

    def filter_jan(dataset): return dataset.time.begin.month == 1
    

  • limit (int) – Optional. If provided, limit the maximum number of datasets returned. Useful for testing and debugging.

  • progress_cbkInt, Int -> None, if supplied will be called for every file read with files_processed_so_far, total_files. This is only applicable to non-lazy loads, ignored when using dask.

Returns

Requested data in a xarray.Dataset

Return type

xarray.Dataset

static load_data(sources, geobox, measurements, resampling=None, fuse_func=None, dask_chunks=None, skip_broken_datasets=False, progress_cbk=None, extra_dims=None, **extra)[source]#

Load data from group_datasets() into an xarray.Dataset.

Parameters
  • sources (xarray.DataArray) – DataArray holding a list of datacube.model.Dataset, grouped along the time dimension

  • geobox (GeoBox) – A GeoBox defining the output spatial projection and resolution

  • measurements – list of Measurement objects

  • resampling (str|dict) –

    The resampling method to use if re-projection is required. This could be a string or a dictionary mapping band name to resampling mode. When using a dict use '*' to indicate “apply to all other bands”, for example {'*': 'cubic', 'fmask': 'nearest'} would use cubic for all bands except fmask for which nearest will be used.

    Valid values are: 'nearest', 'cubic', 'bilinear', 'cubic_spline', 'lanczos', 'average', 'mode', 'gauss',  'max', 'min', 'med', 'q1', 'q3'

    Default is to use nearest for all bands.

  • fuse_func – function to merge successive arrays as an output. Can be a dictionary just like resampling.

  • dask_chunks (dict) –

    If provided, the data will be loaded on demand using using dask.array.Array. Should be a dictionary specifying the chunking size for each output dimension. Unspecified dimensions will be auto-guessed, currently this means use chunk size of 1 for non-spatial dimensions and use whole dimension (no chunking unless specified) for spatial dimensions.

    See the documentation on using xarray with dask for more information.

  • progress_cbk – Int, Int -> None if supplied will be called for every file read with files_processed_so_far, total_files. This is only applicable to non-lazy loads, ignored when using dask.

  • extra_dims (ExtraDimensions) – A ExtraDimensions describing the any additional dimensions on top of (t, y, x)

Return type

xarray.Dataset