Content from Package management


Last updated on 2024-07-25 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • What are the main Python libraries used in atmosphere and ocean science?
  • How do I install and manage all the Python libraries that I want to use?
  • How do I interact with Python?

Objectives

  • Identify the main Python libraries used in atmosphere and ocean science and the relationships between them.
  • Explain the advantages of Anaconda over other Python distributions.
  • Extend the number of packages available via conda using conda-forge.
  • Create a conda environment with the libraries needed for these lessons.
  • Open a Jupyter Notebook ready for use in these lessons

The PyAOS stack


Before we jump in and start analysing our netCDF precipitation data files, we need to consider what Python libraries are best suited to the task.

For reading, writing and analysing data stored in the netCDF file format, atmosphere and ocean scientists will typically do most of their work with either the xarray or iris libraries. These libraries are built on top of more generic data science libraries like numpy and matplotlib, to make the types of analysis we do faster and more efficient. To learn more about the PyAOS “stack” shown in the diagram below (i.e. the collection of libraries that are typically used for data analysis and visualisation in the atmosphere and ocean sciences), check out the overview of the PyAOS stack at the PyAOS community site.

PyAOS stack

Python distributions for data science


Now that we’ve identified the Python libraries we might want to use, how do we go about installing them?

Our first impulse might be to use the Python package installer (pip), but it really only works for libraries written in pure Python. This is a major limitation for the data science community, because many scientific Python libraries have C and/or Fortran dependencies. To spare people the pain of installing these dependencies, a number of scientific Python “distributions” have been released over the years. These come with the most popular data science libraries and their dependencies pre-installed, and some also come with a package manager to assist with installing additional libraries that weren’t pre-installed. Today the most popular distribution for data science is Anaconda, which comes with a package (and environment) manager called conda.

Introducing conda


According to the latest documentation, Anaconda comes with over 300 of the most widely used data science libraries (and their dependencies) pre-installed. In addition, there are several thousand more libraries available via the Anaconda Public Repository, which can be installed by running the conda install command the Bash Shell or Anaconda Prompt (Windows only). It is also possible to install packages using the Anaconda Navigator graphical user interface.

conda in the shell on windows

If you’re on a Windows machine and the conda command isn’t available at the Bash Shell, you’ll need to open the Anaconda Prompt program (via the Windows start menu) and run the command conda init bash (this only needs to be done once). After that, your Bash Shell will be configured to use conda going forward.

For instance, the popular xarray library could be installed using the following command,

BASH

$ conda install xarray

(Use conda search -f {package_name} to find out if a package you want is available.)

OR using Anaconda Navigator:

Anaconda Navigator xarray search

Miniconda

If you don’t want to install the entire Anaconda distribution, you can install Miniconda instead. It essentially comes with conda and nothing else.

Advanced conda


For a relatively small/niche field of research like atmosphere and ocean science, one of the most important features that Anaconda provides is the Anaconda Cloud website, where the community can contribute conda installation packages. This is critical because many of our libraries have a small user base, which means they’ll never make it into the Anaconda Public Repository.

You can search Anaconda Cloud to find the command needed to install the package. For instance, here is the search result for the iris package:

Iris search on Anaconda Cloud

As you can see, there are often multiple versions of the same package up on Anaconda Cloud. To try and address this duplication problem, conda-forge has been launched, which aims to be a central repository that contains just a single (working) version of each package on Anaconda Cloud. You can therefore expand the selection of packages available via conda install beyond the chosen few thousand by adding the conda-forge channel:

BASH

$ conda config --add channels conda-forge

OR

Anaconda Navigator conda-forge

We recommend not adding any other third-party channels unless absolutely necessary, because mixing packages from multiple channels can cause headaches like binary incompatibilities.

Software installation for these lessons


For these particular lessons we will use xarray, but all the same tasks could be performed with iris. We’ll also install dask (xarray uses this for parallel processing), netCDF4 (xarray requires this to read netCDF files), cartopy (to help with geographic plot projections), cmocean (for nice color palettes) and cmdline_provenance (to keep track of our data processing steps). We don’t need to worry about installing jupyter (we will be using the jupyter notebook) because it already comes pre-installed with Anaconda.

We could install these libraries from Anaconda Navigator (not shown) or using the Bash Shell or Anaconda Prompt (Windows):

BASH

$ conda install xarray dask netCDF4 cartopy cmocean cmdline_provenance

If we then list all the libraries that we’ve got installed, we can see that jupyter, dask, xarray, netCDF4, cartopy, cmocean, cmdline_provenance and their dependencies are now there:

BASH

$ conda list

(This list can also be viewed in the environments tab of the Navigator.)

Creating separate environments

If you’ve got multiple data science projects on the go, installing all your packages in the same conda environment can get a little messy. (By default they are installed in the root/base environment.) It’s therefore common practice to create separate conda environments for the various projects you’re working on.

For instance, we could create an environment called pyaos-lesson for this lesson. The process of creating a new environment can be managed in the environments tab of the Anaconda Navigator or via the following Bash Shell / Anaconda Prompt commands:

BASH

$ conda create -n pyaos-lesson jupyter xarray dask netCDF4 cartopy cmocean cmdline_provenance
$ conda activate pyaos-lesson

(it’s conda deactivate to exit)

Notice that in this case we had to include jupyter in the list of packages to install. When you create a brand new conda environment, it doesn’t automatically come with the pre-installed packages that are in the base environment.

You can have lots of different environments,

BASH

$ conda info --envs

OUTPUT

# conda environments:
#
base                  *  /anaconda3
pyaos-lesson             /anaconda3/envs/pyaos-lesson
test                     /anaconda3/envs/test

the details of which can be exported to a YAML configuration file:

BASH

$ conda env export -n pyaos-lesson -f pyaos-lesson.yml
$ cat pyaos-lesson.yml

OUTPUT

name: pyaos-lesson
channels:
  - conda-forge
  - defaults
dependencies:
  - cartopy=0.16.0=py36h81b52dc_1
  - certifi=2018.4.16=py36_0
  - cftime=1.0.1=py36h7eb728f_0
  - ...

Other people (or you on a different computer) can then re-create that exact environment using the YAML file:

BASH

$ conda env create -f pyaos-lesson.yml

The ease with which others can recreate your environment (on any operating system) is a huge breakthough for reproducible research.

To delete the environment:

BASH

$ conda env remove -n pyaos-lesson

Interacting with Python


Now that we know which Python libraries we want to use and how to install them, we need to decide how we want to interact with Python.

The most simple way to use Python is to type code directly into the interpreter. This can be accessed from the bash shell:

$ python
Python 3.7.1 (default, Dec 14 2018, 13:28:58)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> print("hello world")
hello world
>>> exit()
$

The >>> prompt indicates that you are now talking to the Python interpreter.

A more powerful alternative to the default Python interpreter is IPython (Interactive Python). The online documentation outlines all the special features that come with IPython, but as an example, it lets you execute bash shell commands without having to exit the IPython interpreter:

$ ipython
Python 3.7.1 (default, Dec 14 2018, 13:28:58)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.2.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: print("hello world")
hello world

In [2]: ls
data/                              script_template.py
plot_precipitation_climatology.py

In [3]: exit
$ 

(The IPython interpreter can also be accessed via the Anaconda Navigator by running the QtConsole.)

While entering commands to the Python or IPython interpreter line-by-line is great for quickly testing something, it’s clearly impractical for developing longer bodies of code and/or interactively exploring data. As such, Python users tend to do most of their code development and data exploration using either an Integrated Development Environment (IDE) or Jupyter Notebook:

  • Two of the most common IDEs are Spyder and PyCharm (the former comes with Anaconda) and will look very familiar to anyone who has used MATLAB or R-Studio.
  • Jupyter Notebooks run in your web browser and allow users to create and share documents that contain live code, equations, visualizations and narrative text.

We are going to use the Jupyter Notebook to explore our precipitation data (and the plotting functionality of xarray) in the next few lessons. A notebook can be launched from our data-carpentry directory using the Bash Shell:

BASH

$ cd ~/Desktop/data-carpentry
$ jupyter notebook &

(The & allows you to come back and use the bash shell without closing your notebook first.)

Alternatively, you can launch Jupyter Notebook from the Anaconda Navigator and navigate to the data-carpentry directory before creating a new Python 3 notebook:

Launch Jupyter notebook

JupyterLab

If you like Jupyter Notebooks you might want to try JupyterLab, which combines the Jupyter Notebook with many of the features common to an IDE.

Install the Python libraries required for this lesson

If you haven’t already done so, go ahead and install the xarray, dask, netCDF4, cartopy, cmocean and cmdline_provenance packages using either the Anaconda Navigator or Bash Shell.

Remember that you’ll need to add the conda-forge channel first.

(You may like to create a separate pyaos-lesson conda environment, but this is not necessary to complete the lessons. If you create a new environment rather than using the base environment, you’ll need to install jupyter too.)

The software installation instructions explain how to install the Python libraries using the Bash Shell or Anaconda Navigator.

Launch a Jupyer Notebook

In preparation for the next lesson, open a new Jupyter Notebook in your data-carpentry directory by entering jupyter notebook & at the Bash Shell or by clicking the Jupyter Notebook launch button in the Anaconda Navigator.

If you use the Navigator, the Jupyter interface will open in a new tab of your default web browser. Use that interface to navigate to the data-carpentry directory that you created specifically for these lessons before clicking to create a new Python 3 notebook:

Launch Jupyter notebook

Once your notebook is open, import xarray, catropy, matplotlib and numpy using the following Python commands:

PYTHON

import xarray as xr
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np

(Hint: Hold down the shift and return keys to execute a code cell in a Jupyter Notebook.)

Key Points

  • xarray and iris are the core Python libraries used in the atmosphere and ocean sciences.
  • Use conda to install and manage your Python environments.

Content from Data processing and visualisation


Last updated on 2024-07-25 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • How can I create a quick plot of my CMIP data?

Objectives

  • Import the xarray library and use the functions it contains.
  • Convert precipitation units to mm/day.
  • Calculate and plot the precipitation climatology.
  • Use the cmocean library to find colormaps designed for ocean science.

As a first step towards making a visual comparison of the ACCESS-CM2 and ACCESS-ESM1-5 historical precipitation climatology, we are going to create a quick plot of the ACCESS-CM2 data.

PYTHON

accesscm2_pr_file = "data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc"

We will need a number of the libraries introduced in the previous lesson.

PYTHON

import xarray as xr
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np

Since geographic data files can often be very large, when we first open our data file in xarray it simply loads the metadata associated with the file (this is known as “lazy loading”). We can then view summary information about the contents of the file before deciding whether we’d like to load some or all of the data into memory.

PYTHON

ds = xr.open_dataset(accesscm2_pr_file)
print(ds)

OUTPUT

<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 144, lon: 192, time: 60)
Coordinates:
  * time       (time) datetime64[ns] 2010-01-16T12:00:00 ... 2014-12-16T12:00:00
  * lon        (lon) float64 0.9375 2.812 4.688 6.562 ... 355.3 357.2 359.1
  * lat        (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] ...
    lon_bnds   (lon, bnds) float64 ...
    lat_bnds   (lat, bnds) float64 ...
    pr         (time, lat, lon) float32 ...
Attributes:
    CDI:                    Climate Data Interface version 1.9.8 (https://mpi...
    source:                 ACCESS-CM2 (2019): \naerosol: UKCA-GLOMAP-mode\na...
    institution:            CSIRO (Commonwealth Scientific and Industrial Res...
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            CMIP
    branch_method:          standard
    branch_time_in_child:   0.0
    branch_time_in_parent:  0.0
    creation_date:          2019-11-08T08:26:37Z
    data_specs_version:     01.00.30
    experiment:             all-forcing simulation of the recent past
    experiment_id:          historical
    external_variables:     areacella
    forcing_index:          1
    frequency:              mon
    further_info_url:       https://furtherinfo.es-doc.org/CMIP6.CSIRO-ARCCSS...
    grid:                   native atmosphere N96 grid (144x192 latxlon)
    grid_label:             gn
    initialization_index:   1
    institution_id:         CSIRO-ARCCSS
    mip_era:                CMIP6
    nominal_resolution:     250 km
    notes:                  Exp: CM2-historical; Local ID: bj594; Variable: p...
    parent_activity_id:     CMIP
    parent_experiment_id:   piControl
    parent_mip_era:         CMIP6
    parent_source_id:       ACCESS-CM2
    parent_time_units:      days since 0950-01-01
    parent_variant_label:   r1i1p1f1
    physics_index:          1
    product:                model-output
    realization_index:      1
    realm:                  atmos
    run_variant:            forcing: GHG, Oz, SA, Sl, Vl, BC, OC, (GHG = CO2,...
    source_id:              ACCESS-CM2
    source_type:            AOGCM
    sub_experiment:         none
    sub_experiment_id:      none
    table_id:               Amon
    table_info:             Creation Date:(30 April 2019) MD5:e14f55f257cceaf...
    title:                  ACCESS-CM2 output prepared for CMIP6
    variable_id:            pr
    variant_label:          r1i1p1f1
    version:                v20191108
    cmor_version:           3.4.0
    tracking_id:            hdl:21.14100/b4dd0f13-6073-4d10-b4e6-7d7a4401e37d
    license:                CMIP6 model data produced by CSIRO is licensed un...
    CDO:                    Climate Data Operators version 1.9.8 (https://mpi...
    history:                Tue Jan 12 14:50:25 2021: ncatted -O -a history,p...
    NCO:                    netCDF Operators version 4.9.2 (Homepage = http:/...

We can see that our ds object is an xarray.Dataset, which when printed shows all the metadata associated with our netCDF data file.

In this case, we are interested in the precipitation variable contained within that xarray Dataset:

PYTHON

print(ds["pr"])

OUTPUT

<xarray.DataArray 'pr' (time: 60, lat: 144, lon: 192)>
[1658880 values with dtype=float32]
Coordinates:
  * time     (time) datetime64[ns] 2010-01-16T12:00:00 ... 2014-12-16T12:00:00
  * lon      (lon) float64 0.9375 2.812 4.688 6.562 ... 353.4 355.3 357.2 359.1
  * lat      (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38
Attributes:
    standard_name:  precipitation_flux
    long_name:      Precipitation
    units:          kg m-2 s-1
    comment:        includes both liquid and solid phases
    cell_methods:   area: time: mean
    cell_measures:  area: areacella

We can actually use either the ds["pr"] or ds.pr syntax to access the precipitation xarray.DataArray.

To calculate the precipitation climatology, we can make use of the fact that xarray DataArrays have built in functionality for averaging over their dimensions.

PYTHON

clim = ds["pr"].mean("time", keep_attrs=True)
print(clim)

OUTPUT

<xarray.DataArray 'pr' (lat: 144, lon: 192)>
array([[1.8461452e-06, 1.9054805e-06, 1.9228980e-06, ..., 1.9869783e-06,
        2.0026005e-06, 1.9683730e-06],
       [1.9064508e-06, 1.9021350e-06, 1.8931637e-06, ..., 1.9433096e-06,
        1.9182237e-06, 1.9072245e-06],
       [2.1003202e-06, 2.0477617e-06, 2.0348527e-06, ..., 2.2391034e-06,
        2.1970161e-06, 2.1641599e-06],
       ...,
       [7.5109556e-06, 7.4777777e-06, 7.4689174e-06, ..., 7.3359679e-06,
        7.3987890e-06, 7.3978440e-06],
       [7.1837171e-06, 7.1722038e-06, 7.1926393e-06, ..., 7.1552149e-06,
        7.1576678e-06, 7.1592167e-06],
       [7.0353467e-06, 7.0403985e-06, 7.0326828e-06, ..., 7.0392648e-06,
        7.0387587e-06, 7.0304386e-06]], dtype=float32)
Coordinates:
  * lon      (lon) float64 0.9375 2.812 4.688 6.562 ... 353.4 355.3 357.2 359.1
  * lat      (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38
Attributes:
    standard_name:  precipitation_flux
    long_name:      Precipitation
    units:          kg m-2 s-1
    comment:        includes both liquid and solid phases
    cell_methods:   area: time: mean
    cell_measures:  area: areacella

Now that we’ve calculated the climatology, we want to convert the units from kg m-2 s-1 to something that we are a little more familiar with like mm day-1.

To do this, consider that 1 kg of rain water spread over 1 m2 of surface is 1 mm in thickness and that there are 86400 seconds in one day. Therefore, 1 kg m-2 s-1 = 86400 mm day-1.

The data associated with our xarray DataArray is simply a numpy array,

PYTHON

type(clim.data)

OUTPUT

numpy.ndarray

so we can go ahead and multiply that array by 86400 and update the units attribute accordingly:

PYTHON

clim.data = clim.data * 86400
clim.attrs["units"] = "mm/day"

print(clim)

OUTPUT

<xarray.DataArray 'pr' (lat: 144, lon: 192)>
array([[0.15950695, 0.16463352, 0.16613839, ..., 0.17167493, 0.17302468,
        0.17006743],
       [0.16471735, 0.16434446, 0.16356934, ..., 0.16790195, 0.16573453,
        0.1647842 ],
       [0.18146767, 0.17692661, 0.17581128, ..., 0.19345854, 0.18982219,
        0.18698342],
       ...,
       [0.64894656, 0.64607999, 0.64531446, ..., 0.63382763, 0.63925537,
        0.63917372],
       [0.62067316, 0.61967841, 0.62144403, ..., 0.61821057, 0.6184225 ,
        0.61855632],
       [0.60785395, 0.60829043, 0.60762379, ..., 0.60819248, 0.60814875,
        0.6074299 ]])
Coordinates:
  * lon      (lon) float64 0.9375 2.812 4.688 6.562 ... 353.4 355.3 357.2 359.1
  * lat      (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38
Attributes:
    standard_name:  precipitation_flux
    long_name:      Precipitation
    units:          mm/day
    comment:        includes both liquid and solid phases
    cell_methods:   area: time: mean
    cell_measures:  area: areacella

We could now go ahead and plot our climatology using matplotlib, but it would take many lines of code to extract all the latitude and longitude information and to setup all the plot characteristics. Recognising this burden, the xarray developers have built on top of matplotlib.pyplot to make the visualisation of xarray DataArrays much easier.

PYTHON

fig = plt.figure(figsize=[12,5])

ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180))

clim.plot.contourf(
    ax=ax,
    levels=np.arange(0, 13.5, 1.5),
    extend="max",
    transform=ccrs.PlateCarree(),
    cbar_kwargs={"label": clim.units}
)
ax.coastlines()

plt.show()
Precipitation climatology

The default colorbar used by matplotlib is viridis. It used to be jet, but that was changed in response to the #endtherainbow campaign.

Putting all the code together (and reversing viridis so that wet is purple and dry is yellow)…

PYTHON

import xarray as xr
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np


accesscm2_pr_file = "data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc"

ds = xr.open_dataset(accesscm2_pr_file)

clim = ds["pr"].mean("time", keep_attrs=True)

clim.data = clim.data * 86400
clim.attrs["units"] = "mm/day"

fig = plt.figure(figsize=[12,5])
ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180))
clim.plot.contourf(
    ax=ax,
    levels=np.arange(0, 13.5, 1.5),
    extend="max",
    transform=ccrs.PlateCarree(),
    cbar_kwargs={"label": clim.units},
    cmap="viridis_r",
)
ax.coastlines()
plt.show()
Precipitation climatology

Color palette

Copy and paste the final slab of code above into your own Jupyter notebook.

The viridis color palette doesn’t seem quite right for rainfall. Change it to the “haline” cmocean palette used for ocean salinity data.

PYTHON

import cmocean
 
...
clim.plot.contourf(
    ax=ax,
    ...
    cmap=cmocean.cm.haline_r,
)

Season selection

Rather than plot the annual climatology, edit the code so that it plots the June-August (JJA) season.

(Hint: the groupby functionality can be used to group all the data into seasons prior to averaging over the time axis)

PYTHON

clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True) 
clim.data = clim.data * 86400
clim.attrs['units'] = "mm/day"

clim.sel(season="JJA").plot.contourf(
    ax=ax,
    ...
)

Add a title

Add a title to the plot which gives the name of the model (taken from the ds attributes) followed by the words “precipitation climatology (JJA)”

PYTHON

model = ds.attrs["source_id"]
title = f"{model} precipitation climatology (JJA)"
plt.title(title)

Key Points

  • Libraries such as xarray can make loading, processing and visualising netCDF data much easier.
  • The cmocean library contains colormaps custom made for the ocean sciences.

Content from Functions


Last updated on 2024-07-25 | Edit this page

Estimated time: 40 minutes

Overview

Questions

  • How can I define my own functions?

Objectives

  • Define a function that takes parameters.
  • Use docstrings to document functions.
  • Break our existing plotting code into a series of small, single-purpose functions.

In the previous lesson we created a plot of the ACCESS-CM2 historical precipitation climatology using the following commands:

PYTHON

import xarray as xr
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
import cmocean


accesscm2_pr_file = "data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc"

ds = xr.open_dataset(accesscm2_pr_file)

clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True)

clim.data = clim.data * 86400
clim.attrs['units'] = "mm/day"

fig = plt.figure(figsize=[12,5])
ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180))
clim.sel(season="JJA").plot.contourf(
    ax=ax,
    levels=np.arange(0, 13.5, 1.5),
    extend="max",
    transform=ccrs.PlateCarree(),
    cbar_kwargs={"label": clim.units},
    cmap=cmocean.cm.haline_r,
)
ax.coastlines()

model = ds.attrs["source_id"]
title = f"{model} precipitation climatology (JJA)"
plt.title(title)

plt.show()
Precipitation climatology

If we wanted to create a similar plot for a different model and/or different month, we could cut and paste the code and edit accordingly. The problem with that (common) approach is that it increases the chances of a making a mistake. If we manually updated the season to ‘DJF’ for the clim.sel(season= command but forgot to update it when calling plt.title, for instance, we’d have a mismatch between the data and title.

The cut and paste approach is also much more time consuming. If we think of a better way to create this plot in future (e.g. we might want to add gridlines using plt.gca().gridlines()), then we have to find and update every copy and pasted instance of the code.

A better approach is to put the code in a function. The code itself then remains untouched, and we simply call the function with different input arguments.

PYTHON

def plot_pr_climatology(pr_file, season, gridlines=False):
    """Plot the precipitation climatology.
    
    Args:
      pr_file (str): Precipitation data file
      season (str): Season (3 letter abbreviation, e.g. JJA)
      gridlines (bool): Select whether to plot gridlines
    
    """

    ds = xr.open_dataset(pr_file)

    clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True)

    clim.data = clim.data * 86400
    clim.attrs["units"] = "mm/day"

    fig = plt.figure(figsize=[12,5])
    ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180))
    clim.sel(season=season).plot.contourf(
        ax=ax,
        levels=np.arange(0, 13.5, 1.5),
        extend="max",
        transform=ccrs.PlateCarree(),
        cbar_kwargs={"label": clim.units},
        cmap=cmocean.cm.haline_r,
    )
    ax.coastlines()
    if gridlines:
        plt.gca().gridlines()
    
    model = ds.attrs["source_id"]
    title = f"{model} precipitation climatology ({season})"
    plt.title(title)

The docstring allows us to have good documentation for our function:

PYTHON

help(plot_pr_climatology)

OUTPUT

Help on function plot_pr_climatology in module __main__:

plot_pr_climatology(pr_file, season, gridlines=False)
    Plot the precipitation climatology.

    Args:
      pr_file (str): Precipitation data file
      season (str): Season (3 letter abbreviation, e.g. JJA)
      gridlines (bool): Select whether to plot gridlines

We can now use this function to create exactly the same plot as before:

PYTHON

plot_pr_climatology("data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc", "JJA")
plt.show()
Precipitation climatology

Plot a different model and season:

PYTHON

plot_pr_climatology("data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc", "DJF")
plt.show()
Precipitation climatology

Or use the optional gridlines input argument to change the default behaviour of the function (keyword arguments are usually used for options that the user will only want to change occasionally):

PYTHON

plot_pr_climatology(
    "data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc",
    "DJF",
    gridlines=True,
)
plt.show()
Precipitation climatology

Short functions

Our plot_pr_climatology function works, but at 16 lines of code it’s starting to get a little long. In general, people can only fit around 7-12 pieces of information in their short term memory. The readability of your code can therefore be greatly enhanced by keeping your functions short and sharp. The speed at which people can analyse their data is usually limited by the time it takes to read/understand/edit their code (as opposed to the time it takes the code to actually run), so the frequent use of short, well documented functions can dramatically speed up your data science.

  1. Cut and paste the plot_pr_climatology function (defined in the notes above) into your own notebook and try running it with different input arguments.
  2. Break the contents of plot_pr_climatology down into a series of smaller functions, such that it reads as follows:

PYTHON

def plot_pr_climatology(pr_file, season, gridlines=False):
    """Plot the precipitation climatology.

    Args:
      pr_file (str): Precipitation data file
      season (str): Season (3 letter abbreviation, e.g. JJA)
      gridlines (bool): Select whether to plot gridlines

    """

    ds = xr.open_dataset(pr_file)
    clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True)
    clim = convert_pr_units(clim)
    create_plot(clim, ds.attrs["source_id"], season, gridlines=gridlines)
    plt.show()

In other words, you’ll need to define new convert_pr_units and create_plot functions using code from the existing plot_pr_climatology function.

PYTHON

def convert_pr_units(da):
    """Convert kg m-2 s-1 to mm day-1.
   
    Args:
      da (xarray.DataArray): Precipitation data
   
    """
   
    da.data = da.data * 86400
    da.attrs['units'] = "mm/day"
   
    return da


def create_plot(clim, model, season, gridlines=False):
    """Plot the precipitation climatology.
   
    Args:
      clim (xarray.DataArray): Precipitation climatology data
      model (str) : Name of the climate model
      season (str): Season 
   
    Kwargs:  
      gridlines (bool): Select whether to plot gridlines    
   
    """
       
    fig = plt.figure(figsize=[12,5])
    ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180))
    clim.sel(season=season).plot.contourf(
        ax=ax,
        levels=np.arange(0, 13.5, 1.5),
        extend="max",
        transform=ccrs.PlateCarree(),
        cbar_kwargs={"label": clim.units},
        cmap=cmocean.cm.haline_r,
    )
    ax.coastlines()
    if gridlines:
        plt.gca().gridlines()
   
    title = f"{model} precipitation climatology ({season})"
    plt.title(title)


def plot_pr_climatology(pr_file, season, gridlines=False):
    """Plot the precipitation climatology.

    Args:
      pr_file (str): Precipitation data file
      season (str): Season (3 letter abbreviation, e.g. JJA)
      gridlines (bool): Select whether to plot gridlines

    """

    ds = xr.open_dataset(pr_file)
    clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True)
    clim = convert_pr_units(clim)
    create_plot(clim, ds.attrs["source_id"], season, gridlines=gridlines)
    plt.show()

Writing your own modules

We’ve used functions to avoid code duplication in this particular notebook/script, but what if we wanted to convert precipitation units from kg m-2 s-1 to mm/day in a different notebook/script?

To avoid cutting and pasting from this notebook/script to another, the solution would be to place the convert_pr_units function in a separate script full of similar functions.

For instance, we could put all our unit conversion functions in a script called unit_conversion.py. When we want to convert precipitation units (in any script or notebook we’re working on), we can simply import that “module” and use the convert_pr_units function:

PYTHON

import unit_conversion
clim.data = unit_conversion.convert_pr_units(clim.data)

No copy and paste required!

Key Points

  • Define a function using def name(...params...).
  • The body of a function must be indented.
  • Call a function using name(...values...).
  • Use help(thing) to view help for something.
  • Put docstrings in functions to provide help for that function.
  • Specify default values for parameters when defining a function using name=value in the parameter list.
  • The readability of your code can be greatly enhanced by using numerous short functions.
  • Write (and import) modules to avoid code duplication.

Content from Command line programs


Last updated on 2024-07-25 | Edit this page

Estimated time: 50 minutes

Overview

Questions

  • How can I write my own command line programs?

Objectives

  • Use the argparse library to manage command-line arguments in a program.
  • Structure Python scripts according to a simple template.

We’ve arrived at the point where we have successfully defined the functions required to plot the precipitation data.

We could continue to execute these functions from the Jupyter notebook, but in most cases notebooks are simply used to try things out and/or take notes on a new data analysis task. Once you’ve scoped out the task (as we have for plotting the precipitation climatology), that code can be transferred to a Python script so that it can be executed at the command line. It’s likely that your data processing workflows will include command line utilities from the CDO and NCO projects in addition to Python code, so the command line is the natural place to manage your workflows (e.g. using shell scripts or make files).

In general, the first thing that gets added to any Python script is the following:

PYTHON

if __name__ == "__main__":
    main()

The reason we need these two lines of code is that running a Python script in bash is very similar to importing that file in Python. The biggest difference is that we don’t expect anything to happen when we import a file, whereas when running a script we expect to see some output (e.g. an output file, figure and/or some text printed to the screen).

The __name__ variable exists to handle these two situations. When you import a Python file __name__ is set to the name of that file (e.g. when importing script.py, __name__ is script), but when running a script in bash __name__ is always set to __main__. The convention is to call the function that produces the output main(), but you can call it whatever you like.

The next thing you’ll need is a library to parse the command line for input arguments. The most widely used option is argparse.

Putting those together, here’s a template for what most python command line programs look like:

BASH

$ cat script_template.py

PYTHON

import argparse

#
# All your functions (that will be called by main()) go here.
#

def main(inargs):
    """Run the program."""

    print("Input file: ", inargs.infile)
    print("Output file: ", inargs.outfile)


if __name__ == "__main__":

    description = "Print the input arguments to the screen."
    parser = argparse.ArgumentParser(description=description)
    
    parser.add_argument("infile", type=str, help="Input file name")
    parser.add_argument("outfile", type=str, help="Output file name")

    args = parser.parse_args()            
    main(args)

By running script_template.py at the command line we’ll see that argparse handles all the input arguments:

BASH

$ python script_template.py in.nc out.nc

OUTPUT

Input file:  in.nc
Output file:  out.nc

It also generates help information for the user:

BASH

$ python script_template.py -h

OUTPUT

usage: script_template.py [-h] infile outfile

Print the input arguments to the screen.

positional arguments:
  infile      Input file name
  outfile     Output file name

optional arguments:
  -h, --help  show this help message and exit

and issues errors when users give the program invalid arguments:

BASH

$ python script_template.py in.nc

OUTPUT

usage: script_template.py [-h] infile outfile
script_template.py: error: the following arguments are required: outfile

Using this template as a starting point, we can add the functions we developed previously to a script called plot_precipitation_climatology.py.

BASH

$ cat plot_precipitation_climatology.py

PYTHON

import xarray as xr
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
import cmocean
import argparse


def convert_pr_units(da):
    """Convert kg m-2 s-1 to mm day-1.
    
    Args:
      da (xarray.DataArray): Precipitation data
    
    """
    
    da.data = darray.data * 86400
    da.attrs["units"] = "mm/day"
    
    return da


def create_plot(clim, model, season, gridlines=False):
    """Plot the precipitation climatology.
    
    Args:
      clim (xarray.DataArray): Precipitation climatology data
      model (str): Name of the climate model
      season (str): Season
      
    Kwargs:  
      gridlines (bool): Select whether to plot gridlines    
    
    """
        
    fig = plt.figure(figsize=[12,5])
    ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180))
    clim.sel(season=season).plot.contourf(
        ax=ax,
        levels=np.arange(0, 13.5, 1.5),
        extend="max",
        transform=ccrs.PlateCarree(),
        cbar_kwargs={"label": clim.units},
        cmap=cmocean.cm.haline_r,
    )
    ax.coastlines()
    if gridlines:
        plt.gca().gridlines()
    
    title = f"{model} precipitation climatology ({season})"
    plt.title(title)


def main(inargs):
    """Run the program."""

    ds = xr.open_dataset(inargs.pr_file)
    
    clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True)
    clim = convert_pr_units(clim)

    create_plot(clim, ds.attrs["source_id"], inargs.season)
    plt.savefig(
        inargs.output_file,
        dpi=200,
        bbox_inches="tight",
        facecolor="white",
    )


if __name__ == "__main__":
    description='Plot the precipitation climatology.'
    parser = argparse.ArgumentParser(description=description)
    
    parser.add_argument("pr_file", type=str, help="Precipitation data file")
    parser.add_argument("season", type=str, help="Season to plot")
    parser.add_argument("output_file", type=str, help="Output file name")

    args = parser.parse_args()
    
    main(args)

… and then run it at the command line:

BASH

$ python plot_precipitation_climatology.py data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc MAM pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412-MAM-clim.png

Choices

For this series of challenges, you are required to make improvements to the plot_precipitation_climatology.py script that you downloaded earlier from the setup tab at the top of the page.

For the first improvement, edit the line of code that defines the season command line argument (parser.add_argument("season", type=str, help="Season to plot")) so that it only allows the user to input a valid three letter abbreviation (i.e. ["DJF", "MAM", "JJA", "SON"]).

(Hint: Read about the choices keyword argument at the argparse tutorial.)

PYTHON

parser.add_argument(
    "season",
    type=str,
    choices=["DJF", "MAM", "JJA", "SON"], 
    help="Season to plot",
)

Gridlines

Add an optional command line argument that allows the user to add gridlines to the plot.

(Hint: Read about the action="store_true" keyword argument at the argparse tutorial.)

Make the following additions to plot_precipitation_climatology.py (code omitted from this abbreviated version of the script is denoted ...):

PYTHON

...

def main(inargs):

   ... 

   create_plot(
       clim,
       ds.attrs["source_id"],
       inargs.season,
       gridlines=inargs.gridlines,
    )

...

if __name__ == "__main__":

    ... 

    parser.add_argument(
        "--gridlines",
        action="store_true",
        default=False,
        help="Include gridlines on the plot",
    )

... 

Colorbar levels

Add an optional command line argument that allows the user to specify the tick levels used in the colourbar

(Hint: You’ll need to use the nargs='*' keyword argument.)

Make the following additions to plot_precipitation_climatology.py (code omitted from this abbreviated version of the script is denoted ...):

PYTHON

...

def create_plot(clim, model_name, season, gridlines=False, levels=None):
    """Plot the precipitation climatology.
      ...
      Kwargs:
        gridlines (bool): Select whether to plot gridlines
        levels (list): Tick marks on the colorbar      
   
    """

    if not levels:
        levels = np.arange(0, 13.5, 1.5)

    ...

    clim.sel(season=season).plot.contourf(
        ax=ax,
        ...,
    )

...

def main(inargs):

    ... 

    create_plot(
        clim,
        ds.attrs["source_id"],
        inargs.season,
        gridlines=inargs.gridlines,
        levels=inargs.cbar_levels,
    )

...

if __name__ == "__main__":

    ... 

    parser.add_argument(
        "--cbar_levels",
        type=float,
        nargs="*",
        default=None,
        help="list of levels / tick marks to appear on the colorbar",
    )

... 

Free time

Add any other options you’d like for customising the plot (e.g. title, axis labels, figure size).

plot_precipitation_climatology.py

At the conclusion of this lesson your plot_precipitation_climatology.py script should look something like the following:

PYTHON

import xarray as xr
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
import cmocean
import argparse


def convert_pr_units(da):
    """Convert kg m-2 s-1 to mm day-1.
   
    Args:
      da (xarray.DataArray): Precipitation data
    
    """
    
    da.data = darray.data * 86400
    da.attrs["units"] = "mm/day"
   
    return da


def create_plot(clim, model, season, gridlines=False, levels=None):
    """Plot the precipitation climatology.
   
    Args:
      clim (xarray.DataArray): Precipitation climatology data
      model (str): Name of the climate model
      season (str): Season
      
    Kwargs:
      gridlines (bool): Select whether to plot gridlines
      levels (list): Tick marks on the colorbar    
    
    """

    if not levels:
        levels = np.arange(0, 13.5, 1.5)
        
    fig = plt.figure(figsize=[12,5])
    ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180))
    clim.sel(season=season).plot.contourf(
        ax=ax,
        levels=levels,
        extend="max",
        transform=ccrs.PlateCarree(),
        cbar_kwargs={"label": clim.units},
        cmap=cmocean.cm.haline_r,
    )
    ax.coastlines()
    if gridlines:
        plt.gca().gridlines()
    
    title = f"{model} precipitation climatology ({season})"
    plt.title(title)


def main(inargs):
    """Run the program."""

    ds = xr.open_dataset(inargs.pr_file)
    
    clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True)
    clim = convert_pr_units(clim)

    create_plot(
        clim,
        ds.attrs["source_id"],
        inargs.season,
        gridlines=inargs.gridlines,
        levels=inargs.cbar_levels,
    )
    plt.savefig(
        inargs.output_file,
        dpi=200,
        bbox_inches="tight",
        facecolor="white",
    )


if __name__ == "__main__":
    description = "Plot the precipitation climatology."
    parser = argparse.ArgumentParser(description=description)
   
    parser.add_argument("pr_file", type=str, help="Precipitation data file")
    parser.add_argument("season", type=str, help="Season to plot")
    parser.add_argument("output_file", type=str, help="Output file name")

    parser.add_argument(
        "--gridlines",
        action="store_true",
        default=False,
        help="Include gridlines on the plot",
    )
    parser.add_argument(
        "--cbar_levels",
        type=float,
        nargs="*",
        default=None,
        help="list of levels / tick marks to appear on the colorbar",
    )

    args = parser.parse_args()
    main(args)

Key Points

  • Libraries such as argparse can be used the efficiently handle command line arguments.
  • Most Python scripts have a similar structure that can be used as a template.

Content from Version control


Last updated on 2024-07-25 | Edit this page

Estimated time: 35 minutes

Overview

Questions

  • How can I record the revision history of my code?

Objectives

  • Configure git the first time it is used on a computer.
  • Create a local Git repository.
  • Go through the modify-add-commit cycle for one or more files.
  • Explain what the HEAD of a repository is and how to use it.
  • Identify and use Git commit numbers.
  • Compare various versions of tracked files.
  • Restore old versions of files.

Follow along

For this lesson participants follow along command by command, rather than observing and then completing challenges afterwards.

A version control system stores an authoritative copy of your code in a repository, which you can’t edit directly.

Instead, you checkout a working copy of the code, edit that code, then commit changes back to the repository.

In this way, the system records a complete revision history (i.e. of every commit), so that you can retrieve and compare previous versions at any time.

This is useful from an individual viewpoint, because you don’t need to store multiple (but slightly different) copies of the same script.

File mess

It’s also useful from a collaboration viewpoint (including collaborating with yourself across different computers) because the system keeps a record of who made what changes and when.

Setup


When we use Git on a new computer for the first time, we need to configure a few things.

BASH

$ git config --global user.name "Your Name"
$ git config --global user.email "you@email.com"

This user name and email will be associated with your subsequent Git activity, which means that any changes pushed to GitHub, BitBucket, GitLab or another Git host server later on in this lesson will include this information.

You only need to run these configuration commands once - git will remember then for next time.

We then need to navigate to our data-carpentry directory and tell Git to initialise that directory as a Git repository.

BASH

$ cd ~/Desktop/data-carpentry
$ git init

If we use ls to show the directory’s contents, it appears that nothing has changed:

BASH

$ ls -F

OUTPUT

data/          script_template.py
plot_precipitation_climatology.py

But if we add the -a flag to show everything, we can see that Git has created a hidden directory within data-carpentry called .git:

BASH

$ ls -F -a

OUTPUT

./                  data/
../                 plot_precipitation_climatology.py
.git/               script_template.py

Git stores information about the project in this special sub-directory. If we ever delete it, we will lose the project’s history.

We can check that everything is set up correctly by asking Git to tell us the status of our project:

BASH

$ git status

OUTPUT

$ git status
On branch main

Initial commit

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	data/
	plot_precipitation_climatology.py
	script_template.py

nothing added to commit but untracked files present (use "git add" to track)

Branch naming

If you’re running an older version of Git, you may see On branch master instead of On branch main at the top of the git status output. Since 2021, Git has followed a move in the developer community to change the default branch name from “master” to “main” for cultural sensitivity reasons, avoiding “master/slave” terminology.

Tracking changes


The “untracked files” message means that there’s a file/s in the directory that Git isn’t keeping track of. We can tell Git to track a file using git add:

BASH

$ git add plot_precipitation_climatology.py

and then check that the right thing happened:

BASH

$ git status

OUTPUT

On branch main

Initial commit

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

	new file:   plot_precipitation_climatology.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	data/
	script_template.py

Git now knows that it’s supposed to keep track of plot_precipitation_climatology.py, but it hasn’t recorded these changes as a commit yet. To get it to do that, we need to run one more command:

BASH

$ git commit -m "Initial commit of precip climatology script"

OUTPUT

[main (root-commit) 8e69d70] Initial commit of precip climatology script
 1 file changed, 75 insertions(+)
 create mode 100644 plot_precipitation_climatology.py

When we run git commit, Git takes everything we have told it to save by using git add and stores a copy permanently inside the special .git directory. This permanent copy is called a commit (or revision) and its short identifier is 8e69d70 (Your commit may have another identifier.)

We use the -m flag (for “message”) to record a short, descriptive, and specific comment that will help us remember later on what we did and why. If we just run git commit without the -m option, Git will launch nano (or whatever other editor we configured as core.editor) so that we can write a longer message.

If we run git status now:

BASH

$ git status

OUTPUT

On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	data/
	script_template.py

nothing added to commit but untracked files present (use "git add" to track)

it tells us everything is up to date. If we want to know what we’ve done recently, we can ask Git to show us the project’s history using git log:

BASH

$ git log

OUTPUT

commit 8e69d7086cb7c44a48a096122e5324ad91b8a439
Author: Damien Irving <my@email.com>
Date:   Wed Mar 3 15:46:48 2021 +1100

    Initial commit of precip climatology script

git log lists all commits made to a repository in reverse chronological order. The listing for each commit includes the commit’s full identifier (which starts with the same characters as the short identifier printed by the git commit command earlier), the commit’s author, when it was created, and the log message Git was given when the commit was created.

Let’s go ahead and open our favourite text editor and make a small change to plot_precipitation_climatology.py by editing the description variable (which is used by argparse in the help information it displays at the command line).

PYTHON

description = "Plot the precipitation climatology for a given season."

When we run git status now, it tells us that a file it already knows about has been modified:

BASH

$ git status

OUTPUT

On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)

	modified:   plot_precipitation_climatology.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	data/
	script_template.py

no changes added to commit (use "git add" and/or "git commit -a")

The last line is the key phrase: “no changes added to commit”. We have changed this file, but we haven’t told Git we will want to save those changes (which we do with git add) nor have we saved them (which we do with git commit). So let’s do that now. It is good practice to always review our changes before saving them. We do this using git diff. This shows us the differences between the current state of the file and the most recently saved version:

BASH

$ git diff
$ git diff
diff --git a/plot_precipitation_climatology.py b/plot_precipitation_climatology.py
index 58903f5..6c12b29 100644
--- a/plot_precipitation_climatology.py
+++ b/plot_precipitation_climatology.py
@@ -62,7 +62,7 @@ def main(inargs):


 if __name__ == '__main__':
-    description = "Plot the precipitation climatology."
+    description = "Plot the precipitation climatology for a given season."
     parser = argparse.ArgumentParser(description=description)

     parser.add_argument("pr_file", type=str, help="Precipitation data file")

The output is a little cryptic because it’s actually a series of commands for tools like editors and patch telling them how to reconstruct one file given the other, but the + and - markers clearly show what has been changed.

After reviewing our change, it’s time to commit it:

BASH

$ git commit -m "Small improvement to help information"

OUTPUT

On branch main
Changes not staged for commit:
	modified:   plot_precipitation_climatology.py

Untracked files:
	data/
	script_template.py

no changes added to commit

Whoops: Git won’t commit because we didn’t use git add first. Let’s fix that:

BASH

$ git add plot_precipitation_climatology.py
$ git commit -m "Small improvement to help information"

OUTPUT

[main 35f22b7] Small improvement to help information
 1 file changed, 1 insertion(+), 1 deletion(-)

Git insists that we add files to the set we want to commit before actually committing anything. This allows us to commit our changes in stages and capture changes in logical portions rather than only large batches. For example, suppose we’re writing our thesis using LaTeX (the plain text .tex files can be tracked using Git) and we add a few citations to the introduction chapter. We might want to commit those additions to our introduction.tex file but not commit the work we’re doing on the conclusion.tex file (which we haven’t finished yet).

To allow for this, Git has a special staging area where it keeps track of things that have been added to the current changeset but not yet committed.

The Git Staging Area

Let’s do the whole edit-add-commit process one more time to watch as our changes to a file move from our editor to the staging area and into long-term storage. First, we’ll tweak the section of the script that imports all the libraries we need, by putting them in the order suggested by the PEP 8 - Style Guide for Python Code. The convention is to import packages from the Python Standard Library first, then other external packages, then your own modules (with a blank line between each grouping).

PYTHON

import argparse

import xarray as xr
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
import cmocean

BASH

$ git diff

OUTPUT

diff --git a/plot_precipitation_climatology.py b/plot_precipitation_climatology.py
index 6c12b29..c6beb12 100644
--- a/plot_precipitation_climatology.py
+++ b/plot_precipitation_climatology.py
@@ -1,9 +1,10 @@
+import argparse
+
 import xarray as xr
 import cartopy.crs as ccrs
 import matplotlib.pyplot as plt
 import numpy as np
 import cmocean
-import argparse


 def convert_pr_units(da):

Let’s save our changes:

BASH

$ git add plot_precipitation_climatology.py
$ git commit -m "Ordered imports according to PEP 8"

OUTPUT

[main a6cea2c] Ordered imports according to PEP 8
 1 file changed, 2 insertions(+), 1 deletion(-)

check our status:

BASH

$ git status

OUTPUT

On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	data/
	script_template.py

nothing added to commit but untracked files present (use "git add" to track)

and look at the history of what we’ve done so far:

BASH

$ git log

OUTPUT

commit a6cea2cd4facde6adfdde3a08ff9413b45479623 (HEAD -> main)
Author: Damien Irving <my@email.com>
Date:   Wed Mar 3 16:01:45 2021 +1100

    Ordered imports according to PEP 8

commit 35f22b74b11ed7993b23f9b4554b03ffc295e823
Author: Damien Irving <my@email.com>
Date:   Wed Mar 3 15:55:18 2021 +1100

    Small improvement to help information

commit 8e69d7086cb7c44a48a096122e5324ad91b8a439
Author: Damien Irving <my@email.com>
Date:   Wed Mar 3 15:46:48 2021 +1100

    Initial commit of precip climatology script

Exploring history


As we saw earlier, we can refer to commits by their identifiers. You can refer to the most recent commit of the working directory by using the identifier HEAD.

To demonstrate how to use HEAD, let’s make a trival change to plot_precipitation_climatology.py by inserting a comment.

PYTHON

# A random comment

Now, let’s see what we get.

BASH

$ git diff HEAD plot_precipitation_climatology.py

OUTPUT

diff --git a/plot_precipitation_climatology.py b/plot_precipitation_climatology.py
index c6beb12..c11707c 100644
--- a/plot_precipitation_climatology.py
+++ b/plot_precipitation_climatology.py
@@ -6,6 +6,7 @@ import matplotlib.pyplot as plt
 import numpy as np
 import cmocean

+# A random comment

 def convert_pr_units(darray):
     """Convert kg m-2 s-1 to mm day-1.

which is the same as what you would get if you leave out HEAD (try it). The real benefit of using the HEAD notation is the ease with which you can refer to previous commits. We do that by adding ~1 to refer to the commit one before HEAD.

BASH

$ git diff HEAD~1 plot_precipitation_climatology.py

OUTPUT

diff --git a/plot_precipitation_climatology.py b/plot_precipitation_climatology.py
index 6c12b29..c11707c 100644
--- a/plot_precipitation_climatology.py
+++ b/plot_precipitation_climatology.py
@@ -1,10 +1,12 @@
+import argparse
+
 import xarray as xr
 import cartopy.crs as ccrs
 import matplotlib.pyplot as plt
 import numpy as np
 import cmocean
-import argparse

+# A random comment

 def convert_pr_units(da):
     """Convert kg m-2 s-1 to mm day-1.

If we want to see the differences between older commits we can use git diff again, but with the notation HEAD~2, HEAD~3, and so on, to refer to them.

We can also refer to commits using those long strings of digits and letters that git log displays. These are unique IDs for the changes, and “unique” really does mean unique: every change to any set of files on any computer has a unique 40-character identifier. Our first commit (HEAD~2) was given the ID 8e69d7086cb7c44a48a096122e5324ad91b8a439, but you only have to use the first seven characters for git to know what you mean:

BASH

$ git diff 8e69d70 plot_precipitation_climatology.py

OUTPUT

diff --git a/plot_precipitation_climatology.py b/plot_precipitation_climatology.py
index 58903f5..c11707c 100644
--- a/plot_precipitation_climatology.py
+++ b/plot_precipitation_climatology.py
@@ -1,10 +1,12 @@
+import argparse
+
 import xarray as xr
 import cartopy.crs as ccrs
 import matplotlib.pyplot as plt
 import numpy as np
 import cmocean
-import argparse

+# A random comment

 def convert_pr_units(da):
     """Convert kg m-2 s-1 to mm day-1.
@@ -62,7 +64,7 @@ def main(inargs):


 if __name__ == '__main__':
-    description = "Plot the precipitation climatology."
+    description = "Plot the precipitation climatology for a given season."
     parser = argparse.ArgumentParser(description=description)

     parser.add_argument("pr_file", type=str, help="Precipitation data file")

Now that we can save changes to files and see what we’ve changed —- how can we restore older versions of things? Let’s suppose we accidentally overwrite our file:

BASH

$ echo "whoops" > plot_precipitation_climatology.py
$ cat plot_precipitation_climatology.py

OUTPUT

whoops

git status now tells us that the file has been changed, but those changes haven’t been staged:

BASH

$ git status

OUTPUT

On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)

	modified:   plot_precipitation_climatology.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	data/
	script_template.py

no changes added to commit (use "git add" and/or "git commit -a")

We can put things back the way they were at the time of our last commit by using git restore:

BASH

$ git restore plot_precipitation_climatology.py
$ cat plot_precipitation_climatology

OUTPUT

import argparse

import xarray as xr
import cartopy.crs as ccrs
...

The random comment that we inserted has been lost (that change hadn’t been committed) but everything else that was in our last commit is there.

Checking out with Git

If you’re running a different version of Git, you may see a suggestion for git checkout instead of git restore. As of Git version 2.29, git restore is still an experimental command and operates as a specialized form of git checkout.

git checkout HEAD plot_precipitation_climatology.py is the equivalent command.

plot_precipitation_climatology.py

At the conclusion of this lesson your plot_precipitation_climatology.py script should look something like the following:

PYTHON

import argparse

import xarray as xr
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
import cmocean


def convert_pr_units(da):
    """Convert kg m-2 s-1 to mm day-1.
   
    Args:
      da (xarray.DataArray): Precipitation data
    
    """
    
    da.data = da.data * 86400
    da.attrs["units"] = "mm/day"
   
    return darray


def create_plot(clim, model, season, gridlines=False, levels=None):
    """Plot the precipitation climatology.
   
    Args:
      clim (xarray.DataArray): Precipitation climatology data
      model (str): Name of the climate model
      season (str): Season
      
    Kwargs:
      gridlines (bool): Select whether to plot gridlines
      levels (list): Tick marks on the colorbar    
    
    """

    if not levels:
        levels = np.arange(0, 13.5, 1.5)
        
    fig = plt.figure(figsize=[12,5])
    ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180))
    clim.sel(season=season).plot.contourf(
        ax=ax,
        levels=levels,
        extend="max",
        transform=ccrs.PlateCarree(),
        cbar_kwargs={"label": clim.units},
        cmap=cmocean.cm.haline_r
    )
    ax.coastlines()
    if gridlines:
        plt.gca().gridlines()
    
    title = f"{model} precipitation climatology ({season})"
    plt.title(title)


def main(inargs):
    """Run the program."""

    ds = xr.open_dataset(inargs.pr_file)
    
    clim = ds["pr"].groupby("time.season").mean("time")
    clim = convert_pr_units(clim)

    create_plot(
        clim,
        ds.attrs["source_id"],
        inargs.season,
        gridlines=inargs.gridlines,
        levels=inargs.cbar_levels
    )
    plt.savefig(
        inargs.output_file,
        dpi=200,
        bbox_inches="tight",
        facecolor="white",
    )


if __name__ == "__main__":
    description = "Plot the precipitation climatology for a given season."
    parser = argparse.ArgumentParser(description=description)
   
    parser.add_argument("pr_file", type=str, help="Precipitation data file")
    parser.add_argument("season", type=str, help="Season to plot")
    parser.add_argument("output_file", type=str, help="Output file name")

    parser.add_argument(
        "--gridlines",
        action="store_true",
        default=False,
        help="Include gridlines on the plot",
    )
    parser.add_argument(
        "--cbar_levels",
        type=float,
        nargs="*",
        default=None,
        help="list of levels / tick marks to appear on the colorbar",
    )

    args = parser.parse_args()
    main(args)

Key Points

  • Use git config to configure a user name, email address, editor, and other preferences once per machine.
  • git init initializes a repository.
  • git status shows the status of a repository.
  • Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded).
  • git add puts files in the staging area.
  • git commit saves the staged content as a new commit in the local repository.
  • Always write a log message when committing changes.
  • git diff displays differences between commits.
  • git restore recovers old versions of files.

Content from GitHub


Last updated on 2024-07-25 | Edit this page

Estimated time: 25 minutes

Overview

Questions

  • How can I make my code available on GitHub?

Objectives

  • Explain what remote repositories are and why they are useful.
  • Push to or pull from a remote repository.

Follow along

For this lesson participants follow along command by command, rather than observing and then completing challenges afterwards.

Creating a remote repository


Version control really comes into its own when we begin to collaborate with other people (including ourselves for those who work on multiple computers). We already have most of the machinery we need to do this; the only thing missing is to copy changes from one repository to another.

Systems like Git allow us to move work between any two repositories. In practice, though, it’s easiest to use one copy as a central hub, and to keep it on the web rather than on someone’s laptop. Most programmers use hosting services like GitHub, BitBucket or GitLab to hold those master copies.

Let’s start by sharing the changes we’ve made to our current project with the world. Log in to GitHub, then click on the icon in the top right corner to create a new repository called data-carpentry:

Creating a Repository on GitHub (Step 1)

Name your repository “data-carpentry” and then click “Create Repository”:

Creating a Repository on GitHub (Step 2)

As soon as the repository is created, GitHub displays a page with a URL and some information on how to configure your local repository:

Creating a Repository on GitHub (Step 3)

This effectively does the following on GitHub’s servers:

BASH

$ mkdir data-carpentry
$ cd data-carpentry
$ git init

Our local repository still contains our earlier work on plot_precipitation_climatology.py, but the remote repository on GitHub doesn’t contain any files yet.

The next step is to connect the two repositories. We do this by making the GitHub repository a “remote” for the local repository. The home page of the repository on GitHub includes the string we need to identify it:

Where to Find Repository URL on GitHub

Click on the ‘SSH’ link to change the protocol from HTTPS to SSH if SSH isn’t already selected.

HTTPS vs. SSH

We use SSH here because, while it requires some additional configuration, it is a security protocol widely used by many applications. If you want to use HTTPS instead, you’ll need to create a Personal Access Token.

Copy that location from the browser, go into the local data-carpentry repository, and run this command:

BASH

$ git remote add origin git@github.com:DamienIrving/data-carpentry.git

Make sure to use the location for your repository rather than Damien’s: the only difference should be your username instead of DamienIrving.

We can check that the command has worked by running git remote -v:

BASH

$ git remote -v

OUTPUT

origin   git@github.com/DamienIrving/data-carpentry.git (push)
origin   git@github.com/DamienIrving/data-carpentry.git (fetch)

The name origin is a local nickname for your remote repository. We could use something else if we wanted to, but origin is by far the most common choice.

Connecting to GitHub with SSH


Before we can connect to our remote repository, we need to set up a way for our computer to authenticate with GitHub so it knows it’s us trying to connect to our remote repository.

We are going to set up the method that is commonly used by many different services to authenticate access on the command line. This method is called Secure Shell Protocol (SSH). SSH is a cryptographic network protocol that allows secure communication between computers using an otherwise insecure network.

SSH uses what is called a key pair. This is two keys that work together to validate access. One key is publicly known and called the public key, and the other key called the private key is kept private. Very descriptive names.

You can think of the public key as a padlock, and only you have the key (the private key) to open it. You use the public key where you want a secure method of communication, such as your GitHub account. You give this padlock, or public key, to GitHub and say “lock the communications to my account with this so that only computers that have my private key can unlock communications and send git commands as my GitHub account.”

What we will do now is the minimum required to set up the SSH keys and add the public key to a GitHub account.

Keeping your keys secure

You shouldn’t really forget about your SSH keys, since they keep your account secure. It’s good practice to audit your secure shell keys every so often. Especially if you are using multiple computers to access your account.

The first thing we are going to do is check if this has already been done on the computer we’re on. Because generally speaking, this setup only needs to happen once and then you can forget about it. We can run the list command to check what key pairs already exist on our computer.

BASH

ls -al ~/.ssh

(Your output is going to look a little different depending on whether or not SSH has ever been set up on the computer you are using.)

I have not set up SSH on my computer, so my output is

OUTPUT

ls: cannot access '/home/damien/.ssh': No such file or directory

If SSH has been set up on the computer you’re using, the public and private key pairs will be listed. The file names are either id_ed25519/id_ed25519.pub or id_rsa/id_rsa.pub depending on how the key pairs were set up.

If they don’t exist on your computer, you can use this command to create them.

BASH

$ ssh-keygen -t ed25519 -C "you@email.com"

The -t option specifies which type of algorithm to use and -C attaches a comment to the key (here, your email).

If you are using a legacy system that doesn’t support the Ed25519 algorithm, use:
$ ssh-keygen -t rsa -b 4096 -C "you@email.com"

OUTPUT

Generating public/private ed25519 key pair.
Enter file in which to save the key (/home/damien/.ssh/id_ed25519):

We want to use the default file, so just press Enter.

OUTPUT

Created directory '/home/damien/.ssh'.
Enter passphrase (empty for no passphrase):

Now, it is prompting us for a passphrase. If you’re using a laptop that other people sometimes have access to, you might want to create a passphrase. Be sure to use something memorable or save your passphrase somewhere, as there is no “reset my password” option.

OUTPUT

Enter same passphrase again:

After entering the same passphrase a second time, we receive the confirmation.

OUTPUT

Your identification has been saved in /home/damien/.ssh/id_ed25519
Your public key has been saved in /home/damien/.ssh/id_ed25519.pub
The key fingerprint is:
SHA256:SMSPIStNyA00KPxuYu94KpZgRAYjgt9g4BA4kFy3g1o you@email.com
The key's randomart image is:
+--[ED25519 256]--+
|^B== o.          |
|%*=.*.+          |
|+=.E =.+         |
| .=.+.o..        |
|....  . S        |
|.+ o             |
|+ =              |
|.o.o             |
|oo+.             |
+----[SHA256]-----+

The “identification” is actually the private key. You should never share it. The public key is appropriately named. The “key fingerprint” is a shorter version of a public key.

Now that we have generated the SSH keys, we will find the SSH files when we check.

BASH

ls -al ~/.ssh

OUTPUT

drwxr-xr-x 1 Damien  staff  197121   0 Jul 16 14:48 ./
drwxr-xr-x 1 Damien  staff  197121   0 Jul 16 14:48 ../
-rw-r--r-- 1 Damien  staff  197121 419 Jul 16 14:48 id_ed25519
-rw-r--r-- 1 Damien  staff  197121 106 Jul 16 14:48 id_ed25519.pub

Now we have a SSH key pair and we can run this command to check if GitHub can read our authentication.

BASH

ssh -T git@github.com

OUTPUT

The authenticity of host 'github.com (192.30.255.112)' can't be established.
RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? y
Please type 'yes', 'no' or the fingerprint: yes
Warning: Permanently added 'github.com' (RSA) to the list of known hosts.
git@github.com: Permission denied (publickey).

Right, we forgot that we need to give GitHub our public key!

First, we need to copy the public key. Be sure to include the .pub at the end, otherwise you’re looking at the private key.

BASH

cat ~/.ssh/id_ed25519.pub

OUTPUT

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDmRA3d51X0uu9wXek559gfn6UFNF69yZjChyBIU2qKI you@email.com

Now, going to GitHub.com, click on your profile icon in the top right corner to get the drop-down menu. Click “Settings,” then on the settings page, click “SSH and GPG keys,” on the left side “Account settings” menu. Click the “New SSH key” button on the right side. Now, you can add a title (e.g. “work laptop” so you can remember where the original key pair files are located), paste your SSH key into the field, and click the “Add SSH key” to complete the setup.

Adding SSH heys on GitHub

Now that we’ve set that up, let’s check our authentication again from the command line.

BASH

$ ssh -T git@github.com

OUTPUT

Hi Damien! You've successfully authenticated, but GitHub does not provide shell access.

Good! This output confirms that the SSH key works as intended. We are now ready to push our work to the remote repository.

BASH

$ git push origin main

OUTPUT

Counting objects: 9, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (9/9), 821 bytes, done.
Total 9 (delta 2), reused 0 (delta 0)
To github.com:DamienIrving/data-carpentry.git
 * [new branch]      master -> master
Branch master set up to track remote branch master from origin.

We can pull changes from the remote repository to the local one as well:

BASH

$ git pull origin main

OUTPUT

From github.com:DamienIrving/data-carpentry.git
 * branch            master     -> FETCH_HEAD
Already up-to-date.

Pulling has no effect in this case because the two repositories are already synchronised. If someone else had pushed some changes to the repository on GitHub, though, this command would download them to our local repository.

Sharing code with yourself or others


If we logged onto a different computer (e.g. a supercomputing facility or our desktop computer at home) we could access a copy of our code by “cloning” it.

BASH

$ git clone git@github.com:DamienIrving/data-carpentry.git

Since our repository is public, anyone (e.g. research collaborators) could clone the repository by getting the location from the corresponding page on GitHub:

Cloning a repository on GitHub

Working with others

Someone who clones your repository can’t push changes directly to it (unless you add them as a collaborator). They could, however, “fork” your repository and submit suggested changes via a “pull request”. Collaborators and pull requests are beyond the scope of this lesson, but you may come across them as you get more experienced with using git and GitHub.

Key Points

  • A local Git repository can be connected to one or more remote repositories.
  • You can use the SSH protocol to connect to remote repositories.
  • git push copies changes from a local repository to a remote repository.
  • git pull copies changes from a remote repository to a local repository.

Content from Vectorisation


Last updated on 2024-07-25 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • How can I avoid looping over each element of large data arrays?

Objectives

  • Use surface land fraction data to mask the land or ocean.
  • Use the vectorised operations available in the numpy library to avoid looping over array elements.

A useful addition to our plot_precipitation_climatology.py script would be the option to apply a land or ocean mask. To do this, we need to use the land area fraction file.

PYTHON

import numpy as np
import xarray as xr


accesscm2_sftlf_file = "data/sftlf_fx_ACCESS-CM2_historical_r1i1p1f1_gn.nc"

ds = xr.open_dataset(accesscm2_sftlf_file)
sftlf = ds["sftlf"]
print(sftlf)

OUTPUT

<xarray.DataArray 'sftlf' (lat: 144, lon: 192)>
array([[100., 100., 100., ..., 100., 100., 100.],
       [100., 100., 100., ..., 100., 100., 100.],
       [100., 100., 100., ..., 100., 100., 100.],
       ...,
       [  0.,   0.,   0., ...,   0.,   0.,   0.],
       [  0.,   0.,   0., ...,   0.,   0.,   0.],
       [  0.,   0.,   0., ...,   0.,   0.,   0.]], dtype=float32)
Coordinates:
  * lat      (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38
  * lon      (lon) float64 0.9375 2.812 4.688 6.562 ... 353.4 355.3 357.2 359.1
Attributes:
    standard_name:   land_area_fraction
    long_name:       Percentage of the grid  cell occupied by land (including...
    comment:         Percentage of horizontal area occupied by land.
    units:           %
    original_units:  1
    history:         2019-11-09T02:47:20Z altered by CMOR: Converted units fr...
    cell_methods:    area: mean
    cell_measures:   area: areacella

The data in a sftlf file assigns each grid cell a percentage value between 0% (no land) to 100% (all land).

PYTHON

print(sftlf.data.max())
print(sftlf.data.min())

OUTPUT

100.0
0.0

To apply a mask to our plot, the value of all the data points we’d like to mask needs to be set to np.nan. The most obvious solution to applying a land mask, for example, might therefore be to loop over each cell in our data array and decide whether it is a land point (and thus needs to be set to np.nan).

(For this example, we are going to define land as any grid point that is more than 50% land.)

PYTHON

ds = xr.open_dataset("data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc")
clim = ds['pr'].mean("time", keep_attrs=True)

nlats, nlons = clim.data.shape
for y in range(nlats):
    for x in range(nlons):
        if sftlf.data[y, x] > 50:
            clim.data[y, x] = np.nan

While this approach technically works, the problem is that (a) the code is hard to read, and (b) in contrast to low level languages like Fortran and C, high level languages like Python and Matlab are built for usability (i.e. they make it easy to write concise, readable code) as opposed to speed. This particular array is so small that the looping isn’t noticably slow, but in general looping over every data point in an array should be avoided.

Fortunately, there are lots of numpy functions (which are written in C under the hood) that allow you to get around this problem by applying a particular operation to an entire array at once (which is known as a vectorised operation). The np.where function, for instance, allows you to make a true/false decision at each data point in the array and then perform a different action depending on the answer.

The developers of xarray have built-in the np.where functionality, so creating a new DataArray with the land masked becomes a one-line command:

PYTHON

clim_ocean = clim.where(sftlf.data < 50)
print(clim_ocean)

OUTPUT

<xarray.DataArray 'pr' (lat: 144, lon: 192)>
array([[          nan,           nan,           nan, ...,           nan,
                  nan,           nan],
       [          nan,           nan,           nan, ...,           nan,
                  nan,           nan],
       [          nan,           nan,           nan, ...,           nan,
                  nan,           nan],
       ...,
       [7.5109556e-06, 7.4777777e-06, 7.4689174e-06, ..., 7.3359679e-06,
        7.3987890e-06, 7.3978440e-06],
       [7.1837171e-06, 7.1722038e-06, 7.1926393e-06, ..., 7.1552149e-06,
        7.1576678e-06, 7.1592167e-06],
       [7.0353467e-06, 7.0403985e-06, 7.0326828e-06, ..., 7.0392648e-06,
        7.0387587e-06, 7.0304386e-06]], dtype=float32)
Coordinates:
  * lon      (lon) float64 0.9375 2.812 4.688 6.562 ... 353.4 355.3 357.2 359.1
  * lat      (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38
Attributes:
    standard_name:  precipitation_flux
    long_name:      Precipitation
    units:          kg m-2 s-1
    comment:        includes both liquid and solid phases
    cell_methods:   area: time: mean
    cell_measures:  area: areacella

Mask option

Modify plot_precipitation_climatology.py so that the user can choose to apply a mask via the following argparse option:

PYTHON

parser.add_argument(
    "--mask",
    type=str,
    nargs=2,
    metavar=("SFTLF_FILE", "REALM"),
    default=None,
    help="""Provide sftlf file and realm to mask ('land' or 'ocean')""",
)

This should involve defining a new function called apply_mask(), in order to keep main() short and readable.

Test to see if your mask worked by plotting the ACCESS-CM2 climatology for JJA:

BASH

$ python plot_precipitation_climatology.py data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc JJA pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412-JJA-clim_land-mask.png --mask data/sftlf_fx_ACCESS-CM2_historical_r1i1p1f1_gn.nc ocean
Ocean masked rainfall plot

Commit the changes to git and then push to GitHub.

Make the following additions to plot_precipitation_climatology.py (code omitted from this abbreviated version of the script is denoted ...):

PYTHON

def apply_mask(da, sftlf_file, realm):
    """Mask ocean or land using a sftlf (land surface fraction) file.
   
    Args:
     da (xarray.DataArray): Data to mask
     sftlf_file (str): Land surface fraction file
     realm (str): Realm to mask
   
    """
  
    ds = xr.open_dataset(sftlf_file)
  
    if realm == 'land':
        masked_da = da.where(ds['sftlf'].data < 50)
    else:
        masked_da = da.where(ds['sftlf'].data > 50)   
  
    return masked_da

...

def main(inargs):
    """Run the program."""

    ds = xr.open_dataset(inargs.pr_file)
   
    clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True)
    clim = convert_pr_units(clim)

    if inargs.mask:
        sftlf_file, realm = inargs.mask
        clim = apply_mask(clim, sftlf_file, realm)

    ...

if __name__ == "__main__":

    description = "Plot the precipitation climatology for a given season."
    parser = argparse.ArgumentParser(description=description)

    ...
   
    parser.add_argument(
        "--mask",
        type=str,
        nargs=2,
        metavar=("SFTLF_FILE", "REALM"),
        default=None,
        help="""Provide sftlf file and realm to mask ('land' or 'ocean')""",
    )

    args = parser.parse_args()            
    main(args)

plot_precipitation_climatology.py

At the conclusion of this lesson your plot_precipitation_climatology.py script should look something like the following:

PYTHON

import argparse

import xarray as xr
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
import cmocean


def convert_pr_units(da):
    """Convert kg m-2 s-1 to mm day-1.
    
    Args:
      da (xarray.DataArray): Precipitation data
   
    """
   
    da.data = da.data * 86400
    da.attrs["units"] = "mm/day"
   
    return da


def apply_mask(da, sftlf_file, realm):
    """Mask ocean or land using a sftlf (land surface fraction) file.
   
    Args:
      da (xarray.DataArray): Data to mask
      sftlf_file (str): Land surface fraction file
      realm (str): Realm to mask
   
    """
  
    ds = xr.open_dataset(sftlf_file)
  
    if realm == 'land':
        masked_da = da.where(ds['sftlf'].data < 50)
    else:
        masked_da = da.where(ds['sftlf'].data > 50)   
   
    return masked_da


def create_plot(clim, model, season, gridlines=False, levels=None):
    """Plot the precipitation climatology.
    
    Args:
      clim (xarray.DataArray): Precipitation climatology data
      model (str): Name of the climate model
      season (str): Season
     
    Kwargs:
      gridlines (bool): Select whether to plot gridlines
      levels (list): Tick marks on the colorbar    
    
    """

    if not levels:
        levels = np.arange(0, 13.5, 1.5)
       
    fig = plt.figure(figsize=[12,5])
    ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180))
    clim.sel(season=season).plot.contourf(
        ax=ax,
        levels=levels,
        extend="max",
        transform=ccrs.PlateCarree(),
        cbar_kwargs={"label": clim.units},
        cmap=cmocean.cm.haline_r
    )
    ax.coastlines()
    if gridlines:
        plt.gca().gridlines()
    
    title = f"{model} precipitation climatology ({season})"
    plt.title(title)


def main(inargs):
    """Run the program."""

    ds = xr.open_dataset(inargs.pr_file)
    
    clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True)
    clim = convert_pr_units(clim)

    if inargs.mask:
        sftlf_file, realm = inargs.mask
        clim = apply_mask(clim, sftlf_file, realm)

    create_plot(
        clim,
        ds.attrs["source_id"],
        inargs.season,
        gridlines=inargs.gridlines,
        levels=inargs.cbar_levels
    )
    plt.savefig(
        inargs.output_file,
        dpi=200,
        bbox_inches="tight",
        facecolor="white",
    )


if __name__ == "__main__":
    description = "Plot the precipitation climatology for a given season."
    parser = argparse.ArgumentParser(description=description)
   
    parser.add_argument("pr_file", type=str, help="Precipitation data file")
    parser.add_argument("season", type=str, help="Season to plot")
    parser.add_argument("output_file", type=str, help="Output file name")

    parser.add_argument(
        "--gridlines",
        action="store_true",
        default=False,
        help="Include gridlines on the plot",
    )
    parser.add_argument(
        "--cbar_levels",
        type=float,
        nargs="*",
        default=None,
        help="list of levels / tick marks to appear on the colorbar",
    )
    parser.add_argument(
        "--mask",
        type=str,
        nargs=2,
        metavar=("SFTLF_FILE", "REALM"),
        default=None,
        help="""Provide sftlf file and realm to mask ('land' or 'ocean')""",
    )

    args = parser.parse_args()
    main(args)

Key Points

  • For large arrays, looping over each element can be slow in high-level languages like Python.
  • Vectorised operations can be used to avoid looping over array elements.

Content from Defensive programming


Last updated on 2024-07-25 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • How can I make my programs more reliable?

Objectives

  • Signal errors by raising exceptions.
  • Use try-except blocks to catch and handle exceptions.
  • Explain what an assertion is.
  • Add assertions that check the program’s state is correct.
  • Use a logging framework to report on program activity.
  • Identify sources of more advanced lessons on code testing.

Scientist’s nightmare

If you needed any motivation to learn and employ the principles of defensive programming, look no further than this article. It documents the experience of a researcher who had to retract five published papers - three of which were in Science - because his code had inadvertently switched the rows and columns of a data table.

Now that we’ve written plot_precipitation_climatology.py, how can we be sure that it’s producing reliable results?

The first step toward getting the right answers from our programs is to assume that mistakes will happen and to guard against them. This is called defensive programming, and there are a number of tools and approaches at our disposal for doing this. Broadly speaking, we can raise and handle errors to check and respond to program inputs, use assertions to make sure nothing crazy or unexpected has happened, write unit tests to make sure each component of our program produces expected outputs, and use a logging framework to report on program activity. In this lesson, we’ll look at how error handling, assertions and logging can make the unit conversion in our program more reliable, and we’ll provide links to further information on unit .

Types of errors


There are essentially two kinds of errors that can arise in Python: syntax errors and exceptions. You’re probably familiar with the former:

PYTHON

rainfall = 5
if rainfall > 10
    print("heavy rainfall")

ERROR

    if rainfall > 10
                    ^
SyntaxError: expected ':'

Once a statement or expression is syntactically correct, it may cause an error when an attempt is made to execute it. Errors detected during execution are called exceptions (i.e. an exception from normal behaviour) and there are lots of different types of exceptions:

PYTHON

10 * (1/0)

ERROR

ZeroDivisionError: division by zero

PYTHON

4 + spam*3

ERROR

NameError: name 'spam' is not defined

PYTHON

"2" + 2

ERROR

TypeError: can only concatenate str (not "int") to str

Raising errors


With respect to defensive programming, it can sometimes be useful to raise your own exceptions (using the raise keyword).

PYTHON

infile = "temperature_data.txt"
file_format = infile.split(".")[-1]
if file_format != "nc":
    raise ValueError(f"{infile} does not have the netCDF file extension .nc")

ERROR

ValueError                                Traceback (most recent call last)
/var/folders/6v/vrpsky6j509dff7250jyg8240000gp/T/ipykernel_12425/3612736507.py in <module>
      2 file_format = infile.split(".")[-1]
      3 if file_format != "nc":
----> 4     raise ValueError(f"{infile} does not have the netCDF file extension .nc")

ValueError: temperature_data.txt does not have the netCDF file extension .nc

In this case we’ve chosen to raise a ValueError, but we could pick any of the builtin exception types (including just a generic Exception) or define our own custom exception class.

In the context of our plot_precipitation_climatology.py script, we currently multiply our data by 86400 regardless of what the input units are. It would be better if we modified the main function so that the program multiplied by 86400 if the input units are kg m-2 s-1, performed no unit conversion if the input units are mm/day, or halted with an informative error message if the input data have some other units.

PYTHON

input_units = clim.attrs["units"]
if input_units == "kg m-2 s-1":
    clim = convert_pr_units(clim)
elif input_units == "mm/day":
    pass
else:
    raise ValueError("""Input units are not 'kg m-2 s-1' or 'mm/day'""")

Handling errors


As we’ve seen in the examples above, if exceptions aren’t dealt with the program crashes. The error message upon crashing is sometimes be easy to understand (particularly if you wrote the raise statement yourself) but can often be cryptic.

If we’d rather the program didn’t crash when a particular exception occurs, we can use a try-except block to catch and handle the exception. The syntax of the try-except block is:

PYTHON

try:
    <do something>
except Exception:
    <handle the error>

The code in the except block is only executed if an exception occurred in the try block. The except block is required with a try block, even if it contains only the pass statement (i.e. ignore the exception and carry on). For example, let’s say there’s a calculation in our program where we need to divide by the number of available weather stations. If there were no weather stations available, by default the program would crash.

PYTHON

quantity = 500
n_stations = 0

scaled_quantity = quantity / n_stations
print(scaled_quantity)

ERROR

ZeroDivisionError                         Traceback (most recent call last)
/var/folders/6v/vrpsky6j509dff7250jyg8240000gp/T/ipykernel_12425/3927438267.py in <module>
      2 n_stations = 0
      3
----> 4 scaled_quantity = quantity / n_stations
      5 print(scaled_quantity)

ZeroDivisionError: division by zero

If we’d prefer the program simply continue with a scaled_quantity value of NaN, we could catch and handle the ZeroDivisionError.

PYTHON

import numpy as np


quantity = 500
n_stations = 0

try:
    scaled_quantity = quantity / n_stations
except ZeroDivisionError:
    scaled_quantity = np.nan

print(scaled_quantity)

OUTPUT

nan

In the context of our plot_precipitation_climatology.py script, the variable attributes from the input netCDF file are stored in a dictionary that we access via clim.attrs. The dictionary keys are the names of the variable attributes (e.g. standard_name, long_name, units) and the dictionary values are the values corresponding to those keys (e.g. precipitation_flux, Precipitation, kg m-2 s-1). If the input data file didn’t have a units attribute associated with the precipitation variable, our program would currently fail with a KeyError (which Python raises when you ask for a key that isn’t in a dictionary).

PYTHON

example_dict = {
    "standard_name": "precipitation_flux",
    "long_name": "Precipitation",
}
units = example_dict["units"]

ERROR

KeyError                                  Traceback (most recent call last)
/var/folders/6v/vrpsky6j509dff7250jyg8240000gp/T/ipykernel_12425/2679443625.py in <module>
      1 example_dict = {
      2     "standard_name": "precipitation_flux",
      3     "long_name": "Precipitation",
      4 }
----> 5 units = example_dict["units"]

KeyError: 'units'

It’s fine that our program would crash in this situation (we can’t continue if we don’t know the units of the input data), but the error message doesn’t explicitly tell the user that the input file requires a units attribute. To make this crystal clear, we could use a try-except block in the main function to catch the KeyError and re-define a better error message.

PYTHON

try:
    input_units = clim.attrs["units"]
except KeyError:
    raise KeyError(f"Precipitation variable in {inargs.pr_file} does not have a units attribute")

Assertions


Unexpected behaviour in a program can sometimes propagate a long way before triggering an exception or producing a perplexing result. For instance, if a calculation produces a non-physical value for precipitation (e.g. a negative value) that value could be used in various downstream calculations of drought and fire risk (combined with values for temperature, wind, humidity, etc). The final plot of the forest fire danger index might look wrong (or not, which would be even worse) to the scientist who wrote and executed the code, but it wouldn’t be immediately obvious that the precipitation calculation was the source of the problem.

In order to avoid propagation, it’s best to nip unexpected behaviour in the bud right when it occurs. One way to do this is to add assertions to your code. An assertion is simply a statement that something must be true at a certain point in a program. When Python sees one, it evaluates the assertion’s condition. If it’s true, Python does nothing, but if it’s false, Python halts the program immediately and raises an AssertionError with a custom error message.

To demonstrate an assertion in action, consider this piece of code that halts if any precipitation data is negative:

PYTHON

import numpy as np

pr_data = np.array([1.5, 2.3, 0.7, -0.2, 4.4])
assert pr_data.min() >= 0.0, "There is at least one negative precipitation value"

ERROR

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-19-33d87ea29ae4> in <module>()
----> 1 assert pr_data.min() >= 0.0, "There is at least one negative precipitation value"

AssertionError: There is at least one negative precipitation value

With respect to our command line program, one thing to check would be that the climatological precipitation values lie within a sensible range after the unit conversion. The world record highest daily rainfall total is 1825mm (at Reunion Island in 1966), so climatological values across the globe should be more than 0 but less than 2000 mm/day. We could add the following assertions to our convert_pr_units function to catch unexpected precipitation values.

PYTHON

assert darray.data.min() >= 0.0, "There is at least one negative precipitation value"
assert darray.data.max() < 2000, "There is a precipitation value/s > 2000 mm/day"

Assertions are also used in unit testing (each test culminates in an assertion), but that topic is beyond the scope of this lesson.

Testing and continuous integration

An assertion checks that something is true at a particular point in the program. For programs that are more complex (or research critical) than plot_precipitation_climatology.py, it’s a good idea to take the next step and check the overall behavior of entire pieces (or units) of code. Related concepts like unit testing and continuous integration are beyond the scope of this lesson, but Research Software Engineering With Python has a chapter on testing that is well worth a read.

Logging


So far we’ve considered how to make our programs halt or handle the situation when things go wrong. Another option in our defensive programming toolkit is to have our programs report their own activity.

For example, let’s say we’re working with relative humidity data. We wouldn’t typically expect to encounter any values over 100%, but it is physically possible. Rather than halt the program if a value over 100% occurs, we might therefore want our program to simply report the maximum relative humidity. We can then decide whether to trust the output or not (e.g. a value of 100.1% might be ok but not 150%).

To do this reporting, our first instinct might be to add a print statement to the program.

PYTHON

rh_data = np.array([1.5, 20.4, 100.1, 76.3, 54.4])
rh_max = rh_data.max()
print(f"The maximum relative humidity was {rh_max}%")

OUTPUT

The maximum relative humidity was 100.1%

The problem with this approach is that information printed to the screen is lost once we close our command line session. Constantly adding, removing or commenting out print statements is also tedious and error-prone.

A better approach is to use a logging framework, such as Python’s logging library. This lets us leave debugging statements in our code and turn them on or off at will. Let’s start by replacing our print statement with a logging command.

PYTHON

import logging

rh_data = np.array([1.5, 20.4, 100.1, 76.3, 54.4])
rh_max = rh_data.max()
logging.debug(f"The maximum relative humidity was {rh_max}%")

Whoops! There’s no output because by default the logging library only reports information at the “warning” level and above. In order of increasing severity, the available levels are:

  • debug: very detailed information used for localizing errors.
  • info: confirmation that things are working as expected.
  • warning: something unexpected happened, but the program will keep going.
  • error: something has gone badly wrong, but the program hasn’t hurt anything.
  • critical: potential loss of data, security breach, etc.

If we want to see the output from less severe levels (i.e. turn our debugging statements on), we’d need to change the minimum level in the logging configuration. We can also provide the name of a file to write the logging information to, so that it isn’t lost when we finish our command line session.

PYTHON

# for loop only required in notebooks
for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)
    
logging.basicConfig(level=logging.DEBUG, filename="log.txt")) 

rh_data = np.array([1.5, 20.4, 100.1, 76.3, 54.4])
rh_max = rh_data.max()
logging.debug(f"The maximum relative humidity was {rh_max}%")

(The for loop is needed to turn off the background logging the notebook does itself. It’s not needed in a Python script.)

Notice that we’ve used capital logging.DEBUG (which is an integer value) to set the logging level, as opposed to the logging.debug function that is used for logging a message.

By setting the logging level to “DEBUG”, our output “log.txt” file will now capture all logging information with a flag of ‘debug’ or higher - that is, all logging outputs will be written to our log file.

BASH

$ cat log.txt

OUTPUT

DEBUG:root:The maximum relative humidity was 100.1%

In the context of the plot_precipitation_climatology.py script, it would be nice to know whether or not unit conversion was performed. To do this we just need a few small changes to the script. We need to import the logging library at the top of the script,

PYTHON

import logging

and set the logging configuration and add a logging.info command in the main function.

PYTHON

def main(inargs):
    """Run the program."""
    
    logging.basicConfig(level=logging.DEBUG, filename="log.txt") 
    
    ...
    
    if input_units == 'kg m-2 s-1':
        clim = convert_pr_units(clim)
        logging.info("Units converted from kg m-2 s-1 to mm/day")
    elif input_units == "mm/day":
        pass
    else:
        raise ValueError("""Input units are not 'kg m-2 s-1' or 'mm/day'""")
    
    ...

Update your script

Update your working copy of plot_precipitation_climatology.py with the changes from this lesson.

This will mean your main function will now read as follows,

PYTHON

def main(inargs):
    """Run the program."""

    logging.basicConfig(level=logging.DEBUG, filename="log.txt") 

    ds = xr.open_dataset(inargs.pr_file)
   
    clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True)

    try:
        input_units = clim.attrs["units"]
    except KeyError:
        raise KeyError(f"Precipitation variable in {inargs.pr_file} must have a units attribute")

    if input_units == "kg m-2 s-1":
        clim = convert_pr_units(clim)
        logging.info("Units converted from kg m-2 s-1 to mm/day")
    elif input_units == "mm/day":
        pass
    else:
        raise ValueError("""Input units are not 'kg m-2 s-1' or 'mm/day'""")

    if inargs.mask:
        sftlf_file, realm = inargs.mask
        clim = apply_mask(clim, sftlf_file, realm)

    create_plot(
        clim,
        dset.attrs["source_id"],
        inargs.season,
        gridlines=inargs.gridlines,
        levels=inargs.cbar_levels
    )    
    plt.savefig(
        inargs.output_file,
        dpi=200,
        bbox_inches="tight",
        facecolor="white",
    )

and your convert_pr_units as:

PYTHON

def convert_pr_units(da):
    """Convert kg m-2 s-1 to mm day-1.
   
    Args:
      da (xarray.DataArray): Precipitation data
    
    """
    
    da.data = da.data * 86400
    da.attrs["units"] = "mm/day"
   
    assert da.data.min() >= 0.0, "There is at least one negative precipitation value"
    assert da.data.max() < 2000, "There is a precipitation value/s > 2000 mm/day"
    
    return da

Verbose reporting

Add two new command line options to plot_precipitation_climatology.py.

The first should change the logging level so reporting from the program is more verbose.

PYTHON

parser.add_argument(
    "-v",
    "--verbose",
    action="store_true",
    default=False,
    help="Change the minimum logging reporting level from WARNING (default) to INFO",
)

The second should allow the user to specify the name of the log file. (If they don’t specify a name then the logging information is printed to the screen.)

PYTHON

parser.add_argument(
    "--logfile",
    type=str,
    default=None,
    help="Name of log file (by default logging information is printed to the screen)",
)

The basic configuration command at the top of the main function (logging.basicConfig(level=logging.DEBUG, filename="log.txt")) needs to be replaced with the following:

PYTHON

log_lev = logging.INFO if inargs.verbose else logging.WARNING
logging.basicConfig(level=log_lev, filename=inargs.logfile) 

Error handling for land/ocean masks

In the previous lesson we added an apply_mask function to our script. Update the following if statement in that function so that it raises a ValueError if the realm is not ocean or land.

PYTHON

if realm == "land":
    masked_da = da.where(ds["sftlf"].data < 50)
else:
    masked_da = da.where(ds["sftlf"].data > 50)

PYTHON

if realm.lower() == 'land':
    masked_da = da.where(ds["sftlf"].data < 50)
elif realm.lower() == 'ocean':
    masked_da = da.where(ds["sftlf"].data > 50)
else:
    raise ValueError("""Mask realm is not 'ocean' or 'land'""")

plot_precipitation_climatology.py

At the conclusion of this lesson your plot_precipitation_climatology.py script should look something like the following:

PYTHON

import logging
import argparse

import xarray as xr
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
import cmocean


def convert_pr_units(da):
    """Convert kg m-2 s-1 to mm day-1.
    
    Args:
      da (xarray.DataArray): Precipitation data
   
    """

    da.data = da.data * 86400
    da.attrs["units"] = "mm/day"
   
    assert da.data.min() >= 0.0, "There is at least one negative precipitation value"
    assert da.data.max() < 2000, "There is a precipitation value/s > 2000 mm/day"

    return da


def apply_mask(da, sftlf_file, realm):
    """Mask ocean or land using a sftlf (land surface fraction) file.
   
    Args:
      da (xarray.DataArray): Data to mask
      sftlf_file (str): Land surface fraction file
      realm (str): Realm to mask
   
    """
  
    ds = xr.open_dataset(sftlf_file)
    if realm.lower() == "land":
        masked_da = da.where(ds["sftlf"].data < 50)
    elif realm.lower() == 'ocean':
        masked_da = da.where(ds["sftlf"].data > 50)   
    else:
        raise ValueError("""Mask realm is not 'ocean' or 'land'""")    

    return masked_da


def create_plot(clim, model, season, gridlines=False, levels=None):
    """Plot the precipitation climatology.
    
    Args:
      clim (xarray.DataArray): Precipitation climatology data
      model (str): Name of the climate model
      season (str): Season
     
    Kwargs:
      gridlines (bool): Select whether to plot gridlines
      levels (list): Tick marks on the colorbar    
    
    """

    if not levels:
        levels = np.arange(0, 13.5, 1.5)
       
    fig = plt.figure(figsize=[12,5])
    ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180))
    clim.sel(season=season).plot.contourf(
        ax=ax,
        levels=levels,
        extend='max',
        transform=ccrs.PlateCarree(),
        cbar_kwargs={"label": clim.units},
        cmap=cmocean.cm.haline_r,
    )
    ax.coastlines()
    if gridlines:
        plt.gca().gridlines()
    
    title = f"{model} precipitation climatology ({season})"
    plt.title(title)


def main(inargs):
    """Run the program."""

    log_lev = logging.DEBUG if inargs.verbose else logging.WARNING
    logging.basicConfig(level=log_lev, filename=inargs.logfile) 

    ds = xr.open_dataset(inargs.pr_file)
    
    clim = ds['pr'].groupby("time.season").mean("time", keep_attrs=True)

    try:
        input_units = clim.attrs["units"]
    except KeyError:
        raise KeyError(f"Precipitation variable in {inargs.pr_file} does not have a units attribute")

    if input_units == "kg m-2 s-1":
        clim = convert_pr_units(clim)
        logging.info("Units converted from kg m-2 s-1 to mm/day")
    elif input_units == "mm/day":
        pass
    else:
        raise ValueError("""Input units are not 'kg m-2 s-1' or 'mm/day'""")

    if inargs.mask:
        sftlf_file, realm = inargs.mask
        clim = apply_mask(clim, sftlf_file, realm)

    create_plot(
        clim,
        ds.attrs["source_id"],
        inargs.season,
        gridlines=inargs.gridlines,
        levels=inargs.cbar_levels
    )
    plt.savefig(
        inargs.output_file,
        dpi=200,
        bbox_inches="tight",
        facecolor="white",
    )


if __name__ == '__main__':
    description='Plot the precipitation climatology for a given season.'
    parser = argparse.ArgumentParser(description=description)
   
    parser.add_argument("pr_file", type=str, help="Precipitation data file")
    parser.add_argument("season", type=str, help="Season to plot")
    parser.add_argument("output_file", type=str, help="Output file name")

    parser.add_argument(
        "--gridlines",
        action="store_true",
        default=False,
        help="Include gridlines on the plot",
    )
    parser.add_argument(
        "--cbar_levels",
        type=float,
        nargs="*",
        default=None,
        help="list of levels / tick marks to appear on the colorbar",
    )
    parser.add_argument(
        "--mask",
        type=str,
        nargs=2,
        metavar=("SFTLF_FILE", "REALM"),
        default=None,
        help="""Provide sftlf file and realm to mask ('land' or 'ocean')""",
    )
    parser.add_argument(
        "-v",
        "--verbose",
        action="store_true",
        default=False,
        help="Change the minimum logging reporting level from WARNING (default) to DEBUG",
    )
    parser.add_argument(
        "--logfile",
        type=str,
        default=None,
        help="Name of log file (by default logging information is printed to the screen)",
    )

    args = parser.parse_args()
    main(args)

Key Points

  • Program defensively, i.e., assume that errors are going to arise, and write code to detect them when they do.
  • You can raise exceptions in your own code.
  • Put try-except blocks in programs to catch and handle exceptions.
  • Put assertions in programs to check their state as they run.
  • Use a logging framework instead of print statements to report program activity.
  • The are more advanced lessons you can read to learn about code testing.

Content from Data provenance


Last updated on 2024-07-25 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • How can keep track of my data processing steps?

Objectives

  • Automate the process of recording the history of what was entered at the command line to produce a given data file or image.

We’ve now successfully created a command line program - plot_precipitation_climatology.py - that calculates and plots the precipitation climatology for a given season. The last step is to capture the provenance of that plot. In other words, we need a log of all the data processing steps that were taken from the intial download of the data file to the end result (i.e. the PNG image).

The simplest way to do this is to follow the lead of the NCO and CDO command line tools, which insert a record of what was executed at the command line into the history attribute of the output netCDF file. If fact, they were both used in the pre-processing of the data files used in these lessons:

PYTHON

import xarray as xr


accessesm15_pr_file = "data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc"
ds = xr.open_dataset(accessesm15_pr_file)

print(ds.attrs['history'])

OUTPUT

Tue Jan 12 14:50:35 2021: ncatted -O -a history,pr,d,, pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc
Tue Jan 12 14:48:10 2021: cdo seldate,2010-01-01,2014-12-31 /g/data/fs38/publications/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r1i1p1f1/Amon/pr/gn/latest/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_185001-201412.nc pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc
2019-11-15T04:32:57Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards.

Fortunately, there is a Python package called cmdline-provenance that creates NCO/CDO-style records of what was executed at the command line. We can use it to add to the command log, before inserting the updated log into the metadata associated with our output image file.

To start, let’s import the cmdline_provenance library with the other imports at the top of our plot_precipitation_climatology.py script,

PYTHON

import cmdline_provenance as cmdprov

and then make the following updates to the original line of code responsible for saving the image to file:

PYTHON

new_log = cmdprov.new_log(infile_logs={inargs.pr_file: dset.attrs["history"]})
plt.savefig(
    inargs.output_file,
    metadata={"History": new_log},
    dpi=200,
    bbox_inches="tight",
    facecolor="white",
)

When we execute plot_precipitation_climatology.py, cmdprov.new_log will create a record of what was entered at the command line. The name of the input precipitation file and its history attribute have been provided using the infile_logs argument, so that cmdprov.new_log can append the history of that file to the new command log. The metadata argument for plt.savefig has then been used to save the new command log to the image metadata.

To see this in action, we can run the program,

BASH

$ python plot_precipitation_climatology.py data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc SON pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412-SON-clim.png

and then view the metadata of the new PNG image file. There are a number of different programs available to do this, but they can often be tricky to install. Fortunately, conda is used to install programs written in many different languages, not just Python. There are installation recipes for a command line program called exiftool on anaconda.org, so let’s go ahead and install that:

BASH

$ conda install exiftool

Once installed, we can use it to view the metadata associated with our image file:

BASH

$ exiftool pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412-SON-clim.png

OUTPUT

ExifTool Version Number         : 12.17
File Name                       : pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412-SON-clim.png
Directory                       : .
File Size                       : 315 KiB
File Modification Date/Time     : 2021:02:08 10:16:56+11:00
File Access Date/Time           : 2021:02:08 10:16:58+11:00
File Inode Change Date/Time     : 2021:02:08 10:16:56+11:00
File Permissions                : rw-r--r--
File Type                       : PNG
File Type Extension             : png
MIME Type                       : image/png
Image Width                     : 2400
Image Height                    : 1000
Bit Depth                       : 8
Color Type                      : RGB with Alpha
Compression                     : Deflate/Inflate
Filter                          : Adaptive
Interlace                       : Noninterlaced
Software                        : Matplotlib version3.3.3, https://matplotlib.org/
History                         : Mon Feb 08 09:45:17 2021: /Users/damien/opt/anaconda3/envs/pyaos-lesson/bin/python code/plot_precipitation_climatology.py data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc SON pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412-SON-clim.png.Tue Jan 12 14:50:35 2021: ncatted -O -a history,pr,d,, pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc.Tue Jan 12 14:48:10 2021: cdo seldate,2010-01-01,2014-12-31 /g/data/fs38/publications/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r1i1p1f1/Amon/pr/gn/latest/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_185001-201412.nc pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc.2019-11-15T04:32:57Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards. .
Pixels Per Unit X               : 7874
Pixels Per Unit Y               : 7874
Pixel Units                     : meters
Image Size                      : 2400x1000
Megapixels                      : 2.4

Now that we’ve successfully added a command log to our PNG image, we might want to think about how our script should handle other image formats (e.g. PDF, EPS)? For PNG files you can pick whatever metadata keys you like (hence we picked “History”), but other formats only allow specific keys. For now we can add raise an error so that the program halts if someone tries to generate a format that isn’t PNG, and in the exercises we’ll add more valid formats to the script.

PYTHON

image_format = inargs.output_file.split(".")[-1])
if image_format != "png":
    raise ValueError("Output file must have the PNG image file extension .png")

Putting this altogether, here’s what the main function in plot_precipitation_climatology.py looks like:

PYTHON

def main(inargs):
    """Run the program."""

    log_lev = logging.DEBUG if inargs.verbose else logging.WARNING
    logging.basicConfig(level=log_lev, filename=inargs.logfile) 

    ds = xr.open_dataset(inargs.pr_file)
    
    clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True)

    try:
        input_units = clim.attrs["units"]
    except KeyError:
        raise KeyError(f"Precipitation variable in {inargs.pr_file} does not have a units attribute")

    if input_units == "kg m-2 s-1":
        clim = convert_pr_units(clim)
        logging.info("Units converted from kg m-2 s-1 to mm/day")
    elif input_units == "mm/day":
        pass
    else:
        raise ValueError("""Input units are not 'kg m-2 s-1' or 'mm/day'""")

    if inargs.mask:
        sftlf_file, realm = inargs.mask
        clim = apply_mask(clim, sftlf_file, realm)

    create_plot(
        clim,
        ds.attrs["source_id"],
        inargs.season,
        gridlines=inargs.gridlines,
        levels=inargs.cbar_levels,
    )
    
    image_format = inargs.output_file.split(".")[-1]
    if image_format != "png":
        raise ValueError("Output file must have the PNG image file extension .png")
    new_log = cmdprov.new_log(infile_logs={inargs.pr_file: ds.attrs["history"]})
    plt.savefig(
        inargs.output_file,
        metadata={"History": new_log},
        dpi=200,
        bbox_inches="tight",
        facecolor="white",
    )

Handling different image formats

The plt.savefig documentation provides information on the metadata keys accepted by PNG, PDF, EPS and PS image formats.

Using that information as a guide, add a new function called get_log_and_key to the plot_precipitation_climatology.py script. It should take the precipitation file name, history attribute, and plot type as arguments and return the updated command log and the appropriate metadata key for PNG, PDF, EPS or PS image formats.

When you’re done, the final lines of the main function should read as follows:

PYTHON

log_key, new_log = get_log_and_key(
    inargs.pr_file,
    ds.attrs["history"],
    inargs.output_file.split(".")[-1],
)
plt.savefig(
    inargs.output_file,
    metadata={log_key: new_log},
    dpi=200,
    bbox_inches="tight",
    facecolor="white",
)

The new function could read as follows:

PYTHON

def get_log_and_key(pr_file, history_attr, plot_type):
    """Get key and command line log for image metadata.
   
    Different image formats allow different metadata keys.
   
    Args:
      pr_file (str): Input precipitation file
      history_attr (str): History attribute from pr_file
      plot_type (str): File format for output image
   
    """
    
    valid_keys = {"png": "History",
                  "pdf": "Title",
                  "eps": "Creator",
                  "ps" : "Creator",
    }
    if plot_type in valid_keys.keys():
        log_key = valid_keys[plot_type]
    else:
        raise ValueError(f"Image format not one of: {*[*valid_keys],}")
    new_log = cmdprov.new_log(infile_logs={pr_file: history_attr})
   
    return log_key, new_log

Writing the command log to netCDF files

Instead of calculating the seasonal climatology and creating a plot all in the one script, we could have decided to make it a two step process. The first script in the process could take the original precipitation data file and output a new netCDF file containing the seasonal climatology, and the second script could take the seasonal climatology file and create a plot from that.

If the first script reads as follows, how would you edit it so that an updated command log is included in the history attribute of the output netCDF file?

PYTHON

import argparse

import numpy as np
import xarray as xr


def convert_pr_units(darray):
    """Convert kg m-2 s-1 to mm day-1.
    
    Args:
      da (xarray.DataArray): Precipitation data
    """

    da.data = da.data * 86400
    da.attrs["units"] = "mm/day"

    assert da.data.min() >= 0.0, "There is at least one negative precipitation value"
    assert da.data.max() < 2000, "There is a precipitation value/s > 2000 mm/day"

    return da


def main(inargs):
    """Run the program."""

    log_lev = logging.DEBUG if inargs.verbose else logging.WARNING
    logging.basicConfig(level=log_lev, filename=inargs.logfile)

    in_dset = xr.open_dataset(inargs.pr_file)
    clim = in_ds["pr"].groupby("time.season").mean("time", keep_attrs=True)

    try:
        input_units = clim.attrs["units"]
    except KeyError:
        raise KeyError(f"Precipitation variable in {inargs.pr_file} does not have a units attribute")

    if input_units == "kg m-2 s-1":
        clim = convert_pr_units(clim)
        logging.info("Units converted from kg m-2 s-1 to mm/day")
    elif input_units == "mm/day":
        pass
    else:
        raise ValueError("""Input units are not 'kg m-2 s-1' or 'mm/day'""")

    out_ds = clim.to_dataset()
    out_ds.attrs = in_ds.attrs
    out_ds.to_netcdf(inargs.output_file)
   

if __name__ == "__main__":
    description = "Calculate the seasonal precipitation climatology."
    parser = argparse.ArgumentParser(description=description)
    parser.add_argument("pr_file", type=str, help="Precipitation data file")
    parser.add_argument("output_file", type=str, help="Output file name")
    parser.add_argument(
        "-v",
        "--verbose",
        action="store_true",
        default=False,
        help="Change the minimum logging reporting level from WARNING (default) to DEBUG",
    )
    parser.add_argument(
        "--logfile",
        type=str,
        default=None,
        help="Name of log file (by default logging information is printed to the screen)",
    )
    args = parser.parse_args()  
    main(args)

The beginning of the script would need to be updated to import the cmdline_provenance library:

PYTHON

import cmdline_provenance as cmdprov

The body of the main function would then need to be updated to write the new command log to the history attribute of the output dataset:

PYTHON

out_ds.attrs = in_ds.attrs
new_log = cmdprov.new_log(infile_logs={inargs.pr_file: in_ds.attrs["history"]})
out_ds.attrs["history"] = new_log
out_ds.to_netcdf(inargs.output_file)

plot_precipitation_climatology.py

At the conclusion of this lesson your plot_precipitation_climatology.py script should look something like the following:

PYTHON

import logging
import argparse

import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
import cartopy.crs as ccrs
import cmocean
import cmdline_provenance as cmdprov


def convert_pr_units(da):
    """Convert kg m-2 s-1 to mm day-1.
    
    Args:
      da (xarray.DataArray): Precipitation data
   
    """
    
    da.data = da.data * 86400
    da.attrs["units"] = "mm/day"
    
    assert da.data.min() >= 0.0, "There is at least one negative precipitation value"
    assert da.data.max() < 2000, "There is a precipitation value/s > 2000 mm/day"
    
    return da


def apply_mask(darray, sftlf_file, realm):
    """Mask ocean or land using a sftlf (land surface fraction) file.
   
    Args:
      da (xarray.DataArray): Data to mask
      sftlf_file (str): Land surface fraction file
      realm (str): Realm to mask
   
    """
  
    ds = xr.open_dataset(sftlf_file)
    if realm.lower() == 'land':
        masked_da = da.where(ds["sftlf"].data < 50)
    elif realm.lower() == 'ocean':
        masked_da = da.where(ds["sftlf"].data > 50)   
    else:
        raise ValueError("""Mask realm is not 'ocean' or 'land'""")

    return masked_da


def create_plot(clim, model, season, gridlines=False, levels=None):
    """Plot the precipitation climatology.
    
    Args:
      clim (xarray.DataArray): Precipitation climatology data
      model (str): Name of the climate model
      season (str): Season
      
    Kwargs:
      gridlines (bool): Select whether to plot gridlines
      levels (list): Tick marks on the colorbar    
    
    """

    if not levels:
        levels = np.arange(0, 13.5, 1.5)
        
    fig = plt.figure(figsize=[12,5])
    ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180))
    clim.sel(season=season).plot.contourf(
        ax=ax,
        levels=levels,
        extend="max",
        transform=ccrs.PlateCarree(),
        cbar_kwargs={"label": clim.units},
        cmap=cmocean.cm.haline_r
    )
    ax.coastlines()
    if gridlines:
        plt.gca().gridlines()
    
    title = f"{model} precipitation climatology ({season})"
    plt.title(title)


def get_log_and_key(pr_file, history_attr, plot_type):
    """Get key and command line log for image metadata.
   
    Different image formats allow different metadata keys.
   
    Args:
      pr_file (str): Input precipitation file
      history_attr (str): History attribute from pr_file
      plot_type (str): File format for output image
   
    """
    
    valid_keys = {"png": "History",
                  "pdf": "Title",
                  "svg": "Title",
                  "eps": "Creator",
                  "ps" : "Creator",
    }    

    assert plot_type in valid_keys.keys(), f"Image format not one of: {*[*valid_keys],}"
    log_key = valid_keys[plot_type]
    new_log = cmdprov.new_log(infile_logs={pr_file: history_attr})
    
    return log_key, new_log
   

def main(inargs):
    """Run the program."""

    log_lev = logging.INFO if inargs.verbose else logging.WARNING
    logging.basicConfig(level=log_lev, filename=inargs.logfile)

    ds = xr.open_dataset(inargs.pr_file)
    
    clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True)

    try:
        input_units = clim.attrs["units"]
    except KeyError:
        raise KeyError("Precipitation variable in {inargs.pr_file} must have a units attribute")

    if input_units == "kg m-2 s-1":
        clim = convert_pr_units(clim)
        logging.info("Units converted from kg m-2 s-1 to mm/day")
    elif input_units == "mm/day":
        pass
    else:
        raise ValueError("""Input units are not 'kg m-2 s-1' or 'mm/day'""")

    if inargs.mask:
        sftlf_file, realm = inargs.mask
        clim = apply_mask(clim, sftlf_file, realm)

    create_plot(
        clim,
        ds.attrs["source_id"],
        inargs.season,
        gridlines=inargs.gridlines,
        levels=inargs.cbar_levels,
    )
                
    log_key, new_log = get_log_and_key(
        inargs.pr_file,
        ds.attrs["history"],
        inargs.output_file.split('.')[-1],
    )
    plt.savefig(
        inargs.output_file,
        metadata={log_key: new_log},
        dpi=200,
        bbox_inches="tight",
        facecolor="white",
    )


if __name__ == "__main__":
    description = "Plot the precipitation climatology for a given season."
    parser = argparse.ArgumentParser(description=description)
    
    parser.add_argument("pr_file", type=str, help="Precipitation data file")
    parser.add_argument("season", type=str, help="Season to plot")
    parser.add_argument("output_file", type=str, help="Output file name")

    parser.add_argument(
        "--gridlines",
        action="store_true",
        default=False,
        help="Include gridlines on the plot",
    )
    parser.add_argument(
        "--cbar_levels",
        type=float,
        nargs="*",
        default=None,
        help="list of levels / tick marks to appear on the colorbar",
    )
    parser.add_argument(
        "--mask",
        type=str,
        nargs=2,
        metavar=("SFTLF_FILE", "REALM"),
        default=None,
        help="""Provide sftlf file and realm to mask ('land' or 'ocean')""",
    )
    parser.add_argument(
        "-v",
        "--verbose",
        action="store_true",
        default=False,
        help="Change the minimum logging reporting level from WARNING (default) to DEBUG",
    )
    parser.add_argument(
        "--logfile",
        type=str,
        default=None,
        help="Name of log file (by default logging information is printed to the screen)",
    )

    args = parser.parse_args()
    main(args)

Key Points

  • It is possible (in only a few lines of code) to record the provenance of a data file or image.

Content from Large data


Last updated on 2024-07-25 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • How do I work with multiple CMIP files that won’t fit in memory?

Objectives

  • Inspect netCDF chunking.
  • Import the dask library and start a client with parallel workers.
  • Calculate and plot the maximum daily precipitation for a high resolution model.

So far we’ve been working with small, individual data files that can be comfortably read into memory on a modern laptop. What if we wanted to process a larger dataset that consists of many files and/or much larger file sizes? For instance, let’s say the next step in our global precipitation analysis is to plot the daily maximum precipitation over the 1850-2014 period for the high resolution CNRM-CM6-1-HR model.

Data download

Instructors teaching this lesson can download the CNRM-CM6-1-HR daily precipitation data from the Earth System Grid Federation (ESGF). See the instructor notes for details. Since it is a very large download (45 GB), learners are not expected to download the data. (None of the exercises at the end of the lesson require downloading the data.)

At the Unix shell, we can inspect the dataset and see that the daily maximum precipitation data for the 1850-2014 period has been broken up into 25-year chunks and spread across seven netCDF files, most of which are 6.7 GB in size.

BASH

$ ls -lh data/pr_day*.nc

OUTPUT

-rw-r-----  1 irving  staff   6.7G 26 Jan 12:09 pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_18500101-18741231.nc
-rw-r-----  1 irving  staff   6.7G 26 Jan 13:14 pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_18750101-18991231.nc
-rw-r-----  1 irving  staff   6.7G 26 Jan 14:17 pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19000101-19241231.nc
-rw-r-----  1 irving  staff   6.7G 26 Jan 15:20 pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19250101-19491231.nc
-rw-r-----  1 irving  staff   6.7G 26 Jan 16:24 pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19500101-19741231.nc
-rw-r-----  1 irving  staff   6.7G 26 Jan 17:27 pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19750101-19991231.nc
-rw-r-----  1 irving  staff   4.0G 26 Jan 18:05 pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_20000101-20141231.nc

In order to work with these files, we can use xarray to open a “multifile” dataset as though it were a single file. Our first step is to open a new Jupyter notebook and import a library with a rather unfortunate name.

PYTHON

import glob

The glob library contains a single function, also called glob, that finds files whose names match a pattern. We provide those patterns as strings: the character * matches zero or more characters, while ? matches any one character, just like at the Unix shell.

PYTHON

pr_files = glob.glob("data/pr_day*.nc")
pr_files.sort()
print(pr_files)

OUTPUT

['/Users/irving/Desktop/data-carpentry/data/pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_18500101-18741231.nc',
 '/Users/irving/Desktop/data-carpentry/data/pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_18750101-18991231.nc',
 '/Users/irving/Desktop/data-carpentry/data/pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19000101-19241231.nc',
 '/Users/irving/Desktop/data-carpentry/data/pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19250101-19491231.nc',
 '/Users/irving/Desktop/data-carpentry/data/pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19500101-19741231.nc',
 '/Users/irving/Desktop/data-carpentry/data/pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19750101-19991231.nc',
 '/Users/irving/Desktop/data-carpentry/data/pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_20000101-20141231.nc']

Recall that when we first open data in xarray it simply (“lazily”) loads the metadata associated with the data and shows summary information about the contents of the dataset (i.e. it doesn’t read the actual data into memory). This may take a little time for a large multifile dataset.

PYTHON

import xarray as xr

ds = xr.open_mfdataset(pr_files, chunks={"time": "500MB"})

print(ds)

OUTPUT

<xarray.Dataset>
Dimensions:      (axis_nbounds: 2, lat: 360, lon: 720, time: 60265)
Coordinates:
  * lat          (lat) float64 -89.62 -89.12 -88.62 -88.13 ... 88.62 89.12 89.62
  * lon          (lon) float64 0.0 0.5 1.0 1.5 2.0 ... 358.0 358.5 359.0 359.5
  * time         (time) datetime64[ns] 1850-01-01T12:00:00 ... 2014-12-31T12:...
Dimensions without coordinates: axis_nbounds
Data variables:
    time_bounds  (time, axis_nbounds) datetime64[ns] dask.array<chunksize=(9131, 2), meta=np.ndarray>
    pr           (time, lat, lon) float32 dask.array<chunksize=(397, 360, 720), meta=np.ndarray>
Attributes:
    Conventions:            CF-1.7 CMIP-6.2
    creation_date:          2019-05-23T12:33:53Z
    description:            CMIP6 historical
    title:                  CNRM-CM6-1-HR model output prepared for CMIP6 and...
    activity_id:            CMIP
    contact:                contact.cmip@meteo.fr
    data_specs_version:     01.00.21
    dr2xml_version:         1.16
    experiment_id:          historical
    experiment:             all-forcing simulation of the recent past
    external_variables:     areacella
    forcing_index:          2
    frequency:              day
    further_info_url:       https://furtherinfo.es-doc.org/CMIP6.CNRM-CERFACS...
    grid:                   data regridded to a 359 gaussian grid (360x720 la...
    grid_label:             gr
    nominal_resolution:     50 km
    initialization_index:   1
    institution_id:         CNRM-CERFACS
    institution:            CNRM (Centre National de Recherches Meteorologiqu...
    license:                CMIP6 model data produced by CNRM-CERFACS is lice...
    mip_era:                CMIP6
    parent_experiment_id:   p i C o n t r o l
    parent_mip_era:         CMIP6
    parent_activity_id:     C M I P
    parent_source_id:       CNRM-CM6-1-HR
    parent_time_units:      days since 1850-01-01 00:00:00
    parent_variant_label:   r1i1p1f2
    branch_method:          standard
    branch_time_in_parent:  0.0
    branch_time_in_child:   0.0
    physics_index:          1
    product:                model-output
    realization_index:      1
    realm:                  atmos
    references:             http://www.umr-cnrm.fr/cmip6/references
    source:                 CNRM-CM6-1-HR (2017):  aerosol: prescribed monthl...
    source_id:              CNRM-CM6-1-HR
    source_type:            AOGCM
    sub_experiment_id:      none
    sub_experiment:         none
    table_id:               day
    variable_id:            pr
    variant_label:          r1i1p1f2
    EXPID:                  CNRM-CM6-1-HR_historical_r1i1p1f2
    CMIP6_CV_version:       cv=6.2.3.0-7-g2019642
    dr2xml_md5sum:          45d4369d889ddfb8149d771d8625e9ec
    xios_commit:            1442-shuffle
    nemo_gelato_commit:     84a9e3f161dade7_8250e198106a168
    arpege_minor_version:   6.3.3
    history:                none
    tracking_id:            hdl:21.14100/223fa794-73fe-4bb5-9209-8ff910f7dc40

We can see that our ds object is an xarray.Dataset, but notice now that each variable has type dask.array with a chunksize attribute. Dask will access the data chunk-by-chunk (rather than all at once), which is fortunate because at 45GB the full size of our dataset is much larger than the available RAM on our laptop (17GB in this example). Dask can also distribute chunks across multiple cores if we ask it to (i.e. parallel processing).

So how big should our chunks be? As a general rule they need to be small enough to fit comfortably in memory (because multiple chunks can be in memory at once), but large enough to avoid the time cost associated with asking Dask to manage/schedule lots of little chunks. The Dask documentation suggests that chunk sizes between 10MB-1GB are common, so we’ve set the chunk size to 500MB in this example. Since our netCDF files are chunked by time, we’ve specified that the 500MB Dask chunks should also be along that axis. Performance would suffer dramatically if our Dask chunks weren’t aligned with our netCDF chunks.

We can have the Jupyter notebook display the size and shape of our chunks, just to make sure they are indeed 500MB.

PYTHON

ds['pr'].data

OUTPUT

          Array                 Chunk
Bytes     62.48 GB              499.74 MB
Shape     (60265, 360, 720)     (482, 360, 720)
Count     307 Tasks             150 Chunks
Type      float32               numpy.ndarray

Now that we understand the chunking information contained in the metadata, let’s go ahead and calculate the daily maximum precipitation.

PYTHON

pr_max = ds["pr"].max("time", keep_attrs=True)
print(pr_max)

OUTPUT

<xarray.DataArray 'pr' (lat: 360, lon: 720)>
dask.array<nanmax-aggregate, shape=(360, 720), dtype=float32, chunksize=(360, 720), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float64 -89.62 -89.12 -88.62 -88.13 ... 88.62 89.12 89.62
  * lon      (lon) float64 0.0 0.5 1.0 1.5 2.0 ... 357.5 358.0 358.5 359.0 359.5
Attributes:
    long_name:           Precipitation
    units:               kg m-2 s-1
    online_operation:    average
    cell_methods:        area: time: mean
    interval_operation:  900 s
    interval_write:      1 d
    standard_name:       precipitation_flux
    description:         at surface; includes both liquid and solid phases fr...
    history:             none
    cell_measures:       area: areacella

It seems like the calculation happened instataneously, but it’s actually just another “lazy” feature of xarray. It’s showing us what the output of the calculation would look like (i.e. a 360 by 720 array), but xarray won’t actually do the computation until the data is needed (e.g. to create a plot or write to a netCDF file).

To force xarray to do the computation we can use .compute() with the %%time Jupyter notebook command to record how long it takes:

PYTHON

%%time
pr_max_done = pr_max.compute()

OUTPUT

CPU times: user 4min 1s, sys: 54.2 s, total: 4min 55s
Wall time: 3min 44s

By processing the data chunk-by-chunk, we’ve avoided the memory error we would have generated had we tried to handle the whole dataset at once. A completion time of 3 minutes and 44 seconds isn’t too bad, but that was only using one core. We can try and speed things up by using a dask “client” to run the calculation in parallel across multiple cores:

PYTHON

from dask.distributed import Client

client = Client()

client

OUTPUT

Client                                    Cluster
Scheduler: tcp://127.0.0.1:59152          Workers: 4
Dashboard: http://127.0.0.1:8787/status   Cores: 4
                                          Memory: 17.18 GB

(Click on the dashboard link to watch what’s happening on each core.)

PYTHON

%%time
pr_max_done = pr_max.compute()

OUTPUT

CPU times: user 10.2 s, sys: 1.12 s, total: 11.3 s
Wall time: 2min 33s

By distributing the calculation across all four cores the processing time has dropped to 2 minutes and 33 seconds. It’s faster than, but not a quarter of, the original 3 minutes and 44 seconds because there’s a time cost associated with setting up and coordinating jobs across all the cores.

Parallel isn’t always best

For smaller or very complex data processing tasks, the time associated with setting up and coordinating jobs across multiple cores can mean it takes longer to run in parallel than on just one core.

Now that we’ve computed the daily maximum precipitation, we can go ahead and plot it:

PYTHON

import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import numpy as np
import cmocean


pr_max_done.data = pr_max_done.data * 86400
pr_max_done.attrs["units"] = "mm/day"

fig = plt.figure(figsize=[12,5])
ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180))
pr_max_done.plot.contourf(
    ax=ax,
    levels=np.arange(0, 450, 50),
    extend="max",
    transform=ccrs.PlateCarree(),
    cbar_kwargs={"label": pr_max.units},
    cmap=cmocean.cm.haline_r,
)
ax.coastlines()

model = ds.attrs["source_id"]
title = f"Daily maximum precipitation, 1850-2014 ({model})"
plt.title(title)

plt.show()
Daily maximum precipitation

Writing your own Dask-aware functions

So far leveraging Dask for our large data computations has been relatively simple, because almost all of xarray’s built-in operations (like max) work on Dask arrays.

If we wanted to chunk and parallelise code involving operations that aren’t built into xarray (e.g. an interpolation routine from the SciPy library) we’d first need to use the apply_ufunc or map_blocks function to make those operations “Dask-aware”. There’s an xarray tutorial that explains how to do this.

Alternatives to Dask

While dask can be a good option for working with large data, it’s not always the best choice. For instance, in this lesson we’ve talked about the time cost associated with setting up and coordinating jobs across multiple cores, and the fact that it can take a bit of time and effort to make a function Dask-aware.

What other options/tactics do you have when working with a large, multi-file dataset? What are the pros and cons of these options?

Other options include writing a loop to process each netCDF file one at a time or a series of sub-regions one at a time.

By breaking the data processing down like this you might avoid the need to make your function/s Dask-aware. If the loop only involves tens (as opposed to hundreds) files/regions, the loop might also not be too slow. However, using Dask is neater and more scalable, because the code you write looks essentially the same regardless of whether you’re running the code on a powerful supercomputer or a personal laptop.

Find out what you’re working with

Setup a Dask client in your Jupyter notebook. How many cores do you have available and how much memory?

Run the following two commands in your notebook to setup the client and then type client to view the display of information:

PYTHON

from dask.distributed import Client
client = Client()
client

Applying this to your own work

In your own research, are there any data processing tasks that could benefit from chunking and/or parallelisation? If so, how would you implement it? What size would your chunks be and along what axis? Are all the operations involved already Dask-aware?

Key Points

  • Libraries such as dask and xarray can make loading, processing and visualising netCDF data much easier.
  • Dask can speed up processing through parallelism but care may be needed particularly with data chunking.