Quick Start Guide
This document will guide you through the basic steps to get started with GT4Py.
Installation
GT4Py contains a setup.py
installation script and can be installed as usual with pip.
Additional commands are provided to install and remove the GridTools C++ sources, which are not contained in the package.
We strongly recommended to create a virtual environment for any new project:
python -m venv path_for_the_new_venv
source path_for_the_new_venv/bin/activate
pip install --upgrade pip wheel setuptools
Then clone the GT4Py repository and install the local copy or install it directly from PyPI: pip install gt4py For use with NVIDIA GPUs, add the [cudaXXX] optional dependency, where XXX takes the values 101, 102, … depending on the CUDA version installed in your system (CUDA version 11.1, 11.2, …).
git clone https://github.com/GridTools/gt4py.git
pip install ./gt4py
# pip install ./gt4py[cuda11x]
Or
pip install gt4py
# pip install gt4py[cuda11x]
Introduction
In GT4Py, grid computations such as stencils are defined in a domain-specific language (DSL) called GTScript. GTScript is syntactically a subset of Python, but has different semantics. Computations defined in this DSL are translated by the GT4Py toolchain into code in Python based on NumPy or C++/CUDA based on the GridTools library. To be able to achieve full performance with GridTools, data has to adhere to certain layout requirements, which are taken care of by storing the data in GT4Py storage containers.
A Simple Example
Suppose we want to write a stencil computing a parameterized linear combination of multiple 3D fields.
The stencil has two parameters to change the relative weights (alpha
and weight
) of the input
fields (field_a
, field_b
, and field_c
) to be combined:
import numpy as np
# assuming fields of size nx, ny, nz
result = np.zeros((nx, ny, nz))
def stencil_example(field_a, field_b, field_c, result, alpha, weight):
nx, ny, nz = result.shape
for i in range(nx):
for j in range(ny):
for k in range(nz):
result[i, j, k] = field_a[i, j, k] - (1 - alpha) * (
field_b[i, j, k] - weight * field_c[i, j, k]
)
To express this calculation using GTScript, we create a function and use the DSL syntax to represent the loop over 3 dimensions:
import numpy as np
import gt4py.cartesian.gtscript as gtscript
backend = "numpy"
@gtscript.stencil(backend=backend)
def stencil_example(
field_a: gtscript.Field[np.float64],
field_b: gtscript.Field[np.float64],
field_c: gtscript.Field[np.float64],
result: gtscript.Field[np.float64],
*,
alpha: np.float64,
weight: np.float64 = 2.0,
):
with computation(PARALLEL), interval(...):
result = field_a[0, 0, 0] - (1 - alpha) * (
field_b[0, 0, 0] - weight * field_c[0, 0, 0]
)
assert callable(stencil_example) is True
print(type(stencil_example), "\n", stencil_example)
# --> <class '_GT_.__main__.stencil_example.m_stencil_example__numpy'...
# --> <StencilObject: __main__.stencil_example> [backend="numpy"] ...
This definition basically expresses the operations (or kernel) performed at every point of the computation domain to generate the output values.
The indices inside the brackets are interpreted as offsets relative to the current point in the iteration, and not as absolute positions in the data fields.
For an explanation of the line with computation(PARALLEL), interval(...):
, please refer to the section Computations and Intervals.
Note
While not required, it is recommended to specify fields as standard arguments and run-time parameters as keyword-only arguments.
Once the stencil kernel has been defined, we use GT4Py to generate an implementation of this high-level definition as
a callable object that we can use to apply the operations to data. This can be done by just decorating the definition
function with the stencil
decorator provided by GT4Py.
The stencil
decorator generates code in Python or C++ depending on the backend
specified by name.
Currently, the following backends are available:
"numpy"
: a vectorized Python backend"gt:cpu_kfirst"
: a backend based on GridTools code performance-optimized for x86 architecture"gt:cpu_ifirst"
: a GridTools backend targeting many core architectures"gt:gpu"
: a GridTools backend targeting GPUs"cuda"
: a backend targeting NVIDIA GPUs using CUDA and some GridTools
The decorator further replaces the stencil definition function (here stencil_example
) by a callable object that
can be used as a function to call the generated code which modifies the passed data in place.
Instead of using the stencil
decorator, it is also possible to compile the stencil using a
regular function call receiving the definition function:
import gt4py.cartesian.gtscript as gtscript
def stencil_example(
field_a: gtscript.Field[np.float64],
field_b: gtscript.Field[np.float64],
field_c: gtscript.Field[np.float64],
result: gtscript.Field[np.float64],
*,
alpha: np.float64,
weight: np.float64 = 2.0,
):
with computation(PARALLEL), interval(...):
result = field_a[0, 0, 0] - (1. - alpha) * (
field_b[0, 0, 0] - weight * field_c[0, 0, 0]
)
stencil_example_numpy = gtscript.stencil(backend="numpy", definition=stencil_example)
another_example_gt = gtscript.stencil(backend="gt:cpu_ifirst", definition=stencil_example)
The generated code is written to and compiled in a local ‘.gt_cache’ folder. Subsequent invocations will check whether a recent version of the stencil already exists in the cache.
Allocation
Since some backends require data to be in a certain layout in memory, GT4Py provides special NumPy-like
allocators. They work like the familiar set of routines used in NumPy for allocation: ones
, zeros
,
full
and empty
. There is also from_array
that initializes the array to a provided array value.
The result of these routines is either a numpy.ndarray
(for CPU backends) or a cupy.ndarray
(for GPU backends).
import gt4py.storage as gt_storage
backend= "numpy"
field_a = gt_storage.from_array(
np.random.randn(10, 10, 10),
np.float64,
backend=backend,
aligned_index=(0, 0, 0),
)
field_b = gt_storage.ones(
(10, 10, 10), np.float64, backend=backend, aligned_index=(0, 0, 0)
)
field_c = gt_storage.zeros(
(10, 10, 10), np.float64, backend=backend, aligned_index=(0, 0, 0)
)
result = gt_storage.empty(
(10, 10, 10), np.float64, backend=backend, aligned_index=(0, 0, 0)
)
stencil_example(field_a, field_b, field_c, result, alpha=0.5)
The aligned_index
specifies that the array is to be allocated such that memory address of the point specified in
aligned_index
is aligned to a backend-dependent value. For optimal performance, you set the algined_index
to
a point which is the lower-left corner of the iteration domain most frequently used for this field.
Array Interfaces
When passing buffers to stencils, they can be in any form that is compatible with np.asarray
or cp.asarray
,
respectively. Some meta information can be provided to describe the correspondence between array dimensions and
their semantic meaning (e.g. IJK) as well as to specify the correspondence. Also, an index can be designated as the
origin of the array to denote the start of the index range considered to be the iteration domain. Specifically, the
behavior is as follows:
Dimensions can be denoted by adding a
__gt_dims__
attribute to the buffer object. It should be a tuple of strings where currently valid dimensions are"I", "J", "K"
as well as string representations of integers to represent data dimensions, i.e. the dimensions of vector, matrices or higher tensors per grid point. If__gt_dims__
is not present, the dimensions specified in theField
annotation of functions serves as a default.The origin can be specified with the
__gt_origin__
attribute, which is a tuple ofint
s. If when calling the stencil, no other origin is specified, this value is where the iteration domain begins, i.e. the grid point with the lowest index where a value is written. The explicitorigin
keyword when calling a stencil takes priority over this.
Computations and Intervals
Computations and interval determine the iteration space and schedule in the vertical direction.
The computation context determines in which order the vertical
dimension is iterated over. FORWARD
specifies an iteration from low to high index, while BACKWARD
is an
iteration from high to low index. For contexts declared PARALLEL
, no order is assumed and only computations
for which the result is the same irrespective of iteration order are allowed.
Intervals declare the range of indices for which the statements
of the respective context are applied. For example, interval(0,1)
declares that the following context is executed for indices
in [0,1), i.e. only for K=0. The interval(1, None)
represents indices in [1,∞), interval(0, -1)
all indices
except the last.
For example the Thomas algorithm to solve a linear, tridiagonal system of equations can be implemented using a forward and a backward loop with specialized computations at the beginning of each iteration:
import gt4py.cartesian.gtscript as gtscript
@gtscript.stencil
def tridiagonal_solver(
inf: gtscript.Field[np.float64],
diag: gtscript.Field[np.float64],
sup: gtscript.Field[np.float64],
rhs: gtscript.Field[np.float64],
out: gtscript.Field[np.float64],
):
with computation(FORWARD):
with interval(0, 1):
sup = sup / diag
rhs = rhs / diag
with interval(1, None):
sup = sup / (diag - sup[0, 0, -1] * inf)
rhs = (rhs - inf * rhs[0, 0, -1]) / (diag - sup[0, 0, -1] * inf)
with computation(BACKWARD):
with interval(0, -1):
out = rhs - sup * out[0, 0, 1]
with interval(-1, None):
out = rhs
However, PARALLEL
also differs from FORWARD
and BACKWARD
in another way:
For PARALLEL
, we can assume that each statement
(i.e. each assignment within the context) is applied to the full vertical domain, before the next one starts.
If FORWARD
or BACKWARD
is specified
however, all statements are applied to each slice with the same K
, one after each other, before moving to K+1
or K-1
, respectively.
Functions
Functions allow to reuse code elements and to structure your code. They are decorated with @gtscript.function
.
@gtscript.function
def ddx(v, h = 0.1):
v2 = v[-1, 0, 0] + v[1, 0, 0] - 2. * v[0, 0, 0]
return v2 / (h * h)
@gtscript.function
def ddy(v, h = 0.1):
v2 = v[0, -1, 0] + v[0, 1, 0] - 2. * v[0, 0, 0]
return v2 / (h * h)
@gtscript.function
def ddz(v, h = 0.1):
v2 = v[0, 0, -1] + v[0, 0, 1] - 2. * v[0, 0, 0]
return v2 / (h * h)
@gtscript.stencil(backend=backend)
def laplacian(
v: gtscript.Field[np.float64], lap: gtscript.Field[np.float64], *, h: np.float64 = 0.1
):
with computation(PARALLEL), interval(1, -1):
lap = ddx(v, h) + ddy(v, h) + ddz(v, h)
Functions are pure, that is, none of the passed fields are modified and the results must be passed back using
the return
statement. In the above example, v
is not modified. However, multiple return values are allowed:
@gtscript.function
def ddxyz(v, h = 0.1):
x = v[-1, 0, 0] + v[1, 0, 0] - 2. * v[0, 0, 0]
y = v[0, -1, 0] + v[0, 1, 0] - 2. * v[0, 0, 0]
z = v[0, 0, -1] + v[0, 0, 1] - 2. * v[0, 0, 0]
return x / (h * h), y / (h * h), z / (h * h)
@gtscript.stencil(backend=backend)
def laplace(
v: gtscript.Field[np.float64], lap: gtscript.Field[np.float64], *, h: np.float64 = 0.1
):
with computation(PARALLEL), interval(1, -1):
x, y, z = ddxyz(v, h)
lap = x + y + z
Functions can be used only for code inside of computation/interval blocks. There is no overhead attached to function calls since they are inlined in the generated code.
Offset syntax
Offsets can be specified either as a list of offsets on all spatial axes, e.g. field[0, 0, 1]
, or as offsets on the
axes present by specifying the axis field[K+1]
.
System Setup
Compilation settings for GT4Py backends generating C++ or CUDA code can be modified by updating the default values in the gt4py.cartesian.config module. Note that most of the system dependent settings can also be modified using the following environment variables:
BOOST_ROOT
orBOOST_HOME
: root of the boost library headers.CUDA_ROOT
orCUDA_HOME
: installation prefix of the CUDA toolkit.GT_INCLUDE_PATH
: path prefix to an alternative installation of GridTools header files.OPENMP_CPPFLAGS
: preprocessor arguments for OpenMP support.OPENMP_LDFLAGS
: arguments when linking executables with OpenMP support.
MacOS
The clang compiler supplied with the MacOS Command Line Tools does not support the -fopenmp
flag, but it does have
support for OpenMP in the C preprocessor and can link with OpenMP support if the libomp package is installed using
homebrew
(https://brew.sh/). Then set the following environment variables:
export OPENMP_CPPFLAGS="-Xpreprocessor -fopenmp"
export OPENMP_LDFLAGS="$(brew --prefix libomp)/lib/libomp.a"
Similarly, boost headers are most easily installed using homebrew
. Then set the corresponding environment variable:
export BOOST_ROOT=/usr/local/opt/boost