================= Quick Start Guide ================= This document will guide you through the basic steps to get started with GT4Py. ------------ Installation ------------ GT4Py contains a ``setup.py`` installation script and can be installed as usual with `pip`. Additional commands are provided to install and remove the GridTools C++ sources, which are not contained in the package. We strongly recommended to create a virtual environment for any new project: .. code:: bash python -m venv path_for_the_new_venv source path_for_the_new_venv/bin/activate pip install --upgrade pip wheel setuptools Then clone the GT4Py repository and install the local copy or install it directly from PyPI: `pip install gt4py` For use with NVIDIA GPUs, add the `[cudaXXX]` optional dependency, where `XXX` takes the values `101`, `102`, ... depending on the CUDA version installed in your system (CUDA version 11.1, 11.2, ...). .. code:: bash git clone https://github.com/GridTools/gt4py.git pip install ./gt4py # pip install ./gt4py[cuda11x] Or .. code:: bash pip install gt4py # pip install gt4py[cuda11x] ------------ Introduction ------------ In GT4Py, grid computations such as stencils are defined in a domain-specific language (DSL) called GTScript. GTScript is syntactically a subset of Python, but has different semantics. Computations defined in this DSL are translated by the GT4Py toolchain into code in Python based on `NumPy `_ or C++/CUDA based on the `GridTools `_ library. To be able to achieve full performance with GridTools, data has to adhere to certain layout requirements, which are taken care of by storing the data in GT4Py storage containers. A Simple Example ---------------- Suppose we want to write a stencil computing a parameterized linear combination of multiple 3D fields. The stencil has two parameters to change the relative weights (``alpha`` and ``weight``) of the input fields (``field_a``, ``field_b``, and ``field_c``) to be combined: .. code:: python import numpy as np # assuming fields of size nx, ny, nz result = np.zeros((nx, ny, nz)) def stencil_example(field_a, field_b, field_c, result, alpha, weight): nx, ny, nz = result.shape for i in range(nx): for j in range(ny): for k in range(nz): result[i, j, k] = field_a[i, j, k] - (1 - alpha) * ( field_b[i, j, k] - weight * field_c[i, j, k] ) To express this calculation using GTScript, we create a function and use the DSL syntax to represent the loop over 3 dimensions: .. code:: python import numpy as np import gt4py.cartesian.gtscript as gtscript backend = "numpy" @gtscript.stencil(backend=backend) def stencil_example( field_a: gtscript.Field[np.float64], field_b: gtscript.Field[np.float64], field_c: gtscript.Field[np.float64], result: gtscript.Field[np.float64], *, alpha: np.float64, weight: np.float64 = 2.0, ): with computation(PARALLEL), interval(...): result = field_a[0, 0, 0] - (1 - alpha) * ( field_b[0, 0, 0] - weight * field_c[0, 0, 0] ) assert callable(stencil_example) is True print(type(stencil_example), "\n", stencil_example) # --> [backend="numpy"] ... This definition basically expresses the operations (or *kernel*) performed at every point of the computation domain to generate the output values. The indices inside the brackets are interpreted as offsets relative to the current point in the iteration, and not as absolute positions in the data fields. For an explanation of the line ``with computation(PARALLEL), interval(...):``, please refer to the section :ref:`Computations and Intervals`. .. note:: While not required, it is recommended to specify *fields* as standard arguments and run-time *parameters* as *keyword-only* arguments. Once the stencil kernel has been defined, we use GT4Py to generate an *implementation* of this high-level definition as a callable object that we can use to apply the operations to data. This can be done by just decorating the definition function with the ``stencil`` decorator provided by GT4Py. The ``stencil`` decorator generates code in Python or C++ depending on the ``backend`` specified by name. Currently, the following backends are available: * ``"numpy"``: a vectorized Python backend * ``"gt:cpu_kfirst"``: a backend based on GridTools code performance-optimized for x86 architecture * ``"gt:cpu_ifirst"``: a GridTools backend targeting many core architectures * ``"gt:gpu"``: a GridTools backend targeting GPUs * ``"cuda"``: a backend targeting NVIDIA GPUs using CUDA and some GridTools The decorator further replaces the stencil definition function (here ``stencil_example``) by a callable object that can be used as a function to call the generated code which modifies the passed data in place. Instead of using the ``stencil`` decorator, it is also possible to compile the stencil using a regular function call receiving the definition function: .. code:: python import gt4py.cartesian.gtscript as gtscript def stencil_example( field_a: gtscript.Field[np.float64], field_b: gtscript.Field[np.float64], field_c: gtscript.Field[np.float64], result: gtscript.Field[np.float64], *, alpha: np.float64, weight: np.float64 = 2.0, ): with computation(PARALLEL), interval(...): result = field_a[0, 0, 0] - (1. - alpha) * ( field_b[0, 0, 0] - weight * field_c[0, 0, 0] ) stencil_example_numpy = gtscript.stencil(backend="numpy", definition=stencil_example) another_example_gt = gtscript.stencil(backend="gt:cpu_ifirst", definition=stencil_example) The generated code is written to and compiled in a local '.gt_cache' folder. Subsequent invocations will check whether a recent version of the stencil already exists in the cache. ---------- Allocation ---------- Since some backends require data to be in a certain layout in memory, GT4Py provides special `NumPy`-like allocators. They work like the familiar set of routines used in `NumPy` for allocation: ``ones``, ``zeros``, ``full`` and ``empty``. There is also ``from_array`` that initializes the array to a provided array value. The result of these routines is either a ``numpy.ndarray`` (for CPU backends) or a ``cupy.ndarray`` (for GPU backends). .. code:: python import gt4py.storage as gt_storage backend= "numpy" field_a = gt_storage.from_array( np.random.randn(10, 10, 10), np.float64, backend=backend, aligned_index=(0, 0, 0), ) field_b = gt_storage.ones( (10, 10, 10), np.float64, backend=backend, aligned_index=(0, 0, 0) ) field_c = gt_storage.zeros( (10, 10, 10), np.float64, backend=backend, aligned_index=(0, 0, 0) ) result = gt_storage.empty( (10, 10, 10), np.float64, backend=backend, aligned_index=(0, 0, 0) ) stencil_example(field_a, field_b, field_c, result, alpha=0.5) The ``aligned_index`` specifies that the array is to be allocated such that memory address of the point specified in ``aligned_index`` is `aligned` to a backend-dependent value. For optimal performance, you set the ``algined_index`` to a point which is the lower-left corner of the iteration domain most frequently used for this field. ---------------- Array Interfaces ---------------- When passing buffers to stencils, they can be in any form that is compatible with ``np.asarray`` or ``cp.asarray``, respectively. Some meta information can be provided to describe the correspondence between array dimensions and their semantic meaning (e.g. IJK) as well as to specify the correspondence. Also, an index can be designated as the `origin` of the array to denote the start of the index range considered to be the `iteration domain`. Specifically, the behavior is as follows: #. Dimensions can be denoted by adding a ``__gt_dims__`` attribute to the buffer object. It should be a tuple of strings where currently valid dimensions are ``"I", "J", "K"`` as well as string representations of integers to represent data dimensions, i.e. the dimensions of vector, matrices or higher tensors per grid point. If ``__gt_dims__`` is not present, the dimensions specified in the ``Field`` annotation of functions serves as a default. #. The origin can be specified with the ``__gt_origin__`` attribute, which is a tuple of ``int`` s. If when calling the stencil, no other `origin` is specified, this value is where the `iteration domain` begins, i.e. the grid point with the lowest index where a value is written. The explicit ``origin`` keyword when calling a stencil takes priority over this. .. _Computations and Intervals: -------------------------- Computations and Intervals -------------------------- `Computations` and `interval` determine the iteration space and schedule in the vertical direction. The `computation` context determines in which order the vertical dimension is iterated over. ``FORWARD`` specifies an iteration from low to high index, while ``BACKWARD`` is an iteration from high to low index. For contexts declared ``PARALLEL``, no order is assumed and only computations for which the result is the same irrespective of iteration order are allowed. `Intervals` declare the range of indices for which the statements of the respective context are applied. For example, ``interval(0,1)`` declares that the following context is executed for indices in [0,1), i.e. only for `K=0`. The ``interval(1, None)`` represents indices in [1,∞), ``interval(0, -1)`` all indices except the last. For example the Thomas algorithm to solve a linear, tridiagonal system of equations can be implemented using a forward and a backward loop with specialized computations at the beginning of each iteration: .. code:: python import gt4py.cartesian.gtscript as gtscript @gtscript.stencil def tridiagonal_solver( inf: gtscript.Field[np.float64], diag: gtscript.Field[np.float64], sup: gtscript.Field[np.float64], rhs: gtscript.Field[np.float64], out: gtscript.Field[np.float64], ): with computation(FORWARD): with interval(0, 1): sup = sup / diag rhs = rhs / diag with interval(1, None): sup = sup / (diag - sup[0, 0, -1] * inf) rhs = (rhs - inf * rhs[0, 0, -1]) / (diag - sup[0, 0, -1] * inf) with computation(BACKWARD): with interval(0, -1): out = rhs - sup * out[0, 0, 1] with interval(-1, None): out = rhs However, ``PARALLEL`` also differs from ``FORWARD`` and ``BACKWARD`` in another way: For ``PARALLEL``, we can assume that each statement (i.e. each assignment within the context) is applied to the full vertical domain, before the next one starts. If ``FORWARD`` or ``BACKWARD`` is specified however, all statements are applied to each slice with the same ``K``, one after each other, before moving to ``K+1`` or ``K-1``, respectively. --------- Functions --------- Functions allow to reuse code elements and to structure your code. They are decorated with ``@gtscript.function``. .. code:: python @gtscript.function def ddx(v, h = 0.1): v2 = v[-1, 0, 0] + v[1, 0, 0] - 2. * v[0, 0, 0] return v2 / (h * h) @gtscript.function def ddy(v, h = 0.1): v2 = v[0, -1, 0] + v[0, 1, 0] - 2. * v[0, 0, 0] return v2 / (h * h) @gtscript.function def ddz(v, h = 0.1): v2 = v[0, 0, -1] + v[0, 0, 1] - 2. * v[0, 0, 0] return v2 / (h * h) @gtscript.stencil(backend=backend) def laplacian( v: gtscript.Field[np.float64], lap: gtscript.Field[np.float64], *, h: np.float64 = 0.1 ): with computation(PARALLEL), interval(1, -1): lap = ddx(v, h) + ddy(v, h) + ddz(v, h) Functions are pure, that is, none of the passed fields are modified and the results must be passed back using the ``return`` statement. In the above example, ``v`` is not modified. However, multiple return values are allowed: .. code:: python @gtscript.function def ddxyz(v, h = 0.1): x = v[-1, 0, 0] + v[1, 0, 0] - 2. * v[0, 0, 0] y = v[0, -1, 0] + v[0, 1, 0] - 2. * v[0, 0, 0] z = v[0, 0, -1] + v[0, 0, 1] - 2. * v[0, 0, 0] return x / (h * h), y / (h * h), z / (h * h) @gtscript.stencil(backend=backend) def laplace( v: gtscript.Field[np.float64], lap: gtscript.Field[np.float64], *, h: np.float64 = 0.1 ): with computation(PARALLEL), interval(1, -1): x, y, z = ddxyz(v, h) lap = x + y + z Functions can be used only for code inside of computation/interval blocks. There is no overhead attached to function calls since they are inlined in the generated code. ------------- Offset syntax ------------- Offsets can be specified either as a list of offsets on all spatial axes, e.g. ``field[0, 0, 1]``, or as offsets on the axes present by specifying the axis ``field[K+1]``. ------------ System Setup ------------ Compilation settings for GT4Py backends generating C++ or CUDA code can be modified by updating the default values in the `gt4py.cartesian.config `_ module. Note that most of the system dependent settings can also be modified using the following environment variables: * ``BOOST_ROOT`` or ``BOOST_HOME``: root of the boost library headers. * ``CUDA_ROOT`` or ``CUDA_HOME``: installation prefix of the CUDA toolkit. * ``GT_INCLUDE_PATH``: path prefix to an alternative installation of GridTools header files. * ``OPENMP_CPPFLAGS``: preprocessor arguments for OpenMP support. * ``OPENMP_LDFLAGS``: arguments when linking executables with OpenMP support. MacOS ----- The clang compiler supplied with the MacOS Command Line Tools does not support the ``-fopenmp`` flag, but it does have support for OpenMP in the C preprocessor and can link with OpenMP support if the libomp package is installed using ``homebrew`` (https://brew.sh/). Then set the following environment variables: .. code:: bash export OPENMP_CPPFLAGS="-Xpreprocessor -fopenmp" export OPENMP_LDFLAGS="$(brew --prefix libomp)/lib/libomp.a" Similarly, boost headers are most easily installed using ``homebrew``. Then set the corresponding environment variable: .. code:: bash export BOOST_ROOT=/usr/local/opt/boost