Multiplatform programming with xobjects 

Xobjects provides low-level functionalities to the other Xsuite packages, allowing to use the same code on different platforms (CPU and GPU).

Data management 

The following example shows how to use Xobjects to allocate and manipulate date in the memory of the computing platform (CPU or GPU)

Definition of a simple data structure class 

A Xobjects data structure can be defined as follows:

import xobjects as xo

class DataStructure(xo.Struct):
    a = xo.Float64[:]
    b = xo.Float64[:]
    c = xo.Float64[:]
    s = xo.Float64

The structure has four fields, three of them (a, b and c) are arrays and one of them (s) is a scalar.

Allocation of a data object on CPU or GPU 

The memory on which data is stored is chosen by defining a context object, which is passed to the class constructor when the objects are instantiated. For example we can allocate an object of our data structure as follows:

ctx = xo.ContextCpu()
# ctx = xo.ContextCupy() # for NVIDIA GPUs

obj = DataStructure(_context=ctx,
                    a=[1,2,3], b=[4,5,6],
                    c=[0,0,0], s=0)

If ctx = xo.ContextCpu() the object is allocated on CPU, if ctx = xo.ContextCupy() the object is allocated in GPU.

Access to the data 

Independently on the context, the object is accessible in read/write directly from Python. For example:

print(obj.a[2]) # gives: 3
obj.a[2] = 10
print(obj.a[2]) # gives: 10

Numpy-like access 

Xobjects arrays can be viewed as numpy arrays (or numpy-like on GPUs). For example for our object:

arr = obj.a.to_nplike()
type(arr) # gives: numpy.ndarray

This object is a view and not a copy, which means that when we operate on the numpy arrays the underlying xobject is also modified. For example:

print(obj.a[0], obj.b[0]) # gives: 1.0 4.0
a_nplike = obj.a.to_nplike()
b_nplike = obj.b.to_nplike()

# We use np array algebra
a_nplike[:] = b_nplike - 1

print(obj.a[0], obj.b[0]) # gives: 3.0 4.0

# Usage of the .sum method of the numpy array
obj.s = a_nplike.sum()

Kernel functions in C 

Autogenerated C API 

The definition of a Xobject in Python, automatically triggers the generation of a set of functions (C-API) that can be used in C code to access the data allocated in Python. The available functions for a given Xobject can be inspected using the method _gen_c_decl(). For our example structure this can be done by:

print(DataStructure._gen_c_decl())

which provides the following set of C functions:

typedef /*gpuglmem*/ struct DataStructure_s * DataStructure;
/*gpufun*/ DataStructure DataStructure_getp(DataStructure/*restrict*/ obj);
/*gpufun*/ ArrNFloat64 DataStructure_getp_a(DataStructure/*restrict*/ obj);
/*gpufun*/ int64_t DataStructure_len_a(DataStructure/*restrict*/ obj);
/*gpufun*/ double DataStructure_get_a(const DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ void DataStructure_set_a(DataStructure/*restrict*/ obj, int64_t i0, double value);
/*gpufun*/ /*gpuglmem*/double* DataStructure_getp1_a(DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ ArrNFloat64 DataStructure_getp_b(DataStructure/*restrict*/ obj);
/*gpufun*/ int64_t DataStructure_len_b(DataStructure/*restrict*/ obj);
/*gpufun*/ double DataStructure_get_b(const DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ void DataStructure_set_b(DataStructure/*restrict*/ obj, int64_t i0, double value);
/*gpufun*/ /*gpuglmem*/double* DataStructure_getp1_b(DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ ArrNFloat64 DataStructure_getp_c(DataStructure/*restrict*/ obj);
/*gpufun*/ int64_t DataStructure_len_c(DataStructure/*restrict*/ obj);
/*gpufun*/ double DataStructure_get_c(const DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ void DataStructure_set_c(DataStructure/*restrict*/ obj, int64_t i0, double value);
/*gpufun*/ /*gpuglmem*/double* DataStructure_getp1_c(DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ double DataStructure_get_s(const DataStructure/*restrict*/ obj);
/*gpufun*/ void DataStructure_set_s(DataStructure/*restrict*/ obj, double value);
/*gpufun*/ /*gpuglmem*/double* DataStructure_getp_s(DataStructure/*restrict*/ obj);

The generated C declarations currently still use legacy magic comments such as /*gpufun*/, /*gpuglmem*/ and /*restrict*/. For new handwritten C code, however, include xobjects/headers/common.h and use macros such as GPUFUN, GPUKERN, GPUGLMEM, RESTRICT and VECTORIZE_OVER. With macros, typos are caught by the compiler instead of being silently ignored as unknown comments.

Writing a C kernel 

A C function that can be parallelized when running on GPU is called Kernel. As an example, using our example data structure, we write a C kernel function (running on CPU and GPU) that performs the element-by-element product between the arrays obj.a and obj.b and writes it in obj.c. In the kernel code we use methods of the autogenerated C API to access the data in our example DataStructure.

src = '''

#include "xobjects/headers/common.h"

GPUKERN
void myprod(DataStructure ob, int nelem){

    VECTORIZE_OVER(ii, nelem);
        double a_ii = DataStructure_get_a(ob, ii);
        double b_ii = DataStructure_get_b(ob, ii);

        double c_ii = a_ii * b_ii;
        DataStructure_set_c(ob, ii, c_ii);
    END_VECTORIZE;

}
'''

Note the Xobjects macros GPUKERN, VECTORIZE_OVER and END_VECTORIZE. The first specifies that the function is a kernel. The vectorization macros identify the variable on which the calculation can be performed in parallel and the corresponding range (i.e. 0 <= ii < nelem).

Compiling the kernel 

The Xobject contex that we have alredy created to allocate the object in memory can also be used to compile the C code and access it from Python. This can be done with the method add_kernels by providing the source code and the description of the kernels from the source code that we would like to access from Python:

ctx.add_kernels(
    sources=[src],
    kernels={'myprod': xo.Kernel(
                args = [xo.Arg(DataStructure, name='ob'),
                        xo.Arg(xo.Int32, name='nelem')],
                n_threads='nelem')
            }
    )

The argument n_threads can be used to specify the name of an argument of the C function from which the number of threads to be used in the GPU can be inferred.

Calling the kernel 

The kernel can be called from Python as follows

# obj.a contains [3., 4., 5.]
# obj.b contains [4., 5., 6.]
# obj.c contains [0., 0., 0.]

ctx.kernels.myprod(ob=obj, nelem=len(obj.a))

# obj.a contains [3., 4., 5.]
# obj.b contains [4., 5., 6.]
# obj.c contains [12., 20., 30.]

Inspecting the source code 

Before compiling, the context specializes our source code for tgw chosen platform. Such autogenerated specialized code can be inspected from Python as follows:

print(ctx.kernels.myprod.specialized_source)

For our example kernel mymul, if the chosen context is a ContextCpu the generated specialized source is:

void myprod(DataStructure ob, int nelem){

    for (int ii=0; ii<nelem; ii++){ //autovectorized

        double a_ii = DataStructure_get_a(ob, ii);
        double b_ii = DataStructure_get_b(ob, ii);

        double c_ii = a_ii * b_ii;
        DataStructure_set_c(ob, ii, c_ii);
    }//end autovectorized

}

If the chosen context is a ContextCupy the generated specialized source is:

__global__
void myprod(DataStructure ob, int nelem){

    int ii; //autovectorized
    ii=blockDim.x * blockIdx.x + threadIdx.x;//autovectorized
    if (ii<nelem){ //autovectorized
        double a_ii = DataStructure_get_a(ob, ii);
        double b_ii = DataStructure_get_b(ob, ii);

        double c_ii = a_ii * b_ii;
        DataStructure_set_c(ob, ii, c_ii);
    }//end autovectorized
}

If the chosen context is a ContextCupy the generated specialized source is:

__kernel
void myprod(DataStructure ob, int nelem){

    int ii; //autovectorized
    ii=get_global_id(0); //autovectorized

            double a_ii = DataStructure_get_a(ob, ii);
            double b_ii = DataStructure_get_b(ob, ii);

            double c_ii = a_ii * b_ii;
            DataStructure_set_c(ob, ii, c_ii);
    //end autovectorized

}