Multiplatform programming with xobjects

Xobjects provides low-level functionalities to the other Xsuite packages, allowing to use the same code on different platforms (CPU and GPU).

Data management

The following example shows how to use Xobjects to allocate and manipulate date in the memory of the computing platform (CPU or GPU)

Definition of a simple data structure class

A Xobjects data structure can be defined as follows:

import xobjects as xo

class DataStructure(xo.Struct):
    a = xo.Float64[:]
    b = xo.Float64[:]
    c = xo.Float64[:]
    s = xo.Float64

The structure has four fields, three of them (a, b and c) are arrays and one of them (s) is a scalar.

Allocation of a data object on CPU or GPU

The memory on which data is stored is chosen by defining a context object, which is passed to the class constructor when the objects are instantiated. For example we can allocate an object of our data structure as follows:

ctx = xo.ContextCpu()
# ctx = xo.ContextCupy() # for NVIDIA GPUs

obj = DataStructure(_context=ctx,
                    a=[1,2,3], b=[4,5,6],
                    c=[0,0,0], s=0)

If ctx = xo.ContextCpu() the object is allocated on CPU, if ctx = xo.ContextCupy() the object is allocated in GPU.

Access to the data

Independently on the context, the object is accessible in read/write directly from Python. For example:

print(obj.a[2]) # gives: 3
obj.a[2] = 10
print(obj.a[2]) # gives: 10

Numpy-like access

Xobjects arrays can be viewed as numpy arrays (or numpy-like on GPUs). For example for our object:

arr = obj.a.to_nplike()
type(arr) # gives: numpy.ndarray

This object is a view and not a copy, which means that when we operate on the numpy arrays the underlying xobject is also modified. For example:

print(obj.a[0], obj.b[0]) # gives: 1.0 4.0
a_nplike = obj.a.to_nplike()
b_nplike = obj.b.to_nplike()

# We use np array algebra
a_nplike[:] = b_nplike - 1

print(obj.a[0], obj.b[0]) # gives: 3.0 4.0

# Usage of the .sum method of the numpy array
obj.s = a_nplike.sum()

Kernel functions in C

Autogenerated C API

The definition of a Xobject in Python, automatically triggers the generation of a set of functions (C-API) that can be used in C code to access the data allocated in Python. The available functions for a given Xobject can be inspected using the method _gen_c_decl(). For our example structure this can be done by:

print(DataStructure._gen_c_decl())

which provides the following set of C functions:

typedef /*gpuglmem*/ struct DataStructure_s * DataStructure;
/*gpufun*/ DataStructure DataStructure_getp(DataStructure/*restrict*/ obj);
/*gpufun*/ ArrNFloat64 DataStructure_getp_a(DataStructure/*restrict*/ obj);
/*gpufun*/ int64_t DataStructure_len_a(DataStructure/*restrict*/ obj);
/*gpufun*/ double DataStructure_get_a(const DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ void DataStructure_set_a(DataStructure/*restrict*/ obj, int64_t i0, double value);
/*gpufun*/ /*gpuglmem*/double* DataStructure_getp1_a(DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ ArrNFloat64 DataStructure_getp_b(DataStructure/*restrict*/ obj);
/*gpufun*/ int64_t DataStructure_len_b(DataStructure/*restrict*/ obj);
/*gpufun*/ double DataStructure_get_b(const DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ void DataStructure_set_b(DataStructure/*restrict*/ obj, int64_t i0, double value);
/*gpufun*/ /*gpuglmem*/double* DataStructure_getp1_b(DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ ArrNFloat64 DataStructure_getp_c(DataStructure/*restrict*/ obj);
/*gpufun*/ int64_t DataStructure_len_c(DataStructure/*restrict*/ obj);
/*gpufun*/ double DataStructure_get_c(const DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ void DataStructure_set_c(DataStructure/*restrict*/ obj, int64_t i0, double value);
/*gpufun*/ /*gpuglmem*/double* DataStructure_getp1_c(DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ double DataStructure_get_s(const DataStructure/*restrict*/ obj);
/*gpufun*/ void DataStructure_set_s(DataStructure/*restrict*/ obj, double value);
/*gpufun*/ /*gpuglmem*/double* DataStructure_getp_s(DataStructure/*restrict*/ obj);

Writing a C kernel

A C function that can be parallelized when running on GPU is called Kernel. As an example, using our example data structure, we write a C kernel function (running on CPU and GPU) that performs the element-by-element product between the arrays obj.a and obj.b and writes it in obj.c. In the kernel code we use methods of the autogenerated C API to access the data in our example DataStructure.

src = '''

/*gpukern*/
void myprod(DataStructure ob, int nelem){

    for (int ii=0; ii<nelem; ii++){ //vectorize_over ii nelem
        double a_ii = DataStructure_get_a(ob, ii);
        double b_ii = DataStructure_get_b(ob, ii);

        double c_ii = a_ii * b_ii;
        DataStructure_set_c(ob, ii, c_ii);
    } //end_vectorize

}
'''

Note the xobject annotation /*gpukern*/ that specifies that the function is a kernel, as well as annotations //vectorize_over ii nelem and //end_vectorize which identifies the variable on which the calculation can be performed in parallel and the corresponding range (i.e. 0 <= ii < nelem).

Compiling the kernel

The Xobject contex that we have alredy created to allocate the object in memory can also be used to compile the C code and access it from Python. This can be done with the method add_kernels by providing the source code and the description of the kernels from the source code that we would like to access from Python:

ctx.add_kernels(
    sources=[src],
    kernels={'myprod': xo.Kernel(
                args = [xo.Arg(DataStructure, name='ob'),
                        xo.Arg(xo.Int32, name='nelem')],
                n_threads='nelem')
            }
    )

The argument n_threads can be used to specify the name of an argument of the C function from which the number of threads to be used in the GPU can be inferred.

Calling the kernel

The kernel can be called from Python as follows

# obj.a contains [3., 4., 5.]
# obj.b contains [4., 5., 6.]
# obj.c contains [0., 0., 0.]

ctx.kernels.myprod(ob=obj, nelem=len(obj.a))

# obj.a contains [3., 4., 5.]
# obj.b contains [4., 5., 6.]
# obj.c contains [12., 20., 30.]

Inspecting the source code

Before compiling, the context specializes our source code for tgw chosen platform. Such autogenerated specialized code can be inspected from Python as follows:

print(ctx.kernels.myprod.specialized_source)

For our example kernel mymul, if the chosen context is a ContextCpu the generated specialized source is:

void myprod(DataStructure ob, int nelem){

    for (int ii=0; ii<nelem; ii++){ //autovectorized

        double a_ii = DataStructure_get_a(ob, ii);
        double b_ii = DataStructure_get_b(ob, ii);

        double c_ii = a_ii * b_ii;
        DataStructure_set_c(ob, ii, c_ii);
    }//end autovectorized

}

If the chosen context is a ContextCupy the generated specialized source is:

__global__
void myprod(DataStructure ob, int nelem){

    int ii; //autovectorized
    ii=blockDim.x * blockIdx.x + threadIdx.x;//autovectorized
    if (ii<nelem){ //autovectorized
        double a_ii = DataStructure_get_a(ob, ii);
        double b_ii = DataStructure_get_b(ob, ii);

        double c_ii = a_ii * b_ii;
        DataStructure_set_c(ob, ii, c_ii);
    }//end autovectorized
}

If the chosen context is a ContextCupy the generated specialized source is:

__kernel
void myprod(DataStructure ob, int nelem){

    int ii; //autovectorized
    ii=get_global_id(0); //autovectorized

            double a_ii = DataStructure_get_a(ob, ii);
            double b_ii = DataStructure_get_b(ob, ii);

            double c_ii = a_ii * b_ii;
            DataStructure_set_c(ob, ii, c_ii);
    //end autovectorized

}