Multiplatform programming with xobjects
Xobjects provides low-level functionalities to the other Xsuite packages, allowing to use the same code on different platforms (CPU and GPU).
Data management
The following example shows how to use Xobjects to allocate and manipulate date in the memory of the computing platform (CPU or GPU)
Definition of a simple data structure class
A Xobjects data structure can be defined as follows:
import xobjects as xo
class DataStructure(xo.Struct):
a = xo.Float64[:]
b = xo.Float64[:]
c = xo.Float64[:]
s = xo.Float64
The structure has four fields, three of them (a
, b
and c
) are arrays and one of them (s
) is a scalar.
Allocation of a data object on CPU or GPU
The memory on which data is stored is chosen by defining a context object, which is passed to the class constructor when the objects are instantiated. For example we can allocate an object of our data structure as follows:
ctx = xo.ContextCpu()
# ctx = xo.ContextCupy() # for NVIDIA GPUs
obj = DataStructure(_context=ctx,
a=[1,2,3], b=[4,5,6],
c=[0,0,0], s=0)
If ctx = xo.ContextCpu()
the object is allocated on CPU, if ctx = xo.ContextCupy()
the object is allocated in GPU.
Access to the data
Independently on the context, the object is accessible in read/write directly from Python. For example:
print(obj.a[2]) # gives: 3
obj.a[2] = 10
print(obj.a[2]) # gives: 10
Numpy-like access
Xobjects arrays can be viewed as numpy arrays (or numpy-like on GPUs). For example for our object:
arr = obj.a.to_nplike()
type(arr) # gives: numpy.ndarray
This object is a view and not a copy, which means that when we operate on the numpy arrays the underlying xobject is also modified. For example:
print(obj.a[0], obj.b[0]) # gives: 1.0 4.0
a_nplike = obj.a.to_nplike()
b_nplike = obj.b.to_nplike()
# We use np array algebra
a_nplike[:] = b_nplike - 1
print(obj.a[0], obj.b[0]) # gives: 3.0 4.0
# Usage of the .sum method of the numpy array
obj.s = a_nplike.sum()
Kernel functions in C
Autogenerated C API
The definition of a Xobject in Python, automatically triggers the generation of a set of functions (C-API) that can be used in C code to access the data allocated in Python. The available functions for a given Xobject can be inspected using the method _gen_c_decl()
. For our example structure this can be done by:
print(DataStructure._gen_c_decl())
which provides the following set of C functions:
typedef /*gpuglmem*/ struct DataStructure_s * DataStructure;
/*gpufun*/ DataStructure DataStructure_getp(DataStructure/*restrict*/ obj);
/*gpufun*/ ArrNFloat64 DataStructure_getp_a(DataStructure/*restrict*/ obj);
/*gpufun*/ int64_t DataStructure_len_a(DataStructure/*restrict*/ obj);
/*gpufun*/ double DataStructure_get_a(const DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ void DataStructure_set_a(DataStructure/*restrict*/ obj, int64_t i0, double value);
/*gpufun*/ /*gpuglmem*/double* DataStructure_getp1_a(DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ ArrNFloat64 DataStructure_getp_b(DataStructure/*restrict*/ obj);
/*gpufun*/ int64_t DataStructure_len_b(DataStructure/*restrict*/ obj);
/*gpufun*/ double DataStructure_get_b(const DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ void DataStructure_set_b(DataStructure/*restrict*/ obj, int64_t i0, double value);
/*gpufun*/ /*gpuglmem*/double* DataStructure_getp1_b(DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ ArrNFloat64 DataStructure_getp_c(DataStructure/*restrict*/ obj);
/*gpufun*/ int64_t DataStructure_len_c(DataStructure/*restrict*/ obj);
/*gpufun*/ double DataStructure_get_c(const DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ void DataStructure_set_c(DataStructure/*restrict*/ obj, int64_t i0, double value);
/*gpufun*/ /*gpuglmem*/double* DataStructure_getp1_c(DataStructure/*restrict*/ obj, int64_t i0);
/*gpufun*/ double DataStructure_get_s(const DataStructure/*restrict*/ obj);
/*gpufun*/ void DataStructure_set_s(DataStructure/*restrict*/ obj, double value);
/*gpufun*/ /*gpuglmem*/double* DataStructure_getp_s(DataStructure/*restrict*/ obj);
Writing a C kernel
A C function that can be parallelized when running on GPU is called Kernel.
As an example, using our example data structure, we write a C kernel function (running on CPU and GPU) that performs the element-by-element product between the arrays obj.a
and obj.b
and writes it in obj.c
. In the kernel code we use methods of the autogenerated C API to access the data in our example DataStructure.
src = '''
/*gpukern*/
void myprod(DataStructure ob, int nelem){
for (int ii=0; ii<nelem; ii++){ //vectorize_over ii nelem
double a_ii = DataStructure_get_a(ob, ii);
double b_ii = DataStructure_get_b(ob, ii);
double c_ii = a_ii * b_ii;
DataStructure_set_c(ob, ii, c_ii);
} //end_vectorize
}
'''
Note the xobject annotation /*gpukern*/
that specifies that the function is a kernel, as well as annotations //vectorize_over ii nelem
and //end_vectorize
which identifies the variable on which the calculation can be performed in parallel and the corresponding range (i.e. 0 <= ii < nelem).
Compiling the kernel
The Xobject contex that we have alredy created to allocate the object in memory can also be used to compile the C code and access it from Python. This can be done with the method add_kernels
by providing the source code and the description of the kernels from the source code that we would like to access from Python:
ctx.add_kernels(
sources=[src],
kernels={'myprod': xo.Kernel(
args = [xo.Arg(DataStructure, name='ob'),
xo.Arg(xo.Int32, name='nelem')],
n_threads='nelem')
}
)
The argument n_threads
can be used to specify the name of an argument of the C function from which the number of threads to be used in the GPU can be inferred.
Calling the kernel
The kernel can be called from Python as follows
# obj.a contains [3., 4., 5.]
# obj.b contains [4., 5., 6.]
# obj.c contains [0., 0., 0.]
ctx.kernels.myprod(ob=obj, nelem=len(obj.a))
# obj.a contains [3., 4., 5.]
# obj.b contains [4., 5., 6.]
# obj.c contains [12., 20., 30.]
Inspecting the source code
Before compiling, the context specializes our source code for tgw chosen platform. Such autogenerated specialized code can be inspected from Python as follows:
print(ctx.kernels.myprod.specialized_source)
For our example kernel mymul
, if the chosen context is a ContextCpu
the generated specialized source is:
void myprod(DataStructure ob, int nelem){
for (int ii=0; ii<nelem; ii++){ //autovectorized
double a_ii = DataStructure_get_a(ob, ii);
double b_ii = DataStructure_get_b(ob, ii);
double c_ii = a_ii * b_ii;
DataStructure_set_c(ob, ii, c_ii);
}//end autovectorized
}
If the chosen context is a ContextCupy
the generated specialized source is:
__global__
void myprod(DataStructure ob, int nelem){
int ii; //autovectorized
ii=blockDim.x * blockIdx.x + threadIdx.x;//autovectorized
if (ii<nelem){ //autovectorized
double a_ii = DataStructure_get_a(ob, ii);
double b_ii = DataStructure_get_b(ob, ii);
double c_ii = a_ii * b_ii;
DataStructure_set_c(ob, ii, c_ii);
}//end autovectorized
}
If the chosen context is a ContextCupy
the generated specialized source is:
__kernel
void myprod(DataStructure ob, int nelem){
int ii; //autovectorized
ii=get_global_id(0); //autovectorized
double a_ii = DataStructure_get_a(ob, ii);
double b_ii = DataStructure_get_b(ob, ii);
double c_ii = a_ii * b_ii;
DataStructure_set_c(ob, ii, c_ii);
//end autovectorized
}