During this 4 weeks I completed my pending work from the phase 1 i.e. implementing matrix solver and start working on my phase 2 work. In this period I made PR for cgs solver, improved lsmr and also added some distribution methods to Generator.

Fixing the larger number of kernel invocation in LSMR algorithm.

In the previous post I wrote about various method I used to solve the problem regarding the number of individual kernel calls which is causing very bad performance as compared to that of SciPy version, but this time my mentor suggested that we should use try to limit the matrix and vector computation on GPU and other scalar arithmetic on CPU. And BAM!! this works and the there was very huge performance improvement. You can see the comparision of time taken for computing the solution vector in the following table, where we can see that as the size of array increase gap between computational time required by scipy and cupy widens.

size of matrix SciPy CuPy
500x500 53.843938446044916 287.2928771972656
1000x1000 130.22197265625 601.302734375
5000x5000 4808.26650390625 2992.645263671875
10000x10000 20126.594335937505 3263.73974609375

And looking at the profiler we can see that the number of kernel invocation was reduced

gpu profile

lsmr stats

How we fix that

The main problem in the implementation was that all the calclation which involves vector are done on GPU even operation like value comparision, maxima and minima and so on. So what we do here was we just call the variable in GPU onto the CPU and used that for doing less extensive operation. This was done using item method which converts the array with single term into scalar. This reduces the large number of independent kernel call which increases the performance of the algorithm.

CGS solver

Conjugate Gradient Squared is varient of biconjugate gradient descent method and has same number of operation per iteration, but this method does not involve computation with AT. Thus reduce computation where calculation of AT is computationally costly.

Following is the pseudo code of preconditioned Conjugate Gradient Squared Method


Compute r (0) = b − Ax (0) for some initial guess x (0)
Choose r̃ (for example, r̃ = r (0) )
for i = 1, 2, . . .
	ρi−1 = r̃T r(i−1)
	if ρi−1 = 0 method fails
	if i = 1
	u(1) = r(0)
	p(1) = u(1)
	else
	βi−1 = ρi−1 /ρi−2
	u(i) = r(i−1) + βi−1 q(i−1)
	p(i) = u(i) + βi−1 (q(i−1) + βi−1 p(i−1) )
	endif
	solve Mp̂ = p(i)
	v̂ = Ap̂
	αi = ρi−1 /r̃T v̂
	q(i) = u(i) − αi v̂
	solve M û = u(i) + q(i)
	x(i) = x(i−1) + αi û
	q̂ = Aû
	r(i) = r(i−1) − αi q̂
	check convergence; continue if necessary
end

Above algorithm for CGS solver over cupy has very huge performance improvement compared to SciPy. As you can see the result in following table. As the size of matrix increases time required to compute the solution decreases by very huge amount. For matrix of size 5000x5000 there is nearly 4x improvement in the time required to get the solution.

size of matrix SciPy CuPy
100x100 47.81260681152344 162.54141235351562
500x500 277.829833984375 812.6845703125001
1000x1000 706.6454467773438 1627.8208618164062
3000x3000 8468.41064453125 4872.149169921875
5000x5000 33556.39062500001 8166.427872492932

Along with these two solver I am working on two more solver MINRES and BiCG.

Distribution in new random Generator

During period I made PR for binomial, geometric, hypergeometric distribution and also made PR for improvement in the beta distribution. Of all the distribution I am successful in getting geometric and hypergeometric distribution merged onto the main repo. And in the upcoming week I will add more distribution to the generator. I’ll write seperate post describing the random number generator and and how that was implemented in CuPy using the cuRAND library provideed by CUDA.

Following is the example of the random number generated by using the new added distributions


>>>import cupy as cp 
>>>ngood = cp.random.randint(low=5, high=10, size=(5))                     
>>>nbad = cp.random.randint(low=5, high=10, size=(5))                      
>>>nsample = cp.random.randint(low=1, high=10, size=(5)) 
>>>rng = cp.random.default_rng()                                           
>>>rng.hypergeometric(ngood, nbad, nsample)                               
array([4, 5, 1, 2, 3])
>>>p = cp.random.random(10) 
>>>rng.geometric(p)                                                       
array([1, 1, 1, 3, 3, 1, 1, 2, 1, 6])

And all these distribution has very huge performance improvement as compared to NumPy.

Tasks for upcoming weeks

In upcoming weeks I planned to added more disritbution and complete my two more solver i.e. MINRES and BiCG and also get my binomial distribution merged.