Parallel Laplace Solver
=======================

The following code solves Laplace's equation in parallel on a grid.
It divides a square into :math:`n` parallel strips where
:math:`n` is the number of processes and uses jacobi-iteration.
The way the code works is that the root process creates a matrix
and distributes the pieces to the other processes. At each
iteration each process passes its upper row to the process above
and its lower row to the process below since they need to know the
values at neighboring points to do the iteration. Then they iterate
and repeat. Every 500 iterations the error estimates from the
processes are collected using Gather. you can compare the output of
this with the solver we wrote in the section on f2py.

.. CODE-BLOCK:: python

    from mpi4py import MPI
    import numpy
    size=MPI.size
    rank=MPI.rank
    num_points=500
    sendbuf=[]
    root=0
    dx=1.0/(num_points-1)
    from numpy import r_
    j=numpy.complex(0,1)
    rows_per_process=num_points/size
    max_iter=5000
    num_iter=0
    total_err=1

    def numpyTimeStep(u,dx,dy):
        dx2, dy2 = dx**2, dy**2
        dnr_inv = 0.5/(dx2 + dy2)
        u_old=u.copy()
        # The actual iteration
        u[1:-1, 1:-1] = ((u[0:-2, 1:-1] + u[2:, 1:-1])*dy2 +
                         (u[1:-1,0:-2] + u[1:-1, 2:])*dx2)*dnr_inv
        v = (u - u_old).flat
        return u,numpy.sqrt(numpy.dot(v,v))


    if MPI.rank==0:
        print("num_points: %d"%num_points)
        print("dx: %f"%dx)
        print("row_per_procss: %d"%rows_per_process)
        m=numpy.zeros((num_points,num_points),dtype=float)
        pi_c=numpy.pi
        x=r_[0.0:pi_c:num_points*j]
        m[0,:]=numpy.sin(x)
        m[num_points-1,:]=numpy.sin(x)
        l=[ m[i*rows_per_process:(i+1)*rows_per_process,:] for i in range(size)]
        sendbuf=l


    my_grid=MPI.COMM_WORLD.Scatter(sendbuf,root)

    while num_iter <  max_iter and total_err > 10e-6:


        if rank==0:
            MPI.COMM_WORLD.Send(my_grid[-1,:],1)

        if rank > 0 and rank< size-1:
            row_above=MPI.COMM_WORLD.Recv(rank-1)
            MPI.COMM_WORLD.Send(my_grid[-1,:],rank+1)

        if rank==size-1:
            row_above=MPI.COMM_WORLD.Recv(MPI.rank-1)
            MPI.COMM_WORLD.Send(my_grid[0,:],rank-1)

        if rank > 0 and rank< size-1:
            row_below=MPI.COMM_WORLD.Recv(MPI.rank+1)
            MPI.COMM_WORLD.Send(my_grid[0,:],MPI.rank-1)

        if rank==0:
            row_below=MPI.COMM_WORLD.Recv(1)



        if rank >0 and rank < size-1:
            row_below.shape=(1,num_points)
            row_above.shape=(1,num_points)
            u,err =numpyTimeStep(r_[row_above,my_grid,row_below],dx,dx)
            my_grid=u[1:-1,:]

        if rank==0:
            row_below.shape=(1,num_points)
            u,err=numpyTimeStep(r_[my_grid,row_below],dx,dx)
            my_grid=u[0:-1,:]

        if rank==size-1:
            row_above.shape=(1,num_points)
            u,err=numpyTimeStep(r_[row_above,my_grid],dx,dx)
            my_grid=u[1:,:]


        if num_iter%500==0:
            err_list=MPI.COMM_WORLD.Gather(err,root)
            if rank==0:
                total_err = 0
                for a in err_list:
                    total_err=total_err+numpy.math.sqrt( a**2)
                total_err=numpy.math.sqrt(total_err)
                print("error: %f"%total_err)


        num_iter=num_iter+1



    recvbuf=MPI.COMM_WORLD.Gather(my_grid,root)
    if rank==0:
        sol=numpy.array(recvbuf)
        sol.shape=(num_points,num_points)
    ##Write your own code to do something with the solution
        print(num_iter)
        print(sol)

For small grid sizes this will be slower than a straightforward
serial implementation, this is because there is overhead from the
communication, and for small grids the interprocess communication
takes more time than just doing the iteration. However, on a
1000x1000 grid I find that using 4 processors, the parallel version
takes only 6 seconds while the serial version we wrote earlier
takes 20 seconds.

Excercise: Rewrite the above using f2py, so that each process 
compiles a fortran function and uses that, how fast can you get this?