lattice / quda Goto Github PK

View Code? Open in Web Editor NEW

274.0 57.0 95.0 101 MB

QUDA is a library for performing calculations in lattice QCD on GPUs.

Home Page: https://lattice.github.io/quda

License: Other

C++ 68.09% C 3.62% Cuda 25.11% Python 0.71% Fortran 0.20% Shell 0.25% CMake 2.00% CWeb 0.02%

c-plus-plus cuda c qcd gpu multi-gpu mpi

quda's People

Contributors

Stargazers

Watchers

Forkers

fwinter jpfoley witzel alexstrel mchengcit kpetrov urbach knippsch metivett sdcohen mrj10 franzdirenzo silky amabdelrehim weinbe2 yongchull drossetti zzmjohn sunilrt ckallidonis mortmaire azrael417 kostrzewa erinaldi robertmaynard sdliuyuzhi caot chunde evanberkowitz chulwoo1 gkanwar gwli doubleespresso twlqcd andrei-pokrovsky jcosborn bogdan-tanygin hummingtree mathiaswagner mediapastor yonggwi-cho jxy eminsight eromero-vlc sunwayihep chenminacm louishp silentassassinma krenshaw2018 yaomingamd wangericnv hongbaby clqcd cphysics hiok2000 srijitpaul san-zh surfndez qcdcode enp1s0 aniketsen marcogarofalo ssyr hwancheoljeong alphacentauri763 michoumichmich garymatthews marcuspetschlies boegel ylin910095 rqzhang0 sunpho84 mfkiwl saintbenjamin gnodvi mcatillo soujanyajanga tejax-alaghari fernandezdlg krox utku-k-can draenog dmcdougall agrebe bjoo ajgappmark havogt ruilativity gmh5225 mridulsharma03 nikschlage

quda's Issues

Objectify the solvers

The solvers needs to be derived from an abstract solver base class. This will allow

Arbitrary preconditioners to be constructed.
When the constructor is called, all memory required by a given solver should be allocated to prevent malloc overhead at solve time (critical for solvers being used as preconditioners).

Add support for QIO

For a first pass QIO support should be added to allow for loading of gauge fields. This will enable algorithmic testing independent of Chroma etc.

Parallel QIO support, so that the loaded / saved fields are correctly distributed over the parallel machine.
QIO should not be required by the core QUDA library, only by the tests.

Add support for non-texture reads for Wilson fermions

Similar to what Guochun added for the staggered kernels. This should have the option to perform non-texture reads (i.e., through the L1 on Fermi) on the gauge fields and/or the spinor fields.

Segmentation Fault on trying to read gauge configuration in ILDG

Good Morning USA!

I get segfault trying to read configuration file in ILDG format [Chroma reads such w/out problems]
maybe i am doing smth wrong, but I guess segfault is not the best answer. recent github pull. qio/qmp recent cvs checkout

Reading symbols from /home/kpetrov/DEVEL/quda/tests/wilson_invert_test...done.
(gdb) set args --xdim 24 --ydim 24 --zdim 24 --tdim 48 --load-gauge landau_conf.1000
(gdb) run
Starting program: /home/kpetrov/DEVEL/quda/tests/wilson_invert_test --xdim 24 --ydim 24 --zdim 24 --tdim 48 --load-gauge landau_conf.1000
[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x000000000041d657 in read_gauge_field (filename=0x209a820 "landau_conf.1000",
gauge=0x7fffffffdef0, precision=QUDA_DOUBLE_PRECISION, X=0x7fffffffde90,
argc=, argv=) at gauge_qio.cpp:99
99 lattice_size[d] = QMP_get_logical_dimensions()[d]*X[d];
(gdb) bt
#0 0x000000000041d657 in read_gauge_field (

filename=0x209a820 "landau_conf.1000", gauge=0x7fffffffdef0, 
precision=QUDA_DOUBLE_PRECISION, X=0x7fffffffde90, 
argc=<value optimized out>, argv=<value optimized out>) at gauge_qio.cpp:99

#1 0x0000000000403332 in main (argc=11, argv=0x7fffffffe038)

at wilson_invert_test.cpp:238

(gdb) quit

Correct reporting to dslash bandwidth

The current bandwidth figures reported by the dslash tests are inaccurate as they don't correct account for precision or memory traffic savings.

GCR preconditioner under/over relaxation parameter

Add an over / under relaxation parameter to GCR for the preconditioner. This has the potential to decrease the iteration count. This "omega" parameter should be exposed at the interface level.

Dslash performance reduction

All multi-dimensional Dslash types have gotten slower at all precisions since we introduced multi-dimensional parallelization. We need to understand why this is.

Increased register pressure?
Reordering of operations?
Increased number of conditionals?

On a related note, we need more flexibility on the SHARED_FLOATS_PER_THREAD parameter, to allow for different values for different kernels. Introducing an offline dslash tuner to pick the best values (which should be volume independent) would be ideal here.

Correct handling of solve_type

As originally conceived, a "DIRECT" solve_type is supposed to result in the matrix M being inverted directly, while a "NORMEQ" solve inverts M^† M. From the NEWS file:

Added 'solve_type' to QudaInvertParam. This determines how the linear
system is solved, in contrast to solution_type which determines what
system is being solved. When using the CG inverter, solve_type should
generally be set to 'QUDA_NORMEQ_PC_SOLVE', which will solve the
even/odd-preconditioned normal equations via CGNR. (The full
solution will be reconstructed if necessary based on solution_type.)
For BiCGstab, 'QUDA_DIRECT_PC_SOLVE' is generally best. These choices
correspond to what was done by default in earlier versions of QUDA.

At the moment, this doesn't work as advertised; CG always does a "NORMEQ" solve, irrespective of the value of solve_type.

It also wouldn't hurt to choose a more sensible naming convention (at the very least replacing "NORMEQ" by "NORMOP"), as advocated by Balint.

--with-mpi

Hi Guys,
--with-mpi on the configure script appears to be overloaded. I originally specified it to point to the location of the MPI distro which was used to compile QMP programs. It has now been overloaded to also define -DUSE_MPI (to select MPI vs QMP comms harness). This is bad because it is natural to have --with-qmp= --with-mpi= set together for a normal QMP build, and this will add -DQMP_COMMS and -DMPI_COMMS on the command line.

We should find a way around. How about

--with-qmp= sets -DQMP_COMMS
--with-mpi= sets -DMPI_COMMS
--with-mpihome= sets MPI_HOME, but not -DMPI_COMMS

Thoughts?

Hypercubic RNG

In order to have machine independent testing we need to have a hypercubic random number generator, i.e., assign a different random number stream to each site on the lattice. This means that irrespective of how the lattice is mapped to the machine, we will always be generating the same random number fields.

This is required to further aid debugging and verification within QUDA, to further decouple the testing process from external libraries. Likely this will also be desired for any canned multigrid solver that we deploy.

Should we eliminate make.inc.example?

There seems to be increasing danger of make.in and make.inc.example getting out of sync. If there's consensus, I'd be in favor of going over completely to Balint's autoconf approach.

MPI QUDA error: (CUDA) invalid configuration argument (node 7, blas_quda.cu:1509, hostname=curie7063

Salut,

I am trying to run recent QUDA on M2090 cluster and get this error: I understand it is related to shared memory overflow, but reducing local volume didn't quite help
Also, on how many GPUs do I have to run to handle 48x96 lattices?

script was:

MSUB -r MPI_GPU_Job # Request name

MSUB -n 16 # Total number of tasks to use

MSUB -N 16 # Total number of nodes to use

MSUB -T 18000 # Elapsed time limit in seconds

MSUB -o example_%I.o # Standard output. %I is the job id

MSUB -e example_%I.e # Error output. %I is the job id

MSUB -q hybrid # Hybrid partition of GPU nodes

MSUB -A gch0005 # Project ID

set -x
cd ${BRIDGE_MSUB_PWD}
module load cuda
ccc_mprun ./wilson_invert_test --xdim 32 --ydim 32 --zdim 32 --tdim 64 --xgridsize 2 --ygridsize 2 --zgridsize 2 --tgridsize 2

Fix Blas Autotuning

Hi, a user was trying to run QUDA and came accross this error:

(CUDA) too many resources requested for launch (node 0, blas_quda.cu:929)

He was trying to run a 16^4 clover lattice on a single C2050 (but in Multi-GPU mode - ie with wraparound QMP comms).

Probably the blas_params are not optimal (he probably did not run a make tune -- cos my script that I gave him did not have that in), and he used the default blas_params.h file. The curiosity is that for me on qcd10g0310, also with C2050-s this error does not occur when I try to emulate what he is doing (I used exactly the same package for the build that I gave him.)

However, I am using CUDA3.0 and he's using 3.2.

I think in principle, a make tune could fix his problem but that makes automation really quite difficult. (Need to know/edit lattice size in blas_test, and have to do it interactively / submit a job to a compute node for systems where there is no GPU on the interactive node).

Any ideas? Can it be done at runtime without having to wait the 15 minutes for the full BLAS tuning to go through like with make tune?

Non-relativistic host-side spinor fields for Wilson-type quarks

The current interface requires that the host-side spinor fields are in chiral basis, and the conversion to non-relativistic spinor fields is done when the field copied to the device. Additionally, the option to have host-side spinor fields in non-relativistic basis is a requested feature from Alexei Strelchenko, and should be implemented for the 0.4.0 release.

Change memory bandwidth numbers to use base-10 rather than base-2 counting

NVIDIA's specifications always use base-10 counting for reporting theoretical memory bandwidth. For consistency and to facilitate easy comparisons, we should modify all bandwidth numbers reported by QUDA to use GB/s instead of GiB/s as is used presently.

Occasional dslash hanging

Running on hundreds of GPUs seems to induce a non-reproducible hang of the current master (2310c18). This is likely also related to the occasional hang that I have found when doing extreme dslash autotuning where hundreds of thousands of dslash kernel calls are invoked in rapid succession. I have no idea what's causing this, but this should be investigated and appropriately dealt with.

make tune fails for Ns=4 in version ab7ef49bbaed8c19f55f60d339aa13886cc58075

I just pulled the master and tried to run make tune. The default tests/blas_test.cu has

// volume per GPU
const int LX = 12; // Has to be checkerboarded value... (so 24->12)
const int LY = 24;
const int LZ = 24;
const int LT = 24;
const int Nspin = 1;

and 'make tune' works fine. Changing to Nspin=4; for wilson fermions results in

Testing single precision...
QUDA error: (CUDA) too many resources requested for launch (node 0, blas_test.cu:118)

Now trying to get more information as to where the problem comes from (building with DEBUG_HOST and DEBUG_DEVICE)
NB: I am building with -DMULTI_GPU -DOVERLAP_COMMS -DGPU_WILSON_DIRAC -DQMP_COMMS

Reference clover term computation and application

The host code should be able to correctly test the clover QUDA code for correctness. Thus, it should do the following:

Calculate the clover field given the host gauge field.
Compute the inverse clover field given the clover field.
Apply the clover field or its inverse to a spinor field.

individual residua for multi-shift solver

Hi,
Mike has a nice algorithmic trick for RHMC where we can ask for different residua for different shifts in the multi-shift algorithm. Typically the ill conditioned small shift systems contribute less to the overall force and only the higher shifts need to be solved more accurately. It would be nice if we could specify a desired residuum for each shift, rather than having a master one for the lowest shift.

This may need some kind of modification to the invParams tho. Let us get our thinking caps on as to how to specify this in the interface.

Add support for qcdlib domain-wall fermion format

Currently qcdlib can't interface with the domain-wall solver because QUDA doesn't understand the format that qcdlib (an array of pointers to 4-d spinor fields). A new spinor field format should be introduced and the packing and unpacking routines updated to include this possibility.

Gauge-fixed Dslash broken for Wilson

This is probably a bug introduced with the new multi-dim kernels. The last time I knew it was working for sure was back in early March (commit ab7ef49). wilson_dslash_test fails whether or not partitioning in T is turned on.

Flaky blas_test

The blas_test is supposed to automatically find good parameters and ignore bad parameters. The problem is the error detection mechanism that enables this - setBlasTuning - isn't reached until some spinor field copies are undertaken. If the default parameters are bad, then the initial copy will fail causing the program to halt.

Multi-gpu device selection

Guochun and I were discussing how devices are selected when running in multi-gpu mode. At the moment devices are allocated automatically, and the lowest device numbers are selected first. This is not optimal. Take Forge, for example, where each 'node' consists of two numa nodes. Device 0 and device 1 are on the first numa node, and device 2-5 are on the second. It may be advantageous to run two jobs concurrently on each node, using two and four GPUs respectively, say. However, as it stands, those jobs would compete for device 0 and device 1, while device 4 and device 5 go unused.

We'd like to modify initQuda so that we could specify which devices to use when running in multi-gpu mode.
We could add an optional integer array argument to initQuda to specify the devices, or overload initQuda to take either an integer array or a single integer. We would keep the current setting as a default if a device array is not specified, so other code need not change.

Comments?

Staggered invert test produces segmentation fault

When building staggered in single GPU mode (create make.inc with

./configure --enable-os=linux --enable-gpu-arch=sm_20 --enable-staggered-dirac --disable-wilson-dirac --disable-domain-wall-dirac --disable-twisted-mass-dirac

on running the staggered invert test, I get a segfault on GTX480:

[bjoo@qcd10i2 tests]$ ./staggered_invert_test --prec double --recon 18 --test 3running the following test:
prec sloppy_prec link_recon sloppy_link_recon test_type S_dimension T_dimension
double double 18 18 mcg_even 24 24
QUDA: Found device 0: GeForce GTX 480
QUDA: Using device 0: GeForce GTX 480
Creating a DiracStaggeredPC operator
Creating a DiracStaggeredPC operator
Segmentation fault (core dumped)

MPI version runs fine.

Fix clover code generator

Currently the code generator does not automatically generate the standalone clover kernel code, and needs some hand editing to enable it.

Not all device memory freed

When using the QUDA clover inverter within Chroma, after the inversion some device memory areas remain allocated. This might be okay if QUDA was the only program part that accesses the GPU. However, there is work ongoing to extend QDP++ to use the GPU(s) as well. Thus when using the QDP++ extension along with QUDA in the same Chroma run, after exiting the QUDA inverter device memory remains allocated and can not be used in the remainder of Chroma, e.g. sink smear, hadspec, etc.

A thin CUDA layer inserted to QUDA provided for a dump of the allocation history made during QUDA Clover inverter:

0: 0x200300000 524288 1 blas_quda.cu:108
1: 0x200380000 1048576 1 blas_quda.cu:114
2: 0x200480000 1572864 1 blas_quda.cu:120

This refers to where cudaMalloc was called without calling cudaFree later.
(Master branch of QUDA pulled today, Sep 30 10am CET. Single GPU version.)

Objectify and generalize the gauge field

This will require reworking gauge_quda.cpp into a similar form as the ColorSpinorField, with derived classes for cpu or cuda. This must allow:

Arbitrary number of colours (though Nc=3 will continue to be a special case)
Clean up of the packing routines to minimze code bloat

Multi-GPU asymmetric clover fails

The current head master revision fails on the clover MatPC when any type of multi-GPU non-trivial topology is used.

Pipelined CG

The current CG algorithm isn't fully optimized for kernel fusion. The pipelined CG algorithm is much better for this:

www.mpi-inf.mpg.de/~strzodka/papers/public/StGo06PipeCG.pdf

This is also more suited to mixed-precision since it is essentially reliable updates for CG.

staggered: use Wilson's efficient spinor gather kernel

The staggered spinor gather kernel now launches threads equal to whole volume, highly in-efficient and needs to use Wilson's more efficient gather kernel.

GCR breakdown

The GCR solver seems to occasionally NaN out. We need to understand what is happening here:

Is is the Gram-Schmidt that is going unstable?
Bug in code?
Something else?

Multiple calls to loadGaugeQuda

Multiple calls to loadGaugeQuda() cause the error message:

QUDA error: Error: even/odd field is not null, probably already allocated(even=0x54a0000, odd=0x54a7800)

(this is pernicious when wrapping the solver for use in HMC, where the gauge field can change and a new gauge field needs to be downloaded.)

Possible memory leak ?

During QUDA clover BiCGStab inversion the following happens:

BiCGstab: 250 iterations, r2 = 9.738363e-08
BiCGstab: 251 iterations, r2 = 1.270892e-07
QUDA error: (CUDA) too many resources requested for launch (node 1, blas_quda.cu:946 in copyCuda)

Setup as in issue #45. In short:
Quda:master,QMP,cuda 4.0
Chroma:master
QMP:OpenMPI
4 C2070 sharing 1 PCIe

AntiPeriodicT true
AsymmetricLinop false
SolverType BICGSTAB
Verbose true
CudaPrecision DOUBLE
CudaSloppyPrecision HALF
CudaReconstruct RECONS_12
CudaSloppyReconstruct RECONS_12

Any ideas ?

Enable QIO support for parallel reading?

QIO support currently only reads files serially, which means that it takes a very long time to read large gauge fields. I believe QIO supports parallel file reading (?), so this option should be enabled.

Staggered Dslash not disabled.

Hi Folks,
I've been trying to build QUDA with staggered disabled. Indeed looking at the compile line:

/usr/local/cuda/bin/nvcc -O3 -D__CUDA_ARCH__=200 -ftz=true -prec-div=false -prec-sqrt=false -DMULTI_GPU -DOVERLAP_COMMS -DGPU_WILSON_DIRAC -arch=sm_20 -I/usr/local/cuda/include -DQMP_COMMS -I/home/bjoo/Devel/QCD/install/qmp/qmp2-1-6/openmpi/include -I../include -Idslash_core -I. dslash_quda.cu -c -o dslash_quda.o

ie no -DGPU_STAGGERED_DIRAC, but yet it looks like staggered dslash core is being processed:
dslash_core/staggered_dslash_core.h(237) : warning: variable "sign" was set but never used

Best, B

GPU interface functions

We need to have official GPU interface functions analogous to the currently provided CPU interface functions. This feature is required to enable more groups to use QUDA.

QUDA over QMP hangs

Salut,

same cluster, M2090, recent QMP, recent QUDA, code hangs before entering Dslashtune. It might be a qmp problem, but maybe somebody hit this before?

mpi version is openmpi modified by vendor, but QUDA runs with it when compiled directly to mpi. Compiler is ICC/12 or 11

also, how do i make quda use all cards on a given node?

thanks!
k.

Staggered fermion domain decomposition improvements

The following additions are required to allow testing of new preconditioners and domain-decomposition algorithms for improved-staggered actions:
1.) Fix half precision for asqtad fermions, the multi-gpu code is currently broken. Half precision is ideal for use as a DD preconditioner since low accuracy is sufficient, i.e., ~0.1.
2.) Naive staggered fermion kernels, i.e., nearest neighbour operator only. The idea here is to use the naive staggered action with the fat-link gauge field to precondition the asqtad operator. This should yield much better scaling.
3.) Have the option to switch off communications for the dslash, i.e., apply the dslash operator to all sites in a node, but do not include the contributions from neigbouring GPUs. Ideally, the ability to switch off both the 1-hop and 3-hop communications, or just the 1-hop term would yield maximum flexibility.

Allow user to specify logical topology for multi-GPU communications

At present, to properly run an application built with QUDA over QMP, it's necessary to specify "-geom Px Py Pz Pt" on the command-line. This is awkward in cases where the application has built-in logic to determine the best layout and is also incompatible with QDP/C, as summarized by James Osborn:

One issue with interfacing multi-GPU to QDP at the moment will be that
QDP isn't currently setting the logical topology. This was changed to
support multi-lattice in QDP, since one might not want the same node
mapping on each lattice, and QMP didn't have communicator support. Now
that QMP does, I could create a new communicator for each lattice and
set each's topology, but my concern is that MPI communicators could be
expensive in memory and I don't want to rely on this. I'm planning to
add some sort of light-weight communicators to QMP to address this.

Another issue is that the QMP topology has it's own fixed mapping of the
ranks to the logical topology, which may not be optimal. Right now QDP
is using a different mapping which was a little better in some cases. I
am also planning to allow the QMP mapping to be more flexible, but
haven't gotten to this yet.

Anyway, the main point is that it would be nice if QUDA didn't rely on
the QMP topology, but instead allowed the user to pass in a function (or
functions) that specified the rank->coords and coords->rank mappings.
That would allow much greater flexibility for the applications using
QUDA. Additionally, allowing a QMP communicator to be specified would be
ever better. You said that some groups may want to port QMP and not use
communicators, but it should be possible for those ports to still keep
the same API (with the communicator structure) and just have it always
be the same one (basically make QMP_comm_split always fail).

At this stage, I'd suggest not going so far as to rely on QMP communicators, which are still an "alpha" feature, but allowing the user to pass in mapping function seems like a nice solution. This would also add much-needed flexibility to the MPI code path, which currently assumes a simple lexicographical ordering when assigning logical grid coordinates to MPI ranks.

To summarize, I propose replacing this declaration:

 void initCommsQuda(int argc, char **argv, const int *X, const int nDim);

with:

 typedef int (*QudaCommsMap)(const int *x, void *fdata);
 void initCommsQuda(const int *X, const int nDim, QudaCommsMap func, void *fdata);

Here fdata points to any auxiliary data required by the user-supplied mapping function func(). Passing NULL for fdata is perfectly valid. As an implementation detail, note that since we'll no longer be able to assume the existence of a QMP logical topology, we'll have to eliminate the use of "relative" sends and receives in face_qmp.cpp. This is a minor inconvenience but again quoting James Osborn:

The relative sends were just a cached version of the calculation of (get my coords) -> (add 1 mod length) -> (get rank). They aren't necessary (and were never used by QDP), since you can just create the neighbor table yourself and use the regular send.

Comments?

Convergence problems

The Quda solver shows convergence problems. I am using

current master of QUDA (88b9a84)
parallel build via QMP
clover wilson dirac operator
bicgstab solver
SP/HP
CUDA 4.0
4xC2070
32^3x64 SP configurations (I tried more than one to exclude problems with a particular configuration)

I get convergence problems:
BiCGstab: 14998 iterations, r2 = 1.991275e-06
BiCGstab: 14999 iterations, r2 = 1.724662e-06
BiCGstab: 15000 iterations, r2 = 2.394866e-06
QUDA warning: Exceeded maximum iterations 15000

Of course, I double checked the clover and kappa parameter, boundary conditions, etc.

Since an initial try didn't succeed I varied the solver parameters:

SP/HP and SP/SP
delta ranges from 0.1 to 1.0 resulting in roughly 1% to 20% of reliable updates (when using mixed precision)

Still no convergence!

I also changed the GPU grid layout and used
-geom 1 1 1 4 and -geom 1 1 4 1

Not sure what else to try.

Balint mentioned he sees the same convergence problems. The only difference is that he sees it for large GPU partitions (not only 4 like in my case)

make.inc extract:
CUDA_INSTALL_PATH = /opt/cuda4
CPU_ARCH = x86_64 # x86 or x86_64
GPU_ARCH = sm_20 # sm_10, sm_11, sm_12, sm_13, sm_20 or sm_21
OS = linux # linux or osx
PYTHON = python # python 2.5 or later required for 'make gen'
DEVICE = 0 # CUDA device to use for 'make tune'

compilation options

HOST_DEBUG = no # compile host debug code
DEVICE_DEBUG = no # compile deivce debug code for cuda-gdb
VERBOSE = no # display kernel register useage
DSLASH_PROFILING = no # multi-gpu dslash profiling
BUILD_WILSON_DIRAC = yes # build Wilson Dirac operators?
BUILD_CLOVER_DIRAC = yes # build clover Dirac operators?
BUILD_DOMAIN_WALL_DIRAC = no # build domain wall Dirac operators?
BUILD_STAGGERED_DIRAC = no # build staggered Dirac operators?
BUILD_TWISTED_MASS_DIRAC = no # build twisted mass Dirac operators?
BUILD_FATLINK = no # build code for computing asqtad fat links?
BUILD_HISQLINK = @BUILD_HISQLINK@ # build code for computing hisq fat links?
BUILD_GAUGE_FORCE = no # build code for (1-loop Symanzik) gauge force?
BUILD_FERMION_FORCE = no # build code for asqtad fermion force?
BUILD_HISQ_FORCE = no # build code for hisq fermion force?

Multiple GPU options

BUILD_MULTI_GPU = yes # set to 'yes' to build the multi-GPU code
BUILD_QMP = yes # set to 'yes' to build the QMP multi-GPU code
BUILD_MPI = no # set to 'yes' to build the MPI multi-GPU code
OVERLAP_COMMS = yes # set to 'yes' to overlap comms and compute

Do we have qio support?

BUILD_QIO = no # set to 'yes' to build QIO code for binary i/o

Do we compile the numa support

HAVE_NUMA = no

Auto tuner should tune for best cache configuration

Currently we set the cache configuration to 48K L1 and 16 K shared (Fermi). However, this isn't optimal for all kernels and the auto tuner can actually switch the default cache configuration if it requests more than 16K per SM.

The solution is expand the TuneParam class to include a member variable enum cudaFuncCache, which will be tuned per kernel. This shouldn't be too much work, adding it to the 0.4.1 milestone.....

multiple calls to loadGaugeQuda in minvcg branch

multiple calls to loadGaugeQuda produce

in the minvcg branch, when not using mixed precision, in multi-GPU mode, multiple calls to loadGaugeQuda can elicit the error:

QUDA error: (CUDA) invalid argument (node 0, gauge_quda.cpp:805)

Background:
We fixed issue 5 (https://github.com/lattice/quda/issues/#issue/5 ) in the minvcg branch by
adding freeGaugeQuda and freeCloverQuda calls, that can be called at the end of a solver
so that a subsequent call to loadGaugeQuda can happily re-allocate the gauge. However a new issue has arisen: in uniform precision gauge and gaugeSloppy are actually pointers to the same place. Somehow or other after multiple calls to loadGaugeQuda one can encounter the above error. This is pernicious in an HMC like situation when multiple calls to loadGaugeQuda are necessary as the gauge field evolves.

An additional data point: when using a mixed precision solver (eg precision=SINGLE, sloppy precision=HALF) , this situation does not arise, which makes me suspect that the underlying cause of this bug is the aliasing of gauge to gaugeSloppy in uniform precision.

Reproducing:
configure the minvcg branch with

./configure --enable-os=linux --enable-gpu-arch=sm_20 --disable-staggered-dirac
--enable-wilson-dirac --disable-domain-wall-dirac --disable-twisted-mass-dirac
--enable-multi-gpu --with- qmp=/home/bjoo/Devel/QCD/install/qmp/qmp2-1-6/openmpi
--with-mpi=/home/bjoo/Toolchain/install/openmpi-1.5

Then link chroma against this and run the t_leapfrog test with a QUDA solver in the MD using uniform precision.

NB: producing this error so far required an external client to make multiple calls to loadGaugeQuda (eg. chroma calling loadGaugeQuda during the MD evolution in HMC)
A small self contained test within QUDA reproducing this error (without chroma) would be desirable.

Enable support for multiple right hand sides

A major optimization that can be explored is to see if amortizing the gauge field loads by acting on multiple spinors simultaneously will lead a large speed up. This is much more important for the staggered dslash kernel since this is the most bound by gauge field loads.

A simple way to explore this is introduce a y dimension to the thread blocks, this corresponds to the number of right hand sides.

The easiest way to explore this is probably to create a 5-dimensional spinor field, where the length of 5th dimension corresponds to the number of right hand sides. I believe the only changes required to the dslash kernel indexing would be adding to the spinor fields an offset corresponding to the y thread index multiplied by the length of the 4-dimensional spinor field.

spinor_index += threadIdx.y * 4d_length;

By definition, the gauge field indexing is independent of threadIdx.y.

Even with this simple change, there will likely be significant improvement in kernel performance since gauge field loads should obtain reuse through the L1 / texture cache / L2. Further improvement is likely possible through using shared memory management.

Beyond the kernel, there are multiple changes and additions required:

Modifying the interface to deal with multiple right hand sides
Set the texture binding appropriately for the 5-d spinor length
Deal with the packing/unpacking of an array of cpu spinor fields
Multi-GPU packing / unpacking

By moving a 5-d spinor field this complicates the communications packing and unpacking logic, which currently expects a 4-d spinor field. How to solve this requires some thought. I think a simple solution would be to create the contiguous 5-dimensional spinor fields, but also create an array of 4-dimensional spinor fields which are actually pointers to the corresponding parts of the 5-dimensional spinor field. Thus the communications routines can be called on the 4-dimensional reference spinor fields, but the dslash kernel deals with the 5-dimensional field (aside - this may be a very quick and dirty way to get multi-GPU domain wall).

NUMA binding

QUDA should have the feature to conveniently bind to numa-optimized cpu/gpu. Here is my thought so far

*) we can add a separate c program to generate a numa mapping file. I already have such a C program and we can copy or modify it for this purpose. Maybe we can add a utility/ or tools/ directory for that

*) We can compile the tool and run it to generate the numa mapping file in "make tune". A recompile should automatically compile the numa info into the executable.

*) quda should work correctly without the previous step.

Multi-dimensional Wilson parallelization appears buggy

Using multiple GPUs in anything other than the T dimension seems to get the wrong answer, and the wilson_dslash_test fails, e.g.,

$mpirun -np 2 ./wilson_dslash_test --xdim 16 --ydim 16 --zdim 8 --tdim 64 --recon 12 --zgridsize 2 
running the following test:
prec recon   test_type     dagger   S_dim         T_dimension   dslash_type
single   12       1           0       16/16/8        64            wilson
Grid partition info:     X  Y  Z  T
                         0  0  1  0
Randomizing fields... done.
QUDA: Found device 0: GeForce GTX 480
QUDA: Found device 1: GeForce GTX 480
QUDA: Found device 2: GeForce GTX 480
QUDA: Found device 3: GeForce GTX 480
QUDA: Using device 0: GeForce GTX 480
Sending gauge field to GPU
Creating cudaSpinor
Creating cudaSpinorOut
Sending spinor field to GPU
Source: CPU = 1.0497e+06, CUDA = 1.0497e+06
Source: CPU = 1.0497e+06, CUDA = 1.0497e+06
Creating a DiracWilsonPC operator

Spinor mem: 0.006 GiB
Gauge mem: 0.026 GiB
Calculating reference implementation...done.
Executing 100 kernel loops...
done.

206.379265ms per loop
GFLOPS = 85.357785
GiB/s = 70.978243

Results: CPU = 875986.745504, CUDA=876005.116754, CPU-CUDA = 876005.116689
0 fails = 15690
1 fails = 15707
2 fails = 15680
3 fails = 15756
4 fails = 15685
5 fails = 15749
6 fails = 15888
7 fails = 15838
8 fails = 15838
9 fails = 15862
10 fails = 15863
11 fails = 15848
12 fails = 15870
13 fails = 15828
14 fails = 15832
15 fails = 15852
16 fails = 15844
17 fails = 15839
18 fails = 15681
19 fails = 15725
20 fails = 15697
21 fails = 15719
22 fails = 15717
23 fails = 15761
1.000000e-01 Failures: 9630 / 1572864  = 6.122589e-03
1.000000e-02 Failures: 253078 / 1572864  = 1.609027e-01
1.000000e-03 Failures: 378769 / 1572864  = 2.408148e-01
1.000000e-04 Failures: 392425 / 1572864  = 2.494971e-01
1.000000e-05 Failures: 393840 / 1572864  = 2.503967e-01
1.000000e-06 Failures: 393965 / 1572864  = 2.504762e-01
1.000000e-07 Failures: 394184 / 1572864  = 2.506154e-01
1.000000e-08 Failures: 1168774 / 1572864  = 7.430865e-01
1.000000e-09 Failures: 1530791 / 1572864  = 9.732507e-01
1.000000e-10 Failures: 1568667 / 1572864  = 9.973316e-01
1.000000e-11 Failures: 1572459 / 1572864  = 9.997425e-01
1.000000e-12 Failures: 1572821 / 1572864  = 9.999727e-01
1.000000e-13 Failures: 1572862 / 1572864  = 9.999987e-01
1.000000e-14 Failures: 1572864 / 1572864  = 1.000000e+00
1.000000e-15 Failures: 1572864 / 1572864  = 1.000000e+00
1.000000e-16 Failures: 1572864 / 1572864  = 1.000000e+00

This problem is present in the latest master commit a64abf9 on CUDA 4.0, and is likely the same issue that Balint reported, hence is probably a bug introduced at around commit 99b16e1.

Chroma segfault with current quda/master

I get a nice segfault when executing chroma with built-in quda support:

Initialize done
Initializing QUDA device: 0
QUDA: Found device 0: Tesla C2070
QUDA: Found device 1: Tesla C2070
QUDA: Found device 2: Tesla C2070
QUDA: Found device 3: Tesla C2070
[t060:09654] *** Process received signal ***
[t060:09654] Signal: Segmentation fault (11)
[t060:09654] Signal code: Address not mapped (1)
[t060:09654] Failing at address: 0xc
[t060:09654] [ 0] /lib64/libpthread.so.0() [0x337880f4a0]
[t060:09654] [ 1] ../toolchain/install/chroma-parscalar-parscalar-single-quda/bin/chroma(commCoords+0xb) [0x137757b]
[t060:09654] [ 2] ../toolchain/install/chroma-parscalar-parscalar-single-quda/bin/chroma(initQuda+0x174) [0x123b3b4]
[t060:09654] [ 3] ../toolchain/install/chroma-parscalar-parscalar-single-quda/bin/chroma(_ZN6Chroma10initializeEPiPPPc+0xbda) [0x67574a]
[t060:09654] [ 4] ../toolchain/install/chroma-parscalar-parscalar-single-quda/bin/chroma(main+0x29) [0x670c19]
[t060:09654] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3377c1ecdd]
[t060:09654] [ 6] ../toolchain/install/chroma-parscalar-parscalar-single-quda/bin/chroma() [0x670099]
[t060:09654] *** End of error message ***
Segmentation fault

Setup:

4 C2070 sharing 1 PCIe bus. 1 host total.
I would like to use the parscalar build of QDP++ for this machine.

Envvars:
$CUDA_NIC_INTEROP 1
$CUDA_VISIBLE_DEVICES 0,2,3,4

QMP,QDP++,QUDA,CHROMA recent clones, master branch each.

QMP with OpenMPI 1.5.4

QUDA with Cuda 4.0 (see configure line above), sig. parts of make.inc:
CPU_ARCH = x86_64 # x86 or x86_64
GPU_ARCH = sm_20 # sm_10, sm_11, sm_12, sm_13, sm_20 or sm_21
OS = linux # linux or osx
BUILD_WILSON_DIRAC = yes # build Wilson Dirac operators?
BUILD_CLOVER_DIRAC = yes # build clover Dirac operators?
BUILD_DOMAIN_WALL_DIRAC = no # build domain wall Dirac operators?
BUILD_STAGGERED_DIRAC = no # build staggered Dirac operators?
BUILD_TWISTED_MASS_DIRAC = no # build twisted mass Dirac operators?
BUILD_FATLINK = no # build code for computing asqtad fat links?
BUILD_GAUGE_FORCE = no # build code for (1-loop Symanzik) gauge force?
BUILD_FERMION_FORCE = no # build code for asqtad fermion force?
BUILD_HISQ_FORCE = no # build code for hisq fermion force
BUILD_MULTI_GPU = yes # set to 'yes' to build the multi-GPU code
BUILD_QMP = yes # set to 'yes' to build the QMP multi-GPU code
BUILD_MPI = no # set to 'yes' to build the MPI multi-GPU code
OVERLAP_COMMS = yes # set to 'yes' to overlap comms and compute
BUILD_QIO = no # set to 'yes' to build QIO code for binary i/o

Notice, BUILD_MPI==no, even given at configure option !! Correct?

quda/configure --enable-cpu-arch=x86_64 --enable-gpu-arch=sm_20 --enable-wilson-dirac --disable-domain-wall-dirac --disable-staggered-dirac --disable-twisted-mass-dirac --disable-staggered-fatlink --disable-gauge-force --disable-staggered-force --enable-multi-gpu --enable-overlap-comms --with-cuda=/opt/cuda4 --with-qmp=/Home/fwinter1/toolchain/install/qmp-parscalar-parscalar-single-quda --with-mpi=/Home/fwinter1/toolchain/install/openmpi-1.5 CXX=/Home/fwinter1/toolchain/install/openmpi-1.5/bin/mpiCC CC=/Home/fwinter1/toolchain/install/openmpi-1.5/bin/mpicc CFLAGS=-I/Home/fwinter1/toolchain/install/openmpi-1.5/include CXXFLAGS=-I/Home/fwinter1/toolchain/install/openmpi-1.5/include LDFLAGS=-L/Home/fwinter1/toolchain/install/openmpi-1.5/lib LIBS=-lmpi

Chroma:
/chroma/configure --prefix=/Home/fwinter1/toolchain/install/chroma-parscalar-parscalar-single-quda --with-qdp=/Home/fwinter1/toolchain/install/qdp++-parscalar-parscalar-single-quda --with-cuda=/opt/cuda4 --with-quda-0-3=/Home/fwinter1/git/quda CXX=/Home/fwinter1/toolchain/install/openmpi-1.5/bin/mpiCC CXXFLAGS=-O3

Investigate how to do a parallel build

The compile time of QUDA, especially in multi-GPU mode, hinders development. We should really work out how to enable parallel building of QUDA, namely, how to split parallelize dslash_quda.cu. This is currently in a single file because of the file scope requirement of textures and constants.

The textures can be defined multiple times between files, this shouldn't cause a problem. There doesn't appear to be any need to have a texture bound for one kernel available for another kernel.
The constants are more preblematic. If we define multiple constants in different files how do these interact?
1. Do they overlap and require to be set every time a different file's constants ?
2. Do they occupy different areas of the constant memory, and thus if we have multiple constants declared we run the risk of running out of constant memory very quickly?
What is the most time consuming part of the compilation? Is it possible to split the output of the ptx generation and the final assembler and parallelize over the assembler only?

This problem is only going to get exponentially worse as more and more kernels are incorporated in dslash_quda.cu.

Add support for an initial guess to BiCGstab

Currently, only CG supports an initial guess.

Removal of all global variables

Global variables are causing real headaches for extending and maintaining the code base. This is especially for multigrid. We need to eliminate these as much as possible.

Optionally use QMP for multi-GPU staggered

When building staggered with --with-qmp= ... flag in multi-gpu mode
I get unresolved symbols (see at end of message)

Issue is not present when compiling pure MPI (--with-mpi=... , but no --with-qmp)
Code also builds fine when multi-gpu is disabled (absence of --enable-multi-gpu)

This is probably just some comms feature that never made it to a QMP version.

Unresolved symbols from the failing case are below:

/home/bjoo/Toolchain/install/openmpi-1.5/bin/mpicxx -fPIC -L/usr/local/cuda/lib64 -lcudart -L/home/bjoo/Devel/QCD/install/qmp/qmp2-1-6/openmpi/lib -lqmp su3_test.o test_util.o wilson_dslash_reference.o ../lib/libquda.a -o su3_test -fPIC -L/usr/local/cuda/lib64 -lcudart -L/home/bjoo/Devel/QCD/install/qmp/qmp2-1-6/openmpi/lib -lqmp
../lib/libquda.a(dslash_quda.o): In function void staggeredDslashNoReconCuda<2, short2, short2, short2>(short2*, float*, short2 const*, short2 const*, short2 const*, short2 const*, QudaReconstructType_s, short2 const*, float const*, int, int, short2 const*, float const*, double const&, int, int, int, int, int, cudaColorSpinorField*, dim3)': tmpxft_000053e3_00000000-1_dslash_quda.cudafe1.cpp:(.text._Z26staggeredDslashNoReconCudaILi2E6short2S0_S0_EvPT0_PfPKT1_S6_PKT2_S9_21QudaReconstructType_sPKS1_PKfiiSC_SE_RKdiiiiiP20cudaColorSpinorField4dim3[void staggeredDslashNoReconCuda<2, short2, short2, short2>(short2*, float*, short2 const*, short2 const*, short2 const*, short2 const*, QudaReconstructType_s, short2 const*, float const*, int, int, short2 const*, float const*, double const&, int, int, int, int, int, cudaColorSpinorField*, dim3)]+0x19f): undefined reference toexchange_gpu_spinor_start'
tmpxft_000053e3_00000000-1_dslash_quda.cudafe1.cpp:(.text.Z26staggeredDslashNoReconCudaILi2E6short2S0_S0_EvPT0_PfPKT1_S6_PKT2_S9_21QudaReconstructType_sPKS1_PKfiiSC_SE_RKdiiiiiP20cudaColorSpinorField4dim3[void staggeredDslashNoReconCuda<2, short2, short2, short2>(short2, float_, short2 const_, short2 const_, short2 const_, short2 const_, QudaReconstructType_s, short2 const_, float const_, int, int, short2 const_, float const_, double const&, int, int, int, int, int, cudaColorSpinorField_, dim3)]+0x1ac): undefined reference to exchange_gpu_spinor_wait' tmpxft_000053e3_00000000-1_dslash_quda.cudafe1.cpp:(.text._Z26staggeredDslashNoReconCudaILi2E6short2S0_S0_EvPT0_PfPKT1_S6_PKT2_S9_21QudaReconstructType_sPKS1_PKfiiSC_SE_RKdiiiiiP20cudaColorSpinorField4dim3[void staggeredDslashNoReconCuda<2, short2, short2, short2>(short2_, float_, short2 const_, short2 const_, short2 const_, short2 const_, QudaReconstructType_s, short2 const_, float const_, int, int, short2 const_, float const_, double const&, int, int, int, int, int, cudaColorSpinorField_, dim3)]+0x2b0): undefined reference to exchange_gpu_spinor_start'
tmpxft_000053e3_00000000-1_dslash_quda.cudafe1.cpp:(.text._Z26staggeredDslashNoReconCudaILi2E6short2S0_S0_EvPT0_PfPKT1_S6_PKT2_S9_21QudaReconstructType_sPKS1_PKfiiSC_SE_RKdiiiiiP20cudaColorSpinorField4dim3[void staggeredDslashNoReconCuda<2, short2, short2, short2>(short2*, float*, short2 const*, short2 const*, short2 const*, short2 const*, QudaReconstructType_s, short2 const*, float const*, int, int, short2 const*, float const*, double const&, int, int, int, int, int, cudaColorSpinorField*, dim3)]+0x2bd): undefined reference to`exchange_gpu_spinor_wait'
tmpxft_000053e3_00000000-1_dslash_quda.cudafe1.cpp:(.te