Faster subscripting

March 9th, 2010

Subscripting is one of the most powerful M-Language features, a feature that differentiates it from most other languages. In this post we show a small language tweak can speed up some cases of subscripting.

When subscripting, which of the following MATLAB statements performs best?

A(1:n)
A(1:end)

It turns out that in both Jacket and standard MATLAB the latter is faster. In fact, any subscripting you do should make use of that end keyword if possible: A(1:end), A(1:end-1).

Here’s what’s happening. When 1:n is used in MATLAB, it gets expanded out in memory by MATLAB. When MATLAB then goes to use it in the subscript, it has to look at each element individually as it processes the indexing. In contrast, with the 1:end syntax, both Jacket and MATLAB avoid the intermediate expansion and just start executing the indexing assuming a linear sequence.

Let’s do some timings. Here’s the Jacket version…

n = 1e4;
reps = 10000;

% warm up
A = grand(1,n);
b = A(500:end);
b = A(500:n);

gsync
tic
   for r = 1:reps
     b = A(500:end);  geval(b)
   end
gsync
gpu_end = toc / reps * 1e6

gsync
tic
   for r = 1:reps
     b = A(500:n);    geval(b)
   end
gsync
gpu_n = toc / reps * 1e6  % nanoseconds

This gives a speedup of about 2.3x on my laptop — for both the CPU and GPU versions.

1:n    251.9426 ns
1:end  108.2401 ns

Uncategorized

Jacket and GPUs show promise in Neuroscience with fMRI and SPM

March 8th, 2010

For those of you interested in neuroscience and neuroimaging, you have probably heard of a software capability called SPM or Statistical Parametric Mapping developed by a group at University College London.  Well, a group at Georgia Tech has been doing some work with Jacket and CUDA on SPM and have produced some initial results that show some promise.  Being able to speed up the image analysis of functional MRI can benefit the medical community in a big way.  AccelerEyes has been supporting these efforts at Georgia Tech and with the permission of the authors we have produced an initial look at their work.  Enjoy.

http://www.accelereyes.com/resources/spm-fmri

-Dave Gibson

CUDA Programming, MATLAB Programming

Median Filtering: CUDA tips and tricks

March 4th, 2010

Last week we posted a video recording from NVIDIA’s GTC09 conference. In the video, I walked through median filtering, presenting the vanilla implementation and then walking through progressive CUDA optimizations. A comment on that post suggested trying some other compiler flags, and it sparked a new series of experiments.

In the original video, we started with a vanilla CPU implementation of 3×3 median filtering. We then ported this to the GPU to realize some immediate gains, but then we started a string of optimizations to see how far we could drive up performance: switching to textured memory, switching to shared memory, switching the internal sorting of pixels, etc. The conclusion: pay attention to the resource usage reported by nvcc (registers, lmem, smem, cmem).

Times are in milliseconds (not speedups). I’m running this on a GeForce 8600M GT (MacBook Pro laptop). The source code is below if you’d like to try on your own.

  • The original baseline CPU and GPU algorithms both using bubble sort. The GPU version uses vanilla global memory accesses. The GPU is beating the CPU with brute force parallelism.
            cpu       = 0.0894
            gmem      = 0.0144
  • Put in the –maxrregcount=9 flag you suggested. It goes from using 11 reg, 36 lmem bytes to now using 9 reg, 40 lmem. Going from 11 to 9 registers means more blocks can launch, so we get a slight speedup at the cost of only 4 bytes of lmem (4 bytes == one global read).
            gmem      = 0.0144
            gmem_r9   = 0.0115
  • Put the exchange algorithm into both CPU and GPU — no more bubble sort. Still use vanilla memory accesses. We now see a dramatic win for the GPU, but I have no explanation for why the CPU takes longer now. Any ideas?
            cpu_exch  = 0.1215
            gmem_exch = 0.0048
  • For the GPU, keep the original exchange sort version that uses shared memory for accesses. This is the best time for all algorithms: fast memory accesses, efficient sorting, fewest registers.
            exch      = 0.0013

    Here they all are together:

        cpu       = 0.0894
        gmem      = 0.0144
        gmem_r9   = 0.0115
        cpu_exch  = 0.1215
        gmem_exch = 0.0048
        exch      = 0.0013

    Anyone know why the CPU takes longer when using exchange sort? I am checking the return values to make sure all the algorithms are still functionally correct.

    The source code for all of this is included in Jacket v1.3 (currently in the testing phase). But you can also download them here to try for yourself. This experiment is contained in ‘run_13.m’. Note, the timing makes use of gsync/geval which will roll out in v1.3. To run this with v1.2.2 or earlier, just use gforce instead (see comment at top or run_13.m).
    -James Malcolm

    CUDA Programming

Accelerate Computer Vision Data Access Patterns with Jacket & GPUs

March 1st, 2010

For computer vision, we’ve found that efficient implementations require a new data access pattern that MATLAB® does not currently support.  Matlab and the M language is great for linear algebra where blocks of matrices are the typical access pattern, but not for Computer Vision where algorithms typically operate on patches of imagery. For instance, to pull out patches of imagery in M, one must do a double nested for loop,

A = rand(100,100)
for xs = -W:W
for ys = -W:W
patch(xs+W+1, ys+W+1) = A(xs+1+x, ys+1+y);
end
end

…with guards for boundary conditions, etc. It gets even more complicated with non-square patches. On top of that, these implementations don’t translate to the GPUs memory hierarchy at all and are thus slow.

Therefore, we have implemented a command called windows,

B = windows(A, x, y, z, w, T),

whose purpose is to pull a large number of sparse (as opposed to sliding) windows out of A at the locations x, y, z with sizes radii w, and affine transforms T that can then be computed over. The command windows signals to Jacket that we’re doing a patched access pattern that can then be optimized on the GPU.

This alleviates some of the problems MATLAB has with implementing computer vision algorithms in general – - typically lots of mex files are written to handle these problems.

In the future, windows can possibly also be lazy. So,

hist(windows(…)),

would be intelligent and hist would use a windowed access pattern to compute many histograms in patches over an image. The implementations of nlfilter and colfilter would also imply a particular memory access pattern that Jacket can optimize for.

We’re hoping that such developments will lead to efficient implementations of common image patch reliant computer vision algorithms like KLT, SIFT, Mean-Shift etc in MATLAB without having to develop a tightly knit, hard to understand batch of C and CUDA.

-Gallagher

MATLAB Programming

A case study in CUDA optimization

February 20th, 2010

Jimi Malcolm, VP of Engineering and Co-founder of AccelerEyes takes about 15 minutes to share CUDA optimization strategies to maximize performance of CUDA code.  Watch the video below to find out what needs to go into strategizing CUDA development to maximize performance.  Jimi uses Median Filtering for this case study.


Get the Flash Player to see this player.


Jimi Malcolm

M7AJJ4PS8UBQ

CUDA Programming

Using Parallel For Loops (parfor) with MATLAB and Jacket

February 10th, 2010

MATLAB’s parallel for loops (parfor) allow the body of a for loop to be executed across multiple workers simultaneously, but with some pretty large restrictions.  With Jacket MGL, Jacket can be used within parfor loops, with the same restrictions.  However, it is important to note that Jacket MGL does not currently support co-distributed arrays.

Problem Size

Problem size might be the single most important consideration in parallelization using the Parallel Computing Toolbox (PCT) and Jacket MGL.  When data is used by a worker in the MATLAB pool it must be copied from MATLAB to the worker, and must be copied back when the computation is complete.  Additionally, when GPU data is used, it must then be copied by the worker to the GPU and back.  With small, simple problems, parallelization may not offer a performance improvement because of the overhead of this data movement.  Unfortunately, there is no simple answer to the question of “What size problem justifies multi-gpu parallelization?”  We have found that experimentation provides the best answer.

Array Slicing

One of the ways to reduce the overheard of parallelization is to minimize the amount of data being copied.  This can be done through array slicing.  When an array is used with only a single variable dimension, the array can be “sliced” so that ONLY the relevant data is copied.  For example, with a 128×128 array, there would be 16,384 entries.  If the array were used by a worker with both dimensions being referenced with variables, the entire array must be copied.  When only a single dimension is used, MATLAB only needs to copy the elements that the worker could possible refer to.  In the example here, only 128 elements would need to be copied instead of 16,384.

for/gfor/parfor in parfor

It’s possible to use the various types of for loops embedded inside a parfor loop, but they cannot appear directly in the body of the loops.  Instead, they must be inside of a function called by a loop.  Here is a simple example:

parfor i=1:8
  for j=1:8
    m(i,j) = i*j;
  end
end

The above example fails because of transparency.  Other than the iterator variable of the parfor, MATLAB must be able to evaluate indices at the time the code is read, not at the time the code is executed.  We can fix this example by changing it to:

function x = f(n, i)
  m = n;
  j = [];
  gfor j=1:8
    m(j) = i*j;
  gend
end

m = gones(8);
parfor i=1:8
  m(i,:) = f(m(i,:), i);
end

parfor in a gfor

It is not possible to use a parfor loop or an spmd command inside of a gfor

Variable Classification

Inside of a parallel loop, MATLAB attempts to classify each variable.  The different classes are:

- Temporary variable:  a variable created inside a single iteration whose lifespan is only that iteration

- Reduction variable: a variable which accumulates an its value across all iterations, but whose value make be computed regardless of the order of execution of the iterations.

- Broadcast variable: a variable declared outside of the loop and referenced in the loop, but whose value is not changed within the loop

- Sliced variable: a variable whose elements are operated on by independent iterations of the loop

- Loop variable: the iterator of the loop

If a variable cannot be classified into one of these categories, the loop will not be able to execute.  Further restrictions on the ways variables may be classified and used are explained in the Advanced Parfor Topics on The Mathworks web site at:

http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/brdqtjj-1.html

Understanding Some of the Restrictions

Things that are not compatible with parallelization of for loops:

- Loops may not contain spmd statements.

- Loops may not contain global or persistent variable declarations.

- Loops may not contain break or return statements

For more information about the MATLAB-imposed restrictions on parallel for loops, visit:

http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/bq__cs7-1.html

-Brett Lucey

MATLAB Programming

Lazy Execution in MATLAB GPU computing

February 9th, 2010

Several users have written to us enquiring about the need for a better understanding of Jacket’s compile on-the-fly approach for GPU computing. This capability in Jacket is one of the ways that makes it possible for users of the MATLAB language to run high performance M code on Nvidia GPUs without having to program in C++ and CUDA. This article tries to illustrate the advantages of this design and also provides a few examples as to how it can be used efficiently.

Jacket employs a compile on-the-fly approach primarily for arithmetic expressions, referred to as “lazy execution”. Each compilation using this approach comes with an overhead cost, and is why Jacket tries to reduce the number of compilations done using this approach (the Jacket User Guide provides more background). The gforce statement can be used to bypass the lazy execution design to force computation at any stage, if applicable.

Consider the following example,

A = cos(a) + sin(b)
B = sin(b) + cos(a)

In this case, lazy execution implies that only two kernels are compiled and executed – one each for the results A and B. If this philosophy isn’t used, individual commands for sin(a), cos(b), plus need to be dispatched for execution – there would have to be six dispatches (two for the additions, and one for each trig operation), which would ultimately be much costlier in the context of a large loop.

To better illustrate the point, consider the following codes:

Code A: Jacket code

gcache flush;               % Makes a clean slate. Removes old traces
tic
for i = 1:10000
  p = grand(100); q = grand(100);
  gforce(p,q);              % Forces inputs to exist before proceeding
  p = cos(p);
  sp = sin(p);
  cq = cos(q);
  sq = sin(q);
  A = cp + sq;
  B = cq + sp;
  gforce(A,B);               % Makes sure A,B are computed
end
toc

- This code just compiles one kernel that computes A,B on the first iteration of the loop, and this compiled kernel is reused subsequently on every iteration. On an Intel Xeon W5580 CPU @ 3.20GHz with a Tesla C1060, this code timed 13.67 seconds.

Code B: Precompiled Kernels, No Lazy Execution,

% Simulation of the case where computation is needed at every step
% with precompiled kernels

gcache flush;              % Makes a clean slate. Removes old traces

% Precompile all kernels that will be used in the loop
x = sin(grand(100)); gforce(x);         % sin kernel compiled
y = cos(grand(100)); gforce(y);         % cos kernel compiled
z = x + y; gforce(z);                   % kernel for a+b compiled
tic
for i = 1:10000
  p = grand(100); q = grand(100);
  gforce(p,q);                       % Forces inputs to exist
  cp = cos(p); gforce(cp);
  sp = sin(p); gforce(sp);
  cq = cos(q); gforce(cq);
  sq = sin(q); gforce(sq);
  A = cp + sq; gforce(A);
  B = cq + sp; gforce(B);
end
toc

The timed code here uses all pre-compiled kernels, but does a dispatch for each computation involved. However, in the same environment as code-A, it takes 22.36 seconds to execute.

Having got this simple piece of code out of the way, there are a couple of other aspects of this compile on-the-fly/ lazy execution mechanism that need to be highlighted:

  • In an ideal scenario, the first few iterations of a loop should be able to generate all the kernels needed to run the loop, and hence generate a good speed-up over a large number of iterations. Most loops in fact (even complex ones!) satisfy this criterion. For example, in Code A above, it is easy to see that if there is a kernel that is capable of a (cos + sin) computation, everything that is needed to run the loop is there by the end of the first iteration itself
  • In such a scenario, the first few iterations themselves would serve as ‘warm-up’ to generate good speed-ups for the entire loop. For example, timing each iteration in Code A reveals that the first iteration takes 0.93 seconds (includes compilations), and subsequent iterations take about 1.2 milliseconds each.
  • However, if the loop is poorly designed, there may be excessive compilations, which cause bad performance. For example, if there is a CPU operand in a loop that changes every iteration (as shown below), it needs a compilation for every iteration.

A = grand(n);
for ii = 1:m
  A = exp(ii) * A;  % BAD: this causes recompile every iteration
end
  • The trick to get around these excessive compilations is to make “ii” a GPU variable (shown below). It may be noted that the same behavior would be seen if a changing CPU variable exists in place of the iterator (e.g. A = rand * A). The same solution applies.
  • for ii = gsingle(1:m)
      A = exp(ii) * A;  % GOOD: this avoids recompile
    end
  • To avoid kernels getting too big (in terms of memory / amount of work they do), Jacket compiles them even if results are not requested / needed.
  • In some cases of loops involving many element-wise arithmetic operations, Jacket may not be able to efficiently cache the arithmetic operations resulting in excessive re-compilations of the loop body. If you notice arithmetic loop performance to be slow or staggered, try using gforce in the loop to denote an evaluation breakpoint.

a = grand(1024);
gcache flush;
tic
for i = gsingle(1:300)
  a = a + sin(a)/4;
  a = a + log(a/4);
  b = a ./ exp(a);
  b = (a + b).^3 .* log(a);
  c = (a - 5) .* (a - 3).^2;
  a = c ./ a;
  % Try adding gforce() here.
end
toc

This code segment takes 2.59s to run as given, but adding the gforce statement at the indicated place accelerates it to be done in 1.08s. However, it must be noted that putting a gforce statement in every loop may not result in acceleration of the code. In fact, in some cases, it may worsen performance by forcing execution flow away from more optimal kernels.
The AccelerEyes team is continuing to drive innovation to overcome this limitation, and make this on-the-fly compilation feature of Jacket more efficient in terms of both finding most optimal kernels and speed of compilation.

- Sandeep Maddipatla

MATLAB Programming , , ,

How long does it take to get 98X performance improvement with GPUs?

February 4th, 2010

Well, here is a recent story with one of our customers that accomplished 98X performance speed up with Jacket in 16 days.  Of the 16 days, 15 days were spent sending emails back and forth about the problem and less than a day was spent getting the customer code in Jacket and running some performance tests!  Who would have imagined GPU programming with performance in 1 day.  Happy Reading.

Day 1:

Customer uses inverse radon (or iradon in MATLAB terms) extensively for their back projection algorithms.  They would like to know when the iradon function will be available/supported in Jacket.

AccelerEyes product management informs the customer that the inverse radon algorithm used in MATLAB is based on the filtered back project algorithm.  The key elements of that algorithm are interpolation, fft and frequency domain filtering which are all supported in Jacket and should provide decent speed-up for large enough data sizes.

Day 4:

The customer provides email response to product management indicating a couple of things:

  1. They would be interested in working with AccelerEyes and Jacket on an initial implementation of inverse radon given the information provided in the Day 1 response.
  2. If the initial implementation demonstrated a substantial savings in execution time, the customer could allocate the appropriate internal resource to leverage Jacket and GPUs for their needs.
  3. They had some (very limited) experience with programming GPUs directly (we assume CUDA), and had determined that the approach would require far too much development effort from a relatively small staff of software developers.
  4. The customer is ready to work on a prototype/pilot to get some results.

Day 17:

6:49am: After 13 days, the customers sends a follow up email to product management and corporate email box asking whether or not AccelerEyes was interested in working with the customer on the inverse radon prototype/pilot.

“turns out the spam-filters are not always your best friend as the customers email on Day 4 got caught in the corporate spam filter and the product manager never got the message” – morale of the story is follow up email with a phone call in time sensitive situations!

11:31am: After profuse apologies by product management regarding the delay, an exchange of information about the customer code begins.  A review of the use case for the function is undergone (use case = the MATLAB code developed by the customer for the application).   The MATLAB code for the iradon function itself , in addition to the MATLAB built-in functions are also reviewed with the customer to understand all aspects of the application/algorithm.

2:31pm: It becomes clear what needs to be done with Jacket and how parts of the customer’s MATLAB code and other m-files need to be prepared for Jacket.

For all Jacket users: many of you are interested in various MATLAB functions to be implemented in Jacket to ease the programming effort to leverage GPUs, increase performance on more applications and in general achieve more results.  Although the outline below of what needed to be done to get iradon working for this customer was not a design goal and is NOT for all Jacket users, this approach can be used by many Jacket customers to achieve results today without waiting for the specific function or functions to show up in Jacket documentation and fully supported.

The approach:

Step 1: Find the function in the MATLAB source tree that is bottlenecking your performance – in this case “iradon.m”

Step 2: Save the function or .m file as a GPU function like foo.m, in this case the function was saved as “giradon.m”

Step 3: In your foo.m function (this case giradon.m), pass the input variables from your application into the new function you are creating (example – function [img,H] = iradon(varargin) becomes function [img,H] = giradon(varargin))

Step 4: Add a “g” pre-fix to zeros when allocating memory which will allocate memory on the GPU for execution – if applicable

Step 5: Assign variables to be GPU variables p=gsingle(p); that will be used during execution of the most computational intensive calls within the m-file.  It is only necessary to make GPU variables of the data that goes to the GPU and where the Jacket function is supported – fft for example.  In this particalur case the following code was added to the iradon.m file to make it Jacket compatible for ‘linear’ case.

%% Change 1: Add below line 158

%% Converting to gsingle values for GPU usage :

costheta=gsingle(costheta);

sintheta=gsingle(sintheta);

p=gsingle(p);

x=gsingle(x);

y=gsingle(y);

ctrIdx=gsingle(ctrIdx);

Step 6: Vectorize code where possible. Especially where for-loops are involved.  In the iradon case the customer vectorized the for-loop in the function filterProjections() as shown below:

%% Change 2: Lines 204-206

%% Following ‘for’ loop unrolled using vectorizations

% for i = 1:size(p,2)

%    p(:,i) = p(:,i).*H; % frequency domain filtering

% end

%% After Vectorizing the above for-loop :

p(:,1:size(p,2))=p(:,1:size(p,2)).*repmat(H,1,size(p,2));

%%

Step 6:  Open your MATLAB code that calls the bottleneck function.  You have two options here, you can program your MATLAB code to always use GPUs by changing the call to iradon.com (in this example) to foo.m or giradon.m or you can set a flag that will ask the user if they have GPUs or not.

If no, they follow the normal execution of the code using the standard m-files.

If yes, you will need to follow an execution path that has the iradon.m file (from this example) changed to giradon.m or foo.m.

Once you have modified your MATLAB application, save and run the code with the GPU option on.

3:54pm: Customer needs to test and benchmark changes to code and chooses to use the following data sizes for initial testing while a determination is made of the most typical parameters used in reconstructions :

P = phantom(128);

R = radon(P,0:179);

I1 = iradon(R,0:179);

I2 = iradon(R,0:179,’linear’,'none’);

Day 18:

10:17am: Customer determines that the most typical parameters would be to increase the phantom size to 1024 X 1024 ( P = phantom(1024) ) from P = phantom(128).  It turned out the number of angles used is 180 so the initial use case of 0:179 was in line with the requirement.

5:55pm: The following performance results are achieved with Jacket on GPUs using two different hardware configurations and various dimensions of data.

System 1:

CPU: Intel Xeon Dual Core 2.8 GHz, 1GB RAM, (2 Cores)

GPU: GeForce GTX 295, 1212 MHz, 895 MB VRAM, Compute 1.3

OS: Window XP (32-bit)

MATLAB Version: R2009b

Jacket Version: 1.2.2

Phantom Dim Projections Dim ‘iradon’ runtime (sec) ‘giradon’ runtime

(sec)

Speed GPU vs CPU Error Norm (inf)
128×128 185×180 0.64 0.98 0.65 1.77E-006
256×256 367×180 2.45 0.91 2.69 3.35E-006
512×512 729×180 10.39 0.97 10.71 6.82E-006
1024×1024 1453×180 41.46 1.15 36.05 1.84E-005
2048×2048 2901×180 167.07 1.69 98.86 3.12E-005

System 2:

CPU: Intel(R) Dual Quad Core Xeon(R) CPUs W5580 @ 3.20GHz, 32GB RAM, (8 cores)

GPU: Tesla C1060, 1265 MHz, 4095 MB VRAM, Compute 1.3

OS: Linux Fedora 10 (32-bit)

MATLAB Version: R2009b

Jacket Version: 1.2.2

Phantom Dim Projections Dim ‘iradon’ runtime (sec) ‘giradon’ runtime

(sec)

Speed GPU vs CPU Error Norm (inf)
128×128 185×180 0.1 0.4 0.25 1.77E-006
256×256 367×180 0.38 0.41 0.93 3.35E-006
512×512 729×180 1.82 0.43 4.23 6.82E-006
1024×1024 1453×180 8.35 0.51 16.37 1.84E-005
2048×2048 2901×180 38.14 1.93 19.76 3.12E-005

The “iradon” runtimes were run on the CPU where the “giradon” runtimes were leveraging the GPU.  Of course the title of this blog post was a little misleading as it is not true that 98X speed up was the case in every data size and every configuration as performance does vary in all applications based on code, data size, hardware and other factors.  In this case, an everyday laptop with a GeForce GPU can provide almost 100X performance improvement versus a garden variety Intel Core Duo 2.8 GHz processor for the right size data problem.  But even when you run on state of the art hardware, both CPU and GPU, and the data size is appropriate for the GPU architectures, it is not unusual to get 10 to 20X performance improvement.

There is actually a bigger story here than the performance improvements and speed up of this code!  It has to do with productivity.  This customer had already determined that they did not have the resources, knowledge and expertise to program CUDA to leverage the GPU resources.  Jacket provides a platform for engineers, scientists and analysts, that are familiar with MATLAB, to quickly (in this case, a couple of hours) make current applications and algorithms work on GPUs to achieve substantial performance gains, even when the latest and greatest hardware is not immediately at ones disposal.  Shortening the time to solution is one of the major value points for Jacket and should be considered even if an option to program in CUDA is available.

PS – one more point to the story, email is great but today’s spam filters can really disrupt communication if alternative methods are not used in conjunction with our new way of life.

MATLAB Programming , , , ,

Streaming data to the GPU

February 1st, 2010

We just got off the phone with a finance firm wanting to using Jacket to process market data as it streams in. Essentially, they want to avoid loading memory into MATLAB before sending data out to the GPU. We’ve had similar requests from defense companies streaming radar and other signals, from healthcare firms dealing with medical image processing, from companies doing video image processing, and more.

Probably the best way is to use the SDK to push data directly out to the GPU. With the SDK, you can get accesses to both the host-side and device-side memory buffers within Jacket. With these pointers in hand, you can copy your data from disk or some socket — and avoid an intermediate MATLAB representation.

Below I’ve pasted the complete source code for a little function that pulls in several floating-point values from a file (in this case /dev/random) and stores that data directly into the internal Jacket host-side (CPU) buffer. This function returns a new Jacket GPU variable that can be used in subsequent computations. From inside MATLAB, our new function ‘my_read’ can be called like this:

>> A = my_read;  % pull in some random values
>> A + 1         % perform some computation

But keep in mind this is random garbage data it’s actually reading.

This approach could similarly be extended to read from sockets, pipes, memory-mapped hardware, etc. You could set it up to take in parameters indicating where to start reading, how much to read, etc.
-James Malcolm

Download the source and Makefile.

#include <jacket.h>
#include <stdio.h>

err_t jktFunction(int nlhs, mxArray *plhs[], int nrhs, mxArray *prhs[])
{
    char *filename = "/dev/random"; /* file to read from */
    off_t offset = 42;              /* starting position to read */
    size_t numel = 10;              /* how many values to read */

    FILE *f = fopen(filename, "rb");
    if (fseek(f, offset, SEEK_CUR) != 0)
        return err("couldn't fseek");

    /* note: assume single-precision */
    mxArray *m = plhs[0] = jkt_new(1, numel, mxSINGLE_CLASS, mxREAL);

    /* get host-side pointer */
    void *h_data;
    TRY(jkt_mem_host(&h_data, m));

    /* copy from disk directly into host-side buffer */
    if (fread(h_data, sizeof(float), numel, f) != numel)
        return err("problem reading");

    return errNone;
}

Here’s a Makefile to build it on Linux/Mac.

# 1) wherever you have CUDA
CUDADIR = /usr/local/cuda
# 2) wherever you have Matlab
MATLAB = /opt/matlab
# 3) wherever you have Jacket installed
JKT = /usr/local/jacket
# 4) uncomment if 64-bit
#  OS = 64

MEXT = $(shell $(MATLAB)/bin/mexext)
CUDAINC = -I$(CUDADIR)/include
CUDALIB = -L$(CUDADIR)/lib$(OS) -lcudart -Wl,-rpath,$(CUDADIR)/lib$(OS)
JKTLIB = -L$(JKT)/engine -ljacket$(OS)
JKTINC = -I$(JKT)/engine/include/
NVMEX = $(JKT)/sdk/common/nvmex -f $(JKT)/sdk/common/nvopts.sh

all: my_read.$(MEXT)

%.$(MEXT) : %.cu
	env MATLAB=$(MATLAB) $(NVMEX) $^ $(JKTLIB) $(JKTINC)
                                     $(CUDAINC) $(CUDALIB)
                                     -cxx -largeArrayDims

CUDA Programming

Torben’s Corner

January 19th, 2010

We work very closely with our customers and really appreciate the feedback we receive and value the insight provided.  The following wiki http://www.accelereyes.com/wiki/index.php?title=Torbens_Corner by Torben Larsen’s team at AAU outlining performance observations on GPUs and CPUs are not only value to our technical team but also valuable to our entire customer base.  Thanks Torben for this great resource!

Announcements