So numpy must be well-optimized. I tried optimizing my GPU kernel by minimizing array index lookups, etc, but nothing I came up with made a significant difference for this simple kernel. Both the CPU and GPU gave exactly the same answer, so that's nice.
Here's the output with execution times...
GPU execution time: 0.0115399 CPU execution time: 2.7895e-05 CPU/GPU speed ratio for 10^0 kernel executions: 0.241726% Difference between the 2 answers: 0.0 GPU execution time: 0.0115771 CPU execution time: 2.19345e-05 CPU/GPU speed ratio for 10^1 kernel executions: 0.189464% Difference between the 2 answers: 0.0 GPU execution time: 0.0116088 CPU execution time: 2.19345e-05 CPU/GPU speed ratio for 10^2 kernel executions: 0.188947% Difference between the 2 answers: 0.0 GPU execution time: 0.0115681 CPU execution time: 2.59876e-05 CPU/GPU speed ratio for 10^3 kernel executions: 0.22465% Difference between the 2 answers: 0.0 GPU execution time: 0.011663 CPU execution time: 7.70092e-05 CPU/GPU speed ratio for 10^4 kernel executions: 0.660289% Difference between the 2 answers: 0.0 GPU execution time: 0.023535 CPU execution time: 0.000612974 CPU/GPU speed ratio for 10^5 kernel executions: 2.60452% Difference between the 2 answers: 0.0 GPU execution time: 0.0234549 CPU execution time: 0.0182121 CPU/GPU speed ratio for 10^6 kernel executions: 77.6472% Difference between the 2 answers: 0.0 GPU execution time: 0.0668991 CPU execution time: 0.240016 CPU/GPU speed ratio for 10^7 kernel executions: 358.773% Difference between the 2 answers: 0.0 GPU execution time: 0.567215 CPU execution time: 2.24371 CPU/GPU speed ratio for 10^8 kernel executions: 395.566% Difference between the 2 answers: 0.0With cgminer running at -I 9 on all the GPUs, the speed advantage for a GPU doesn't budge, significantly. So pyopencl is pretty effective at interrupting cgminer and prioritizing its threads.GPU execution time: 0.179582 CPU execution time: 2.7895e-05 CPU/GPU speed ratio for 10^0 kernel executions: 0.0155333% Difference between the 2 answers: 0.0 GPU execution time: 0.263615 CPU execution time: 2.31266e-05 CPU/GPU speed ratio for 10^1 kernel executions: 0.00877287% Difference between the 2 answers: 0.0 GPU execution time: 0.263666 CPU execution time: 2.40803e-05 CPU/GPU speed ratio for 10^2 kernel executions: 0.00913287% Difference between the 2 answers: 0.0 GPU execution time: 0.011616 CPU execution time: 2.81334e-05 CPU/GPU speed ratio for 10^3 kernel executions: 0.242195% Difference between the 2 answers: 0.0 GPU execution time: 0.0116951 CPU execution time: 7.60555e-05 CPU/GPU speed ratio for 10^4 kernel executions: 0.650317% Difference between the 2 answers: 0.0 GPU execution time: 0.023536 CPU execution time: 0.000617981 CPU/GPU speed ratio for 10^5 kernel executions: 2.62569% Difference between the 2 answers: 0.0 GPU execution time: 0.0236619 CPU execution time: 0.0189419 CPU/GPU speed ratio for 10^6 kernel executions: 80.0524% Difference between the 2 answers: 0.0 GPU execution time: 0.0630081 CPU execution time: 0.230431 CPU/GPU speed ratio for 10^7 kernel executions: 365.717% Difference between the 2 answers: 0.0 GPU execution time: 0.82972 CPU execution time: 2.4491 CPU/GPU speed ratio for 10^8 kernel executions: 295.172% Difference between the 2 answers: 0.0Installation was a bit tricky. You have to make sure setuptools is overriden by distribute. But Ubuntu 12.04 makes this easy. Thanks to kermit666 on SO for this simple approach to getting virtualenv wrapper and numpy up and running quickly on a fresh Ubuntu install.#!/usr/bin/env sh sudo apt-get install python-pip python-dev sudo pip install virtualenv virtualenvwrapper echo 'export PROJECT_HOME="$HOME/src"' >> $HOME/.bashrc echo 'export WORKON_HOME="$HOME/.virtualenvs"' >> $HOME/.bashrc echo 'source /usr/local/bin/virtualenvwrapper.sh' >> $HOME/.bashrc sudo apt-get install -y gfortran g++ # sudo apt-get remove -y --purge python-setuptools # start a new virtalenv project mkproject parallel pip install --upgrade distribute pip install mako numpy pyopencl
Here's the python code that ran the kernel and measured execution time. It's based on the official pyopencl example ...
import pyopencl as cl import numpy import numpy.linalg as la import time for M in range(0, 8): N = 10**M * 1000 a = numpy.random.rand(N).astype(numpy.float32) b = numpy.random.rand(N).astype(numpy.float32) ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) mf = cl.mem_flags a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a) b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b) dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, b.nbytes) prg = cl.Program(ctx, """ __kernel void sum(__global const float *a, __global const float *b, __global float *c) { float a2 = a[gid]; float b2 = b[gid]; c[gid] = a2 * a2 + b2 * b2; } """).build() prg.sum(queue, a.shape, None, a_buf, b_buf, dest_buf) gpu_ans = numpy.empty_like(a) gpu_t0 = time.time() cl.enqueue_copy(queue, gpu_ans, dest_buf) gpu_t = time.time() - gpu_t0 print 'GPU execution time: %g' % gpu_t cpu_ans = numpy.empty_like(a) cpu_t0 = time.time() cpu_ans = a * a + b * b cpu_t = time.time() - cpu_t0 print 'CPU execution time: %g' % cpu_t print 'CPU/GPU difference in speed for %d additions: %g%% ' % (N, 200.0 * cpu_t / (gpu_t + cpu_t)) print 'Difference between the 2 answers:' print la.norm(cpu_ans - gpu_ans)
Nice.
ReplyDelete