Difference between benchmark and time macro in Julia - macros

I've recently discovered a huge difference between two macros: #benchmark and #time in terms of memory allocation information and time. For example:
#benchmark quadgk(x -> x, 0., 1.)
BenchmarkTools.Trial:
memory estimate: 560 bytes
allocs estimate: 17
--------------
minimum time: 575.890 ns (0.00% GC)
median time: 595.049 ns (0.00% GC)
mean time: 787.248 ns (22.15% GC)
maximum time: 41.578 μs (97.60% GC)
--------------
samples: 10000
evals/sample: 182
#time quadgk(x -> x, 0., 1.)
0.234635 seconds (175.02 k allocations: 9.000 MiB)
(0.5, 0.0)
Why there is a big difference between these two examples?

The reason is precompilation overhead. To see this define:
julia> h() = quadgk(x -> x, 0., 1.)
h (generic function with 1 method)
julia> #time h()
1.151921 seconds (915.60 k allocations: 48.166 MiB, 1.64% gc time)
(0.5, 0.0)
julia> #time h()
0.000013 seconds (21 allocations: 720 bytes)
(0.5, 0.0)
as opposed to
julia> #time quadgk(x -> x, 0., 1.)
0.312454 seconds (217.94 k allocations: 11.158 MiB, 2.37% gc time)
(0.5, 0.0)
julia> #time quadgk(x -> x, 0., 1.)
0.279686 seconds (180.17 k allocations: 9.234 MiB)
(0.5, 0.0)
What happens here is that in the first call, wrapping quadgk in a function, anonymous function x->x is defined only once, since it is wrapped in a function, and thus quadgk is compiled only once. In the second call x->x is defined anew with every call and thus compilation has to be performed each time.
And now the crucial point is that BenchmarkTools.jl wraps your code in a function which you can check by inspecting how generate_benchmark_definition function works in this package, so it is equivalent to the first approach presented above.
Another way to run the code without redefining the optimized function would be:
julia> g(x) = x
g (generic function with 1 method)
julia> #time quadgk(g, 0., 1.)
1.184723 seconds (951.18 k allocations: 49.977 MiB, 1.58% gc time)
(0.5, 0.0)
julia> #time quadgk(g, 0., 1.)
0.000020 seconds (23 allocations: 752 bytes)
(0.5, 0.0)
(though this is not what BenchmarkTools.jl does - I add it to show that when you use function g you do not pay precompilation tax twice)

Related

Swift - create sine waves with high frequencies

I have the following problem: for the sake of amplitude modulation, I want to generate a sine wave with a given frequency. For lower frequencies (such as 440 Hz) the algorithm works well, but dealing with high frequencies (for example 20.000 Hz), I get additional noises of lower frequencies, increasing with the time - I mean longer I play the signal, more and more unwanted frequencies appear - and the signal is thus distorted.
Here's the essence of my algorithm
let methodStart = NSDate()
let n = vDSP_Length(1024)
let page: [Float] = (0 ..< n).map {_ in
let val: Float = sin(2.0 * .pi * 20000.0 / 44100 * Float(index))
index += 1
return val
}
let methodFinish = NSDate()
let executionTime = methodFinish.timeIntervalSince(methodStart as Date)
print("Execution time: \(executionTime)") // 0.0005 s
As you can see, I work with a loop - later the page array is used to generate the correspondent tone.
I did measure the execution time which looks like this:
Execution time: 0.0004639625549316406
Execution time: 0.000661015510559082
Execution time: 0.0005699396133422852
Execution time: 0.00047194957733154297
Execution time: 0.0005259513854980469
Execution time: 0.00047194957733154297
Execution time: 0.0005289316177368164
When we talk about the signal frequency of 20.000 Hz, the execution time should not be an issue, as during approx. 0.0005 s, a signal period fits more or less 10 times into the execution time range -> 1 : 20.000 = 0.00005.
My question: how can I achieve a pure signal?
Should I work with pointers? If so, how can I do so in my case?

CUDA GPU time in MATLAB increasing when the grid size is increased

I am using MATLAB R2017a. I am running a simple code to calculate cumulative sum from the first point until ith point.
my CUDA kernel code is:
__global__ void summ(const double *A, double *B, int N){
for (int i=threadIdx.x; i<N; i++){
B[i+1] = B[i] + A[i];}}
my MATLAB code is
k=parallel.gpu.CUDAKernel('summ.ptx','summ.cu');
n=10^7;
A=rand(n,1);
ans=zeros(n,1);
A1=gpuArray(A);
ans2=gpuArray(ans);
k.ThreadBlockSize = [1024,1,1];
k.GridSize = [3,1];
G = feval(k,A1,ans2,n);
G1 = gather(G);
GPU_time = toc
I am wondering why the GPU time increasing when i increase the grid size (k,.GridSize). for instant for 10^7 data,
k.GridSize=[1,1] the time is 8.0748s
k.GridSize=[2,1] the time is 8.0792s
k.GridSize=[3,1] the time is 8.0928s
From what i understand, for 10^7 number of data, the system will need 10^7 / 1024 ~ 9767 blocks, so the grid size should be [9767,1].
The GPU device is
Name: 'Tesla K20c'
Index: 1
ComputeCapability: '3.5'
SupportsDouble: 1
DriverVersion: 9.1000
ToolkitVersion: 8
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 5.2983e+09
AvailableMemory: 4.9132e+09
MultiprocessorCount: 13
ClockRateKHz: 705500
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
thank you for your response.
You appear to be worrying about a very very small portion of the time compared to the overall effect. The real question you should be asking is: does this amount of time to solve this problem make sense? The answer to that is no absolutely not.
Here is a modified code which should run much faster
n=10^7;
dev = gpuDevice;
A = randn(n,1,'gpuArray');
B = randn(n,1,'gpuArray');
tic
G = A+cumsum(B);
wait(dev)
toc
On my 1060 this runs in 0.03 seconds. For even faster speeds you can use single precision
At any rate, that 0.02 seconds could be easily attributable to small changes in loads on your GPU. It's a much more likely scenario than having to do with gridsizes.

scipy integrate.quad return an incorrect value

i use scipy integrate.quad to calc cdf of normal distribution:
def nor(delta, mu, x):
return 1 / (math.sqrt(2 * math.pi) * delta) * np.exp(-np.square(x - mu) / (2 * np.square(delta)))
delta = 0.1
mu = 0
t = np.arange(4.0, 10.0, 1)
nor_int = lambda t: integrate.quad(lambda x: nor(delta, mu, x), -np.inf, t)
nor_int_vec = np.vectorize(nor_int)
s = nor_int_vec(t)
for i in zip(s[0],s[1]):
print i
while it print as follows:
(1.0000000000000002, 1.2506543424265854e-08)
(1.9563704110140217e-11, 3.5403445591955275e-11)
(1.0000000000001916, 1.2616577562700088e-08)
(1.0842532749783998e-34, 1.9621183122960244e-34)
(4.234531567162006e-09, 7.753407284370446e-09)
(1.0000000000001334, 1.757986959115912e-10)
for some x, it return a value approximate to zero, it should be return 1.
can somebody tell me what is wrong?
Same reason as in why does quad return both zeros when integrating a simple Gaussian pdf at a very small variance? but seeing as I can't mark it as a duplicate, here goes:
You are integrating a function with tight localization (at scale delta) over a very large (in fact infinite) interval. The integration routine can simply miss the part of the interval where the function is substantially different from 0, judging it to be 0 instead. Some guidance is required. The parameter points can be used to this effect (see the linked question) but since quad over an infinite interval does not support it, the interval has to be manually split, like so:
for t in range(4, 10):
int1 = integrate.quad(lambda x: nor(delta, mu, x), -np.inf, mu - 10*delta)[0]
int2 = integrate.quad(lambda x: nor(delta, mu, x), mu - 10*delta, t)[0]
print(int1 + int2)
This prints 1 or nearly 1 every time. I picked mu-10*delta as a point to split on, figuring most of the function lies to the right of it, no matter what mu and delta are.
Notes:
Use np.sqrt etc; there is usually no reason for put math functions in NumPy code. The NumPy versions are available and are vectorized.
Applying np.vectorize to quad is not doing anything besides making the code longer and slightly harder to read. Use a normal Python loop or list comprehension. See NumPy vectorization with integration

Julia Code optimization

I have the following code from a previous question and I need help optimizing the code for speed. This is the code:
function OfdmSym()
N = 64
n = 1000
symbol = ones(Complex{Float64}, n, 64)
data = ones(Complex{Float64}, 1, 48)
unused = zeros(Complex{Float64}, 1, 12)
pilot = ones(Complex{Float64}, 1, 4)
s = [-1-im -1+im 1-im 1+im]
for i=1:n # generate 1000 symbols
for j = 1:48 # generate 48 complex data symbols whose basis is s
r = rand(1:4) # 1, 2, 3, or 4
data[j] = s[r]
end
symbol[i,:]=[data[1,1:10] pilot[1] data[1,11:20] pilot[2] data[1,21:30] pilot[3] data[1,31:40] pilot[4] data[1,41:48] unused]
end
end
OfdmSym()
I appreciate your help.
First of all, I timed it with N = 100000
OfdmSym() # Warmup
for i = 1:5
#time OfdmSym()
end
and its pretty quick as it is
elapsed time: 3.235866305 seconds (1278393328 bytes allocated, 15.18% gc time)
elapsed time: 3.147812323 seconds (1278393328 bytes allocated, 14.89% gc time)
elapsed time: 3.144739194 seconds (1278393328 bytes allocated, 14.68% gc time)
elapsed time: 3.118775273 seconds (1278393328 bytes allocated, 14.79% gc time)
elapsed time: 3.137765971 seconds (1278393328 bytes allocated, 14.85% gc time)
But I rewrote using for loops to avoid the slicing:
function OfdmSym2()
N = 64
n = 100000
symbol = zeros(Complex{Float64}, n, 64)
s = [-1-im, -1+im, 1-im, 1+im]
for i=1:n
for j = 1:48
#inbounds symbol[i,j] = s[rand(1:4)]
end
symbol[i,11] = one(Complex{Float64})
symbol[i,22] = one(Complex{Float64})
symbol[i,33] = one(Complex{Float64})
symbol[i,44] = one(Complex{Float64})
end
end
OfdmSym2() # Warmup
for i = 1:5
#time OfdmSym2()
end
which is 20x faster
elapsed time: 0.159715932 seconds (102400256 bytes allocated, 12.80% gc time)
elapsed time: 0.159113184 seconds (102400256 bytes allocated, 14.75% gc time)
elapsed time: 0.158200345 seconds (102400256 bytes allocated, 14.82% gc time)
elapsed time: 0.158469032 seconds (102400256 bytes allocated, 15.00% gc time)
elapsed time: 0.157919113 seconds (102400256 bytes allocated, 14.86% gc time)
If you look at the profiler (#profile) you'll see that most of the time is spent generating random numbers, as you'd expect, as everything else is just moving numbers around.
It's all just bits, right? This isn't clean (at all), but it runs slightly faster on my machine (which is much slower than yours so I won't bother posting my times). Is it a little faster on your machine?
function my_OfdmSym()
const n = 100000
const my_one_bits = uint64(1023) << 52
const my_sign_bit = uint64(1) << 63
my_sym = Array(Uint64,n<<1,64)
fill!(my_sym, my_one_bits)
for col = [1:10, 12:21, 23:32, 34:43, 45:52]
for row = 1:(n<<1)
if randbool() my_sym[row, col] |= my_sign_bit end
end
end
my_symbol = reinterpret(Complex{Float64}, my_sym, (n, 64))
for k in [11, 22, 33, 44]
my_symbol[:, k] = 1.0
end
for k=53:64
my_symbol[:, k] = 0.0
end
end

Efficient method for finding elements in MATLAB matrix

I would like to know how can the bottleneck be treated in the given piece of code.
%% Points is an Nx3 matrix having the coordinates of N points where N ~ 10^6
Z = points(:,3)
listZ = (Z >= a & Z < b); % Bottleneck
np = sum(listZ); % For later usage
slice = points(listZ,:);
Currently for N ~ 10^6, np ~ 1000 and number of calls to this part of code = 1000, the bottleneck statement is taking around 10 seconds in total, which is a big chunk of time compared to the rest of my code.
Some more screenshots of a sample code for only the indexing statement as requested by #EitanT
If the equality on one side is not important you can reformulate it to a one-sided comparison and it gets one order of magnitude faster:
Z = rand(1e6,3);
a=0.5; b=0.6;
c=(a+b)/2;
d=abs(a-b)/2;
tic
for k=1:100,
listZ1 = (Z >= a & Z < b); % Bottleneck
end
toc
tic
for k=1:100,
listZ2 = (abs(Z-c)<d);
end
toc
isequal(listZ1, listZ2)
returns
Elapsed time is 5.567460 seconds.
Elapsed time is 0.625646 seconds.
ans =
1
Assuming the worst case:
element-wise & is not short-circuited internally
the comparisons are single-threaded
You're doing 2*1e6*1e3 = 2e9 comparisons in ~10 seconds. That's ~200 million comparisons per second (~200 MFLOPS).
Considering you can do some 1.7 GFLops on a single core, this indeed seems rather low.
Are you running Windows 7? If so, have you checked your power settings? You are on a mobile processor, so I expect that by default, there will be some low-power consumption scheme in effect. This allows windows to scale down the processing speed, so...check that.
Other than that....I really have no clue.
Try doing something like this:
for i = 1:1000
x = (a >= 0.5);
x = (x < 0.6);
end
I found it to be faster than:
for i = 1:1000
x = (a >= 0.5 & a < 0.6);
end
by about 4 seconds:
Elapsed time is 0.985001 seconds. (first one)
Elapsed time is 4.888243 seconds. (second one)
I think the reason for your slowing is the element wise & operation.