I've seen a technique of use an update (mainly for side-effect of adding a new column, I guess) in a way of: update someFun each t from t. Is it good or bad practice to use such technique?
Some experiments:
t1:([]a:1 2);
t2:([]a:1 2;b:30 40);
update s:{(x`a)+x`b} each t2 from t1
Seems we can use different tables to do this, so I guessed we'll have 2x memory over-use.
But:
t:([]a:til 1000000;b:-1*til 1000000);
\ts:10 s0: update s:{(x`a)+x`b} each t from t;
4761 32778560
\ts:10 s1: update s:{(x`a)+x`b} each ([]a;b) from t;
4124 32778976
\ts:10 s2: update s:{x+y}'[a;b] from t;
1908 32778512
gives almost the same result for all cases for memory. I wonder why memory consumptions are the same?
In all examples you're 'eaching' over rows of the table & it seems the memory consumption is a result of building up the vector incrementally (multiple memory block allocations) rather than in one go. Use vector operations whenever possible
q)n:5000000;t:([]a:til n;b:-1*til n)
q)
q)// each row
q)\ts update s:{(x`a)+x`b} each t from t;
1709 214218848
q)v:n#0
q)\ts {x}each v
361 214218256
q)
q)// vector op
q)\ts update s:sum a b from t;
18 67109760
q)\ts til n
5 67109040
Actually it's already 2x memory used.
Size of t is 16 M from -22!t
And memory used is 32 M
Related
we have a hash table with size 16, using double hashing method.
h1(x) = k mod 16
h2(x) = 2*(k mod 8)
I know that h2 hash function is bad, probably because mod 8 and times 2, but I don't know how to explain it. is there any explanation like "h2 hash function should mod prime or it will cause ____ problem "
It is bad because it increases the number of collisions.
The (mod 8) means that you are only looking for 8 pigeonholes in your 16-pigeonhole table.
Multiplying it by 2 just spreads those 8 pigeonholes out so that you don’t have to search too many slots past the hashed index to find an empty hole...
You should always compute modulo the size of your table.
h(x) ::= x (mod N) // where N is the table size
The purpose of making the table size a prime number just has to do with how powers of two are very common in computer science. If your data is random, then the size of the table doesn’t matter.
— As long as it is big enough for your expected load factor. A 16-element table is very small. You shouldn’t expect to store more than 6-12 random values in your table without a high-probability of collisions.
A very good linked thread is What is a good Hash Function?, which is totally worth a read just for the links to further reading alone.
I'm trying to work with matrices in KDB, and am frequently having to query their dimensions.
Currently I'm doing count and count flip, but this is verbose and repetitive. Is there a more elegant way to query the dimensions of an n-D matrix?
Assuming that we are in front of a well formed matrix, a function that would achieve your objective is:
shape:{:(count x;count x[0]);};
If you use it very often, you can save in the startup file q.q in the q directory, so it is going to be loaded on launch and readily available.
Clearly, flipping the entire matrix is more expensive in terms of time:
t:(100;100)#til 10000
q)\t:1000000 {:(count x; count flip x);}[t]
33808
q)\t:1000000 {:(count x;count x[0]);}[t]
282
Having said that, the flip method will guarantee that the matrix is well formed, which is not going to be caught by the proposed method:
q)t2:((2;3;4);(2;3))
q){:((#)x;(#)x[0]);}[t2]
2 3
q){:(count x; count flip x);}[t2]
'length
[1] {:(count x; count flip x);}
I am learning KDB+ and Q programming and read about the following statement -
"select performs vector operations on column lists". What does Vector operation mean here? Could somebody please explain with an example? Also, How its faster than standard SQL?
A vector operation is an operation that takes one or more vectors and produces another vector. For example + in q is a vector operation:
q)a:1 2 3
q)b:10 20 30
q)a + b
11 22 33
If a and b are columns in a table, you can perform vector operations on them in a select statement. Continuing with the previous example, let's put a and b vectors in a table as columns:
q)([]a;b)
a b
----
1 10
2 20
3 30
Now,
q)select c:a + b from ([]a;b)
c
--
11
22
33
The select statement performed the same a+b vector addition, but took input and returned output as table columns.
How its faster than standard SQL?
"Standard" SQL implementations typically store data row by row. In a table with many columns the first element of a column and its second element can be separated in memory by the data from other columns. Modern computers operate most efficiently when the data is stored contiguously. In kdb+, this is achieved by storing tables column by column.
A vector is a list of atoms of the same type. Some examples:
2 3 4 5 / int
"A fine, clear day" / char
`ibm`goog`aapl`ibm`msft / symbol
2017.01 2017.02 2017.03m / month
Kdb+ stores and handles vectors very efficiently. Q operators – not just +-*% but e.g. mcount, ratios, prds – are optimised for vectors.
These operators can be even more efficient when vectors have attributes, such as u (no repeated items) and s (items are in ascending order).
When table columns are vectors, those same efficiencies are available. These efficiencies are not available to standard SQL, which views tables as unordered sets of rows.
Being column-oriented, kdb+ can splay large tables, storing each column as a separate file, which reduces file I/O when selecting from large tables.
The sentence means when you refer to a specific column of a table with a column label, it is resolved into the whole column list, rather than each element of it, and any operations on it shall be understood as list operations.
q)show t: flip `a`b!(til 3;10*til 3)
a b
----
0 0
1 10
2 20
q)select x: count a, y: type b from t
x y
---
3 7
q)type t[`b]
7h
q)type first t[`b]
-7h
count a in the above q-sql is equivalent to count t[`a] which is count 0 1 2 = 3. The same goes to type b; the positive return value 7 means b is a list rather than an atom: http://code.kx.com/q/ref/datatypes/#primitive-datatypes
I am trying to find out the mean, media and percentile ranges of price movements for a given volume to be filled using trade data. Attaching the code below. The problem is that the code gives me wsfull error when i run it on ~80k records. I am using a 4g linux box. At the moment I can only run it for ~30k records and even then q uses >70% of my ram.
Is there any way to make it more memory friendly?
rangeForVol : {[symIn; vol; dt]
data: select from table where sym=symIn, date=dt;
data: update cumVol: sums quantity, cVol: sums quantity from data;
data: update cumVolTgt: cumVol + vol from data;
data: update pxLst: price[where each ((cumVol>=/:cVol) and (cumVol<=/:cumVolTgt))=1] from data;
.Q.gc[];
data: update minPx: min each pxLst, maxPx: max each pxLst from data;
data: update range: maxPx - minPx from data;
data
};
select count i by floor range%0.5 from rangeForVol[`ABC; 2500; 2012.06.04]
The code you quote above almost certainly does not do what you were trying to achieve.
The column cumVol and cVol are both identical (in that they contain a running total of that day's volume). Later you calculate cumVol>=/:cVol. /: means that for every element in cVol you will compare it to the entire vector cumVol. As they are identical, you will get the identity matrix (plus some extra 1b for any non-distinct values).
q)(til 4)=\:til 4
1000b
0100b
0010b
0001b
It seems you wanted to perform an element-wise comparison between the two vectors (though comparing a vector to itself also doesn't make sense), and if you want to do this explicitly, each-both would be the correct adverb (='). However, in q, the = operator will implicitly apply item-wise to two vectors of the same length (or a vector and a scalar, as is happening in your each-left example), making any adverb unnecessary.
The fact you are creating two n x n matrices when you probably intended a length n vector is probably the reason you're running out of memory.
I am using PostgreSQL 8.2, which is main reason why I'm asking this question. I want to get in this version of PostgreSQL a column (let name it C) with cumulative minimum for some other preordered column (let name it B). So on n-th row of column C should be minimum of values of B in rows 1 to n for some ordering.
In example below column A gives order and column C contains cumulative minimum for column B in that order:
A B C
------------
1 5 5
2 4 4
3 6 4
4 5 4
5 3 3
6 1 1
Probably easiest way to explain what I want is what, in later versions, next query does:
SELECT A , B, min (B) OVER(ORDER BY A) C FROM T;
But version 8.2, of course, don't have window functions.
I've written some plpgsql functions that do this on arrays. But to use this I have to use array_agg aggregate function that I again wrote myself (there no built in array_agg in that version). This approach isn't very efficient and while it worked well on smaller tables it becoming almost unusable now that I need to use it on bigger ones.
So I would be very grateful for any suggestions of alternative, more efficient solutions of this problem.
Thank you!
Well, you can use this simple subselect:
SELECT a, b, (SELECT min(b) FROM t t1 WHERE t1.a <= t.a) AS c
FROM t
ORDER BY a;
But I doubt it will be faster for big tables than a plpgsql function. Maybe you can show us your function. There might be room for improvement there.
For this to be fast you should have a multi-column index like:
CREATE INDEX t_a_b_idx ON t (a,b);
But really, you should upgrade your to a more recent version of PostgreSQL. Version 8.2 has reached end of life last year. No more security updates. And so many missing features ...