I am learning KDB+ and Q programming and read about the following statement -
"select performs vector operations on column lists". What does Vector operation mean here? Could somebody please explain with an example? Also, How its faster than standard SQL?
A vector operation is an operation that takes one or more vectors and produces another vector. For example + in q is a vector operation:
q)a:1 2 3
q)b:10 20 30
q)a + b
11 22 33
If a and b are columns in a table, you can perform vector operations on them in a select statement. Continuing with the previous example, let's put a and b vectors in a table as columns:
q)([]a;b)
a b
----
1 10
2 20
3 30
Now,
q)select c:a + b from ([]a;b)
c
--
11
22
33
The select statement performed the same a+b vector addition, but took input and returned output as table columns.
How its faster than standard SQL?
"Standard" SQL implementations typically store data row by row. In a table with many columns the first element of a column and its second element can be separated in memory by the data from other columns. Modern computers operate most efficiently when the data is stored contiguously. In kdb+, this is achieved by storing tables column by column.
A vector is a list of atoms of the same type. Some examples:
2 3 4 5 / int
"A fine, clear day" / char
`ibm`goog`aapl`ibm`msft / symbol
2017.01 2017.02 2017.03m / month
Kdb+ stores and handles vectors very efficiently. Q operators – not just +-*% but e.g. mcount, ratios, prds – are optimised for vectors.
These operators can be even more efficient when vectors have attributes, such as u (no repeated items) and s (items are in ascending order).
When table columns are vectors, those same efficiencies are available. These efficiencies are not available to standard SQL, which views tables as unordered sets of rows.
Being column-oriented, kdb+ can splay large tables, storing each column as a separate file, which reduces file I/O when selecting from large tables.
The sentence means when you refer to a specific column of a table with a column label, it is resolved into the whole column list, rather than each element of it, and any operations on it shall be understood as list operations.
q)show t: flip `a`b!(til 3;10*til 3)
a b
----
0 0
1 10
2 20
q)select x: count a, y: type b from t
x y
---
3 7
q)type t[`b]
7h
q)type first t[`b]
-7h
count a in the above q-sql is equivalent to count t[`a] which is count 0 1 2 = 3. The same goes to type b; the positive return value 7 means b is a list rather than an atom: http://code.kx.com/q/ref/datatypes/#primitive-datatypes
Related
I am doing calculation on columns using summation. I want to manually change my first n entries in my calc column from float to NaN. Can someone please advise me how to do that?
For example, if my column in table t now is mycol:(1 2 3 4 5 6 7 8 9), I am trying to get a function that can replace the first n=4 entries with NaN, so my column in table t becomes mycol:(0N 0N 0N 0N 5 6 7 8 9)
Thank you so much!
Emily
We can use amend functionality to replace the first n items with null value. Additionally, it would be better to use the appropriate null literal for each column based on the type. Something like this would work:
f: {nullDict: "ijfs"!(0Ni;0Nj;0Nf:`); #[x; til y; :; nullDict .Q.ty x]}
This will amend the first y items in the list x. .Q.ty will get the type for input so that we can get the corresponding value from the dictionary.
You can then use this for a single column, like so:
update mycol: f[mycol;4] from tbl
You can also do this in one go for multiple columns with varying number of items to be replaced using functional form:
![tbl;();0b;`mycol`mycol2!((f[;4];`mycol);(f[;3];`mycol2))]
Do take note that you will need to modify nullDict with whatever other types you need.
Update: Thanks to Jonathon McMurray for suggesting a better way to build up nullDict for all primitive types using the below code:
{x!first each x$\:()}.Q.t except " "
I have a matrix A and a vector b. I don't know their sizes, the size varies because it is the output of another function. What I want to do is filter A by a column (let's say jth column) which has at least one value that is in b.
How do I do this without measuring the size of b and concatenating every filtered result. Right now, the code is like this (assume j is a given value)
bsize=size(b,1);
for i=1:bsize
if i==1
a=A(A(:,j)==b(i),:);
else
a=[a; A(A(:,j)==b(i),:)];
end
end
I want to code a faster solution.
I am adding a numerical example just to make it clear. Let's say
A=[2 4
7 14
11 13
15 14]
and b=[4 14]
What I'm trying to do is filter to obtain the A matrix whose values are 4 and 14 in the second column, the elements of b to obtain the following output.
A=[2 4
7 14
15 14]
In my data A has more than 12000 rows and b has more than 100 elements. It doesn't always have to be the second column, sometimes the column index changes but that's not the problem now.
Use the ismember function to create a logical index based on column j=2 of A and vector b, and use that index into the rows of A:
output = A(ismember(A(:,j), b), :);
I have a table that looks like:
id aff1 aff2 aff3 value
1 a x b 5
2 b c x 4
3 a b g 1
I would like to aggregate the aff columns to calculate the sum of "value" for each aff. For example, the above gives:
aff sum
a 6
b 10
c 4
g 1
x 9
Ideally, I'd like to do this directly in tableau without remaking the table by unfolding it along all the aff columns.
You can use Tableau’s inbuilt pivot method as below, without reshaping in source .
CTRL Select all 3 dimensions you want to merge , and click on pivot .
You will get your new reshaped data as below, delete other columns :
Finally build your view.
I hope this answers . Rest other options for the above results include JOIN at DB level, or creating multiple calculated fields for each attribute value which are not scalable.
I need some help matching data and combining it. I currently have four columns of data in an Excel sheet, similar to the following:
Column: 1 2 3 4
U 3 A 0
W 6 B 0
R 1 C 0
T 9 D 0
... ... ... ...
Column two is a data value that corresponds to the letter in column one. What I need to do is compare column 3 with column 1 and whenever it matches copy the corresponding value from column 2 to column 4.
You might ask why don't I do this manually ? I have a spreadsheet with around 100,000 rows so this really isn't an option!
I do have access to MATLAB and have the information imported, if this would be more easily completed within that environment, please let me know.
As mentioned by #bla:
a formula similar to =IF(A1=C1,B1,0)
should serve (Excel).
In MATLAB, we can operate on both rows and columns of a matrix. What does it exactly mean by "row major" or "column major"?
It is important to understand that MATLAB stores data in column-major order, so you know what happens when you apply the colon operator without any commas:
>> M = magic(3)
M =
8 1 6
3 5 7
4 9 2
>> M(:)
ans =
8
3
4
1
5
9
6
7
2
I tend to think "MATLAB goes down, then across". This makes it easy to reshape and permute arrays without scrambling your data. It's also necessary in order to grasp linear indexing (e.g. M(4)).
For example, a common way to obtain a column vector inline from some expression that generates an array is:
reshape(<array expression>,[],1)
As with (:) this stacks all the columns on top of each other into single column vector, for all data in any higher dimensions.
But this nifty syntactic trick lets you avoid an extra line of code.
In MATLAB, arrays are stored in column major order.
It means that when you have a multi-dimensional array, its 1D representation in memory is such that leftmost indices change faster.
It's called column major order because for a 2D array (matrix), the first (leftmost) index is typically the row index, so since it changes faster than the second (next to the right) index, the 1D representation of the matrix is memory correspond to the concatenation of the columns of the matrix.