Comparing, matching and combining columns of data

Comparing, matching and combining columns of data - matlab

I need some help matching data and combining it. I currently have four columns of data in an Excel sheet, similar to the following:
Column: 1 2 3 4
U 3 A 0
W 6 B 0
R 1 C 0
T 9 D 0
... ... ... ...
Column two is a data value that corresponds to the letter in column one. What I need to do is compare column 3 with column 1 and whenever it matches copy the corresponding value from column 2 to column 4.
You might ask why don't I do this manually ? I have a spreadsheet with around 100,000 rows so this really isn't an option!
I do have access to MATLAB and have the information imported, if this would be more easily completed within that environment, please let me know.

As mentioned by #bla:
a formula similar to =IF(A1=C1,B1,0)
should serve (Excel).

Related

How do I replace the first 10 entries in a column with NaN in KDB

I am doing calculation on columns using summation. I want to manually change my first n entries in my calc column from float to NaN. Can someone please advise me how to do that?
For example, if my column in table t now is mycol:(1 2 3 4 5 6 7 8 9), I am trying to get a function that can replace the first n=4 entries with NaN, so my column in table t becomes mycol:(0N 0N 0N 0N 5 6 7 8 9)
Thank you so much!
Emily

We can use amend functionality to replace the first n items with null value. Additionally, it would be better to use the appropriate null literal for each column based on the type. Something like this would work:
f: {nullDict: "ijfs"!(0Ni;0Nj;0Nf:`); #[x; til y; :; nullDict .Q.ty x]}
This will amend the first y items in the list x. .Q.ty will get the type for input so that we can get the corresponding value from the dictionary.
You can then use this for a single column, like so:
update mycol: f[mycol;4] from tbl
You can also do this in one go for multiple columns with varying number of items to be replaced using functional form:
![tbl;();0b;`mycol`mycol2!((f[;4];`mycol);(f[;3];`mycol2))]
Do take note that you will need to modify nullDict with whatever other types you need.
Update: Thanks to Jonathon McMurray for suggesting a better way to build up nullDict for all primitive types using the below code:
{x!first each x$\:()}.Q.t except " "

KDB/Q : What is Vector operation?

I am learning KDB+ and Q programming and read about the following statement -
"select performs vector operations on column lists". What does Vector operation mean here? Could somebody please explain with an example? Also, How its faster than standard SQL?

A vector operation is an operation that takes one or more vectors and produces another vector. For example + in q is a vector operation:
q)a:1 2 3
q)b:10 20 30
q)a + b
11 22 33
If a and b are columns in a table, you can perform vector operations on them in a select statement. Continuing with the previous example, let's put a and b vectors in a table as columns:
q)([]a;b)
a b
----
1 10
2 20
3 30
Now,
q)select c:a + b from ([]a;b)
c
--
11
22
33
The select statement performed the same a+b vector addition, but took input and returned output as table columns.
How its faster than standard SQL?
"Standard" SQL implementations typically store data row by row. In a table with many columns the first element of a column and its second element can be separated in memory by the data from other columns. Modern computers operate most efficiently when the data is stored contiguously. In kdb+, this is achieved by storing tables column by column.

A vector is a list of atoms of the same type. Some examples:
2 3 4 5 / int
"A fine, clear day" / char
`ibm`goog`aapl`ibm`msft / symbol
2017.01 2017.02 2017.03m / month
Kdb+ stores and handles vectors very efficiently. Q operators – not just +-*% but e.g. mcount, ratios, prds – are optimised for vectors.
These operators can be even more efficient when vectors have attributes, such as u (no repeated items) and s (items are in ascending order).
When table columns are vectors, those same efficiencies are available. These efficiencies are not available to standard SQL, which views tables as unordered sets of rows.
Being column-oriented, kdb+ can splay large tables, storing each column as a separate file, which reduces file I/O when selecting from large tables.

The sentence means when you refer to a specific column of a table with a column label, it is resolved into the whole column list, rather than each element of it, and any operations on it shall be understood as list operations.
q)show t: flip `a`b!(til 3;10*til 3)
a b
----
0 0
1 10
2 20
q)select x: count a, y: type b from t
x y
---
3 7
q)type t[`b]
7h
q)type first t[`b]
-7h
count a in the above q-sql is equivalent to count t[`a] which is count 0 1 2 = 3. The same goes to type b; the positive return value 7 means b is a list rather than an atom: http://code.kx.com/q/ref/datatypes/#primitive-datatypes

Summing in groups of rows

I need help making sums in matlab, I have a column with this aspect;
7
2
1
0
5
2
8
7
(...)
And now I want to sum those numbers in groups of 4 rows and get a new matrix with those numbers, for this example I will get a new column with:
10 (7+2+1+0)
22 (5+2+8+7)
(...)
Thx for helping

Using reshape allows you to make a 4xn-matrix out of your data. Doing so, you can use sum.
sum(reshape(x,4,numel(x)/4),1)'

Multiple Columns, way to select closest to a value

I'm trying to analyze data sets that are obtained from CSV files. After the data is read into matlab, I am left with a variable of my data only. The number of columns and rows changes between each file. Is there a way to average each column and then create a variable for the one with the closest average to a certain value? and then also select the columns directly before and after this middle column and create variables for them, as well as create a variable for the column with the lowest average? Currently, I am selecting the columns manually and creating a variables for them that way.
For example:
I have this table of numbers. (I used the same number in each column for sake of easy averaging in this example.
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
Let's say I want the column whose average is closest to 3.2
That column would be column 3 whose average is 3. Then I would want the code to select the column before (column 2) and the column after (column 4). As well as the column with the lowest average (column 1)

First get the averages (I assume the data matrix is in variable X):
Xmns = mean(X);
Then to find the minimum, use "min":
[val,ind] = min(Xmns);
"val" holds the minimum value, "ind" the corresponding index in Xmns, which is the corresponding column.
To find the column mean closest to a particular value, again you can use min:
[val,ind] = min(abs(Xmns-key_val));
Now "ind" holds the column index with mean closest to "key_val". The next column is just "ind+1" and the previous "ind-1" - just be sure to check you are not beyond the ends of the matrix (i.e. ind may already be 1 or size(X,2)).
Also, given the column index "ind", to create a new variable with that column, you just use:
sc= X(:,ind);
and if you want to remove that column from X:
X(:,ind) = [];
and that is all.

MATLAB: copy a specific portion of an array

I am trying to copy a few elements from a matrix, but not a whole row, and not a single element.
For example, in the following matrix:
a = 1 2
3 4
5 6
7 8
9 0
How would I copy out just the following data?
b = 1
3
5
i.e. rows 1:3 in column 1 only... I know that you can remove an entire column like this:
b = a(:,1)
... and I appreciate that could just do this and then dump the last two rows, but I'd like to use more streamlined code as I am running a very resource-intensive solution.

Elements in a matrix in MATLAB are stored in column-major order. Which means, you could even use a single index and say:
b = a(1:3);
Since the first 3 elements ARE 1,3,5. Similarly, a(6) is 2, a(7) is 4 etc. Look at the sub2ind method to understand more:
http://www.mathworks.com/help/techdoc/ref/sub2ind.html

You are not "removing" the second column, you are referencing the other column.
You should read some of the Matlab docs, they provide some help about the syntax for accessing portions of matrices:
http://www.mathworks.com/help/techdoc/learn_matlab/f2-12841.html#f2-428