How can i calculate correlation matrix? - kdb

I have a table with columns a, b, c. Can I calculate the correlation matrix of cor[a;a], cor[a;b], cor[a;c] using functional form somehow?
?[table; (); 0b; (`aa`ab`ac)!((cor; `a; `a); (cor; `a; `b);(cor; `a; `b));
How can i generate the list of the last argument?
(cor; a;b)

q)show t:([]a:5?1.0;b:5?1.0;c:5?1.0)
a b c
------------------------------
0.389056 0.949975 0.6919531
0.391543 0.439081 0.4707883
0.08123546 0.5759051 0.6346716
0.9367503 0.5919004 0.9672398
0.2782122 0.8481567 0.2306385
q)u cor/:\:u:flip t
| a b c
-| --------------------------------
a| 1 -0.1328262 0.6671159
b| -0.1328262 1 -0.1830702
c| 0.6671159 -0.1830702 1

So manually typed out form:
q)t:([] a:10?10; b:10?10; c:10?10)
q)?[t;();0b;`aa`ab`ac!((cor;`a;`a);(cor;`a;`b);(cor;`a;`c))]
aa ab ac
-----------------------
1 -0.2530506 0.7966834
If you wanted to generate the last argument, assuming you wanted all permutations of first column combined with all columns:
q)a:{(`$raze'[string x])!(cor),/:x}{x[0],/:x}cols t;
q)?[t;();0b;a]
aa ab ac
-----------------------
1 -0.2530506 0.7966834
If you wanted all column permutations:
q)a:{(`$raze'[string x])!(cor),/:x}{x cross x}cols t
q)?[t;();0b;a]
aa ab ac ba bb bc ca cb cc
----------------------------------------------------------------------
1 -0.2530506 0.7966834 -0.2530506 1 -0.268787 0.7966834 -0.268787 1

Related

KDB Table enlist function call not running

I have a simple problem below
f2:{[x;y]
r:sum(x)*sum(y);
r
};
tm:([] pr:(100.01 100.02;100.03 100.04); rv:(15.72 55.64; 16.92 15.17 12.21 34.99))
f2 each [tm`rv][tm`pr]
The result I get is
{[x;y]
r:sum(x)*sum(y);
r
}[(15.72 55.64;16.92 15.17 12.21 34.99)'[(100.01 100.02;100.03 100.04)]]
The result I want is to add tm`rv and add tm`pr and multiply.
tm
pr rv
-------------------------------------
100.01 100.02 15.72 55.64
100.03 100.04 16.92 15.17 12.21 34.99
Hi, you can sum each nested list then do the multiplication:
select result:(sum each pr)*sum each rv from tm
result
--------
14274.14
15863.55
q)
But if you want to use your f2 function:{[x;y] r:sum(x)*sum(y); r }
You should do this:
f2'[tm`rv;tm`pr]
14274.14 15863.55
q)
' apply f2 over pairwise combinations of arguments

read function name and args from a table row and iterate and execute it and store the output to a single table

I have a data.csv which looks like below having a function name and a dictionary.
function,args
fun1,(`startDate`endDate`sym`rollPerct`expDateThreshold`expDateThresholdExpiry)!(.z.D-5;.z.D;`AAPL;0.8;10;1)
fun2,(`startDate`endDate`sym`rollPerct`expDateThreshold`expDateThresholdExpiry)!(.z.D-5;.z.D;`MSFT`ZAK;0.8;10;1)
fun3,(`startDate`endDate`sym`rollPerct`expDateThreshold`expDateThresholdExpiry)!(.z.D-5;.z.D;`NAFK;0.8;10;1)
And If I read the data
tab:("S*";enlist ",") 0:`$data.csv
Now, I want to iterate all rows from the table like below and call them and save all 3 results to a single table res
fun1 [(`startDate`endDate`sym`rollPerct`expDateThreshold`expDateThresholdExpiry)!(.z.D-5;.z.D;`AAPL;0.8;10;1)]
fun2 [(`startDate`endDate`sym`rollPerct`expDateThreshold`expDateThresholdExpiry)!(.z.D-5;.z.D;`MSFT`ZAK;0.8;10;1)]
fun3 [(`startDate`endDate`sym`rollPerct`expDateThreshold`expDateThresholdExpiry)!(.z.D-5;.z.D;`NAFK;0.8;10;1)]
Code snippet to iterate over f1[args], f2[args] and f3[args]. Combine all 3 results into a single table. I had used loop here, but there should be something better than loop here? let me know if any?
cnt:(count table); //get count of table
ino:0; //initialize out counter to 0
tab::flip (`date`sym`ric!(`date$();`symbol$();`symbol$())); //create a global table so it can hold iteration data
//perform iteration where f1[args],f2[args],f3[args]=tab
while[ino<cnt;
data:exec .[first function;args] from table where i=ino;
upsert[`tab;data];
ino:ino+1
]
//tab now has all the itration data of f1 f2 f3
tab
if your inputs are correctly ordered for all functions, the following simple example should work
q)f1:{x+y+z+2};f2:{x*y*z*22};f3:{x%y%z%42};
q)tab:([]func:`f1`f2`f3;args:`x`y`z!/:3 cut til 9)
q)tab
func args
-----------------
f1 `x`y`z!0 1 2
f2 `x`y`z!3 4 5
f3 `x`y`z!6 7 8
q)update res:func .'get'[args]from tab
func args res
---------------------------
f1 `x`y`z!0 1 2 5
f2 `x`y`z!3 4 5 1320
f3 `x`y`z!6 7 8 0.1632653
NB: if you're loaded args are strings, you'll want to parse these
for example - taking the above again
q)tab:update .Q.s1'[args]from tab
q)tab
func args
-------------------
f1 "`x`y`z!0 1 2"
f2 "`x`y`z!3 4 5"
f3 "`x`y`z!6 7 8"
q)meta tab
c | t f a
----| -----
func| s
args| C
q)tab:update'[reval;parse]'[args]from tab
q)tab
func args
-----------------
f1 `x`y`z!0 1 2
f2 `x`y`z!3 4 5
f3 `x`y`z!6 7 8
q)meta tab
c | t f a
----| -----
func| s
args|
q)update res:func .'get'[args]from tab
func args res
---------------------------
f1 `x`y`z!0 1 2 5
f2 `x`y`z!3 4 5 1320
f3 `x`y`z!6 7 8 0.1632653
reval in the above will try to stop anything dodgy being ran but i would avoid parsing code straight from files where possible

Explanation of code that constructs correlation matrix

I am referring to this answer.
The code to construct a correlation matrix given a table of columns is
u cor/:\:u:flip t where t is a table.
Reading right to left, I understand up till u:flip t. May I please ask for an explanation on what the rest of the code does?
Thanks
If you substitute for a something which gives more visual output, such as join with two vectors, it should be easier to see what the derived function cor/:\: is doing
q)"123","abc" // simple join
"123abc"
q)"123",/:"abc" // join left arg to each item of right arg
"123a"
"123b"
"123c"
q)"123",/:\:"abc" // join each item of left arg to each item of right
"1a" "1b" "1c"
"2a" "2b" "2c"
"3a" "3b" "3c"
Back to a simple example of cor
q)show t:([]a:3?1.0;b:3?1.0;c:3?1.0)
a b c
-------------------------------
0.7935513 0.6377554 0.3573039
0.2037285 0.03845637 0.02547383
0.7757617 0.8972357 0.688089
q)u cor/:\:u:flip t
| a b c
-| -----------------------------
a| 1 0.9474878 0.8529413
b| 0.9474878 1 0.975085
c| 0.8529413 0.975085 1
q)show data:value flip t; // extract the data for clarity
0.7935513 0.2037285 0.7757617
0.6377554 0.03845637 0.8972357
0.3573039 0.02547383 0.688089
q)cor[data 0;]each data // first row cor each row
1 0.9474878 0.8529413
q)cor[data 1;]each data // second row cor each row
0.9474878 1 0.975085
q)cor[data 2;]each data // last row cor each row
0.8529413 0.975085 1
q){cor[x]each data}each data // all at once
1 0.9474878 0.8529413
0.9474878 1 0.975085
0.8529413 0.975085 1
q)data cor/:\:data // derived function much nicer
1 0.9474878 0.8529413
0.9474878 1 0.975085
0.8529413 0.975085 1
If you are looking into correlation matrices then it might be a good idea to have a look into what they are, this might give some context to the inputs/outputs/code.
https://www.displayr.com/what-is-a-correlation-matrix/?msclkid=f68768aeab8e11ecbca30d34e2ba880f
In this case, we are finding the correlation between some matrix/table u:flip t and itself.
The rest of the query is comprised of function cor and two kdb+ iterators each right /: and each left \:.
https://code.kx.com/q/ref/cor/?msclkid=748e645bab8d11ecab715a544f547398
https://code.kx.com/q/wp/iterators/?msclkid=d2172906ab8c11ecac2b46902bbe505d
Each right will apply each item of the right-hand argument to each of the left-hand argument
q)1 ,/: 10 20 30
1 10
1 20
1 30
While each left will apply each item of the left-hand argument to each of the right-hand arguments
q)1 2 3 ,\: 10
1 10
2 10
3 10
If we use both simultaneously as illustrated below where we join , each element of the left-hand list \: with each element of the right-hand list /:
q)1 2 3,/:\:10 20 30
1 10 1 20 1 30
2 10 2 20 2 30
3 10 3 20 3 30
Thenu cor/:\:u:flip t can be understood to be taking each element of u and finding its correlation with every element within u, achieved through the use of cor/:\:.

Convert KMeans "centres" output to PySpark dataframe

I'm running a K-means clustering model, and I want to analyse the cluster centroids, however the centers output is a LIST of my 20 centroids, with their coordinates (8 each) as an ARRAY. I need it as a dataframe, with clusters 1:20 as rows, and their attribute values (centroid coordinates) as columns like so:
c1 | 0.85 | 0.03 | 0.01 | 0.00 | 0.12 | 0.01 | 0.00 | 0.12
c2 | 0.25 | 0.80 | 0.10 | 0.00 | 0.12 | 0.01 | 0.00 | 0.77
c3 | 0.05 | 0.10 | 0.00 | 0.82 | 0.00 | 0.00 | 0.22 | 0.00
The dataframe format is important because what I WANT to do is:
For each centroid
Identify the 3 strongest attributes
Create a "name" for each of the 20 centroids that is a concatenation of the 3 most dominant traits in that centroid
For example:
c1 | milk_eggs_cheese
c2 | meat_milk_bread
c3 | toiletries_bread_eggs
This code is running in Zeppelin, EMR version 5.19, Spark2.4. The model works great, but this is the boilerplate code from the Spark documentation (https://spark.apache.org/docs/latest/ml-clustering.html#k-means), which produces the list of arrays output that I can't really use.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
This is an excerpt of the output I get.
Cluster Centers:
[0.12391775 0.04282062 0.00368751 0.27282358 0.00533401 0.03389095
0.04220946 0.03213536 0.00895981 0.00990327 0.01007891]
[0.09018751 0.01354349 0.0130329 0.00772877 0.00371508 0.02288211
0.032301 0.37979978 0.002487 0.00617438 0.00610262]
[7.37626746e-02 2.02469798e-03 4.00944473e-04 9.62304581e-04
5.98964859e-03 2.95190585e-03 8.48736175e-01 1.36797882e-03
2.57451073e-04 6.13320072e-04 5.70559278e-04]
Based on How to convert a list of array to Spark dataframe I have tried this:
df = sc.parallelize(centers).toDF(['fresh_items', 'wine_liquor', 'baby', 'cigarettes', 'fresh_meat', 'fruit_vegetables', 'bakery', 'toiletries', 'pets', 'coffee', 'cheese'])
df.show()
But this throws the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
model.clusterCenters() gives you a list of numpy arrays and not a list of lists like in the answer you have linked. Just convert the numpy arrays to a lists before creating the dataframe:
bla = [e.tolist() for e in centers]
df = sc.parallelize(bla).toDF(['fresh_items', 'wine_liquor', 'baby', 'cigarettes', 'fresh_meat', 'fruit_vegetables', 'bakery', 'toiletries', 'pets', 'coffee', 'cheese'])
#or df = spark.createDataFrame(bla, ['fresh_items', 'wine_liquor', 'baby', 'cigarettes', 'fresh_meat', 'fruit_vegetables', 'bakery', 'toiletries', 'pets', 'coffee', 'cheese']
df.show()

Spark: All RDD data not getting saved to Cassandra table

Hi, I am trying to load RDD data to a Cassandra Column family using Scala. Out of a total 50 rows , only 28 are getting stored into cassandra table.
Below is the Code snippet:
val states = sc.textFile("state.txt")
//list o fall the 50 states of the USA
var n =0 // corrected to var
val statesRDD = states.map{a =>
n=n+1
(n, a)
}
scala> statesRDD.count
res2: Long = 50
cqlsh:brs> CREATE TABLE BRS.state(state_id int PRIMARY KEY, state_name text);
statesRDD.saveToCassandra("brs","state", SomeColumns("state_id","state_name"))
// this statement saves only 28 rows out of 50, not sure why!!!!
cqlsh:brs> select * from state;
state_id | state_name
----------+-------------
23 | Minnesota
5 | California
28 | Nevada
10 | Georgia
16 | Kansas
13 | Illinois
11 | Hawaii
1 | Alabama
19 | Maine
8 | Oklahoma
2 | Alaska
4 | New York
18 | Virginia
15 | Iowa
22 | Wyoming
27 | Nebraska
20 | Maryland
7 | Ohio
6 | Colorado
9 | Florida
14 | Indiana
26 | Montana
21 | Wisconsin
17 | Vermont
24 | Mississippi
25 | Missouri
12 | Idaho
3 | Arizona
(28 rows)
Can anyone please help me in finding where the issue is?
Edit:
I understood why only 28 rows are getting stored in Cassandra, it's because I have made the first column a PRIMARY KEY and It looks like in my code, n is incremented maximum to 28 and then it starts again with 1 till 22 (total 50).
val states = sc.textFile("states.txt")
var n =0
var statesRDD = states.map{a =>
n+=1
(n, a)
}
I tried making n an accumulator variable as well(viz. val n = sc.accumulator(0,"Counter")), but I don't see any differnce in the output.
scala> statesRDD.foreach(println)
[Stage 2:> (0 + 0) / 2]
(1,New Hampshire)
(2,New Jersey)
(3,New Mexico)
(4,New York)
(5,North Carolina)
(6,North Dakota)
(7,Ohio)
(8,Oklahoma)
(9,Oregon)
(10,Pennsylvania)
(11,Rhode Island)
(12,South Carolina)
(13,South Dakota)
(14,Tennessee)
(15,Texas)
(16,Utah)
(17,Vermont)
(18,Virginia)
(19,Washington)
(20,West Virginia)
(21,Wisconsin)
(22,Wyoming)
(1,Alabama)
(2,Alaska)
(3,Arizona)
(4,Arkansas)
(5,California)
(6,Colorado)
(7,Connecticut)
(8,Delaware)
(9,Florida)
(10,Georgia)
(11,Hawaii)
(12,Idaho)
(13,Illinois)
(14,Indiana)
(15,Iowa)
(16,Kansas)
(17,Kentucky)
(18,Louisiana)
(19,Maine)
(20,Maryland)
(21,Massachusetts)
(22,Michigan)
(23,Minnesota)
(24,Mississippi)
(25,Missouri)
(26,Montana)
(27,Nebraska)
(28,Nevada)
I am curious to know what is causing n to not getting updated after value 28? Also, what are the ways in which I can create a counter which I can use for creating RDD?
There are some misconceptions about distributed systems embedded inside your question. The real heart of this is "How do I have a counter in a distributed system?"
The short answer is you don't. For example what you've done in your code example originally is something like this.
Task One {
var x = 0
record 1: x = 1
record 2: x = 2
}
Task Two {
var x = 0
record 20: x = 1
record 21: x = 2
}
Each machine is independently creating a new x variable set at 0 which gets incremented within it's own context, independently over the other nodes.
For most use cases the "counter" question can be replaced with "How can I get a Unique Identifier per Record in a distributed system?"
For this most users end up using a UUID which can be generated on independent machines with infinitesimal chances of conflicts.
If the question can be "How can I get a monotonic increasing unique indentifier?"
Then you can use zipWithUniqueIndex which will not count but will generate monotonically increasing ids.
If you just want them number to start with it's best to do it on the local system.
Edit; Why can't I use an accumulator?
Accumulators store their state (surprise) per task. You can see this with a little example:
val x = sc.accumulator(0, "x")
sc.parallelize(1 to 50).foreachPartition{ it => it.foreach(y => x+= 1); println(x)}
/*
6
7
6
6
6
6
6
7
*/
x.value
// res38: Int = 50
The accumulators combine their state after finishing their tasks, which means you can't use them as a global distributed counter.