Can temporals be used as columns in KDB? - kdb

I've created a pivot table based on:
https://code.kx.com/q/kb/pivoting-tables/
I've just replaced the symbols with minutes:
t:([]k:1 2 3 2 3;p:09:00 09:30 10:00 09:00 09:30; v:10 20 30 40 50)
P:asc exec distinct p from t;
exec P#(p!v) by k:k from t
Suffice to say, this doesn't work:
k|
-| -----------------------------
1| `s#09:00 09:30 10:00!10 0N 0N
2| `s#09:00 09:30 10:00!40 20 0N
3| `s#09:00 09:30 10:00!0N 50 30
which I expected, as the docs says P must be a list of symbols.
My question is; can temporal datatypes be used as columns at all in KDB?

Column names must be symbols. You can use .Q.id to give columns valid names, for example:
q)t:([]k:1 2 3 2 3;p:09:00 09:30 10:00 09:00 09:30; v:10 20 30 40 50)
q)P:.Q.id each asc exec distinct p from t;
q)exec P#.Q.id'[p]!v by k:k from t
k| a0900 a0930 a1000
-| -----------------
1| 10
2| 40 20
3| 50 30
You could convert minutes to their symbolic representation like this of course:
q)P:`$string asc exec distinct p from t;
q)exec P#(`$string p)!v by k:k from t
k| 09:00 09:30 10:00
-| -----------------
1| 10
2| 40 20
3| 50 30
but the result would be confusing at best, I strongly advise against such column names.

Related

How can I group my data frame based on conditions on a column

I have a data frame like this:
Date
Version
Value
Name
Jan 1
123.1
3
A
Jan 2
123.23
5
A
Jan 1
223.1
6
B
Jan 2
623.23
7
B
I want to group the table for 'Version' with the same prefix (everything from the first letter to there is the .. And for the Value, it selects the values using the row with the longest string length of version. And for the `Name' column, it uses any of the rows with the same prefix.
Version Prefix
Value
Name
123
5
A
223
6
B
623
7
B
Meaning version 123.1 and 123.23 has the same prefix '123', so both rows become 1 row in the result. And 'Value' equals to 5 since row with Version 123.23 (the row with the longest
Version has 5 as Value.
(df.withColumn('Version Prefix', split('Version','\.')[0])#Create new column
.withColumn('size', size(split(split('Version','\.')[1],'(?!$)')))#Calculate the size of the suffixes
.withColumn('max', max('size').over(Window.partitionBy('Version Prefix','Name')))#Find the suffix with the maximum size
.where(col('size')==col('max'))#Filter out max suffixes
.drop('Date','size','max','Version')#Drop unwanted columns
).show()
+-----+----+--------------+
|Value|Name|Version Prefix|
+-----+----+--------------+
| 5| A| 123|
| 6| B| 223|
| 7| B| 623|
+-----+----+--------------+

is there any easier way to combine 100+ PySpark dataframe with different columns together (not merge, but append)

suppose I have a lot of dataframe, with similar structure, but different columns. I want to combine all of them together, how to do it in a easier way?
for example, df1, df2, df3 are as follows:
df1
id base1 base2 col1 col2 col3 col4
1 1 100 30 1 2 3
2 2 200 40 2 3 4
3 3 300 20 4 4 5
df2
id base1 base2 col1
5 4 100 15
6 1 99 18
7 2 89 9
df3
id base1 base2 col1 col2
9 2 77 12 3
10 1 89 16 5
11 2 88 10 7
to be:
id base1 base2 col1 col2 col3 col4
1 1 100 30 1 2 3
2 2 200 40 2 3 4
3 3 300 20 4 4 5
5 4 100 15 NaN NaN NaN
6 1 99 18 NaN NaN NaN
7 2 89 9 NaN NaN NaN
9 2 77 12 3 NaN NaN
10 1 89 16 5 NaN NaN
11 2 88 10 7 NaN NaN
currently I use this code:
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
df_comb1=customUnion(df1,df2)
df_comb2=customUnion(df_comb1,df3)
however, if I keep creating new dataframe like df4,df5,etc. (100+)
my code becomes messy.
is there a way to code it in a easier way?
Thanks in advance
You can manage this with a list of data frames and a function, without necessarily needing to statically name each data frame...
dataframes = [df1,df2,df3] # load data frames
Compute the set of all possible columns:
all_cols = {i for lst in [df.columns for df in dataframes] for i in lst}
#{'base1', 'base2', 'col1', 'col2', 'col3', 'col4', 'id'}
A function to add missing columns to a DF:
def add_missing_cols(df, cols):
v = df
for col in [c for c in cols if (not c in df.columns)]:
v = v.withColumn(col, f.lit(None))
return v
completed_dfs = [add_missing_cols(df, all_cols) for df in dataframes]
res = completed_dfs[0]
for df in completed_dfs[1:]:
res = res.unionAll(df)
res.show()
+---+-----+-----+----+----+----+----+
| id|base1|base2|col1|col2|col3|col4|
+---+-----+-----+----+----+----+----+
| 1| 1| 100| 30| 1| 2| 3|
| 2| 2| 200| 40| 2| 3| 4|
| 3| 3| 300| 20| 4| 4| 5|
| 5| 4| 100| 15|null|null|null|
| 6| 1| 99| 18|null|null|null|
| 7| 2| 89| 9|null|null|null|
| 9| 2| 77| 12| 3|null|null|
| 10| 1| 89| 16| 5|null|null|
| 11| 2| 88| 10| 7|null|null|
+---+-----+-----+----+----+----+----+

How to add a column to the existing DataFrame and using window function to add specific rows in the new column using Scala/Spark 2.2

Eg: I would like to add the quantity sold by the date.
Date Quantity
11/4/2017 20
11/4/2017 23
11/4/2017 12
11/5/2017 18
11/5/2017 12
Output with the new Column:
Date Quantity, New_Column
11/4/2017 20 55
11/4/2017 23 55
11/4/2017 12 55
11/5/2017 18 30
11/5/2017 12 30
Simply use sum as a window function by specifying a WindowSpec:
import org.apache.spark.sql.expressions.Window
df.withColumn("New_Column", sum("Quantity").over(Window.partitionBy("Date"))).show
+---------+--------+----------+
| Date|Quantity|New_Column|
+---------+--------+----------+
|11/5/2017| 18| 30|
|11/5/2017| 12| 30|
|11/4/2017| 20| 55|
|11/4/2017| 23| 55|
|11/4/2017| 12| 55|
+---------+--------+----------+

Calculate variance across columns in pyspark

How can I calculate variance across numerous columns in a pyspark ?
For e.g. if the pyspark.sql.dataframe table is:
ID A B C
1 12 15 7
2 6 15 2
3 56 25 25
4 36 12 5
and output needed is
ID A B C Variance
1 12 15 7 10.9
2 6 15 2 29.6
3 56 25 25 213.6
4 36 12 5 176.2
There is a variance function in pyspark but it works only column-wise.
Just concat the columns that you need using concat_ws function and use udf to calculate variance like below
from pyspark.sql.functions import *
from pyspark.sql.types import *
from statistics import pvariance
def calculateVar(row):
data = [float(x.strip()) for x in row.split(",")]
return pvariance(data)
varUDF = udf(calculateVar,FloatType())
df.withColumn('Variance',varUDF(concat_ws(",",df.a,df.b,df.c))).show()
output :
+---+---+---+---+---------+
| id| a| b| c| Variance|
+---+---+---+---+---------+
| 1| 12| 15| 7|10.888889|
| 2| 6| 15| 2|29.555555|
| 3| 56| 25| 25|213.55556|
| 4| 36| 12| 5|176.22223|
+---+---+---+---+---------+

Running Sum of last one hour transaction using Spark Scala

I want to calculate running sum from last one hour for each transaction using Spark-Scala. I have following dataframe with three fields and want to calculate fourth field as given below:
Customer TimeStamp Tr Last_1Hr_RunningSum
Cust-1 6/1/2015 6:51:55 1 1
Cust-1 6/1/2015 6:58:34 3 4
Cust-1 6/1/2015 7:20:46 3 7
Cust-1 6/1/2015 7:40:45 4 11
Cust-1 6/1/2015 7:55:34 5 15
Cust-1 6/1/2015 8:20:34 0 12
Cust-1 6/1/2015 8:34:34 3 12
Cust-1 6/1/2015 9:35:34 7 7
Cust-1 6/1/2015 9:45:34 3 10
Cust-2 6/1/2015 16:26:34 2 2
Cust-2 6/1/2015 16:35:34 1 3
Cust-2 6/1/2015 17:39:34 3 3
Cust-2 6/1/2015 17:43:34 5 8
Cust-3 6/1/2015 17:17:34 6 6
Cust-3 6/1/2015 17:21:34 4 10
Cust-3 6/1/2015 17:45:34 2 12
Cust-3 6/1/2015 17:56:34 3 15
Cust-3 6/1/2015 18:21:34 4 13
Cust-3 6/1/2015 19:24:34 1 1
I want to calculate "Last_1Hr_RunningSum" as new field which look back for one hour from each transaction by customer id and take some of "Tr"(Transaction filed).
For example :Cust-1 at 6/1/2015 8:20:34 will look back till 6/1/2015 7:20:46 and take sum of (0+5+4+3) = 12.
Same way for each row I want to look back for one hour and take sum of all Transaction during that one hour.
I tried running sqlContext.sql with nested query but its giving me error. Also Window function and Row Number over partition is not supported by Spark-Scala SQLContext.
How can I get the sum of last one hour from "Tr" using column 'TimeStamp' with Spark-Scala only.
Thanks in advance.
I tried running sqlContext.sql with nested query but its giving me error
Did you try using Join?
df.registerTempTable("input")
val result = sqlContext.sql("""
SELECT
FIRST(a.Customer) AS Customer,
FIRST(a.Timestamp) AS Timestamp,
FIRST(a.Tr) AS Tr,
SUM(b.Tr) AS Last_1Hr_RunningSum
FROM input a
JOIN input b ON
a.Customer = b.Customer
AND b.Timestamp BETWEEN (a.Timestamp - 3600000) AND a.Timestamp
GROUP BY a.Customer, a.Timestamp
ORDER BY a.Customer, a.Timestamp
""")
result.show()
Which prints the expected result:
+--------+-------------+---+-------------------+
|Customer| Timestamp| Tr|Last_1Hr_RunningSum|
+--------+-------------+---+-------------------+
| Cust-1|1420519915000| 1| 1.0|
| Cust-1|1420520314000| 3| 4.0|
| Cust-1|1420521646000| 3| 7.0|
| Cust-1|1420522845000| 4| 11.0|
| Cust-1|1420523734000| 5| 15.0|
| Cust-1|1420525234000| 0| 12.0|
| Cust-1|1420526074000| 3| 12.0|
| Cust-1|1420529734000| 7| 7.0|
| Cust-1|1420530334000| 3| 10.0|
| Cust-2|1420554394000| 2| 2.0|
| Cust-2|1420554934000| 1| 3.0|
| Cust-2|1420558774000| 3| 3.0|
| Cust-2|1420559014000| 5| 8.0|
| Cust-3|1420557454000| 6| 6.0|
| Cust-3|1420557694000| 4| 10.0|
| Cust-3|1420559134000| 2| 12.0|
| Cust-3|1420559794000| 3| 15.0|
| Cust-3|1420561294000| 4| 13.0|
| Cust-3|1420565074000| 1| 1.0|
+--------+-------------+---+-------------------+
(This solution assumes the time is given in milliseconds)