Overriding values in dataframe while joining 2 dataframes - scala

In the below example I would like to override the values in Spark Dataframe A with corresponding value in Dataframe B (if it exists). Is there a way to do it using Spark (Scala)?
Dataframe A
ID Name Age
1 Paul 30
2 Sean 35
3 Rob 25
Dataframe B
ID Name Age
1 Paul 40
Result
ID Name Age
1 Paul 40
2 Sean 35
3 Rob 25

The combined use of a left join and coalesce should do the trick, something like:
dfA
.join(dfB, "ID", "left")
.select(
dfA.col("ID"),
dfA.col("Name"),
coalesce(dfB.col("Age"), dfA.col("Age")).as("Age")
)
Explanation: For a specific ID some_id, there is 2 cases:
if dfB does not contain some_id: then the left join will produce null for dfB.col("Age") and the
coalesce will return the first non-null value from expressions we
passed to it, i.e. the value of dfA.col("Age")
if dfB contains some_id then the value from dfB.col("Age") will be used.

Related

How to count the number of values in a column in a dataframe based on the values in the other dataframe

I have two dataframes. the first one is a raw dataframe so its item_value column has all the item values. and the other dataframe has columns named min,avg,max which has min,avg,max values specified for the items in the first dataframe. and I want to count the number of item values in the first dataframe based on the specified agg values in the second dataframe.
the first dataframe looks like this
item_name
item_value
A
1.4
A
2.1
B
3.0
A
2.8
B
4.5
B
1.1
the second dataframe looks like this
item_name
min
avg
max
A
1.1
2
2.7
B
2.1
3
4.0
I want to count the number of item values that are greater than the defined min,avg,max values in the other dataframe
So the result I want is
item_name
min
avg
max
A
3
2
1
B
2
1
1
Any help would be much appreciated
*please forgive my grammar
If you don't mind SQL implementation, you can try:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
sql = """
select df2.item_name,
sum(case when df1.item_value > df2.min then 1 else 0 end) as min,
sum(case when df1.item_value > df2.avg then 1 else 0 end) as avg,
sum(case when df1.item_value > df2.max then 1 else 0 end) as max
from df2 join df1 on df2.item_name=df1.item_name
group by df2.item_name
"""
df = spark.sql(sql)
df.show()

KDB Select rows from a table based on one of its column while comparing it to another table

I have table1 as below.
num
value
1
10
2
15
3
20
table2
ver
value
1.0
5
2.0
15
3.0
18
Output should be as below. I need to select all rows from table1 such that table1.value <= table2.value.
num
value
1
10
2
15
I tried this, it's not working.
select from table1 where value <= (exec value from table2)
From a logical point of view what you're asking kdb to compare is:
10 15 20<=5 15 18
Because these are equal lengths, kdb assumes you mean pairwise comparison, aka
10<=5
15<=15
20<=18
to which it would return
q)10 15 20<=5 15 18
010b
What you actually seem to mean (based on your expected output) is 10 15 20<=max(5 15 18). So in that case you would want:
q)t1:([]num:1 2 3;val:10 15 20)
q)t2:([]ver:1 2 3.;val:5 15 18)
q)select from t1 where val<=exec max val from t2
num val
-------
1 10
2 15
As an aside, you can't/shouldn't have a column called value as it clashes with a keyword
value is a keyword so don't assign to it.
Assuming you want all values from table1 with value less than the max value in table2 you could do:
q)table1:([]num:til 3;val:10 15 20)
q)table2:([]ver:`float$til 3;val:5 15 18)
q)select from table1 where val<=max table2`val
num val
-------
0 10
1 15

Select distinct for all columns from keyed table

It seems we can not get distinct values from a keyed table in the same way as for unkeyed:
t:([a:1 2]b:3 4)
?[t;();0b;()] // keyed table
?[0!t;();1b;()] // unkeyed table
?[t;();1b;()] // err 'type
Why do we have this error here?
I suspect it's the same reason you can't run distinct on a dictionary - it's ambiguous. Do you intend to apply distinct to the keys or the values? I think kdb doesn't pick a side so it makes you do it yourself.
q)t:([]a:1 1 1 2 2;b:10 12 10 14 14)
q)select distinct from t
a b
----
1 10
1 12
2 14
q)select distinct from 1!t
'type
q)distinct `a`b`c!(1;"ab";enlist 1b)
'type

Functional update - multivariable function with dynamic columns

Any help with the following would be much appreciated!
I have two tables: table1 is a summary table whilst table2 is a list of all data points. I want to be able to summarise the information in table2 for each row in table1.
table1:flip `grp`constraint!(`a`b`c`d; 10 10 20 20);
table2:flip `grp`cat`constraint`val!(`a`a`a`a`a`b`b`b;`cl1`cl1`cl1`cl2`cl2`cl2`cl2`cl1; 10 10 10 10 10 10 20 10; 1 2 3 4 5 6 7 8);
function:{[grpL;constraintL;catL] first exec total: sum val from table2 where constraint=constraintL, grp=grpL,cat=catL};
update cl1:function'[grp;constraint;`cl1], cl2:function'[grp;constraint;`cl2] from table1;
The fourth line of this code achieves what I want for the two categories:cl1 and cl2
In table1 I want to name a new column with the name of the category (cl1, cl2, etc.) and I want the values in that column to be the output from running the function over that column.
However, I have hundreds of different categories, so don't want to have to list them out manually as in the fourth line. How would I pass in a list of categories, e.g. below?
`cl1`cl2`cl3
Sticking to your approach, you would just have to make your update statement functional and then iterate over the columns like so:
{![`table1;();0b;(1#x)!enlist ((';function);`grp;`constraint;1#x)]} each `cl1`cl2
Assuming you can amend table1 in place. If you must retain the original table1 then you can pass it by value though it will consume more memory
{![x;();0b;(1#y)!enlist ((';function);`grp;`constraint;1#y)]}/[table1;`cl1`cl2]
Another approach would be to aggregate, pivot and join though it's not necessarily a better solution as you get nulls rather than zeros
a:select sum val by cat,grp,constraint from table2
p:exec (exec distinct cat from a)#cat!val by grp,constraint from a
table1 lj p
There are several different methods you can look into.
The easiest method would be a functional update - http://code.kx.com/wiki/JB:QforMortals2/queries_q_sql#Functional_update
Below, though, should somewhat prove more useful, quicker and neater:
Your problem can be split into 2 parts. For the first part, you are looking to create a sum of each category by grp and constraint within table2. As for the second part, you are looking to join these results (the lookups) onto the corresponding records from table1.
You can create the necessary groups using by
q)exec val,cat by grp,constraint from table2
grp constraint| val cat
--------------| ------------------------------
a 10 | 1 2 3 4 5 `cl1`cl1`cl1`cl2`cl2
b 10 | 6 8 `cl2`cl1
b 20 | ,7 ,`cl2
Note though, this will only create nested lists of the columns in your select query
Next is to sum each of the cat groups
q)exec sum each val group cat by grp,constraint from table2
grp constraint|
--------------| ------------
a 10 | `cl1`cl2!6 9
b 10 | `cl2`cl1!6 8
b 20 | (,`cl2)!,7
Then, to create the cat's columns you can use a pivot like syntax - http://code.kx.com/wiki/Pivot
q)cats:asc exec distinct cat from table2
q)exec cats#sum each val group cat by grp,constraint from table2
grp constraint| cl1 cl2
--------------| -------
a 10 | 6 9
b 10 | 8 6
b 20 | 7
Now you can use this lookup table and index into each row from table1
q)(exec cats#sum each val group cat by grp,constraint from table2)[table1]
cl1 cl2
-------
6 9
8 6
To fill the nulls with zeros, use the carat symbol - http://code.kx.com/wiki/Reference/Caret
q)0^(exec cats#sum each val group cat by grp,constraint from table2)[table1]
cl1 cl2
-------
6 9
8 6
0 0
0 0
And now you can join on each row from table1 to your results using join-each
q)table1,'0^(exec cats#sum each val group cat by grp,constraint from table2)[table1]
grp constraint cl1 cl2
----------------------
a 10 6 9
b 10 8 6
c 20 0 0
d 20 0 0
HTH, Sean
This approach is the easiest way to pass in a list of categories
{table1^flip x!function'[table1`grp;table1`constraint;]each x}`cl1`cl2

Pyspark rdd Transpose

I have the following emp table in hive testing database
1 ram 2000.0 101 market
2 shyam 3000.0 102 IT
3 sam 4000.0 103 finance
4 remo 1000.0 103 finance
I want to transpose this table in pyspark with first two columns being same and last 3 columns being stacked.
I have done the following in pyspark shell
test = sqlContext.sql("select * from testing.emp")
data = test.flatMap (lambda row: [Row (id=row ['id'],name=row['name'],column_name=col,column_val=row [col]) for col in ('sal','dno','dname')])
emp = sqlContext.createDataFrame(data)
emp.registerTempTable('mytempTable')
sqlContext.sql('create table testing.test(id int,name string,column_name string,column_val int) row format delimited fields terminated by ","')
sqlContext.sql('INSERT INTO TABlE testing.test select * from mytempTable')
the expected output is
1 ram sal 2000
1 ram dno 101
1 ram dname market
2 shyam sal 3000
2 shyam dno 102
2 shyam dname IT
3 sam sal 4000
3 sam dno 103
3 sam dname finance
4 remo sal 1000
4 remo dno 103
4 remo dname finance
But the output I get is
NULL 2000.0 1 NULL
NULL NULL 1 NULL
NULL NULL 1 NULL
NULL 3000.0 2 NULL
NULL NULL 2 NULL
NULL NULL 2 NULL
NULL 4000.0 3 NULL
NULL NULL 3 NULL
NULL NULL 3 NULL
NULL 1000.0 4 NULL
NULL NULL 4 NULL
NULL NULL 4 NULL
Also please let me know how I can loop columns if I have many columns in the table
Sorry I just notic "hive table "
cfg = SparkConf().setAppName('MyApp')
spark = SparkSession.builder.config(conf=cfg).enableHiveSupport().getOrCreate()
df = spark.table("default.test").cache()
cols = df.columns[2:5]
df = df.rdd.map(lambda x: Row(id=x[0], name=x[1], val=dict(zip(cols, x[2:5]))))
df = spark.createDataFrame(df)
df.createOrReplaceTempView('mytempTable')
sql = """
select
id,
name,
explode(val) AS (k,v)
from mytempTable
"""
df = spark.sql(sql)
df.show()
And in HIVE :
> desc test;
OK
id string
somebody string
sal string
dno string
dname string
dt string
# Partition Information
# col_name data_type comment
dt string
P.S.
You can only use sql without Spark as :
select
a.id,
a.somebody,
b.k,
b.v
from (
select
id,
somebody,
map('sal',sal,
'dno',dno,
'dname',dname) as val
from default.test
) a
lateral VIEW explode(val) b as k,v
For your question about small parquet files:
cfg = SparkConf().setAppName('MyApp')
spark = SparkSession.builder.enableHiveSupport().config(conf=cfg).getOrCreate()
df = spark.sparkContext.parallelize(range(26))
df = df.map(lambda x: (x, chr(x + 97), '2017-01-12'))
df = spark.createDataFrame(df, schema=['idx', 'val', 'dt']).coalesce(1)
df.write.saveAsTable('default.testing', mode='overwrite', partitionBy='dt', format='parquet')
small parquet files amount = DataFrame partitions amount
You can use df.coalesce or df.repartition to decrease DataFrame partitions amount
But I am not sure whether there is a hidden trouble that reduce DataFrame partitions to only one (e.g.: OOM?)
And there is another way to combine small files with out spark,just use HIVE sql:
set mapred.reduce.tasks=1;
insert overwrite table default.yourtable partition (dt='2017-01-13')
select col from tmp.yourtable where dt='2017-01-13';