kdb getting float from integer division - kdb

I have a table
id, turnover, qty
and I want to query
select sum turnover, sum qty, (sum turnover) div (sum qty) by id from Table
However, the the resulting value from the division seems to be an int and shows 0 (as the unit price is a lot smaller than 1). I tried to cast the results into a float, but that doesnt help
select sum turnover, sum qty, `float$(`float$(sum turnover) div `float$(sum qty)) by id from Table.
How can I get a float in return?
Also, as a side question. How can I name the column (equivalently to sql select sum(x) as my_column_name ...)

That's the expected output from div, you should use % to divide numbers - which always returns a float.
q)200 div 8.5
22
q)200%8.5
23.52941
q)
Reference here;
Div: http://code.kx.com/q/ref/arith-integer/#div
%: http://code.kx.com/q/ref/arith-float/#divide
*edit
Apologies - forgot to reference the rest of your question. In your example, you are calculating the sum turnover and sum qty twice - you will want to avoid that, if you're dealing with a lot of records.
How is this;
q)show trade:([] id:(`$"A",'string[til 10]);turnover:10?til 10; qty:10?100+til 200)
id turnover qty
---------------
A0 4 152
A1 4 238
A2 2 298
A3 2 268
A4 7 246
A5 2 252
A6 0 279
A7 5 286
A8 7 245
A9 5 191
q)update toverq:sumT%sumQ from select sumT:sum turnover,sumQ:sum qty by id from trade
id| sumT sumQ toverq
--| ---------------------
A0| 4 152 0.02631579
A1| 4 238 0.01680672
A2| 2 298 0.006711409
A3| 2 268 0.007462687
A4| 7 246 0.02845528
A5| 2 252 0.007936508
A6| 0 279 0
A7| 5 286 0.01748252
A8| 7 245 0.02857143
A9| 5 191 0.02617801

Related

Perform bucketing properly on spark query

Let's consider a dataset:
name
age
Max
33
Adam
32
Zim
41
Muller
62
Now, if we run this query on dataset x:
x.as("a").join(x.as("b")).where(
$"b.age" - $"a.age" <= 10 and
$"a.age" > $"b.age").show()
name
age
name
age
Max
33
Zim
41
Adam
32
Max
33
Adam
32
Zim
41
That is my desired result.
Now, conceptually if I have a very big dataset, I might want to use bucketing to reduce search space.
So, doing bucketing with:
val buck_x = x.withColumn("buc_age", floor($"age"/ 10))
which gives me:
name
age
buck_age
Max
33
3
Adam
32
3
Zim
41
4
Muller
62
6
After explode, I get the following result:
val exp_x = buck_x.withColumn("buc_age", explode(array($"buc_age" -1, $"buc_age", $"buc_age" + 1)))
name
age
buck_age
Max
33
2
Max
33
3
Max
33
4
Adam
32
2
Adam
32
3
Adam
32
4
Zim
41
3
Zim
41
4
Zim
41
5
Muller
62
5
Muller
62
6
Muller
62
7
Now, after final query,
exp_x.as("a").join(exp_x.as("b")).where(
$"a.buc_age" === $"b.buc_age" and
$"b.age" - $"a.age" <= 10 and
$"b.age" > $"a.age").show()
I get the following result.
name
age
buc_age
name
age
buc_age
Max
33
3
Zim
41
3
Max
33
4
Zim
41
4
Adam
32
2
Max
33
2
Adam
32
3
Zim
41
3
Adam
32
3
Max
33
3
Adam
32
4
Zim
41
4
Adam
32
4
Max
33
4
Clearly, it's not the same as my expectation, I am getting more rows than expected. How to solve this while using bucket?
Drop your bucketing columns and then select distinct rows, essentially undoing the duplication caused by explode:
exp_x.select(res1.columns.map(c => col(c).as(c + "_a")) : _*).join(exp_x.select(res1.columns.map(c => col(c).as(c + "_b")) : _*)).where(
$"buc_age_a" === $"buc_age_b" and
$"age_b" - $"age_a" <= 10 and
$"age_b" > $"age_a").
drop("buc_age_a", "buc_age_b").
distinct.
show
+------+-----+------+-----+
|name_a|age_a|name_b|age_b|
+------+-----+------+-----+
| Adam| 32| Zim| 41|
| Adam| 32| Max| 33|
| Max| 33| Zim| 41|
+------+-----+------+-----+
There is really no need for an explode.
Instead, this approach unions two inner self joins. The two joins find cases where:
A and B are in the same bucket, and B is older
B is one bucket more, but no more than 10 years older
This should perform better than using the explode, since fewer comparisons are performed (because the sets being joined here are one third of the exploded size).
val namesDF = Seq(("Max", 33), ("Adam", 32), ("Zim", 41), ("Muller", 62)).toDF("name", "age")
val buck_x = namesDF.withColumn("buc_age", floor($"age" / 10))
// same bucket where b is still older
val same = buck_x.as("a").join(buck_x.as("b"), ($"a.buc_age" === $"b.buc_age" && $"b.age" > $"a.age"), "inner")
// different buckets -- b is one bucket higher but still no more than 10 ages different
val diff = buck_x.as("a").join(buck_x.as("b"), ($"a.buc_age" + 1 === $"b.buc_age" && $"b.age" <= $"a.age" + 10), "inner")
val result = same.union(diff)
The result (you can do a drop to remove excess columns like in Charlie's answer):
result.show(false)
+----+---+-------+----+---+-------+
|name|age|buc_age|name|age|buc_age|
+----+---+-------+----+---+-------+
|Adam|32 |3 |Max |33 |3 |
|Max |33 |3 |Zim |41 |4 |
|Adam|32 |3 |Zim |41 |4 |
+----+---+-------+----+---+-------+

Add unique rows for each group when similar group repeats after certain rows

Hi Can anyone help me please to get unique group number?
I need to give unique rows for each group even when same group repeats after some groups.
I have following data:
id version product startdate enddate
123 0 2443 2010/09/01 2011/01/02
123 1 131 2011/01/03 2011/03/09
123 2 131 2011/08/10 2012/09/10
123 3 3009 2012/09/11 2014/03/31
123 4 668 2014/04/01 2014/04/30
123 5 668 2014/05/01 2016/01/01
123 6 668 2016/01/02 2017/09/08
123 7 131 2017/09/09 2017/10/10
123 8 131 2018/10/11 2019/01/01
123 9 550 2019/01/02 2099/01/01
select *,
dense_rank()over(partition by id order by id,product)
from table
Expected results:
id version product startdate enddate count
123 0 2443 2010/09/01 2011/01/02 1
123 1 131 2011/01/03 2011/03/09 2
123 2 131 2011/08/10 2012/09/10 2
123 3 3009 2012/09/11 2014/03/31 3
123 4 668 2014/04/01 2014/04/30 4
123 5 668 2014/05/01 2016/01/01 4
123 6 668 2016/01/02 2017/09/08 4
123 7 131 2017/09/09 2017/10/10 5
123 8 131 2018/10/11 2019/01/01 5
123 9 550 2019/01/02 2099/01/01 6
Try the following
SELECT
id,version,product,startdate,enddate,
1+SUM(v)OVER(PARTITION BY id ORDER BY version) n
FROM
(
SELECT
*,
IIF(LAG(product)OVER(PARTITION BY id ORDER BY version)<>product,1,0) v
FROM TestTable
) q

Update Spark dataframe to populate data from another dataframe

I have 2 dataframes. I want to take distinct values of 1 column and link it with all the rows of another dataframe. For e.g -
Dataframe 1 : df1 contains
scenarioId
---------------
101
102
103
Dataframe 2 : df2 contains columns
trades
-------------------------------------
isin price
ax11 111
re32 909
erre 445
Expected output
trades
----------------
isin price scenarioid
ax11 111 101
re32 909 101
erre 445 101
ax11 111 102
re32 909 102
erre 445 102
ax11 111 103
re32 909 103
erre 445 103
Note that i dont have a possibility to join the 2 dataframes on a common column. Please suggest.
What you need is cross join or cartessian product:
val result = df1.crossJoin(df2)
although I do not recommend it as the amount of data rises very fast. You'll get all possible pairs - elements of cartessian product (the number will be number of rows in df1 times number of rows in df2).

Dataframe groupBy, get corresponding rows value, based on result of aggregate function [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 6 years ago.
I have dataframe with column by name c1, c2, c3, c4. I want to group it on a column and use agg function on other column eg min/max/agg.. etc and get the corresponding other column value based on result of agg function
Example :
c1 c2 c3 c4
1 23 1 1
1 45 2 2
1 91 3 3
1 90 4 4
1 71 5 5
1 42 6 6
1 72 7 7
1 44 8 8
1 55 9 9
1 21 0 0
Should result:
c1 c2 c3 c4
1 91 3 3
let dataframe be df
df.groupBy($"c1").agg(max($"c2"), ??, ??)
can someone please help what should go inplace of ??
i know solution of this problem using RDD. Wanted to explore if this can be solve in easier way using Dataframe/Dataset api
You can do this in two steps:
calculate the aggregated data frame;
join the data frame back with the original data frame and filter based on the condition;
so:
val maxDF = df.groupBy("c1").agg(max($"c2").as("maxc2"))
// maxDF: org.apache.spark.sql.DataFrame = [c1: int, maxc2: int]
df.join(maxDF, Seq("c1")).where($"c2" === $"maxc2").drop($"maxc2").show
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| 1| 91| 3| 3|
+---+---+---+---+

Difference between SAS merge and full outer join [duplicate]

This question already has answers here:
How to replicate a SAS merge
(2 answers)
Closed 7 years ago.
Table t1:
person | visit | code_num1 | code_desc1
1 1 100 OTD
1 2 101 SED
2 3 102 CHM
3 4 103 OTD
3 4 103 OTD
4 5 101 SED
Table t2:
person | visit | code_num2 | code_desc2
1 1 104 DME
1 6 104 DME
3 4 103 OTD
3 4 103 OTD
3 7 103 OTD
4 5 104 DME
I have the following SAS code that merges the two tables t1 and t2 by person and visit:
DATA t3;
MERGE t1 t2;
BY person visit;
RUN;
Which produces the following output:
person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED
1 6 104 DME
2 3 102 CHM
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 103 OTD
4 5 101 SED 104 DME
I want to replicate this in a hive query, and tried using a full outer join:
create table t3 as
select case when a.person is null then b.person else a.person end as person,
case when a.visit is null then b.visit else a.visit end as visit,
a.code_num1, a.code_desc1, b.code_num2, b.code_desc2
from t1 a
full outer join t2 b
on a.person=b.person and a.visit=b.visit
Which yields the table:
person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED null null
1 6 null null 104 DME
2 3 102 CHM null null
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 null null 103 OTD
4 5 101 SED 104 DME
Which is almost the same as SAS, but we have 2 extra rows for (person=3, visit=4). I assume this is because hive is matching each row in one table with two rows in the other, producing the 4 rows in t3, whereas SAS does not. Any suggestions on how I could get my query to match the output of the SAS merge?
If you merge two data sets and they have variables with the same names (besides the by variables) then variables from the second data set will overwwrite any variables having the same name in the first data set. So your sas code creates a overlaid dataset. A full outer join does not do this.
It seems to me if you first dedupe the right side table then do a full outer join you should get the equivalent table in hive. I don't see a need for the case when statements either as Joe pointed out. Just do a join on the key values:
create table t3 as
select coalesce(a.person, b.person) as person
, coalesce(a.visit, b.visit) as visit
, a.code_num1
, a.code_desc1
, b.code_num2
, b.code_desc2
from
(select * from t1) a
full outer join
(select person, visit, code_num2, code_desc2
group by person, visit, code_num2, code_desc2 from t2) b
on a.person=b.person and a.visit=b.visit
;
I can't test this code currently so be sure to test it. Good luck.