split dataframe in batches pyspark - pyspark

My requirement is to split the dataframe in group of 2 batches with each batch containing only 2 items and batch size(BATCH in output) increasing incrementally.
col#1 col#2 DATE
A 1 202010
B 1.1 202010
C 1.2 202010
D 1.3 202001
E 1.4 202001
O/P
col#1 col#2 DATE BATCH
A 1 202010 1
B 1.1 202010 1
C 1.2 202010 2
D 1.3 202001 2
E 1.4 202001 3

I was able to achieve this by following approach:
def dfZipWithIndex (df, offset=1, colName='rowId'):
new_schema = StructType([StructField(colName,LongType(),True)]+
df.schema.fields)
zipped_rdd = df.rdd.zipWithIndex()
new_rdd =zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))
return spark.createDataFrame(new_rdd, new_schema)
chunk_size=2
final_new=dfZipWithIndex(input_df)
temp_final=input_df.withColumn('BATCH',F.floor(F.col('rowId')/chunk_size)+1)

Related

SQL aggregate sum produces unexpected output

I don't understand how sum works.
For a PostgreSQL table in dbeaver:
a
b
c
d
1
2
3
2
1
2
4
3
2
1
3
2
2
1
4
2
3
2
4
2
the query
select a, b, c, d, sum(c) as sum_c, sum(d) as sum_d from abc a group by a, b, c, d
produces
a
b
c
d
sum_c
sum_d
1
2
3
2
3
2
1
2
4
3
4
3
2
1
3
2
3
2
2
1
4
2
4
2
3
2
4
2
4
2
and I don't understand why: I expected sum_c would be 18 in each row, which is the sum of values in c, and sum_d would be 11 for the same reason.
Why do sum_c and sum_d just copy the values from c and d in each row?
You can't get the result that you want with group by.
When you aggregate with group by you create groups for all the columns that are after group by and for each of these groups you get the aggregated results.
For your sample data, one group is 1,2,3,2 and for this combination of values you get the sum of c which is 3 since there is only 1 row with c=3 in that group.
Use SUM() window function:
SELECT a, b, c, d,
SUM(c) OVER () sum_c,
SUM(d) OVER () sum_d
FROM abc

separating the records in a kdb table

There is a table with a column that I would like to break into multiple records. For example
q)tab:([]a:1 2 3;b:(`a;`$"b c";`d);c:2 3 4)
q)tab
a b c
-------
1 a 2
2 b c 3
3 d 4
There is a space between b and c in the second entry of column b, I would like the table to become
a b c
-----
1 a 2
2 b 3
2 c 3
3 d 4
I tried
" " string vs exec b from tab
but didn't work.
Any idea?
Since b is the column with multiple entries per row, you can count each value and expand the corresponding row entries accordingly. Then ungroup like Terry mentioned should work.
q)t:([]a:1 2 3;b:(`a;`b`c;`d);c:2 3 4)
q)![t;();0b;{x!(enlist({(count each x)#'y};`b)),/:x}cols t]
a b c
------------
,1 ,`a ,2
2 2 `b`c 3 3
,3 ,`d ,4
q)ungroup ![t;();0b;{x!(enlist({(count each x)#'y};`b)),/:x}cols t]
a b c
-----
1 a 2
2 b 3
2 c 3
3 d 4
EDIT: Realised after your comment that the input is different. I think this is what you want.
q)t:([]a:1 2 3;b:(`a;`$"b c";`d);c:2 3 4)
q)ungroup update`$" "vs'string b from t
a b c
-----
1 a 2
2 b 3
2 c 3
3 d 4
You would normally do this using ungroup:
q)ungroup([]a:1 2 3;b:((),`a;`b`c;(),`d);c:2 3 4)
a b c
-----
1 a 2
2 b 3
2 c 3
3 d 4

Create a Boolean column displaying comparison between 2 other columns in kdb+

I'm currently learning kdb+/q.
I have a table of data. I want to take 2 columns of data (just numbers), compare them and create a new Boolean column that will display whether the value in column 1 is greater than or equal to the value in column 2.
I am comfortable using the update command to create a new column, but I don't know how to ensure that it is Boolean, how to compare the values and a method to display the "greater-than-or-equal-to-ness" - is it possible to do a simple Y/N output for that?
Thanks.
/ dummy data
q) show t:([] a:1 2 3; b: 0 2 4)
a b
---
1 0
2 2
3 4
/ add column name 'ge' with value from b>=a
q) update ge:b>=a from t
a b ge
------
1 0 0
2 2 1
3 4 1
Use a vector conditional:
http://code.kx.com/q/ref/lists/#vector-conditional
q)t:([]c1:1 10 7 5 9;c2:8 5 3 4 9)
q)r:update goe:?[c1>=c2;1b;0b] from t
c1 c2 goe
-------------
1 8 0
10 5 1
7 3 1
5 4 1
9 9 1
Use meta to confirm the goe column is of boolean type:
q)meta r
c | t f a
-------| -----
c1 | j
c2 | j
goe | b
The operation <= works well with vectors, but in some cases when a function needs atoms as input for performing an operation, you might want to use ' (each-both operator).
e.g. To compare the length of symbol string with another column value
q)f:{x<=count string y}
q)f[3;`ab]
0b
q)t:([] l:1 2 3; s: `a`bc`de)
q)update r:f'[l;s] from t
l s r
------
1 a 1
2 bc 1
3 de 0

How to flatten a data frame in apache spark | Scala

I have the following data frame :
df1
uid text frequency
1 a 1
1 b 0
1 c 2
2 a 0
2 b 0
2 c 1
I need to flatten it on the basis of uid to :
df2
uid a b c
1 1 0 2
2 0 0 1
I've worked on similar lines in R but haven't been able to translate it into sql or scala.
Any suggestions on how to approach this?
You can group by uid, use text as a pivot column and sum frequencies:
df1
.groupBy("uid")
.pivot("text")
.sum("frequency").show()

PostgreSQL, sum data from row of table?

x a b c d
----------
A 1 2 3 4
B 5 6 7 8
C 6 7 8 9
I want my sum of A = 1 + 2 + 3 + 4 and so for B and C, Is there any command that can sum row of data in PostgreSQL?
There is no such built-in function, but you can simply do the following:
select x, a+b+c+d as column_sum from mytable
Assuming, of course, that the data type of a, b, c and d are numeric.