Spark PIVOT performance is very slow on High volume Data - scala

I have one dataframe with 3 columns and 20,000 no of rows. i need to be convert all 20,000 transid into column.
table macro:
prodid
transid
flag
A
1
1
B
2
1
C
3
1
so on..
Expected Op be like upto 20,000 no of columns:
prodid
1
2
3
A
1
1
1
B
1
1
1
C
1
1
1
I have tried with PIVOT/transpose function but its taking too long time for high volume data. for processing 20,000 rows to column its taking around 10 hrs.
eg.
val array =a1.select("trans_id").distinct.collect.map(x => x.getString(0)).toSeq
val a2=a1.groupBy("prodid").pivot("trans_id",array).sum("flag")
When i used pivot on 200-300 no of rows then it is working fast but when no of rows increase PIVOT is not good.
can anyone please help me to find out the solution.is there any method to avoid PIVOT function as PIVOT is good for low volume conversion only.How to deal with high volume data.
I need this type of conversion for matrix multiplication.
for matrix multiplication my input be like below table and final results will be in matrix multiplication.
|col1|col2|col3|col4|
|----|----|----|----|
|1 | 0 | 1 | 0 |
|0 | 1 | 0 | 0 |
|1 | 1 | 1 | 1 |

Related

pySpark sequential window function / access previously computed valued (previous row)

I have a pySpark dataframe and need to compute a column which depends on the value of the same column in the previous row. But instead of using the old value of this column of the previous row, I need the new one, to which the calculation was already applied.
Specifically, I need to compute the following: Given a dataframe with a column A in a specific order order. Compute a column B sequentially as B = MAX(0, LAG(B) - LAG(A)), starting with a default value of 0 for the first row.
Example:
Input:
order | A
------|----
0 | -1
1 | -2
2 | 4
3 | 4
4 | -1
5 | 4
6 | -1
Wanted output:
order | A | B
------|----|---
0 | -1 | 0 <- B here is here set to 0
1 | -2 | 1
2 | 4 | 3
3 | 4 | 0
4 | -1 | 0
5 | 4 | 1
6 | -1 | 0
Using the default F.lag window function does not work, since this one yields only the old previous row, since otherwise distributed computing is no longer possible, if it needs to be computed sequentially.

How to create a column of row id in Spark dataframe for each distinct column value using Scala

I have a data frame in scala spark as
category | score |
A | 0.2
A | 0.3
A | 0.3
B | 0.9
B | 0.8
B | 1
I would like to
add a row id column as
category | score | row-id
A | 0.2 | 0
A | 0.3 | 1
A | 0.3 | 2
B | 0.9 | 0
B | 0.8 | 1
B | 1 | 2
Basically I want the row id to be monotonically increasing for each distinct value in column category. I already have a sorted dataframe so all the rows with same category are grouped together. However, I still don't know how to generate the row_id that restarts when a new category appears. Please help!
This is a good use case for Window aggregation functions
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
import df.sparkSession.implicits._
val window = Window.partitionBy('category).orderBy('score)
df.withColumn("row-id", row_number.over(window))
Window functions work kind of like groupBy except that instead of each group returning a single value, each row in each group returns a single value. In this case the value is the row's position within the group of rows of the same category. Also, if this is the effect that you are trying to achieve, then you don't need to have pre-sorted the column category beforehand.

How to preserve order of a DataFrame when writing it as CSV with partitioning by columns?

I sort the rows of a DataFrame and write it out to disk like so:
df.
orderBy("foo").
write.
partitionBy("bar", "moo").
option("compression", "gzip").
csv(outDir)
When I look into the generated .csv.gz files, their order is not preserved. Is this the way Spark does this? Is there a way to preserve order when writing a DF to disk with a partitioning?
Edit: To be more precise: Not the order of the CSVs is off, but the order inside them. Let's say I have it like the following after df.orderBy (for simplicity, I now only partition by one column):
foo | bar | baz
===============
1 | 1 | 1
1 | 2 | 2
1 | 1 | 3
2 | 3 | 4
2 | 1 | 5
3 | 2 | 6
3 | 3 | 7
4 | 2 | 9
4 | 1 | 10
I expect it to be like this, e.g. for files in folder bar=1:
part-00000-NNN.csv.gz:
1,1
1,3
2,5
part-00001-NNN.csv.gz:
3,8
4,10
But what it is like:
part-00000-NNN.csv.gz:
1,1
2,5
1,3
part-00001-NNN.csv.gz:
4,10
3,8
It's been a while but I witnessed this again. I finally came across a workaround.
Suppose, your schema is like:
time: bigint
channel: string
value: double
If you do:
df.sortBy("time").write.partitionBy("channel").csv("hdfs:///foo")
the timestamps in the individual part-* files get tossed around.
If you do:
df.sortBy("channel", "time").write.partitionBy("channel").csv("hdfs:///foo")
the order is correct.
I think it has to do with shuffling. So, as a workaround, I am now sorting by the columns I want my data to be partitioned by first, then by the column I want to have it sorted in the individual files.

Add a key element for n rows in PySpark Dataframe

I have a dataframe like the one shown below.
id | run_id
--------------
4 | 12345
6 | 12567
10 | 12890
13 | 12450
I wish to add a new column say Key that will have value 1 for the first n rows and 2 for the next n rows. The result will be like:
id | run_id | key
----------------------
4 | 12345 | 1
6 | 12567 | 1
10 | 12890 | 2
13 | 12450 | 2
Is it possibile to do the same with PySpark?. Thanks in advance for the help.
Here is one way to do it using zipWithIndex:
# sample rdd
rdd=sc.parallelize([[4,12345], [6,12567], [10,12890], [13,12450]])
# group size for key
n=2
# add rownumber and then label in batches of size n
rdd=rdd.zipWithIndex().map(lambda (x, rownum): x+[int(rownum/n)+1])
# convert to dataframe
df=rdd.toDF(schema=['id', 'run_id', 'key'])
df.show(4)

Max consecutive years for each customer in Tableau

I am trying to find for each customer the Max consecutive years he buys something. I tried to create a calculated field but to no avail.
I created two calculated fields
Consecutive: if max([Count])>0 then previous_value(0)+1+index()-index() else 0 end
max: window_max([Consecutive])
My data looks something like:
Year | Customer | Count
1996 | a | 2
1996 | b | 1
1997 | a | 1
1997 | b | 2
1998 | b | 1
So the result would be
a:2
b:3
Use nested table calcs.
The first calc, call it running_good_years, is a running count of consecutive years with sales.
If count(Sales) = 0 then 0 else previous_value(0) + 1 end
The second just returns the max
Window_max(running_good_years)
With table calcs, defining the partitioning and addressing is critical. Partition by Customer, Address by year