Combining values across columns and pyspark and pivoting - pyspark

I have a pypark df like so:
+-------+-------+-----------+-----------+-----------+-----------+-----------+-----------+
| SEQ_ID|TOOL_ID|kurtosis_1m|kurtosis_2m|kurtosis_3m|kurtosis_4m|kurtosis_5m|kurtosis_6m|
+-------+-------+-----------+-----------+-----------+-----------+-----------+-----------+
|3688539| 99725| 6.7484| 6.2753| 6.2055| 7.2076| 7.0501| 7.5099|
|3689076| 99705| 4.8109| 4.3774| 4.1131| 4.4084| 4.1568| 4.4445|
+-------+-------+-----------+-----------+-----------+-----------+-----------+-----------+
I need to pivot it in such a way that I end up with a dataframe like so:
+-------+-------+-----------+
| SEQ_ID|TOOL_ID|kurtosis |
+-------+-------+-----------+
|3688539| 99725| 6.7484|
|3688539| 99725| 6.2753|
|3688539| 99725| 6.2055|
|3688539| 99725| 7.2076|
|3688539| 99725| 7.0501|
|3688539| 99725| 7.5099|
|3689076| 99705| 4.8109|
|3689076| 99705| 4.3774|
|3689076| 99705| 4.1131|
|3689076| 99705| 4.4084|
|3689076| 99705| 4.1568|
|3689076| 99705| 4.4445|
+-------+-------+-----------+
I figured one way would be to create the kurtosis column as an array column and then exploding it. How do you combine the values of columns across a dataframe into a single column as an array.
I have other columns like mean_1m, mean_2m etc... that I need to pivot in the same manner.
Any insights?
Thank You

You can create an array of dataframes and union them.
First, identify the kurtosis columns:
sub_string = "kurtosis"
kurtosis_col = [x for x in df.schema.names if sub_string in x]
Now create an array of dataframes equal to the number of kurtosis columns:
from functools import reduce
from pyspark.sql import DataFrame
from pyspark.sql.functions import col
df_array = [df.withColumn('col', F.concat(F.col(x)))
.select('seq_id', 'tool_id', 'col') for x in kurtosis_col]
# Union them
reduce(DataFrame.unionAll, df_array).withColumnRenamed("col", "kurtosis").show()
Output:
+-------+-------+--------+
| seq_id|tool_id|kurtosis|
+-------+-------+--------+
|3688539| 99725| 6.7484|
|3688539| 99725| 7.2076|
|3688539| 99725| 6.2753|
|3688539| 99725| 6.2055|
|3688539| 99725| 7.5099|
|3688539| 99725| 7.0501|
|3689076| 99705| 4.4084|
|3689076| 99705| 4.1131|
|3689076| 99705| 4.8109|
|3689076| 99705| 4.4445|
|3689076| 99705| 4.3774|
|3689076| 99705| 4.1568|
+-------+-------+--------+
You can follow a similar approach for your other set columns like mean_1m, etc. One way to join them back and to avoid duplicates is by using row_number() and monotonically_increasing_id() before joining. Let me know if you need that piece of code.

You can use array to combine multiple columns into one array and then - like you have already described in your question - explode the array.
from pyspark.sql import functions as F
cols = [x for x in df.schema.names if "kurtosis" in x]
df.withColumn("kurtosis", F.explode(F.array(cols))) \
.drop(*cols) \
.show()
Output:
+-------+-------+--------+
| SEQ_ID|TOOL_ID|kurtosis|
+-------+-------+--------+
|3688539| 99725| 6.7484|
|3688539| 99725| 6.2753|
|3688539| 99725| 6.2055|
|3688539| 99725| 7.2076|
|3688539| 99725| 7.0501|
|3688539| 99725| 7.5099|
|3689076| 99705| 4.8109|
|3689076| 99705| 4.3774|
|3689076| 99705| 4.1131|
|3689076| 99705| 4.4084|
|3689076| 99705| 4.1568|
|3689076| 99705| 4.4445|
+-------+-------+--------+

Related

Average element wise List of Dense vectors in each row of a pyspark dataframe

I have a column in a pyspark dataframe that contains Lists of DenseVectors. Different rows might have Lists of different sizes but each vector in the list is of the same size. I want to calculate the element-wise average of each of those lists.
To be more concrete, lets say I have the following df:
|ID | Column |
| -------- | ------------------------------------------- |
| 0 | List(DenseVector(1,2,3), DenseVector(2,4,5))|
| 1 | List(DenseVector(1,2,3)) |
| 2 | List(DenseVector(2,2,3), DenseVector(2,4,5))|
What I would like to obtain is
|ID | Column |
| -------- | --------------------|
| 0 | DenseVector(1.5,3,4)|
| 1 | DenseVector(2,4,5) |
| 2 | DenseVector(2,3,4) |
Many thanks!
I don't think there is a direct pyspark function to do this. There is an ElementwiseProduct(which works different to what is expected here) and others here. So, you could try to achieve this with a udf.
from pyspark.sql import functions as F
from pyspark.ml.linalg import Vectors, VectorUDT
def elementwise_avg(vector_list):
x = y = z = 0
no_of_v = len(vector_list)
for i, elem in enumerate(vector_list):
x += elem[i][0]
y += elem[i][1]
z += elem[i][2]
return Vectors.dense(x/no_of_v,y/no_of_v,z/no_of_v)
elementwise_avg_udf = F.udf(elementwise_avg, VectorUDT())
df = df.withColumn("Elementwise Avg", elementwise_avg_udf("Column"))

How to split up a column with 2D array, based on index Spark Scala

My current data frame is this.
+-------------------------------------------------------------------------------------+
|scores |
+-------------------------------------------------------------------------------------+
|[[1000, 1234, 4.6789], [2000, 1234, 4.0], [3000, 1234, 3.6789], [4000, 1234, 2.6789]]|
+-------------------------------------------------------------------------------------+
I want to convert it to the one below where the columns are separated the their index in the 2d scores array.
+---------------------+---------------------+---------------------------+
|score 1 |score2 |score3 |
+---------------------+---------------------+---------------------------+
|[1000,2000,3000,4000]|[1234,1234,1234,1234]|[4.6789,4.0,3.6789,2.6789] |
+---------------------+---------------------+---------------------------+
I have broken down the required steps below.
First of all, I recreated your data as follows:
val scores = spark.read.json(Seq("""{"scores": [[1000, 1234, 4.6789], [2000, 1234, 4.0], [3000, 1234, 3.6789], [4000, 1234, 2.6789]]}""").toDS)
scores.select(explode($"scores").alias("scores")).show(false)
+------------------------+
|scores |
+------------------------+
|[1000.0, 1234.0, 4.6789]|
|[2000.0, 1234.0, 4.0] |
|[3000.0, 1234.0, 3.6789]|
|[4000.0, 1234.0, 2.6789]|
+------------------------+
The next step splits each element in each array to a column
val split = scores.select(explode($"scores")).select((0 until 3).map(i => col("col")(i).alias(s"col$i")): _*)
split.show
+------+------+------+
| col0| col1| col2|
+------+------+------+
|1000.0|1234.0|4.6789|
|2000.0|1234.0| 4.0|
|3000.0|1234.0|3.6789|
|4000.0|1234.0|2.6789|
+------+------+------+
Then, we collect the rows in each column into a sequence
val res = (0 to 2).map(i => split.select(s"col$i").collect.map(_.getDouble(0))).toDS
res.show
+--------------------+
| value|
+--------------------+
|[1000.0, 2000.0, ...|
|[1234.0, 1234.0, ...|
|[4.6789, 4.0, 3.6...|
+--------------------+
Finally, we transpose the data into the format you requested
val scoresFinal = res.agg(collect_list("value").alias("result")).select((0 until 3).map(i => col("result")(i).alias(s"score${i+1}")): _*)
scoresFinal.show
+--------------------+--------------------+--------------------+
| score1| score2| score3|
+--------------------+--------------------+--------------------+
|[1000.0, 2000.0, ...|[1234.0, 1234.0, ...|[4.6789, 4.0, 3.6...|
+--------------------+--------------------+--------------------+

Which is the best way to find element in array column in spark scala?

I have a array column on which i find text from it and form a dataframe. Which is the better way among the below 2 options?
Option 1
val texts = Seq("text1", "text2", "text3")
val df = mainDf.select(col("*"))
.withColumn("temptext", explode($"textCol"))
.where($"temptext".isin(texts: _*))
And since it has added and extra column "temptext" and increased duplicate rows by exploding
val tempDf = df.drop("temptext").dropDuplicates("Root.Id") // dropDuplicates does not work since I have passed nested field
vs
Option 2
val df = mainDf.select(col("*"))
.where(array_contains($"textCol", "text1") ||
array_contains($"textCol", "text2") ||
array_contains($"textCol", "text3"))
Actually I wanted to make a generic api, If I go with option 2
then the problem is for every new text i need to add array_contains($"textCol", "text4") and create new api every time
and in option 1 it creates duplicate rows since I explode the array and also needs to drop the temporary column
Use arrays_overlap (or) array_intersect functions to pass array(<strings>) instead of array_contains.
Example:
1.filter based on texts variable:
val df=Seq((Seq("text1")),(Seq("text4","text1")),(Seq("text5"))).
toDF("textCol")
df.show()
//+--------------+
//| textCol|
//+--------------+
//| [text1]|
//|[text4, text1]|
//| [text5]|
//+--------------+
val texts = Array("text1","text2","text3")
//using arrays_overlap
df.filter(arrays_overlap(col("textcol"),lit(texts))).show(false)
//+--------------+
//|textCol |
//+--------------+
//|[text1] |
//|[text4, text1]|
//+--------------+
//using arrays_intersect
df.filter(size(array_intersect(col("textcol"),lit(texts))) > 0).show(false)
//+--------------+
//|textCol |
//+--------------+
//|[text1] |
//|[text4, text1]|
//+--------------+
2.Adding texts variable to the dataframe:
val texts = "text1,text2,text3"
val df=Seq((Seq("text1")),(Seq("text4","text1")),(Seq("text5"))).
toDF("textCol").
withColumn("texts",split(lit(s"${texts}"),","))
df.show(false)
//+--------------+---------------------+
//|textCol |texts |
//+--------------+---------------------+
//|[text1] |[text1, text2, text3]|
//|[text4, text1]|[text1, text2, text3]|
//|[text5] |[text1, text2, text3]|
//+--------------+---------------------+
//using array_intersect
df.filter("""size(array_intersect(textcol,texts)) > 0""").show(false)
//+--------------+---------------------+
//|textCol |texts |
//+--------------+---------------------+
//|[text1] |[text1, text2, text3]|
//|[text4, text1]|[text1, text2, text3]|
//+--------------+---------------------+
//using arrays_overlap
df.filter("""arrays_overlap(textcol,texts)""").show(false)
+--------------+---------------------+
|textCol |texts |
+--------------+---------------------+
|[text1] |[text1, text2, text3]|
|[text4, text1]|[text1, text2, text3]|
+--------------+---------------------+

How to split dataframe based on a range of values in a column and store them in separate files?

I have a dataframe created in spark after reading a table from postgres as below.
val url = "jdbc:postgresql://localhost:5432/testdb"
val connectionProperties = new Properties()
connectionProperties.setProperty("Driver", "org.postgresql.Driver")
connectionProperties.setProperty("Username", "testDB")
connectionProperties.setProperty("Password", "123456")
val query = "select * from testdb.datatable"
val dataDF = spark.read.jdbc(url, query1, connectionProperties)
I can see the count of data from the dataframe:
scala> dataDF.count
count: 3907891
sample output:
scala> dataDF.take(5)
------------|----|--------|
|source_name|id |location|
|-----------|----|--------|
| DB2 | 10 |Hive |
| SAP | 20 |Hive |
| SQL Server| 17 |Hive |
| Oracle | 21 |Hive |
| DB2 | 33 |Hive |
|-----------|----|--------|
The dataframe contains a column "ID" of type "Integer" which contains data in a range of 10 to 50
Is there anyway I can split the dataframe into 4 different partitions and write each partition as a files based on the column ID of which each file contains data of ID in file1: 10-20, file2: 21-30, file3: 31-40, file4: 41-50
If you know the ids range, I would go with something simple.
val data = Seq(
("db2", 10, "Hive"),
("sap", 20, "Hive"),
("sql", 17, "Hive"),
("oracle", 21, "Hive"),
("server", 33, "Hive"),
("risk", 43, "Hive"),
).toDF("source_name", "id", "location")
val bucketed = data.withColumn("bucket",
when($"id".between(0, 10), "1-10")
.when($"id".between(11, 20), "11-20")
.when($"id".between(21, 30), "21-30")
.when($"id".between(31, 40), "31-40")
.when($"id".between(41, 50), "41-50")
.otherwise("50+"))
bucketed.write.option("header", true)
.mode(SaveMode.Overwrite)
.partitionBy("bucket")
.csv("bucketing")

Apache Spark update a row in an RDD or Dataset based on another row

I'm trying to figure how I can update some rows based on another another row.
For example, I have some data like
Id | useraname | ratings | city
--------------------------------
1, philip, 2.0, montreal, ...
2, john, 4.0, montreal, ...
3, charles, 2.0, texas, ...
I want to update the users in the same city to the same groupId (either 1 or 2)
Id | useraname | ratings | city
--------------------------------
1, philip, 2.0, montreal, ...
1, john, 4.0, montreal, ...
3, charles, 2.0, texas, ...
How can I achieve this in my RDD or Dataset ?
So just for sake of completeness, what if the Id is a String, the dense rank won't work ?
For example ?
Id | useraname | ratings | city
--------------------------------
a, philip, 2.0, montreal, ...
b, john, 4.0, montreal, ...
c, charles, 2.0, texas, ...
So the result looks like this:
grade | useraname | ratings | city
--------------------------------
a, philip, 2.0, montreal, ...
a, john, 4.0, montreal, ...
c, charles, 2.0, texas, ...
A clean way to do this would be to use dense_rank() from Window functions. It enumerates the unique values in your Window column. Because city is a String column, these will be increasing alphabetically.
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
val df = spark.createDataFrame(Seq(
(1, "philip", 2.0, "montreal"),
(2, "john", 4.0, "montreal"),
(3, "charles", 2.0, "texas"))).toDF("Id", "username", "rating", "city")
val w = Window.orderBy($"city")
df.withColumn("id", rank().over(w)).show()
+---+--------+------+--------+
| id|username|rating| city|
+---+--------+------+--------+
| 1| philip| 2.0|montreal|
| 1| john| 4.0|montreal|
| 2| charles| 2.0| texas|
+---+--------+------+--------+
Try:
df.select("city").distinct.withColumn("id", monotonically_increasing_id).join(df.drop("id"), Seq("city"))