how to convert rows into columns in spark dataframe, scala - scala

Is there any way to transpose dataframe rows into columns.
I have following structure as a input:
val inputDF = Seq(("pid1","enc1", "bat"),
("pid1","enc2", ""),
("pid1","enc3", ""),
("pid3","enc1", "cat"),
("pid3","enc2", "")
).toDF("MemberID", "EncounterID", "entry" )
inputDF.show:
+--------+-----------+-----+
|MemberID|EncounterID|entry|
+--------+-----------+-----+
| pid1| enc1| bat|
| pid1| enc2| |
| pid1| enc3| |
| pid3| enc1| cat|
| pid3| enc2| |
+--------+-----------+-----+
expected result:
+--------+----------+----------+----------+-----+
|MemberID|Encounter1|Encounter2|Encounter3|entry|
+--------+----------+----------+----------+-----+
| pid1| enc1| enc2| enc3| bat|
| pid3| enc1| enc2| null| cat|
+--------+----------+----------+----------+-----+
Please suggest if there is any optimized direct API available for transposing rows into columns.
my input data size is quite huge, so actions like collect, I wont be able to perform as it would take all the data on driver.
I am using Spark 2.x

I am not sure that what you need is what you actually asked. Yet, just in case here is an idea:
val entries = inputDF.where('entry isNotNull)
.where('entry !== "")
.select("MemberID", "entry").distinct
val df = inputDF.groupBy("MemberID")
.agg(collect_list("EncounterID") as "encounterList")
.join(entries, Seq("MemberID"))
df.show
+--------+-------------------------+-----+
|MemberID| encounterList |entry|
+--------+-------------------------+-----+
| pid1| [enc2, enc1, enc3]| bat|
| pid3| [enc2, enc1]| cat|
+--------+-------------------------+-----+
The order of the list is not deterministic but you may sort it and then extract new columns from it with .withColumn("Encounter1", sort_array($"encounterList")(0))...
Other idea
In case what you want is to put the value of entry in the corresponding "Encounter" column, you can use a pivot:
inputDF
.groupBy("MemberID")
.pivot("EncounterID", Seq("enc1", "enc2", "enc3"))
.agg(first("entry")).show
+--------+----+----+----+
|MemberID|enc1|enc2|enc3|
+--------+----+----+----+
| pid1| bat| | |
| pid3| cat| | |
+--------+----+----+----+
Adding Seq("enc1", "enc2", "enc3") is optionnal but since you know the content of the column, it will speed up the computation.

Related

assign values to a new column depending on old column values in dataframe

I have assigned values to 4 variables in a conf or application.properties file,
A = 1
B = 2
C = 3
D = 4
I have a dataframe as follows,
+-----+
|name |
+-----+
| A |
| C |
| B |
| D |
| B |
+-----+
I want to add a new column that has the values assigned from the conf variables declared above for A,B,C,D respectively depending on the value in the name column.
Final Dataframe should have,
+----+----------+
|name|NAME_VALUE|
+----+----------+
| A | 1 |
| C | 3 |
| B | 2 |
| D | 4 |
| B | 2 |
+----+----------+
I tried lit function in .WITHCOLUMN with conf.getint($name), not accepting Column in lit func requires string, I have to hardcode the variable names in lit. Is there anyway for me to dynamically assign those respective conf variable names in LIT so it can automatically assign values to another column in spark scala?
For this moment i dont have any ideas how to do it as you intended with dynamic usage of vals names.
My proposition is to use a seq of tuples instead of multiple vals, in such case you can create some udf and try to map this value for each row, but you can also use join which i am showing in below example:
val data = Seq(("A"),("C"), ("B"), ("D"), ("B"))
val df = data.toDF("name")
val mappings = Seq(("A",1), ("B",2), ("C",3), ("D",4))
val mappingsDf = mappings.toDF("name", "value")
df.join(broadcast(mappingsDf), df("name") === mappingsDf("name"), "left")
.select(
df("name"),
mappingsDf("value")
).show
output is as expected:
+----+-----+
|name|value|
+----+-----+
| A| 1|
| C| 3|
| B| 2|
| D| 4|
| B| 2|
+----+-----+
This solution is pretty generic as your mapping are df here so you can hardcode them as showed in my example or load them from some csv or json easily with spark api
Due to broadcast join it should be quite efficient (you should remove this hint if you want to use big amount of mappings!)
I think its easy to understand and maintain as its not udf but only Spark api

transform distinct row values to different columns with corresponding rows using Pyspark

I'm new to Pyspark and trying to transform data
Given dataframe
Col1
A=id1a A=id2a B=id1b C=id1c B=id2b
D=id1d A=id3a B=id3b C=id2c
A=id4a C=id3c
Required:
A B C
id1a id1b id1c
id2a id2b id2c
id3a id3b id3b
id4a null null
I have tried pivot, but that gives first value.
There might be a better way , however an approach is splitting the column on spaces to create array of the entries and then using higher order functions(spark 2.4+) to split on the '=' for each entry in the splitted array .Then explode and create 2 columns one with the id and one with the value. Then we can assign a row number to each partition and groupby then pivot:
import pyspark.sql.functions as F
df1 = (df.withColumn("Col1",F.split(F.col("Col1"),"\s+")).withColumn("Col1",
F.explode(F.expr("transform(Col1,x->split(x,'='))")))
.select(F.col("Col1")[0].alias("cols"),F.col("Col1")[1].alias("vals")))
from pyspark.sql import Window
w = Window.partitionBy("cols").orderBy("cols")
final = (df1.withColumn("Rnum",F.row_number().over(w)).groupBy("Rnum")
.pivot("cols").agg(F.first("vals")).orderBy("Rnum"))
final.show()
+----+----+----+----+----+
|Rnum| A| B| C| D|
+----+----+----+----+----+
| 1|id1a|id1b|id1c|id1d|
| 2|id2a|id2b|id2c|null|
| 3|id3a|id3b|id3c|null|
| 4|id4a|null|null|null|
+----+----+----+----+----+
this is how df1 looks like after the transformation:
df1.show()
+----+----+
|cols|vals|
+----+----+
| A|id1a|
| A|id2a|
| B|id1b|
| C|id1c|
| B|id2b|
| D|id1d|
| A|id3a|
| B|id3b|
| C|id2c|
| A|id4a|
| C|id3c|
+----+----+
May be I don't know the full picture, but the data format seems to be strange. If nothing can be done at the data source, then some collects, pivots and joins will be needed. Try this.
import pyspark.sql.functions as F
test = sqlContext.createDataFrame([('A=id1a A=id2a B=id1b C=id1c B=id2b',1),('D=id1d A=id3a B=id3b C=id2c',2),('A=id4a C=id3c',3)],schema=['col1','id'])
tst_spl = test.withColumn("item",(F.split('col1'," ")))
tst_xpl = tst_spl.select(F.explode("item"))
tst_map = tst_xpl.withColumn("key",F.split('col','=')[0]).withColumn("value",F.split('col','=')[1]).drop('col')
#%%
tst_pivot = tst_map.groupby(F.lit(1)).pivot('key').agg(F.collect_list(('value'))).drop('1')
#%%
tst_arr = [tst_pivot.select(F.posexplode(coln)).withColumnRenamed('col',coln) for coln in tst_pivot.columns]
tst_fin = reduce(lambda df1,df2:df1.join(df2,on='pos',how='full'),tst_arr).orderBy('pos')
tst_fin.show()
+---+----+----+----+----+
|pos| A| B| C| D|
+---+----+----+----+----+
| 0|id3a|id3b|id1c|id1d|
| 1|id4a|id1b|id2c|null|
| 2|id1a|id2b|id3c|null|
| 3|id2a|null|null|null|
+---+----+----+----+----

How to check whether a the whole column in a pyspark contains a value using Expr

In pyspark how can i use expr to check whether a whole column contains the value in columnA of that row.
pseudo code below
df=df.withColumn("Result", expr(if any the rows in column1 contains the value colA (for this row) then 1 else 0))
Take an arbitrary example:
valuesCol = [('rose','rose is red'),('jasmine','I never saw Jasmine'),('lily','Lili dont be silly'),('daffodil','what a flower')]
df = sqlContext.createDataFrame(valuesCol,['columnA','columnB'])
df.show()
+--------+-------------------+
| columnA| columnB|
+--------+-------------------+
| rose| rose is red|
| jasmine|I never saw Jasmine|
| lily| Lili dont be silly|
|daffodil| what a flower|
+--------+-------------------+
Application of expr() here. How one can use expr(), just look for the corresponding SQL syntax and it should work with expr() mostly.
df = df.withColumn('columnA_exists',expr("(case when instr(lower(columnB), lower(columnA))>=1 then 1 else 0 end)"))
df.show()
+--------+-------------------+--------------+
| columnA| columnB|columnA_exists|
+--------+-------------------+--------------+
| rose| rose is red| 1|
| jasmine|I never saw Jasmine| 1|
| lily| Lili dont be silly| 0|
|daffodil| what a flower| 0|
+--------+-------------------+--------------+

Randomly join two dataframes

I have two tables, one called Reasons that has 9 records and another containing IDs with 40k records.
IDs:
+------+------+
|pc_pid|pc_aid|
+------+------+
| 4569| 1101|
| 63961| 1101|
|140677| 4364|
|127113| 7|
| 96097| 480|
| 8309| 3129|
| 45218| 89|
|147036| 3289|
| 88493| 3669|
| 29973| 3129|
|127444| 3129|
| 36095| 89|
|131001| 1634|
|104731| 781|
| 79219| 244|
+-------------+
Reasons:
+-----------------+
| reasons|
+-----------------+
| follow up|
| skin chk|
| annual meet|
|review lab result|
| REF BY DR|
| sick visit|
| body pain|
| test|
| other|
+-----------------+
I want output like this
|pc_pid|pc_aid| reason
+------+------+-------------------
| 4569| 1101| body pain
| 63961| 1101| review lab result
|140677| 4364| body pain
|127113| 7| sick visit
| 96097| 480| test
| 8309| 3129| other
| 45218| 89| follow up
|147036| 3289| annual meet
| 88493| 3669| review lab result
| 29973| 3129| REF BY DR
|127444| 3129| skin chk
| 36095| 89| other
In the reasons I have only 9 records and in the ID dataframe I have 40k records, I want to assign reason randomly to each and every id.
The following solution tries to be more robust to the number of reasons (ie. you can have as many reasons as you can reasonably fit in your cluster). If you just have few reasons (like the OP asks), you can probably broadcast them or embed them in a udf and easily solve this problem.
The general idea is to create an index (sequential) for the reasons and then random values from 0 to N (where N is the number of reasons) on the IDs dataset and then join the two tables using these two new columns. Here is how you can do this:
case class Reasons(s: String)
defined class Reasons
case class Data(id: Long)
defined class Data
Data will hold the IDs (simplified version of the OP) and Reasons will hold some simplified reasons.
val d1 = spark.createDataFrame( Data(1) :: Data(2) :: Data(10) :: Nil)
d1: org.apache.spark.sql.DataFrame = [id: bigint]
d1.show()
+---+
| id|
+---+
| 1|
| 2|
| 10|
+---+
val d2 = spark.createDataFrame( Reasons("a") :: Reasons("b") :: Reasons("c") :: Nil)
+---+
| s|
+---+
| a|
| b|
| c|
+---+
We will later need the number of reasons so we calculate that first.
val numerOfReasons = d2.count()
val d2Indexed = spark.createDataFrame(d2.rdd.map(_.getString(0)).zipWithIndex)
d2Indexed.show()
+---+---+
| _1| _2|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+
val d1WithRand = d1.select($"id", (rand * numerOfReasons).cast("int").as("rnd"))
The last step is to join on the new columns and the remove them.
val res = d1WithRand.join(d2Indexed, d1WithRand("rnd") === d2Indexed("_2")).drop("_2").drop("rnd")
res.show()
+---+---+
| id| _1|
+---+---+
| 2| a|
| 10| b|
| 1| c|
+---+---+
pyspark random join itself
data_neg = data_pos.sortBy(lambda x: uniform(1, 10000))
data_neg = data_neg.coalesce(1, False).zip(data_pos.coalesce(1, True))
The fastest way to randomly join dataA (huge dataframe) and dataB (smaller dataframe, sorted by any column):
dfB = dataB.withColumn(
"index", F.row_number().over(Window.orderBy("col")) - 1
)
dfA = dataA.withColumn("index", (F.rand() * dfB.count()).cast("bigint"))
df = dfA.join(dfB, on="index", how="left").drop("index")
Since dataB is already sorted, row numbers can be assigned over sorted window with high degree of parallelism. F.rand() is another highly parallel function, so adding index to dataA will be very fast as well.
If dataB is small enough, you may benefit from broadcasting it.
This method is better than using:
zipWithIndex: Can be very expensive to convert dataframe to rdd, zipWithIndex, and then to df.
monotonically_increasing_id: Need to be used with row_number which will collect all the partitions into a single executor.
Reference: https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6

PySpark DataFrame Manipulation Efficiency

Suppose I have the following data frame :
+----------+-----+----+-------+
|display_id|ad_id|prob|clicked|
+----------+-----+----+-------+
| 123| 989| 0.9| 0|
| 123| 990| 0.8| 1|
| 123| 999| 0.7| 0|
| 234| 789| 0.9| 0|
| 234| 777| 0.7| 0|
| 234| 769| 0.6| 1|
| 234| 798| 0.5| 0|
+----------+-----+----+-------+
I then perform the following operations to get a final data set (shown below the code) :
# Add a new column with the clicked ad_id if clicked == 1, 0 otherwise
df_adClicked = df.withColumn("ad_id_clicked", when(df.clicked==1, df.ad_id).otherwise(0))
# DF -> RDD with tuple : (display_id, (ad_id, prob), clicked)
df_blah = df_adClicked.rdd.map(lambda x : (x[0],(x[1],x[2]),x[4])).toDF(["display_id", "ad_id","clicked_ad_id"])
# Group by display_id and create column with clicked ad_id and list of tuples : (ad_id, prob)
df_blah2 = df_blah.groupby('display_id').agg(F.collect_list('ad_id'), F.max('clicked_ad_id'))
# Define function to sort list of tuples by prob and create list of only ad_ids
def sortByRank(ad_id_list):
sortedVersion = sorted(ad_id_list, key=itemgetter(1), reverse=True)
sortedIds = [i[0] for i in sortedVersion]
return(sortedIds)
# Sort the (ad_id, prob) tuples by using udf/function and create new column ad_id_sorted
sort_ad_id = udf(lambda x : sortByRank(x), ArrayType(IntegerType()))
df_blah3 = df_blah2.withColumn('ad_id_sorted', sort_ad_id('collect_list(ad_id)'))
# Function to change clickedAdId into an array of size 1
def createClickedSet(clickedAdId):
setOfDocs = [clickedAdId]
return setOfDocs
clicked_set = udf(lambda y : createClickedSet(y), ArrayType(IntegerType()))
df_blah4 = df_blah3.withColumn('ad_id_set', clicked_set('max(clicked_ad_id)'))
# Select the necessary columns
finalDF = df_blah4.select('display_id', 'ad_id_sorted','ad_id_set')
+----------+--------------------+---------+
|display_id|ad_id_sorted |ad_id_set|
+----------+--------------------+---------+
|234 |[789, 777, 769, 798]|[769] |
|123 |[989, 990, 999] |[990] |
+----------+--------------------+---------+
Is there a more efficient way of doing this? Doing this set of transformations in the way that I am seems to be the bottle neck in my code. I would greatly appreciate any feedback.
I haven't done any timing comparisons, but I would think that by not using any UDFs Spark should be able to optimally optimize itself.
#scala: val dfad = sc.parallelize(Seq((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF("display_id","ad_id","prob","clicked")
#^^^that's^^^ the only difference (besides putting val in front of variables) between this python response and a Scala one
dfad = sc.parallelize(((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF(["display_id","ad_id","prob","clicked"])
dfad.registerTempTable("df_ad")
df1 = sqlContext.sql("SELECT display_id,collect_list(ad_id) ad_id_sorted FROM (SELECT * FROM df_ad SORT BY display_id,prob DESC) x GROUP BY display_id")
+----------+--------------------+
|display_id| ad_id_sorted|
+----------+--------------------+
| 234|[789, 777, 769, 798]|
| 123| [989, 990, 999]|
+----------+--------------------+
df2 = sqlContext.sql("SELECT display_id, max(ad_id) as ad_id_set from df_ad where clicked=1 group by display_id")
+----------+---------+
|display_id|ad_id_set|
+----------+---------+
| 234| 769|
| 123| 990|
+----------+---------+
final_df = df1.join(df2,"display_id")
+----------+--------------------+---------+
|display_id| ad_id_sorted|ad_id_set|
+----------+--------------------+---------+
| 234|[789, 777, 769, 798]| 769|
| 123| [989, 990, 999]| 990|
+----------+--------------------+---------+
I didn't put the ad_id_set into an Array because you were calculating the max and max should only return 1 value. I'm sure if you really need it in an array you can make that happen.
I included the subtle Scala difference if a future someone using Scala has a similar problem.