The following code is used to extract ranks from the column products. The ranks are second numbers in each pair [...]. For example, in the given example [[222,66],[333,55]] the ranks are 66 and 55 for products with PK 222 and 333, accordingly.
But the code in Spark 2.2 works very slowly when df_products is around 800 Mb:
df_products.createOrReplaceTempView("df_products")
val result = df.as("df2")
.join(spark.sql("SELECT * FROM df_products")
.select($"product_PK", explode($"products").as("products"))
.withColumnRenamed("product_PK","product_PK_temp").as("df1"),$"df2.product _PK" === $"df1.product_PK_temp" and $"df2.rec_product_PK" === $"df1.products.product_PK", "left")
.drop($"df1.product_PK_temp")
.select($"product_PK", $"rec_product_PK", coalesce($"df1.products.col2", lit(0.0)).as("rank_product"))
This is a small sample of df_products and df:
df_products =
+----------+--------------------+
|product_PK| products|
+----------+--------------------+
| 111|[[222,66],[333,55...|
| 222|[[333,24],[444,77...|
...
+----------+--------------------+
df =
+----------+-----------------+
|product_PK| rec_product_PK|
+----------+-----------------+
| 111| 222|
| 222| 888|
+----------+-----------------+
The above-given code works well when the arrays in each row of products contain a small number of elements. But when there are a lot of elements in the arrays of each row [[..],[..],...], then the code seems to get stuck and it does not advance.
How can I optimize the code? Any help is really highly appreciated.
Is it possible, for example, to transform df_products into the following DataFrame before joining?
df_products =
+----------+--------------------+------+
|product_PK| rec_product_PK| rank|
+----------+--------------------+------+
| 111| 222| 66|
| 111| 333| 55|
| 222| 333| 24|
| 222| 444| 77|
...
+----------+--------------------+------+
As per my answer here, you can transform df_products using something like this:
import org.apache.spark.sql.functions.explode
df1 = df.withColumn("array_elem", explode(df("products"))
df2 = df1.select("product_PK", "array_elem.*")
This assumes products is an array of structs. If products is an array of array, you can use the following instead:
df2 = df1.withColumn("rank", df2("products").getItem(1))
Related
I'm new to Pyspark and trying to transform data
Given dataframe
Col1
A=id1a A=id2a B=id1b C=id1c B=id2b
D=id1d A=id3a B=id3b C=id2c
A=id4a C=id3c
Required:
A B C
id1a id1b id1c
id2a id2b id2c
id3a id3b id3b
id4a null null
I have tried pivot, but that gives first value.
There might be a better way , however an approach is splitting the column on spaces to create array of the entries and then using higher order functions(spark 2.4+) to split on the '=' for each entry in the splitted array .Then explode and create 2 columns one with the id and one with the value. Then we can assign a row number to each partition and groupby then pivot:
import pyspark.sql.functions as F
df1 = (df.withColumn("Col1",F.split(F.col("Col1"),"\s+")).withColumn("Col1",
F.explode(F.expr("transform(Col1,x->split(x,'='))")))
.select(F.col("Col1")[0].alias("cols"),F.col("Col1")[1].alias("vals")))
from pyspark.sql import Window
w = Window.partitionBy("cols").orderBy("cols")
final = (df1.withColumn("Rnum",F.row_number().over(w)).groupBy("Rnum")
.pivot("cols").agg(F.first("vals")).orderBy("Rnum"))
final.show()
+----+----+----+----+----+
|Rnum| A| B| C| D|
+----+----+----+----+----+
| 1|id1a|id1b|id1c|id1d|
| 2|id2a|id2b|id2c|null|
| 3|id3a|id3b|id3c|null|
| 4|id4a|null|null|null|
+----+----+----+----+----+
this is how df1 looks like after the transformation:
df1.show()
+----+----+
|cols|vals|
+----+----+
| A|id1a|
| A|id2a|
| B|id1b|
| C|id1c|
| B|id2b|
| D|id1d|
| A|id3a|
| B|id3b|
| C|id2c|
| A|id4a|
| C|id3c|
+----+----+
May be I don't know the full picture, but the data format seems to be strange. If nothing can be done at the data source, then some collects, pivots and joins will be needed. Try this.
import pyspark.sql.functions as F
test = sqlContext.createDataFrame([('A=id1a A=id2a B=id1b C=id1c B=id2b',1),('D=id1d A=id3a B=id3b C=id2c',2),('A=id4a C=id3c',3)],schema=['col1','id'])
tst_spl = test.withColumn("item",(F.split('col1'," ")))
tst_xpl = tst_spl.select(F.explode("item"))
tst_map = tst_xpl.withColumn("key",F.split('col','=')[0]).withColumn("value",F.split('col','=')[1]).drop('col')
#%%
tst_pivot = tst_map.groupby(F.lit(1)).pivot('key').agg(F.collect_list(('value'))).drop('1')
#%%
tst_arr = [tst_pivot.select(F.posexplode(coln)).withColumnRenamed('col',coln) for coln in tst_pivot.columns]
tst_fin = reduce(lambda df1,df2:df1.join(df2,on='pos',how='full'),tst_arr).orderBy('pos')
tst_fin.show()
+---+----+----+----+----+
|pos| A| B| C| D|
+---+----+----+----+----+
| 0|id3a|id3b|id1c|id1d|
| 1|id4a|id1b|id2c|null|
| 2|id1a|id2b|id3c|null|
| 3|id2a|null|null|null|
+---+----+----+----+----
This question already has an answer here:
How to transform DataFrame before joining operation?
(1 answer)
Closed 5 years ago.
Given the DataFrame like this:
df_products =
+----------+--------------------+
|product_PK| products|
+----------+--------------------+
| 111|[[222,66],[333,55...|
| 222|[[333,24],[444,77...|
...
+----------+--------------------+
how can I transform it into the following DataFrame:
df_products =
+----------+--------------------+------+
|product_PK| rec_product_PK| rank|
+----------+--------------------+------+
| 111| 222| 66|
| 111| 333| 55|
| 222| 333| 24|
| 222| 444| 77|
...
+----------+--------------------+------+
You basically have two steps here: First is exploding the arrays (using the explode functions) to get a row for each value in the array, then fixing each element.
You do not have the schema here so the internal structure of each element in the array is not clear, however, I would assume it is something like a struct with two elements.
This means you would do something like this:
import org.apache.spark.sql.functions.explode
df1 = df.withColumn("array_elem", explode(df("products"))
df2 = df1.select("product_PK", "array_elem.*")
now all you have to do is rename the columns to the names you need.
Is there any way to transpose dataframe rows into columns.
I have following structure as a input:
val inputDF = Seq(("pid1","enc1", "bat"),
("pid1","enc2", ""),
("pid1","enc3", ""),
("pid3","enc1", "cat"),
("pid3","enc2", "")
).toDF("MemberID", "EncounterID", "entry" )
inputDF.show:
+--------+-----------+-----+
|MemberID|EncounterID|entry|
+--------+-----------+-----+
| pid1| enc1| bat|
| pid1| enc2| |
| pid1| enc3| |
| pid3| enc1| cat|
| pid3| enc2| |
+--------+-----------+-----+
expected result:
+--------+----------+----------+----------+-----+
|MemberID|Encounter1|Encounter2|Encounter3|entry|
+--------+----------+----------+----------+-----+
| pid1| enc1| enc2| enc3| bat|
| pid3| enc1| enc2| null| cat|
+--------+----------+----------+----------+-----+
Please suggest if there is any optimized direct API available for transposing rows into columns.
my input data size is quite huge, so actions like collect, I wont be able to perform as it would take all the data on driver.
I am using Spark 2.x
I am not sure that what you need is what you actually asked. Yet, just in case here is an idea:
val entries = inputDF.where('entry isNotNull)
.where('entry !== "")
.select("MemberID", "entry").distinct
val df = inputDF.groupBy("MemberID")
.agg(collect_list("EncounterID") as "encounterList")
.join(entries, Seq("MemberID"))
df.show
+--------+-------------------------+-----+
|MemberID| encounterList |entry|
+--------+-------------------------+-----+
| pid1| [enc2, enc1, enc3]| bat|
| pid3| [enc2, enc1]| cat|
+--------+-------------------------+-----+
The order of the list is not deterministic but you may sort it and then extract new columns from it with .withColumn("Encounter1", sort_array($"encounterList")(0))...
Other idea
In case what you want is to put the value of entry in the corresponding "Encounter" column, you can use a pivot:
inputDF
.groupBy("MemberID")
.pivot("EncounterID", Seq("enc1", "enc2", "enc3"))
.agg(first("entry")).show
+--------+----+----+----+
|MemberID|enc1|enc2|enc3|
+--------+----+----+----+
| pid1| bat| | |
| pid3| cat| | |
+--------+----+----+----+
Adding Seq("enc1", "enc2", "enc3") is optionnal but since you know the content of the column, it will speed up the computation.
I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.
Suppose I have the following data frame :
+----------+-----+----+-------+
|display_id|ad_id|prob|clicked|
+----------+-----+----+-------+
| 123| 989| 0.9| 0|
| 123| 990| 0.8| 1|
| 123| 999| 0.7| 0|
| 234| 789| 0.9| 0|
| 234| 777| 0.7| 0|
| 234| 769| 0.6| 1|
| 234| 798| 0.5| 0|
+----------+-----+----+-------+
I then perform the following operations to get a final data set (shown below the code) :
# Add a new column with the clicked ad_id if clicked == 1, 0 otherwise
df_adClicked = df.withColumn("ad_id_clicked", when(df.clicked==1, df.ad_id).otherwise(0))
# DF -> RDD with tuple : (display_id, (ad_id, prob), clicked)
df_blah = df_adClicked.rdd.map(lambda x : (x[0],(x[1],x[2]),x[4])).toDF(["display_id", "ad_id","clicked_ad_id"])
# Group by display_id and create column with clicked ad_id and list of tuples : (ad_id, prob)
df_blah2 = df_blah.groupby('display_id').agg(F.collect_list('ad_id'), F.max('clicked_ad_id'))
# Define function to sort list of tuples by prob and create list of only ad_ids
def sortByRank(ad_id_list):
sortedVersion = sorted(ad_id_list, key=itemgetter(1), reverse=True)
sortedIds = [i[0] for i in sortedVersion]
return(sortedIds)
# Sort the (ad_id, prob) tuples by using udf/function and create new column ad_id_sorted
sort_ad_id = udf(lambda x : sortByRank(x), ArrayType(IntegerType()))
df_blah3 = df_blah2.withColumn('ad_id_sorted', sort_ad_id('collect_list(ad_id)'))
# Function to change clickedAdId into an array of size 1
def createClickedSet(clickedAdId):
setOfDocs = [clickedAdId]
return setOfDocs
clicked_set = udf(lambda y : createClickedSet(y), ArrayType(IntegerType()))
df_blah4 = df_blah3.withColumn('ad_id_set', clicked_set('max(clicked_ad_id)'))
# Select the necessary columns
finalDF = df_blah4.select('display_id', 'ad_id_sorted','ad_id_set')
+----------+--------------------+---------+
|display_id|ad_id_sorted |ad_id_set|
+----------+--------------------+---------+
|234 |[789, 777, 769, 798]|[769] |
|123 |[989, 990, 999] |[990] |
+----------+--------------------+---------+
Is there a more efficient way of doing this? Doing this set of transformations in the way that I am seems to be the bottle neck in my code. I would greatly appreciate any feedback.
I haven't done any timing comparisons, but I would think that by not using any UDFs Spark should be able to optimally optimize itself.
#scala: val dfad = sc.parallelize(Seq((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF("display_id","ad_id","prob","clicked")
#^^^that's^^^ the only difference (besides putting val in front of variables) between this python response and a Scala one
dfad = sc.parallelize(((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF(["display_id","ad_id","prob","clicked"])
dfad.registerTempTable("df_ad")
df1 = sqlContext.sql("SELECT display_id,collect_list(ad_id) ad_id_sorted FROM (SELECT * FROM df_ad SORT BY display_id,prob DESC) x GROUP BY display_id")
+----------+--------------------+
|display_id| ad_id_sorted|
+----------+--------------------+
| 234|[789, 777, 769, 798]|
| 123| [989, 990, 999]|
+----------+--------------------+
df2 = sqlContext.sql("SELECT display_id, max(ad_id) as ad_id_set from df_ad where clicked=1 group by display_id")
+----------+---------+
|display_id|ad_id_set|
+----------+---------+
| 234| 769|
| 123| 990|
+----------+---------+
final_df = df1.join(df2,"display_id")
+----------+--------------------+---------+
|display_id| ad_id_sorted|ad_id_set|
+----------+--------------------+---------+
| 234|[789, 777, 769, 798]| 769|
| 123| [989, 990, 999]| 990|
+----------+--------------------+---------+
I didn't put the ad_id_set into an Array because you were calculating the max and max should only return 1 value. I'm sure if you really need it in an array you can make that happen.
I included the subtle Scala difference if a future someone using Scala has a similar problem.