I have data in pyspark dataframe (it is a very big table with 900M rows)
This is the data that I have
+-------+---------+----------+
| key| time| cond|
+-------+---------+----------+
| 6| 3704| null|
| 6| 74967| 1062|
| 6|151565068| null|
| 6|154999554| null|
| 6|160595800| null|
| 6|166192324| null|
| 6|166549533| null|
| 6|171318946| null|
| 6|754759092| null|
| 6|754999359| 18882624|
| 6|755171746| 11381128|
| 6|761097038| null|
| 6|774496554| null|
| 6|930609982| null|
| 6|930809622| null|
| 1| 192427| null|
| 1| 192427| 2779|
| 1| 717931| null|
| 1| 1110573| null|
| 1| 1155854| null|
| 1| 70049289| null|
| 1| 70687548| null|
| 1| 71222733| null|
| 1| 85006084| null|
| 1| 85029676| null|
| 1| 85032605| 1424537|
| 1| 85240114| null|
| 1| 85573757| null|
| 1| 85710915| null|
| 1| 85870370| null|
+-------+---------+----------+
This is what I need to be doing with the dataframe (intermediate step):
+-------+---------+----------+--------+
| key| time| cond| result|
+-------+---------+----------+--------+
| 6| 3704| null| 0|
| 6| 74967| 1062| 1|
| 6|151565068| null| 0|
| 6|154999554| null| 1|
| 6|160595800| null| 2|
| 6|166192324| null| 3|
| 6|166549533| null| 4|
| 6|171318946| null| 5|
| 6|754759092| null| 6|
| 6|754999359| 18882624| 7|
| 6|755171746| 11381128| 0|
| 6|761097038| null| 0|
| 6|774496554| null| 1|
| 6|930609982| null| 2|
| 6|930809622| null| 3|
| 1| 192427| null| 0|
| 1| 192427| 2779| 1|
| 1| 717931| null| 0|
| 1| 1110573| null| 1|
| 1| 1155854| null| 2|
| 1| 70049289| null| 3|
| 1| 70687548| null| 4|
| 1| 71222733| null| 5|
| 1| 85006084| null| 6|
| 1| 85029676| null| 7|
| 1| 85032605| 1424537| 8|
| 1| 85240114| null| 0|
| 1| 85573757| null| 1|
| 1| 85710915| null| 2|
| 1| 85870370| null| 3|
+-------+---------+----------+--------+
The logic for 'result' column is as follows: have a running counter per key, zero the counter if 'cond' column is not null.
We can assume that table is orderBy("key",asc("time"))
My end results is actually average the result (per key) on rows were condition is not null.
It should look like this for above data (final result):
+--------+--------------+
| key | avg_per_key |
+--------+--------------+
| 6| 2.66666665| ==> (1+7+0)/3
| 1| 4.5| ==> (1+8)/2
+--------+--------------+
I plan to do it like this:
df_results = df3[df3.cond.isNotNull()].groupby(['key']).agg(
F.expr("avg(result)").alias("avg_per_key")
)
I assume it should work, but maybe there is a better way of doing it without the intermediate step in the middle.
How can that be done efficiently in pyspark? (remember that the dataset is huge)
Try this. Result is calculated by using an incremental sum over conditions, and then using those groupings as partitionBy in another window for row_number() - 1 to get desired result. Filter before groupBy should be performant by reducing shuffle.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("key").orderBy("time")
w1=Window().partitionBy("key","result").orderBy("time")
conditions=F.when((F.col("cond").isNotNull())&(F.col("lag").isNotNull()&\
F.col("lead").isNull()),F.lit(1))\
.when((F.col("cond").isNull())&(F.col("lag").isNotNull()),F.lit(1))\
.otherwise(F.lit(0))
df.withColumn("lag", F.lag("cond").over(w))\
.withColumn("lead", F.lead("cond").over(w))\
.withColumn("result",F.sum(conditions).over(w))\
.withColumn("result", F.row_number().over(w1)-1).filter("cond is not null")\
.groupBy("key").agg(F.mean(F.col("result")).alias("avg_per_key")).show()
#+---+------------------+
#|key| avg_per_key|
#+---+------------------+
#| 6|2.6666666666666665|
#| 1| 4.5|
#+---+------------------+
This was my solution for it, I am not saying that it is optimal, but worked for my case, when other attempts crashed the cluster.
I am a beginner in spark, so I understand that this approach might cause issues since it takes the dataset into memory.
If I had more time to play with it, I would try something with sortWithinPartitions
def handleRow(row):
temp = list(row[1])
temp = np.array([temp[x:x+2] for x in range(0, len(temp),2)])
temp[:,0] = temp[:,0].astype(float)
temp = temp[temp[:,0].argsort()]
avg_per_key= []
counter=0
for time,cond in temp:
if cond!=None:
avg_per_key.append(counter)
counter=0
else:
counter=counter+1
return [(row[0],-1 if len(avg_per_key)==0 else np.mean(avg_per_key))]
count = df3.rdd.map(lambda x: (x.key, (x.time, x.cond)))\
.reduceByKey(lambda a, b: a + b)\
.flatMap(handleRow)\
.collect()
Related
I have a dataframe
+----------------+------------+-----+
| Sport|Total_medals|count|
+----------------+------------+-----+
| Alpine Skiing| 3| 4|
| Alpine Skiing| 2| 18|
| Alpine Skiing| 4| 1|
| Alpine Skiing| 1| 38|
| Archery| 2| 12|
| Archery| 1| 72|
| Athletics| 2| 50|
| Athletics| 1| 629|
| Athletics| 3| 8|
| Badminton| 2| 5|
| Badminton| 1| 86|
| Baseball| 1| 216|
| Basketball| 1| 287|
|Beach Volleyball| 1| 48|
| Biathlon| 4| 1|
| Biathlon| 3| 9|
| Biathlon| 1| 61|
| Biathlon| 2| 23|
| Bobsleigh| 2| 6|
| Bobsleigh| 1| 60|
+----------------+------------+-----+
Is there a way for me to combine the value of counts from multiple rows if they are from the same sport?
For example, if Sport = Alpine Skiing I would have something like this:
+----------------+-----+
| Sport|count|
+----------------+-----+
| Alpine Skiing| 61|
+----------------+-----+
where count is equal to 4+18+1+38 = 61. I would like to do this for all sports
any help would be appreciated
You need to groupby on the Sport column and then aggregate the count column with the sum() function.
Example:
import pyspark.sql.functions as F
grouped_df = df.groupby('Sport').agg(F.sum('count'))
I have data in pyspark dataframe (it is a very big table with 900M rows)
The dataframe contains a column with these values:
+---------------+
|prev_display_id|
+---------------+
| null|
| null|
| 1062|
| null|
| null|
| null|
| null|
| 18882624|
| 11381128|
| null|
| null|
| null|
| null|
| 2779|
| null|
| null|
| null|
| null|
+---------------+
I am trying to generate a new column based on this column, that will look like this:
+---------------+------+
|prev_display_id|result|
+---------------+------+
| null| 0|
| null| 1|
| 1062| 0|
| null| 1|
| null| 2|
| null| 3|
| null| 4|
| 18882624| 0|
| 11381128| 0|
| null| 1|
| null| 2|
| null| 3|
| null| 4|
| 2779| 0|
| null| 1|
| null| 2|
| null| 3|
| null| 4|
+---------------+------+
The function for the new column is something like:
new_col = 0 if (prev_display_id!=null) else col = col+1
Where col is like a running counter that reset to zero when a non-null value is met.
How can that be done efficiently in pyspark?
UPDATE
I tried the solution suggested by #anki below. I works great for small datasets, but it generates this error:
WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Unfortunately it seems that for my big dataset it kills the cluster.
See image below for the error when running on the big dataset with 2 rd5.2xlarge data nodes:
Any idea how to solve this issue?
From what I understand , you can create an id column with monotonically_increasing_id and then take sum over the window for cases where prev_display_id is not null , then take row number partitioned by that column and minus 1:
w = Window.orderBy(F.monotonically_increasing_id())
w1 = F.sum((F.col("prev_display_id").isNotNull()).cast("integer")).over(w)
(df.withColumn("result",F.row_number()
.over(Window.partitionBy(w1).orderBy(w1))-1).drop("idx")).show()
+---------------+------+
|prev_display_id|result|
+---------------+------+
| null| 0|
| null| 1|
| 1062| 0|
| null| 1|
| null| 2|
| null| 3|
| null| 4|
| 18882624| 0|
| 11381128| 0|
| null| 1|
| null| 2|
| null| 3|
| null| 4|
| 2779| 0|
| null| 1|
| null| 2|
| null| 3|
| null| 4|
+---------------+------+
You can get this by running the following command:
window = Window.orderBy(f.monotonically_increasing_id())
df.withColumn('row',f.row_number().over(window))\
.withColumn('ne',f.when(f.col('consumer_id').isNotNull(),f.col('row')))\
.withColumn('result',f.when(f.col('ne').isNull(),f.col('row')-f.when(f.last('ne',ignorenulls=True)\
.over(window).isNull(),1).otherwise(f.last('ne',ignorenulls=True).over(window))).otherwise(0))\
.drop('row','ne').show()
+-----------+------+
|consumer_id|result|
+-----------+------+
| null| 0|
| null| 1|
| null| 2|
| 11| 0|
| 11| 0|
| null| 1|
| null| 2|
| 12| 0|
| 12| 0|
+-----------+------+
I am facing a problem in PySpark Dataframe loaded from a CSV file , where my numeric column do have empty values Like below
+-------------+------------+-----------+-----------+
| Player_Name|Test_Matches|ODI_Matches|T20_Matches|
+-------------+------------+-----------+-----------+
| Aaron, V R| 9| 9| |
| Abid Ali, S| 29| 5| |
|Adhikari, H R| 21| | |
| Agarkar, A B| 26| 191| 4|
+-------------+------------+-----------+-----------+
Casted those columns to integer and all those empty become null
df_data_csv_casted = df_data_csv.select(df_data_csv['Country'],df_data_csv['Player_Name'], df_data_csv['Test_Matches'].cast(IntegerType()).alias("Test_Matches"), df_data_csv['ODI_Matches'].cast(IntegerType()).alias("ODI_Matches"), df_data_csv['T20_Matches'].cast(IntegerType()).alias("T20_Matches"))
+-------------+------------+-----------+-----------+
| Player_Name|Test_Matches|ODI_Matches|T20_Matches|
+-------------+------------+-----------+-----------+
| Aaron, V R| 9| 9| null|
| Abid Ali, S| 29| 5| null|
|Adhikari, H R| 21| null| null|
| Agarkar, A B| 26| 191| 4|
+-------------+------------+-----------+-----------+
Then I am taking a total , but if one of them is null , result is also coming as null. How to solve it ?
df_data_csv_withTotalCol=df_data_csv_casted.withColumn('Total_Matches',(df_data_csv_casted['Test_Matches']+df_data_csv_casted['ODI_Matches']+df_data_csv_casted['T20_Matches']))
+-------------+------------+-----------+-----------+-------------+
|Player_Name |Test_Matches|ODI_Matches|T20_Matches|Total_Matches|
+-------------+------------+-----------+-----------+-------------+
| Aaron, V R | 9| 9| null| null|
|Abid Ali, S | 29| 5| null| null|
|Adhikari, H R| 21| null| null| null|
|Agarkar, A B | 26| 191| 4| 221|
+-------------+------------+-----------+-----------+-------------+
You can fix this by using coalesce function . for example , lets create some sample data
from pyspark.sql.functions import coalesce,lit
cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDf.show()
+----+----+
| a| b|
+----+----+
|null|null|
| 1|null|
|null| 2|
+----+----+
When I do simple sum as you did -
cDf.withColumn('Total',cDf.a+cDf.b).show()
I get total as null , same as you described-
+----+----+-----+
| a| b|Total|
+----+----+-----+
|null|null| null|
| 1|null| null|
|null| 2| null|
+----+----+-----+
to fix, use coalesce along with lit function , which replaces null values by zeroes.
cDf.withColumn('Total',coalesce(cDf.a,lit(0)) +coalesce(cDf.b,lit(0))).show()
this gives me correct results-
| a| b|Total|
+----+----+-----+
|null|null| 0|
| 1|null| 1|
|null| 2| 2|
+----+----+-----+
I have a DataFrame with a column "Speed". Can I efficiently add a column with, for each row, the number of rows in the DataFrame such that their "Speed" is within +/2 from the row "Speed"?
results = spark.createDataFrame([[1],[2],[3],[4],[5],
[4],[5],[4],[5],[6],
[5],[6],[1],[3],[8],
[2],[5],[6],[10],[12]],
['Speed'])
results.show()
+-----+
|Speed|
+-----+
| 1|
| 2|
| 3|
| 4|
| 5|
| 4|
| 5|
| 4|
| 5|
| 6|
| 5|
| 6|
| 1|
| 3|
| 8|
| 2|
| 5|
| 6|
| 10|
| 12|
+-----+
You could use a window function :
# Order the window by speed, and look at range [0;+2]
w = Window.orderBy('Speed').rangeBetween(0,2)
# Define a column counting the number of rows containing value Speed+2
results = results.withColumn('count+2',F.count('Speed').over(w)).orderBy('Speed')
results.show()
+-----+-----+
|Speed|count|
+-----+-----+
| 1| 6|
| 1| 6|
| 2| 7|
| 2| 7|
| 3| 10|
| 3| 10|
| 4| 11|
| 4| 11|
| 4| 11|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 6| 4|
| 6| 4|
| 6| 4|
| 8| 2|
| 10| 2|
| 12| 1|
+-----+-----+
Note : The window function counts the studied row itself. You could correct this by adding a -1 in the count column
results = results.withColumn('count+2',F.count('Speed').over(w)-1).orderBy('Speed')
I join two data frames and have the resulting data frame as below.Now I want to
+---------+-----------+-----------+-------------------+---------+-------------------+
|a |b | c | d | e | f |
+---------+-----------+-----------+-------------------+---------+-------------------+
| 7| 2| 1|2015-04-12 23:59:01| null| null |
| 15| 2| 2|2015-04-12 23:59:02| | |
| 11| 2| 4|2015-04-12 23:59:03| null| null|
| 3| 2| 4|2015-04-12 23:59:04| null| null|
| 8| 2| 3|2015-04-12 23:59:05| {NORMAL} 2015-04-12 23:59:05|
| 16| 2| 3|2017-03-12 23:59:06| null| null|
| 5| 2| 3|2015-04-12 23:59:07| null| null|
| 18| 2| 3|2015-03-12 23:59:08| null| null|
| 17| 2| 1|2015-03-12 23:59:09| null| null|
| 6| 2| 1|2015-04-12 23:59:10| null| null|
| 19| 2| 3|2015-03-12 23:59:11| null| null|
| 9| 2| 3|2015-04-12 23:59:12| null| null|
| 1| 2| 2|2015-04-12 23:59:13| null| null|
| 1| 2| 2|2015-04-12 23:59:14| null| null|
| 1| 2| 2|2015-04-12 23:59:15| null| null|
| 10| 3| 2|2015-04-12 23:59:16| null| null|
| 4| 2| 3|2015-04-12 23:59:17| {NORMAL}|2015-04-12 23:59:17|
| 12| 3| 1|2015-04-12 23:59:18| null| null|
| 13| 3| 1|2015-04-12 23:59:19| null| null|
| 14| 2| 1|2015-04-12 23:59:20| null| null|
+---------+-----------+-----------+-------------------+---------+-------------------+
Now I have to find the first occuring 1 before each 3 in column c .For example
| 4| 2| 3|2015-04-12 23:59:17| {NORMAL}|2015-04-12 23:59:17|
Before this record I want to know the first occured 1 in column c which is
| 17| 2| 1|2015-03-12 23:59:09| null| null|
Any help is appreciated
You can use Spark window function lag import org.apache.spark.sql.expressions.Window
In first step you filter your data on the column "c" based on value as either 1 or 3. You will get data similar to
dft.show()
+---+---+---+---+
| id| a| b| c|
+---+---+---+---+
| 1| 7| 2| 1|
| 2| 15| 2| 3|
| 3| 11| 2| 3|
| 4| 3| 2| 1|
| 5| 8| 2| 3|
+---+---+---+---+
Next, define the window
val w = Window.orderBy("id")
Once this is done, create a new column and put previous value in it
dft.withColumn("prev", lag("c",1).over(w)).show()
+---+---+---+---+----+
| id| a| b| c|prev|
+---+---+---+---+----+
| 1| 7| 2| 1|null|
| 2| 15| 2| 3| 1|
| 3| 11| 2| 3| 3|
| 4| 3| 2| 1| 3|
| 5| 8| 2| 3| 1|
+---+---+---+---+----+
Finally filter on the values of column "c" and "prev"
Note: Do combine the steps when you are writing final code, so as to apply filter directly.