Given the dimensions of the datasets I am currently using, I started working in Databricks with PySpark. After a few weeks I still struggle to fully understand what is going on under the hood.
I have a dataset of about 40 millions rows and I apply this function that dynamically compute some aggregation in a rolling window:
def freshness(df):
days = lambda x: x*60*60*24
w = Window.partitionBy('csecid', 'date').orderBy('date')
w1 = Window.partitionBy('csecid').orderBy(F.col('date').cast('timestamp').cast('long')).rangeBetween(-days(100), 0)
w2 = Window.partitionBy('csecid').orderBy('date')
w3 = Window.partitionBy('csecid', 'id', 'date').orderBy('date')
w4 = Window.partitionBy('csecid', 'id')
w5 = Window.partitionBy('csecid', 'id').orderBy(F.col('date').desc())
df = df.withColumn('dateid', F.row_number().over(w))
df1 = df.withColumn('flag', F.collect_list('date').over(w1))
df1 = df1.withColumn('id', F.row_number().over(w2)).select('csecid', 'flag', 'id')
df1 = df1.withColumn('date', F.explode('flag')).drop('flag')
df1 = df1.withColumn('dateid', F.row_number().over(w3))
df2 = df1.join(df, on=['csecid', 'date', 'dateid'], how='left')
df3 = (df2
.withColumn('analyst_fresh', F.floor(F.approx_count_distinct('analystid', 0.005).over(w4)/3))
.orderBy('csecid', 'id', 'date', F.col('perenddate').desc())
.groupBy('csecid', 'id', 'analystid')
.agg(F.last('date', True).alias('date'), F.last('analyst_fresh', True).alias('analyst_fresh'))
.where(F.col('analystid').isNotNull())
.orderBy('csecid', 'id', 'date')
.withColumn('id2', F.row_number().over(w5))
.withColumn('freshness', F.when(F.col('id2')<=F.col('analyst_fresh'), 1).otherwise(0))
.drop('analyst_fresh', 'id2', 'analystid')
)
df_fill = _get_fill_dates_df(df3, 'date', ['id', 'csecid'])
df3 = df3.join(df_fill, on=['csecid', 'date', 'id', 'freshness'], how='outer')
df3 = df3.groupBy('csecid', 'id', 'date').agg(F.max('freshness').alias('freshness'))
df3 = df2.join(df3, on=['csecid', 'id', 'date'], how='left').fillna(0, subset=['freshness'])
df3 = df3.withColumn('fresh_revision', (F.abs(F.col('revisions_improved'))+F.col('freshness'))*F.signum('revisions_improved'))
df4 = (df3
.orderBy('csecid', 'id', 'date', F.col('perenddate').desc())
.groupBy('csecid', 'id', 'analystid')
.agg(F.last('date').alias('date'), F.last('fresh_revision', True).alias('fresh_revision'))
.orderBy('csecid', 'id', 'date')
.groupBy('csecid', 'id').agg(F.last('date').alias('date'),
F.sum('fresh_revision').alias('fresh_revision'), F.sum(F.abs('fresh_revision')).alias('n_revisions'))
.withColumn('revision_index_improved', F.col('fresh_revision') / F.col('n_revisions'))
.groupBy('csecid', 'date')
.agg(F.first('revision_index_improved').alias('revision_index_improved'))
)
df5 = df.join(df4, on=['csecid', 'date'], how='left').orderBy('csecid', 'date')
return df5
weight_list = ['leader', 'truecall']
for c in weight_list:
df = df.withColumn('revisions_improved', (F.abs(F.col('revisions_improved'))+F.col(c))*F.col('revisions'))
df = freshness(df)
This piece of code run in about 2 hours, and I noticed that most of the computational time is taken from one job that uses a single executor (out of 16 available).
I read that it is possible to overcome this problem by repatitioning the dataframe with .repartition(). My questions are the following: how can I find which piece of code the abovementioned job is referred to? Where should I repartition my dataframe? Is that a correct solution?
Related
In my PySpark dataframe, I have a column 'TimeStamp' which is in DateTime format. I want to covert that to 'Date' format and then use that in the 'GroupBy'.
df = spark.sql("SELECT * FROM `myTable`")
df.filter((df.somthing!="thing"))
df.withColumn('MyDate', col('Timestamp').cast('date')
df.groupBy('MyDate').count().show()
But I get this error:
cannot resolve 'MyDate' given input columns:
Can you please help me with this ?
each time you do df. you are creating a new dataframe.
df was only initialized in your first line of code, so that dataframe object does not have the new column MyDate.
you can look at the id() of each object to see
df = spark.sql("SELECT * FROM `myTable`")
print(id(df))
print(id(df.filter(df.somthing!="thing")))
this is correct syntax to chain operations
df = spark.sql("SELECT * FROM myTable")
df = (df
.filter(df.somthing != "thing")
.withColumn('MyDate', col('Timestamp').cast('date'))
.groupBy('MyDate').count()
)
df.show(truncate=False)
UPDATE: this is a better way to write it
df = (
spark.sql(
"""
SELECT *
FROM myTable
""")
.filter(col("something") != "thing")
.withColumn("MyDate", col("Timestamp").cast("date"))
.groupBy("MyDate").count()
)
I'm performing an inner join between, say, 8 dataframes, all coming from the same parent. Sample code:
// read parquet
val readDF = session.read.parquet(...)
// multiple expensive transformations are performed over readDF, making its DAG grow
// repartition + cache
val df = readDF.repartition($"type").cache
val df1 = df.filter($"type" === 1)
val df2 = df.filter($"type" === 2)
val df3 = df.filter($"type" === 3)
val df4 = df.filter($"type" === 4)
val df5 = df.filter($"type" === 5)
val df6 = df.filter($"type" === 6)
val df7 = df.filter($"type" === 7)
val df8 = df.filter($"type" === 8)
val joinColumns = Seq("col1", "col2", "col3", "col4")
val joinDF = df1
.join(df2, joinColumns)
.join(df3, joinColumns)
.join(df4, joinColumns)
.join(df5, joinColumns)
.join(df6, joinColumns)
.join(df7, joinColumns)
.join(df8, joinColumns)
Unexpectedly, the joinDF sentence is taking a long time. Join is supposed to be a transformation, not an action.
Do you know what's happening? Is this a use case for checkpointing?
Notes:
- joinDF.explain shows a long DAG lineage.
- using Spark 2.3.0 with Scala
RDD JOIN, SPARK SQL JOIN are known as a Transformation. I ran this with no issue on DataBricks Notebook, but I am not privy to " ...// multiple expensive transformations are performed over readDF, making its DAG grow .... May be there is an Action there.
Indeed, checkpointing seems to fix the long running join. It now behaves as a transformation, returning faster. So, I conclude that the delay was related to the large DAG lineage.
Also, the subsequent actions are now faster.
I need to parameterized join condition and joining columns should get passes from CLI (I'm using: prompt.in pyspark)
my code is:
x1 = col(argv[1])
x2 = col(argv[2])
df = df1.join(df2, (df1.x1 == df2.x2))
This is my script:
join.py empid emdid
I get this error
df has no such columns.
Any ideas on how to solve this?
Follow this approach, It will work even if your dataframes are joining on column having same name.
argv = ['join.py', 'empid', 'empid']
x1 = argv[1]
x2 = argv[2]
df1 = spark.createDataFrame([(1, "A"),(2, "B")], ("empid", "c2"))
df2 = spark.createDataFrame([(1, "A"),(2, "B")], ("empid", "c2"))
df = df1.join(df2, df1[x1] == df2[x2])
df.show()
var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.limit(3)
df2.show()
var df3 = df.except(df2)
df3.show()
Surprisingly, I found that except is not working the way it should. Here is my output:
df2: created correctly, contains 1,2 and 3. But my df3 still has 1, 2 and/or 3 in it. It's kind of random. If I run it multiple times, I get different result. Can anyone please help me? Thanks in advance.
You need to put a spark "action" to collect the data that is required for "df2" before performing the "except" operation, which will ensure that the dataframe df2 get computed before hand and has the fixed content which will be subtracted from df.
Randomness is because spark lazy evaluation and spark is putting all your code in one stage. And the contents of "df2" is not fixed when you performed the "except" operation on it. As per the spark function definition for limit:
Returns a new Dataset by taking the first n rows. The difference between this function
and head is that head is an action and returns an array (by triggering query execution)
while limit returns a new Dataset.
since, it return a datset, will be lazy evaluation,
Below code will give you a consistent output.
var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.head(3).map(f => f.mkString).toList.toDF("num")
df2.show()
var df3 = df.except(df2)
df3.show()
Best way to test this is to just create a new DF that has the values you want to diff.
val df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
val df2 = List(1,2,3).toDF("num")
df2.show()
val df3 = df.except(df2)
df3.show()
Alternatively, just write a deterministic filter to select the rows you want:
val df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
val df2 = df.filter("num <= 3")
df2.show()
val df3 = df.except(df2)
df3.show()
One could use a leftanti join for this if you have uniqueness in the column for which you are comparing.
Example:
var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.limit(3)
df2.show()
var df3 = df.join(df2,Seq("num"),"leftanti")
df3.show()
I'm have a CSV dataset that I want to process using Spark, the second column is of this format:
yyyy-MM-dd hh:mm:ss
I want to group each MM-dd
val days : RDD = sc.textFile(<csv file>)
val partitioned = days.map(row => {
row.split(",")(1).substring(5,10)
}).invertTheMap.groupOrReduceByKey
The result of groupOrReduceByKey is of form:
("MM-dd" -> (row1, row2, row3, ..., row_n) )
How should I implement invertTheMap and groupOrReduceByKey?
I saw this in Python here but I wonder how is it done in Scala?
This should do the trick
val testData = List("a, 1987-09-30",
"a, 2001-09-29",
"b, 2002-09-30")
val input = sc.parallelize(testData)
val grouped = input.map{
row =>
val columns = row.split(",")
(columns(1).substring(6, 11), row)
}.groupByKey()
grouped.foreach(println)
The output is
(09-29,CompactBuffer(a, 2001-09-29))
(09-30,CompactBuffer(a, 1987-09-30, b, 2002-09-30))