Confusing perf stat results,What does it means about L1-dcache-loads 1110.763 M/sec? M is MB? - stat

perf results:
1501.634694 task-clock # 1.835 CPUs utilized ( +- 0.11% )
137 context-switches # 0.000 M/sec ( +- 1.06% )
5 CPU-migrations # 0.000 M/sec ( +- 7.14% )
145,306 page-faults # 0.097 M/sec ( +- 0.00% )
2,973,182,970 cycles # 1.980 GHz ( +- 0.09% ) [40.04%]
379,990,837 stalled-cycles-frontend # 12.78% frontend cycles idle ( +- 0.79% ) [39.66%]
230,979,839 stalled-cycles-backend # 7.77% backend cycles idle ( +- 5.22% ) [39.88%]
6,457,881,267 instructions # 2.17 insns per cycle
# 0.06 stalled cycles per insn ( +- 0.76% ) [49.76%]
318,376,775 branches # 212.020 M/sec ( +- 0.82% ) [49.62%]
47,093 branch-misses # 0.01% of all branches ( +- 10.27% ) [50.27%]
1,667,960,311 L1-dcache-loads # 1110.763 M/sec ( +- 0.54% ) [50.98%]
11,817,899 L1-dcache-load-misses # 0.71% of all L1-dcache hits ( +- 1.16% ) [51.09%]
1,408,419 LLC-loads # 0.938 M/sec ( +- 3.71% ) [41.25%]
950,688 LLC-load-misses # 67.50% of all LL-cache hits ( +- 8.25% ) [40.74%]
0.818404313 seconds time elapsed ( +- 0.29% )

1,667,960,311 L1-dcache-loads # 1110.763 M/sec ( +- 0.54% ) [50.98%]
No, The M is Mega, therefore 1 110 763 000 loads per second. It does not talk about the size of the loads, which would need more information about your processor architecture.

Related

Avoid shuffle when joining two tables with same partitions without bucketting strategy

I have two Hive tables A and B with:
same partitions (partition_1, partition_2)
an extra id field that is not sorted in partitions
When I join these two tables in PySpark, for example with:
df_A = spark.table("db.A")
df_B = spark.table("db.B")
df = df_A.join(df_B, how="inner", on=["partition_1", "partition_2", "id"])
I always end up with a shuffle:
+- == Initial Plan ==
Project (23)
+- SortMergeJoin Inner (22)
:- Sort (18)
: +- Exchange (17)
: +- Filter (16)
: +- Scan parquet db.A (15)
+- Sort (21)
+- Exchange (20)
+- Filter (19)
+- Scan parquet db.B (7)
I created two similar tables but with a bucketting strategy this time:
df.write.partitionBy("partition_A", "partition_B").bucketBy(10, "id").saveAsTable(...)
And there is no more shuffle in the join
+- == Initial Plan ==
Project (17)
+- SortMergeJoin Inner (16)
:- Sort (13)
: +- Filter (12)
: +- Scan parquet db.A (11)
+- Sort (15)
+- Filter (14)
+- Scan parquet db.B (5)
My questions are:
Can I avoid this shuffle in the join without having to re-create the tables with a bucketting strategy ?
Does this shuffle operate on all data ? Or does it consider that the partitions are the same and optimise this shuffle ?
What I tried so far:
repartitioning on partitions (df.repartition("partition_A", "partition_B")) on both tables before joining
repartitioning on partitions and id field (df.repartition(numPartitions, "partition_A", "partition_B", "id"))
sorting data by id before joining
But the shuffle is still here.
I tried on both Databricks and EMR runtimes with same behaviour.
Thanks for your help

Pyspark Update, Insert records on LargeData parquet file

I have 70M+ Records(116MB) in my Data example columns
ID, TransactionDate, CreationDate
Here ID is primary Key column. I need to Update my data with New upcoming Parquet files data which is of size <50MB.
Sample Input
ID col1 col2
1 2021-01-01 2020-08-21
2 2021-02-02 2020-08-21
New Data
ID col1 col2
1 2021-02-01 2020-08-21
3 2021-02-02 2020-08-21
Output Rows of Data
1 2021-02-01 2020-08-21 (Updated)
3 2021-02-02 2020-08-21 (Inserted)
2 2021-02-02 2020-08-21 (Remains Same)
I have tried with Various approaches But none of them giving proper results with Less Shuffle Read & Write and Execution Time.
Few of my approaches.
Inner Join(Update Records), Left-Anti(Insert Records), Left-Anti(Remains Same records ) Joins
Taking 10Minutes to execute with 9.5GB Shuffle Read and 9.5 GB shuffle right.
I tried with some partitionBy on creationDate approach but unable to get how to read New data with appropriate partition.
Help me with better approach that takes less time. With less shuffle read and write in Pyspark
Thanks in Advance.
You cannot avoid some shuffle, but at least you can limit it by doing only one full outer join instead of one inner join and two anti joins.
You first add a new column updated to your new dataframe, to determine if joined row is updated or not, then you perform your full outer join, and finally you select value for each column from new or old data according to updated column. Code as follow, with old_data dataframe as current data and new_data dataframe as updated data:
from pyspark.sql import functions as F
join_columns = ['ID']
final_data = new_data \
.withColumn('updated', F.lit(True)) \
.join(old_data, join_columns, 'full_outer') \
.select(
[F.col(c) for c in join_columns] +
[F.when(F.col('updated'), new_data[c]).otherwise(old_data[c]).alias(c) for c in old_data.columns if c not in join_columns]
)
If you look at execution plan using .explain() method on final_data dataframe, you can see that you have only two shuffles (the Exchange step), one per joined dataframe:
== Physical Plan ==
*(5) Project [coalesce(ID#6L, ID#0L) AS ID#17L, CASE WHEN exists#12 THEN col1#7 ELSE col1#1 END AS col1#24, CASE WHEN exists#12 THEN col2#8 ELSE col2#2 END AS col2#25]
+- SortMergeJoin [ID#6L], [ID#0L], FullOuter
:- *(2) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#27]
: +- *(1) Project [ID#6L, col1#7, col2#8, true AS exists#12]
: +- *(1) Scan ExistingRDD[ID#6L,col1#7,col2#8]
+- *(4) Sort [ID#0L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#32]
+- *(3) Scan ExistingRDD[ID#0L,col1#1,col2#2]
If you look at your one inner join and two anti join execution plan, you get six shuffles:
== Physical Plan ==
Union
:- *(5) Project [ID#0L, col1#7, col2#8]
: +- *(5) SortMergeJoin [ID#0L], [ID#6L], Inner
: :- *(2) Sort [ID#0L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#177]
: : +- *(1) Project [ID#0L]
: : +- *(1) Filter isnotnull(ID#0L)
: : +- *(1) Scan ExistingRDD[ID#0L,col1#1,col2#2]
: +- *(4) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#183]
: +- *(3) Filter isnotnull(ID#6L)
: +- *(3) Scan ExistingRDD[ID#6L,col1#7,col2#8]
:- SortMergeJoin [ID#0L], [ID#6L], LeftAnti
: :- *(7) Sort [ID#0L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#192]
: : +- *(6) Scan ExistingRDD[ID#0L,col1#1,col2#2]
: +- *(9) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#197]
: +- *(8) Project [ID#6L]
: +- *(8) Filter isnotnull(ID#6L)
: +- *(8) Scan ExistingRDD[ID#6L,col1#7,col2#8]
+- SortMergeJoin [ID#6L], [ID#0L], LeftAnti
:- *(11) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#203]
: +- *(10) Scan ExistingRDD[ID#6L,col1#7,col2#8]
+- *(13) Sort [ID#0L ASC NULLS FIRST], false, 0
+- ReusedExchange [ID#0L], Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#177]

Aggregation after sort(), persist() and limit() in Spark

I'm trying to get the sum of a column of the top n rows in a persisted DataFrame. For some reason, the following doesn't work:
val df = df0.sort(col("colB").desc).persist()
df.limit(2).agg(sum("colB")).show()
It shows a random number which is clearly less than the sum of the top two. The number changes from run-to-run.
Calling show() on the limit()'ed DF does consistently show the correct top two values:
df.limit(2).show()
It is as if sort() doesn't apply before the aggregation. Is this a bug in Spark? I suppose it's kind of expected that persist() loses the sorting, but why does it work with show() and should this be documented somewhere?
See the query plans below. agg results in an exchange (4th line in physical plan) which removes the sorting, whereas show does not result in any exchange, so sorting is maintained.
scala> df.limit(2).agg(sum("colB")).explain()
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(colB#4 as bigint))])
+- *(2) HashAggregate(keys=[], functions=[partial_sum(cast(colB#4 as bigint))])
+- *(2) GlobalLimit 2
+- Exchange SinglePartition, true, [id=#95]
+- *(1) LocalLimit 2
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Sort [colB#4 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(colB#4 DESC NULLS LAST, 200), true, [id=#7]
+- LocalTableScan [colB#4]
scala> df.limit(2).explain()
== Physical Plan ==
CollectLimit 2
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Sort [colB#4 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(colB#4 DESC NULLS LAST, 200), true, [id=#7]
+- LocalTableScan [colB#4]
But if you persist the limited dataframe, there won't be any exchange for the aggregation, so that might do the trick:
val df1 = df.limit(2).persist()
scala> df1.agg(sum("colB")).explain()
== Physical Plan ==
*(1) HashAggregate(keys=[], functions=[sum(cast(colB#4 as bigint))])
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(colB#4 as bigint))])
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- CollectLimit 2
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Sort [colB#4 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(colB#4 DESC NULLS LAST, 200), true, [id=#7]
+- LocalTableScan [colB#4]
In any case, it's best to use window functions to assign row numbers and sum the rows if their row number meets a certain condition (e.g. row_number <= 2). This will result in a deterministic outcome. For example,
df0.withColumn(
"rn",
row_number().over(Window.orderBy($"colB".desc))
).filter("rn <= 2").agg(sum("colB"))

Scala spark: Sum all columns across all rows

I can do this quite easily with
df.groupBy().sum()
But I'm not sure if the groupBy() doesn't add additional performance impacts, or is just bad style. I've seen it done with
df.agg( ("col1", "sum"), ("col2", "sum"), ("col3", "sum"))
Which skips the (I think unnecessary groupBy), but has its own uglyness. What's the correct way to do this? Is there any under-the-hood difference between using .groupBy().<aggOp>() and using .agg?
If you check the Physical plan for both queries spark internally calls same plan so we can use either of them!
I think using df.groupBy().sum() will be handy as we don't need to specify all column names.
Example:
val df=Seq((1,2,3),(4,5,6)).toDF("id","j","k")
scala> df.groupBy().sum().explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(id#7 as bigint)), sum(cast(j#8 as bigint)), sum(cast(k#9 as bigint))])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(id#7 as bigint)), partial_sum(cast(j#8 as bigint)), partial_sum(cast(k#9 as bigint))])
+- LocalTableScan [id#7, j#8, k#9]
scala> df.agg(sum("id"),sum("j"),sum("k")).explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(id#7 as bigint)), sum(cast(j#8 as bigint)), sum(cast(k#9 as bigint))])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(id#7 as bigint)), partial_sum(cast(j#8 as bigint)), partial_sum(cast(k#9 as bigint))])
+- LocalTableScan [id#7, j#8, k#9]

Apache Spark - Does dataset.dropDuplicates() preserve partitioning?

I know that there exist several transformations which preserve parent partitioning (if it was set before - e.g. mapValues) and some which do not preserve it (e.g. map).
I use Dataset API of Spark 2.2. My question is - does dropDuplicates transformation preserve partitioning? Imagine this code:
case class Item(one: Int, two: Int, three: Int)
import session.implicits._
val ds = session.createDataset(List(Item(1,2,3), Item(1,2,3)))
val repart = ds.repartition('one, 'two).cache()
repart.dropDuplicates(List("one", "two")) // will be partitioning preserved?
generally, dropDuplicates does a shuffle (and thus not preserve partitioning), but in your special case it does NOT do an additional shuffle because you have already partitioned the dataset in a suitable form which is taken into account by the optimizer:
repart.dropDuplicates(List("one","two")).explain()
== Physical Plan ==
*HashAggregate(keys=[one#3, two#4, three#5], functions=[])
+- *HashAggregate(keys=[one#3, two#4, three#5], functions=[])
+- InMemoryTableScan [one#3, two#4, three#5]
+- InMemoryRelation [one#3, two#4, three#5], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- Exchange hashpartitioning(one#3, two#4, 200)
+- LocalTableScan [one#3, two#4, three#5]
the keyword to look for here is : Exchange
But consider the following code where you first repartition the dataset using plain repartition():
val repart = ds.repartition(200).cache()
repart.dropDuplicates(List("one","two")).explain()
This will indeed trigger an additional shuffle ( now you have 2 Exchange steps):
== Physical Plan ==
*HashAggregate(keys=[one#3, two#4], functions=[first(three#5, false)])
+- Exchange hashpartitioning(one#3, two#4, 200)
+- *HashAggregate(keys=[one#3, two#4], functions=[partial_first(three#5, false)])
+- InMemoryTableScan [one#3, two#4, three#5]
+- InMemoryRelation [one#3, two#4, three#5], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- Exchange RoundRobinPartitioning(200)
+- LocalTableScan [one#3, two#4, three#5]
NOTE: I checked that with Spark 2.1, it may be different in Spark 2.2 because the optimizer changed in Spark 2.2 (Cost-Based Optimizer)
No, dropDuplicates doesn't preserve partitions since it has a shuffle boundary, which doesn't guarantee order.
dropDuplicates is approximately:
ds.groupBy(columnId).agg(/* take first column from any available partition */)