I am trying to convert rdd into DataFrame using Case Class as follows
1.)Fetching Data from textfile having "id,name,country" saperated by "," but without header
val x = sc.textFile("file:///home/hdadmin/records.txt")
2.)Creating a case class "rec" with header definition as below:
case class rec(id:Int, name:String, country:String)
3.) Now I define the transformations
val y = x.map(x=>x.split(",")).map(x=>rec(x(0).toInt,x(1),x(2)))
4.) Then I imported the implicits library
import spark.implicits._
5.) Converting rdd to data Frame using toDF method:
val z = y.toDF()
6.) Now when I try to fetch the records with command below:
z.select("name").show()
I get the following error:
17/05/19 12:50:14 ERROR LiveListenerBus: SparkListenerBus has already
stopped! Dropping event SparkListenerSQLExecutionStart(9,show at
:49,org.apache.spark.sql.Dataset.show(Dataset.scala:495)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:49)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:54)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:56)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:58)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:60)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:62)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:64) $line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:66)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:68)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:70)
$line105.$read$$iw$$iw$$iw$$iw$$iw$$iw.(:72)
$line105.$read$$iw$$iw$$iw$$iw$$iw.(:74)
$line105.$read$$iw$$iw$$iw$$iw.(:76)
$line105.$read$$iw$$iw$$iw.(:78)
$line105.$read$$iw$$iw.(:80)
$line105.$read$$iw.(:82)
$line105.$read.(:84)
$line105.$read$.(:88)
$line105.$read$.(),== Parsed Logical Plan ==
GlobalLimit 21
+- LocalLimit 21 +- Project [name#91]
+- LogicalRDD [id#90, name#91, country#92]
== Analyzed Logical Plan == name: string GlobalLimit 21
+- LocalLimit 21 +- Project [name#91]
+- LogicalRDD [id#90, name#91, country#92]
== Optimized Logical Plan == GlobalLimit 21
+- LocalLimit 21 +- Project [name#91]
+- LogicalRDD [id#90, name#91, country#92]
== Physical Plan == CollectLimit 21
+- *Project [name#91] +- Scan ExistingRDD[id#90,name#91,country#92],org.apache.spark.sql.execution.SparkPlanInfo#b807ee,1495223414636)
17/05/19 12:50:14 ERROR LiveListenerBus: SparkListenerBus has already
stopped! Dropping event SparkListenerSQLExecutionEnd(9,1495223414734)
java.lang.IllegalStateException: SparkContext has been shutdown at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1863) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1897) at
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
at
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
at
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
at
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
at
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189)
at
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925)
at
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562)
at org.apache.spark.sql.Dataset.head(Dataset.scala:1924) at
org.apache.spark.sql.Dataset.take(Dataset.scala:2139) at
org.apache.spark.sql.Dataset.showString(Dataset.scala:239) at
org.apache.spark.sql.Dataset.show(Dataset.scala:526) at
org.apache.spark.sql.Dataset.show(Dataset.scala:486) at
org.apache.spark.sql.Dataset.show(Dataset.scala:495) ... 56 elided
Where could be the problem?
After trying the same code for a couple of text files I actually rectified the text format in the text file for any discrepency.
The Column Separator in below code is "," and it was missing at 1 place inside the text file after I scanned it minutely.
val y = x.map(x=>x.split(",")).map(x=>rec(x(0).toInt,x(1),x(2)))
The code worked fine and gave me results in Structured table format after the changes.
Therefore its important to note that the separator(",", "\t", "|") given inside
x.split("")
should be same as in source file and throughout the source file.
Related
I have two Hive tables A and B with:
same partitions (partition_1, partition_2)
an extra id field that is not sorted in partitions
When I join these two tables in PySpark, for example with:
df_A = spark.table("db.A")
df_B = spark.table("db.B")
df = df_A.join(df_B, how="inner", on=["partition_1", "partition_2", "id"])
I always end up with a shuffle:
+- == Initial Plan ==
Project (23)
+- SortMergeJoin Inner (22)
:- Sort (18)
: +- Exchange (17)
: +- Filter (16)
: +- Scan parquet db.A (15)
+- Sort (21)
+- Exchange (20)
+- Filter (19)
+- Scan parquet db.B (7)
I created two similar tables but with a bucketting strategy this time:
df.write.partitionBy("partition_A", "partition_B").bucketBy(10, "id").saveAsTable(...)
And there is no more shuffle in the join
+- == Initial Plan ==
Project (17)
+- SortMergeJoin Inner (16)
:- Sort (13)
: +- Filter (12)
: +- Scan parquet db.A (11)
+- Sort (15)
+- Filter (14)
+- Scan parquet db.B (5)
My questions are:
Can I avoid this shuffle in the join without having to re-create the tables with a bucketting strategy ?
Does this shuffle operate on all data ? Or does it consider that the partitions are the same and optimise this shuffle ?
What I tried so far:
repartitioning on partitions (df.repartition("partition_A", "partition_B")) on both tables before joining
repartitioning on partitions and id field (df.repartition(numPartitions, "partition_A", "partition_B", "id"))
sorting data by id before joining
But the shuffle is still here.
I tried on both Databricks and EMR runtimes with same behaviour.
Thanks for your help
I have a Spark query which reads a lot of parquet data from S3, filters it, and adds a column computed as regexp_extract(input_file_name, ...) which I assume is a relatively heavy operation (if applied before filtering rather than after it).
The whole query looks like this:
val df = spark
.read
.option("mergeSchema", "true")
.parquet("s3://bucket/path/date=2020-01-1{5,6}/clientType=EXTENSION_CHROME/type={ACCEPT,IGNORE*}/")
.where(...)
.withColumn("type", regexp_extract(input_file_name, "type=([^/]+)", 1))
.repartition(300)
.cache()
df.count()
Is withColumn executed after where or before where? Does it depend on the order in which I write them? What if my where statement used a column added by withColumn?
The withColumn and filter execute in the order they are called. The plan explains it. Please read the plan bottom up.
val employees = spark.createDataFrame(Seq(("E1",100.0), ("E2",200.0),("E3",300.0))).toDF("employee","salary")
employees.withColumn("column1", when(col("salary") > 200, lit("rich")).otherwise("poor")).filter(col("column1")==="poor").explain(true)
Plan - project happened 1st then filter.
== Parsed Logical Plan ==
'Filter ('column1 = poor)
+- Project [employee#4, salary#5, CASE WHEN (salary#5 > cast(200 as double)) THEN rich ELSE poor END AS column1#8]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Analyzed Logical Plan ==
employee: string, salary: double, column1: string
Filter (column1#8 = poor)
+- Project [employee#4, salary#5, CASE WHEN (salary#5 > cast(200 as double)) THEN rich ELSE poor END AS column1#8]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
Code 1st filters then adds new column
employees.filter(col("employee")==="E1").withColumn("column1", when(col("salary") > 200, lit("rich")).otherwise("poor")).explain(true)
Plan - 1st filters then projects
== Parsed Logical Plan ==
'Project [employee#4, salary#5, CASE WHEN ('salary > 200) THEN rich ELSE poor END AS column1#13]
+- Filter (employee#4 = E1)
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Analyzed Logical Plan ==
employee: string, salary: double, column1: string
Project [employee#4, salary#5, CASE WHEN (salary#5 > cast(200 as double)) THEN rich ELSE poor END AS column1#13]
+- Filter (employee#4 = E1)
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
Another evidence - it gives error when filter is called on a column before adding it (obviously)
employees.filter(col("column1")==="poor").withColumn("column1", when(col("salary") > 200, lit("rich")).otherwise("poor")).show()
org.apache.spark.sql.AnalysisException: cannot resolve '`column1`' given input columns: [employee, salary];;
'Filter ('column1 = poor)
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
Let us suppose I have a dataframe that looks like this:
val df2 = Seq({"A:job_1, B:whatever1"}, {"A:job_1, B:whatever2"} , {"A:job_2, B:whatever3"}).toDF("values")
df2.show()
How can I group it by a regular expression like "job_" and then take the first element to end in something like :
|A:job_1, B:whatever1|
|A:job_2, B:whatever3|
Thank a lot and kind regards
You should probably just create a new column with regexp_extract and drop this column !
import org.apache.spark.sql.{functions => F}
df2.
withColumn("A", F.regexp_extract($"values", "job_[0-9]+", 0)). // Extract the key of the groupBy
groupBy("A").
agg(F.first("values").as("first value")). // Get the first value
drop("A").
show()
Here is the catalyst if you wish to go further in the comprehension !
As you can see in the optimised logical plan, the two following are stricly equivalent :
creating explicitly a new column with : .withColumn("A", F.regexp_extract($"values", "job_[0-9]+", 0))
grouping by a new column with : .groupBy(F.regexp_extract($"values", "job_[0-9]+", 0).alias("A"))
Here is the catalyst plan :
== Parsed Logical Plan ==
'Aggregate [A#198], [A#198, first('values, false) AS first value#206]
+- Project [values#3, regexp_extract(values#3, job_[0-9]+, 0) AS A#198]
+- Project [value#1 AS values#3]
+- LocalRelation [value#1]
== Analyzed Logical Plan ==
A: string, first value: string
Aggregate [A#198], [A#198, first(values#3, false) AS first value#206]
+- Project [values#3, regexp_extract(values#3, job_[0-9]+, 0) AS A#198]
+- Project [value#1 AS values#3]
+- LocalRelation [value#1]
== Optimized Logical Plan ==
Aggregate [A#198], [A#198, first(values#3, false) AS first value#206]
+- LocalRelation [values#3, A#198]
Transform your data to a Seq with two columns and operate it:
val aux = Seq({"A:job_1, B:whatever1"}, {"A:job_1, B:whatever2"} , {"A:job_2, B:whatever3"})
.map(x=>(x.split(",")(0).replace("A:","")
,x.split(",")(1).replace("B:","")))
.toDF("A","B")
.groupBy("A")
I removed A: and B:, but it is not necesary.
Or you can try:
df2.withColumn("A",col("value").substr(4,8))
.groupBy("A")
Method 1:
Querying a parquet file directly as :
val sqlDF = spark.sql("SELECT columns FROM parquet.`sample.parquet`")
and
Method 2:
Querying the Dataframe after reading a parquet file as :
df = spark.read.parquet(path_to_parquet_file)
df.select(columns)
and
Method 3:
Querying a Temporary View as :
df.createOrReplaceTempView("sample")
val sqlDF = spark.sql("SELECT columns FROM sample")
Behind the scene, are all 3 essentially executed the same way ?
In Method 1, does the parquet get converted into dataframe / dataset
before query execution ?
Which of the 3 methods are efficient and why ? (if they are
different)
Is there a specific use case for these methods ? (if they are
different)
Thank You !
Short Answer
Yes. The 3 ways you have illustrated of querying a Parquet file using Spark are executed in the same way.
Long Answer
The reason why this is so is a combination of two features of Spark: lazy evaluation & query optimization.
As a developer, you could split the Spark operations into multiple steps (as you have done in method 2). Internally, Spark (lazily) evaluates the operations in conjunction and applies optimizations on it. In this case, Spark could optimize the operations by column pruning (basically, it will not read the entire parquet data into memory; only the specific columns you have requested.)
The 3rd method of creating a temporary view is just about naming the data you have read, so that you can reference in further operations. It does not change how it is computed in the first place.
For more information on optimizations performed by Spark in reading Parquet, refer this in-depth article.
NOTE:
As I have mentioned in a comment to the question, you have selected specific columns in method 2; while the other two reads the entire data. Since, these are essentially different operations, there will be difference in execution. The above answer assumes similar operations are been performed in each of the three methods (either reading complete data or some specific columns from the file).
If you are trying to evaluate which '3' of them is best for same objective, There is no difference in between those. physical plan tell's your ask - 'Behind the scene?'.
Method 1:
sqlDF = spark.sql("SELECT CallNumber,CallFinalDisposition FROM parquet.`/tmp/ParquetA`").show()
== Physical Plan ==
CollectLimit 21
+- *(1) Project [cast(CallNumber#2988 as string) AS CallNumber#3026, CallFinalDisposition#2992]
+- *(1) FileScan parquet [CallNumber#2988,CallFinalDisposition#2992] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/ParquetA], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<CallNumber:int,CallFinalDisposition:string>
Method 2:
df = spark.read.parquet('/tmp/ParquetA')
df.select("CallNumber","CallFinalDisposition").show()
== Physical Plan ==
CollectLimit 21
+- *(1) Project [cast(CallNumber#3100 as string) AS CallNumber#3172, CallFinalDisposition#3104]
+- *(1) FileScan parquet [CallNumber#3100,CallFinalDisposition#3104] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/ParquetA], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<CallNumber:int,CallFinalDisposition:string>
Method 3:
tempDF = spark.read.parquet('/tmp/ParquetA/')
tempDF.createOrReplaceTempView("temptable");
tiny = spark.sql("SELECT CallNumber,CallFinalDisposition FROM temptable").show()
== Physical Plan ==
CollectLimit 21
+- *(1) Project [cast(CallNumber#2910 as string) AS CallNumber#2982, CallFinalDisposition#2914]
+- *(1) FileScan parquet [CallNumber#2910,CallFinalDisposition#2914] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/ParquetA], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<CallNumber:int,CallFinalDisposition:string>
I'm trying to join two Dataframes, one is around 10 million records and the other is about 1/3 of that. Since the small DataFrame fits comfortably in the executor's memory, I perform a broadcast join and then write out the result:
val df = spark.read.parquet("/plablo/data/tweets10M")
.select("id", "content", "lat", "lon", "date")
val fullResult = FilterAndClean.performFilter(df, spark)
.select("id", "final_tokens")
.filter(size($"final_tokens") > 1)
val fullDFWithClean = {
df.join(broadcast(fullResult), "id")
}
fullDFWithClean
.write
.partitionBy("date")
.mode(saveMode = SaveMode.Overwrite)
.parquet("/plablo/data/cleanTokensSpanish")
After a while, I get this error:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:125)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:231)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36)
at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:68)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:88)
at org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:209)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at org.apache.spark.sql.execution.FileSourceScanExec.consume(DataSourceScanExec.scala:141)
at org.apache.spark.sql.execution.FileSourceScanExec.doProduceVectorized(DataSourceScanExec.scala:392)
at org.apache.spark.sql.execution.FileSourceScanExec.doProduce(DataSourceScanExec.scala:315)
.....
There's this question that addresses the same issue. In the comments, it's mentioned that increasing spark.sql.broadcastTimeout could fix the problem, but after setting a large value (5000 seconds) I still get the same error (although much later, of course).
The original data is partitioned by date column, the function that returns fullResult performs a series of narrow transformations and filters the data so, I'm assuming, the partition is preserved.
The Physical Plan confirms that spark will perform a BroadcastHashJoin
*Project [id#11, content#8, lat#5, lon#6, date#150, final_tokens#339]
+- *BroadcastHashJoin [id#11], [id#363], Inner, BuildRight
:- *Project [id#11, content#8, lat#5, lon#6, date#150]
: +- *Filter isnotnull(id#11)
: +- *FileScan parquet [lat#5,lon#6,content#8,id#11,date#150]
Batched: true, Format: Parquet, Location:
InMemoryFileIndex[hdfs://geoint1.lan:8020/plablo/data/tweets10M],
PartitionCount: 182, PartitionFilters: [], PushedFilters:
[IsNotNull(id)], ReadSchema:
struct<lat:double,lon:double,content:string,id:int>
+- BroadcastExchange
HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))
+- *Project [id#363, UDF(UDF(UDF(content#360))) AS
final_tokens#339]
+- *Filter (((UDF(UDF(content#360)) = es) && (size(UDF(UDF(UDF(content#360)))) > 1)) && isnotnull(id#363))
+- *FileScan parquet [content#360,id#363,date#502] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://geoint1.lan:8020/plablo/data/tweets10M], PartitionCount: 182, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<content:string,id:int>
I believe that, given the size of my data, this operation should be relatively fast (on 4 executors with 5 cores each and 4g RAM running on YARN in cluster mode).
Any help is appreciated
In situations like this, the first question is how big is the dataframe you are trying to broadcast? It's worth estimating its size (see this SO answer and this also).
Note that Spark's default spark.sql.autoBroadcastJoinThreshold is only 10Mb so you are really not supposed to broadcast very large datasets.
Your use of broadcast takes precedence and may be forcing Spark to do something it otherwise would choose not to do. A good rule is to only force aggressive optimization if the default behavior is unacceptable because aggressive optimization often creates various edge conditions, like the one you are experiencing.
This can also fail if spark.task.maxDirectResultSize is not increased. It's default is 1 megabyte (1m). Try spark.task.maxDirectResultSize=10g.