Flink: Table API copy operators in execution plan - scala

I use Flink 1.2.0 Table API for processing some streaming data. Following is my code:
val dataTable = myDataStream
// table A
val tableA = dataTable
.window(Tumble over 5.minutes on 'rowtime as 'w)
.groupBy("w, group1, group2")
.select("w.start as time, group1, group2, data1.sum as data1, data2.sum as data2")
tableEnv.registerTable("tableA", tableA)
// table A sink
tableA.writeToSink(sinkTableA)
//...
// I shoul get some other outputs from TableA output
//...
val dataTable = tableEnv.ingest("tableA")
// table result1
val result1 = dataTable
.window(Tumble over 5.minutes on 'rowtime as 'w)
.groupBy("w, group1")
.select("w.start as time, group1, data1.sum as data1")
// result1 sink
result2.writeToSink(sinkResult1)
// table result2
val result2 = dataTable
.window(Tumble over 5.minutes on 'rowtime as 'w)
.groupBy("w, group2")
.select("w.start as time, group2, data2.sum as data1")
// result2 sink
result2.writeToSink(sinkResult2)
I wait to get this tree in the flink execution plan.
Same as I have for Flink Streaming in my other Flink jobs.
DataStream_Operators -> TableA_Operators -> TableA_Sink
|-> Result1_Operators -> Result1_Sink
|-> Result2_Operators -> Result2_Sink
But, I get this with 3 copies of same opertoprs for TableA !
DataStream_Operators -> TableA_Operators -> TableA_Sink
|-> Copy_of_TableA_Operators -> Result1_Operators -> Result1_Sink
|-> Copy_of_TableA_Operators -> Result2_Operators -> Result2_Sink
I have bad performance with big input data for this job in result.
How I can fix this and get optimal execution plan ?
I undestand, what the Flink Table API and SQL are experimental features and
maybe it's will fixed in next versions.

At the current state, the Table API translates the whole query whenever you convert a Table into a DataSet or DataStream or write it to a TableSink. In your program, you call three times writeToSink which means that the each time the complete query is translated.
But what is the complete query? There are all Table API operators that have been applied on a Table. When you register a Table in the TableEnvironment it is basically registered as a view, i.e., only its definition (all the operators that define the Table) are registered. Therefore, these operators are translated again when you call writeToSink the second and third time.
You can solve this issue if you translate tableA into a DataStream and register the DataStream in the TableEnvironment instead for registering it as a Table. This would look as follows:
val tableA = ...
val streamA = tableA.toDataStream[X] // X should be a case class for rows of tableA
val tableEnv.registerDataStream("tableA", streamA)
tableEnv.ingest("tableA").writeToSink(sinkTableA) // emit tableA by ingesting the registered DataStream
I know, this is not very convenient but at the moment the only way to avoid repeated translation of a Table.

Related

Why clustering seems doesn't work in spark cogroup function

I have two hive clustered tables t1 and t2
CREATE EXTERNAL TABLE `t1`(
`t1_req_id` string,
...
PARTITIONED BY (`t1_stats_date` string)
CLUSTERED BY (t1_req_id) INTO 1000 BUCKETS
// t2 looks similar with same amount of buckets
the code looks like as following:
val t1 = spark.table("t1").as[T1].rdd.map(v => (v.t1_req_id, v))
val t2= spark.table("t2").as[T2].rdd.map(v => (v.t2_req_id, v))
val outRdd = t1.cogroup(t2)
.flatMap { coGroupRes =>
val key = coGroupRes._1
val value: (Iterable[T1], Iterable[T2])= coGroupRes._2
val t3List = // create a list with some logic on Iterable[T1] and Iterable[T2]
t3List
}
outRdd.write....
I make sure that the both t1 and t2 table has same amount of partitions, and on spark-submit there are
spark.sql.sources.bucketing.enabled=true and spark.sessionState.conf.bucketingEnabled=true flags
But Spark DAG doesn't show any impact of clustering. It seems there is still data full shuffle
What am I missing, any other configurations, tunings? How can it be assured that there is no full data shuffle?
My spark version is 2.3.1
And it shouldn't show.
Any logical optimizations are limited to DataFrame API. Once you push data to black-box functional dataset API (see Spark 2.0 Dataset vs DataFrame) and later to RDD API, no more information is pushed back to the optimizer.
You could partially utilize bucketing by performing join first, getting something around these lines
spark.table("t1")
.join(spark.table("t2"), $"t1.t1_req_id" === $"t2.t2_req_id", "outer")
.groupBy($"t1.v.t1_req_id", $"t2.t2_req_id")
.agg(...) // For example collect_set($"t1.v"), collect_set($"t2.v")
However, unlike cogroup, this will generate full Cartesian Products within groups, and might be not applicable in your case

Joining two clustered tables in spark dataset seems to end up with full shuffle

I have two hive clustered tables t1 and t2
CREATE EXTERNAL TABLE `t1`(
`t1_req_id` string,
...
PARTITIONED BY (`t1_stats_date` string)
CLUSTERED BY (t1_req_id) INTO 1000 BUCKETS
// t2 looks similar with same amount of buckets
The insert part happens in hive
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table `t1` partition(t1_stats_date,t1_stats_hour)
select *
from t1_raw
where t1_stats_date='2020-05-10' and t1_stats_hour='12' AND
t1_req_id is not null
The code looks like as following:
val t1 = spark.table("t1").as[T1]
val t2= spark.table("t2").as[T2]
val outDS = t1.joinWith(t2, t1("t1_req_id) === t2("t2_req_id), "fullouter")
.map { case (t1Obj, t2Obj) =>
val t3:T3 = // do some logic
t3
}
outDS.toDF.write....
I see projection in DAG - but it seems that the job still does full data shuffle
Also, while looking into the logs of executor I don't see it reads the same bucket of the two tables in one chunk - that what I would expect to find
There are spark.sql.sources.bucketing.enabled, spark.sessionState.conf.bucketingEnabled and
spark.sql.join.preferSortMergeJoin flags
What am I missing? and why is there still full shuffle, if there are bucketed tables?
The current spark version is 2.3.1
One possibility here to check for is if you have a type mismatch. E.g. if the type of the join column is string in T1 and BIGINT in T2. Even if the types are both integer (e.g. one is INT, another BIGINT) Spark will still add shuffle here because different types use different hash functions for bucketing.

How to Compare columns of two tables using Spark?

I am trying to compare two tables() by reading as DataFrames. And for each common column in those tables using concatenation of a primary key say order_id with other columns like order_date, order_name, order_event.
The Scala Code I am using
val primary_key=order_id
for (i <- commonColumnsList){
val column_name = i
val tempDataFrameForNew = newDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
val tempDataFrameOld = oldDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
//Get those records which aren common in both old/new tables
matchCountCalculated = tempDataFrameForNew.intersect(tempDataFrameOld)
//Get those records which aren't common in both old/new tables
nonMatchCountCalculated = tempDataFrameOld.unionAll(tempDataFrameForNew).except(matchCountCalculated)
//Total Null/Non-Null Counts in both old and new tables.
nullsCountInNewDataFrame = newDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nullsCountInOldDataFrame = oldDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nonNullsCountInNewDataFrame = newDFCount - nullsCountInNewDataFrame
nonNullsCountInOldDataFrame = oldDFCount - nullsCountInOldDataFrame
//Put the result for a given column in a Seq variable, later convert it to Dataframe.
tempSeq = tempSeq :+ Row(column_name, matchCountCalculated.toString, nonMatchCountCalculated.toString, (nullsCountInNewDataFrame - nullsCountInOldDataFrame).toString,
(nonNullsCountInNewDataFrame - nonNullsCountInOldDataFrame).toString)
}
// Final Step: Create DataFrame using Seq and some Schema.
spark.createDataFrame(spark.sparkContext.parallelize(tempSeq), schema)
The above code is working fine for a medium set of Data, but as the number of Columns and Records increases in my New & Old Table, the execution time is increasing. Any sort of advice is appreciated.
Thank you in Advance.
You can do the following:
1. Outer join the old and new dataframe on priamary key
joined_df = df_old.join(df_new, primary_key, "outer")
2. Cache it if you possibly can. This will save you a lot of time
3. Now you can iterate over columns and compare columns using spark functions (.isNull for not matched, == for matched etc)
for (col <- df_new.columns){
val matchCount = df_joined.filter(df_new[col].isNotNull && df_old[col].isNotNull).count()
val nonMatchCount = ...
}
This should be considerably faster, especially when you can cache your dataframe. If you can't it might be a good idea so save the joined df to disk in order to avoid a shuffle each time

Split data frame into smaller ones and push a big dataframe to all executors?

I'm implementing the following logic using Spark.
Get the result of a table with 50K rows.
Get another table (about 30K rows).
For all the combination between (1) and (2), do some work and get a value.
How about pushing the data frame of (2) to all executors and partition (1) and run each portion on each executor? How to implement it?
val getTable(t String) =
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"$t"
)).load()
.select("col1", "col2", "col3")
val table1 = getTable("table1")
val table2 = getTable("table2")
// Split the rows in table1 and make N, say 32, data frames
val partitionedTable1 : List[DataSet[Row]] = splitToSmallerDFs(table1, 32) // How to implement it?
val result = partitionedTable1.map(x => {
val value = doWork(x, table2) // Is it good to send table2 to executors like this?
value
})
Question:
How to break a big data frame into small data frames? (repartition?)
Is it good to send table2 (pass a big data frame as a parameter) to executors like this?
How to break a big data frame into small data frames? (repartition?)
Simple answer would be Yes repartion can be a solution.
The challanging question can be, Would repartitioning a dataframe to smaller partition improve the overall operation?
Dataframes are already distributed in nature. Meaning that the operation you perform on dataframes like join, groupBy, aggregations, functions and many more are all executed where the data is residing. But the operations such as join, groupBy, aggregations where shuffling is needed, repartition would be void as
groupBy operation would shuffle dataframe such that distinct groups would be in the same executor.
partitionBy in Window function performs the same way as groupBy
join operation would shuffle data in the same manner.
Is it good to send table2 (pass a big data frame as a parameter) to executors like this?
Its not good to pass the dataframes as you did. As you are passing dataframes inside transformation so the table2 would not be visible to the executors.
I would suggest you to use broadcast variable
you can do as below
val table2 = sparkContext.broadcast(getTable("table2"))
val result = partitionedTable1.map(x => {
val value = doWork(x, table2.value)
value
})

How to tune mapping/filtering on big datasets (cross joined from two datasets)?

Spark 2.2.0
I have the following code converted from SQL script. It has been running for two hours and it's still running. Even slower than SQL Server. Is anything not done correctly?
The following is the plan,
Push table2 to all executors
Partition table1 and distribute the partitions to executors.
And each row in table2/t2 joins (cross join) each partition of table1.
So the calculation on the result of the cross-join can be run distributed/parallelly. (I wanted to, for example suppose​ I have 16 executors, keep a copy of t2 on all the 16 executors. Then divide table 1 into 16 partitions, one for each executor. Then each executor do the calculation on one partition of table 1 and t2.)
case class Cols (Id: Int, F2: String, F3: BigDecimal, F4: Date, F5: String,
F6: String, F7: BigDecimal, F8: String, F9: String, F10: String )
case class Result (Id1: Int, ID2: Int, Point: Int)
def getDataFromDB(source: String) = {
import sqlContext.sparkSession.implicits._
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"$source"
)).load()
.select("Id", "F2", "F3", "F4", "F5", "F6", "F7", "F8", "F9", "F10")
.as[Cols]
}
val sc = new SparkContext(conf)
val table1:DataSet[Cols] = getDataFromDB("table1").repartition(32).cache()
println(table1.count()) // about 300K rows
val table2:DataSet[Cols] = getDataFromDB("table2") // ~20K rows
table2.take(1)
println(table2.count())
val t2 = sc.broadcast(table2)
import org.apache.spark.sql.{functions => func}
val j = table1.joinWith(t2.value, func.lit(true))
j.map(x => {
val (l, r) = x
Result(l.Id, r.Id,
(if (l.F1!= null && r.F1!= null && l.F1== r.F1) 3 else 0)
+(if (l.F2!= null && r.F2!= null && l.F2== r.F2) 2 else 0)
+ ..... // All kind of the similiar expression
+(if (l.F8!= null && r.F8!= null && l.F8== r.F8) 1 else 0)
)
}).filter(x => x.Value >= 10)
println("Total count %d", j.count()) // This takes forever, the count will be about 100
How to rewrite it with Spark idiomatic way?
Ref: https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html
(Somehow I feel as if I have seen the code already)
The code is slow because you use just a single task to load the entire dataset from the database using JDBC and despite cache it does not benefit from it.
Start by checking out the physical plan and Executors tab in web UI to find out about the single executor and the single task to do the work.
You should use one of the following to fine-tune the number of tasks for loading:
Use partitionColumn, lowerBound, upperBound options for the JDBC data source
Use predicates option
See JDBC To Other Databases in Spark's official documentation.
After you're fine with the loading, you should work on improving the last count action and add...another count action right after the following line:
val table1: DataSet[Cols] = getDataFromDB("table1").repartition(32).cache()
// trigger caching as it's lazy in Dataset API
table1.count
The reason why the entire query is slow is that you only mark table1 to be cached when an action gets executed which is exactly at the end (!) In other words, cache does nothing useful and more importantly makes the query performance even worse.
Performance will increase after you table2.cache.count too.
If you want to do cross join, use crossJoin operator.
crossJoin(right: Dataset[_]): DataFrame Explicit cartesian join with another DataFrame.
Please note the note from the scaladoc of crossJoin (no pun intended).
Cartesian joins are very expensive without an extra filter that can be pushed down.
The following requirement is already handled by Spark given all the optimizations available.
So the calculation on the result of the cross-join can be run distributed/parallelly.
That's Spark's job (again, no pun intended).
The following requirement begs for broadcast.
I wanted to, for example suppose​ I have 16 executors, keep a copy of t2 on all the 16 executors. Then divide table 1 into 16 partitions, one for each executor. Then each executor do the calculation on one partition of table 1 and t2.)
Use broadcast function to hint Spark SQL's engine to use table2 in broadcast mode.
broadcast[T](df: Dataset[T]): Dataset[T] Marks a DataFrame as small enough for use in broadcast joins.