Most Efficient Way to Join Massive/Small Datasets - scala

I currently have a large RDD called chartEvents containing data of the form:
case class ChartEvent(patientID: String, itemID: String, chartTime: String, storeTime: String, value: String,
valueNum: String, warning: String, error: String)
The data is coming from a 35 GB .csv file which I am parsing in using SQL:
CSVUtils.loadCSVAsTable(sqlContext, "data_unzipped/CHARTEVENTS.csv")
val chartEvents = sqlContext.sql(
"""
|SELECT SUBJECT_ID, ITEMID, CHARTTIME, STORETIME, VALUE, VALUENUM, WARNING, ERROR
|FROM CHARTEVENTS
""".stripMargin)
.map(r => ChartEvent(r(0).toString, r(1).toString, r(2).toString, r(3).toString, r(4).toString,
r(5).toString, r(6).toString, r(7).toString))
I have a separate, very small (less than 100 rows) RDD called featureMapping of the form RDD[(itemID, label)] where these are both strings. What I am trying to do is filter down the chartEvents RDD to rows which only contain itemIDs in featureMapping. My current method is to perform an inner join of the two RDDs as follows:
val result = chartEvents.map{case event => (event.itemID, event)}.join(featureMapping)
However, I am noticing that this is on track to take several hours to run, and is using a massive amount of space in my /user/<user>/appdata/local/temp folder. Is there a more efficient way to perform this filtering? Would coding it into the sqlContext be faster?

If you register your tables in hive metastore you can set spark.sql.autoBroadcastJoinThreshold
from the doc:
Configures the maximum size in bytes for a table that will be
broadcast to all worker nodes when performing a join. By setting this
value to -1 broadcasting can be disabled. Note that currently
statistics are only supported for Hive Metastore tables where the
command ANALYZE TABLE COMPUTE STATISTICS noscan has been
run.

Related

Joining two clustered tables in spark dataset seems to end up with full shuffle

I have two hive clustered tables t1 and t2
CREATE EXTERNAL TABLE `t1`(
`t1_req_id` string,
...
PARTITIONED BY (`t1_stats_date` string)
CLUSTERED BY (t1_req_id) INTO 1000 BUCKETS
// t2 looks similar with same amount of buckets
The insert part happens in hive
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table `t1` partition(t1_stats_date,t1_stats_hour)
select *
from t1_raw
where t1_stats_date='2020-05-10' and t1_stats_hour='12' AND
t1_req_id is not null
The code looks like as following:
val t1 = spark.table("t1").as[T1]
val t2= spark.table("t2").as[T2]
val outDS = t1.joinWith(t2, t1("t1_req_id) === t2("t2_req_id), "fullouter")
.map { case (t1Obj, t2Obj) =>
val t3:T3 = // do some logic
t3
}
outDS.toDF.write....
I see projection in DAG - but it seems that the job still does full data shuffle
Also, while looking into the logs of executor I don't see it reads the same bucket of the two tables in one chunk - that what I would expect to find
There are spark.sql.sources.bucketing.enabled, spark.sessionState.conf.bucketingEnabled and
spark.sql.join.preferSortMergeJoin flags
What am I missing? and why is there still full shuffle, if there are bucketed tables?
The current spark version is 2.3.1
One possibility here to check for is if you have a type mismatch. E.g. if the type of the join column is string in T1 and BIGINT in T2. Even if the types are both integer (e.g. one is INT, another BIGINT) Spark will still add shuffle here because different types use different hash functions for bucketing.

Scala/Spark: Immutable Dataframes and Memory

I am very new to Scala. I have experience in Java and R
I am confused about the immutability of DataFrames and memory management. The reason is this:
A Dataframe in R is also immutable. Subsequently, it was found in R to be unworkable. (Simplistically put) when working with a very large number of columns, each transformation led to a new Dataframe. 1000 consecutive operations on 1000 consecutive columns would lead to 1000 Dataframe objects). Now, most data scientists prefer R's data.table which performas operations by reference on a single data.table object.
Scala's dataframe (to a newbie) seems have a similar problem. The following code, for example, seems to create 1000 dataframes when renaming 1000 columns. Despite the foldLeft(), each call to withColumn() creates a new instance of DataFrame.
So, do I trust a very efficient garbage collection in Scala, or do I need to try and limit the number of immutable instances created. If the latter, what techniques should I be looking at?
def castAllTypedColumnsTo(df: DataFrame,
sourceType: DataType, targetType: DataType):
DataFrame =
{
val columnsToBeCasted = df.schema
.filter(s => s.dataType == sourceType)
if (columnsToBeCasted.length > 0)
{
println(s"Found ${columnsToBeCasted.length} columns " +
s"(${columnsToBeCasted.map(s => s.name).mkString(",")})" +
s" - casting to ${targetType.typeName.capitalize}Type")
}
columnsToBeCasted.foldLeft(df)
{ (foldedDf, col) =>
castColumnTo(foldedDf, col.name, targetType)
}
}
This method will return a new instance on each call
private def castColumnTo(df: DataFrame, cn: String, tpe: DataType):
DataFrame =
{
//println("castColumnTo")
df.withColumn(cn, df(cn).cast(tpe)
)
}
The difference is essentially laziness. Each new DataFrame that is returned is not materialized in memory. It just stores the base DataFrame and the function that should be applied to it. It's essentially an execution plan for how to create some data, not the data itself.
When it comes time to actually execute and save the result somewhere, then all 1000 operations can be applied to each row in parallel, so you get 1 additional output DataFrame. Spark condenses as many operations together as possible, and does not materialize anything unnecessary or that hasn't been explicitly requested to be saved or cached.

Split data frame into smaller ones and push a big dataframe to all executors?

I'm implementing the following logic using Spark.
Get the result of a table with 50K rows.
Get another table (about 30K rows).
For all the combination between (1) and (2), do some work and get a value.
How about pushing the data frame of (2) to all executors and partition (1) and run each portion on each executor? How to implement it?
val getTable(t String) =
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"$t"
)).load()
.select("col1", "col2", "col3")
val table1 = getTable("table1")
val table2 = getTable("table2")
// Split the rows in table1 and make N, say 32, data frames
val partitionedTable1 : List[DataSet[Row]] = splitToSmallerDFs(table1, 32) // How to implement it?
val result = partitionedTable1.map(x => {
val value = doWork(x, table2) // Is it good to send table2 to executors like this?
value
})
Question:
How to break a big data frame into small data frames? (repartition?)
Is it good to send table2 (pass a big data frame as a parameter) to executors like this?
How to break a big data frame into small data frames? (repartition?)
Simple answer would be Yes repartion can be a solution.
The challanging question can be, Would repartitioning a dataframe to smaller partition improve the overall operation?
Dataframes are already distributed in nature. Meaning that the operation you perform on dataframes like join, groupBy, aggregations, functions and many more are all executed where the data is residing. But the operations such as join, groupBy, aggregations where shuffling is needed, repartition would be void as
groupBy operation would shuffle dataframe such that distinct groups would be in the same executor.
partitionBy in Window function performs the same way as groupBy
join operation would shuffle data in the same manner.
Is it good to send table2 (pass a big data frame as a parameter) to executors like this?
Its not good to pass the dataframes as you did. As you are passing dataframes inside transformation so the table2 would not be visible to the executors.
I would suggest you to use broadcast variable
you can do as below
val table2 = sparkContext.broadcast(getTable("table2"))
val result = partitionedTable1.map(x => {
val value = doWork(x, table2.value)
value
})

How to tune mapping/filtering on big datasets (cross joined from two datasets)?

Spark 2.2.0
I have the following code converted from SQL script. It has been running for two hours and it's still running. Even slower than SQL Server. Is anything not done correctly?
The following is the plan,
Push table2 to all executors
Partition table1 and distribute the partitions to executors.
And each row in table2/t2 joins (cross join) each partition of table1.
So the calculation on the result of the cross-join can be run distributed/parallelly. (I wanted to, for example suppose​ I have 16 executors, keep a copy of t2 on all the 16 executors. Then divide table 1 into 16 partitions, one for each executor. Then each executor do the calculation on one partition of table 1 and t2.)
case class Cols (Id: Int, F2: String, F3: BigDecimal, F4: Date, F5: String,
F6: String, F7: BigDecimal, F8: String, F9: String, F10: String )
case class Result (Id1: Int, ID2: Int, Point: Int)
def getDataFromDB(source: String) = {
import sqlContext.sparkSession.implicits._
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"$source"
)).load()
.select("Id", "F2", "F3", "F4", "F5", "F6", "F7", "F8", "F9", "F10")
.as[Cols]
}
val sc = new SparkContext(conf)
val table1:DataSet[Cols] = getDataFromDB("table1").repartition(32).cache()
println(table1.count()) // about 300K rows
val table2:DataSet[Cols] = getDataFromDB("table2") // ~20K rows
table2.take(1)
println(table2.count())
val t2 = sc.broadcast(table2)
import org.apache.spark.sql.{functions => func}
val j = table1.joinWith(t2.value, func.lit(true))
j.map(x => {
val (l, r) = x
Result(l.Id, r.Id,
(if (l.F1!= null && r.F1!= null && l.F1== r.F1) 3 else 0)
+(if (l.F2!= null && r.F2!= null && l.F2== r.F2) 2 else 0)
+ ..... // All kind of the similiar expression
+(if (l.F8!= null && r.F8!= null && l.F8== r.F8) 1 else 0)
)
}).filter(x => x.Value >= 10)
println("Total count %d", j.count()) // This takes forever, the count will be about 100
How to rewrite it with Spark idiomatic way?
Ref: https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html
(Somehow I feel as if I have seen the code already)
The code is slow because you use just a single task to load the entire dataset from the database using JDBC and despite cache it does not benefit from it.
Start by checking out the physical plan and Executors tab in web UI to find out about the single executor and the single task to do the work.
You should use one of the following to fine-tune the number of tasks for loading:
Use partitionColumn, lowerBound, upperBound options for the JDBC data source
Use predicates option
See JDBC To Other Databases in Spark's official documentation.
After you're fine with the loading, you should work on improving the last count action and add...another count action right after the following line:
val table1: DataSet[Cols] = getDataFromDB("table1").repartition(32).cache()
// trigger caching as it's lazy in Dataset API
table1.count
The reason why the entire query is slow is that you only mark table1 to be cached when an action gets executed which is exactly at the end (!) In other words, cache does nothing useful and more importantly makes the query performance even worse.
Performance will increase after you table2.cache.count too.
If you want to do cross join, use crossJoin operator.
crossJoin(right: Dataset[_]): DataFrame Explicit cartesian join with another DataFrame.
Please note the note from the scaladoc of crossJoin (no pun intended).
Cartesian joins are very expensive without an extra filter that can be pushed down.
The following requirement is already handled by Spark given all the optimizations available.
So the calculation on the result of the cross-join can be run distributed/parallelly.
That's Spark's job (again, no pun intended).
The following requirement begs for broadcast.
I wanted to, for example suppose​ I have 16 executors, keep a copy of t2 on all the 16 executors. Then divide table 1 into 16 partitions, one for each executor. Then each executor do the calculation on one partition of table 1 and t2.)
Use broadcast function to hint Spark SQL's engine to use table2 in broadcast mode.
broadcast[T](df: Dataset[T]): Dataset[T] Marks a DataFrame as small enough for use in broadcast joins.

Best way to gain performance when doing a join count using spark and scala

i have a requirement to validate an ingest operation , bassically, i have two big files within HDFS, one is avro formatted (ingested files), another one is parquet formatted (consolidated file).
Avro file has this schema:
filename, date, count, afield1,afield2,afield3,afield4,afield5,afield6,...afieldN
Parquet file has this schema:
fileName,anotherField1,anotherField1,anotherField2,anotherFiel3,anotherField14,...,anotherFieldN
If i try to load both files in a DataFrame and then try to use a naive join-where, the job in my local machine takes more than 24 hours!, which is unaceptable.
ingestedDF.join(consolidatedDF).where($"filename" === $"fileName").count()
¿Which is the best way to achieve this? ¿dropping colums from the DataFrame before doing the join-where-count? ¿calculating the counts per dataframe and then join and sum?
PD
I was reading about map-side-joint technique but it looks that this technique would work for me if there was a small file able to fit in RAM, but i cant assure that, so, i would like to know which is the prefered way from the community to achieve this.
http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/
I would approach this problem by stripping down the data to only the field I'm interested in (filename), making a unique set of the filename with the source it comes from (the origin dataset).
At this point, both intermediate datasets have the same schema, so we can union them and just count. This should be orders of magnitude faster than using a join on the complete data.
// prepare some random dataset
val data1 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.8).map(i => (s"file$i", i, "rubbish"))
val data2 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.7).map(i => (s"file$i", i, "crap"))
val df1 = sparkSession.createDataFrame(data1).toDF("filename", "index", "data")
val df2 = sparkSession.createDataFrame(data2).toDF("filename", "index", "data")
// select only the column we are interested in and tag it with the source.
// Lets make it distinct as we are only interested in the unique file count
val df1Filenames = df1.select("filename").withColumn("df", lit("df1")).distinct
val df2Filenames = df2.select("filename").withColumn("df", lit("df2")).distinct
// union both dataframes
val union = df1Filenames.union(df2Filenames).toDF("filename","source")
// let's count the occurrences of filename, by using a groupby operation
val occurrenceCount = union.groupBy("filename").count
// we're interested in the count of those files that appear in both datasets (with a count of 2)
occurrenceCount.filter($"count"===2).count