Apache Spark SQL: how to optimize chained join for dataframe - scala

I have to make a left join between a principle data frame and several reference data frame, so a chained join computation. And I wonder how to make this action efficient and scalable.
Method 1 is easy to understand, which is also the current method, but I'm not satisfied because all the transformations have been chained and waited for the final action to trigger the computation, if I continue to add transformation and the volume of data, spark will fail at the end, so this method is not scalable.
Method 1:
def pipeline(refDF1: DataFrame, refDF2: DataFrame, refDF3: DataFrame, refDF4: DataFrame, refDF5: DataFrame): DataFrame = {
val transformations: List[DataFrame => DataFrame] = List(
castColumnsFromStringToLong(ColumnsToCastToLong),
castColumnsFromStringToFloat(ColumnsToCastToFloat),
renameColumns(RenameMapping),
filterAndDropColumns,
joinRefDF1(refDF1),
joinRefDF2(refDF2),
joinRefDF3(refDF3),
joinRefDF4(refDF4),
joinRefDF5(refDF5),
calculate()
)
transformations.reduce(_ andThen _)
}
pipeline(refDF1, refDF2, refDF3, refDF4, refDF5)(principleDF)
Method 2: I've not found a real way to achieve my idea, but I hope to trigger the computation of each join immediately.
according to my test, count() is too heavy for spark and useless for my application, but I don't know how to trigger the join computation with an efficient action. This kind of action is, in fact, the answer to this question.
val joinedDF_1 = castColumnsFromStringToLong(principleDF, ColumnsToCastToLong)
joinedDF_1.cache() // joinedDF is not always used multiple times, but for some data frame, it is, so I add cache() to indicate the usage
joinedDF_1.count()
val joinedDF_2 = castColumnsFromStringToFloat(joinedDF_1, ColumnsToCastToFloat)
joinedDF_2.cache()
joinedDF_2.count()
val joinedDF_3 = renameColumns(joinedDF_2, RenameMapping)
joinedDF_3.cache()
joinedDF_3.count()
val joinedDF_4 = filterAndDropColumns(joinedDF_4)
joinedDF_4.cache()
joinedDF_4.count()
...

When you want to force the computation of a given join (or any transformation that is not final) in Spark, you can use a simple show or count on your DataFrame. This kind of terminal points will force the computation of the result because otherwise it is simply not possible to execute the action.
Only after this will your DataFrame be effectively stored in your cache.
Once you're finished with a given DataFrame, don't hesitate to unpersist. This will unpersist your data if your cluster need more room for further computation.

You need to repartitions your dataset with the columns before calling the join transformation.
Example:
df1=df1.repartion(col("col1"),col("col2"))
df2=df2.repartion(col("col1"),col("col2"))
joinDF = df1.join(jf2,df1.col("col1").equals(df2.col("col1")) &....)

Try creating a new dataframe based on it.
Ex:
val dfTest = session.createDataFrame(df.rdd, df.schema).cache()
dfTest .storageLevel.useMemory // result should be a true.

Related

Why clustering seems doesn't work in spark cogroup function

I have two hive clustered tables t1 and t2
CREATE EXTERNAL TABLE `t1`(
`t1_req_id` string,
...
PARTITIONED BY (`t1_stats_date` string)
CLUSTERED BY (t1_req_id) INTO 1000 BUCKETS
// t2 looks similar with same amount of buckets
the code looks like as following:
val t1 = spark.table("t1").as[T1].rdd.map(v => (v.t1_req_id, v))
val t2= spark.table("t2").as[T2].rdd.map(v => (v.t2_req_id, v))
val outRdd = t1.cogroup(t2)
.flatMap { coGroupRes =>
val key = coGroupRes._1
val value: (Iterable[T1], Iterable[T2])= coGroupRes._2
val t3List = // create a list with some logic on Iterable[T1] and Iterable[T2]
t3List
}
outRdd.write....
I make sure that the both t1 and t2 table has same amount of partitions, and on spark-submit there are
spark.sql.sources.bucketing.enabled=true and spark.sessionState.conf.bucketingEnabled=true flags
But Spark DAG doesn't show any impact of clustering. It seems there is still data full shuffle
What am I missing, any other configurations, tunings? How can it be assured that there is no full data shuffle?
My spark version is 2.3.1
And it shouldn't show.
Any logical optimizations are limited to DataFrame API. Once you push data to black-box functional dataset API (see Spark 2.0 Dataset vs DataFrame) and later to RDD API, no more information is pushed back to the optimizer.
You could partially utilize bucketing by performing join first, getting something around these lines
spark.table("t1")
.join(spark.table("t2"), $"t1.t1_req_id" === $"t2.t2_req_id", "outer")
.groupBy($"t1.v.t1_req_id", $"t2.t2_req_id")
.agg(...) // For example collect_set($"t1.v"), collect_set($"t2.v")
However, unlike cogroup, this will generate full Cartesian Products within groups, and might be not applicable in your case

Spark Transactional remove rows

I am working with dataframes with Scala in a banking process and I need remove some rows if the transaction is cancellation. For example if I have a cancellation, I must remove the previous row. In the case I have three cancellation continuous I must remove 3 previous rows.
DataFrame initial:
DataFrame expected
I will appreciate your help.
Combination of inbuilt functions, udf function and window function should help you get your desired result (commented for clarity)
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("Account").orderBy("Sequence").rowsBetween(Long.MinValue, Long.MaxValue)
import org.apache.spark.sql.functions._
def filterUdf = udf((array:Seq[Long], sequence: Long)=> !array.contains(sequence))
df.withColumn("collection", sum(when(col("Type") === "Cancellation", 1).otherwise(0)).over(windowSpec)) //getting the count of cancellation in each group
.withColumn("Sequence", when(col("Type") === "Cancellation", col("Sequence")-col("collection")).otherwise(col("Sequence"))) //getting the difference between count and sequence number to get the sequence number of previous
.withColumn("collection", collect_set(when(col("Type") === "Cancellation", col("Sequence")).otherwise(0)).over(windowSpec)) //collecting the differenced sequence number of cancellation
.filter(filterUdf(col("collection"), col("Sequence"))) //filtering out the rows calling the udf
.drop("collection")
.show(false)
which should give you
+-------+-----------+--------+
|Account|Type |Sequence|
+-------+-----------+--------+
|11047 |Aggregation|11 |
|1030583|Aggregation|1 |
|1030583|Aggregation|4 |
+-------+-----------+--------+
Note: This solution works only when you have sequencial cancellation in each group of Account
I think a Map of stack data structure is useful in this case, with the key is account id. You push the Agg rows into stack until encountering a Cancel, then you pop the stack.

How to get the row top 1 in Spark Structured Streaming?

I have an issue with Spark Streaming (Spark 2.2.1). I am developing a real time pipeline where first I get data from Kafka, second join the result with another table, then send the Dataframe to a ALS model (Spark ML) and it return a streaming Dataframe with one additional column predit. The problem is when I tried to get the row with the highest score, I couldn't find a way to resolve it.
I tried:
Apply SQL functions like Limit, Take, sort
dense_rank() function
search in StackOverflow
I read Unsupported Operations but doesn't seem to be much there.
Additional with the highest score I would send to a Kafka queue
My code is as follows:
val result = lines.selectExpr("CAST(value AS STRING)")
.select(from_json($"value", mySchema).as("data"))
//.select("data.*")
.selectExpr("cast(data.largo as int) as largo","cast(data.stock as int) as stock","data.verificavalormax","data.codbc","data.ide","data.timestamp_cli","data.tef_cli","data.nombre","data.descripcion","data.porcentaje","data.fechainicio","data.fechafin","data.descripcioncompleta","data.direccion","data.coordenadax","data.coordenaday","data.razon_social","data.segmento_app","data.categoria","data.subcategoria")
result.printSchema()
val model = ALSModel.load("ALSParaTiDos")
val fullPredictions = model.transform(result)
//fullPredictions is a streaming dataframe with a extra column "prediction", here i need the code to get the first row
val query = fullPredictions.writeStream.format("console").outputMode(OutputMode.Append()).option("truncate", "false").start()
query.awaitTermination()
Update
Maybe I was not clear, so I'm attaching an image with my problem. Also I wrote a more simple code to complement it: https://gist.github.com/.../9193c8a983c9007e8a1b6ec280d8df25
detailing what i need. Please I will appreciate any help :)
TL;DR Use stream-stream inner joins (Spark 2.3.0) or use memory sink (or a Hive table) for a temporary storage.
I think that the following sentence describes your case very well:
The problem is when I tried to get the row with the highest score, I couldn't find a way to resolve it.
Machine learning aside as it gives you a streaming Dataset with predictions so focusing on finding a maximum value in a column in a streaming Dataset is the real case here.
The first step is to calculate the max value as follows (copied directly from your code):
streaming.groupBy("idCustomer").agg(max("score") as "maxscore")
With that, you have two streaming Datasets that you can join as of Spark 2.3.0 (that has been released few days ago):
In Spark 2.3, we have added support for stream-stream joins, that is, you can join two streaming Datasets/DataFrames.
Inner joins on any kind of columns along with any kind of join conditions are supported.
Inner join the streaming Datasets and you're done.
Try this:
Implement a function that extract the max value of the column and then filter your dataframe with the max
def getDataFrameMaxRow(df:DataFrame , col:String):DataFrame = {
// get the maximum value
val list_prediction = df.select(col).toJSON.rdd
.collect()
.toList
.map { x => gson.fromJson[JsonObject](x, classOf[JsonObject])}
.map { x => x.get(col).getAsString.toInt}
val max = getMaxFromList(list_prediction)
// filter dataframe by the maximum value
val df_filtered = df.filter(df(col) === max.toString())
return df_filtered
}
def getMaxFromList(xs: List[Int]): Int = xs match {
case List(x: Int) => x
case x :: y :: rest => getMaxFromList( (if (x > y) x else y) :: rest )
}
And in the body of your code add:
import com.google.gson.JsonObject
import com.google.gson.Gson
import org.apache.spark.sql.DataFrame
val fullPredictions = model.transform(result)
val df_with_row_max = getDataFrameMaxRow(fullPredictions, "prediction")
Good Luck !!

Scala/Spark: Immutable Dataframes and Memory

I am very new to Scala. I have experience in Java and R
I am confused about the immutability of DataFrames and memory management. The reason is this:
A Dataframe in R is also immutable. Subsequently, it was found in R to be unworkable. (Simplistically put) when working with a very large number of columns, each transformation led to a new Dataframe. 1000 consecutive operations on 1000 consecutive columns would lead to 1000 Dataframe objects). Now, most data scientists prefer R's data.table which performas operations by reference on a single data.table object.
Scala's dataframe (to a newbie) seems have a similar problem. The following code, for example, seems to create 1000 dataframes when renaming 1000 columns. Despite the foldLeft(), each call to withColumn() creates a new instance of DataFrame.
So, do I trust a very efficient garbage collection in Scala, or do I need to try and limit the number of immutable instances created. If the latter, what techniques should I be looking at?
def castAllTypedColumnsTo(df: DataFrame,
sourceType: DataType, targetType: DataType):
DataFrame =
{
val columnsToBeCasted = df.schema
.filter(s => s.dataType == sourceType)
if (columnsToBeCasted.length > 0)
{
println(s"Found ${columnsToBeCasted.length} columns " +
s"(${columnsToBeCasted.map(s => s.name).mkString(",")})" +
s" - casting to ${targetType.typeName.capitalize}Type")
}
columnsToBeCasted.foldLeft(df)
{ (foldedDf, col) =>
castColumnTo(foldedDf, col.name, targetType)
}
}
This method will return a new instance on each call
private def castColumnTo(df: DataFrame, cn: String, tpe: DataType):
DataFrame =
{
//println("castColumnTo")
df.withColumn(cn, df(cn).cast(tpe)
)
}
The difference is essentially laziness. Each new DataFrame that is returned is not materialized in memory. It just stores the base DataFrame and the function that should be applied to it. It's essentially an execution plan for how to create some data, not the data itself.
When it comes time to actually execute and save the result somewhere, then all 1000 operations can be applied to each row in parallel, so you get 1 additional output DataFrame. Spark condenses as many operations together as possible, and does not materialize anything unnecessary or that hasn't been explicitly requested to be saved or cached.

Split data frame into smaller ones and push a big dataframe to all executors?

I'm implementing the following logic using Spark.
Get the result of a table with 50K rows.
Get another table (about 30K rows).
For all the combination between (1) and (2), do some work and get a value.
How about pushing the data frame of (2) to all executors and partition (1) and run each portion on each executor? How to implement it?
val getTable(t String) =
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"$t"
)).load()
.select("col1", "col2", "col3")
val table1 = getTable("table1")
val table2 = getTable("table2")
// Split the rows in table1 and make N, say 32, data frames
val partitionedTable1 : List[DataSet[Row]] = splitToSmallerDFs(table1, 32) // How to implement it?
val result = partitionedTable1.map(x => {
val value = doWork(x, table2) // Is it good to send table2 to executors like this?
value
})
Question:
How to break a big data frame into small data frames? (repartition?)
Is it good to send table2 (pass a big data frame as a parameter) to executors like this?
How to break a big data frame into small data frames? (repartition?)
Simple answer would be Yes repartion can be a solution.
The challanging question can be, Would repartitioning a dataframe to smaller partition improve the overall operation?
Dataframes are already distributed in nature. Meaning that the operation you perform on dataframes like join, groupBy, aggregations, functions and many more are all executed where the data is residing. But the operations such as join, groupBy, aggregations where shuffling is needed, repartition would be void as
groupBy operation would shuffle dataframe such that distinct groups would be in the same executor.
partitionBy in Window function performs the same way as groupBy
join operation would shuffle data in the same manner.
Is it good to send table2 (pass a big data frame as a parameter) to executors like this?
Its not good to pass the dataframes as you did. As you are passing dataframes inside transformation so the table2 would not be visible to the executors.
I would suggest you to use broadcast variable
you can do as below
val table2 = sparkContext.broadcast(getTable("table2"))
val result = partitionedTable1.map(x => {
val value = doWork(x, table2.value)
value
})