Scala/Spark: Immutable Dataframes and Memory - scala

I am very new to Scala. I have experience in Java and R
I am confused about the immutability of DataFrames and memory management. The reason is this:
A Dataframe in R is also immutable. Subsequently, it was found in R to be unworkable. (Simplistically put) when working with a very large number of columns, each transformation led to a new Dataframe. 1000 consecutive operations on 1000 consecutive columns would lead to 1000 Dataframe objects). Now, most data scientists prefer R's data.table which performas operations by reference on a single data.table object.
Scala's dataframe (to a newbie) seems have a similar problem. The following code, for example, seems to create 1000 dataframes when renaming 1000 columns. Despite the foldLeft(), each call to withColumn() creates a new instance of DataFrame.
So, do I trust a very efficient garbage collection in Scala, or do I need to try and limit the number of immutable instances created. If the latter, what techniques should I be looking at?
def castAllTypedColumnsTo(df: DataFrame,
sourceType: DataType, targetType: DataType):
DataFrame =
{
val columnsToBeCasted = df.schema
.filter(s => s.dataType == sourceType)
if (columnsToBeCasted.length > 0)
{
println(s"Found ${columnsToBeCasted.length} columns " +
s"(${columnsToBeCasted.map(s => s.name).mkString(",")})" +
s" - casting to ${targetType.typeName.capitalize}Type")
}
columnsToBeCasted.foldLeft(df)
{ (foldedDf, col) =>
castColumnTo(foldedDf, col.name, targetType)
}
}
This method will return a new instance on each call
private def castColumnTo(df: DataFrame, cn: String, tpe: DataType):
DataFrame =
{
//println("castColumnTo")
df.withColumn(cn, df(cn).cast(tpe)
)
}

The difference is essentially laziness. Each new DataFrame that is returned is not materialized in memory. It just stores the base DataFrame and the function that should be applied to it. It's essentially an execution plan for how to create some data, not the data itself.
When it comes time to actually execute and save the result somewhere, then all 1000 operations can be applied to each row in parallel, so you get 1 additional output DataFrame. Spark condenses as many operations together as possible, and does not materialize anything unnecessary or that hasn't been explicitly requested to be saved or cached.

Related

Why clustering seems doesn't work in spark cogroup function

I have two hive clustered tables t1 and t2
CREATE EXTERNAL TABLE `t1`(
`t1_req_id` string,
...
PARTITIONED BY (`t1_stats_date` string)
CLUSTERED BY (t1_req_id) INTO 1000 BUCKETS
// t2 looks similar with same amount of buckets
the code looks like as following:
val t1 = spark.table("t1").as[T1].rdd.map(v => (v.t1_req_id, v))
val t2= spark.table("t2").as[T2].rdd.map(v => (v.t2_req_id, v))
val outRdd = t1.cogroup(t2)
.flatMap { coGroupRes =>
val key = coGroupRes._1
val value: (Iterable[T1], Iterable[T2])= coGroupRes._2
val t3List = // create a list with some logic on Iterable[T1] and Iterable[T2]
t3List
}
outRdd.write....
I make sure that the both t1 and t2 table has same amount of partitions, and on spark-submit there are
spark.sql.sources.bucketing.enabled=true and spark.sessionState.conf.bucketingEnabled=true flags
But Spark DAG doesn't show any impact of clustering. It seems there is still data full shuffle
What am I missing, any other configurations, tunings? How can it be assured that there is no full data shuffle?
My spark version is 2.3.1
And it shouldn't show.
Any logical optimizations are limited to DataFrame API. Once you push data to black-box functional dataset API (see Spark 2.0 Dataset vs DataFrame) and later to RDD API, no more information is pushed back to the optimizer.
You could partially utilize bucketing by performing join first, getting something around these lines
spark.table("t1")
.join(spark.table("t2"), $"t1.t1_req_id" === $"t2.t2_req_id", "outer")
.groupBy($"t1.v.t1_req_id", $"t2.t2_req_id")
.agg(...) // For example collect_set($"t1.v"), collect_set($"t2.v")
However, unlike cogroup, this will generate full Cartesian Products within groups, and might be not applicable in your case

Apache Spark SQL: how to optimize chained join for dataframe

I have to make a left join between a principle data frame and several reference data frame, so a chained join computation. And I wonder how to make this action efficient and scalable.
Method 1 is easy to understand, which is also the current method, but I'm not satisfied because all the transformations have been chained and waited for the final action to trigger the computation, if I continue to add transformation and the volume of data, spark will fail at the end, so this method is not scalable.
Method 1:
def pipeline(refDF1: DataFrame, refDF2: DataFrame, refDF3: DataFrame, refDF4: DataFrame, refDF5: DataFrame): DataFrame = {
val transformations: List[DataFrame => DataFrame] = List(
castColumnsFromStringToLong(ColumnsToCastToLong),
castColumnsFromStringToFloat(ColumnsToCastToFloat),
renameColumns(RenameMapping),
filterAndDropColumns,
joinRefDF1(refDF1),
joinRefDF2(refDF2),
joinRefDF3(refDF3),
joinRefDF4(refDF4),
joinRefDF5(refDF5),
calculate()
)
transformations.reduce(_ andThen _)
}
pipeline(refDF1, refDF2, refDF3, refDF4, refDF5)(principleDF)
Method 2: I've not found a real way to achieve my idea, but I hope to trigger the computation of each join immediately.
according to my test, count() is too heavy for spark and useless for my application, but I don't know how to trigger the join computation with an efficient action. This kind of action is, in fact, the answer to this question.
val joinedDF_1 = castColumnsFromStringToLong(principleDF, ColumnsToCastToLong)
joinedDF_1.cache() // joinedDF is not always used multiple times, but for some data frame, it is, so I add cache() to indicate the usage
joinedDF_1.count()
val joinedDF_2 = castColumnsFromStringToFloat(joinedDF_1, ColumnsToCastToFloat)
joinedDF_2.cache()
joinedDF_2.count()
val joinedDF_3 = renameColumns(joinedDF_2, RenameMapping)
joinedDF_3.cache()
joinedDF_3.count()
val joinedDF_4 = filterAndDropColumns(joinedDF_4)
joinedDF_4.cache()
joinedDF_4.count()
...
When you want to force the computation of a given join (or any transformation that is not final) in Spark, you can use a simple show or count on your DataFrame. This kind of terminal points will force the computation of the result because otherwise it is simply not possible to execute the action.
Only after this will your DataFrame be effectively stored in your cache.
Once you're finished with a given DataFrame, don't hesitate to unpersist. This will unpersist your data if your cluster need more room for further computation.
You need to repartitions your dataset with the columns before calling the join transformation.
Example:
df1=df1.repartion(col("col1"),col("col2"))
df2=df2.repartion(col("col1"),col("col2"))
joinDF = df1.join(jf2,df1.col("col1").equals(df2.col("col1")) &....)
Try creating a new dataframe based on it.
Ex:
val dfTest = session.createDataFrame(df.rdd, df.schema).cache()
dfTest .storageLevel.useMemory // result should be a true.

How to tune mapping/filtering on big datasets (cross joined from two datasets)?

Spark 2.2.0
I have the following code converted from SQL script. It has been running for two hours and it's still running. Even slower than SQL Server. Is anything not done correctly?
The following is the plan,
Push table2 to all executors
Partition table1 and distribute the partitions to executors.
And each row in table2/t2 joins (cross join) each partition of table1.
So the calculation on the result of the cross-join can be run distributed/parallelly. (I wanted to, for example suppose​ I have 16 executors, keep a copy of t2 on all the 16 executors. Then divide table 1 into 16 partitions, one for each executor. Then each executor do the calculation on one partition of table 1 and t2.)
case class Cols (Id: Int, F2: String, F3: BigDecimal, F4: Date, F5: String,
F6: String, F7: BigDecimal, F8: String, F9: String, F10: String )
case class Result (Id1: Int, ID2: Int, Point: Int)
def getDataFromDB(source: String) = {
import sqlContext.sparkSession.implicits._
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"$source"
)).load()
.select("Id", "F2", "F3", "F4", "F5", "F6", "F7", "F8", "F9", "F10")
.as[Cols]
}
val sc = new SparkContext(conf)
val table1:DataSet[Cols] = getDataFromDB("table1").repartition(32).cache()
println(table1.count()) // about 300K rows
val table2:DataSet[Cols] = getDataFromDB("table2") // ~20K rows
table2.take(1)
println(table2.count())
val t2 = sc.broadcast(table2)
import org.apache.spark.sql.{functions => func}
val j = table1.joinWith(t2.value, func.lit(true))
j.map(x => {
val (l, r) = x
Result(l.Id, r.Id,
(if (l.F1!= null && r.F1!= null && l.F1== r.F1) 3 else 0)
+(if (l.F2!= null && r.F2!= null && l.F2== r.F2) 2 else 0)
+ ..... // All kind of the similiar expression
+(if (l.F8!= null && r.F8!= null && l.F8== r.F8) 1 else 0)
)
}).filter(x => x.Value >= 10)
println("Total count %d", j.count()) // This takes forever, the count will be about 100
How to rewrite it with Spark idiomatic way?
Ref: https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html
(Somehow I feel as if I have seen the code already)
The code is slow because you use just a single task to load the entire dataset from the database using JDBC and despite cache it does not benefit from it.
Start by checking out the physical plan and Executors tab in web UI to find out about the single executor and the single task to do the work.
You should use one of the following to fine-tune the number of tasks for loading:
Use partitionColumn, lowerBound, upperBound options for the JDBC data source
Use predicates option
See JDBC To Other Databases in Spark's official documentation.
After you're fine with the loading, you should work on improving the last count action and add...another count action right after the following line:
val table1: DataSet[Cols] = getDataFromDB("table1").repartition(32).cache()
// trigger caching as it's lazy in Dataset API
table1.count
The reason why the entire query is slow is that you only mark table1 to be cached when an action gets executed which is exactly at the end (!) In other words, cache does nothing useful and more importantly makes the query performance even worse.
Performance will increase after you table2.cache.count too.
If you want to do cross join, use crossJoin operator.
crossJoin(right: Dataset[_]): DataFrame Explicit cartesian join with another DataFrame.
Please note the note from the scaladoc of crossJoin (no pun intended).
Cartesian joins are very expensive without an extra filter that can be pushed down.
The following requirement is already handled by Spark given all the optimizations available.
So the calculation on the result of the cross-join can be run distributed/parallelly.
That's Spark's job (again, no pun intended).
The following requirement begs for broadcast.
I wanted to, for example suppose​ I have 16 executors, keep a copy of t2 on all the 16 executors. Then divide table 1 into 16 partitions, one for each executor. Then each executor do the calculation on one partition of table 1 and t2.)
Use broadcast function to hint Spark SQL's engine to use table2 in broadcast mode.
broadcast[T](df: Dataset[T]): Dataset[T] Marks a DataFrame as small enough for use in broadcast joins.

Most Efficient Way to Join Massive/Small Datasets

I currently have a large RDD called chartEvents containing data of the form:
case class ChartEvent(patientID: String, itemID: String, chartTime: String, storeTime: String, value: String,
valueNum: String, warning: String, error: String)
The data is coming from a 35 GB .csv file which I am parsing in using SQL:
CSVUtils.loadCSVAsTable(sqlContext, "data_unzipped/CHARTEVENTS.csv")
val chartEvents = sqlContext.sql(
"""
|SELECT SUBJECT_ID, ITEMID, CHARTTIME, STORETIME, VALUE, VALUENUM, WARNING, ERROR
|FROM CHARTEVENTS
""".stripMargin)
.map(r => ChartEvent(r(0).toString, r(1).toString, r(2).toString, r(3).toString, r(4).toString,
r(5).toString, r(6).toString, r(7).toString))
I have a separate, very small (less than 100 rows) RDD called featureMapping of the form RDD[(itemID, label)] where these are both strings. What I am trying to do is filter down the chartEvents RDD to rows which only contain itemIDs in featureMapping. My current method is to perform an inner join of the two RDDs as follows:
val result = chartEvents.map{case event => (event.itemID, event)}.join(featureMapping)
However, I am noticing that this is on track to take several hours to run, and is using a massive amount of space in my /user/<user>/appdata/local/temp folder. Is there a more efficient way to perform this filtering? Would coding it into the sqlContext be faster?
If you register your tables in hive metastore you can set spark.sql.autoBroadcastJoinThreshold
from the doc:
Configures the maximum size in bytes for a table that will be
broadcast to all worker nodes when performing a join. By setting this
value to -1 broadcasting can be disabled. Note that currently
statistics are only supported for Hive Metastore tables where the
command ANALYZE TABLE COMPUTE STATISTICS noscan has been
run.

How to find out the keywords in a text table with Spark?

I am new to Spark. I have two tables in HDFS. One table(table 1) is a tag table,composed of some text, which could be some words or a sentence. Another table(table 2) has a text column. Every row could have more than one keyword in the table 1. my task is find out all the matched keywords in table 1 for the text column in table 2, and output the keyword list for every row in table 2.
The problem is I have to iterate every row in table 2 and table 1. If I produce a big list for table 1, and use a map function for table 2. I will still have to use a loop to iterate the list in the map function. And the driver shows the JVM memory limit error,even if the loop is not large(10 thousands time).
myTag is the tag list of table 1.
def ourMap(line: String, myTag: List[String]): String = {
var ret = line
val length = myTag.length
for (i <- 0 to length - 1) {
if (line.contains(myTag(i)))
ret = ret.replaceAll(myTag(i), "_")
}
ret
}
val matched = result.map(b => ourMap(b, tagList))
Any suggestion to finish this task? With or without Spark
Many thanks!
An example is as follows:
table1
row1|Spark
row2|RDD
table2
row1| Spark is a fast and general engine. RDD supports two types of operations.
row2| All transformations in Spark are lazy.
row3| It is for test. I am a sentence.
Expected result :
row1| Spark,RDD
row2| Spark
MAJOR EDIT:
The first table actually may contain sentences and not just simple keywords :
row1| Spark
row2| RDD
row3| two words
row4| I am a sentence
Here you go, considering the data sample that you have provided :
val table1: Seq[(String, String)] = Seq(("row1", "Spark"), ("row2", "RDD"), ("row3", "Hashmap"))
val table2: Seq[String] = Seq("row1##Spark is a fast and general engine. RDD supports two types of operations.", "row2##All transformations in Spark are lazy.")
val rdd1: RDD[(String, String)] = sc.parallelize(table1)
val rdd2: RDD[(String, String)] = sc.parallelize(table2).map(_.split("##").toList).map(l => (l.head, l.tail(0))).cache
We'll build an inverted index of the second data table which we will join to the first table :
val df1: DataFrame = rdd1.toDF("key", "value")
val df2: DataFrame = rdd2.toDF("key", "text")
val df3: DataFrame = rdd2.flatMap { case (row, text) => text.trim.split( """[^\p{IsAlphabetic}]+""")
.map(word => (word, row))
}.groupByKey.mapValues(_.toSet.toSeq).toDF("word", "index")
import org.apache.spark.sql.functions.explode
val results: RDD[(String, String)] = df3.join(df1, df1("value") === df3("word")).drop("key").drop("value").withColumn("index", explode($"index")).rdd.map {
case r: Row => (r.getAs[String]("index"), r.getAs[String]("word"))
}.groupByKey.mapValues(i => i.toList.mkString(","))
results.take(2).foreach(println)
// (row1,Spark,RDD)
// (row2,Spark)
MAJOR EDIT:
As mentioned in the comment : The specifications of the issue changed. Keywords are no longer simple keywords, they might be sentences. In that case, this approach wouldn't work, it's a different kind of problem. One way to do it is using Locality-sensitive hashing (LSH) algorithm for nearest neighbor search.
An implementation of this algorithm is available here.
The algorithm and its implementation are unfortunately too long to discuss on SO.
From what I could gather from your problem statement is that you are kind of trying to tag the data in Table 2 with the keywords which are present in Table 1. For this, instead of loading the Table1 as a list and then doing each keyword pattern matching for each row in Table2, do this :
Load Table1 as a hashSet.
Traverse the Table2 and for each word in that phrase, do a search in the above hashset. I assume the words that you shall have to search from here are less as compared to pattern matching for each keyword. Remember, search now is O(1) operation whereas pattern matching is not.
Also, in this process, you can also filter words like " is, are, when, if " etc as they shall never be used for tagging. So that reduces words you need to find in hashSet.
The hashSet can be loaded into memory(I think 10K keywords should not take more than few MBs). This variable can be shared across executors through broadcast variables.