Why external sorter works in spark - scala

I made processing data code in scala&spark and somehow it's so slow. I guess it's because of 'ExternalSort'. As you can see my code below, There is no reason to sort data but spark did.
I have more than 6,000,000 rows in RDD and try to cluster data with column name 'ID' (which are less than 20 types, so each ID group would be more than 300,000 rows)
I know It's pretty large data but other process were not slow. Any idea of this?
val ListByID = allData.map { x => (x.getAs[String]("ID"), List(x)) }.reduceByKey { (a: List[Row], b: List[Row]) => List(a, b).flatten }
val goalData = ListByID.map({ rowList =>
val list = rowList._2
val ID = rowList._1
val SD = list.head.getAs[String]("SD")
val ANOTEHR_ID_CNT = list.map{ row=> row.getAs[String]("ANOTHER_ID")}.distinct.length
Row(
ID, ID, list.length,
list.count { row => row.getAs[Int]("FLAGA")==1 },
list.count { row => row.getAs[Int]("FLAGB")==1 },
SD, ANOTEHR_ID_CNT)
})

Following part:
allData.map{...}.reduceByKey{ (a: List[Row], b: List[Row]) => List(a, b).flatten }
is just a significantly more expensive implementation of groupByKey. It not only puts more pressure on GC by applying map-side aggregations but may also create huge number of temporary objects. If single group doesn't fit into memory then out-of-memory error is inevitable.
Next you group data and drag all the fields when all you do later is counting. It could be easily handled with simple aggregation.
Reduce by ID and ANOTHER_ID counting FLAGA=1, FLAGB=1 and keeping single SD
Reduce 1. by ID, sum FLAGA=1, FLAGB=1, 1 (distinct ANOTHER_ID), keep arbitrary SD.
Finally if you start with DataFrame why move data to less efficient format at all? With pseudocode:
df.groupBy("ID").agg(
count($"*"),
count(when($"FLAGA" === 1, 1)),
count(when($"FLAGB" === 1, 1))
countDistinct("ANOTHER_ID"),
first("SD")
)

Related

How do you use the salted key technique to reduce groupBy skew in spark for RDDs

I have lets say rddA = List( ((skewKey, (col1, col2)) ), where skewKey is a number between 1 and 10k assigned to each col1 such that the total distribution of items over all the skewKeys are uniform.
I would like to do an operation grouped on col1, but it's highly skewed so some tasks take forever, so I would like to first repartition to 10k partitions using the skew key.
rddA.groupBy(x => x._1, numPartitions = 10k) // group by skew key first
.flatMap( // remove skewKey
group => {
(
group._2.map(x => x._2)
)
}
)
.groupBy(x => x._1) // then do the actual group computation
.map( // what I actually want to compute
group => {
(
group._1,
some_process(group._2)
)
}
)
Will this give me what I want? two group by operations, one to repartition to the skewKey, and the second to the actual groupBy operation

How to tune mapping/filtering on big datasets (cross joined from two datasets)?

Spark 2.2.0
I have the following code converted from SQL script. It has been running for two hours and it's still running. Even slower than SQL Server. Is anything not done correctly?
The following is the plan,
Push table2 to all executors
Partition table1 and distribute the partitions to executors.
And each row in table2/t2 joins (cross join) each partition of table1.
So the calculation on the result of the cross-join can be run distributed/parallelly. (I wanted to, for example suppose​ I have 16 executors, keep a copy of t2 on all the 16 executors. Then divide table 1 into 16 partitions, one for each executor. Then each executor do the calculation on one partition of table 1 and t2.)
case class Cols (Id: Int, F2: String, F3: BigDecimal, F4: Date, F5: String,
F6: String, F7: BigDecimal, F8: String, F9: String, F10: String )
case class Result (Id1: Int, ID2: Int, Point: Int)
def getDataFromDB(source: String) = {
import sqlContext.sparkSession.implicits._
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"$source"
)).load()
.select("Id", "F2", "F3", "F4", "F5", "F6", "F7", "F8", "F9", "F10")
.as[Cols]
}
val sc = new SparkContext(conf)
val table1:DataSet[Cols] = getDataFromDB("table1").repartition(32).cache()
println(table1.count()) // about 300K rows
val table2:DataSet[Cols] = getDataFromDB("table2") // ~20K rows
table2.take(1)
println(table2.count())
val t2 = sc.broadcast(table2)
import org.apache.spark.sql.{functions => func}
val j = table1.joinWith(t2.value, func.lit(true))
j.map(x => {
val (l, r) = x
Result(l.Id, r.Id,
(if (l.F1!= null && r.F1!= null && l.F1== r.F1) 3 else 0)
+(if (l.F2!= null && r.F2!= null && l.F2== r.F2) 2 else 0)
+ ..... // All kind of the similiar expression
+(if (l.F8!= null && r.F8!= null && l.F8== r.F8) 1 else 0)
)
}).filter(x => x.Value >= 10)
println("Total count %d", j.count()) // This takes forever, the count will be about 100
How to rewrite it with Spark idiomatic way?
Ref: https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html
(Somehow I feel as if I have seen the code already)
The code is slow because you use just a single task to load the entire dataset from the database using JDBC and despite cache it does not benefit from it.
Start by checking out the physical plan and Executors tab in web UI to find out about the single executor and the single task to do the work.
You should use one of the following to fine-tune the number of tasks for loading:
Use partitionColumn, lowerBound, upperBound options for the JDBC data source
Use predicates option
See JDBC To Other Databases in Spark's official documentation.
After you're fine with the loading, you should work on improving the last count action and add...another count action right after the following line:
val table1: DataSet[Cols] = getDataFromDB("table1").repartition(32).cache()
// trigger caching as it's lazy in Dataset API
table1.count
The reason why the entire query is slow is that you only mark table1 to be cached when an action gets executed which is exactly at the end (!) In other words, cache does nothing useful and more importantly makes the query performance even worse.
Performance will increase after you table2.cache.count too.
If you want to do cross join, use crossJoin operator.
crossJoin(right: Dataset[_]): DataFrame Explicit cartesian join with another DataFrame.
Please note the note from the scaladoc of crossJoin (no pun intended).
Cartesian joins are very expensive without an extra filter that can be pushed down.
The following requirement is already handled by Spark given all the optimizations available.
So the calculation on the result of the cross-join can be run distributed/parallelly.
That's Spark's job (again, no pun intended).
The following requirement begs for broadcast.
I wanted to, for example suppose​ I have 16 executors, keep a copy of t2 on all the 16 executors. Then divide table 1 into 16 partitions, one for each executor. Then each executor do the calculation on one partition of table 1 and t2.)
Use broadcast function to hint Spark SQL's engine to use table2 in broadcast mode.
broadcast[T](df: Dataset[T]): Dataset[T] Marks a DataFrame as small enough for use in broadcast joins.

How to find out the keywords in a text table with Spark?

I am new to Spark. I have two tables in HDFS. One table(table 1) is a tag table,composed of some text, which could be some words or a sentence. Another table(table 2) has a text column. Every row could have more than one keyword in the table 1. my task is find out all the matched keywords in table 1 for the text column in table 2, and output the keyword list for every row in table 2.
The problem is I have to iterate every row in table 2 and table 1. If I produce a big list for table 1, and use a map function for table 2. I will still have to use a loop to iterate the list in the map function. And the driver shows the JVM memory limit error,even if the loop is not large(10 thousands time).
myTag is the tag list of table 1.
def ourMap(line: String, myTag: List[String]): String = {
var ret = line
val length = myTag.length
for (i <- 0 to length - 1) {
if (line.contains(myTag(i)))
ret = ret.replaceAll(myTag(i), "_")
}
ret
}
val matched = result.map(b => ourMap(b, tagList))
Any suggestion to finish this task? With or without Spark
Many thanks!
An example is as follows:
table1
row1|Spark
row2|RDD
table2
row1| Spark is a fast and general engine. RDD supports two types of operations.
row2| All transformations in Spark are lazy.
row3| It is for test. I am a sentence.
Expected result :
row1| Spark,RDD
row2| Spark
MAJOR EDIT:
The first table actually may contain sentences and not just simple keywords :
row1| Spark
row2| RDD
row3| two words
row4| I am a sentence
Here you go, considering the data sample that you have provided :
val table1: Seq[(String, String)] = Seq(("row1", "Spark"), ("row2", "RDD"), ("row3", "Hashmap"))
val table2: Seq[String] = Seq("row1##Spark is a fast and general engine. RDD supports two types of operations.", "row2##All transformations in Spark are lazy.")
val rdd1: RDD[(String, String)] = sc.parallelize(table1)
val rdd2: RDD[(String, String)] = sc.parallelize(table2).map(_.split("##").toList).map(l => (l.head, l.tail(0))).cache
We'll build an inverted index of the second data table which we will join to the first table :
val df1: DataFrame = rdd1.toDF("key", "value")
val df2: DataFrame = rdd2.toDF("key", "text")
val df3: DataFrame = rdd2.flatMap { case (row, text) => text.trim.split( """[^\p{IsAlphabetic}]+""")
.map(word => (word, row))
}.groupByKey.mapValues(_.toSet.toSeq).toDF("word", "index")
import org.apache.spark.sql.functions.explode
val results: RDD[(String, String)] = df3.join(df1, df1("value") === df3("word")).drop("key").drop("value").withColumn("index", explode($"index")).rdd.map {
case r: Row => (r.getAs[String]("index"), r.getAs[String]("word"))
}.groupByKey.mapValues(i => i.toList.mkString(","))
results.take(2).foreach(println)
// (row1,Spark,RDD)
// (row2,Spark)
MAJOR EDIT:
As mentioned in the comment : The specifications of the issue changed. Keywords are no longer simple keywords, they might be sentences. In that case, this approach wouldn't work, it's a different kind of problem. One way to do it is using Locality-sensitive hashing (LSH) algorithm for nearest neighbor search.
An implementation of this algorithm is available here.
The algorithm and its implementation are unfortunately too long to discuss on SO.
From what I could gather from your problem statement is that you are kind of trying to tag the data in Table 2 with the keywords which are present in Table 1. For this, instead of loading the Table1 as a list and then doing each keyword pattern matching for each row in Table2, do this :
Load Table1 as a hashSet.
Traverse the Table2 and for each word in that phrase, do a search in the above hashset. I assume the words that you shall have to search from here are less as compared to pattern matching for each keyword. Remember, search now is O(1) operation whereas pattern matching is not.
Also, in this process, you can also filter words like " is, are, when, if " etc as they shall never be used for tagging. So that reduces words you need to find in hashSet.
The hashSet can be loaded into memory(I think 10K keywords should not take more than few MBs). This variable can be shared across executors through broadcast variables.

Output of Join in Apache Flink

In Apache Flink, if I join two data sets on one primary key I get a tuple 2 containing the corresponding data set entry out each of the data sets.
The problem is, when applying a the map() method to the outcoming tuple 2 data set it does not really look nice, especially if the entries of both data sets have a high number of features.
Using tuples in both input data sets gets me some code like this:
var in1: DataSet[(Int, Int, Int, Int, Int)] = /* */
var in2: DataSet[(Int, Int, Int, Int)] = /* */
val out = in1.join(in2).where(0, 1, 2).equalTo(0, 1, 2)
.map(join => (join._1._1, join._1._2, join._1._3,
join._1._4, join._1._5, join._2._4))
I would not mind using POJOs or case classes, but I don't see how this would make it better.
Question 1: Is there a nice way to flaten that tuple 2? E.g. using another operator.
Question 2: How to handle a join of 3 data sets on the same key? It would make the example source even more messy.
Thanks for helping.
you can directly apply a join function on each pair of joined elements such as for example
val leftData: DataSet[(String, Int, Int)] = ...
val rightData: DataSet[(String, Int)] = ...
val joined: DataSet[(String, Int, Int)] = leftData
.join(rightData).where(0).equalTo(0) { (l, r) => (l._1, l._2, l._3 + r._2) ) }
To answer the second question, Flink handles only binary joins. However, Flink's optimizer can avoid to do unnecessary shuffles, if you give a hint about the behavior of your function. Forward Field annotations tell the optimizer, that certain fields (such as the join key) have not been modified by your join function and enables reusing existing partitioning and sortings.

scala operations on part of RDD

I am new to Scala and I am trying to do something for a project:
I generated a RDD: RDD
[UserID1, Date1, Value1]
[UserID1, Date2, Value2]
[UserID1, Date3, Value3]
[UserID2, Date1, Value1]
[UserID3, Date1, Value1]
I wish to run a function on this RDD that generates RDD
[UserID1, FunctionResult1, FunctionResult2]
[UserID1, FunctionResult3, FunctionResult4]
[UserID2, FunctionResult1, FunctionResult2]
The way this function should work is:
1, groupBy UserID, and sort the date on ascending order (I have already formatted the date into an INT: 20150225.
2, take the first Date and the Second Date and find the number of Day between them.
3, take the first Value and the Second Value, find the difference between them.
record these value to Function Results, proceed to process the second Date and third Date and the second and third Value.
If the input is 5 rows x 3 column, the result should be 4 rows x 3 column.
So far, I have tried to do reduceByKey on the RDD, but it only generates a single row in the output. So I was wondering if there is any other efficient way to do this, perhaps without looping? My current code looks like this
val basicsearchprofile = basicsearch.map(x=>(x._1,(x._2,x._3).reduceByKey((a,b)=> funcdiff(a,b))
// x._1 is the User ID, x._2 is the Date, x._3 is the Value;
def funcdiff(a:(Date,Value),b:(Date,Value)):(Day,value) =
{
val diffdays = (b._1%100 - a._1%100) + ((b._1/100)%100)- (a._1/100)%100))*30 + ((a._1/10000)%100 - (b._1/10000)%100))*365 //difference between days
val diffvalue = Math.abs(a._2 - b._2)
}
(diffdays diffvalue)
}
I assume that the returned value from the function funcdiff reduces every event in pairs, and eventually reduces it to a single row? Is it possible to make it apply function funcdiff to first row with second row, record answer; next apply it to second and third row...and so on such that the returned result is an RDD of [ID, Datediff, Valuediff]
Thanks in advance
Spark process rows parallel. As you have to do row2-row1, row3-row2, I think you can not work in parallel anymore. So you'll have to forget Spark a bit, use plain Scala and process a whole user data on a single node (each user can be processed in parallel though). For instance:
// First, group by user with Spark
case class Info(userId:String, date:Int, value:Int)
val infos=List(
Info("john",20150221,10),
Info("mary",20150221,11),
Info("john",20150222,12),
Info("mary",20150223,15),
Info("john",20150223,14),
Info("john",20150224,16),
Info("john",20150225,18),
Info("mary",20150225,17))
val infoRdd=sc.parallelize(infos)
val infoByIdRdd=infoRdd.map( info => (info.userId, info)).groupByKey()
// Then use plain Scala to process each user data
def infoDeltas(infos:List[Info]) = {
// Transform [Info1, Info2, Info3] into [(Info1,Info2),(Info2,Info3)]
val accZero:(Option[Info],List[(Info,Info)])=(None,List())
def accInfo(last:Option[Info], list:List[(Info,Info)], info:Info) = {
last match {
case None => (Some(info), list)
case Some(lastInfo) => (Some(info), list:+(lastInfo,info))
}
}
val infoIntervals=infos.foldLeft(accZero)(
(acc,info) => accInfo(acc._1, acc._2, info)
)._2
// Transform [(Info1,Info2),(Info2,Info3)] into [Info2-Info1,Info3-Info2]
infoIntervals.map(interval => interval match {case (before,after) => Info(after.userId,after.date-before.date,after.value-before.value)})
}
val infoDeltasByIdRdd=infoByIdRdd.mapValues(infos => infoDeltas(infos.toList))