Variable number of partitions and batch for loadDataFrame - scala

I'm currently using the neo4j-spark-connector for loading in data from the graph into a DataFrame.
My Spark job contains different queries on different parts of the graph. It can also run against different versions of the graph with less or more data, which we don't know upfront. The problem is that I don't know how to set a correct number of partitions and batch for each query before knowing how much results it will return.
val result = neo.cypher(("MATCH (p:Person)-[:KNOWS]->(other:Person) RETURN person.Name, other.Name"))
.partitions(**partitions**) // PROBLEM
.batch(**batch**) // PROBLEM
I came up with a temporary solution where I count the number of most occuring relationship type and divide this by the number of partitions.
val batch: Long = {
// partitions default: 200
println("Calculating batch size...")
val batchSizeCount = if (relationShip == null) {
neo.cypher("MATCH (n)-[r:MOST_OCC_REL]->(m) RETURN COUNT(r)").loadRdd[Long].collect.head
}
val newPartitions = if (partitions > 5) {
partitions - 5
} else {
partitions
}
val batchSize = if (batchSizeCount < newPartitions) {
1L
} else {
batchSizeCount / newPartitions
}
println("Batch size for Neo4j: " + batchSize)
batchSize
}
Although the overhead it works for most of the (simple) queries. But for more complex queries it does not seem to be correct al the time.
I need to make sure that the reading of source data always uses the appropriate partitions/batch configuration for each run/query.
Is there maybe a better way to define the correct partitions and batch size, without knowing the amount of data it will return ?

Related

Using flatMapGroupsWithState with forEachBatch

I have a streaming spark app, wherein the running stream I'm removing duplicate rows using Stateful Aggregation with flatMapGroupsWithState.
But when I used forEachBatch on the stream, and used the same functions I created to remove duplicates on stream, it is treating each Batch as an independent entity, and returning duplicates only among that single Micro Batch.
Code:
case class User(name: String, userId: Integer)
case class StateClass(totalUsers: Int)
def removeDuplicates(inputData: Dataset[User]): Dataset[User] = {
inputData
.groupByKey(_.userId)
.flatMapGroupsWithState(OutputMode.Append, GroupStateTimeout.ProcessingTimeTimeout)(removeDuplicatesInternal)
}
def removeDuplicatesInternal(id: Integer, newData: Iterator[User], state: GroupState[StateClass]): Iterator[User] = {
if (state.hasTimedOut) {
state.remove() // Removing state since no same UserId in 4 hours
return Iterator()
}
if (newData.isEmpty)
return Iterator()
if (!state.exists) {
val firstUserData = newData.next()
val newState = StateClass(1) // Total count = 1 initially
state.update(newState)
state.setTimeoutDuration("4 hours")
Iterator(firstUserData) // Returning UserData first time
}
else {
val newState = StateClass(state.get.totalUsers + 1)
state.update(newState)
state.setTimeoutDuration("4 hours")
Iterator() // Returning empty since state already exists (Already sent this UserData before)
}
}
Input Stream I used is userStream.
Above functions works fine when I directly pass stream to it.
val results = removeDuplicates(userStream)
But when I do something like:
userStream
.writeStream
.foreachBatch { (batch, batchId) => writeBatch(batch) }
def writeBatch(batch: Dataset[User]): Unit = {
val distinctBatch = removeDuplicates(batch)
}
I get distinct User data only within that Micro Batch. But I want it to be distinct overall across 4 hour timeout.
For Eg:
If 1st batch has UserIds (1, 3, 5, 1), and second batch has UserIds (2, 3, 1).
Expected Behaviour:
Output: 1st Batch = (1, 3, 5) and 2nd Batch = (2)
My Output: 1st Batch = (1, 3, 5) and 2nd Batch = (2, 3, 1)
How can I enable it to use the same State throughout? Right now, it is treating each micro-batch different, and creating a separate state for each batch.
PS: Problem is not limited to getting duplicates on Stream, I need to use forEachBatch for some computations on Micro batches, and remove duplicates before writing.
For Running test script, refer this: https://ideone.com/nZ5pq2
The behavior is actually expected one.
flatMapGroupsWithState leverages state store only when the query is streaming one. (For batch query, it doesn't even create a state store because it's not necessary.) Once you call forEachBatch, the provided batch parameter is no longer continuous one across batches - consider it as a dataset from batch query, which the batch means "a" micro-batch.
So you still need to pass your streaming dataset to removeDuplicate, or have your own way to deduplicate records across batches in forEachBatch.

How to process multiple parquet files individually in a for loop?

I have multiple parquet files (around 1000). I need to load each one of them, process it and save the result to a Hive table. I have a for loop but it only seems to work with 2 or 5 files, but not with 1000, as it seems Sparks tries to load them all at the same time, and I need it do it individually in the same Spark session.
I tried using a for loop, then a for each, and I ussed unpersist() but It fails anyway.
val ids = get_files_IDs()
ids.foreach(id => {
println("Starting file " + id)
var df = load_file(id)
var values_df = calculate_values(df)
values_df.write.mode(SaveMode.Overwrite).saveAsTable("table.values_" + id)
df.unpersist()
})
def get_files_IDs(): List[String] = {
var ids = sqlContext.sql("SELECT CAST(id AS varchar(10)) FROM table.ids WHERE id IS NOT NULL")
var ids_list = ids.select("id").map(r => r.getString(0)).collect().toList
return ids_list
}
def calculate_values(df:org.apache.spark.sql.DataFrame): org.apache.spark.sql.DataFrame ={
val values_id = df.groupBy($"id", $"date", $"hr_time").agg(avg($"value_a") as "avg_val_a", avg($"value_b") as "avg_value_b")
return values_id
}
def load_file(id:String): org.apache.spark.sql.DataFrame = {
val df = sqlContext.read.parquet("/user/hive/wh/table.db/parquet/values_for_" + id + ".parquet")
return df
}
What I would expect is for Spark to load file ID 1, process the data, save it to the Hive table and then dismiss that date and cotinue with the second ID and so on until it finishes the 1000 files. Instead of it trying to load everything at the same time.
Any help would be very appreciated! I've been stuck on it for days. I'm using Spark 1.6 with Scala Thank you!!
EDIT: Added the definitions. Hope it can help to get a better view. Thank you!
Ok so after a lot of inspection I realised that the process was working fine. It processed each file individualy and saved the results. The issue was that in some very specific cases the process was taking way way way to long.
So I can tell that with a for loop or for each you can process multiple files and save the results without problem. Unpersisting and clearing cache do helps on performance.

SparkSQL performance issue with collect method

We are currently facing a performance issue in sparksql written in scala language. Application flow is mentioned below.
Spark application reads a text file from input hdfs directory
Creates a data frame on top of the file using programmatically specifying schema. This dataframe will be an exact replication of the input file kept in memory. Will have around 18 columns in the dataframe
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema)
Creates a filtered dataframe from the first data frame constructed in step 2. This dataframe will contain unique account numbers with the help of distinct keyword.
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
Using the two dataframes constructed in step 2 & 3, we will get all the records which belong to one account number and do some Json parsing logic on top of the filtered data.
var filtrEqpDF =
eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
Finally the json parsed data will be put into Hbase table
Here we are facing performance issues while calling the collect method on top of the data frames. Because collect will fetch all the data into a single node and then do the processing, thus losing the parallel processing benefit.
Also in real scenario there will be 10 billion records of data which we can expect. Hence collecting all those records in to driver node will might crash the program itself due to memory or disk space limitations.
I don't think the take method can be used in our case which will fetch limited number of records at a time. We have to get all the unique account numbers from the whole data and hence I am not sure whether take method, which takes
limited records at a time, will suit our requirements
Appreciate any help to avoid calling collect methods and have some other best practises to follow. Code snippets/suggestions/git links will be very helpful if anyone have had faced similar issues
Code snippet
val eqpSchemaString = "acoountnumber ....."
val eqpSchema = StructType(eqpSchemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)));
val eqpRdd = sc.textFile(inputPath)
val eqpRowRdd = eqpRdd.map(_.split(",")).map(eqpRow => Row(eqpRow(0).trim, eqpRow(1).trim, ....)
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema);
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}
The approach usually take in this kind of situation is:
Instead of collect, invoke foreachPartition: foreachPartition applies a function to each partition (represented by an Iterator[Row]) of the underlying DataFrame separately (the partition being the atomic unit of parallelism of Spark)
the function will open a connection to HBase (thus making it one per partition) and send all the contained values through this connection
This means the every executor opens a connection (which is not serializable but lives within the boundaries of the function, thus not needing to be sent across the network) and independently sends its contents to HBase, without any need to collect all data on the driver (or any one node, for that matter).
It looks like you are reading a CSV file, so probably something like the following will do the trick:
spark.read.csv(inputPath). // Using DataFrameReader but your way works too
foreachPartition { rows =>
val conn = ??? // Create HBase connection
for (row <- rows) { // Loop over the iterator
val data = parseJson(row) // Your parsing logic
??? // Use 'conn' to save 'data'
}
}
You can ignore collect in your code if you have large set of data.
Collect Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Also this can cause the driver to run out of memory, though, because collect() fetches the entire RDD/DF to a single machine.
I have just edited your code, which should work for you.
var distAccNrsDF = eqpDF.select("accountnumber").distinct()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'")
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}

Spark - how to handle with lazy evaluation in case of iterative (or recursive) function calls

I have a recursive function that needs to compare the results of the current call to the previous call to figure out whether it has reached a convergence. My function does not contain any action - it only contains map, flatMap, and reduceByKey. Since Spark does not evaluate transformations (until an action is called), my next iteration does not get the proper values to compare for convergence.
Here is a skeleton of the function -
def func1(sc: SparkContext, nodes:RDD[List[Long]], didConverge: Boolean, changeCount: Int) RDD[(Long] = {
if (didConverge)
nodes
else {
val currChangeCount = sc.accumulator(0, "xyz")
val newNodes = performSomeOps(nodes, currChangeCount) // does a few map/flatMap/reduceByKey operations
if (currChangeCount.value == changeCount) {
func1(sc, newNodes, true, currChangeCount.value)
} else {
func1(sc, newNode, false, currChangeCount.value)
}
}
}
performSomeOps only contains map, flatMap, and reduceByKey transformations. Since it does not have any action, the code in performSomeOps does not execute. So my currChangeCount does not get the actual count. What that implies, the condition to check for the convergence (currChangeCount.value == changeCount) is going to be invalid. One way to overcome is to force an action within each iteration by calling a count but that is an unnecessary overhead.
I am wondering what I can do to force an action w/o much overhead or is there another way to address this problem?
I believe there is a very important thing you're missing here:
For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
Because of that accumulators cannot be reliably used for managing control flow and are better suited for job monitoring.
Moreover executing an action is not an unnecessary overhead. If you want to know what is the result of the computation you have to perform it. Unless of course the result is trivial. The cheapest action possible is:
rdd.foreach { case _ => }
but it won't address the problem you have here.
In general iterative computations in Spark can be structured as follows:
def func1(chcekpoinInterval: Int)(sc: SparkContext, nodes:RDD[List[Long]],
didConverge: Boolean, changeCount: Int, iteration: Int) RDD[(Long] = {
if (didConverge) nodes
else {
// Compute and cache new nodes
val newNodes = performSomeOps(nodes, currChangeCount).cache
// Periodically checkpoint to avoid stack overflow
if (iteration % checkpointInterval == 0) newNodes.checkpoint
/* Call a function which computes values
that determines control flow. This execute an action on newNodes.
*/
val changeCount = computeChangeCount(newNodes)
// Unpersist old nodes
nodes.unpersist
func1(checkpointInterval)(
sc, newNodes, currChangeCount.value == changeCount,
currChangeCount.value, iteration + 1
)
}
}
I see that these map/flatMap/reduceByKey transformations are updating an accumulator. Therefore the only way to perform all updates is to execute all these functions and count is the easiest way to achieve that and gives the lowest overhead compared to other ways (cache + count, first or collect).
Previous answers put me on the right track to solve a similar convergence detection problem.
foreach is presented in the docs as:
foreach(func) : Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
It seems like instead of using rdd.foreach() as a cheap action to trigger accumulator increments placed in various transformations, it should be used to do the incrementing itself.
I'm unable to produce a scala example, but here's a basic java version, if it can still help:
// Convergence is reached when two iterations
// return the same number of results
long previousCount = -1;
long currentCount = 0;
while (previousCount != currentCount){
rdd = doSomethingThatUpdatesRdd(rdd);
// Count entries in new rdd with foreach + accumulator
rdd.foreach(tuple -> accumulator.add(1));
// Update helper values
previousCount = currentCount;
currentCount = accumulator.sum();
accumulator.reset();
}
// Convergence is reached

Method for reducing memory load of Spark program

I have a Spark program with calculates relations between users, i.e. it receives data set of type:
RDD[(java.lang.Long, Map[(String, String), Integer])]
Where the Long is timestamp, and the map is a score relevant to tuples of two users. and should run some function over the scores and return the following type:
Map[String, Map[java.lang.Long, java.lang.Double]]
Where the String is the first String in the tuple, and the map is the results of the function per timeslot.
In my case I have around 2000 users so the maps I receive are quite big (2000^2 per timeslot), and also the results relies on the previous timeslot results.
I am running the program locally and receiving GC overhead limit exceeded. I increased the heap memory to 14g using: -Xmx14G in vmarguments (I see the java process is occupying more than 12g of memory) but it didn't help.
Currently implemented method
I have tried several directions to decrease the memory consumption and currently came up with the following idea: since every timestamp relies only on the previous one I will collect every timeslot separately and keep the previous results on driver. In this manner I will run calculations only on part of the data and hopefully it will not crush the program.
The code:
def calculateScorePerTimeslot(scorePerTimeslotRDD: RDD[(java.lang.Long, Map[(String, String), Integer])]): Map[String, Map[java.lang.Long, java.lang.Double]] = {
var distancesPerTimeslotVarRDD = distancesPerTimeslotRDD.groupBy(_._1).sortBy(_._1)
println("Start collecting all the results - cache the data!!")
distancesPerTimeslotVarRDD.cache()
println("Caching all the data has completed!")
while(!distancesPerTimeslotVarRDD.isEmpty())
{
val dataForTimeslot: (java.lang.Long, Iterable[(java.lang.Long, Map[(String, String), Integer])]) = distancesPerTimeslotVarRDD.first()
println("Retrieved data for timeslot: " + dataForTimeslot._1)
//Code which is irrelevant for question - logic
println("Removing timeslot: " + dataForTimeslot._1)
distancesPerTimeslotVarRDD = distancesPerTimeslotVarRDD.filter(t => !t._1.equals(dataForTimeslot._1))
println("Filtering has complete! - without: " + dataForTimeslot._1)
}
}
Summary: Basically, the idea is to extract one timeslot at a time process it and save the results at driver - in this manner I try to reduce the size of data which passes on collect.
Reason I write this post
Unfortunately, this doesn't help me and the program still dies. My question is: is this manner of taking the first() item of a RDD and then filter it have the effect of iterating over the items on RDD? Are there other better ideas to tackle this kinds of question (better ideas which are not increasing the memory or moving to a real distributed cluster)?
Firstly, RDD[(java.lang.Long, Map[(String, String), Integer])] uses more memory than RDD[(java.lang.Long, Array[(String, String, Integer)])]. You'll save some memory if you can use the latter.
Secondly, your loop is pretty inefficient in caching data. Always call unpersist on any RDD you no longer need.
distancesPerTimeslotVarRDD.cache()
var rddSize = distancesPerTimeslotVarRDD.count()
println("Caching all the data has completed!")
while(rddSize > 0) {
val prevRDD = distancesPerTimeslotVarRDD
val dataForTimeslot = distancesPerTimeslotVarRDD.first()
println("Retrieved data for timeslot: " + dataForTimeslot._1)
// Code which is irrelevant for answer - logic
println("Removing timeslot: " + dataForTimeslot._1)
// Cache the new value of distancesPerTimeslotVarRDD
distancesPerTimeslotVarRDD = distancesPerTimeslotVarRDD.filter(t => !t._1.equals(dataForTimeslot._1)).cache()
// Force calculation so we can throw away previous iteration value
rddSize = distancesPerTimeslotVarRDD.count()
println("Filtering has complete! - without: " + dataForTimeslot._1)
// Get rid of previously cached RDD
prevRDD.unpersist(false)
}
Thirdly, you can try using Kryo Serializer, though this sometimes makes things worse. You have to configure the serializer and replace cache with persist(StorageLevel.MEMORY_ONLY_SER)