How the number of Tasks and Partitions is set when using MemoryStream? - scala

I'm trying to understand a strange behavior that I observed in my Spark structure streaming application that is running in local[*] mode.
I have 8 core on my machines. While the majority of my Batches have 8 partitions, every once in a while I get 16 or 32 or 56 and so on partitions/Tasks. I notice that it is always a multiple of 8. I have notice in opening the stage tab, that when it happens, it is because there is multiple LocalTableScan.
That is if I have 2 LocalTableScan then the mini-batch job, will have 16 task/partition and so on.
I mean it could well do two scans, combine the two batches and feed it to the mini-batch job. However no it results in a mini-batch job that the number of tasks = number of core * number of scan.
Here is how I set my MemoryStream:
val rows = MemoryStream[Map[String,String]]
val df = rows.toDF()
val rdf = df.mapPartitions{ it => {.....}}(RowEncoder.apply(StructType(List(StructField("blob", StringType, false)))))
I have a future that feeds my memory stream as such, right after:
Future {
blocking {
for (i <- 1 to 100000) {
rows.addData(maps)
Thread.sleep(3000)
}
}
}
and then my query:
rdf.writeStream.
trigger(Trigger.ProcessingTime("1 seconds"))
.format("console").outputMode("append")
.queryName("SourceConvertor1").start().awaitTermination()
I wonder why the numbers of Tasks varies ? How is it supposed to be determined by Spark ?

Related

spark No space left on device when working on extremely large data

The followings are my scala spark code:
val vertex = graph.vertices
val edges = graph.edges.map(v=>(v.srcId, v.dstId)).toDF("key","value")
var FMvertex = vertex.map(v => (v._1, HLLCounter.encode(v._1)))
var encodedVertex = FMvertex.toDF("keyR", "valueR")
var Degvertex = vertex.map(v => (v._1, 0.toLong))
var lastRes = Degvertex
//calculate FM of the next step
breakable {
for (i <- 1 to MaxIter) {
var N_pre = FMvertex.map(v => (v._1, HLLCounter.decode(v._2)))
var adjacency = edges.join(
encodedVertex,//FMvertex.toDF("keyR", "valueR"),
$"value" === $"keyR"
).rdd.map(r => (r.getAs[VertexId]("key"), r.getAs[Array[Byte]]("valueR"))).reduceByKey((a,b)=>HLLCounter.Union(a,b))
FMvertex = FMvertex.union(adjacency).reduceByKey((a,b)=>HLLCounter.Union(a,b))
// update vetex encode
encodedVertex = FMvertex.toDF("keyR", "valueR")
var N_curr = FMvertex.map(v => (v._1, HLLCounter.decode(v._2)))
lastRes = N_curr
var middleAns = N_curr.union(N_pre).reduceByKey((a,b)=>Math.abs(a-b))//.mapValues(x => x._1 - x._2)
if (middleAns.values.sum() == 0){
println(i)
break
}
Degvertex = Degvertex.join(middleAns).mapValues(x => x._1 + i * x._2)//.map(identity)
}
}
val res = Degvertex.join(lastRes).mapValues(x => x._1.toDouble / x._2.toDouble)
return res
In which I use several functions I defined in Java:
import net.agkn.hll.HLL;
import com.google.common.hash.*;
import com.google.common.hash.Hashing;
import java.io.Serializable;
public class HLLCounter implements Serializable {
private static int seed = 1234567;
private static HashFunction hs = Hashing.murmur3_128(seed);
private static int log2m = 15;
private static int regwidth = 5;
public static byte[] encode(Long id) {
HLL hll = new HLL(log2m, regwidth);
Hasher myhash = hs.newHasher();
hll.addRaw(myhash.putLong(id).hash().asLong());
return hll.toBytes();
}
public static byte[] Union(byte[] byteA, byte[] byteB) {
HLL hllA = HLL.fromBytes(byteA);
HLL hllB = HLL.fromBytes(byteB);
hllA.union(hllB);
return hllA.toBytes();
}
public static long decode(byte[] bytes) {
HLL hll = HLL.fromBytes(bytes);
return hll.cardinality();
}
}
This code is used for calculating Effective Closeness on a large graph, and I used Hyperloglog package.
The code works fine when I ran it on a graph with about ten million vertices and hundred million of edges. However, when I ran it on a graph with thousands million of graph and billions of edges, after several hours running on clusters, it shows
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 91 in stage 29.1 failed 4 times, most recent failure: Lost task 91.3 in stage 29.1 (TID 17065, 9.10.135.216, executor 102): java.io.IOException: : No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
Can anybody help me? I just begin to use spark for several days. Thank you for helping.
Xiaotian, you state "The shuffle read and shuffle write is about 1TB. I do not need those intermediate values or RDDs". This statement affirms that you are not familiar with Apache Spark or possibly the algorithm you are running. Please let me explain.
When adding three numbers, you have to make a choice about the first two numbers to add. For example (a+b)+c or a+(b+c). Once that choice is made, there is a temporary intermediate value that is held for the number within the parenthesis. It is not possible to continue the computation across all three numbers without the intermediary number.
The RDD is a space efficient data structure. Each "new" RDD represents a set of operations across an entire data set. Some RDDs represent a single operation, like "add five" while others represent a chain of operations, like "add five, then multiply by six, and subtract by seven". You cannot discard an RDD without discarding some portion of your mathematical algorithm.
At its core, Apache Spark is a scatter-gather algorithm. It distributes a data set to a number of worker nodes, where that data set is part of a single RDD that gets distributed, along with the needed computations. At this point in time, the computations are not yet performed. As the data is requested from the computed form of the RDD, the computations are performed on-demand.
Occasionally, it is not possible to finish a computation on a single worker without knowing some of the intermediate values from other workers. This kind of cross communication between the workers always happens between the head node which distributes the data to the various workers and collects and aggregates the data from the various workers; but, depending on how the algorithm is structured, it can also occur mid-computation (especially in algorithms that groupBy or join data slices).
You have an algorithm that requires shuffling, in such a manner that a single node cannot collect the results from all of the other nodes because the single node doesn't have enough ram to hold the intermediate values collected from the other nodes.
In short, you have an algorithm that can't scale to accommodate the size of your data set with the hardware you have available.
At this point, you need to go back to your Apache Spark algorithm and see if it is possible to
Tune the partitions in the RDD to reduce the cross talk (partitions that are too small might require more cross talk in shuffling as a fully connected inter-transfer grows at O(N^2), partitions that are too big might run out of ram within a compute node).
Restructure the algorithm such that full shuffling is not required (sometimes you can reduce in stages such that you are dealing with more reduction phases, each phase having less data combine).
Restructure the algorithm such that shuffling is not required (it is possible, but unlikely that the algorithm is simply mis-written, and factoring it differently can avoid requesting remote data from a node's perspective).
If the problem is in collecting the results, rewrite the algorithm to return the results not in the head node's console, but in a distributed file system that can accommodate the data (like HDFS).
Without the nuts-and-bolts of your Apache Spark program, and access to your data set, and access to your Spark cluster and it's logs, it's hard to know which one of these common approaches would benefit you the most; so I listed them all.
Good Luck!

spark streaming - use previous calculated dataframe in next iteration

I have a streaming app that take a dstream and run an sql manipulation over the Dstream and dump it to file
dstream.foreachRDD { rdd =>
{spark.read.json(rdd)
.select("col")
.filter("value = 1")
.write.csv("s3://..")
now I need to be able to take into account the previous calculation (from eaelier batch) in my calculation (something like the following):
dstream.foreachRDD { rdd =>
{val df = spark.read.json(rdd)
val prev_df = read_prev_calc()
df.join(prev_df,"id")
.select("col")
.filter(prev_df("value)
.equalTo(1)
.write.csv("s3://..")
is there a way to write the calc result in memory somehow and use it as an input to to the calculation
Have you tried using the persist() method on a DStream? It will automatically persist every RDD of that DStream in memory.
by default, all input data and persisted RDDs generated by DStream transformations are automatically cleared.
Also, DStreams generated by window-based operations are automatically persisted in memory.
For more details, you can check https://spark.apache.org/docs/latest/streaming-programming-guide.html#caching--persistence
https://spark.apache.org/docs/0.7.2/api/streaming/spark/streaming/DStream.html
If you are looking only for one or two previously calculated dataframes, you should look into Spark Streaming Window.
Below snippet is from spark documentation.
val windowedStream1 = stream1.window(Seconds(20))
val windowedStream2 = stream2.window(Minutes(1))
val joinedStream = windowedStream1.join(windowedStream2)
or even simpler, if we want to do a word count over the last 20 seconds of data, every 10 seconds, we have to apply the reduceByKey operation on the pairs DStream of (word, 1) pairs over the last 30 seconds of data. This is done using the operation reduceByKeyAndWindow.
// Reduce last 20 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(20), Seconds(10))
more details and examples at-
https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations

Reading Mongo data from Spark

I am reading data from mongodb on a spark job, using the com.mongodb.spark.sql connector (v 2.0.0).
It works fine for most db's, but for a specific db, the stage takes a long time and the number of partitions is very high.
My program is set on 128 partitions (x2 number of vCPUs) which works fine after some testing the we did. On this load the number jumps to 2061 partitions and the stage takes several minutes to process, even though I am using a filter and the document clearly states that filters are done on the underlining data source (https://docs.mongodb.com/spark-connector/v2.0/scala/datasets-and-sql/)
This is how I read data:
val readConfig: ReadConfig = ReadConfig(
Map(
"spark.mongodb.input.uri" -> s"${mongodb.uri}/?${mongodb.uriParams}",
"spark.mongodb.input.database" -> s"${mongodb.dbNamesConfig.siteInstances}",
"collection" -> params.collectionName
), None)
val df: DataFrame = sparkSession.read.format("com.mongodb.spark.sql").options(readConfig.asOptions)
.schema(impressionSchema)
.load()
println("df: " + df.rdd.getNumPartitions) // this is 2061 partitions
val filteredDF = df.coalesce(128).filter(
$"_timestamp".isNotNull
.and($"_timestamp".between(new Timestamp(start.getMillis()), new Timestamp(end.getMillis())))
.and($"component_type" === SITE_INSTANCE_CHART_COMPONENT)
)
println("filteredDF: " + filteredDF.rdd.getNumPartitions) // 128 after using coalesce
filteredDF.select(
$"iid",
$"instance_id".as("instanceId"),
$"_global_visitor_key".as("globalVisitorKey"),
$"_timestamp".as("timestamp"),
$"_timestamp".cast(DataTypes.DateType).as("date")
)
Data is not very big (Shuffle Write is 20MB for this stage) and even if I filter only 1 document, the run time is the same (only the Shuffle Write is much smaller).
How can solve this?
Thanks
Nir

How to Persist an array in spark

I'm comparing two tables to find out difference between them (i.e Source and destination), for that I'm loading those tables to memory and the comparison happens as expected in the machine of configuration 8GB memory and 4 cores but when comparing large amount of data the system hangs and runs out of memory, so I used persist() of storagelevel DISK_ONLY
the machine is capable of holding 100,000 rows in memory to store that to DISK at a time and do the rest comparison operations, I'm trying like below:
var partition = math.ceil(c / 100000.toFloat).toInt
println(partition + " partition")
var a = 1
var data = spark.sparkContext.parallelize(Seq(""))
var offset = 0
for (s <- a to partition) {
val query = "(select * from destination LIMIT 100000 OFFSET " + offset + ") as src"
data = data.union(spark.read.jdbc(url, query, connectionProperties).rdd.map(_.mkString(","))).persist(StorageLevel.DISK_ONLY)
offset += 100000
}
val dest = data.collect.toArray
val s = spark.sparkContext.parallelize(dest, 1).persist(StorageLevel.DISK_ONLY)
yes off-course I can use partition but the problem is I need to supply Lowerbounds,Upperbounds,NumPartitions dynamically for fetching 100,000 I tried like:
val destination = spark.read.options(options).jdbc(options("url"), options("dbtable"), "EMPLOYEE_ID", 1, 22, 21, new java.util.Properties()).rdd.map(_.mkString(","))
it takes too much of time and storing those files into partitions though comparing operation is Iterative in nature its reading all the partitions for each and every step.
Coming to the problem
val dest = data.collect.toArray
val s = spark.sparkContext.parallelize(dest, 1).persist(StorageLevel.DISK_ONLY)
the above lines convert all the partitioned RDD's to Array and parallelizing it to single partition so I don't want to iterate through all the partitions again and again. But val dest = data.collect.toArray can't able to convert some huge amount of lines because of shortage in memory and seems it won't allow to Persist() an array in spark.
Is there is any way I can store and parallelize to one partition in DISK
Sorry for being a noob.
Thanks you..!

Spark: coalesce very slow even the output data is very small

I have the following code in Spark:
myData.filter(t => t.getMyEnum() == null)
.map(t => t.toString)
.saveAsTextFile("myOutput")
There are 2000+ files in the myOutput folder, but only a few t.getMyEnum() == null, so there are only very few output records. Since I don't want to search just a few outputs in 2000+ output files, I tried to combine the output using coalesce like below:
myData.filter(t => t.getMyEnum() == null)
.map(t => t.toString)
.coalesce(1, false)
.saveAsTextFile("myOutput")
Then the job becomes EXTREMELY SLOW! I am wondering why it is so slow? There was just a few output records scattering in 2000+ partitions? Is there a better way to solve this problem?
if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
Note: With shuffle = true, you can actually coalesce to a larger
number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large. Calling coalesce(1000, shuffle = true) will result in 1000 partitions with the data distributed using a hash partitioner.
So try by passing the true to coalesce function. i.e.
myData.filter(_.getMyEnum == null)
.map(_.toString)
.coalesce(1, shuffle = true)
.saveAsTextFile("myOutput")