Spark dataframe process partitions in batches, N partitions at a time - scala

I need to process Spark dataframe partitions in batches, N partitions at a time. For example if i have 1000 partitions in hive table, i need to process 100 partitions at a time.
I tried following approach
Get partition list from hive table and find total count
Get loop count using total_count/100
Then
for x in range(loop_count):
files_list=partition_path_list[start_index:end_index]
df = spark.read.option("basePath", target_table_location).parquet(*files_list)
But this is not working as expected. Can anyone suggest a better method. Solution in Spark Scala is preferred

The for loop you have is just having x increment each time. That's why the start and end indices do not increment.
Not sure why you mention Scala since your code is in Python.
Here's an example with loop count being 1000.
partitions_per_iteration = 100
loop_count = 1000
for start_index in range(0, loop_count, partitions_per_iteration):
files_list=partition_path_list[start_index:start_index + partitions_per_iteration]
df = spark.read.option("basePath", target_table_location).parquet(*files_list)
In Scala, you can do a similar loop:
total = 1000
for {
startIndex <- 0 until total by 100
} {
val filesList = partitionsPathList.slice(startIndex, startIndex + partitionsPerIteration)
val df = ...
}
I think total or totalPartitions is a clearer variable name than "loop count".

Related

How the number of Tasks and Partitions is set when using MemoryStream?

I'm trying to understand a strange behavior that I observed in my Spark structure streaming application that is running in local[*] mode.
I have 8 core on my machines. While the majority of my Batches have 8 partitions, every once in a while I get 16 or 32 or 56 and so on partitions/Tasks. I notice that it is always a multiple of 8. I have notice in opening the stage tab, that when it happens, it is because there is multiple LocalTableScan.
That is if I have 2 LocalTableScan then the mini-batch job, will have 16 task/partition and so on.
I mean it could well do two scans, combine the two batches and feed it to the mini-batch job. However no it results in a mini-batch job that the number of tasks = number of core * number of scan.
Here is how I set my MemoryStream:
val rows = MemoryStream[Map[String,String]]
val df = rows.toDF()
val rdf = df.mapPartitions{ it => {.....}}(RowEncoder.apply(StructType(List(StructField("blob", StringType, false)))))
I have a future that feeds my memory stream as such, right after:
Future {
blocking {
for (i <- 1 to 100000) {
rows.addData(maps)
Thread.sleep(3000)
}
}
}
and then my query:
rdf.writeStream.
trigger(Trigger.ProcessingTime("1 seconds"))
.format("console").outputMode("append")
.queryName("SourceConvertor1").start().awaitTermination()
I wonder why the numbers of Tasks varies ? How is it supposed to be determined by Spark ?

spark streaming - use previous calculated dataframe in next iteration

I have a streaming app that take a dstream and run an sql manipulation over the Dstream and dump it to file
dstream.foreachRDD { rdd =>
{spark.read.json(rdd)
.select("col")
.filter("value = 1")
.write.csv("s3://..")
now I need to be able to take into account the previous calculation (from eaelier batch) in my calculation (something like the following):
dstream.foreachRDD { rdd =>
{val df = spark.read.json(rdd)
val prev_df = read_prev_calc()
df.join(prev_df,"id")
.select("col")
.filter(prev_df("value)
.equalTo(1)
.write.csv("s3://..")
is there a way to write the calc result in memory somehow and use it as an input to to the calculation
Have you tried using the persist() method on a DStream? It will automatically persist every RDD of that DStream in memory.
by default, all input data and persisted RDDs generated by DStream transformations are automatically cleared.
Also, DStreams generated by window-based operations are automatically persisted in memory.
For more details, you can check https://spark.apache.org/docs/latest/streaming-programming-guide.html#caching--persistence
https://spark.apache.org/docs/0.7.2/api/streaming/spark/streaming/DStream.html
If you are looking only for one or two previously calculated dataframes, you should look into Spark Streaming Window.
Below snippet is from spark documentation.
val windowedStream1 = stream1.window(Seconds(20))
val windowedStream2 = stream2.window(Minutes(1))
val joinedStream = windowedStream1.join(windowedStream2)
or even simpler, if we want to do a word count over the last 20 seconds of data, every 10 seconds, we have to apply the reduceByKey operation on the pairs DStream of (word, 1) pairs over the last 30 seconds of data. This is done using the operation reduceByKeyAndWindow.
// Reduce last 20 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(20), Seconds(10))
more details and examples at-
https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations

Spark: convert large input to rdd

I read a lot of lines from an iterator and I need to convert them to an RDD.
I have see some answers like do an sc.parallelize(YourIterable.toList) but the toList will raise a memory exception
I have also read post saying that it is again Spark model but I think there should be another solution.
I have two ideas, please tell me which is the best or if you have any other ideas to solve this.
Solution 1: I store those lines 100 000 by 100 000 to an ArrayBuffer then when the iterator is empty I convert the array to an RDD with parallelize
val result = ArrayBuffer[String]()
var counter = 0
var resultRDD:RDD[Array[Option[Any]]] = sc.sparkContext.emptyRDD[Array[Option[Any]]]
while (resultSet.next()) {
//Do some stuff on resultSet
result.append(stuff)
counter = counter + 1
if(counter % 100000 == 0){
val tmp = sc.sparkContext.parallelize(result)
tmp.count // Need to run an action because result will be cleared
resultRDD = resultRDD union sc.sparkContext.parallelize(result)
result.clear
}
}
// Same for last lines
//use resultRDD
With this method the use of count to force an action on the lazy union before the arrayBuffer.clear is a bit annoying.
Solution 2: Same bunch reads but write in some files in HDFS and next do a sc.textFiles

How to Persist an array in spark

I'm comparing two tables to find out difference between them (i.e Source and destination), for that I'm loading those tables to memory and the comparison happens as expected in the machine of configuration 8GB memory and 4 cores but when comparing large amount of data the system hangs and runs out of memory, so I used persist() of storagelevel DISK_ONLY
the machine is capable of holding 100,000 rows in memory to store that to DISK at a time and do the rest comparison operations, I'm trying like below:
var partition = math.ceil(c / 100000.toFloat).toInt
println(partition + " partition")
var a = 1
var data = spark.sparkContext.parallelize(Seq(""))
var offset = 0
for (s <- a to partition) {
val query = "(select * from destination LIMIT 100000 OFFSET " + offset + ") as src"
data = data.union(spark.read.jdbc(url, query, connectionProperties).rdd.map(_.mkString(","))).persist(StorageLevel.DISK_ONLY)
offset += 100000
}
val dest = data.collect.toArray
val s = spark.sparkContext.parallelize(dest, 1).persist(StorageLevel.DISK_ONLY)
yes off-course I can use partition but the problem is I need to supply Lowerbounds,Upperbounds,NumPartitions dynamically for fetching 100,000 I tried like:
val destination = spark.read.options(options).jdbc(options("url"), options("dbtable"), "EMPLOYEE_ID", 1, 22, 21, new java.util.Properties()).rdd.map(_.mkString(","))
it takes too much of time and storing those files into partitions though comparing operation is Iterative in nature its reading all the partitions for each and every step.
Coming to the problem
val dest = data.collect.toArray
val s = spark.sparkContext.parallelize(dest, 1).persist(StorageLevel.DISK_ONLY)
the above lines convert all the partitioned RDD's to Array and parallelizing it to single partition so I don't want to iterate through all the partitions again and again. But val dest = data.collect.toArray can't able to convert some huge amount of lines because of shortage in memory and seems it won't allow to Persist() an array in spark.
Is there is any way I can store and parallelize to one partition in DISK
Sorry for being a noob.
Thanks you..!

KMeans|| for sentiment analysis on Spark

I'm trying to write sentiment analysis program based on Spark. To do this I'm using word2vec and KMeans clustering. From word2Vec I've got 20k word/vectors collection in 100 dimension space and now I'm trying to clusterize this vectors space. When I run KMeans with default parallel implementation the algorithm worked 3 hours! But with random initialization strategy it was like 8 minutes.
What am I doing wrong? I have mac book pro machine with 4 kernels processor and 16 GB of RAM.
K ~= 4000
maxInteration was 20
var vectors: Iterable[org.apache.spark.mllib.linalg.Vector] =
model.getVectors.map(entry => new VectorWithLabel(entry._1, entry._2.map(_.toDouble)))
val data = sc.parallelize(vectors.toIndexedSeq).persist(StorageLevel.MEMORY_ONLY_2)
log.info("Clustering data size {}",data.count())
log.info("==================Train process started==================");
val clusterSize = modelSize/5
val kmeans = new KMeans()
kmeans.setInitializationMode(KMeans.K_MEANS_PARALLEL)
kmeans.setK(clusterSize)
kmeans.setRuns(1)
kmeans.setMaxIterations(50)
kmeans.setEpsilon(1e-4)
time = System.currentTimeMillis()
val clusterModel: KMeansModel = kmeans.run(data)
And spark context initialization is here:
val conf = new SparkConf()
.setAppName("SparkPreProcessor")
.setMaster("local[4]")
.set("spark.default.parallelism", "8")
.set("spark.executor.memory", "1g")
val sc = SparkContext.getOrCreate(conf)
Also few updates about running this program. I'm running it inside Intelij IDEA. I don't have real Spark cluster. But I thought that your personal machine can be Spark cluster
I saw that the program hangs inside this loop from Spark code LocalKMeans.scala:
// Initialize centers by sampling using the k-means++ procedure.
centers(0) = pickWeighted(rand, points, weights).toDense
for (i <- 1 until k) {
// Pick the next center with a probability proportional to cost under current centers
val curCenters = centers.view.take(i)
val sum = points.view.zip(weights).map { case (p, w) =>
w * KMeans.pointCost(curCenters, p)
}.sum
val r = rand.nextDouble() * sum
var cumulativeScore = 0.0
var j = 0
while (j < points.length && cumulativeScore < r) {
cumulativeScore += weights(j) * KMeans.pointCost(curCenters, points(j))
j += 1
}
if (j == 0) {
logWarning("kMeansPlusPlus initialization ran out of distinct points for centers." +
s" Using duplicate point for center k = $i.")
centers(i) = points(0).toDense
} else {
centers(i) = points(j - 1).toDense
}
}
Initialisation using KMeans.K_MEANS_PARALLEL is more complicated then random. However, it shouldn't make such a big difference. I would recommend to investigate, whether it is the parallel algorithm which takes to much time (it should actually be more efficient then KMeans itself).
For information on profiling see:
http://spark.apache.org/docs/latest/monitoring.html
If it is not the initialisation which takes up the time there is something seriously wrong. However, using random initialisation shouldn't be any worse for the final result (just less efficient!).
Actually when you use KMeans.K_MEANS_PARALLEL to initialise you should get reasonable results with 0 iterations. If this is not the case there might be some regularities in the distribution of the data which send KMeans offtrack. Hence, if you haven't distributed your data randomly you could also change this. However, such an impact would surprise me give a fixed number of iterations.
I've run spark on AWS with 3 slaves (c3.xlarge) and the result is the same - problem is that parallel KMeans initialize algo in N parallel runs, but it's still extremely slow for small amount of data, my solution is to contionue using Random initialization.
Data size approximately: 4k clusters for 21k 100-dim vectors.