How to process scala dataframe in chunks in scala? - scala

We have a huge dataframe in scala of around 120000 rows. We want to process the dataframe into chunks of 25 each and do 1 http request for 25 rows together as we divide. What is the best way to divide the dataframe and do some operations on each chunk.
For Example:
Consider this dataframe val df = Seq((1, 2, 3, 4, 5, 6, 7, 8, 9, 10)).toDF()
and let's suppose we want to first take 4 rows and perform some operation on those, then next 4 perform operation on them and then perform operation on remaining 2

You could repartition your dataframe to chop it up in the size that you want, and then use foreachPartition to do a certain operation on each of these partitions.
On your small example, it could look something like the following:
val df = Seq(1, 2, 3, 4, 5, 6, 7, 8, 9, 10).toDF().repartition(3)
df.rdd.foreachPartition(iterator => {
println(s"printing: ${iterator.mkString(",")}")
})
printing: [1],[3],[8]
printing: [5],[7],[9]
printing: [2],[4],[6],[10]
In here, we're using .repartition(3) to chop up our dataset in the desired size. Note that in this case the size of our partitions is 3 or 4 (since 10 is not a multiple of 4).
So in your case, you simply have to:
change .repartition(3) to .repartition(4800) (120000/25 = 4800)
change the println(s"printing: ${iterator.mkString(",")}") line to whatever operation you want to be doing

Related

Spark Dataframe grouping and partition by key with a set number of partitions.

I have a spark dataframe with multiple labels and features coreesponding to each, like this:
+----------------+--------------------+
| label| feature_paths|
+----------------+--------------------+
| person1|[-0.015756417, 0....|
| person1|[-0.05177306, 0.1...|
| person1|[-0.11631858, 0.1...|
| person2|[-0.058303248, 0....|
| person2|[-0.03415013, 0.0...|
+----------------+--------------------+
I want to train a clustering model for each label (person), so basically, I want to create an rdd for each label, and then run a map operation like rdd.map(service) which will eventually save a gmm model for each entity.
The code is like:
def service(rddentry):
label = rddentry[0]
features = rddentry[1]
print(label)
from sklearn.mixture import BayesianGaussianMixture
from sklearn.externals import joblib
gmm = BayesianGaussianMixture(n_components=3, covariance_type="diag", init_params='kmeans')
model = gmm.fit(features)
joblib.dump(model, str(label)+'.joblib')
return model
My goals, that I want to achieve is:
Create an rdd where the number of partitions is equal to the number of unique labels, such that: rdd.getNumPartition() = no_of_unique_labels.
Each rdd entry will have multiple features, belonging to a single label.
Send each rdd partition to the service function.
My experiments until now:
When doing sdf.repartition('label'), it creates several empty dataframes.
sdf.partionBy('label') also does not work. It creates a random number of partitions.
I have spent almost two days but of no concrete results until now. Any help or guidance in the right direction would be helpful.
You can use partitionBy with new HashPartitioner(number_of_partitions)
One extra action required to count the unique labels count and you can use that as number of required partitions.
Here is the sample, Note: You need a paired RDD to do this. So, after repartition you can map to get the necessary times from a tuple
scala> val data = sc.parallelize(List("1","1","1","2","3","4","4","4"),4)
scala> data.glom.collect
res20: Array[Array[String]] = Array(Array(1, 1), Array(1, 2), Array(3, 4), Array(4, 4))
scala> val data_repart = data.keyBy(x=>x).partitionBy(new HashPartitioner(data.distinct.count.toInt))
scala> data_repart.glom.collect
res21: Array[Array[(String, String)]] = Array(Array((4,4), (4,4), (4,4)), Array((1,1), (1,1), (1,1)), Array((2,2)), Array((3,3)))
scala> data_repart.map(_._2).glom.collect
res22: Array[Array[String]] = Array(Array(4, 4, 4), Array(1, 1, 1), Array(2), Array(3))
Let me know if it helps.

Create Spark dataset with parts of other dataset

I'm trying to create a new dataset by taking intervals from another dataset, for example, consider dataset1 as input and dataset2 as output:
dataset1 = [1, 2, 3, 4, 5, 6]
dataset2 = [1, 2, 2, 3, 3, 4, 4, 5, 5, 6]
I managed to do that using arrays, but for mlib a dataset is needed.
My code with array:
def generateSeries(values: Array[Double], n: Int): Seq[Array[Float]] = {
var res: Array[Array[Float]] = new Array[Array[Float]](m)
for(i <- 0 to m-n){
res :+ values(i to i + n)
}
return res
}
FlatMap seems like the way to go, but how a function can search for the next value in the dataset?
The problem here is that an array is in no way similar to a DataSet. A DataSet is unordered and has no indices, so thinking in terms of arrays won't help you. Go for a Seq and treat it without using indices and positions at all.
So, to represent an array-like behaviour on a DataSet you need to create your own indices. This is simply done by pairing the value with the position in the "abstract array" we are representing.
So the type of your DataSet will be something like [(Int,Int)], where the first is the index and the second is the value. They will arrive unordered, so you will need to rework your logic in a more functional way. It's not really clear what you're trying to achieve but I hope I gave you an hint. Otherwise explain better the expected result in the comment to my answer and I will edit.

On Spark's RDD's take and takeOrdered methods

I'm a bit confused on how Spark's rdd.take(n) and rdd.takeOrdered(n) works. Can someone explain to me these two methods with some examples? Thanks.
In order to explain how ordering works we create an RDD with integers from 0 to 99:
val myRdd = sc.parallelize(Seq.range(0, 100))
We can now perform:
myRdd.take(5)
Which will extract the first 5 elements of the RDD and we will obtain an Array[Int] containig the first 5 integers of myRDD: '0 1 2 3 4 5' (with no ordering function, just the first 5 elements in the first 5 position)
The takeOrdered(5) operation works in a similar way: it will extract the first 5 elements of the RDD as an Array[Int] but we have to opportunity to specify the ordering criteria:
myRdd.takeOrdered(5)( Ordering[Int].reverse)
Will extract the first 5 elements according to specified ordering. In our case the result will be: '99 98 97 96 95'
If you have a more complex data structure in your RDD you may want to perform your own ordering function with the operation:
myRdd.takeOrdered(5)( Ordering[Int].reverse.on { x => ??? })
Which will extract the first 5 elements of your RDD as an Array[Int] according to your custom ordering function.

How to transpose an RDD in Spark

I have an RDD like this:
1 2 3
4 5 6
7 8 9
It is a matrix. Now I want to transpose the RDD like this:
1 4 7
2 5 8
3 6 9
How can I do this?
Say you have an N×M matrix.
If both N and M are so small that you can hold N×M items in memory, it doesn't make much sense to use an RDD. But transposing it is easy:
val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
val transposed = sc.parallelize(rdd.collect.toSeq.transpose)
If N or M is so large that you cannot hold N or M entries in memory, then you cannot have an RDD line of this size. Either the original or the transposed matrix is impossible to represent in this case.
N and M may be of a medium size: you can hold N or M entries in memory, but you cannot hold N×M entries. In this case you have to blow up the matrix and put it together again:
val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
// Split the matrix into one number per line.
val byColumnAndRow = rdd.zipWithIndex.flatMap {
case (row, rowIndex) => row.zipWithIndex.map {
case (number, columnIndex) => columnIndex -> (rowIndex, number)
}
}
// Build up the transposed matrix. Group and sort by column index first.
val byColumn = byColumnAndRow.groupByKey.sortByKey().values
// Then sort by row index.
val transposed = byColumn.map {
indexedRow => indexedRow.toSeq.sortBy(_._1).map(_._2)
}
A first draft without using collect(), so everything runs worker side and nothing is done on driver:
val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
rdd.flatMap(row => (row.map(col => (col, row.indexOf(col))))) // flatMap by keeping the column position
.map(v => (v._2, v._1)) // key by column position
.groupByKey.sortByKey // regroup on column position, thus all elements from the first column will be in the first row
.map(_._2) // discard the key, keep only value
The problem with this solution is that the columns in the transposed matrix will end up shuffled if the operation is performed in a distributed system. Will think of an improved version
My idea is that in addition to attach the 'column number' to each element of the matrix, we attach also the 'row number'. So we could key by column position and regroup by key like in the example, but then we could reorder each row on the row number and then strip row/column numbers from the result.
I just don't have a way to know the row number when importing a file into an RDD.
You might think it's heavy to attach a column and a row number to each matrix element, but i guess that's the price to pay to have the possibility to process your input as chunks in a distributed fashion and thus handle huge matrices.
Will update the answer when i find a solution to the ordering problem.
As of Spark 1.6 you can use the pivot operation on DataFrames, depending on the actual shape of your data, if you put it into a DF you could pivot columns to rows, the following databricks blog is very useful as it describes in detail a number of pivoting use cases with code examples

Spark: How to write diffrent group values from RDD to different files?

I need to write values with key 1 to file file1.txt and values with key 2 to file2.txt:
val ar = Array (1 -> 1, 1 -> 2, 1 -> 3, 1 -> 4, 1 -> 5, 2 -> 6, 2 -> 7, 2 -> 8, 2 -> 9)
val distAr = sc.parallelize(ar)
val grk = distAr.groupByKey()
How to do this without iterrating collection grk twice?
We write data from different customers to different tables, which is essentially the same usecase. The common pattern we use is something like this:
val customers:List[String] = ???
customers.foreach{customer => rdd.filter(record => belongsToCustomer(record,customer)).saveToFoo()}
This probably does not fulfill the wish of 'not iterating over the rdd twice (or n times)', but filter is a cheap operation to do in a parallel distributed environment and it works, so I think it does comply to the 'general Spark way' of doing things.