Filter columns in large dataset with Spark - scala

I have a dataset which is 1,000,000 rows by about 390,000 columns. The fields are all binary, either 0 or 1. The data is very sparse.
I've been using Spark to process this data. My current task is to filter the data--I only want data in 1000 columns that have been preselected. This is the current code that I'm using to achieve this task:
val result = bigdata.map(_.zipWithIndex.filter{case (value, index) => selectedColumns.contains(index)})
bigdata is just an RDD[Array[Int]]
However, this code takes quite a while to run. I'm sure there's a more efficient way to filter my dataset that doesn't involve going in and filtering every single row separately. Would loading my data into a DataFrame, and maniuplating it through the DataFrame API make things faster/easier? Should I be looking into column-store based databases?

You can start with making your filter a little bit more efficient. Please note that:
your RDD contains Array[Int]. It means you can access nth element of each row in O(1) time
#selectedColumns << #columns
Considering these two facts it should be obvious that it doesn't make sense to iterate over all elements for each row not to mention contains calls. Instead you can simply map over selectedColumns
// Optional if selectedColumns are not ordered
val orderedSelectedColumns = selectedColumns.toList.sorted.toArray
rdd.map(row => selectedColumns.map(row))
Comparing time complexity:
zipWithIndex + filter (assuming best case scenario when contains is O(1)) - O(#rows * # columns)
map - O(#rows * #selectedColumns)

The easiest way to speed up execution is to parallelize it with partitionBy:
bigdata.partitionBy(new HashPartitioner(numPartitions)).foreachPartition(...)
foreachPartition receives a Iterator over which you can map and filter.
numPartitions is a val which you can set with the amount of desired parallel partitions.

Related

Spark: groupBy and treat grouped data as a Dataset

I have a Spark Dataset, and I would like to group the data and process the groups, yielding zero or one element per each group. Something like:
val resulDataset = inputDataset
.groupBy('x, 'y)
.flatMap(...)
I didn't find a way to apply a function after a groupBy, but it appears I can use groupByKey instead (is it a good idea? is there a better way?):
val resulDataset = inputDataset
.groupByKey(v => (v.x, v.y))
.flatMap(...)
This works, but here is a thing: I would like process the groups as Datasets. The reason is that I already have convenient functions to use on Datasets and would like to reuse them when calculating the result for each group. But, the groupByKey.flatMap yields an Iterator over the grouped elements, not the Dataset.
The question: is there a way in Spark to group an input Dataset and map a custom function to each group, while treating the grouped elements as a Dataset ? E.g.:
val inputDataset: Dataset[T] = ...
val resulDataset: Dataset[U] = inputDataset
.groupBy(...)
.flatMap(group: Dataset[T] => {
// using Dataset API to calculate resulting value, e.g.:
group.withColumn(row_number().over(...))....as[U]
})
Note, that grouped data is bounded, and it is OK to process it on a single node. But the number of groups can be very high, so the resulting Dataset needs to be distributed. The point of using the Dataset API to process a group is purely a question of using a convenient API.
What I tried so far:
creating a Dataset from an Iterator in the mapped function - it fails with an NPE from a SparkSession (my understanding is that it boils down to the fact that one cannot create a Dataset within the functions which process a Dataset; see this and this)
tried to overcome the issues in the first solution, attempted to create new SparkSession to create the Dataset within a new session; fails with NPE from SparkSession.newSession
(ab)using repartition('x, 'y).mapPartitions(...), but this also yields an Iterator[T] for each partition, not a Dataset[T]
finally, (ab)using filter: I can collect all distinct values of the grouping criteria into an Array (select.distinct.collect), and iterate this array to filter the source Dataset, yielding one Dataset for each group (sort of joins the idea of multiplexing from this article); although this works, my understanding is that it collects all the data on a single node, so it doesn't scale and will eventually have memory issues

How to avoid the need for nested calls to Spark dataframes - which dont' work

Suppose I have a Spark dataframe called trades which has in its schema a few columns, some dimensions (let's say Product and Type) and some facts (let's say Price and Volume).
Rows in the dataframe which have the same dimension columns belong logically to the same group.
What I need is to map each dimension set (Product, Type) to a numeric value, so to obtain in the end a dataframe stats which has as many rows as the distinct number of dimensions and a value - this is the critical part - which is obtained from all the rows in trades of that (Product, Type) and which must be computed sequentially in order, because the function applied row by row is neither associative nor commutative, and it cannot be parallelized.
I managed to handle the sequential function I need to apply to each subset by repartitioning to 1 single chunk each dataframe and sorting the rows, so to get exactly what I need.
The thing I am struggling with is how to do the map from trades to stats as a Spark job: in my scenario master is remote and can leverage multiple executors, while the deploy mode is local and local machine is poorly equipped.
So I don't want to do looping over the driver, but push it down to the cluster.
If this was not Spark, I'd have done something like:
val dimensions = trades.select("Product", "Type").distinct()
val stats = dimensions.map( row =>
val product = row.getAs[String]("Product")
val type = row.getAs[String]("Type")
val inScope = col("Product") === product and col("Type") === type
val tradesInScope = trades.filter(inScope)
Row(product, type, callSequentialFunction(tradesInScope))
)
This seemed fine to me, but it's absolutely not working: I am trying to do a nested call on trades, and it seem they are not supported. Indeed, when running this the spark job compile but when actually performing an action I get a NullPointerException because the dataframe trades is null within the map
I am new to Spark, and I don't know any other way of achieving the same intent in a valid way. Could you help me?
you get a NullpointerExecptionbecause you cannot use dataframes within executor-side code, they only live on the driver.Also, your code would not ensure thatcallSequentialFunction will be called sequentially, because map on a dataframe will run in parallel (if you have more than 1 partition). What you can do is something like this:
val dimensions = trades.select("Product", "Type").distinct().as[(String,String)].collect()
val stats = dimensions.map{case (product,type) =>
val inScope = col("Product") === product and col("Type") === type
val tradesInScope = trades.filter(inScope)
(product, type, callSequentialFunction(tradesInScope))
}
But note that the order in dimensionsis somewhat arbitrary, so you should sort dimensionsaccording to your needs

Caching Large Dataframes in Spark Effectively

I am currently working on 11,000 files. Each file will generate a data frame which will be Union with the previous one. Below is the code:
var df1 = sc.parallelize(Array(("temp",100 ))).toDF("key","value").withColumn("Filename", lit("Temp") )
files.foreach( filename => {
val a = filename.getPath.toString()
val m = a.split("/")
val name = m(6)
println("FILENAME: " + name)
if (name == "_SUCCESS") {
println("Cannot Process '_SUCCSS' Filename")
} else {
val freqs=doSomething(a).toDF("key","value").withColumn("Filename", lit(name) )
df1=df1.unionAll(freqs)
}
})
First, i got an error of java.lang.StackOverFlowError on 11,000 files. Then, i add a following line after df1=df1.unionAll(freqs):
df1=df1.cache()
It resolves the problem but after each iteration, it is getting slower. Can somebody please suggest me what should be done to avoid StackOverflowError with no decrease in time.
Thanks!
The issue is that spark manages a dataframe as a set of transformations. It begins with the "toDF" of the first dataframe, then perform the transformations on it (e.g. withColumn), then unionAll with the previous dataframe etc.
The unionAll is just another such transformation and the tree becomes very long (with 11K unionAll you have an execution tree of depth 11K). The unionAll when building the information can get to a stack overflow situation.
The caching doesn't solve this, however, I imagine you are adding some action along the way (otherwise nothing would run besides building the transformations). When you perform caching, spark might skip some of the steps and therefor the stack overflow would simply arrive later.
You can go back to RDD for iterative process (your example actually is not iterative but purely parallel, you can simply save each separate dataframe along the way and then convert to RDD and use RDD union).
Since your case seems to be join unioning a bunch of dataframes without true iterations, you can also do the union in a tree manner (i.e. union pairs, then union pairs of pairs etc.) this would change the depth from O(N) to O(log N) where N is the number of unions.
Lastly, you can read and write the dataframe to/from disk. The idea is that after every X (e.g. 20) unions, you would do df1.write.parquet(filex) and then df1 = spark.read.parquet(filex). When you read the lineage of a single dataframe would be the file reading itself. The cost of course would be the writing and reading of the file.

Why is dataset.count causing a shuffle! (spark 2.2)

Here is my dataframe:
The underlying RDD has 2 partitions
When I do a df.count, the DAG produced is
When I do a df.rdd.count, the DAG produced is:
Ques: Count is an action in spark, the official definition is ‘Returns the number of rows in the DataFrame.’. Now, when I perform the count on the dataframe why is a shuffle occurring? Besides, when I do the same on the underlying RDD no shuffle occurs.
It makes no sense to me why a shuffle would occur anyway. I tried to go through the source code of count here spark github But it doesn’t make sense to me fully. Is the “groupby” being supplied to the action the culprit?
PS. df.coalesce(1).count does not cause any shuffle
It seems that DataFrame's count operation uses groupBy resulting in shuffle. Below is the code from https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
* Returns the number of rows in the Dataset.
* #group action
* #since 1.6.0
*/
def count(): Long = withAction("count", groupBy().count().queryExecution) {
plan =>
plan.executeCollect().head.getLong(0)
}
While if you look at RDD's count function, it passes on the aggregate function to each of the partitions, which returns the sum of each partition as Array and then use .sum to sum elements of array.
Code snippet from this link:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
/**
* Return the number of elements in the RDD.
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
When spark is doing dataframe operation, it does first compute partial counts for every partition and then having another stage to sum those up together. This is particularly good for large dataframes, where distributing counts to multiple executors actually adds to performance.
The place to verify this is SQL tab of Spark UI, which would have some sort of the following physical plan description:
*HashAggregate(keys=[], functions=[count(1)], output=[count#202L])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#206L])
In the shuffle stage, the key is empty, and the value is count of the partition, and all these (key,value) pairs are shuffled to one single partition.
That is, the data moved in the shuffle stage is very little.

Spark: How to combine 2 sorted RDDs so that order is preserved after union?

I have 2 sorted RDDs:
val rdd_a = some_pair_rdd.sortByKey().
zipWithIndex.filter(f => f._2 < n).
map(f => f._1)
val rdd_b = another_pair_rdd.sortByKey().
zipWithIndex.filter(f => f._2 < n).
map(f => f._1)
val all_rdd = rdd_a.union(rdd_b)
In all_rdd, I see that the order is not necessarily maintained as I'd imagined (that all elements of rdd_a come first, followed by all elements of rdd_b). Is my assumption incorrect (about the contract of union), and if so, what should I use to append multiple sorted RDDs into a single rdd?
I'm fairly new to Spark so I could be wrong, but from what I understand Union is a narrow transformation. That is, each executor joins only its local blocks of RDD a with its local blocks of RDD b and then returns that to the driver.
As an example, let's say that you have 2 executors and 2 RDDS.
RDD_A = ["a","b","c","d","e","f"]
and
RDD_B = ["1","2","3","4","5","6"]
Let Executor 1 contain the first half of both RDD's and Executor 2 contain the second half of both RDD's. When they perform the union on their local blocks, it would look something like:
Union_executor1 = ["a","b","c","1","2","3"]
and
Union_executor2 = ["d","e","f","4","5","6"]
So when the executors pass their parts back to the driver you would have ["a","b","c","1","2","3","d","e","f","4","5","6"]
Again, I'm new to Spark and I could be wrong. I'm just sharing based on my understanding of how it works with RDD's. Hopefully we can both learn something from this.
You can't. Spark does not have a merge sort, because you can't make assumptions about the way that the RDDs are actually stored on the nodes. If you want things in sort order after you take the union, you need to sort again.