I have this dataframe and I'd like to combine all the arrays,
in the data column, into one big array, separate from the DataFrame.
Scala and DataFrame API are still pretty new to me, but I gave it a shot:
case class Tile(data: Array[Int])
val ta = Tile(Array(1,2))
val tb = Tile(Array(3,4))
val tc = Tile(Array(5,6))
df = ListBuffer(ta,tb,tc).toDF()
// Combine contents of DF into one array
val result = new Array[Int](6)
var offset = 0
val combine = (t: WrappedArray[Int]) => {
Array.copy(t, 0, result, offset, t.length)
offset += t.length
}
df.foreach(r => combine(r(0).asInstanceOf[WrappedArray[Int]]))
df.show()
+------+
| data|
+------+
|[1, 1]|
|[2, 2]|
|[3, 3]|
+------+
When I run this, I get the following error:
16/08/23 11:21:32 ERROR executor.Executor: Exception in task 0.0 in stage 17.0 (TID 17)
scala.MatchError: WrappedArray(1, 1) (of class scala.collection.mutable.WrappedArray$ofRef)
at scala.runtime.ScalaRunTime$.array_apply(ScalaRunTime.scala:71)
at scala.Array$.slowcopy(Array.scala:81)
at scala.Array$.copy(Array.scala:107)
at $line150.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:32)
at $line150.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:31)
at $line190.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:46)
at $line190.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:46)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:74
Can anyone point me in the right direction? Thanks!
When working with Spark you can not accumulate things using a foreach like you do normally. Since spark distributes the work among all executors, your function needs to be Serializable.
In case you still want to do things in a way similar to what you normally do, then use Accumulator which supports spark's distributed model.
val myRdd: RDD[List[Int]] = sc.parallelize(List(List(1,2), List(3,4), List(5,6))
val acc = sc.collectionAccumulator[Int]("MyAccumulator")
myRdd.foreach(l => l.foreach(i => acc.add(i)))
Or in your case
case class Tile(data: Array[Int])
val myRdd: RDD[Tile] = sc.parallelize(List(
Tile(Array(1,2)),
Tile(Array(3,4)),
Tile(Array(5,6))
))
val acc = sc.collectionAccumulator[Int]("MyAccumulator")
myRdd.foreach(t => t.data.foreach(i => acc.add(i)))
Related
I am new to scala. I am trying to write a code to map parsed numbers in a series of xml files. My code works for a small RDD as below:
val myrdd = sc.parallelize(Array("FavoriteCount=\"23\" Score=\"43\"","FavoriteCount=\"12\" Score=\"32\"","FavoriteCount=\"32\" Score=\"2\""))
def successMatches(s: String): (String,Int) = {
val fcountMatcher = """FavoriteCount=\"(\d+)\"""".r
val scoreMatcher = """Score=\"(\d+)\"""".r
val fcount = fcountMatcher.findFirstMatchIn(s).get.group(1)
val score = scoreMatcher.findFirstMatchIn(s).get.group(1)
(fcount,score.toInt)
}
val myWords = myrdd.map(x => successMatches(x))
myWords.take(3)
myrdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4] at parallelize at <console>:29
successMatches: (s: String)(String, Int)
myWords: Array[(String, Int)] = Array((23,43), (12,32), (32,2))
res1: Array[(String, Int)] = Array((23,43), (12,32), (32,2))
for the actual xml RDD it returns error message as below:
val myWords = valid_lines.take(1).map(x => successMatches(x))
myWords.take(1)
ava.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at successMatches(<console>:52)
at $anonfun$1.apply(<console>:57)
at $anonfun$1.apply(<console>:57)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
... 42 elided
What am I missing?
This is what the first element of the xml RDD looks like:
valid_lines.take(1)
res51: Array[String] = Array(" <row AnswerCount="0" Body="<p>I'm having trouble with a basic machine learning methodology question. I understand the concept of not using the same data to both train and evaluate a classifier, and furthermore when there are parameters in an algorithm to be optimized, you should use an independent third test set to get the final reportable performance figures (e.g. recall rate). However, using a <em>single</em> test set to measure performance seems to be problematic because the measures of performance will likely differ depending on how the data is partitioned into training (plus validation) and test sets, especially for small datasets. It would be better to average the results of N different partitions.</p>
<p>For t...
I'm working on a UDAF that returns an array of elements.
The input for each update is a tuple of index and value.
What the UDAF does is to sum all the values under the same index.
Example:
For input(index,value) : (2,1), (3,1), (2,3)
should return (0,0,4,1,...,0)
The logic works fine, but I have an issue with the update method, my implementation only updates 1 cell for each row, but the last assignment in that method actually copies the entire array - which is redundant and extremely time consuming.
This assignment alone is responsible for 98% of my query execution time.
My question is, how can I reduce that time? Is it possible to assign 1 value in the buffer array without having to replace the entire buffer?
P.S.: I'm working with Spark 1.6, and I cannot upgrade it anytime soon, so please stick to solution that would work with this version.
class SumArrayAtIndexUDAF() extends UserDefinedAggregateFunction{
val bucketSize = 1000
def inputSchema: StructType = StructType(StructField("index",LongType) :: StructField("value",LongType) :: Nil)
def dataType: DataType = ArrayType(LongType)
def deterministic: Boolean = true
def bufferSchema: StructType = {
StructType(
StructField("buckets", ArrayType(LongType)) :: Nil
)
}
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = new Array[Long](bucketSize)
}
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val index = input.getLong(0)
val value = input.getLong(1)
val arr = buffer.getAs[mutable.WrappedArray[Long]](0)
buffer(0) = arr // TODO THIS TAKES WAYYYYY TOO LONG - it actually copies the entire array for every call to this method (which essentially updates only 1 cell)
}
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val arr1 = buffer1.getAs[mutable.WrappedArray[Long]](0)
val arr2 = buffer2.getAs[mutable.WrappedArray[Long]](0)
for(i <- arr1.indices){
arr1.update(i, arr1(i) + arr2(i))
}
buffer1(0) = arr1
}
override def evaluate(buffer: Row): Any = {
buffer.getAs[mutable.WrappedArray[Long]](0)
}
}
TL;DR Either don't use UDAF or use primitive types in place of ArrayType.
Without UserDefinedFunction
Both solutions should skip expensive juggling between internal and external representation.
Using standard aggregates and pivot
This uses standard SQL aggregations. While optimized internally it might be expensive when number of keys and size of the array grow.
Given input:
val df = Seq((1, 2, 1), (1, 3, 1), (1, 2, 3)).toDF("id", "index", "value")
You can:
import org.apache.spark.sql.functions.{array, coalesce, col, lit}
val nBuckets = 10
#transient val values = array(
0 until nBuckets map (c => coalesce(col(c.toString), lit(0))): _*
)
df
.groupBy("id")
.pivot("index", 0 until nBuckets)
.sum("value")
.select($"id", values.alias("values"))
+---+--------------------+
| id| values|
+---+--------------------+
| 1|[0, 0, 4, 1, 0, 0...|
+---+--------------------+
Using RDD API with combineByKey / aggregateByKey.
Plain old byKey aggregation with mutable buffer. No bells and whistles but should perform reasonably well with wide range of inputs. If you suspect input to be sparse, you may consider more efficient intermediate representation, like mutable Map.
rdd
.aggregateByKey(Array.fill(nBuckets)(0L))(
{ case (acc, (index, value)) => { acc(index) += value; acc }},
(acc1, acc2) => { for (i <- 0 until nBuckets) acc1(i) += acc2(i); acc1}
).toDF
+---+--------------------+
| _1| _2|
+---+--------------------+
| 1|[0, 0, 4, 1, 0, 0...|
+---+--------------------+
Using UserDefinedFunction with primitive types
As far as I understand the internals, performance bottleneck is ArrayConverter.toCatalystImpl.
It look like it is called for each call MutableAggregationBuffer.update, and in turn allocates new GenericArrayData for each Row.
If we redefine bufferSchema as:
def bufferSchema: StructType = {
StructType(
0 to nBuckets map (i => StructField(s"x$i", LongType))
)
}
both update and merge can be expressed as plain replacements of primitive values in the buffer. Call chain will remain pretty long, but it won't require copies / conversions and crazy allocations. Omitting null checks you'll need something similar to
val index = input.getLong(0)
buffer.update(index, buffer.getLong(index) + input.getLong(1))
and
for(i <- 0 to nBuckets){
buffer1.update(i, buffer1.getLong(i) + buffer2.getLong(i))
}
respectively.
Finally evaluate should take Row and convert it to output Seq:
for (i <- 0 to nBuckets) yield buffer.getLong(i)
Please note that in this implementation a possible bottleneck is merge. While it shouldn't introduce any new performance problems, with M buckets, each call to merge is O(M).
With K unique keys, and P partitions it will be called M * K times in the worst case scenario, where each key, occurs at least once on each partition. This effectively increases complicity of the merge component to O(M * N * K).
In general there is not much you can do about it. However if you make specific assumptions about the data distribution (data is sparse, key distribution is uniform), you can shortcut things a bit, and shuffle first:
df
.repartition(n, $"key")
.groupBy($"key")
.agg(SumArrayAtIndexUDAF($"index", $"value"))
If the assumptions are satisfied it should:
Counterintuitively reduce shuffle size by shuffling sparse pairs, instead of dense array-like Rows.
Aggregate data using updates only (each O(1)) possibly touching only as subset of indices.
However if one or both assumptions are not satisfied, you can expect that shuffle size will increase while number of updates will stay the same. At the same time data skews can make things even worse than in update - shuffle - merge scenario.
Using Aggregator with "strongly" typed Dataset:
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.{Encoder, Encoders}
class SumArrayAtIndex[I](f: I => (Int, Long))(bucketSize: Int) extends Aggregator[I, Array[Long], Seq[Long]]
with Serializable {
def zero = Array.fill(bucketSize)(0L)
def reduce(acc: Array[Long], x: I) = {
val (i, v) = f(x)
acc(i) += v
acc
}
def merge(acc1: Array[Long], acc2: Array[Long]) = {
for {
i <- 0 until bucketSize
} acc1(i) += acc2(i)
acc1
}
def finish(acc: Array[Long]) = acc.toSeq
def bufferEncoder: Encoder[Array[Long]] = Encoders.kryo[Array[Long]]
def outputEncoder: Encoder[Seq[Long]] = ExpressionEncoder()
}
which could be used as shown below
val ds = Seq((1, (1, 3L)), (1, (2, 5L)), (1, (0, 1L)), (1, (4, 6L))).toDS
ds
.groupByKey(_._1)
.agg(new SumArrayAtIndex[(Int, (Int, Long))](_._2)(10).toColumn)
.show(false)
+-----+-------------------------------+
|value|SumArrayAtIndex(scala.Tuple2) |
+-----+-------------------------------+
|1 |[1, 3, 5, 0, 6, 0, 0, 0, 0, 0] |
|2 |[0, 11, 0, 0, 0, 0, 0, 0, 0, 0]|
+-----+-------------------------------+
Note:
See also SPARK-27296 - User Defined Aggregating Functions (UDAFs) have a major efficiency problem
I need to convert an Array[Array[Double]] to an RDD, e.g [[1.1, 1.2 ...], [2.1, 2.2 ...], [3.1, 3.2 ...], ...] to
+-----+-----+-----+
| 1.1 | 1.2 | ... |
| 2.1 | 2.2 | ... |
| 3.1 | 3.2 | ... |
| ... | ... | ... |
+-----+-----+-----+
val testDensities: Array[Array[Double]] = Array(Array(1.1, 1.2), Array(2.1, 2.2), Array(3.1, 3.2))
val testData = spark.sparkContext
.parallelize(Seq(testDensities
.map { x => x.toArray }
.map { x => x.toString() } ))
And this code even feels incorrect, the second map call is supposed to map each element in the array to convert the Double to String. This is what I get when I save it as a text file.
[Ljava.lang.String;#773d7a60
Can anybody please comment on what should I do, and where I am doing a horrendous mistake?
Thanks.
If you want to convert an Array[Double] to a String you can use the mkString method which joins each item of the array with a delimiter (in my example ",")
scala> val testDensities: Array[Array[Double]] = Array(Array(1.1, 1.2), Array(2.1, 2.2), Array(3.1, 3.2))
scala> val rdd = spark.sparkContext.parallelize(testDensities)
scala> val rddStr = rdd.map(_.mkString(","))
rddStr: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at map at
scala> rddStr.collect.foreach(println)
1.1,1.2
2.1,2.2
3.1,3.2
Maybe something like this:
scala> val testDensities: Array[Array[Double]] = Array(Array(1.1, 1.2), Array(2.1, 2.2), Array(3.1, 3.2))
scala> val strRdd = sc.parallelize(testDensities).map(_.mkString("[",",","]"))
strRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[16] at map at <console>:26
scala> strRdd.collect
res7: Array[String] = Array([1.1,1.2], [2.1,2.2], [3.1,3.2])
But I have two question:
Why do you want to do it? I understand that is only because you are
learning and you are playing with Spark.
Why do you try to use "Array"? It is not the first time that I see people trying to transform all in arrays. Keep RDD until the end and use more generic collections types.
Why is your code wrong:
Because you apply the map in your local array (in the Driver) and then create a RDD from a list of lists.
So:
You are not parallelizing the execution of the maps. In fact, you are parallelizing nothing.
You create an RDD of Lists and not of String.
If you execute your code in the console:
scala> val testData = sc.parallelize(Seq(testDensities.map { x => x.toArray }.map { x => x.toString() } ))
testData: org.apache.spark.rdd.RDD[Array[String]] = ParallelCollectionRDD[14] at parallelize at <console>:26
the response is clear: RDD[Array[String]]
The following Spark code correctly demonstrates what I want to do and generates the correct output with a tiny demo data set.
When I run this same general type of code on a large volume of production data, I am having runtime problems. The Spark job runs on my cluster for ~12 hours and fails out.
Just glancing at the code below, it seems inefficient to explode every row, just to merge it back down. In the given test data set, the fourth row with three values in array_value_1 and three values in array_value_2, that will explode to 3*3 or nine exploded rows.
So, in a larger data set, a row with five such array columns, and ten values in each column, would explode out to 10^5 exploded rows?
Looking at the provided Spark functions, there are no out of the box functions that would do what I want. I could supply a user-defined-function. Are there any speed drawbacks to that?
val sparkSession = SparkSession.builder.
master("local")
.appName("merge list test")
.getOrCreate()
val schema = StructType(
StructField("category", IntegerType) ::
StructField("array_value_1", ArrayType(StringType)) ::
StructField("array_value_2", ArrayType(StringType)) ::
Nil)
val rows = List(
Row(1, List("a", "b"), List("u", "v")),
Row(1, List("b", "c"), List("v", "w")),
Row(2, List("c", "d"), List("w")),
Row(2, List("c", "d", "e"), List("x", "y", "z"))
)
val df = sparkSession.createDataFrame(rows.asJava, schema)
val dfExploded = df.
withColumn("scalar_1", explode(col("array_value_1"))).
withColumn("scalar_2", explode(col("array_value_2")))
// This will output 19. 2*2 + 2*2 + 2*1 + 3*3 = 19
logger.info(s"dfExploded.count()=${dfExploded.count()}")
val dfOutput = dfExploded.groupBy("category").agg(
collect_set("scalar_1").alias("combined_values_2"),
collect_set("scalar_2").alias("combined_values_2"))
dfOutput.show()
It could be inefficient to explode but fundamentally the operation you try to implement is simply expensive. Effectively it is just another groupByKey and there is not much you can do here to make it better. Since you use Spark > 2.0 you could collect_list directly and flatten:
import org.apache.spark.sql.functions.{collect_list, udf}
val flatten_distinct = udf(
(xs: Seq[Seq[String]]) => xs.flatten.distinct)
df
.groupBy("category")
.agg(
flatten_distinct(collect_list("array_value_1")),
flatten_distinct(collect_list("array_value_2"))
)
In Spark >= 2.4 you can replace udf with composition of built-in functions:
import org.apache.spark.sql.functions.{array_distinct, flatten}
val flatten_distinct = (array_distinct _) compose (flatten _)
It is also possible to use custom Aggregator but I doubt any of these will make a huge difference.
If sets are relatively large and you expect significant number of duplicates you could try to use aggregateByKey with mutable sets:
import scala.collection.mutable.{Set => MSet}
val rdd = df
.select($"category", struct($"array_value_1", $"array_value_2"))
.as[(Int, (Seq[String], Seq[String]))]
.rdd
val agg = rdd
.aggregateByKey((MSet[String](), MSet[String]()))(
{case ((accX, accY), (xs, ys)) => (accX ++= xs, accY ++ ys)},
{case ((accX1, accY1), (accX2, accY2)) => (accX1 ++= accX2, accY1 ++ accY2)}
)
.mapValues { case (xs, ys) => (xs.toArray, ys.toArray) }
.toDF
So I'm running into an issue where a filter I'm using on an RDD can potentially create an empty RDD. I feel that doing a count() in order to test for emptiness would be very expensive, and was wondering if there is a more performant way to handle this situation.
Here is an example of what this issue might look like:
val b:RDD[String] = sc.parallelize(Seq("a","ab","abc"))
println(b.filter(a => !a.contains("a")).reduce(_+_))
would give the result
empty collection
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1005)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1005)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1005)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:985)
Does anyone have any suggestions for how I should go about addressing this edge case?
Consider .fold("")(_ + _) instead of .reduce(_ + _)
how about
scala> val b = sc.parallelize(Seq("a","ab","abc"))
b: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> b.isEmpty
res1: Boolean = false