Apache Spark reduceByKey to sum decimals - scala

I'm trying to map an RDD as such (see output for results) and map reduce by the decimal values and I keep getting error. When I tried using reduceByKey() with word count it worked fine. Are decimal values summed differently?
val voltageRDD= myRDD.map(i=> i.split(";"))
.filter(i=> i(0).split("/")(2)=="2008")
.map(i=> (i(0).split("/")(2),i(2).toFloat)).take(5)
Output:
voltageRDD: Array[(String, Float)] = Array((2008,1.62), (2008,1.626), (2008,1.622), (2008,1.612), (2008,1.612))
When trying to reduce:
val voltageRDD= myRDD.map(i=> i.split(";"))
.filter(i=> i(0).split("/")(2)=="2008")
.map(i=> (i(0).split("/")(2),i(2).toFloat)).reduceByKey(_+_).take(5)
I get the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2954.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2954.0 (TID 15696, 10.19.240.54): java.lang.NumberFormatException: For input string: "?"

If your data contains columns which are not parseable to a float, then you should either filter them out beforehand or treat them accordingly. Such a treatment could mean that you assign a value of 0.0f, if you see a non-parseable entry. The following code does exactly this.
val voltageRDD= myRDD.map(i=> i.split(";"))
.filter(i => i(0).split("/")(2)=="2008")
.map(i => (i(0).split("/")(2), Try{ i(2).toFloat }.toOption.getOrElse(0.0f)))
.reduceByKey(_ + _).take(5)

Short version: you probably have a line for which i(2) equals ?.
As per my comment your data most probably isn't consistent which won't be a problem in the first snippet because of the take(5) and no actions that require spark to perform operations on the whole data set. Spark is lazy and therefore will perform computations only until it gets 5 results from the map -> filter -> map chain.
The second snippet on the other hand will perform computations on your whole data set so it can perform the reduceByKey and only then it will take 5 results therefore it might catch problems which are too far in your data set for the first snippet.

Related

PySpark filtering gives inconsistent behavior

So I have a data set where I do some transformations and the last step is to filter out rows that have a 0 in a column called frequency. The code that does the filtering is super simple:
def filter_rows(self, name: str = None, frequency_col: str = 'frequency', threshold: int = 1):
df = getattr(self, name)
df = df.where(df[frequency_col] >= threshold)
setattr(self, name, df)
return self
The problem is a very strange behavior where if I put a rather high threshold like 10, it works fine, filtering out all the rows below 10. But if I make the threshold just 1, it does not remove the 0s! Here is an example of the former (threshold=10):
{"user":"XY1677KBTzDX7EXnf-XRAYW4ZB_vmiNvav7hL42BOhlcxZ8FQ","domain":"3a899ebbaa182778d87d","frequency":12}
{"user":"lhoAWb9U9SXqscEoQQo9JqtZo39nutq3NgrJjba38B10pDkI","domain":"3a899ebbaa182778d87d","frequency":9}
{"user":"aRXbwY0HcOoRT302M8PCnzOQx9bOhDG9Z_fSUq17qtLt6q6FI","domain":"33bd29288f507256d4b2","frequency":23}
{"user":"RhfrV_ngDpJex7LzEhtgmWk","domain":"390b4f317c40ac486d63","frequency":14}
{"user":"qZqqsNSNko1V9eYhJB3lPmPp0p5bKSq0","domain":"390b4f317c40ac486d63","frequency":11}
{"user":"gsmP6RG13azQRmQ-RxcN4MWGLxcx0Grs","domain":"f4765996305ccdfa9650","frequency":10}
{"user":"jpYTnYjVkZ0aVexb_L3ZqnM86W8fr082HwLliWWiqhnKY5A96zwWZKNxC","domain":"f4765996305ccdfa9650","frequency":15}
{"user":"Tlgyxk_rJF6uE8cLM2sArPRxiOOpnLwQo2s","domain":"f89838b928d5070c3bc3","frequency":17}
{"user":"qHu7fpnz2lrBGFltj98knzzbwWDfU","domain":"f89838b928d5070c3bc3","frequency":11}
{"user":"k0tU5QZjRkBwqkKvMIDWd565YYGHfg","domain":"f89838b928d5070c3bc3","frequency":17}
And now here is some of the data with threshold=1:
{"user":"KuhSEPFKACJdNyMBBD2i6ul0Nc_b72J4","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"EP1LomZ3qAMV3YtduC20","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"UxulBfshmCro-srE3Cs5znxO5tnVfc0_yFps","domain":"d69cb6f62b885fec9b7d","frequency":1}
{"user":"v2OX7UyvMVnWlDeDyYC8Opk-va_i8AwxZEsxbk","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"4hu1uE2ucAYZIrNLeOY2y9JMaArFZGRqjgKzlKenC5-GfxDJQQbLcXNSzj","domain":"68b588cedbc66945c442","frequency":0}
{"user":"5rFMWm_A-7N1E9T289iZ65TIR_JG_OnZpJ-g","domain":"68b588cedbc66945c442","frequency":1}
{"user":"RLqoxFMZ7Si3CTPN1AnI4hj6zpwMCJI","domain":"68b588cedbc66945c442","frequency":1}
{"user":"wolq9L0592MGRfV_M-FxJ5Wc8UUirjqjMdaMDrI","domain":"68b588cedbc66945c442","frequency":0}
{"user":"9spTLehI2w0fHcxyvaxIfo","domain":"68b588cedbc66945c442","frequency":1}
I should note that before this step I perform some other transformations, and I've noticed weird behaviors in Spark in the past sometimes doing very simple things like this after a join or a union can give very strange results where eventually the only solution is to write out the data and read it back in again and do the operation in a completely separate script. I hope there is a better solution than this!

spark - extract elements from an RDD[Row] when reading Hive table in Spark

I was going to read a Hive table in spark using scala, and extract some/all of fields from it and then save the data into HDFS.
My code is as follow:
val data = spark.sql("select * from table1 limit 1000")
val new_rdd = data.rdd.map(row => {
var arr = new ArrayBuffer[String]
val len = row.size
for(i <- 0 to len-1) arr.+=(row.getAs[String](i))
arr.toArray
})
new_rdd.take(10).foreach(println)
new_rdd.map(_.mkString("\t")).saveAsTextFile(dataOutputPath)
The above chunk is the one that finally worked.
I had written another version, where this line:
for(i <- 0 to len-1) arr.+=(row.getAs[String](i))
was replaced by this line:
for(i <- 0 to len-1) arr.+=(row.get(i).toString)
To me, both lines did exactly the same thing: for each row, I get the ith element as a string, and put it into the ArrayBuffer, which comes to an Array at the end.
However, the two methods have different results.
The first line works well. Data were able to be correctly saved on HDFS.
While the Error was thrown when I am going to save the data if using the second line:
ERROR ApplicationMaster: User class threw exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 56
in stage 3.0 failed 4 times, most recent failure: Lost task 56.3 in stage
3.0 (TID 98, ip-172-31-18-87.ec2.internal, executor 6):
java.lang.NullPointerException
Therefore, I wonder if there is some intrinsic differences in between
getAs[String](i)
and
get(i).toString
?
Many thanks
getAs[String](i) is the same as
get(i).asInstanceOf[String]
therefore it is just a type casting. toString is not.

SparkML MultilayerPerceptron error: java.lang.ArrayIndexOutOfBoundsException

I have the following model that I would like to estimate using SparkML MultilayerPerceptronClassifier().
val formula = new RFormula()
.setFormula("vtplus15predict~ vhisttplus15 + vhistt + vt + vtminus15 + Time + Length + Day")
.setFeaturesCol("features")
.setLabelCol("label")
formula.fit(data).transform(data)
Note: The features is a vector and label is a Double
root
|-- features: vector (nullable = true)
|-- label: double (nullable = false)
I define my MLP estimator as follows:
val layers = Array[Int](6, 5, 8, 1) //I suspect this is where it went wrong
val mlp = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)
// train the model
val model = mlp.fit(train)
Unfortunately, I got the following error:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 11
at org.apache.spark.ml.classification.LabelConverter$.encodeLabeledPoint(MultilayerPerceptronClassifier.scala:121)
at org.apache.spark.ml.classification.MultilayerPerceptronClassifier$$anonfun$3.apply(MultilayerPerceptronClassifier.scala:245)
at org.apache.spark.ml.classification.MultilayerPerceptronClassifier$$anonfun$3.apply(MultilayerPerceptronClassifier.scala:245)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:363)
at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:935)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:950)
...
org.apache.spark.ml.classification.LabelConverter$.encodeLabeledPoint(MultilayerPerceptronClassifier.scala:121)
This tells us that an array is out of bounds in the MultilayerPerceptronClassifier.scala file, let's look at the code there:
def encodeLabeledPoint(labeledPoint: LabeledPoint, labelCount: Int): (Vector, Vector) = {
val output = Array.fill(labelCount)(0.0)
output(labeledPoint.label.toInt) = 1.0
(labeledPoint.features, Vectors.dense(output))
}
It performs a one-hot encoding of the labels in the dataset. The ArrayIndexOutOfBoundsException occurs since the output array is too short.
By going back in the code, it's possible to find that labelCount is the same as the number of output nodes in the layers array. In other words, the number of output nodes should be the same as the number of classes. Looking at the documentation for MLP there is the following line:
The number of nodes N in the output layer corresponds to the number of classes.
The solution is therefore to either:
Change the number of nodes in the final layer of the network (output nodes)
Reconstruct the data to have the same number of classes as your network output nodes.
Note: The final output layer should always be 2 or more, never 1, since there should be one node per class and a problem with a single class does not make sense.
rearrange your dataset as the error shows you have fewer arrays than you have in your features set or your data set has a null set which prompted an error.I came across this type of error while working on my MLP project.hope my answer helps you.
thanks for reaching out
The solution is to first find the local optimal that allows one to escape the ArrayIndexOutBound and then use brute-force search to find the global optimal. Shaido suggest finding n
For example, val layers =
Array[Int](6, 5, 8, n). This assumes the length of the feature vectors
are 6. – Shaido
So make n be a large integer(n =100) then manually use brute-force search to arrive at a good solution(n =50 then try n =32 - error, n = 35 - perfect).
Credit to Shaido.

Data-processing takes too long if pre-processing just before

So I've been trying to perform a cumsum operation on a data-set. I want to emphasize that I want my cumsum to happen on partitions on my data-set (eg. cumsum for feature1 over time for personA).
I know how to do it, and it works "on its own" perfectly - i'll explain that part later. Here's the piece of code doing it:
// it's admitted that this DF contains all data I need
// with one column/possible value, with only 1/0 in each line
// 1 <-> feature has the value
// 0 <-> feature doesn't contain the value
// this DF is the one I get after the one-hot operation
// this operation is performed to apply ML algorithms on features
// having simultaneously multiple values
df_after_onehot.createOrReplaceTempView("test_table")
// #param DataFrame containing all possibles values eg. A, B, C
def cumSumForFeatures(values: DataFrame) = {
values
.map(value => "CAST(sum(" + value(0) + ") OVER (PARTITION BY person ORDER BY date) as Integer) as sum_" + value(0))
.reduce(_+ ", " +_)
}
val req = "SELECT *, " + cumSumForFeatures(possible_segments) + " FROM test_table"
// val req = "SELECT * FROM test_table"
println("executing: " + req)
val data_after_cumsum = sqLContext.sql(req).orderBy("person", "date")
data_after_cumsum.show(10, false)
The problem happens when I try to perform the same operation with some pre-processing before (like the one-hot operation, or adding computed features before). I tried with a very small dataset and it doesn't work.
Here is the printed stack trace (the part that should interess you at least):
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
[Executor task launch worker-3] ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-3,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
So it seems it's related to a GC issue/JVM heap size? I just don't understand how it's related to my pre-processing?
I tried unpersist operation on not-used-anymore DFs.
I tried modifying the options on my machine (eg. -Xmx2048m).
The issue is the same once I deploy on AWS.
Extract of my pom.xml (for versions of Java, Spark, Scala):
<spark.version>2.1.0</spark.version>
<scala.version>2.10.4</scala.version>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
Would you know how I could fix my issue?
Thanks
From what I understand, I think that we could have two reasons for that:
JVM's heap overflow because of kept-in-memory-but-no-longer-used DataFrames
the cum-sum request could be too big to be processed with the few amount of RAM left
show/print operations increase the number of steps necessary for the job, and may interfer with Spark's inner optimizations
Considering that, I decided to "unpersist" no-longer-used DataFrames. That did not seem to change much.
Then, I decided to remove all unecessary show/print operations. That improved the number of step very much.
I changed my code to be more functionnal, but I kept 3 separate values to help debugging. That did not change much, but my code is cleaner.
Finally, here is the thing that helped me deal with the problem. Instead of making my request go through the dataset in one pass, I partitionned the list of features into slices:
def listOfSlices[T](list: List[T], sizeOfSlices: Int): List[List[T]] =
(for (i <- 0 until list.length by sizeOfSlices) yield list.slice(i, i+sizeOfSlices)).toList
I perform the request for each slice of, with a map operation. Then I join together them to have my final DataFrame. That way, I kind of distribute the computation, and it seems that this way is more efficient.
val possible_features_slices = listOfSlices[String](possible_features, 5)
val df_cum_sum = possible_features_slices
.map(possible_features_slice =>
dfWithCumSum(sqLContext, my_df, possible_segments_slice, "feature", "time")) // operation described in the original post
.foldLeft[DataFrame](null)((a, b) => if (a == null) b else if (b == null) a else a.join(b, Seq("person", "list_features", "time")))
I just really want to emphasize that I still not understand the reason behind my problem, and I still expect an answer at this level.

Implement a MergeSort like feature in spark with scala

Spark Version 1.2.1
Scala Version 2.10.4
I have 2 SchemaRDD which are associated by a numeric field:
RDD 1: (Big table - about a million records)
[A,3]
[B,4]
[C,5]
[D,7]
[E,8]
RDD 2: (Small table < 100 records so using it as a Broadcast Variable)
[SUM, 2]
[WIN, 6]
[MOM, 7]
[DOM, 9]
[POM, 10]
Result
[C,5, WIN]
[D,7, MOM]
[E,8, DOM]
[E,8, POM]
I want the max(field) from RDD1 which is <= the field from RDD2.
I am trying to approach this using Merge by:
Sorting RDD by a key (sort within a group will have not more than 100 records in that group. In the above example is within a group)
Performing the merge operation similar to mergesort. Here I need to keep a track of the previous value as well to find the max; still I traverse the list only once.
Since there are too may variables here I am getting "Task not serializable" exception. Is this implementation approach Correct? I am trying to avoid the Cartesian Product here. Is there a better way to do it?
Adding the code -
rdd1.groupBy(itm => (itm(2), itm(3))).mapValues( itmorg => {
val miorec = itmorg.toList.sortBy(_(1).toString)
for( r <- 0 to miorec.length) {
for ( q <- 0 to rdd2.value.length) {
if ( (miorec(r)(1).toString > rdd2.value(q).toString && miorec(r-1)(1).toString <= rdd2.value(q).toString && r > 0) || r == miorec.length)
org.apache.spark.sql.Row(miorec(r-1)(0),miorec(r-1)(1),miorec(r-1)(2),miorec(r-1)(3),rdd2.value(q))
}
}
}).collect.foreach(println)
I would not do a global sort. It is an expensive operation for what you need. Finding the maximum is certainly cheaper than getting a global ordering of all values. Instead, do this:
For each partition, build a structure that keeps the max on RDD1 for each row on RDD2. This can be trivially done using mapPartitions and normal scala data structures. You can even use your one-pass merge code here. You should get something like a HashMap(WIN -> (C, 5), MOM -> (D, 7), ...)
Once this is done locally on each executor, merging these resulting data structures should be simple using reduce.
The goal here is to do little to no shuffling an keeping the most complex operation local, since the result size you want is very small (it would be easier in code to just create all valid key/values with RDD1 and RDD2 then aggregateByKey, but less efficient).
As for your exception, you woudl need to show the code, "Task not serializable" usually means you are passing around closures which are not, well, serializable ;-)