Spark reducer and summation result issue

Spark reducer and summation result issue - scala

Here is sample file
Department,Designation,costToCompany,State
Sales,Trainee,12000,UP
Sales,Lead,32000,AP
Sales,Lead,32000,LA
Sales,Lead,32000,TN
Sales,Lead,32000,AP
Sales,Lead,32000,TN
Sales,Lead,32000,LA
Sales,Lead,32000,LA
Marketing,Associate,18000,TN
Marketing,Associate,18000,TN
HR,Manager,58000,TN
Produce an output as csv
Group by department, desigination, State
With additional columns with sum(costToCompany) and sum(TotalEmployeeCount)
Result should be like
Dept,Desg,state,empCount,totalCost
Sales,Lead,AP,2,64000
Sales,Lead,LA,3,96000
Sales,Lead,TN,2,64000
Following is the solution and writing to file is resulting in an error. What am i doing wrong here?
Step #1: Load file
val file = sc.textFile("data/sales.txt")
Step #2: Create a case class to represt the data
scala> case class emp(Dept:String, Desg:String, totalCost:Double, State:String)
defined class emp
Step #3: Split data and create RDD of emp object
scala> val fileSplit = file.map(_.split(","))
scala> val data = fileSplit.map(x => emp(x(0), x(1), x(2).toDouble, x(3)))
Step #4: Turn the data into Key/value par with key=(dept, desg,state) and value=(1,totalCost)
scala> val keyVals = data.map(x => ((x.Dept,x.Desg,x.State),(1,x.totalCost)))
Step #5: Group by using reduceByKey, as we want summation as well for total number of employees and the cost
scala> val results = keyVals.reduceByKey{(a,b) => (a._1+b._1, a._2+b._2)} //(a.count+ b.count, a.cost+b.cost)
results: org.apache.spark.rdd.RDD[((String, String, String), (Int, Double))] = ShuffledRDD[41] at reduceByKey at <console>:55
Step #6: save the results
scala> results.repartition(1).saveAsTextFile("data/result")
Error
17/08/16 22:16:59 ERROR executor.Executor: Exception in task 0.0 in stage 20.0 (TID 23)
java.lang.NumberFormatException: For input string: "costToCompany"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
at java.lang.Double.parseDouble(Double.java:540)
at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
at $line85.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:51)
at $line85.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/08/16 22:16:59 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 20.0 (TID 23, localhost, executor driver): java.lang.NumberFormatException: For input string: "costToCompany"
Update 1
Forgot to remove header. update code here. Save is throwing a different error now. also, need to put the header back in the file.
scala> val file = sc.textFile("data/sales.txt")
scala> val header = fileSplit.first()
scala> val noHeaderData = fileSplit.filter(_(0) != header(0))
scala> case class emp(Dept:String, Desg:String, totalCost:Double, State:String)
scala> val data = noHeaderData.map(x => emp(x(0), x(1), x(2).toDouble, x(3)))
scala> val keyVals = data.map(x => ((x.Dept,x.Desg,x.State),(1,x.totalCost)))
scala> val resultSpecific = results.map(x => (x._1._1, x._1._2, x._1._3, x._2._1, x._2._2))
scala> resultSpecific.repartition(1).saveASTextFile("data/specific")
<console>:64: error: value saveASTextFile is not a member of org.apache.spark.rdd.RDD[(String, String, String, Int, Double)]
resultSpecific.repartition(1).saveASTextFile("data/specific")

To answer your question as well as comments:
It would be easier for you to utilize dataframes in this case, as your file is in csv format you can use the following way to load and save the data. In this way, you do not need to concern yourself with splitting the rows in the file as well as taking care of the header (both when loading and saving).
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true") //reading the headers
.load("csv/file/path");
The dataframe column names will then be the same as the header in the file. Instead of reduceByKey() you can use the dataframe's groupBy() and agg():
val res = df.groupBy($"Department", $"Designation", $"State")
.agg(count($"costToCompany").alias("empCount"), sum($"costToCompany").alias("totalCost"))
Then save it:
res.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("results.csv")

when you are trying to cast into double, costToCompany string wont cast thats why its stuck when try to fire action. just drop first record from file and then it will work . you can also do such operation on dataframe also which will be easy

The error is straight forward and it says that
:64: error: value saveASTextFile is not a member of
org.apache.spark.rdd.RDD[(String, String, String, Int, Double)]
resultSpecific.repartition(1).saveASTextFile("data/specific")
In fact, you got no method called saveASTextFile(...) but saveAsTextFile(???). you have case error on your method name.

Related

spark create Dataframe in UDF

I have a example, want to create Dataframe in a UDF. Something like the one below
import org.apache.spark.ml.classification.LogisticRegressionModel
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.feature.VectorAssembler
data to Dataframe
val df = Seq((1,1,34,23,34,56),(2,1,56,34,56,23),(3,0,34,23,23,78),(4,0,23,34,78,23),(5,1,56,23,23,12),
(6,1,67,34,56,34),(7,0,23,23,23,56),(8,0,12,34,45,89),(9,1,12,34,12,34),(10,0,12,34,23,34)).toDF("id","label","tag1","tag2","tag3","tag4")
val assemblerDF = new VectorAssembler().setInputCols(Array("tag1", "tag2", "tag3","tag4")).setOutputCol("features")
val data = assemblerDF.transform(df)
val Array(train,test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val testData=test.toDF
val loadmodel=LogisticRegressionModel.load("/user/xu/savemodel")
sc.broadcast(loadmodel)
val assemblerFe = new VectorAssembler().setInputCols(Array("a", "b", "c","d")).setOutputCol("features")
sc.broadcast(assemblerFe)
UDF
def predict(predictSet:Vector):Double={
val set=Seq((1,2,3,4)).toDF("a","b","c","d")
val predata = assemblerFe.transform(set)
val result=loadmodel.transform(predata)
result.rdd.take(1)(0)(3).toString.toDouble}
spark.udf.register("predict", predict _)
testData.registerTempTable("datatable")
spark.sql("SELECT predict(features) FROM datatable").take(1)
i get an error like
ERROR Executor: Exception in task 3.0 in stage 4.0 (TID 7) [Executor task launch worker for task 7]
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => double)
and
WARN TaskSetManager: Lost task 3.0 in stage 4.0 (TID 7, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => double)
Is dataframe not supported? I am using Spark 2.3.0 and Scala 2.11. thanks

As mentioned in comments, you don't need UDF here to apply the Trained model to test data. You can apply the model to test dataframe in the main program itself as below:
val df = Seq((1,1,34,23,34,56),(2,1,56,34,56,23),(3,0,34,23,23,78),(4,0,23,34,78,23),(5,1,56,23,23,12),
(6,1,67,34,56,34),(7,0,23,23,23,56),(8,0,12,34,45,89),(9,1,12,34,12,34),(10,0,12,34,23,34)).toDF("id","label","tag1","tag2","tag3","tag4")
val assemblerDF = new VectorAssembler().setInputCols(Array("tag1", "tag2", "tag3","tag4")).setOutputCol("features")
val data = assemblerDF.transform(df)
val Array(train,test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val testData=test.toDF
val loadmodel=LogisticRegressionModel.load("/user/xu/savemodel")
sc.broadcast(loadmodel)
val assemblerFe = new VectorAssembler().setInputCols(Array("a", "b", "c","d")).setOutputCol("features")
sc.broadcast(assemblerFe)
val set=Seq((1,2,3,4)).toDF("a","b","c","d")
val predata = assemblerFe.transform(set)
val result=loadmodel.transform(predata) // Applying model on predata dataframe. You can apply model on any DataFrame.
result is a DataFrame now, you can Reigister the DataFrame as a table and query predictionLabel and features using SQL OR you can directly select the predictLabel and other fields from DataFrame.
Please note, UDF is a feature of Spark SQL to define new Column-based functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets. It doesnt return the DataFrame itself as a return type. and generally its not advised to use UDF's unless necessary, refer to: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs-blackbox.html

Merge two columns of type Array[string] into a new Array[string] column

I have two columns in a Spark SQL DataFrame with each entry in either column as an array of strings.
val ngramDataFrame = Seq(
(Seq("curious", "bought", "20"), Seq("iwa", "was", "asj"))
).toDF("filtered_words", "ngrams_array")
I want to merge the arrays in each row to make a single array in a new column. My code is as follows:
def concat_array(firstarray: Array[String],
secondarray: Array[String]) : Array[String] =
{ (firstarray ++ secondarray).toArray }
val concatUDF = udf(concat_array _)
val concatFrame = ngramDataFrame.withColumn("full_array", concatUDF($"filtered_words", $"ngrams_array"))
I can successfully use the concat_array function on two arrays. However when I run the above code, I get the following exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 16.0 failed 1 times, most recent failure: Lost task 0.0 in stage 16.0 (TID 12, localhost): org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (array, array) => array) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String; at $line80.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:76) ... 13 more Driver stacktrace:

In Spark 2.4 or later you can use concat (if you want to keep duplicates):
ngramDataFrame.withColumn(
"full_array", concat($"filtered_words", $"ngrams_array")
).show
+--------------------+---------------+--------------------+
| filtered_words| ngrams_array| full_array|
+--------------------+---------------+--------------------+
|[curious, bought,...|[iwa, was, asj]|[curious, bought,...|
+--------------------+---------------+--------------------+
or array_union (if you want to drop duplicates):
ngramDataFrame.withColumn(
"full_array",
array_union($"filtered_words", $"ngrams_array")
)
These can be also composed from the other higher order functions, for example
ngramDataFrame.withColumn(
"full_array",
flatten(array($"filtered_words", $"ngrams_array"))
)
with duplicates, and
ngramDataFrame.withColumn(
"full_array",
array_distinct(flatten(array($"filtered_words", $"ngrams_array")))
)
without.
On a side note, you shouldn't use WrappedArray when working with ArrayType columns. Instead you should expect the guaranteed interface, which is Seq. So the udf should use function with following signature:
(Seq[String], Seq[String]) => Seq[String]
Please refer to SQL Programming Guide for details.

Arjun there is an error in the udf you had created.when you are passing the array type columns .data type is not Array[String] it is WrappedArray[String].below i am pasting the modified udf along with output.
val SparkCtxt = new SparkContext(sparkConf)
val sqlContext = new SQLContext(SparkCtxt)
import sqlContext.implicits
import org.apache.spark.sql.functions._
val temp=SparkCtxt.parallelize(Seq(Row(Array("String1","String2"),Array("String3","String4"))))
val df= sqlContext.createDataFrame(temp,
StructType(List(
StructField("Col1",ArrayType(StringType),true),
StructField("Col2",ArrayType(StringType),true)
)
) )
def concat_array(firstarray: mutable.WrappedArray[String],
secondarray: mutable.WrappedArray[String]) : mutable.WrappedArray[String] =
{
(firstarray ++ secondarray)
}
val concatUDF = udf(concat_array _)
val df2=df.withColumn("udftest",concatUDF(df.col("Col1"), df.col("Col2")))
df2.select("udftest").foreach(each=>{println("***********")
println(each(0))})
df2.show(true)
OUTPUT:
+------------------+------------------+--------------------+
| Col1| Col2| udftest|
+------------------+------------------+--------------------+
|[String1, String2]|[String3, String4]|[String1, String2...|
+------------------+------------------+--------------------+
WrappedArray(String1, String2, String3, String4)

Unable to parallelize a list in Scala

I am unable to parallelize a list in scala, getting java.lang.NullPointerException
messages.foreachRDD( rdd => {
for(avroLine <- rdd){
val record = Injection.injection.invert(avroLine.getBytes).get
val field1Value = record.get("username")
val jsonStrings=Seq(record.toString())
val newRow = sqlContext.sparkContext.parallelize(Seq(record.toString()))
}
})
output
jsonStrings...List({"username": "user_118", "tweet": "tweet_218", "timestamp": 18})
Exception
Caused by: java.lang.NullPointerException
at com.capitalone.AvroConsumer$$anonfun$main$1$$anonfun$apply$1.apply(AvroConsumer.scala:83)
at com.capitalone.AvroConsumer$$anonfun$main$1$$anonfun$apply$1.apply(AvroConsumer.scala:74)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.util.CompletionIterator.foreach(CompletionIterator.scala:26)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:917)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:917)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
Thanks in Advance!!

You're trying to create an RDD in the spark worker context. While foreachRDD operates in the driver, the foreach operation you perform on each RDD is distributed to the workers. It seems unlikely that you actually want to create a new RDD for each line of the input stream.
Update after comments:
It's hard to have this discussion in a comment thread where there is no formatting for code. My basic question is why aren't you doing something like this:
val messages: ReceiverInputDStream[String] = RabbitMQUtils.createStream(ssc, rabbitParams)
def toJsonString(message: String): String = SparkUtils.getRecordInjection(QUEUE_NAME).invert(message.getBytes()).get
val jsonStrings: DStream[String] = messages map toJsonString
I haven't bothered to figure out and track down all the libraries you're using (please, next time, submit a MCVE), so I haven't tried to compile that. But it looks like all you want is to map each input message to a JSON string. Maybe you want to do something fancy with the resulting DStream of Strings but that might be a different question.

def toJsonString(message: String): String = {val record =
SparkUtils.getRecordInjection(QUEUE_NAME).invert(message.getBytes()).get }
dStreams.foreachRDD( rdd => {
val jsonStrings = rdd.map (stream =>toJsonString(stream))
val df = sqlContext.read.json(jsonStrings)
df.write.mode("Append").csv("/Users/Documents/kafka-poc/consumer-out/def/")}

Spark sqlContext UDF acting on Sets

I've been trying to define a function that works within Spark's DataFrame which takes scala sets as input and outputs an integer. I'm getting the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 20 in stage 25.0 failed 1 times, most recent failure: Lost task 20.0 in stage 25.0 (TID 473, localhost): java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.collection.immutable.Set
Here is a simple code that gives the crux of the issue:
// generate sample data
case class Dummy( x:Array[Integer] )
val df = sqlContext.createDataFrame(Seq(
Dummy(Array(1,2,3)),
Dummy(Array(10,20,30,40))
))
// define the UDF
import org.apache.spark.sql.functions._
def setSize(A:Set[Integer]):Integer = {
A.size
}
// For some reason I couldn't get it to work without this valued function
val sizeWrap: (Set[Integer] => Integer) = setSize(_)
val sizeUDF = udf(sizeWrap)
// this produces the error
df.withColumn("colSize", sizeUDF('x)).show
What am I missing here? How can I get this to work? I know I can do this by casting to RDD but I don't want to go back and forth between RDD and DataFrames.

Use Seq:
val sizeUDF = udf((x: Seq) => setSize(x.toSet))

How to build a large distributed [sparse] matrix in Apache Spark 1.0?

I have an RDD as such: byUserHour: org.apache.spark.rdd.RDD[(String, String, Int)] I would like to create a sparse matrix of the data for calculations like median, mean, etc. The RDD contains the row_id, column_id and value. I have two Arrays containing the row_id and column_id strings for lookup.
Here is my attempt:
import breeze.linalg._
val builder = new CSCMatrix.Builder[Int](rows=BCnUsers.value.toInt,cols=broadcastTimes.value.size)
byUserHour.foreach{x =>
val row = userids.indexOf(x._1)
val col = broadcastTimes.value.indexOf(x._2)
builder.add(row,col,x._3)}
builder.result()
Here is my error:
14/06/10 16:39:34 INFO DAGScheduler: Failed to run foreach at <console>:38
org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: breeze.linalg.CSCMatrix$Builder$mcI$sp
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
My dataset is quite large so I would like to do this distributed if possible. Any help would be appreciated.
Progress update:
CSCMartix is not meant to work in Spark. However, there is RowMatrix which extends DistributedMatrix. RowMatrix does have a method, computeColumnSummaryStatistics(), that should compute some of the stats I am looking for. I know MLlib is growing everyday so I will watch for updates, but in the meantime I will try to make an RDD[Vector] to feed RowMatrix. Noting that RowMatrix is experimental and represents a row-oriented distributed Matrix with no meaningful row indices.

Starting with mapping a little different byUserHour is now an RDD[(String, (String, Int))]
. Because RowMatrix does not preserve order of rows I groupByKey on the row_id. Perhaps in the future I will figure out how to do this with a sparse matrix.
val byUser = byUserHour.groupByKey // RDD[(String, Iterable[(String, Int)])]
val times = countHour.map(x => x._1.split("\\+")(1)).distinct.collect.sortWith(_ < _) // Array[String]
val broadcastTimes = sc.broadcast(times) // Broadcast[Array[String]]
val userMaps = byUser.mapValues {
x => x.map{
case(time,cnt) => time -> cnt
}.toMap
} // RDD[(String, scala.collection.immutable.Map[String,Int])]
val rows = userMaps.map {
case(u,ut) => (u.toDouble +: broadcastTimes.value.map(ut.getOrElse(_,0).toDouble))} // RDD[Array[Double]]
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val rowVec = rows.map(x => Vectors.dense(x)) // RDD[org.apache.spark.mllib.linalg.Vector]
import org.apache.spark.mllib.linalg.distributed._
val countMatrix = new RowMatrix(rowVec)
val stats = countMatrix.computeColumnSummaryStatistics()
val meanvec = stats.mean

You can do that with a CoordinateMatrix:
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val sparseMatrix = new CoordinateMatrix(byUserHour.map {
case (row, col, data) => MatrixEntry(row, col, data)
})