I'm running a simple Spark project on a EMR YARN cluster to:
read a textfile on S3 into an RDD[String]
define a schema and convert that RDD into a DF
I am doing a mapPartition on the RDD to convert that RDD[String] into an RDD[Row].
My problem - I get a java.Lang.NullPointerException and I can't figure out what the problem is.
The stacktrace lists these 2 line numbers in the source code -
the line of rdd1.mapPartition
within the anonymous function, the line with the match case that matches the regular
Here's the stacktrace excerpt -
Caused by: java.lang.NullPointerException
at packageA.Herewego$$anonfun$3.apply(Herewego.scala:107)
at packageA.Herewego$$anonfun$3.apply(Herewego.scala:88)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I've tried -
The error occurs when running in YARN cluster mode - and not in Local mode (in my IDE). This made me think that something isn't defined on the Executor? I moved the createrow function def into the anonymous function def - it didn't work though.
Here's the code block
val rdd4: RDD[Row] = rdd1.mapPartitions((it:Iterator[String]) => {
def createrow(a: List[String]): Row = {
val format = new java.text.SimpleDateFormat("dd/MMM/yyyy HH:mm:ss Z")
val re1: Row = Row.apply(a.head)
val d: Date = format.parse(a.tail.mkString(" "))
val t = new Timestamp(d.getTime)
val re2: Row = Row.apply(t)
Row.merge(re1, re2)
}
var output: List[Row] = List()
while (it.hasNext) {
val data: String = it.next()
val res = data match {
case rx(ipadd, date, time) => createrow(List(ipadd, date, time))
case _ => createrow(List("0.0.0.0", "00/Jan/0000", "00:00:00 0"))
}
output = output :+ res
}
output.toIterator
}).persist(MEMORY_ONLY)
// Collect and Persist the RDD in Memory
val tmp = rdd4.collect()
Do I need to broadcast any variables or functions used within the mapPartition?
Any pointers in the right direction will be more than appreciated.
I want to replace the string "a" for an array of Strings making .contains() to check for every String in the array. Is that possible?
val filtered = stream.flatMap(status => status.getText.split(" ").filter(_.contains("a")))
Edit:
Also tried this (sc is sparkContext):
val ssc = new StreamingContext(sc, Seconds(15))
val stream = TwitterUtils.createStream(ssc, None)
val filtered = stream.flatMap(status => status.getText.split(" ").filter(a.contains(_)))
And got the following error:
java.io.NotSerializableException: Object of org.apache.spark.streaming.twitter.TwitterInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects.
Then I tried to broadcast the array before it is used:
val aBroadcast = sc.broadcast(a)
val filtered = stream.flatMap(status => status.getText.split(" ").filter(aBroadcast.value.contains(_)))
And got the same error.
Thanks
As I understand the question you want to see if the status text after being split contains a list of words which is a subset of a:
val a = Array("a1", "a2")
val filtered = stream.flatMap(status => status.getText.split(" ").filter(_.forall(a contains))
I have two columns in a Spark SQL DataFrame with each entry in either column as an array of strings.
val ngramDataFrame = Seq(
(Seq("curious", "bought", "20"), Seq("iwa", "was", "asj"))
).toDF("filtered_words", "ngrams_array")
I want to merge the arrays in each row to make a single array in a new column. My code is as follows:
def concat_array(firstarray: Array[String],
secondarray: Array[String]) : Array[String] =
{ (firstarray ++ secondarray).toArray }
val concatUDF = udf(concat_array _)
val concatFrame = ngramDataFrame.withColumn("full_array", concatUDF($"filtered_words", $"ngrams_array"))
I can successfully use the concat_array function on two arrays. However when I run the above code, I get the following exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 16.0 failed 1 times, most recent failure: Lost task 0.0 in stage 16.0 (TID 12, localhost): org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (array, array) => array) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String; at $line80.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:76) ... 13 more Driver stacktrace:
In Spark 2.4 or later you can use concat (if you want to keep duplicates):
ngramDataFrame.withColumn(
"full_array", concat($"filtered_words", $"ngrams_array")
).show
+--------------------+---------------+--------------------+
| filtered_words| ngrams_array| full_array|
+--------------------+---------------+--------------------+
|[curious, bought,...|[iwa, was, asj]|[curious, bought,...|
+--------------------+---------------+--------------------+
or array_union (if you want to drop duplicates):
ngramDataFrame.withColumn(
"full_array",
array_union($"filtered_words", $"ngrams_array")
)
These can be also composed from the other higher order functions, for example
ngramDataFrame.withColumn(
"full_array",
flatten(array($"filtered_words", $"ngrams_array"))
)
with duplicates, and
ngramDataFrame.withColumn(
"full_array",
array_distinct(flatten(array($"filtered_words", $"ngrams_array")))
)
without.
On a side note, you shouldn't use WrappedArray when working with ArrayType columns. Instead you should expect the guaranteed interface, which is Seq. So the udf should use function with following signature:
(Seq[String], Seq[String]) => Seq[String]
Please refer to SQL Programming Guide for details.
Arjun there is an error in the udf you had created.when you are passing the array type columns .data type is not Array[String] it is WrappedArray[String].below i am pasting the modified udf along with output.
val SparkCtxt = new SparkContext(sparkConf)
val sqlContext = new SQLContext(SparkCtxt)
import sqlContext.implicits
import org.apache.spark.sql.functions._
val temp=SparkCtxt.parallelize(Seq(Row(Array("String1","String2"),Array("String3","String4"))))
val df= sqlContext.createDataFrame(temp,
StructType(List(
StructField("Col1",ArrayType(StringType),true),
StructField("Col2",ArrayType(StringType),true)
)
) )
def concat_array(firstarray: mutable.WrappedArray[String],
secondarray: mutable.WrappedArray[String]) : mutable.WrappedArray[String] =
{
(firstarray ++ secondarray)
}
val concatUDF = udf(concat_array _)
val df2=df.withColumn("udftest",concatUDF(df.col("Col1"), df.col("Col2")))
df2.select("udftest").foreach(each=>{println("***********")
println(each(0))})
df2.show(true)
OUTPUT:
+------------------+------------------+--------------------+
| Col1| Col2| udftest|
+------------------+------------------+--------------------+
|[String1, String2]|[String3, String4]|[String1, String2...|
+------------------+------------------+--------------------+
WrappedArray(String1, String2, String3, String4)
Here is sample file
Department,Designation,costToCompany,State
Sales,Trainee,12000,UP
Sales,Lead,32000,AP
Sales,Lead,32000,LA
Sales,Lead,32000,TN
Sales,Lead,32000,AP
Sales,Lead,32000,TN
Sales,Lead,32000,LA
Sales,Lead,32000,LA
Marketing,Associate,18000,TN
Marketing,Associate,18000,TN
HR,Manager,58000,TN
Produce an output as csv
Group by department, desigination, State
With additional columns with sum(costToCompany) and sum(TotalEmployeeCount)
Result should be like
Dept,Desg,state,empCount,totalCost
Sales,Lead,AP,2,64000
Sales,Lead,LA,3,96000
Sales,Lead,TN,2,64000
Following is the solution and writing to file is resulting in an error. What am i doing wrong here?
Step #1: Load file
val file = sc.textFile("data/sales.txt")
Step #2: Create a case class to represt the data
scala> case class emp(Dept:String, Desg:String, totalCost:Double, State:String)
defined class emp
Step #3: Split data and create RDD of emp object
scala> val fileSplit = file.map(_.split(","))
scala> val data = fileSplit.map(x => emp(x(0), x(1), x(2).toDouble, x(3)))
Step #4: Turn the data into Key/value par with key=(dept, desg,state) and value=(1,totalCost)
scala> val keyVals = data.map(x => ((x.Dept,x.Desg,x.State),(1,x.totalCost)))
Step #5: Group by using reduceByKey, as we want summation as well for total number of employees and the cost
scala> val results = keyVals.reduceByKey{(a,b) => (a._1+b._1, a._2+b._2)} //(a.count+ b.count, a.cost+b.cost)
results: org.apache.spark.rdd.RDD[((String, String, String), (Int, Double))] = ShuffledRDD[41] at reduceByKey at <console>:55
Step #6: save the results
scala> results.repartition(1).saveAsTextFile("data/result")
Error
17/08/16 22:16:59 ERROR executor.Executor: Exception in task 0.0 in stage 20.0 (TID 23)
java.lang.NumberFormatException: For input string: "costToCompany"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
at java.lang.Double.parseDouble(Double.java:540)
at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
at $line85.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:51)
at $line85.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/08/16 22:16:59 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 20.0 (TID 23, localhost, executor driver): java.lang.NumberFormatException: For input string: "costToCompany"
Update 1
Forgot to remove header. update code here. Save is throwing a different error now. also, need to put the header back in the file.
scala> val file = sc.textFile("data/sales.txt")
scala> val header = fileSplit.first()
scala> val noHeaderData = fileSplit.filter(_(0) != header(0))
scala> case class emp(Dept:String, Desg:String, totalCost:Double, State:String)
scala> val data = noHeaderData.map(x => emp(x(0), x(1), x(2).toDouble, x(3)))
scala> val keyVals = data.map(x => ((x.Dept,x.Desg,x.State),(1,x.totalCost)))
scala> val resultSpecific = results.map(x => (x._1._1, x._1._2, x._1._3, x._2._1, x._2._2))
scala> resultSpecific.repartition(1).saveASTextFile("data/specific")
<console>:64: error: value saveASTextFile is not a member of org.apache.spark.rdd.RDD[(String, String, String, Int, Double)]
resultSpecific.repartition(1).saveASTextFile("data/specific")
To answer your question as well as comments:
It would be easier for you to utilize dataframes in this case, as your file is in csv format you can use the following way to load and save the data. In this way, you do not need to concern yourself with splitting the rows in the file as well as taking care of the header (both when loading and saving).
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true") //reading the headers
.load("csv/file/path");
The dataframe column names will then be the same as the header in the file. Instead of reduceByKey() you can use the dataframe's groupBy() and agg():
val res = df.groupBy($"Department", $"Designation", $"State")
.agg(count($"costToCompany").alias("empCount"), sum($"costToCompany").alias("totalCost"))
Then save it:
res.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("results.csv")
when you are trying to cast into double, costToCompany string wont cast thats why its stuck when try to fire action. just drop first record from file and then it will work . you can also do such operation on dataframe also which will be easy
The error is straight forward and it says that
:64: error: value saveASTextFile is not a member of
org.apache.spark.rdd.RDD[(String, String, String, Int, Double)]
resultSpecific.repartition(1).saveASTextFile("data/specific")
In fact, you got no method called saveASTextFile(...) but saveAsTextFile(???). you have case error on your method name.
I'm trying to create a string sampler using a rdd of string as a dictionnary and the class RandomDataGenerator from org.apache.spark.mllib.random package.
import org.apache.spark.mllib.random.RandomDataGenerator
import org.apache.spark.rdd.RDD
import scala.util.Random
class StringSampler(var dic: RDD[String], var seed: Long = System.nanoTime) extends RandomDataGenerator[String] {
require(dic != null, "Dictionary cannot be null")
require(!dic.isEmpty, "Dictionary must contains lines (words)")
Random.setSeed(seed)
var fraction: Long = 1 / dic.count()
//return a random line from dictionary
override def nextValue(): String = dic.sample(withReplacement = true, fraction).take(1)(0)
override def setSeed(newSeed: Long): Unit = Random.setSeed(newSeed)
override def copy(): StringSampler = new StringSampler(dic)
def setDictionary(newDic: RDD[String]): Unit = {
require(newDic != null, "Dictionary cannot be null")
require(!newDic.isEmpty, "Dictionary must contains lines (words)")
dic = newDic
fraction = 1 / dic.count()
}
}
val dictionaryName: String
val dictionaries: Broadcast[Map[String, RDD[String]]]
val dictionary: RDD[String] = dictionaries.value(dictionaryName) // dictionary.cache()
val sampler = new StringSampler(dictionary)
RandomRDDs.randomRDD(context, sampler, size, numPartitions)
But I encounter a SparkException saying that the dictionary is lacking of a SparkContext when I try to generate a random RDD of strings. It seems that spark is loosing context of the dictionary rdd when copying it to cluster nodes and I don't know how to fix it.
I tried to cache the dictionary before passing it to the StringSampler, but it didn't change anything...
I was thinking about linking it back to the original SparkContext, but I don't even know if it's possible. Anyone have an idea ?
Caused by: org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
I believe the issue is here:
val dictionaries: Broadcast[Map[String, RDD[String]]]
val dictionary: RDD[String] = dictionaries.value(dictionaryName)
You should not be broadcasting anything containing an RDD. RDDs are already parallelized and spread throughout the cluster. The error comes from trying to serialize and deserialize an RDD, which loses its context and is pointless anyway.
Just do this:
val dictionaries: Map[String, RDD[String]]
val dictionary: RDD[String] = dictionaries(dictionaryName)