call of distinct and map together throws NPE in spark library - scala

I am unsure if this is a bug, so if you do something like this
// d:spark.RDD[String]
d.distinct().map(x => d.filter(_.equals(x)))
you will get a Java NPE. However if you do a collect immediately after distinct, all will be fine.
I am using spark 0.6.1.

Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list.
It looks like your current code is trying to group the elements of d by value; you can do this efficiently with the groupBy() RDD method:
scala> val d = sc.parallelize(Seq("Hello", "World", "Hello"))
d: spark.RDD[java.lang.String] = spark.ParallelCollection#55c0c66a
scala> d.groupBy(x => x).collect()
res6: Array[(java.lang.String, Seq[java.lang.String])] = Array((World,ArrayBuffer(World)), (Hello,ArrayBuffer(Hello, Hello)))

what about the windowing example provided in the Spark 1.3.0 stream programming guide
val dataset: RDD[String, String] = ...
val windowedStream = stream.window(Seconds(20))...
val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }
SPARK-5063 causes the example to fail since the join is being called from within the transform method on an RDD

Related

Find columns to select, for spark.read(), from another Dataset - Spark Scala

I have a Dataset[Year] that has the following schema:
case class Year(day: Int, month: Int, Year: Int)
Is there any way to make a collection of the current schema?
I have tried:
println("Print -> "+ds.collect().toList)
But the result were:
Print -> List([01,01,2022], [31,01,2022])
I expected something like:
Print -> List(Year(01,01,2022), Year(31,01,2022)
I know that with a map I can adjust it, but I am trying to create a generic method that accepts any schema, and for this I cannot add a map doing the conversion.
That is my method:
class SchemeList[A]{
def set[A](ds: Dataset[A]): List[A] = {
ds.collect().toList
}
}
Apparently the method return is getting the correct signature, but when running the engine, it gets an error:
val setYears = new SchemeList[Year]
val YearList: List[Year] = setYears.set(df)
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to schemas.Schemas$Year
Based on your additional information in your comment:
I need this list to use as variables when creating another dataframe via jdbc (I need to make a specific select within postgresql). Is there a more performative way to pass values from a dataframe as parameters in a select?
Given your initial dataset:
val yearsDS: Dataset[Year] = ???
and that you want to do something like:
val desiredColumns: Array[String] = ???
spark.read.jdbc(..).select(desiredColumns.head, desiredColumns.tail: _*)
You could find the column names of yearsDS by doing:
val desiredColumns: Array[String] = yearsDS.columns
Spark achieves this by using def schema, which is defined on Dataset.
You can see the definition of def columns here.
May be you got a DataFrame,not a DataSet.
try to use "as" to transform dataframe to dataset.
like this
val year = Year(1,1,1)
val years = Array(year,year).toList
import spark.implicits._
val df = spark.
sparkContext
.parallelize(years)
.toDF("day","month","Year")
.as[Year]
println(df.collect().toList)

How to call a method after a spark structured streaming query (Kafka)?

I need to execute some functions based on the values that I receive from topics. I'm currently using ForeachWriter to convert all the topics to a List.
Now, I want to pass this List as a parameter to the methods.
This is what I have so far
def doA(mylist: List[String]) = { //something for A }
def doB(mylist: List[String]) = { //something for B }
Ans this is how I call my streaming queries
//{"s":"a","v":"2"}
//{"s":"b","v":"3"}
val readTopics = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "myTopic").load()
val schema = new StructType()
.add("s",StringType)
.add("v",StringType)
val parseStringDF = readTopics.selectExpr("CAST(value AS STRING)")
val parseDF = parseStringDF.select(from_json(col("value"), schema).as("data"))
.select("data.*")
parseDF.writeStream
.format("console")
.outputMode("append")
.start()
//fails here
val listOfTopics = parseDF.select("s").map(row => (row.getString(0))).collect.toList
//unable to call the below methods
for (t <- listOfTopics ){
if(t == "a")
doA(listOfTopics)
else if (t == "b")
doB(listOfTopics)
else
println("do nothing")
}
spark.streams.awaitAnyTermination()
Questions:
How can I call a stand-alone (non-streaming) method in a streaming job?
I cannot use ForeachWriter here as I want to pass a SparkSession to methods and since SparkSession is not serializable, I cannot use ForeachWriter. What are the alternatives to call the methods doA and doB in parallel?
If you want to be able to collect data to a local Spark driver/executor, you need to use parseDF.write.foreachBatch, i.e. using a ForEachWriter
It's unclear what you need the SparkSession for within your two methods, but since they are working on non-Spark datatypes, you probably shouldn't be using a SparkSession instance, anyway
Alternatively, you should .select() and filter your topic column, then apply the functions to two "topic-a" and "topic-b" dataframes, thus parallelizing the workload. Otherwise, you would be better off just using regular KafkaConsumer from kafka-clients or kafka-streams rather than Spark

spark-streaming scala: how can I pass an array of strings to a filter?

I want to replace the string "a" for an array of Strings making .contains() to check for every String in the array. Is that possible?
val filtered = stream.flatMap(status => status.getText.split(" ").filter(_.contains("a")))
Edit:
Also tried this (sc is sparkContext):
val ssc = new StreamingContext(sc, Seconds(15))
val stream = TwitterUtils.createStream(ssc, None)
val filtered = stream.flatMap(status => status.getText.split(" ").filter(a.contains(_)))
And got the following error:
java.io.NotSerializableException: Object of org.apache.spark.streaming.twitter.TwitterInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects.
Then I tried to broadcast the array before it is used:
val aBroadcast = sc.broadcast(a)
val filtered = stream.flatMap(status => status.getText.split(" ").filter(aBroadcast.value.contains(_)))
And got the same error.
Thanks
As I understand the question you want to see if the status text after being split contains a list of words which is a subset of a:
val a = Array("a1", "a2")
val filtered = stream.flatMap(status => status.getText.split(" ").filter(_.forall(a contains))

Map reduce to perform group by and sum in Cassandra, with spark and job server

I am creating a spark job server, which connects to cassandra. After getting the records i want to perform a simple group by and sum on it. I am able to retreive the data, I could not print the output. I have tried google on for hours and have posted in cassandra google groups as well. My current code is as below and i am getting error at collect.
override def runJob(sc: SparkContext, config: Config): Any = {
//sc.cassandraTable("store", "transaction").select("terminalid","transdate","storeid","amountpaid").toArray().foreach (println)
// Printing of each record is successful
val rdd = sc.cassandraTable("POSDATA", "transaction").select("terminalid","transdate","storeid","amountpaid")
val map1 = rdd.map ( x => (x.getInt(0), x.getInt(1),x.getDate(2))->x.getDouble(3) ).reduceByKey((x,y)=>x+y)
println(map1)
// output is ShuffledRDD[3] at reduceByKey at Daily.scala:34
map1.collect
//map1.ccollectAsMap().map(println(_))
//Throwing error java.lang.ClassNotFoundException: transaction.Daily$$anonfun$2
}
Your map1 is a RDD. You can try the following:
map1.foreach(r => println(r))
Spark does lazy evaluation on rdd. So try some action
map1.take(10).foreach(println)

Why doesn't keys() and values() work on (String,String) one-pair RDD, while sortByKey() works

I create an RDD using the README.md file in Spark directory. The type of the newRDD is (String,String)
val lines = sc.textFile("README.md")
val newRDD = lines.map(x => (x.split(" ")(0),x))
So, when I try to runnewRDD.values() or newRDD.keys(), I get the error:
error: org.apache.spark.rdd.RDD[String] does not take parameters newRDD.values()or.keys() resp.
What I can understand from the error is maybe that String data type cannot be a key (And I think I am wrong). But if that's the case, why does
newRDD.sortByKey() work ?
Note: I am trying values() and keys() transformations because they're listed as valid transformations for one-pair RDDs
Edit: I am using Apache Spark version 1.5.2 in Scala
It doesn't work values (or keys) receives no parameters and because of that it has to be called without parentheses:
val rdd = sc.parallelize(Seq(("foo", "bar")))
rdd.keys.first
// String = foo
rdd.values.first
// String = bar