How I can deal with Tuple in Spark Streaming? - scala

I have a problem with Spark Scala which I want to multiply Tuple elements in Spark streaming,I get data from kafka to dstream ,my RDD data is like this,
(2,[2,3,4,6,5])
(4,[2,3,4,6,5])
(7,[2,3,4,6,5])
(9,[2,3,4,6,5])
I want to do this operate using multiplication like this,
(2,[2*2,3*2,4*2,6*2,5*2])
(4,[2*4,3*4,4*4,6*4,5*4])
(7,[2*7,3*7,4*7,6*7,5*7])
(9,[2*9,3*9,4*9,6*9,5*9])
Then,I get the rdd like this,
(2,[4,6,8,12,10])
(4,[8,12,16,24,20])
(7,[14,21,28,42,35])
(9,[18,27,36,54,45])
Finally,I get Tuple the second element by smallest like this,
(2,4)
(4,8)
(7,14)
(9,18)
How can I do this with scala from dstream? I use spark version 1.6

Give you a demo with scala
// val conf = new SparkConf().setAppName("ttt").setMaster("local")
//val sc = new SparkContext(conf)
// val data =Array("2,2,3,4,6,5","4,2,3,4,6,5","7,2,3,4,6,5","9,2,3,4,6,5")
//val lines = sc.parallelize(data)
//change to your data (each RDD in streaming)
lines.map(x => (x.split(",")(0).toInt,List(x.split(",")(1).toInt,x.split(",")(2).toInt,x.split(",")(3).toInt,x.split(",")(4).toInt,x.split(",")(5).toInt) ))
.map(x =>(x._1 ,x._2.min)).map(x => (x._1,x._2* x._1)).foreach(x => println(x))
here is the result
(2,4)
(4,8)
(7,14)
(9,18)
Each RDD in DStream contains data at a specific time interval, and you can manipulate each RDD as you want

Let's say, you are getting tuple rdd in variable input:
import scala.collection.mutable.ListBuffer
val result = input
.map(x => { // for each element
var l = new ListBuffer[Int]() // create a new list for storing the multiplication result
for(i <- x._1){ // for each element in the array
l += x._0 * i // append the multiplied result to the new list
}
(x._0, l.toList) // return the new tuple
})
.map(x => {
(x._0, x._1.min) // return the new tuple with the minimum element in it from the list
})
result.foreach(println) should result in:
(2,4)
(4,8)
(7,14)
(9,18)

Related

Pyspark equivalent of Scala Spark

I have the following code in Scala:
val checkedValues = inputDf.rdd.map(row => {
val size = row.length
val items = for (i <- 0 until size) yield {
val fieldName = row.schema.fieldNames(i)
val sourceField = sourceFields(fieldName) // sourceField is a map which returns another object
val value = Option(row.get(i))
sourceField.checkType(value)
}
items
})
Basically, the above snippet takes a Spark DataFrame, converts into an rdd and applies the map function to return an rdd which is just an collection of object with datatype and other information for each of the values in the DataFrame.
How would I go about writing something equivalent in Pyspark because schema is not an attribute of Row in Pyspark among other things?

Spark 2.3: Reading dataframe inside rdd.map()

I want to iterate through each row of an RDD using .map() and I want to use a dataframe inside the map function as follows:
val rdd = ... // rdd holding seq of ids in each row
val df = ... // df with columns `id: String` and `value: Double`
rdd
.map { case Row(listOfStrings: Seq[String]) =>
listOfStrings.foldLeft(Seq[Double]())(op = (temp, curr) => {
// calling df here
val extractValue: Double = df.filter(s"id == $curr").first()(1)
temp :+ extractValue
}
}
Above is pseudocode which I made up, and this results in an exception because I cannot call a dataframe inside .map().
The only way I can think of overcoming this is to collect df before .map() so that it is no longer a dataframe. Is there a method in which I can do this without collecting? Note that joining the rdd and df is not suitable.
Basically you have a RDD of lists of IDs RDD[Seq[String]] and a dataframe of tuples (id, value). You are trying to replace the IDs of the RDD by the corresponding values in the dataframe.
The way you try to do it is impossible in spark. You cannot reference a dataframe nor a RDD inside a map. Indeed, they are objects that you manipulate in the driver to parallelize jobs, executed by the workers. However, the code inside map is executed by a worker and a worker cannot delegate work to other workers. Only the driver can. This is why (intuitively) what you are trying to do is not possible.
You say that a join is not suitable. I am not sure why but this is exactly what I propose, in combination with a flatMap. I use the RDD API but we could write similar code using the dataframe API.
// generating data
val data = Seq(Seq("a", "b", "c"), Seq("d", "e"), Seq("f"))
val rdd = sc.parallelize(data)
val df = Seq("a" -> 1d, "b" -> 2d, "c" -> 3d,
"d" -> 4d, "e" -> 5d, "f" -> 6d)
.toDF("id", "value")
// Transforming the dataframe into a RDD[String, Double]
val rdd_df = df.rdd
.map(row => row.getAs[String]("id") -> row.getAs[Double]("value"))
val result = rdd
// We start with zipWithUniqueId to remember how the lists were arranged
.zipWithUniqueId
// we flatten the lists, remembering for each row the list id
.flatMap{ case (ids, unique_id) => ids.map(id => id -> unique_id) }
.join(rdd_df)
.map{ case(_, (unique_id, value)) => unique_id -> value }
// we reform the lists by grouping by list id
.groupByKey
.map(_._2.toArray)
scala> result.collect
res: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0), Array(4.0, 5.0), Array(6.0))

spark-streaming scala: how can I pass an array of strings to a filter?

I want to replace the string "a" for an array of Strings making .contains() to check for every String in the array. Is that possible?
val filtered = stream.flatMap(status => status.getText.split(" ").filter(_.contains("a")))
Edit:
Also tried this (sc is sparkContext):
val ssc = new StreamingContext(sc, Seconds(15))
val stream = TwitterUtils.createStream(ssc, None)
val filtered = stream.flatMap(status => status.getText.split(" ").filter(a.contains(_)))
And got the following error:
java.io.NotSerializableException: Object of org.apache.spark.streaming.twitter.TwitterInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects.
Then I tried to broadcast the array before it is used:
val aBroadcast = sc.broadcast(a)
val filtered = stream.flatMap(status => status.getText.split(" ").filter(aBroadcast.value.contains(_)))
And got the same error.
Thanks
As I understand the question you want to see if the status text after being split contains a list of words which is a subset of a:
val a = Array("a1", "a2")
val filtered = stream.flatMap(status => status.getText.split(" ").filter(_.forall(a contains))

Spark Scala: Generating list of DataFrame based on values in RDD

I have a rdd containing values, each of those values will be passed to a function generate_df(num:Int) to create a dataframe. So essentially in the end we will have a list of dataframes stored in a list buffer like this var df_list_example = new ListBuffer[org.apache.spark.sql.DataFrame]().
First I will show the code and result of doing it using a list instead of RDD:
var df_list = new ListBuffer[org.apache.spark.sql.DataFrame]()
for (i <- list_values) //list_values contains values
{
df_list += generate_df(i)
}
Result:
df_list:
scala.collection.mutable.ListBuffer[org.apache.spark.sql.DataFrame] =
ListBuffer([value: int], [value: int], [value: int])
However, when I am using RDD which is very essential for my use case I am having issue:
var df_rdd_list = new ListBuffer[org.apache.spark.sql.DataFrame]()
//rdd_values contains values
rdd_values.map( i => df_rdd_list += generate_df(i) )
Result:
df_rdd_list:
scala.collection.mutable.ListBuffer[org.apache.spark.sql.DataFrame] =
ListBuffer()
Basically the list buffer remains empty and cannot store dataframes unlike when I am using list of values instead of rdd of values. Mapping using rdd is essential for my original use case.

Converting String RDD to Int RDD

I am new to scala..I want to know when processing large datasets with scala in spark is it possible to read as int RDD instead of String RDD
I tried the below:
val intArr = sc
.textFile("Downloads/data/train.csv")
.map(line=>line.split(","))
.map(_.toInt)
But I am getting the error:
error: value toInt is not a member of Array[String]
I need to convert to int rdd because down the line i need to do the below
val vectors = intArr.map(p => Vectors.dense(p))
which requires the type to be integer
Any kind of help is truly appreciated..thanks in advance
As far as I understood, one line should create one vector, so it should goes like:
val result = sc
.textFile("Downloads/data/train.csv")
.map(line => line.split(","))
.map(numbers => Vectors.dense(numbers.map(_.toInt)))
numbers.map(_.toInt) will map every element of array to int, so result type will be Array[Int]