Problem
I need to update this line in my code. How do I do that?
"case StringType => concat_ws(",",collect_list(col(c)))"
To only append strings that are not already in the existing field. In this example, the letter "b" would not appear twice.
Code
val df =Seq(
(1, 1.0, true, "a"),
(2, 2.0, false, "b")
(3, 2.0, false, "b")
(3, 2.0, false, "c")
).toDF("id","d","b","s")
val dataTypes: Map[String, DataType] = df.schema.map(sf =>
(sf.name,sf.dataType)).toMap
def genericAgg(c:String) = {
dataTypes(c) match {
case DoubleType => sum(col(c))
case StringType => concat_ws(",",collect_list(col(c)))
case BooleanType => max(col(c))
}
}
val aggExprs: Seq[Column] = df.columns.filterNot(_=="id")
.map(c => genericAgg(c))
df
.groupBy("id")
.agg(
aggExprs.head,aggExprs.tail:_*
)
.show()
You probably want to use collect_set() instead of collect_list(). This will automatically remove duplicates during the collection.
I am not sure why you want to turn the array of unique strings into a comma-delimited list. Spark can easily handle array columns and they are displayed such that each element can be seen. Still, if you absolutely must have the array converted into a comma-delimited string, use array_join in Spark 2.4+ or a UDF in earlier versions of Spark.
I have an RDD[Sale] and wanted to leave only the latest sales. So what I did is created a pair RDD and then performed grouping and filtering:
val sales: RDD[(String, Sale)] = rawSales.map(sale => sale.id -> sale)
.groupByKey()
.mapValues(_.maxBy(_.timestamp))
But how do I return back to RDD[Sale] instead of the pair RDD in this case?
The only way I figured out is the following:
val value: RDD[Sale] = sales.map(salePaired => salePaired._2)
Is it the most proper solution?
You can access the keys or values from pair RDD directly, like you access any Map
val keys: RDD[String] = sales.keys
val values: RDD[Sale] = sales.values
I have a problem with Spark Scala which I want to multiply Tuple elements in Spark streaming,I get data from kafka to dstream ,my RDD data is like this,
(2,[2,3,4,6,5])
(4,[2,3,4,6,5])
(7,[2,3,4,6,5])
(9,[2,3,4,6,5])
I want to do this operate using multiplication like this,
(2,[2*2,3*2,4*2,6*2,5*2])
(4,[2*4,3*4,4*4,6*4,5*4])
(7,[2*7,3*7,4*7,6*7,5*7])
(9,[2*9,3*9,4*9,6*9,5*9])
Then,I get the rdd like this,
(2,[4,6,8,12,10])
(4,[8,12,16,24,20])
(7,[14,21,28,42,35])
(9,[18,27,36,54,45])
Finally,I get Tuple the second element by smallest like this,
(2,4)
(4,8)
(7,14)
(9,18)
How can I do this with scala from dstream? I use spark version 1.6
Give you a demo with scala
// val conf = new SparkConf().setAppName("ttt").setMaster("local")
//val sc = new SparkContext(conf)
// val data =Array("2,2,3,4,6,5","4,2,3,4,6,5","7,2,3,4,6,5","9,2,3,4,6,5")
//val lines = sc.parallelize(data)
//change to your data (each RDD in streaming)
lines.map(x => (x.split(",")(0).toInt,List(x.split(",")(1).toInt,x.split(",")(2).toInt,x.split(",")(3).toInt,x.split(",")(4).toInt,x.split(",")(5).toInt) ))
.map(x =>(x._1 ,x._2.min)).map(x => (x._1,x._2* x._1)).foreach(x => println(x))
here is the result
(2,4)
(4,8)
(7,14)
(9,18)
Each RDD in DStream contains data at a specific time interval, and you can manipulate each RDD as you want
Let's say, you are getting tuple rdd in variable input:
import scala.collection.mutable.ListBuffer
val result = input
.map(x => { // for each element
var l = new ListBuffer[Int]() // create a new list for storing the multiplication result
for(i <- x._1){ // for each element in the array
l += x._0 * i // append the multiplied result to the new list
}
(x._0, l.toList) // return the new tuple
})
.map(x => {
(x._0, x._1.min) // return the new tuple with the minimum element in it from the list
})
result.foreach(println) should result in:
(2,4)
(4,8)
(7,14)
(9,18)
I have a RDD RDD1 with the following Schema:
RDD[String, Array[String]]
(let's call it RDD1)
and I would like create a new RDD RDD2 with each row as RDD[String,String] with the key and value belonging to RDD1.
For example:
RDD1 =Array(("Fruit",("Orange","Apple","Peach")),("Shape",("Square","Rectangle")),("Mathematician",("Aryabhatt"))))
I want the output to be as:
RDD2 = Array(("Fruit","Orange"),("Fruit","Apple"),("Fruit","Peach"),("Shape","Square"),("Shape","Rectangle"),("Mathematician","Aryabhatt"))
Can someone help me with this piece of code?
My Try:
val R1 = RDD1.map(line => (line._1,line._2.split((","))))
val R2 = R1.map(line => line._2.foreach(ph => ph.map(line._1)))
This gives me an error:
error: value map is not a member of Char
I understand that it is because that map function is only applicable to the RDDs and not each string/char. Please help me with a way to use nested functions for this purpose in Spark.
Break down the problem.
("Fruit",Array("Orange","Apple","Peach") -> Array(("Fruit", "Orange"), ("Fruit", "Apple"), ("Fruit", "Peach"))
def flattenLine(line: (String, Array[String])) = line._2.map(x => (line._1, x)
Apply that function to your rdd:
rdd1.flatMap(flattenLine)
I started testing Spark with Cassandra.
I get data from Cassandra which has two columns (primary key, set).
val sc = new SparkContext("spark://172.31.32.224:7077","test", conf)
val rdd = sc.cassandraTable("test", "table").select("pk", "lists")
.map(l => (l.get[String]("pk"), l.getList[String]("lists")))
But this code is mapping (String, Seq[String])
I'd like to break the Seq[String] and make pairs with "pk", such as
((pk1, list(1)), (pk1, list(2), (pk1, list(3)))
Is there way to do this?
replace map with flatmap and create a collection of pairs:
.flatMap{l =>
val pk = l.get[String]("pk")
l.getList[String]("lists").map(item => (pk,List(item)))
}