I am trying to get an RDD[(String, Iterable[String])] without using groupbykey. These are my tuples:
(Group 1, John)
(Group 2, Sam)
(Group 1, Mary)
(Group 3, Pam)
I need to get:
(Group 1, List(John, Mary)), (Group 2, List(Sam)), (Group 3, List(Pam))
without using groupby or groupbykeys function. How can I do this?
So if you want to use Spark APIs, you can use a windowing function over key, and try to collect values into a list:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(Group 1, John),
(Group 2, Sam),
(Group 1, Mary),
(Group 3, Pam)
).toDF("key", "value")
val keyWindow = Window.partitionBy("key")
val result = df
.withColumn(
"values",
collect_list(col("value")).over(keyWindow)
)
.select("key", "values")
.distinct
result.show()
// here is the result:
+-------+------------+
| key| values|
+-------+------------+
|Group 1|[John, Mary]|
|Group 2| [Sam]|
|Group 3| [Pam]|
+-------+------------+
You can then convert this df to rdd easily!
Update:
So if you want to use plain scala, assuming the data is given as follows:
val data = Seq(
("Group 1", "John"),
("Group 2", "Sam"),
("Group 1", "Mary"),
("Group 3", "Pam")
)
You need to foldLeft over the data, with a state of Map[String, List[String]] (which basically means a mapping from each key/group to values), and update the state of your iteration (the Map thing)
val result = data.foldLeft(Map.empty[String, List[String]]) {
case (acc, (key, value)) =>
acc.updatedWith(key) {
case Some(values) => Some(value :: values)
case None => Some(value :: Nil)
}
}
Just to make things clearer, you can use .collect on a dataframe to collect the rows as an Array. And, to make an RDD from the resulting Map[String, List[String]], you can use spark.sparkContext.parallelize(result.toSeq).
Related
I am new to Scala and Spark .
There are 2 RDDs like
RDD_A= (keyA,5),(KeyB,10)
RDD_B= (keyA,3),(KeyB,7)
how do I calculate : RDD_A-RDD_B so that I get (keyA,2),(KeyB,3)
I tried subtract and subtractByKey but I am unable to get similar output like above
Let's assume that each RDD has only one value with specified key:
val df =
Seq(
("A", 5),
("B", 10)
).toDF("key", "value")
val df2 =
Seq(
("A", 3),
("B", 7)
).toDF("key", "value")
You can merge these RDDs using union and perform the computation via groupBy as follows:
import org.apache.spark.sql.functions._
df.union(df2)
.groupBy("key")
.agg(first("value").minus(last("value")).as("value"))
.show()
will print:
+---+-----+
|key|value|
+---+-----+
| B| 3|
| A| 2|
+---+-----+
RDD solution for the question
Please find inline code comments for the explanation
object SubtractRDD {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate(); // Create Spark Session
val list1 = List(("keyA",5),("keyB",10))
val list2 = List(("keyA",3),("keyB",7))
val rdd1= spark.sparkContext.parallelize(list1) // convert list to RDD
val rdd2= spark.sparkContext.parallelize(list2)
val result = rdd1.join(rdd2) // Inner join RDDs
.map(x => (x._1, x._2._1 - x._2._2 )) // Combiner function for RDDs
.collectAsMap() // Collect result as Map
println(result)
}
}
I have a dataframe with an array column like:
val df = Seq(
Array("abc", "abc", "null", "null"),
Array("bcd", "bc", "bcd", "null"),
Array("ijk", "abc", "bcd", "ijk")).toDF("col")
And looks like:
col:
["abc","abc","null","null"]
["bcd","bc","bcd","null"]
["ijk","abc","bcd","ijk"]
I am trying to get the duplicate value of each array in scala:
col_1:
['abc']
['bcd']
['ijk']
I tried to get the duplicate value in the list but no clue on how this can be done with array column
val df = List("bcd", "bc", "bcd", "null")
df.groupBy(identity).collect { case (x, List(_,_,_*)) => x }
df.withColumn("id", monotonically_increasing_id())
.withColumn("col", explode(col("col")))
.groupBy("id", "col")
.count()
.filter(col("count") > 1 /*&& col("col") =!= "null"*/)
.select("col")
.show()
You can simply use custom UDF
def findDuplicate = udf((in: Seq[String]) =>
in.groupBy(identity)
.filter(_._2.length > 1)
.keys
.toArray
)
df.withColumn("col_1", explode(findDuplicate($"col")))
.show()
if you are willing to skip null values (as in your example) just add .filterNot(_ == "null") before .groupBy
The duplicate values of an array column could be obtained by assigning a monotonically increasing id to each array, exploding the array, and then window grouping by id and col.
import org.apache.spark.sql.functions.max
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.monotonically_increasing_id
import spark.implicits._
val df = spark.sparkContext.parallelize(Seq(
Array("abc", "abc", null, null),
Array("bcd", "bc", "bcd", null),
Array("ijk", "abc", "bcd", "ijk"))).toDF("col")
df.show(10)
val idfDF = df.withColumn("id", monotonically_increasing_id)
val explodeDF = idfDF.select(col("id"), explode(col("col")))
val countDF = explodeDF.groupBy("id", "col").count()
// Windows are partitions of id
val byId = Window.partitionBy("id")
val maxDF = countDF.withColumn("max", max("count") over byId)
val finalDf = maxDF.where("max == count").where("col is not null").select("col")
finalDf.show(10)
+---+
|col|
+---+
|abc|
|ijk|
|bcd|
+---+
I have a file with 20+ columns of which I would like to extract a few. Until now, I have the following code. I'm sure there is a smart way to do it, but not able to get it working successfully. Any ideas?
mvnmdata is of type RDD[String]
val strpcols = mvnmdata.map(x => x.split('|')).map(x => (x(0),x(1),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12),x(13),x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23) ))```
The next solution provides an easy and scalable way to manage your column names and indices. It is based on a map which determines the column name/index relation. The map will also help us to handle both the index of the extracted column and its name.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val rdd = spark.sparkContext.parallelize(Seq(
"1|500|400|300",
"1|34|67|89",
"2|10|20|56",
"3|2|5|56",
"3|1|8|22"))
val dictColums = Map("c0" -> 0, "c2" -> 2)
// create schema from map keys
val schema = StructType(dictColums.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(dictColums.values.toSeq.map{cols(_)})
}
val df = spark.createDataFrame(mappedRDD, schema).show
//output
+---+---+
| c0| c2|
+---+---+
| 1|400|
| 1| 67|
| 2| 20|
| 3| 5|
| 3| 8|
+---+---+
First we declare dictColums in this example we will extract the cols "c0" -> 0 and "c2" -> 2
Next we create the schema from the keys of the map
The one map (which you already have) will split the line by |, the second one will create a Row containing the values that correspond to each item of dictColums.values
UPDATE:
You could also create a function from the above functionality in order to be able to reuse it multiple times:
import org.apache.spark.sql.DataFrame
def stringRddToDataFrame(colsMapping: Map[String, Int], rdd: RDD[String]) : DataFrame = {
val schema = StructType(colsMapping.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(colsMapping.values.toSeq.map{cols(_)})
}
spark.createDataFrame(mappedRDD, schema)
}
And then use it for your case:
val cols = Map("c0" -> 0, "c1" -> 1, "c5" -> 5, ... "c23" -> 23)
val df = stringRddToDataFrame(cols, rdd)
As below,if you don't want to write repeated x(i),you can process it in a loop. Example 1:
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
for (i <- Array(0,1,5,6...)){
xbuffer.append(x(i))
}
xbuffer
})
If you only want to define the index list with start&end and the numbers to be excluded, see Example 2 of below:
scala> (1 to 10).toSet
res8: scala.collection.immutable.Set[Int] = Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
scala> ((1 to 10).toSet -- Set(2,9)).toArray.sortBy(row=>row)
res9: Array[Int] = Array(1, 3, 4, 5, 6, 7, 8, 10)
The final code you want:
//define the function to process indexes
def getSpecIndexes(start:Int, end:Int, removedValueSet:Set[Int]):Array[Int] = {
((start to end).toSet -- removedValueSet).toArray.sortBy(row=>row)
}
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
//call the function
for (i <- getSpecIndexes(0,100,Set(3,4,5,6))){
xbuffer.append(x(i))
}
xbuffer
})
As a little bit of context, what I'm trying to achieve here is given multiple rows grouped by a certain set of keys, after that first reduce I would like to group them in a general row by, for example, date, with each of the grouped counters previously calculated. This may not seem clear by just reading it so here is an example output (quite simple, nothing complex) of what should happen.
(("Volvo", "T4", "2019-05-01"), 5)
(("Volvo", "T5", "2019-05-01"), 7)
(("Audi", "RS6", "2019-05-01"), 4)
And once merged those Row objects...
date , volvo_counter , audi_counter
"2019-05-01" , 12 , 4
I reckon this is quite a corner case and that there may be different approaches but I was wondering if there was any solution within the same RDD so there's no need for multiple RDDs divided by counter.
What you want to do is a pivot. You talk about RDDs so I assume that your question is: "how to do a pivot with the RDD API?". As far as I know there is no built-in function in the RDD API that does it. You could do it yourself like this:
// let's create sample data
val rdd = sc.parallelize(Seq(
(("Volvo", "T4", "2019-05-01"), 5),
(("Volvo", "T5", "2019-05-01"), 7),
(("Audi", "RS6", "2019-05-01"), 4)
))
// If the keys are not known in advance, we compute their distinct values
val values = rdd.map(_._1._1).distinct.collect.toSeq
// values: Seq[String] = WrappedArray(Volvo, Audi)
// Finally we make the pivot and use reduceByKey on the sequence
val res = rdd
.map{ case ((make, model, date), counter) =>
date -> values.map(v => if(make == v) counter else 0)
}
.reduceByKey((a, b) => a.indices.map(i => a(i) + b(i)))
// which gives you this
res.collect.head
// (String, Seq[Int]) = (2019-05-01,Vector(12, 4))
Note that you can write much simpler code with the SparkSQL API:
// let's first transform the previously created RDD to a dataframe:
val df = rdd.map{ case ((a, b, c), d) => (a, b, c, d) }
.toDF("make", "model", "date", "counter")
// And then it's as simple as that:
df.groupBy("date")
.pivot("make")
.agg(sum("counter"))
.show
+----------+----+-----+
| date|Audi|Volvo|
+----------+----+-----+
|2019-05-01| 4| 12|
+----------+----+-----+
I think it's easier to do with DataFrame:
val data = Seq(
Record(Key("Volvo", "2019-05-01"), 5),
Record(Key("Volvo", "2019-05-01"), 7),
Record(Key("Audi", "2019-05-01"), 4)
)
val rdd = spark.sparkContext.parallelize(data)
val df = rdd.toDF()
val modelsExpr = df
.select("key.model").as("model")
.distinct()
.collect()
.map(r => r.getAs[String]("model"))
.map(m => sum(when($"key.model" === m, $"value").otherwise(0)).as(s"${m}_counter"))
df
.groupBy("key.date")
.agg(modelsExpr.head, modelsExpr.tail: _*)
.show(false)
I have around 20-25 list of columns from conf file and have to aggregate first Notnull value. I tried the function to pass the column list and agg expr from reading the conf file.
I was able to get first function but couldn't find how to specify first with ignoreNull value as true.
The code that I tried is
def groupAndAggregate(df: DataFrame, cols: List[String] , aggregateFun: Map[String, String]): DataFrame = {
df.groupBy(cols.head, cols.tail: _*).agg(aggregateFun)
}
val df = sc.parallelize(Seq(
(0, null, "1"),
(1, "2", "2"),
(0, "3", "3"),
(0, "4", "4"),
(1, "5", "5"),
(1, "6", "6"),
(1, "7", "7")
)).toDF("grp", "col1", "col2")
//first
groupAndAggregate(df, List("grp"), Map("col1"-> "first", "col2"-> "COUNT") ).show()
+---+-----------+-----------+
|grp|first(col1)|count(col2)|
+---+-----------+-----------+
| 1| 2| 4|
| 0| | 3|
+---+-----------+-----------+
I need to get 3 as a result in place of null.
I am using Spark 2.1.0 and Scala 2.11
Edit 1:
If I use the following function
import org.apache.spark.sql.functions.{first,count}
df.groupBy("grp").agg(first(df("col1"), ignoreNulls = true), count("col2")).show()
I get my desired result. Can we pass the ignoreNulls true for first function in Map?
I have been able to achieve this by creating a list of Columns and passing it to agg function of groupBy. The earlier approach had an issue where i was not able to name the columns as the agg function was not returning me the order of columns in the output DF, i have renamed the columns in the list itself.
import org.apache.spark.sql.functions._
def groupAndAggregate(df: DataFrame): DataFrame = {
val list: ListBuffer[Column] = new ListBuffer[Column]()
try {
val columnFound = getAggColumns(df) // function to return a Map[String, String]
val agg_func = columnFound.entrySet().toList.
foreach(field =>
list += first(df(columnFound.getOrDefault(field.getKey, "")),ignoreNulls = true).as(field.getKey)
)
list += sum(df("col1")).as("watch_time")
list += count("*").as("frequency")
val groupColumns = getGroupColumns(df) // function to return a List[String]
val output = df.groupBy(groupColumns.head, groupColumns.tail: _*).agg(
list.head, list.tail: _*
)
output
} catch {
case e: Exception => {
e.printStackTrace()}
null
}
}
I think you should use na operator and drop all the nulls before you do aggregation.
na: DataFrameNaFunctions Returns a DataFrameNaFunctions for working with missing data.
drop(cols: Array[String]): DataFrame Returns a new DataFrame that drops rows containing any null or NaN values in the specified columns.
The code would then look as follows:
df.na.drop("col1").groupBy(...).agg(first("col1"))
That will impact count so you'd have to do count separately.