I have here a toy data set for which I need to compute list of cities in each state and population of that state(sum of population of all the cities in that state)Data
I want to do it using RDDs without using groupByKey and joins. My approach so far:
In this approach I used 2 separate key-value pairs and joined them.
val rdd1=inputRdd.map(x=>(x._1,x._3.toInt))
val rdd2=inputRdd.map(x=>(x._1,x._2))
val popn_sum=rdd1.reduceByKey(_+_)
val list_cities=rdd2.reduceByKey(_++_)
popn_sum.join(list_cities).collect()
Is it possible to get the same output with just 1 key-value pair and without any joins.
I have created a new key-value pair, but I do not know how to proceed to get the same output using aggregateByKey or reduceByKey with this RDD:
val rdd3=inputRdd.map(x=>(x._1,(List(x._2),x._3)))
I am new to spark and want to learn the best way get this output.
Array((B,(12,List(B1, B2))), (A,(6,List(A1, A2, A3))), (C,(8,List(C1, C2))))
Thanks in advance
If your inputRdd is of type
inputRdd: org.apache.spark.rdd.RDD[(String, String, Int)]
Then you can achieve your desired result by simply using one reduceByKey as
inputRdd.map(x => (x._1, (List(x._2), x._3.toInt))).reduceByKey((x, y) => (x._1 ++ y._1, x._2+y._2))
and you can it with aggregateByKey as
inputRdd.map(x => (x._1, (List(x._2), x._3.toInt))).aggregateByKey((List.empty[String], 0))((x, y) => (x._1 ++ y._1, x._2+y._2), (x, y) => (x._1 ++ y._1, x._2+y._2))
DataFrame way
Even better approach would be to use dataframe way. You can convert your rdd to dataframe simply by applying .toDF("state", "city", "population") which should give you
+-----+----+----------+
|state|city|population|
+-----+----+----------+
|A |A1 |1 |
|B |B1 |2 |
|C |C1 |3 |
|A |A2 |2 |
|A |A3 |3 |
|B |B2 |10 |
|C |C2 |5 |
+-----+----+----------+
After that you can just use groupBy, and apply collect_list and sum inbuilt aggregation functions as
import org.apache.spark.sql.functions._
inputDf.groupBy("state").agg(collect_list(col("city")).as("cityList"), sum("population").as("sumPopulation"))
which should give you
+-----+------------+-------------+
|state|cityList |sumPopulation|
+-----+------------+-------------+
|B |[B1, B2] |12 |
|C |[C1, C2] |8 |
|A |[A1, A2, A3]|6 |
+-----+------------+-------------+
Dataset is almost the same but comes with additional type-safety
Related
Given a DF, let's say I have 3 classes each with a method addCol that will use the columns in the DF to create and append a new column to the DF (based on different calculations).
What is the best way to get a resulting df that will contain the original df A and the 3 added columns?
val df = Seq((1, 2), (2,5), (3, 7)).toDF("num1", "num2")
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method1", col("num1")/col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method2", col("num1")*col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method3", col("num1")+col("num2"))
}
One option is actions.foldLeft(df) { (df, action) => action.addCol(df))}. The end result is the DF I want -- with columns num1, num2, method1, method2, and method3. But from my understanding this will not make use of distributed evaluation, and each addCol will happen sequentially. What is the more efficient way to do this?
Efficient way to do this is using select.
select is faster than the foldLeft if you have very huge data - Check this post
You can build required expressions & use that inside select, check below code.
scala> df.show(false)
+----+----+
|num1|num2|
+----+----+
|1 |2 |
|2 |5 |
|3 |7 |
+----+----+
scala> val colExpr = Seq(
$"num1",
$"num2",
($"num1"/$"num2").as("method1"),
($"num1" * $"num2").as("method2"),
($"num1" + $"num2").as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+
Update
Return Column instead of DataFrame. Try using higher order functions, Your all three function can be replaced with below one function.
scala> def add(
num1:Column, // May be you can try to use variable args here if you want.
num2:Column,
f: (Column,Column) => Column
): Column = f(num1,num2)
For Example, varargs & while invoking this method you need to pass required columns at the end.
def add(f: (Column,Column) => Column,cols:Column*): Column = cols.reduce(f)
Invoking add function.
scala> val colExpr = Seq(
$"num1",
$"num2",
add($"num1",$"num2",(_ / _)).as("method1"),
add($"num1", $"num2",(_ * _)).as("method2"),
add($"num1", $"num2",(_ + _)).as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+
I have this following dataframe where certain columns like version and datSetName are supposedly constants. I am trying to get these constants into a variable(version is of type float and dataSetName is string).
|id |version |dataSetName
|1 |1.0 | employee
|2 |1.0 | employee
|3 |1.0 | employee
|4 |1.0 | employee
using the following way gives me a Row
val datSetName = df.select("dataSetName").distinct.collect()(0)
what's the best way to get dataSetName and version into String and Float variables respectively.
Check below code.
verison
df
.select("version")
.distinct.map(_.getAs[Double](0))
.collect
.head
dataSetName
df
.select("dataSetName")
.distinct
.map(_.getAs[String](0))
.collect
.head
version & dataSetName
df
.select("version","dataSetName")
.distinct
.map(c => (c.getAs[Double](0),c.getAs[String](1)))
.collect
.head
(Double, String) = (1.0,employee) // Output
The below code gives a count vector for each row in the DataFrame:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.fit(df)
cvModel.transform(df).show(false)
The result is:
+---+---------------+-------------------------+
|id |words |features |
+---+---------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+
How to get total counts of each words, like:
+---+------+------+
|id |words |counts|
+---+------+------+
|0 |a | 3 |
|1 |b | 3 |
|2 |c | 2 |
+---+------+------+
Shankar's answer only gives you the actual frequencies if the CountVectorizer model keeps every single word in the corpus (e.g. no minDF or VocabSize limitations). In these cases you can use Summarizer to directly sum each Vector. Note: this requires Spark 2.3+ for Summarizer.
import org.apache.spark.ml.stat.Summarizer.metrics
// You need to select normL1 and another item (like mean) because, for some reason, Spark
// won't allow one Vector to be selected at a time (at least in 2.4)
val totalCounts = cvModel.transform(df)
.select(metrics("normL1", "mean").summary($"features").as("summary"))
.select("summary.normL1", "summary.mean")
.as[(Vector, Vector)]
.first()
._1
You'll then have to zip totalCounts with cvModel.vocabulary to get the words themselves.
You can simply explode and groupBy to get the count of each word
cvModel.transform(df).withColumn("words", explode($"words"))
.groupBy($"words")
.agg(count($"words").as("counts"))
.withColumn("id", row_number().over(Window.orderBy("words")) -1)
.show(false)
Output:
+-----+------+---+
|words|counts|id |
+-----+------+---+
|a |3 |1 |
|b |3 |2 |
|c |2 |3 |
+-----+------+---+
I would like to write code that would group a line iterator inputs: Iterator[InputRow] by timestamp an unique items (by unit and eventName), i.e. eventTime should be the latest timestamp in the new Iterator[T] list where InputRow is defined as
case class InputRow(unit:Int, eventName: String, eventTime:java.sql.Timestamp, value: Int)
Example data before grouping:
+-----------------------+----+---------+-----+
|eventTime |unit|eventName|value|
+-----------------------+----+---------+-----+
|2018-06-02 16:05:11 |2 |B |1 |
|2018-06-02 16:05:12 |1 |A |2 |
|2018-06-02 16:05:13 |2 |A |2 |
|2018-06-02 16:05:14 |1 |A |3 |
|2018-06-02 16:05:15 |2 |A |3 |
After:
+-----------------------+----+---------+-----+
|eventTime |unit|eventName|value|
+-----------------------+----+---------+-----+
|2018-06-02 16:05:11 |2 |B |1 |
|2018-06-02 16:05:14 |1 |A |3 |
|2018-06-02 16:05:15 |2 |A |3 |
What is a good approach to writing the above code in Scala?
Good news: your question already contains the verbs that correspond to the functional calls to be used in the code: group by, sort by (latest timestamp).
To sort InputRow by latest timestamp, we'll need an implicit ordering:
implicit val rowSortByTimestamp: Ordering[InputRow] =
(r1: InputRow, r2: InputRow) => r1.eventTime.compareTo(r2.eventTime)
// or shorter:
// implicit val rowSortByTimestamp: Ordering[InputRow] =
// _.eventTime compareTo _.eventTime
And now, having
val input: Iterator[InputRow] = // input data
Let's group them by (unit, eventName)
val result = input.toSeq.groupBy(row => (row.unit, row.eventName))
then extract the one with the latest timestamp
.map { case (gr, rows) => rows.sorted.last }
and sort from ealiest to latest
.toSeq.sorted
The result is
InputRow(2,B,2018-06-02 16:05:11.0,1)
InputRow(1,A,2018-06-02 16:05:14.0,3)
InputRow(2,A,2018-06-02 16:05:15.0,3)
You can use struct built-in function to combine eventTime and value column as struct so that max by eventTime (latest) can be taken when groupBy unit and eventName and aggregating, which should give you your desired output
import org.apache.spark.sql.functions._
df.withColumn("struct", struct("eventTime", "value"))
.groupBy("unit", "eventName")
.agg(max("struct").as("struct"))
.select(col("struct.eventTime"), col("unit"), col("eventName"), col("struct.value"))
as
+-------------------+----+---------+-----+
|eventTime |unit|eventName|value|
+-------------------+----+---------+-----+
|2018-06-02 16:05:14|1 |A |3 |
|2018-06-02 16:05:11|2 |B |1 |
|2018-06-02 16:05:15|2 |A |3 |
+-------------------+----+---------+-----+
You can accomplish that with a foldLeft and a map:
val grouped: Map[(Int, String), InputRow] =
rows
.foldLeft(Map.empty[(Int, String), Seq[InputRow]])({ case (acc, row) =>
val key = (row.unit, row.eventName)
// Get from the accumulator the Seq that already exists or Nil if
// this key has never been seen before
val value = acc.getOrElse(key, Nil)
// Update the accumulator
acc + (key -> (value :+ row))
})
// Get the last element from the list of rows when grouped by unit and event.
.map({case (k, v) => k -> v.last})
This assumes that the eventTimes are already stored in sorted order. If this is not a safe assumption, you can define an implicit Ordering for java.sql.Timestamp and replace v.last with v.maxBy(_.eventTime).
See here.
Edit
Or use .groupBy(row => (row.unit, row.eventName)) instead of the foldLeft:
implicit val ordering: Ordering[Timestamp] = _ compareTo _
val grouped = rows.groupBy(row => (row.unit, row.eventName))
.values
.map(_.maxBy(_.eventTime))
I am new to Scala programming , i have worked on R very extensively but while working for scala it has become tough to work in a loop to extract specific columns to perform computation on the column values
let me explain with help of an example :
i have Final dataframe arrived after joining the 2 dataframes,
now i need to perform calculation like
Above is the computation with reference to the columns , so after computation we'll get the below spark dataframe
How to refer to the column index in for-loop to compute the new column values in spark dataframe in scala
Here is one solution:
Input Data:
+---+---+---+---+---+---+---+---+---+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |
+---+---+---+---+---+---+---+---+---+
|24 |74 |74 |21 |66 |65 |100|27 |19 |
+---+---+---+---+---+---+---+---+---+
Zipped the columns to remove the non-matching columns:
val oneCols = data.schema.filter(_.name.contains("1")).map(x => x.name).sorted
val twoCols = data.schema.filter(_.name.contains("2")).map(x => x.name).sorted
val cols = oneCols.zip(twoCols)
//cols: Seq[(String, String)] = List((a1,a2), (b1,b2), (c1,c2), (d1,d2))
Use foldLeft function to dynamically add columns:
import org.apache.spark.sql.functions._
val result = cols.foldLeft(data)((data,c) => data.withColumn(s"Diff_${c._1}",
(col(s"${lit(c._2)}") - col(s"${lit(c._1)}"))/col(s"${lit(c._2)}")))
Here is the result:
result.show(false)
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |Diff_a1 |Diff_b1|Diff_c1 |Diff_d1 |
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|24 |74 |74 |21 |66 |65 |100|27 |19 |0.6307692307692307|0.26 |-1.7407407407407407|-0.10526315789473684|
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+