how to use achieve below requirement using spark RDD - scala

I am really new for Spark. would you please help below require..
I have below source file. first field is name, second field is group id. I need to count how many group the name has, and list all the groups and count.
abc 1
abc 2
abc 3
xyz 1
xyz 3
def 2
def 4
lmn 6
I want to get below ex
name dept count
abc 1,2,3 3
xyz 1,3 2
def 2,4 2
lmn 6 1
thanks in advance.

Assuming you have a CSV file. So , first create dataframe using following steps.
import org.apache.spark.sql.types._
import org.apache.spark.Row
val members = sc.textFile("member").Map(lines => lines.split(",")).map(a => Row(a(0),a(1)))
val rddStruct = new StructType(Array(StructField("name", StringType, nullable = true),StructField("depart", StringType, nullable = true)))
val df = sqlContext.createDataFrame(members,rddStruct)
To achieve the output, following steps can be followed.
Apply a groupBy function can collect all departments as a set
val df2 = df.groupBy("Name").agg(collect_set("Depart").as("Depart"))
df2.show
+---+-----------+
| Name| Depart|
+---+-----------+
|lmn| [6]|
|def| [2, 4]|
|abc|[1, 2, 3]|
|xyz| [1, 2]|
+---+---------+
Then apply a size function on the Depart column to get the count.
val df3 = df2.withColumn("Count", size(df2("Depart")))
df3.show
+---+---------+-----+
| Name| Depart|Count|
+---+---------+-----+
|lmn| [6]| 1|
|def| [2, 4]| 2|
|abc|[1, 2, 3]| 3|
|xyz| [1, 2]| 2|
+---+---------+-----+
If result required should be sorted in descending order than you can apply a orderBy function on the above output.
val df4 = df3.orderBy(desc("Count"))
df4.show
+---+---------+-----+
| Name| Depart|Count|
+---+---------+-----+
|abc|[1, 2, 3]| 3|
|def| [2, 4]| 2|
|xyz| [1, 2]| 2|
|lmn| [6]| 1|
+---+---------+-----+
About structType you can read here

You can make it simple using RDD transformation:
scala> var rdd = sc.textFile("/data_test1")
scala> rdd.map(x => x.split(" ")).
map(x => (x(0), x(1))).
groupByKey().
map(x => (x._1, x._2.toSet.mkString(","),x._2.size)).
toDF("name", "dept", "count").show()
Output:
+----+-----+-----+
|name| dept|count|
+----+-----+-----+
| abc|1,2,3| 3|
| lmn| 6| 1|
| def| 2,4| 2|
| xyz| 1,3| 2|
+----+-----+-----+

Related

Spark GroupBy and Aggregate Strings to Produce a Map of Counts of the Strings Based on a Condition

I have a dataframe with two multiple columns, two of which are id and label as shown below.
+---+---+---+
| id| label|
+---+---+---+
| 1| "abc"|
| 1| "abc"|
| 1| "def"|
| 2| "def"|
| 2| "def"|
+---+---+---+
I want to groupBy "id" and aggregate the label column by counts (ignore null) of label in a map data structure and the expected result is as shown below:
+---+---+--+--+--+--+--+--
| id| label |
+---+-----+----+----+----+
| 1| {"abc":2, "def":1}|
| 2| {"def":2} |
+---+-----+----+----+----+
Is it possible to do this without using user-defined aggregate functions? I saw a similar answer here, but it doesn't aggregate based on the count of each item.
I apologize if this question is silly, I am new to both Scala and Spark.
Thanks
Without Custom UDFs
import org.apache.spark.sql.functions.{map, collect_list}
df.groupBy("id", "label")
.count
.select($"id", map($"label", $"count").as("map"))
.groupBy("id")
.agg(collect_list("map"))
.show(false)
+---+------------------------+
|id |collect_list(map) |
+---+------------------------+
|1 |[[def -> 1], [abc -> 2]]|
|2 |[[def -> 2]] |
+---+------------------------+
Using Custom UDF,
import org.apache.spark.sql.functions.udf
val customUdf = udf((seq: Seq[String]) => {
seq.groupBy(x => x).map(x => x._1 -> x._2.size)
})
df.groupBy("id")
.agg(collect_list("label").as("list"))
.select($"id", customUdf($"list").as("map"))
.show(false)
+---+--------------------+
|id |map |
+---+--------------------+
|1 |[abc -> 2, def -> 1]|
|2 |[def -> 2] |
+---+--------------------+

minBy equivalent in Spark Dataframe [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 4 years ago.
I'm looking for equivalent function of minBy aggregate in Spark Dataframe or may need to manually aggregate. Any thoughts? Thanks.
https://prestodb.io/docs/current/functions/aggregate.html#min_by
There is no such direct function to get the 'min_by' values from the Dataframe.
It is a two stage operation in Spark. First groupby the column then apply min function to get min value for each numeric column for each group.
scala> val inputDF = Seq(("a", 1),("b", 2), ("b", 3), ("a", 4), ("a", 5)).toDF("id", "count")
inputDF: org.apache.spark.sql.DataFrame = [id: string, count: int]
scala> inputDF.show()
+---+-----+
| id|count|
+---+-----+
| a| 1|
| b| 2|
| b| 3|
| a| 4|
| a| 5|
+---+-----+
scala> inputDF.groupBy($"id").min("count").show()
+---+----------+
| id|min(count)|
+---+----------+
| b| 2|
| a| 1|
+---+----------+

spark flatten records using a key column

I am trying to implement the logic to flatten the records using spark/Scala API. I am trying to use map function.
Could you please help me with the easiest approach to solve this problem?
Assume, for a given key I need to have 3 process codes
Input dataframe-->
Keycol|processcode
John |1
Mary |8
John |2
John |4
Mary |1
Mary |7
==============================
Output dataframe-->
Keycol|processcode1|processcode2|processcode3
john |1 |2 |4
Mary |8 |1 |7
Assuming same number of rows per Keycol, one approach would be to aggregate processcode into an array for each Keycol and expand out into individual columns:
val df = Seq(
("John", 1),
("Mary", 8),
("John", 2),
("John", 4),
("Mary", 1),
("Mary", 7)
).toDF("Keycol", "processcode")
val df2 = df.groupBy("Keycol").agg(collect_list("processcode").as("processcode"))
val numCols = df2.select( size(col("processcode")) ).as[Int].first
val cols = (0 to numCols - 1).map( i => col("processcode")(i) )
df2.select(col("Keycol") +: cols: _*).show
+------+--------------+--------------+--------------+
|Keycol|processcode[0]|processcode[1]|processcode[2]|
+------+--------------+--------------+--------------+
| Mary| 8| 1| 7|
| John| 1| 2| 4|
+------+--------------+--------------+--------------+
A couple of alternative approaches.
SQL
df.createOrReplaceTempView("tbl")
val q = """
select keycol,
c[0] processcode1,
c[1] processcode2,
c[2] processcode3
from (select keycol, collect_list(processcode) c
from tbl
group by keycol) t0
"""
sql(q).show
Result
scala> sql(q).show
+------+------------+------------+------------+
|keycol|processcode1|processcode2|processcode3|
+------+------------+------------+------------+
| Mary| 1| 7| 8|
| John| 4| 1| 2|
+------+------------+------------+------------+
PairRDDFunctions (groupByKey) + mapPartitions
import org.apache.spark.sql.Row
val my_rdd = df.map{ case Row(a1: String, a2: Int) => (a1, a2)
}.rdd.groupByKey().map(t => (t._1, t._2.toList))
def f(iter: Iterator[(String, List[Int])]) : Iterator[Row] = {
var res = List[Row]();
while (iter.hasNext) {
val (keycol: String, c: List[Int]) = iter.next
res = res ::: List(Row(keycol, c(0), c(1), c(2)))
}
res.iterator
}
import org.apache.spark.sql.types.{StringType, IntegerType, StructField, StructType}
val schema = new StructType().add(
StructField("Keycol", StringType, true)).add(
StructField("processcode1", IntegerType, true)).add(
StructField("processcode2", IntegerType, true)).add(
StructField("processcode3", IntegerType, true))
spark.createDataFrame(my_rdd.mapPartitions(f, true), schema).show
Result
scala> spark.createDataFrame(my_rdd.mapPartitions(f, true), schema).show
+------+------------+------------+------------+
|Keycol|processcode1|processcode2|processcode3|
+------+------------+------------+------------+
| Mary| 1| 7| 8|
| John| 4| 1| 2|
+------+------------+------------+------------+
Please keep in mind that in all cases order of values in columns for process codes is undetermined unless explicitly specified.

How to aggregate a Spark data frame to get a sparse vector using Scala?

I have a data frame like the one below in Spark, and I want to group it by the id column and then for each line in the grouped data I need to create a sparse vector with elements from the weight column at indices specified by the index column. The length of the sparse vector is known, say 1000 for this example.
Dataframe df:
+-----+------+-----+
| id|weight|index|
+-----+------+-----+
|11830| 1| 8|
|11113| 1| 3|
| 1081| 1| 3|
| 2654| 1| 3|
|10633| 1| 3|
|11830| 1| 28|
|11351| 1| 12|
| 2737| 1| 26|
|11113| 3| 2|
| 6590| 1| 2|
+-----+------+-----+
I have read this which is sort of similar of what I want to do, but for a rdd. Does anyone know of a good way to do this for a data frame in Spark using Scala?
My attempt so far is to first collect the weights and indices as lists like this:
val dfWithLists = df
.groupBy("id")
.agg(collect_list("weight") as "weights", collect_list("index") as "indices"))
which looks like:
+-----+---------+----------+
| id| weights| indices|
+-----+---------+----------+
|11830| [1, 1]| [8, 28]|
|11113| [1, 3]| [3, 2]|
| 1081| [1]| [3]|
| 2654| [1]| [3]|
|10633| [1]| [3]|
|11351| [1]| [12]|
| 2737| [1]| [26]|
| 6590| [1]| [2]|
+-----+---------+----------+
Then I define a udf and do something like this:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.sql.functions.udf
def toSparseVector: ((Array[Int], Array[BigInt]) => Vector) = {(a1, a2) => Vectors.sparse(1000, a1, a2.map(x => x.toDouble))}
val udfToSparseVector = udf(toSparseVector)
val dfWithSparseVector = dfWithLists.withColumn("SparseVector", udfToSparseVector($"indices", $"weights"))
but this doesn't seem to work, and it feels like there should be an easier way to do it without needing to collecting the weights and indices to lists first.
I'm pretty new to Spark, Dataframes and Scala, so any help is highly appreciated.
You have to collect them as vectors must be local, single machine: https://spark.apache.org/docs/latest/mllib-data-types.html#local-vector
For creating the sparse vectors you have 2 options, using unordered (index, value) pairs or specifying the indices and values arrays:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$
If you can get the data into a different format (pivoted), you could also make use of the VectorAssembler:
https://spark.apache.org/docs/latest/ml-features.html#vectorassembler
With some small tweaks you can get your approach working:
:paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val df = Seq((11830,1,8), (11113, 1, 3), (1081, 1,3), (2654, 1, 3), (10633, 1, 3), (11830, 1, 28), (11351, 1, 12), (2737, 1, 26), (11113, 3, 2), (6590, 1, 2)).toDF("id", "weight", "index")
val dfWithFeat = df
.rdd
.map(r => (r.getInt(0), (r.getInt(2), r.getInt(1).toDouble)))
.groupByKey()
.map(r => LabeledPoint(r._1, Vectors.sparse(1000, r._2.toSeq)))
.toDS
dfWithFeat.printSchema
dfWithFeat.show(10, false)
// Exiting paste mode, now interpreting.
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
+-------+-----------------------+
|label |features |
+-------+-----------------------+
|11113.0|(1000,[2,3],[3.0,1.0]) |
|2737.0 |(1000,[26],[1.0]) |
|10633.0|(1000,[3],[1.0]) |
|1081.0 |(1000,[3],[1.0]) |
|6590.0 |(1000,[2],[1.0]) |
|11830.0|(1000,[8,28],[1.0,1.0])|
|2654.0 |(1000,[3],[1.0]) |
|11351.0|(1000,[12],[1.0]) |
+-------+-----------------------+
dfWithFeat: org.apache.spark.sql.Dataset[org.apache.spark.mllib.regression.LabeledPoint] = [label: double, features: vector]

Spark : Aggregating based on a column

I have a file consisting of 3 fields (Emp_ids, Groups, Salaries)
100 A 430
101 A 500
201 B 300
I want to get result as
1) Group name and count(*)
2) Group name and max( salary)
val myfile = "/home/hduser/ScalaDemo/Salary.txt"
val conf = new SparkConf().setAppName("Salary").setMaster("local[2]")
val sc= new SparkContext( conf)
val sal= sc.textFile(myfile)
Scala DSL:
case class Data(empId: Int, group: String, salary: Int)
val df = sqlContext.createDataFrame(lst.map {v =>
val arr = v.split(' ').map(_.trim())
Data(arr(0).toInt, arr(1), arr(2).toInt)
})
df.show()
+-----+-----+------+
|empId|group|salary|
+-----+-----+------+
| 100| A| 430|
| 101| A| 500|
| 201| B| 300|
+-----+-----+------+
df.groupBy($"group").agg(count("*") as "count").show()
+-----+-----+
|group|count|
+-----+-----+
| A| 2|
| B| 1|
+-----+-----+
df.groupBy($"group").agg(max($"salary") as "maxSalary").show()
+-----+---------+
|group|maxSalary|
+-----+---------+
| A| 500|
| B| 300|
+-----+---------+
Or with plain SQL:
df.registerTempTable("salaries")
sqlContext.sql("select group, count(*) as count from salaries group by group").show()
+-----+-----+
|group|count|
+-----+-----+
| A| 2|
| B| 1|
+-----+-----+
sqlContext.sql("select group, max(salary) as maxSalary from salaries group by group").show()
+-----+---------+
|group|maxSalary|
+-----+---------+
| A| 500|
| B| 300|
+-----+---------+
While Spark SQL is recommended way to do such aggregations due to performance reasons, it can be easily done with RDD API:
val rdd = sc.parallelize(Seq(Data(100, "A", 430), Data(101, "A", 500), Data(201, "B", 300)))
rdd.map(v => (v.group, 1)).reduceByKey(_ + _).collect()
res0: Array[(String, Int)] = Array((B,1), (A,2))
rdd.map(v => (v.group, v.salary)).reduceByKey((s1, s2) => if (s1 > s2) s1 else s2).collect()
res1: Array[(String, Int)] = Array((B,300), (A,500))