Make RDD from List in scala&spark - scala

Orgin data
ID, NAME, SEQ, NUMBER
A, John, 1, 3
A, Bob, 2, 5
A, Sam, 3, 1
B, Kim, 1, 4
B, John, 2, 3
B, Ria, 3, 5
To mak ID group list, I did below
val MapRDD = originDF.map { x => (x.getAs[String](colMap.ID), List(x)) }
val ListRDD = MapRDD.reduceByKey { (a: List[Row], b: List[Row]) => List(a, b).flatten }
My goal is making this RDD (purpose is to find SEQ-1's NAME and Number diff in each ID group)
ID, NAME, SEQ, NUMBER, PRE_NAME, DIFF
A, John, 1, 3, NULL, NULL
A, Bob, 2, 5, John, 2
A, Sam, 3, 1, Bob, -4
B, Kim, 1, 4, NULL, NULL
B, John, 2, 3, Kim, -1
B, Ria, 3, 5, John, 2
Currently ListRDD would be like
A, ([A,Jone,1,3], [A,Bob,2,5], ..)
B, ([B,Kim,1,4], [B,John,2,3], ..)
This is code I tried to make my goal RDD with ListRDD (not working as I want)
def myFunction(ListRDD: RDD[(String, List[Row])]) = {
var rows: List[Row] = Nil
ListRDD.foreach( row => {
rows ::: make(row._2)
})
//rows has nothing and It's not RDD
}
def make( eachList: List[Row]): List[Row] = {
caseList.foreach { x => //... Make PRE_NAME and DIFF in new List
}
My final goal is to save this RDD in csv (RDD.saveAsFile...). How to make this RDD(not list) with this data.

Window functions look like a good fit here:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag
val df = sc.parallelize(Seq(
("A", "John", 1, 3),
("A", "Bob", 2, 5),
("A", "Sam", 3, 1),
("B", "Kim", 1, 4),
("B", "John", 2, 3),
("B", "Ria", 3, 5))).toDF("ID", "NAME", "SEQ", "NUMBER")
val w = Window.partitionBy($"ID").orderBy($"SEQ")
df.select($"*",
lag($"NAME", 1).over(w).alias("PREV_NAME"),
($"NUMBER" - lag($"NUMBER", 1).over(w)).alias("DIFF"))

Related

Calculate date difference for a specific column ID Scala

I need to calculate a date difference for a column, considering a specific ID shown in a different column and the first date for that specific ID, using Scala.
I have the following dataset:
The column ID shows the specific ID previously mentioned, the column date shows the date of the event and the column rank shows the chronological positioning of the different event dates for each specific ID.
I need to calculate for ID 1, the date difference for ranks 2 and 3 compared to rank 1 for that same ID, the same for ID 2 and so forth.
The expected result is the following:
Does somebody know how to do it?
Thanks!!!
Outside of using a library like Spark to reason about your data in SQL-esque terms, this can be accomplished using the Collections API by first finding the minimum date for each ID and then comparing the dates in the original collection:
# import java.time.temporal.ChronoUnit.DAYS
import java.time.temporal.ChronoUnit.DAYS
# import java.time.LocalDate
import java.time.LocalDate
# case class Input(id : Int, date : LocalDate, rank : Int)
defined class Input
# case class Output(id : Int, date : LocalDate, rank : Int, diff : Long)
defined class Output
# val inData = Seq(Input(1, LocalDate.of(2020, 12, 10), 1),
Input(1, LocalDate.of(2020, 12, 12), 2),
Input(1, LocalDate.of(2020, 12, 16), 3),
Input(2, LocalDate.of(2020, 12, 11), 1),
Input(2, LocalDate.of(2020, 12, 13), 2),
Input(2, LocalDate.of(2020, 12, 14), 3))
inData: Seq[Input] = List(
Input(1, 2020-12-10, 1),
Input(1, 2020-12-12, 2),
Input(1, 2020-12-16, 3),
Input(2, 2020-12-11, 1),
Input(2, 2020-12-13, 2),
Input(2, 2020-12-14, 3)
# val minDates = inData.groupMapReduce(_.id)(identity){(a, b) =>
a.date.isBefore(b.date) match {
case true => a
case false => b
}}
minDates: Map[Int, Input] = Map(1 -> Input(1, 2020-12-10, 1), 2 -> Input(2, 2020-12-11, 1))
# val outData = inData.map(a => Output(a.id, a.date, a.rank, DAYS.between(minDates(a.id).date, a.date)))
outData: Seq[Output] = List(
Output(1, 2020-12-10, 1, 0L),
Output(1, 2020-12-12, 2, 2L),
Output(1, 2020-12-16, 3, 6L),
Output(2, 2020-12-11, 1, 0L),
Output(2, 2020-12-13, 2, 2L),
Output(2, 2020-12-14, 3, 3L)
You can get the required output by performing the steps as done below :
//Creating the Sample data
import org.apache.spark.sql.types._
val sampledf = Seq((1,"2020-12-10",1),(1,"2020-12-12",2),(1,"2020-12-16",3),(2,"2020-12-08",1),(2,"2020-12-11",2),(2,"2020-12-13",3))
.toDF("ID","Date","Rank").withColumn("Date",$"Date".cast("Date"))
//adding column with just the value for the rank = 1 column
import org.apache.spark.sql.functions._
val df1 = sampledf.withColumn("Basedate",when($"Rank" === 1 ,$"Date"))
//Doing GroupBy based on ID and basedate column and filtering the records with null basedate
val groupedDF = df1.groupBy("ID","basedate").min("Rank").filter($"min(Rank)" === 1)
//joining the two dataframes and selecting the required columns.
val joinedDF = df1.join(groupedDF.as("t"), Seq("ID"),"left").select("ID","Date","Rank","t.basedate")
//Applying the inbuilt datediff function to get the required output.
val finalDF = joinedDF.withColumn("DateDifference", datediff($"Date",$"basedate"))
finalDF.show(false)
//If using databricks you can use display method.
display(finalDF)

Values of a Dataframe Column into an Array in Scala Spark

Say, I have dataframe
val df1 = sc.parallelize(List(
("A1",45, "5", 1, 90),
("A2",60, "1", 1, 120),
("A3", 45, "9", 1, 450),
("A4", 26, "7", 1, 333)
)).toDF("CID","age", "children", "marketplace_id","value")
Now I want all the values of column "children" into an separate array in the same order.
the below code works for smaller dataset with only one partition
val list1 = df.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list1: Array[String] = Array(5, 1, 9, 7)
But the above code fails when we have partitions
val partitioned = df.repartition($"CID")
val list = partitioned.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list: Array[String] = Array(9, 1, 7, 5)
is there way, that I can get all the values of a column into an array without changing an order?

how to filter few rows in a table using Scala

Using Scala:
I have a emp table as below
id, name, dept, address
1, a, 10, hyd
2, b, 10, blr
3, a, 5, chn
4, d, 2, hyd
5, a, 3, blr
6, b, 2, hyd
Code:
val inputFile = sc.textFile("hdfs:/user/edu/emp.txt");
val inputRdd = inputFile.map(iLine => (iLine.split(",")(0),
iLine.split(",")(1),
iLine.split(",")(3)
));
// filtering only few columns Now i want to pull hyd addressed employees complete data
Problem: I don't want to print all emp details, I want to print only few emp details who all are from hyd.
I have loaded this emp dataset into Rdd
I was split this Rdd with ','
now I want to print only hyd addressed emp.
I think the below solution will help to solve your problem.
val fileName = "/path/stact_test.txt"
val strRdd = sc.textFile(fileName).map { line =>
val data = line.split(",")
(data(0), data(1), data(3))
}.filter(rec=>rec._3.toLowerCase.trim.equals("hyd"))
after splitting the data, filter the location using the 3rd item from the tuple RDD.
Output:
(1, a, hyd)
(4, d, hyd)
(6, b, hyd)
You may try to use dataframe
val viewsDF=spark.read.text("hdfs:/user/edu/emp.txt")
val splitedViewsDF = viewsDF.withColumn("id", split($"value",",").getItem(0))
.withColumn("name", split($"value", ",").getItem(1))
.withColumn("address", split($"value", ",").getItem(3))
.drop($"value")
.filter(df("address").equals("hyd") )

Spark Scala: Aggregate DataFrame Column Values into a Ordered List

I have a spark scala DataFrame that has four values: (id, day, val, order). I want to create a new DataFrame with columns: (id, day, value_list: List(val1, val2, ..., valn)) where val1, through valn are ordered by asc order value.
For instance:
(50, 113, 1, 1),
(50, 113, 1, 3),
(50, 113, 2, 2),
(51, 114, 1, 2),
(51, 114, 2, 1),
(51, 113, 1, 1)
would become:
((51,113),List(1))
((51,114),List(2, 1)
((50,113),List(1, 2, 1))
I'm close, but don't know what to do after I've aggregated the data into a list. I'm not sure how to then have spark order each value list by the order int:
import org.apache.spark.sql.Row
val testList = List((50, 113, 1, 1), (50, 113, 1, 3), (50, 113, 2, 2), (51, 114, 1, 2), (51, 114, 2, 1), (51, 113, 1, 1))
val testDF = sqlContext.sparkContext.parallelize(testList).toDF("id1", "id2", "val", "order")
val rDD1 = testDF.map{case Row(key1: Int, key2: Int, val1: Int, val2: Int) => ((key1, key2), List((val1, val2)))}
val rDD2 = rDD1.reduceByKey{case (x, y) => x ++ y}
where the output looks like:
((51,113),List((1,1)))
((51,114),List((1,2), (2,1)))
((50,113),List((1,3), (1,1), (2,2)))
The next step would be to produce:
((51,113),List((1,1)))
((51,114),List((2,1), (1,2)))
((50,113),List((1,1), (2,2), (1,3)))
You will just need to map over your RDD and use sortBy:
scala> val df = Seq((50, 113, 1, 1), (50, 113, 1, 3), (50, 113, 2, 2), (51, 114, 1, 2), (51, 114, 2, 1), (51, 113, 1, 1)).toDF("id1", "id2", "val", "order")
df: org.apache.spark.sql.DataFrame = [id1: int, id2: int, val: int, order: int]
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val rDD1 = df.map{case Row(key1: Int, key2: Int, val1: Int, val2: Int) => ((key1, key2), List((val1, val2)))}
rDD1: org.apache.spark.rdd.RDD[((Int, Int), List[(Int, Int)])] = MapPartitionsRDD[10] at map at <console>:28
scala> val rDD2 = rDD1.reduceByKey{case (x, y) => x ++ y}
rDD2: org.apache.spark.rdd.RDD[((Int, Int), List[(Int, Int)])] = ShuffledRDD[11] at reduceByKey at <console>:30
scala> val rDD3 = rDD2.map(x => (x._1, x._2.sortBy(_._2)))
rDD3: org.apache.spark.rdd.RDD[((Int, Int), List[(Int, Int)])] = MapPartitionsRDD[12] at map at <console>:32
scala> rDD3.collect.foreach(println)
((51,113),List((1,1)))
((50,113),List((1,1), (2,2), (1,3)))
((51,114),List((2,1), (1,2)))
testDF.groupBy("id1","id2").agg(collect_list($"val")).show
+---+---+-----------------+
|id1|id2|collect_list(val)|
+---+---+-----------------+
| 51|113| [1]|
| 51|114| [1, 2]|
| 50|113| [1, 1, 2]|
+---+---+-----------------+

How to make column pairs of map?

I have some columns like
age | company | country | gender |
----------------------------------
1 | 1 | 1 | 1 |
-----------------------------------
I want to create pairs like
(age,company)
(company,country)
(country,gender)
(company,gender)
(age,gender)
(age,country)
(age,company,country)
(company,country,gender)
(age,country,gender)
(age,company,gender)
(age,company,country,gender)
An idiomatic approach to generating a powerset using Set collection method subsets,
implicit class groupCols[A](val cols: List[A]) extends AnyVal {
def grouping() = cols.toSet.subsets.filter { _.size > 1 }.toList
}
Then
List("age","company","country","gender").grouping
delivers
List( Set(age, company),
Set(age, country),
Set(age, gender),
Set(company, country),
Set(company, gender),
Set(country, gender),
Set(age, company, country),
Set(age, company, gender),
Set(age, country, gender),
Set(company, country, gender),
Set(age, company, country, gender))
Note that the powerset includes the empty set and a set for each element in the original set, here we filter them out.
I doubt you can achieve this with tuples (and this topic confirms it).
But what you are looking for is called Power Set.
Consider this piece of code:
object PowerSetTest extends Application {
val ls = List(1, 2, 3, 4)
println(power(ls.toSet).filter(_.size > 1))
def power[A](t: Set[A]): Set[Set[A]] = {
#annotation.tailrec
def pwr(t: Set[A], ps: Set[Set[A]]): Set[Set[A]] =
if (t.isEmpty) ps
else pwr(t.tail, ps ++ (ps map (_ + t.head)))
pwr(t, Set(Set.empty[A]))
}
}
Running this gives you:
Set(Set(1, 3), Set(1, 2), Set(2, 3), Set(1, 2, 3, 4), Set(3, 4), Set(2, 4), Set(1, 2, 4), Set(1, 4), Set(1, 2, 3), Set(2, 3, 4), Set(1, 3, 4))
You can read here for more information