Need to generate sequence ids without using groupby in Spark Scala - scala

I want to generate Sequence of Numbers column(Seq_No) as Product_IDs changes in table. In my input table I have only Product_IDs and want output with Seq_No. We can not use GropuBy or Row Number over partition in SQL as Scala does not support.
Logic : Seq_No = 1
for(i = 2:No_of_Rows)
when Product_IDs(i) != Product_IDs(i-1) then Seq_No(i) = Seq_No(i-1)+1
Else Seq_No(i) = Seq_No(i-1)
end as Seq_No
Product_IDs Seq_No
ID1 1
ID1 1
ID1 1
ID2 2
ID3 3
ID3 3
ID3 3
ID3 3
ID1 4
ID1 4
ID4 5
ID5 6
ID3 7
ID6 8
ID6 8
ID5 9
ID5 9
ID4 10
So I want to generate Seq_No as current Product_Id is not equal to previous Product_Ids. Input table has only one column Product_IDs and we want Product_IDs with Seq_No using Spark Scala.

I would probably just write a function to generate the sequence numbers:
scala> val getSeqNum: String => Int = {
var prevId = ""
var n = 0
(id: String) => {
if (id != prevId) {
n += 1
prevId = id
}
n
}
}
getSeqNum: String => Int = <function1>
scala> for { id <- Seq("foo", "foo", "bar") } yield getSeqNum(id)
res8: Seq[Int] = List(1, 1, 2)
UPDATE:
I'm not quite clear on what you want beyond that, Nikhil, and I am not a Spark expert, but I imagine you want something like
val rrd = ??? // Hopefully you know how to get the RRD
for {
(id, col2, col3) <- rrd // assuming the entries are tuples
seqNum = getSeqNum(id)
} yield ??? // Hopefully you know how to transform the entries

Related

how should count enum in dataframe by scala

I'm fresh about scala, there's a spark dataframe like above:
userid|productid|enumXX|
1 3 1
2 3 1
3 4 2
1 3 3
for enumXX values are 1,2,3; it's enum type, definition is below:
object enumXX extends Enumeration {
type EnumXX = Value
val apple = Value(1)
val balana = Value(2)
val orign = Value(3)
}
I would like to group by userid, productid and collect how many apple, balana and orign(count), how should I do by scala?

How to explode an array into multiple columns in Spark

I have a spark dataframe looks like:
id DataArray
a array(3,2,1)
b array(4,2,1)
c array(8,6,1)
d array(8,2,4)
I want to transform this dataframe into:
id col1 col2 col3
a 3 2 1
b 4 2 1
c 8 6 1
d 8 2 4
What function should I use?
Use apply:
import org.apache.spark.sql.functions.col
df.select(
col("id") +: (0 until 3).map(i => col("DataArray")(i).alias(s"col$i")): _*
)
You can use foldLeft to add each columnn fron DataArray
make a list of column names that you want to add
val columns = List("col1", "col2", "col3")
columns.zipWithIndex.foldLeft(df) {
(memodDF, column) => {
memodDF.withColumn(column._1, col("dataArray")(column._2))
}
}
.drop("DataArray")
Hope this helps!

Spark dataframe explode pair of lists

My dataframe has 2 columns which look like this:
col_id| col_name
-----------
id1 | name1
id2 | name2
------------
id3 | name3
id4 | name4
....
so for each row, there are 2 matching arrays of the same length in columns id and name. What I want is to get each pair of id/name as a separate row like:
col_id| col_name
-----------
id1 | name1
-----------
id2 | name2
....
explode seems like the function to use but I can't seem to get it to work. What I tried is:
rdd.explode(col("col_id"), col("col_name")) ({
case row: Row =>
val ids: java.util.List[String] = row.getList(0)
val names: java.util.List[String] = row.getList(1)
var res: Array[(String, String)] = new Array[(String, String)](ids.size)
for (i <- 0 until ids.size) {
res :+ (ids.get(i), names.get(i))
}
res
})
This however returns only nulls so it might just be my poor knowledge of Scala. Can anyone point out the issue?
Looks like the last 10mins out of the past 1-2hours did the trick lol. This works just fine:
df.explode(col("id"), col("name")) ({
case row: Row =>
val ids: List[String] = row.getList(0).asScala.toList
val names: List[String] = row.getList(1).asScala.toList
ids zip names
})

Add several columns to a RDD through other RDDs in scala

I have 4 RDDs with the same key but different columns. And I want to attached them. I thought in perform a fullOuterJoin because, even if the ids are not matched I want the complete register.
Maybe this is easier with dataframes (taking into account not to lose registers)? But until now I have the following code:
var final_agg = rdd_agg_1.fullOuterJoin(rdd_agg_2).fullOuterJoin(rdd_agg_2).fullOuterJoin(rdd_agg_4).map {
case (id, // TODO this case to format the resulting rdd)
}
If I have this rdds:
rdd1 rdd2 rdd3 rdd4
id field1 id field2 id field3 id field4
1 2 1 2 1 2 2 3
2 5 5 1
So the resulting rdd would have the following form:
rdd
id field1 field2 field3 field4
1 2 2 2 3
2 5 - - -
5 - 1 - -
EDIT:
This is the RDD that I want to format in the case:
org.apache.spark.rdd.RDD[(String, (Option[(Option[(Option[Int], Option[Int])], Option[Int])], Option[String]))]

Spark Mllib - Scala

I have a dataset containing the Customer_ID and the Movies that each customer have seen.
I am analyzing the pattern over the movies. Like If customer X see movie Y then he also se movie Z.
I already group my dataset by Customer ID and I've the sample of data:
Customer_ID,Movie_ID
1, 2,1,3
2, 1
3, 3,6,8
What I want is ignore the column Customer_ID and only having the list of movies like this:
2,1,3
1
3,6,8
How can I do this? My code is:
val data = sc.textFile("FILE");
case class Movies(Customer_ID:String,Movie_ID:String);
def csvToMyClass(line: String) = {
val split = line.split(',');
Movies(split(0),split(1))
}
val df = data.map(csvToMyClass).toDF("Customer_ID","Movie_ID");
df.show;
val movies = df.groupBy(col("Customer_ID")).agg(collect_list(col("Movie_ID")) as "Movie_ID").withColumn("Movie_ID", concat_ws(",", col("Movie_ID"))).rdd
Thanks