Spark RDD mapping one row of data into multiple rows - scala

I have a text file with data that look like this:
Type1 1 3 5 9
Type2 4 6 7 8
Type3 3 6 9 10 11 25
I'd like to transform it into an RDD with rows like this:
1 Type1
3 Type1
3 Type3
......
I started with a case class:
MyData[uid : Int, gid : String]
New to spark and scala, and I can't seem to find an example that does this.

It seems you want something like this?
rdd.flatMap(line=>{
val splitLine = line.split(' ').toList
splitLine match{
case (gid:String) :: rest => rest.map(x:String =>MyData(x.toInt, gid))
}
}

Related

how should count enum in dataframe by scala

I'm fresh about scala, there's a spark dataframe like above:
userid|productid|enumXX|
1 3 1
2 3 1
3 4 2
1 3 3
for enumXX values are 1,2,3; it's enum type, definition is below:
object enumXX extends Enumeration {
type EnumXX = Value
val apple = Value(1)
val balana = Value(2)
val orign = Value(3)
}
I would like to group by userid, productid and collect how many apple, balana and orign(count), how should I do by scala?

spark Group By data-frame columns without aggregation [duplicate]

This question already has answers here:
How to aggregate values into collection after groupBy?
(3 answers)
Closed 4 years ago.
I have a csv file in hdfs : /hdfs/test.csv, I like to group below data using spark & scala, I need a output some this like this.
I want to group by A1...AN column based on A1 column and the output should be something like this
all the rows should be grouped like below.
OUTPUt:
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
Input:
++++++++++++++++++++++++++++++
name A1 A1 A2 A3..AN
--------------------------------
JACK ABCD 0 1 0 1
JACK LMN 0 1 0 3
JACK ABCD 2 9 2 9
JAC HBC 1 T 5 21
JACK LMN 0 4 3 T
JACK HBC E7 4W 5 8
I need a below output in spark scala
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
You can achieve this by having the columns as an array.
import org.apache.spark.sql.functions.{collect_set, concat_ws, array, col}
val aCols = 1.to(250).map( x -> col(s"A$x"))
val concatCol = concat_ws(",", array(aCols : _*))
groupedDf = df.withColumn("aConcat", concatCol).
groupBy("name", "A").
agg(collect_set("aConcat"))
If you're okay with duplicates you can also use collect_list instead of collect_set.
Your input has two different columns called A1. I will assume the groupBy category is called A, while the element to put in that final array is A1.
If you load the data into a DataFrame, you can do this to achieve the output specified:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
val grouped = someDF
.groupBy($"name", $"A")
.agg(collect_set(concat_ws(",", $"A1", $"A2", $"A3", $"A4")).alias("grouped"))

Combing rows in a spark dataframe

If I have an input as below:
sno name time
1 hello 1
1 hello 2
1 hai 3
1 hai 4
1 hai 5
1 how 6
1 how 7
1 are 8
1 are 9
1 how 10
1 how 11
1 are 12
1 are 13
1 are 14
I want to combine the fields having similar values in name as the below output format:
sno name timestart timeend
1 hello 1 2
1 hai 3 5
1 how 6 7
1 are 8 9
1 how 10 11
1 are 12 14
The input will be sorted according to time and only the records which are having the same name for repeated time intervals must be merged.
I am trying to do using spark but I cannot figure out a way to do this using spark functions since I am new to spark. Any suggestions on the approach will be appreciated.
I tried thinking of writing a user-defined function and applying maps to the data frame but I could not come up with the right logic for the function.
PS: I am trying to do this using scala spark.
One way to do so would be to use a plain SQL query.
Let's say df is your input dataframe.
val viewName = s"dataframe"
df.createOrReplaceTempView(viewName)
spark.sql(query(viewName))
def query(viewName: String): String = s"SELECT sno, name, MAX(time) AS timeend, MIN(time) AS timestart FROM $viewName GROUP BY name"
You can of course use df set. This would be something like:
df.groupBy($"name")
.agg($"sno", $"name", max($"time").as("timeend"), min($"time").as("timestart"))

Add several columns to a RDD through other RDDs in scala

I have 4 RDDs with the same key but different columns. And I want to attached them. I thought in perform a fullOuterJoin because, even if the ids are not matched I want the complete register.
Maybe this is easier with dataframes (taking into account not to lose registers)? But until now I have the following code:
var final_agg = rdd_agg_1.fullOuterJoin(rdd_agg_2).fullOuterJoin(rdd_agg_2).fullOuterJoin(rdd_agg_4).map {
case (id, // TODO this case to format the resulting rdd)
}
If I have this rdds:
rdd1 rdd2 rdd3 rdd4
id field1 id field2 id field3 id field4
1 2 1 2 1 2 2 3
2 5 5 1
So the resulting rdd would have the following form:
rdd
id field1 field2 field3 field4
1 2 2 2 3
2 5 - - -
5 - 1 - -
EDIT:
This is the RDD that I want to format in the case:
org.apache.spark.rdd.RDD[(String, (Option[(Option[(Option[Int], Option[Int])], Option[Int])], Option[String]))]

Need to generate sequence ids without using groupby in Spark Scala

I want to generate Sequence of Numbers column(Seq_No) as Product_IDs changes in table. In my input table I have only Product_IDs and want output with Seq_No. We can not use GropuBy or Row Number over partition in SQL as Scala does not support.
Logic : Seq_No = 1
for(i = 2:No_of_Rows)
when Product_IDs(i) != Product_IDs(i-1) then Seq_No(i) = Seq_No(i-1)+1
Else Seq_No(i) = Seq_No(i-1)
end as Seq_No
Product_IDs Seq_No
ID1 1
ID1 1
ID1 1
ID2 2
ID3 3
ID3 3
ID3 3
ID3 3
ID1 4
ID1 4
ID4 5
ID5 6
ID3 7
ID6 8
ID6 8
ID5 9
ID5 9
ID4 10
So I want to generate Seq_No as current Product_Id is not equal to previous Product_Ids. Input table has only one column Product_IDs and we want Product_IDs with Seq_No using Spark Scala.
I would probably just write a function to generate the sequence numbers:
scala> val getSeqNum: String => Int = {
var prevId = ""
var n = 0
(id: String) => {
if (id != prevId) {
n += 1
prevId = id
}
n
}
}
getSeqNum: String => Int = <function1>
scala> for { id <- Seq("foo", "foo", "bar") } yield getSeqNum(id)
res8: Seq[Int] = List(1, 1, 2)
UPDATE:
I'm not quite clear on what you want beyond that, Nikhil, and I am not a Spark expert, but I imagine you want something like
val rrd = ??? // Hopefully you know how to get the RRD
for {
(id, col2, col3) <- rrd // assuming the entries are tuples
seqNum = getSeqNum(id)
} yield ??? // Hopefully you know how to transform the entries