Spark RDD mapping one row of data into multiple rows

Spark RDD mapping one row of data into multiple rows - scala

I have a text file with data that look like this:
Type1 1 3 5 9
Type2 4 6 7 8
Type3 3 6 9 10 11 25
I'd like to transform it into an RDD with rows like this:
1 Type1
3 Type1
3 Type3
......
I started with a case class:
MyData[uid : Int, gid : String]
New to spark and scala, and I can't seem to find an example that does this.

It seems you want something like this?
rdd.flatMap(line=>{
val splitLine = line.split(' ').toList
splitLine match{
case (gid:String) :: rest => rest.map(x:String =>MyData(x.toInt, gid))
}
}

Related

how should count enum in dataframe by scala

I'm fresh about scala, there's a spark dataframe like above:
userid|productid|enumXX|
1 3 1
2 3 1
3 4 2
1 3 3
for enumXX values are 1,2,3; it's enum type, definition is below:
object enumXX extends Enumeration {
type EnumXX = Value
val apple = Value(1)
val balana = Value(2)
val orign = Value(3)
}
I would like to group by userid, productid and collect how many apple, balana and orign(count), how should I do by scala?

spark Group By data-frame columns without aggregation [duplicate]

This question already has answers here:
How to aggregate values into collection after groupBy?
(3 answers)
Closed 4 years ago.
I have a csv file in hdfs : /hdfs/test.csv, I like to group below data using spark & scala, I need a output some this like this.
I want to group by A1...AN column based on A1 column and the output should be something like this
all the rows should be grouped like below.
OUTPUt:
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
Input:
++++++++++++++++++++++++++++++
name A1 A1 A2 A3..AN
--------------------------------
JACK ABCD 0 1 0 1
JACK LMN 0 1 0 3
JACK ABCD 2 9 2 9
JAC HBC 1 T 5 21
JACK LMN 0 4 3 T
JACK HBC E7 4W 5 8
I need a below output in spark scala
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)

You can achieve this by having the columns as an array.
import org.apache.spark.sql.functions.{collect_set, concat_ws, array, col}
val aCols = 1.to(250).map( x -> col(s"A$x"))
val concatCol = concat_ws(",", array(aCols : _*))
groupedDf = df.withColumn("aConcat", concatCol).
groupBy("name", "A").
agg(collect_set("aConcat"))
If you're okay with duplicates you can also use collect_list instead of collect_set.

Your input has two different columns called A1. I will assume the groupBy category is called A, while the element to put in that final array is A1.
If you load the data into a DataFrame, you can do this to achieve the output specified:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
val grouped = someDF
.groupBy($"name", $"A")
.agg(collect_set(concat_ws(",", $"A1", $"A2", $"A3", $"A4")).alias("grouped"))

Combing rows in a spark dataframe

If I have an input as below:
sno name time
1 hello 1
1 hello 2
1 hai 3
1 hai 4
1 hai 5
1 how 6
1 how 7
1 are 8
1 are 9
1 how 10
1 how 11
1 are 12
1 are 13
1 are 14
I want to combine the fields having similar values in name as the below output format:
sno name timestart timeend
1 hello 1 2
1 hai 3 5
1 how 6 7
1 are 8 9
1 how 10 11
1 are 12 14
The input will be sorted according to time and only the records which are having the same name for repeated time intervals must be merged.
I am trying to do using spark but I cannot figure out a way to do this using spark functions since I am new to spark. Any suggestions on the approach will be appreciated.
I tried thinking of writing a user-defined function and applying maps to the data frame but I could not come up with the right logic for the function.
PS: I am trying to do this using scala spark.

One way to do so would be to use a plain SQL query.
Let's say df is your input dataframe.
val viewName = s"dataframe"
df.createOrReplaceTempView(viewName)
spark.sql(query(viewName))
def query(viewName: String): String = s"SELECT sno, name, MAX(time) AS timeend, MIN(time) AS timestart FROM $viewName GROUP BY name"
You can of course use df set. This would be something like:
df.groupBy($"name")
.agg($"sno", $"name", max($"time").as("timeend"), min($"time").as("timestart"))

Add several columns to a RDD through other RDDs in scala

I have 4 RDDs with the same key but different columns. And I want to attached them. I thought in perform a fullOuterJoin because, even if the ids are not matched I want the complete register.
Maybe this is easier with dataframes (taking into account not to lose registers)? But until now I have the following code:
var final_agg = rdd_agg_1.fullOuterJoin(rdd_agg_2).fullOuterJoin(rdd_agg_2).fullOuterJoin(rdd_agg_4).map {
case (id, // TODO this case to format the resulting rdd)
}
If I have this rdds:
rdd1 rdd2 rdd3 rdd4
id field1 id field2 id field3 id field4
1 2 1 2 1 2 2 3
2 5 5 1
So the resulting rdd would have the following form:
rdd
id field1 field2 field3 field4
1 2 2 2 3
2 5 - - -
5 - 1 - -
EDIT:
This is the RDD that I want to format in the case:
org.apache.spark.rdd.RDD[(String, (Option[(Option[(Option[Int], Option[Int])], Option[Int])], Option[String]))]

Need to generate sequence ids without using groupby in Spark Scala

I want to generate Sequence of Numbers column(Seq_No) as Product_IDs changes in table. In my input table I have only Product_IDs and want output with Seq_No. We can not use GropuBy or Row Number over partition in SQL as Scala does not support.
Logic : Seq_No = 1
for(i = 2:No_of_Rows)
when Product_IDs(i) != Product_IDs(i-1) then Seq_No(i) = Seq_No(i-1)+1
Else Seq_No(i) = Seq_No(i-1)
end as Seq_No
Product_IDs Seq_No
ID1 1
ID1 1
ID1 1
ID2 2
ID3 3
ID3 3
ID3 3
ID3 3
ID1 4
ID1 4
ID4 5
ID5 6
ID3 7
ID6 8
ID6 8
ID5 9
ID5 9
ID4 10
So I want to generate Seq_No as current Product_Id is not equal to previous Product_Ids. Input table has only one column Product_IDs and we want Product_IDs with Seq_No using Spark Scala.

I would probably just write a function to generate the sequence numbers:
scala> val getSeqNum: String => Int = {
var prevId = ""
var n = 0
(id: String) => {
if (id != prevId) {
n += 1
prevId = id
}
n
}
}
getSeqNum: String => Int = <function1>
scala> for { id <- Seq("foo", "foo", "bar") } yield getSeqNum(id)
res8: Seq[Int] = List(1, 1, 2)
UPDATE:
I'm not quite clear on what you want beyond that, Nikhil, and I am not a Spark expert, but I imagine you want something like
val rrd = ??? // Hopefully you know how to get the RRD
for {
(id, col2, col3) <- rrd // assuming the entries are tuples
seqNum = getSeqNum(id)
} yield ??? // Hopefully you know how to transform the entries

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark RDD mapping one row of data into multiple rows - scala

It seems you want something like this? rdd.flatMap(line=>{ val splitLine = line.split(' ').toList splitLine match{ case (gid:String) :: rest => rest.map(x:String =>MyData(x.toInt, gid)) } }

Related

how should count enum in dataframe by scala

spark Group By data-frame columns without aggregation [duplicate]

Combing rows in a spark dataframe

Add several columns to a RDD through other RDDs in scala

Need to generate sequence ids without using groupby in Spark Scala

Categories

Resources