Add several columns to a RDD through other RDDs in scala

Add several columns to a RDD through other RDDs in scala - scala

I have 4 RDDs with the same key but different columns. And I want to attached them. I thought in perform a fullOuterJoin because, even if the ids are not matched I want the complete register.
Maybe this is easier with dataframes (taking into account not to lose registers)? But until now I have the following code:
var final_agg = rdd_agg_1.fullOuterJoin(rdd_agg_2).fullOuterJoin(rdd_agg_2).fullOuterJoin(rdd_agg_4).map {
case (id, // TODO this case to format the resulting rdd)
}
If I have this rdds:
rdd1 rdd2 rdd3 rdd4
id field1 id field2 id field3 id field4
1 2 1 2 1 2 2 3
2 5 5 1
So the resulting rdd would have the following form:
rdd
id field1 field2 field3 field4
1 2 2 2 3
2 5 - - -
5 - 1 - -
EDIT:
This is the RDD that I want to format in the case:
org.apache.spark.rdd.RDD[(String, (Option[(Option[(Option[Int], Option[Int])], Option[Int])], Option[String]))]

Related

splitting string column into multiple columns based on key value item using spark scala

I have a dataframe where one column contains several information in a 'key=value' format.
There are almost a 30 different 'key=value' that can appear in that column will use 4 columns
for understanding ( _age, _city, _sal, _tag)
id name properties
0 A {_age=10, _city=A, _sal=1000}
1 B {_age=20, _city=B, _sal=3000, tag=XYZ}
2 C {_city=BC, tag=ABC}
How can I convert this string column into multiple columns?
Need to use spark scala dataframe for it.
The expected output is:
id name _age _city _sal tag
0 A 10 A 1000
1 B 20 B 3000 XYZ
2 C BC ABC

Short answer
df
.select(
col("id"),
col("name"),
col("properties.*"),
..
)

Try this :
val s = df.withColumn("dummy", explode(split(regexp_replace($"properties", "\\{|\\}", ""), ",")))
val result= s.drop("properties").withColumn("col1",split($"dummy","=")(0)).withColumn("col1-value",split($"dummy","=")(1)).drop("dummy")
result.groupBy("id","name").pivot("col1").agg(first($"col1-value")).orderBy($"id").show

spark Group By data-frame columns without aggregation [duplicate]

This question already has answers here:
How to aggregate values into collection after groupBy?
(3 answers)
Closed 4 years ago.
I have a csv file in hdfs : /hdfs/test.csv, I like to group below data using spark & scala, I need a output some this like this.
I want to group by A1...AN column based on A1 column and the output should be something like this
all the rows should be grouped like below.
OUTPUt:
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
Input:
++++++++++++++++++++++++++++++
name A1 A1 A2 A3..AN
--------------------------------
JACK ABCD 0 1 0 1
JACK LMN 0 1 0 3
JACK ABCD 2 9 2 9
JAC HBC 1 T 5 21
JACK LMN 0 4 3 T
JACK HBC E7 4W 5 8
I need a below output in spark scala
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)

You can achieve this by having the columns as an array.
import org.apache.spark.sql.functions.{collect_set, concat_ws, array, col}
val aCols = 1.to(250).map( x -> col(s"A$x"))
val concatCol = concat_ws(",", array(aCols : _*))
groupedDf = df.withColumn("aConcat", concatCol).
groupBy("name", "A").
agg(collect_set("aConcat"))
If you're okay with duplicates you can also use collect_list instead of collect_set.

Your input has two different columns called A1. I will assume the groupBy category is called A, while the element to put in that final array is A1.
If you load the data into a DataFrame, you can do this to achieve the output specified:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
val grouped = someDF
.groupBy($"name", $"A")
.agg(collect_set(concat_ws(",", $"A1", $"A2", $"A3", $"A4")).alias("grouped"))

How to explode an array into multiple columns in Spark

I have a spark dataframe looks like:
id DataArray
a array(3,2,1)
b array(4,2,1)
c array(8,6,1)
d array(8,2,4)
I want to transform this dataframe into:
id col1 col2 col3
a 3 2 1
b 4 2 1
c 8 6 1
d 8 2 4
What function should I use?

Use apply:
import org.apache.spark.sql.functions.col
df.select(
col("id") +: (0 until 3).map(i => col("DataArray")(i).alias(s"col$i")): _*
)

You can use foldLeft to add each columnn fron DataArray
make a list of column names that you want to add
val columns = List("col1", "col2", "col3")
columns.zipWithIndex.foldLeft(df) {
(memodDF, column) => {
memodDF.withColumn(column._1, col("dataArray")(column._2))
}
}
.drop("DataArray")
Hope this helps!

Need to generate sequence ids without using groupby in Spark Scala

I want to generate Sequence of Numbers column(Seq_No) as Product_IDs changes in table. In my input table I have only Product_IDs and want output with Seq_No. We can not use GropuBy or Row Number over partition in SQL as Scala does not support.
Logic : Seq_No = 1
for(i = 2:No_of_Rows)
when Product_IDs(i) != Product_IDs(i-1) then Seq_No(i) = Seq_No(i-1)+1
Else Seq_No(i) = Seq_No(i-1)
end as Seq_No
Product_IDs Seq_No
ID1 1
ID1 1
ID1 1
ID2 2
ID3 3
ID3 3
ID3 3
ID3 3
ID1 4
ID1 4
ID4 5
ID5 6
ID3 7
ID6 8
ID6 8
ID5 9
ID5 9
ID4 10
So I want to generate Seq_No as current Product_Id is not equal to previous Product_Ids. Input table has only one column Product_IDs and we want Product_IDs with Seq_No using Spark Scala.

I would probably just write a function to generate the sequence numbers:
scala> val getSeqNum: String => Int = {
var prevId = ""
var n = 0
(id: String) => {
if (id != prevId) {
n += 1
prevId = id
}
n
}
}
getSeqNum: String => Int = <function1>
scala> for { id <- Seq("foo", "foo", "bar") } yield getSeqNum(id)
res8: Seq[Int] = List(1, 1, 2)
UPDATE:
I'm not quite clear on what you want beyond that, Nikhil, and I am not a Spark expert, but I imagine you want something like
val rrd = ??? // Hopefully you know how to get the RRD
for {
(id, col2, col3) <- rrd // assuming the entries are tuples
seqNum = getSeqNum(id)
} yield ??? // Hopefully you know how to transform the entries

Spark RDD mapping one row of data into multiple rows

I have a text file with data that look like this:
Type1 1 3 5 9
Type2 4 6 7 8
Type3 3 6 9 10 11 25
I'd like to transform it into an RDD with rows like this:
1 Type1
3 Type1
3 Type3
......
I started with a case class:
MyData[uid : Int, gid : String]
New to spark and scala, and I can't seem to find an example that does this.

It seems you want something like this?
rdd.flatMap(line=>{
val splitLine = line.split(' ').toList
splitLine match{
case (gid:String) :: rest => rest.map(x:String =>MyData(x.toInt, gid))
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Add several columns to a RDD through other RDDs in scala - scala

Related

splitting string column into multiple columns based on key value item using spark scala

spark Group By data-frame columns without aggregation [duplicate]

How to explode an array into multiple columns in Spark

Need to generate sequence ids without using groupby in Spark Scala

Spark RDD mapping one row of data into multiple rows

Categories

Resources