I have a dataframe where one column contains several information in a 'key=value' format.
There are almost a 30 different 'key=value' that can appear in that column will use 4 columns
for understanding ( _age, _city, _sal, _tag)
id name properties
0 A {_age=10, _city=A, _sal=1000}
1 B {_age=20, _city=B, _sal=3000, tag=XYZ}
2 C {_city=BC, tag=ABC}
How can I convert this string column into multiple columns?
Need to use spark scala dataframe for it.
The expected output is:
id name _age _city _sal tag
0 A 10 A 1000
1 B 20 B 3000 XYZ
2 C BC ABC
Short answer
df
.select(
col("id"),
col("name"),
col("properties.*"),
..
)
Try this :
val s = df.withColumn("dummy", explode(split(regexp_replace($"properties", "\\{|\\}", ""), ",")))
val result= s.drop("properties").withColumn("col1",split($"dummy","=")(0)).withColumn("col1-value",split($"dummy","=")(1)).drop("dummy")
result.groupBy("id","name").pivot("col1").agg(first($"col1-value")).orderBy($"id").show
This question already has answers here:
How to aggregate values into collection after groupBy?
(3 answers)
Closed 4 years ago.
I have a csv file in hdfs : /hdfs/test.csv, I like to group below data using spark & scala, I need a output some this like this.
I want to group by A1...AN column based on A1 column and the output should be something like this
all the rows should be grouped like below.
OUTPUt:
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
Input:
++++++++++++++++++++++++++++++
name A1 A1 A2 A3..AN
--------------------------------
JACK ABCD 0 1 0 1
JACK LMN 0 1 0 3
JACK ABCD 2 9 2 9
JAC HBC 1 T 5 21
JACK LMN 0 4 3 T
JACK HBC E7 4W 5 8
I need a below output in spark scala
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
You can achieve this by having the columns as an array.
import org.apache.spark.sql.functions.{collect_set, concat_ws, array, col}
val aCols = 1.to(250).map( x -> col(s"A$x"))
val concatCol = concat_ws(",", array(aCols : _*))
groupedDf = df.withColumn("aConcat", concatCol).
groupBy("name", "A").
agg(collect_set("aConcat"))
If you're okay with duplicates you can also use collect_list instead of collect_set.
Your input has two different columns called A1. I will assume the groupBy category is called A, while the element to put in that final array is A1.
If you load the data into a DataFrame, you can do this to achieve the output specified:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
val grouped = someDF
.groupBy($"name", $"A")
.agg(collect_set(concat_ws(",", $"A1", $"A2", $"A3", $"A4")).alias("grouped"))
I have a spark dataframe looks like:
id DataArray
a array(3,2,1)
b array(4,2,1)
c array(8,6,1)
d array(8,2,4)
I want to transform this dataframe into:
id col1 col2 col3
a 3 2 1
b 4 2 1
c 8 6 1
d 8 2 4
What function should I use?
Use apply:
import org.apache.spark.sql.functions.col
df.select(
col("id") +: (0 until 3).map(i => col("DataArray")(i).alias(s"col$i")): _*
)
You can use foldLeft to add each columnn fron DataArray
make a list of column names that you want to add
val columns = List("col1", "col2", "col3")
columns.zipWithIndex.foldLeft(df) {
(memodDF, column) => {
memodDF.withColumn(column._1, col("dataArray")(column._2))
}
}
.drop("DataArray")
Hope this helps!
I want to generate Sequence of Numbers column(Seq_No) as Product_IDs changes in table. In my input table I have only Product_IDs and want output with Seq_No. We can not use GropuBy or Row Number over partition in SQL as Scala does not support.
Logic : Seq_No = 1
for(i = 2:No_of_Rows)
when Product_IDs(i) != Product_IDs(i-1) then Seq_No(i) = Seq_No(i-1)+1
Else Seq_No(i) = Seq_No(i-1)
end as Seq_No
Product_IDs Seq_No
ID1 1
ID1 1
ID1 1
ID2 2
ID3 3
ID3 3
ID3 3
ID3 3
ID1 4
ID1 4
ID4 5
ID5 6
ID3 7
ID6 8
ID6 8
ID5 9
ID5 9
ID4 10
So I want to generate Seq_No as current Product_Id is not equal to previous Product_Ids. Input table has only one column Product_IDs and we want Product_IDs with Seq_No using Spark Scala.
I would probably just write a function to generate the sequence numbers:
scala> val getSeqNum: String => Int = {
var prevId = ""
var n = 0
(id: String) => {
if (id != prevId) {
n += 1
prevId = id
}
n
}
}
getSeqNum: String => Int = <function1>
scala> for { id <- Seq("foo", "foo", "bar") } yield getSeqNum(id)
res8: Seq[Int] = List(1, 1, 2)
UPDATE:
I'm not quite clear on what you want beyond that, Nikhil, and I am not a Spark expert, but I imagine you want something like
val rrd = ??? // Hopefully you know how to get the RRD
for {
(id, col2, col3) <- rrd // assuming the entries are tuples
seqNum = getSeqNum(id)
} yield ??? // Hopefully you know how to transform the entries
I have a text file with data that look like this:
Type1 1 3 5 9
Type2 4 6 7 8
Type3 3 6 9 10 11 25
I'd like to transform it into an RDD with rows like this:
1 Type1
3 Type1
3 Type3
......
I started with a case class:
MyData[uid : Int, gid : String]
New to spark and scala, and I can't seem to find an example that does this.
It seems you want something like this?
rdd.flatMap(line=>{
val splitLine = line.split(' ').toList
splitLine match{
case (gid:String) :: rest => rest.map(x:String =>MyData(x.toInt, gid))
}
}