Spark DF create Seq column in witcolumn

Spark DF create Seq column in witcolumn - scala

I have a df:
col1
col2
1
abcdefghi
2
qwertyuio
and I want to repeat each row, dividing the col2 in 3 substrings of lenght 3:
col1
col2
1
abcdefghi
1
abc
1
def
1
ghi
2
qwertyuio
2
qwe
2
rty
2
uio
I was trying to create a new column of Seq containng Seq((col("col1"), substring(col("col2"),0,3))...) :
val df1 = df.withColumn("col3", Seq(
(col("col1"), substring(col("col2"),0,3)),
(col("col1"), substring(col("col2"),3,3)),
(col("col1"), substring(col("col2"),6,3)) ))
My idea was to select that new column, and reduce it, getting one final Seq. Then pass it to DF and append it to the initial df.
I am getting an error in the withColumn like:
Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon

You can use the Spark array function instead:
val df1 = df.union(
df.select(
$"col1",
explode(array(
substring(col("col2"),0,3),
substring(col("col2"),3,3),
substring(col("col2"),6,3)
)).as("col2")
)
)
df1.show
+----+---------+
|col1| col2|
+----+---------+
| 1|abcdefghi|
| 2|qwertyuio|
| 1| abc|
| 1| cde|
| 1| fgh|
| 2| qwe|
| 2| ert|
| 2| yui|
+----+---------+

You can use udf also,
val df = spark.sparkContext.parallelize(Seq((1L,"abcdefghi"), (2L,"qwertyuio"))).toDF("col1","col2")
df.show(false)
// input
+----+---------+
|col1|col2 |
+----+---------+
|1 |abcdefghi|
|2 |qwertyuio|
+----+---------+
// udf
val getSeq = udf((col2: String) => col2.split("(?<=\\G...)"))
df.withColumn("col2", explode(getSeq($"col2")))
.union(df).show(false)
+----+---------+
|col1|col2 |
+----+---------+
|1 |abc |
|1 |ghi |
|1 |abcdefghi|
|1 |def |
|2 |qwe |
|2 |rty |
|2 |uio |
|2 |qwertyuio|
+----+---------+

Related

How do you split a column such that first half becomes the column name and the second the column value in Scala Spark?

I have a column which has value like
+----------------------+-----------------------------------------+
|UserId |col |
+----------------------+-----------------------------------------+
|1 |firstname=abc |
|2 |lastname=xyz |
|3 |firstname=pqr;lastname=zzz |
|4 |firstname=aaa;middlename=xxx;lastname=bbb|
+----------------------+-----------------------------------------+
and what I want is something like this:
+----------------------+--------------------------------+
|UserId |firstname | lastname| middlename|
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |null | xyz | null |
|3 |pqr | zzz | null |
|4 |aaa | bbb | xxx |
+----------------------+--------------------------------+
I have already done this:
var new_df = df.withColumn("temp_new", split(col("col"), "\\;")).select(
(0 until numCols).map(i => split(col("temp_new").getItem(i), "=").getItem(1).as(s"col$i")): _*
)
where numCols is the max length of col
but as you may have guessed I get something like this as the output:
+----------------------+--------------------------------+
|UserId |col0 | col1 | col2 |
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |xyz | null | null |
|3 |pqr | zzz | null |
|4 |aaa | xxx | bbb |
+----------------------+--------------------------------+
NOTE: The above is just an example. There could be more additions to the columns like firstname=aaa;middlename=xxx;lastname=bbb;age=20;country=India and so on for around 40-50 columnnames and values. They are dynamic and I don't know most of them in advance
I am looking for a a way to achieve the result with Scala in Spark.

You could apply groupBy/pivot to generate key columns after converting the key/value-pairs string column into a Map column via SQL function str_to_map, as shown below:
val df = Seq(
(1, "firstname=joe;age=33"),
(2, "lastname=smith;country=usa"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe")
).toDF("user_id", "key_values")
df.
select($"user_id", explode(expr("str_to_map(key_values, ';', '=')"))).
groupBy("user_id").pivot("key").agg(first("value").as("value")).
orderBy("user_id"). // only for ordered output
show
/*
+-------+----+-------+---------+--------+
|user_id| age|country|firstname|lastname|
+-------+----+-------+---------+--------+
| 1| 33| null| joe| null|
| 2|null| usa| null| smith|
| 3| 44| aus| zoe| cooper|
| 4|null| null| john| doe|
+-------+----+-------+---------+--------+
*/

Since your data is split by ; then your key value pairs are split by = you may consider using str_to_map the following:
creating a temporary view of your data eg
df.createOrReplaceTempView("my_table")
Running the following on your spark session
result_df = sparkSession.sql("<insert sql below here>")
WITH split_data AS (
SELECT
UserId,
str_to_map(col,';','=') full_name
FROM
my_table
)
SELECT
UserId,
full_name['firstname'] as firstname,
full_name['lastname'] as lastname,
full_name['middlename'] as middlename
FROM
split_data

This solution is proposed in accordance with the expanded requirement described in the other answer's comments section:
Existence of duplicate keys in column key_values
Only duplicate key columns will be aggregated as ArrayType
There are probably other approaches. The solution below uses groupBy/pivot with collect_list, followed by extracting the single element (null if empty) from the non-duplicate key columns.
val df = Seq(
(1, "firstname=joe;age=33;moviegenre=comedy"),
(2, "lastname=smith;country=usa;moviegenre=drama"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe;moviegenre=drama;moviegenre=comedy")
).toDF("user_id", "key_values")
val mainCols = df.columns diff Seq("key_values")
val dfNew = df.
withColumn("kv_arr", split($"key_values", ";")).
withColumn("kv", explode(expr("transform(kv_arr, kv -> split(kv, '='))"))).
groupBy("user_id").pivot($"kv"(0)).agg(collect_list($"kv"(1)))
val dupeKeys = Seq("moviegenre") // user-provided
val nonDupeKeys = dfNew.columns diff (mainCols ++ dupeKeys)
dfNew.select(
mainCols.map(col) ++
dupeKeys.map(col) ++
nonDupeKeys.map(k => when(size(col(k)) > 0, col(k)(0)).as(k)): _*
).
orderBy("user_id"). // only for ordered output
show
/*
+-------+---------------+----+-------+---------+--------+
|user_id| moviegenre| age|country|firstname|lastname|
+-------+---------------+----+-------+---------+--------+
| 1| [comedy]| 33| null| joe| null|
| 2| [drama]|null| usa| null| smith|
| 3| []| 44| aus| zoe| cooper|
| 4|[drama, comedy]|null| null| john| doe|
+-------+---------------+----+-------+---------+--------+
/*
Note that higher-order function transform is used to handle the key/value split, as SQL function str_to_map (used in the original solution) can't handle duplicate keys.

Filter DF using the column of another DF (same col in both DF) Spark Scala

I am trying to filter a DataFrame DF1 using the column of another DataFrame DF2, the col is country_id. I Want to reduce all the rows of the first DataFrame to only the countries that there are on the second DF. An example:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |ARG |5 |
|2015-12-14 |GER |1 |
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
|2015-12-14 |USA |1 |
+--------------+------------+
|USE | country_id |
+--------------+------------+
| F |RUS |
| F |CHN |
Expected:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
How could I do this? I am new with Spark so I have thought on use maybe intersect? or would be more efficient other method?
Thanks in advance!

You can use left semi join:
val DF3 = DF1.join(DF2, Seq("country_id"), "left_semi")
DF3.show
//+----------+----------+-----+
//|country_id| Date|value|
//+----------+----------+-----+
//| RUS|2015-12-14| 1|
//| CHN|2015-12-14| 3|
//+----------+----------+-----+
You can also use inner join :
val DF3 = DF1.alias("a").join(DF2.alias("b"), Seq("country_id")).select("a.*")

Creating a new dataframe with many rows for each row in existing dataframe

I currently have a dataframe
df1 =
+-----+
| val|
+-----+
| 1|
| 2|
| 3|
....
| 2456|
+-----+
Each value corresponds to a single cell in a 3d cube.
I have a function findNeighbors which returns a list of the neighboring cubes, which I then map to df1 to get the neighbors of every row.
df2 = df1.map(row => findNeighbors(row(0).toInt)
This results in something like
df2 =
+---------------+
| neighbors|
+---------------+
| (1,2), (1, 7)|
| (2,1), (2, 3)|
.... etc
+---------------+
Where, for each row, for each Array in that row, the first item is the value of the cell and the second is the value of its neighbor.
I now want to create a new dataframe that takes all of those nested arrays and makes them rows like this:
finalDF =
+-----+------+
| cell|neighb|
+-----+------+
| 1| 2|
| 1| 7|
| 2| 1|
| 2| 3|
.... etc
+------------+
And this is where I am stuck
I tried using the code below, but I can't append to a local dataframe from within the foreach function.
var df: DataFrame = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], my_schema)
val colNames = Seq("cell", "neighb")
neighborsDf.foreach(row => {
var rowDf: DataFrame = row.toDF(colNames: _*)
df.union(rowDf)
})
I'm sure there is a much better way to approach this problem, but I'm very new and very lost in scala/spark, and 10 hours of googling hasn't helped me.

Starting a little down the track, a somewhat similar example:
val df2 = df.select(explode($"neighbours").as("neighbours_flat"))
val df3 = df2.select(col("neighbours_flat").getItem(0) as "cell",col("neighbours_flat")
.getItem(1) as "neighbour")
df3.show(false)
starting from neighbours field def:
+----------------+
|neighbours_flat |
+----------------+
|[[1, 2], [1, 7]]|
|[[2, 1], [2, 3]]|
+----------------+
results in:
+----+---------+
|cell|neighbour|
+----+---------+
|1 |2 |
|1 |7 |
|2 |1 |
|2 |3 |
+----+---------+
You need to have an array def and then use explode.

Calculate residual amount in dataframe column

I have a "capacity" dataframe:
scala> sql("create table capacity (id String, capacity Int)");
scala> sql("insert into capacity values ('A', 50), ('B', 100)");
scala> sql("select * from capacity").show(false)
+---+--------+
|id |capacity|
+---+--------+
|A |50 |
|B |100 |
+---+--------+
I have another "used" dataframe with following information:
scala> sql ("create table used (id String, capacityId String, used Int)");
scala> sql ("insert into used values ('item1', 'A', 10), ('item2', 'A', 20), ('item3', 'A', 10), ('item4', 'B', 30), ('item5', 'B', 40), ('item6', 'B', 40)")
scala> sql("select * from used order by capacityId").show(false)
+-----+----------+----+
|id |capacityId|used|
+-----+----------+----+
|item1|A |10 |
|item3|A |10 |
|item2|A |20 |
|item6|B |40 |
|item4|B |30 |
|item5|B |40 |
+-----+----------+----+
Column "capacityId" of the "used" dataframe is foreign key to column "id" of the "capacity" dataframe.
I want to calculate the "capacityLeft" column which is residual amount at that point of time.
+-----+----------+----+--------------+
|id |capacityId|used| capacityLeft |
+-----+----------+----+--------------+
|item1|A |10 |40 | <- 50(capacity of 'A')-10
|item3|A |10 |30 | <- 40-10
|item2|A |20 |10 | <- 30-20
|item6|B |40 |60 | <- 100(capacity of 'B')-40
|item4|B |30 |30 | <- 60-30
|item5|B |40 |-10 | <- 30-40
+-----+----------+----+--------------+
In real senario, the "createdDate" column is used for ordering of "used" dataframe column.
Spark version: 2.2

This can be solved by using window functions in Spark. Note that for this to work there need to exist a column that keep track of the row order for each capacityId.
Start by joining the two dataframes together:
val df = used.join(capacity.withColumnRenamed("id", "capacityId"), Seq("capacityId"), "inner")
Here the id in the capacity dataframe is renamed to match the id name in the used dataframe as to not keep a duplicate columns.
Now create a window and calculate the cumsum of the used column. Take the value of the capacity and subtract the cumsum to get the remaining amount:
val w = Window.partitionBy("capacityId").orderBy("createdDate")
val df2 = df.withColumn("capacityLeft", $"capacity" - sum($"used").over(w))
Resulting dataframe with example createdDate column:
+----------+-----+----+-----------+--------+------------+
|capacityId| id|used|createdDate|capacity|capacityLeft|
+----------+-----+----+-----------+--------+------------+
| B|item6| 40| 1| 100| 60|
| B|item4| 30| 2| 100| 30|
| B|item5| 40| 3| 100| -10|
| A|item1| 10| 1| 50| 40|
| A|item3| 10| 2| 50| 30|
| A|item2| 20| 3| 50| 10|
+----------+-----+----+-----------+--------+------------+
Any unwanted columns can now be removed with drop.

Splitting row in multiple row in spark-shell

I have imported data in Spark dataframe in spark-shell. Data is filled in it like :
Col1 | Col2 | Col3 | Col4
A1 | 11 | B2 | a|b;1;0xFFFFFF
A1 | 12 | B1 | 2
A2 | 12 | B2 | 0xFFF45B
Here in Col4, the values are of different kinds and I want to separate them like (suppose "a|b" is type of alphabets, "1 or 2" is a type of digit and "0xFFFFFF or 0xFFF45B" is a type of hexadecimal no.):
So, the output should be :
Col1 | Col2 | Col3 | alphabets | digits | hexadecimal
A1 | 11 | B2 | a | 1 | 0xFFFFFF
A1 | 11 | B2 | b | 1 | 0xFFFFFF
A1 | 12 | B1 | | 2 |
A2 | 12 | B2 | | | 0xFFF45B
Hope I've made my query clear to you and I am using spark-shell. Thanks in advance.

Edit after getting this answer about how to make backreference in regexp_replace.
You can use regexp_replace with a backreference, then split twice and explode. It is, imo, cleaner than my original solution
val df = List(
("A1" , "11" , "B2" , "a|b;1;0xFFFFFF"),
("A1" , "12" , "B1" , "2"),
("A2" , "12" , "B2" , "0xFFF45B")
).toDF("Col1" , "Col2" , "Col3" , "Col4")
val regExStr = "^([A-z|]+)?;?(\\d+)?;?(0x.*)?$"
val res = df
.withColumn("backrefReplace",
split(regexp_replace('Col4,regExStr,"$1;$2;$3"),";"))
.select('Col1,'Col2,'Col3,
explode(split('backrefReplace(0),"\\|")).as("letter"),
'backrefReplace(1) .as("digits"),
'backrefReplace(2) .as("hexadecimal")
)
+----+----+----+------+------+-----------+
|Col1|Col2|Col3|letter|digits|hexadecimal|
+----+----+----+------+------+-----------+
| A1| 11| B2| a| 1| 0xFFFFFF|
| A1| 11| B2| b| 1| 0xFFFFFF|
| A1| 12| B1| | 2| |
| A2| 12| B2| | | 0xFFF45B|
+----+----+----+------+------+-----------+
you still need to replace empty strings by nullthough...
Previous Answer (somebody might still prefer it):
Here is a solution that sticks to DataFrames but is also quite messy. You can first use regexp_extract three times (possible to do less with backreference?), and finally split on "|" and explode. Note that you need a coalesce for explode to return everything (you still might want to change the empty strings in letter to null in this solution).
val res = df
.withColumn("alphabets", regexp_extract('Col4,"(^[A-z|]+)?",1))
.withColumn("digits", regexp_extract('Col4,"^([A-z|]+)?;?(\\d+)?;?(0x.*)?$",2))
.withColumn("hexadecimal",regexp_extract('Col4,"^([A-z|]+)?;?(\\d+)?;?(0x.*)?$",3))
.withColumn("letter",
explode(
split(
coalesce('alphabets,lit("")),
"\\|"
)
)
)
res.show
+----+----+----+--------------+---------+------+-----------+------+
|Col1|Col2|Col3| Col4|alphabets|digits|hexadecimal|letter|
+----+----+----+--------------+---------+------+-----------+------+
| A1| 11| B2|a|b;1;0xFFFFFF| a|b| 1| 0xFFFFFF| a|
| A1| 11| B2|a|b;1;0xFFFFFF| a|b| 1| 0xFFFFFF| b|
| A1| 12| B1| 2| null| 2| null| |
| A2| 12| B2| 0xFFF45B| null| null| 0xFFF45B| |
+----+----+----+--------------+---------+------+-----------+------+
Note: The regexp part could be so much better with backreference, so if somebody knows how to do it, please comment!

Not sure this is doable while staying 100% with Dataframes, here's a (somewhat messy?) solution using RDDs for the split itself:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
// we switch to RDD to perform the split of Col4 into 3 columns
val rddWithSplitCol4 = input.rdd.map { r =>
val indexToValue = r.getAs[String]("Col4").split(';').map {
case s if s.startsWith("0x") => 2 -> s
case s if s.matches("\\d+") => 1 -> s
case s => 0 -> s
}
val newCols: Array[String] = indexToValue.foldLeft(Array.fill[String](3)("")) {
case (arr, (index, value)) => arr.updated(index, value)
}
(r.getAs[String]("Col1"), r.getAs[Int]("Col2"), r.getAs[String]("Col3"), newCols(0), newCols(1), newCols(2))
}
// switch back to Dataframe and explode alphabets column
val result = rddWithSplitCol4
.toDF("Col1", "Col2", "Col3", "alphabets", "digits", "hexadecimal")
.withColumn("alphabets", explode(split(col("alphabets"), "\\|")))
result.show(truncate = false)
// +----+----+----+---------+------+-----------+
// |Col1|Col2|Col3|alphabets|digits|hexadecimal|
// +----+----+----+---------+------+-----------+
// |A1 |11 |B2 |a |1 |0xFFFFFF |
// |A1 |11 |B2 |b |1 |0xFFFFFF |
// |A1 |12 |B1 | |2 | |
// |A2 |12 |B2 | | |0xFFF45B |
// +----+----+----+---------+------+-----------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark DF create Seq column in witcolumn - scala

Related

How do you split a column such that first half becomes the column name and the second the column value in Scala Spark?

Filter DF using the column of another DF (same col in both DF) Spark Scala

Creating a new dataframe with many rows for each row in existing dataframe

Calculate residual amount in dataframe column

Splitting row in multiple row in spark-shell

Categories

Resources