Split values in dataframe into separate dataframe column -spark scala - scala

My requirement is to convert the below dataframe
df.show()
Id | vals
1 | name=John || age=25 || col1 =val1 || col2= val2
2 | name=Joe || age=23 || col1 =val11 || col2= val22
Into
Id | name | age | col1 | col2
1 | John | 25 | val1 | val2
2 | Joe | 23 | val11 |val22
Please assist me with this.

To generate the wanted result in a dynamic fashion, here's one approach that uses a mix of split and explode to transform column vals into an ArrayType column of [key, value] (e.g. ["name", "john"]), followed by a grouping by id and pivot on the key to aggregate value:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, "name=John || age=25 || col1 =val1 || col2= val2"),
(2, "name=Joe || age=23 || col1 =val11 || col2= val22")
).toDF("id", "vals")
df.
withColumn("flattened", explode(split($"vals", "\\s*\\|\\|\\s*"))).
withColumn("kv_array", split($"flattened", "\\s*=\\s*")).
groupBy($"id").pivot($"kv_array"(0)).agg(first($"kv_array"(1))).
show
// +---+---+-----+-----+----+
// |id |age|col1 |col2 |name|
// +---+---+-----+-----+----+
// |1 |25 |val1 |val2 |John|
// |2 |23 |val11|val22|Joe |
// +---+---+-----+-----+----+

You could use spark sql split function to split your string and convert to array[string] and then select the columns accordingly. Something like below:
val df1 = df.withColumn("vals",split($"vals","\\|\\|"))
.select($"id",split($"vals"(0),"=")(1).alias("name"),
split($"vals"(1),"=")(1).alias("age"),
split($"vals"(2),"=")(1).alias("col1"),
split($"vals"(3),"=")(1).alias("col2"))

Related

Merge multiple spark rows inside dataframe by ID into one row based on update_time

we need to merge multiple rows based on ID into a single record using Pyspark. If there are multiple updates to the column, then we have to select the one with the last update made to it.
Please note, NULL would mean there was no update made to the column in that instance.
So, basically we have to create a single row with the consolidated updates made to the records.
So,for example, if this is the dataframe ...
Looking for similar answer, but in Pyspark .. Merge rows in a spark scala Dataframe
------------------------------------------------------------
| id | column1 | column2 | updated_at |
------------------------------------------------------------
| 123 | update1 | <*no-update*> | 1634228709 |
| 123 | <*no-update*> | 80 | 1634228724 |
| 123 | update2 | <*no-update*> | 1634229000 |
expected output is -
------------------------------------------------------------
| id | column1 | column2 | updated_at |
------------------------------------------------------------
| 123 | update2 | 80 | 1634229000 |
Let's say that our input dataframe is:
+---+-------+----+----------+
|id |col1 |col2|updated_at|
+---+-------+----+----------+
|123|null |null|1634228709|
|123|null |80 |1634228724|
|123|update2|90 |1634229000|
|12 |update1|null|1634221233|
|12 |null |80 |1634228333|
|12 |update2|null|1634221220|
+---+-------+----+----------+
What we want is to covert updated_at to TimestampType then order by id and updated_at in desc order:
df = df.withColumn("updated_at", F.col("updated_at").cast(TimestampType())).orderBy(
F.col("id"), F.col("updated_at").desc()
)
that gives us:
+---+-------+----+-------------------+
|id |col1 |col2|updated_at |
+---+-------+----+-------------------+
|12 |null |80 |2021-10-14 18:18:53|
|12 |update1|null|2021-10-14 16:20:33|
|12 |update2|null|2021-10-14 16:20:20|
|123|update2|90 |2021-10-14 18:30:00|
|123|null |80 |2021-10-14 18:25:24|
|123|null |null|2021-10-14 18:25:09|
+---+-------+----+-------------------+
Now get first non None value in each column or return None and group by id:
exp = [F.first(x, ignorenulls=True).alias(x) for x in df.columns[1:]]
df = df.groupBy(F.col("id")).agg(*exp)
And the result is:
+---+-------+----+-------------------+
|id |col1 |col2|updated_at |
+---+-------+----+-------------------+
|123|update2|90 |2021-10-14 18:30:00|
|12 |update1|80 |2021-10-14 18:18:53|
+---+-------+----+-------------------+
Here's the full example code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import TimestampType
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
data = [
(123, None, None, 1634228709),
(123, None, 80, 1634228724),
(123, "update2", 90, 1634229000),
(12, "update1", None, 1634221233),
(12, None, 80, 1634228333),
(12, "update2", None, 1634221220),
]
columns = ["id", "col1", "col2", "updated_at"]
df = spark.createDataFrame(data, columns)
df = df.withColumn("updated_at", F.col("updated_at").cast(TimestampType())).orderBy(
F.col("id"), F.col("updated_at").desc()
)
exp = [F.first(x, ignorenulls=True).alias(x) for x in df.columns[1:]]
df = df.groupBy(F.col("id")).agg(*exp)

Spark scala dynamically create filter condition value from sequence Pair

I have a spark dataframe : df :
|id | year | month |
-------------------
| 1 | 2020 | 01 |
| 2 | 2019 | 03 |
| 3 | 2020 | 01 |
I have a sequence year_month = Seq[(2019,01),(2020,01),(2021,01)]
val year_map gets genrated dynamically based on code runs everytime
I want to filter the dataframe : df based on the year_month sequence for on ($year=seq[0] & $month = seq[1]) for each value pair in sequence year_month
You can achieve this by
Create a dataframe from year_month
Perform an inner join on year_month with your original dataframe on month and year
Choosing distinct records
The resulting dataframe will be the matched rows
Working Example
Setup
import spark.implicits._
val dfData = Seq((1,2020,1),(2,2019,3),(3,2020,1))
val df = dfData.toDF()
.selectExpr("_1 as id"," _2 as year","_3 as month")
df.createOrReplaceTempView("original_data")
val year_month = Seq((2019,1),(2020,1),(2021,1))
Step 1
// Create Temporary DataFrame
val yearMonthDf = year_month.toDF()
.selectExpr("_1 as year","_2 as month" )
yearMonthDf.createOrReplaceTempView("temp_year_month")
Step 2
var dfResult = spark.sql("select o.id, o.year, o.month from original_data o inner join temp_year_month t on o.year = t.year and o.month = t.month")
Step3
var dfResultDistinct = dfResult.distinct()
Output
dfResultDistinct.show()
+---+----+-----+
| id|year|month|
+---+----+-----+
| 1|2020| 1|
| 3|2020| 1|
+---+----+-----+
NB. If you are interested in finding the similar records that exist irrespective of the id. You could update the spark sql to the following (it has removed o.id)
select
o.year,
o.month
from
original_data o
inner join
temp_year_month t on o.year = t.year and
o.month = t.month
which would give as the result
+----+-----+
|year|month|
+----+-----+
|2020| 1|
+----+-----+

How to create a data frame in a for loop with the variable that is iterating in loop

So I have a huge data frame which is combination of individual tables, it has an identifier column at the end which specifies the table number as shown below
+----------------------------+
| col1 col2 .... table_num |
+----------------------------+
| x y 1 |
| a b 1 |
| . . . |
| . . . |
| q p 2 |
+----------------------------+
(original table)
I have to split this into multiple little dataframes based on table num. The number of tables combined to create this is pretty large so it's not feasible to individually create the disjoint subset dataframes, so I was thinking if I made a for loop iterating over min to max values of table_num I could achieve this task but I can't seem to do it, any help is appreciated.
This is what I came up with
for (x < min(table_num) to max(table_num)) {
var df(x)= spark.sql("select * from df1 where state = x")
df(x).collect()
but I don't think the declaration is right.
so essentially what I need is df's that look like this
+-----------------------------+
| col1 col2 ... table_num |
+-----------------------------+
| x y 1 |
| a b 1 |
+-----------------------------+
+------------------------------+
| col1 col2 ... table_num |
+------------------------------+
| xx xy 2 |
| aa bb 2 |
+------------------------------+
+-------------------------------+
| col1 col2 ... table_num |
+-------------------------------+
| xxy yyy 3 |
| aaa bbb 3 |
+-------------------------------+
... and so on ...
(how I would like the Dataframes split)
In Spark Arrays can be almost in data type. When made as vars you can dynamically add and remove elements from them. Below I am going to isolate the table nums into their own array, this is so I can easily iterate through them. After isolated I go through a while loop to add each table as a unique element to the DF Holder Array. To query the elements of the array use DFHolderArray(n-1) where n is the position you want to query with 0 being the first element.
//This will go and turn the distinct row nums in a queriable (this is 100% a word) array
val tableIDArray = inputDF.selectExpr("table_num").distinct.rdd.map(x=>x.mkString.toInt).collect
//Build the iterator
var iterator = 1
//holders for DF and transformation step
var tempDF = spark.sql("select 'foo' as bar")
var interimDF = tempDF
//This will be an array for dataframes
var DFHolderArray : Array[org.apache.spark.sql.DataFrame] = Array(tempDF)
//loop while the you have note reached end of array
while(iterator<=tableIDArray.length) {
//Call the table that is stored in that location of the array
tempDF = spark.sql("select * from df1 where state = '" + tableIDArray(iterator-1) + "'")
//Fluff
interimDF = tempDF.withColumn("User_Name", lit("Stack_Overflow"))
//If logic to overwrite or append the DF
DFHolderArray = if (iterator==1) {
Array(interimDF)
} else {
DFHolderArray ++ Array(interimDF)
}
iterator = iterator + 1
}
//To query the data
DFHolderArray(0).show(10,false)
DFHolderArray(1).show(10,false)
DFHolderArray(2).show(10,false)
//....
Approach is to collect all unique keys and build respective data frames. I added some functional flavor to it.
Sample dataset:
name,year,country,id
Bayern Munich,2014,Germany,7747
Bayern Munich,2014,Germany,7747
Bayern Munich,2014,Germany,7746
Borussia Dortmund,2014,Germany,7746
Borussia Mönchengladbach,2014,Germany,7746
Schalke 04,2014,Germany,7746
Schalke 04,2014,Germany,7753
Lazio,2014,Germany,7753
Code:
val df = spark.read.format(source = "csv")
.option("header", true)
.option("delimiter", ",")
.option("inferSchema", true)
.load("groupby.dat")
import spark.implicits._
//collect data for each key into a data frame
val uniqueIds = df.select("id").distinct().map(x => x.mkString.toInt).collect()
// List buffer to hold separate data frames
var dataframeList: ListBuffer[org.apache.spark.sql.DataFrame] = ListBuffer()
println(uniqueIds.toList)
// filter data
uniqueIds.foreach(x => {
val tempDF = df.filter(col("id") === x)
dataframeList += tempDF
})
//show individual data frames
for (tempDF1 <- dataframeList) {
tempDF1.show()
}
One approach would be to write the DataFrame as partitioned Parquet files and read them back into a Map, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("a", "b", 1), ("c", "d", 1), ("e", "f", 1),
("g", "h", 2), ("i", "j", 2)
).toDF("c1", "c2", "table_num")
val filePath = "/path/to/parquet/files"
df.write.partitionBy("table_num").parquet(filePath)
val tableNumList = df.select("table_num").distinct.map(_.getAs[Int](0)).collect
// tableNumList: Array[Int] = Array(1, 2)
val dfMap = ( for { n <- tableNumList } yield
(n, spark.read.parquet(s"$filePath/table_num=$n").withColumn("table_num", lit(n)))
).toMap
To access the individual DataFrames from the Map:
dfMap(1).show
// +---+---+---------+
// | c1| c2|table_num|
// +---+---+---------+
// | a| b| 1|
// | c| d| 1|
// | e| f| 1|
// +---+---+---------+
dfMap(2).show
// +---+---+---------+
// | c1| c2|table_num|
// +---+---+---------+
// | g| h| 2|
// | i| j| 2|
// +---+---+---------+

Scala: Using a spark sql function when selecting column from a dataframe

I have a two table/dataframe: A and B
A has following columns: cust_id, purch_date
B has one column: cust_id, col1 (col1 is not needed)
Following sample shows content of each table:
Table A
cust_id purch_date
34564 2017-08-21
34564 2017-08-02
34564 2017-07-21
23847 2017-09-13
23423 2017-06-19
Table B
cust_id col1
23442 x
12452 x
12464 x
23847 x
24354 x
I want to select the cust_id and first day of month of purch_date where the selected cust_id are not there in B.
This can be achieved in SQL by following command:
select a.cust_id, trunc(purch_date, 'MM') as mon
from a
left join b
on a.cust_id = b.cust_id
where b.cust_id is null
group by cust_id, mon;
Following will be the output:
Table A
cust_id purch_date
34564 2017-08-01
34564 2017-07-01
23423 2017-06-01
I tried the following to implement the same in Scala:
import org.apache.spark.sql.functions._
a = spark.sql("select * from db.a")
b = spark.sql("select * from db.b")
var out = a.join(b, Seq("cust_id"), "left")
.filter("col1 is null")
.select("cust_id", trunc("purch_date", "month"))
.distinct()
But I am getting different errors like:
error: type mismatch; found: StringContext required: ?{def $: ?}
I am stuck here and couldn't find enough documentation/answers on net.
Select should contain Columns instead of Strings:
Input:
df1:
+-------+----------+
|cust_id|purch_date|
+-------+----------+
| 34564|2017-08-21|
| 34564|2017-08-02|
| 34564|2017-07-21|
| 23847|2017-09-13|
| 23423|2017-06-19|
+-------+----------+
df2:
+-------+----+
|cust_id|col1|
+-------+----+
| 23442| X|
| 12452| X|
| 12464| X|
| 23847| X|
| 24354| X|
+-------+----+
Change your query as below:
df1.join(df2, Seq("cust_id"), "left").filter("col1 is null")
.select($"cust_id", trunc($"purch_date", "MM"))
.distinct()
.show()
Output:
+-------+---------------------+
|cust_id|trunc(purch_date, MM)|
+-------+---------------------+
| 23423| 2017-06-01|
| 34564| 2017-07-01|
| 34564| 2017-08-01|
+-------+---------------------+

Scala how to match two dfs if mathes then update the key in first df

I have data in two dataframes:
selectedPersonDF:
ID key
1
2
3
4
5
selectedDetailsDF:
first second third key
--------------------------
1 9 9 777
9 8 8 878
8 10 10 765
10 12 19 909
11 2 20 708
Code :
val personDF = spark.read.option("header", "true").option("inferSchema", "false").csv("person.csv")
val detailsDF = spark.read.option("header", "true").option("inferSchema", "false").csv("details.csv")
val selectedPersonDF=personDF.select((col("ID"),col("key"))).show()
val selectedDetailsDF=detailsDF.select(col("first"),col("second"),col("third"),col("key")).show()
I have to match the selectedPersonDF id column with selectedDetailsDF all the columns(First, Second, Third) if any of the column data matches with persons id then we have to take the key value from selectedDetailsDF and have to update in selectedPersonDF key column.
Expected output (in selectedPersonDF):
ID key
1 777
2 708
3
4
5
and after removing the first row from persons'df since its matched with detailsdf remaining data should be stored in another df.
You can use join and use || condition checking and left join as
val finalDF = selectedPersonDF.join(selectedDetailsDF.withColumnRenamed("key", "key2"), $"ID" === $"first" || $"ID" === $"second" || $"ID" === $"third", "left")
.select($"ID", $"key2".as("key"))
.show(false)
so finalDF should give you
+---+----+
|ID |key |
+---+----+
|1 |777 |
|2 |708 |
|3 |null|
|4 |null|
|5 |null|
+---+----+
We can call .na.fill("") on above dataframe (key column has to be StringType()) to get
+---+---+
|ID |key|
+---+---+
|1 |777|
|2 |708|
|3 | |
|4 | |
|5 | |
+---+---+
After that you can use filter to separate the dataframe into matching and non matching using key column with value and null repectively
val notMatchingDF = finalDF.filter($"key" === "")
val matchingDF = finalDF.except(notMatchingDF)
Updated for if the column names of selectedDetailsDF is unknown except the key column
If the column names of the second dataframe is unknown then you will have to form an array column of the unknown columns as
val columnsToCheck = selectedDetailsDF.columns.toSet - "key" toList
import org.apache.spark.sql.functions._
val tempSelectedDetailsDF = selectedDetailsDF.select(array(columnsToCheck.map(col): _*).as("array"), col("key").as("key2"))
Now tempSelectedDetailsDF dataframe has two columns: combined column of all the unknown columns as an array column and the key column renamed as key2.
After that you would need a udf function for checking the condition while joining
val arrayContains = udf((array: collection.mutable.WrappedArray[String], value: String) => array.contains(value))
And then you join the dataframes using the call to the defined udf function as
val finalDF = selectedPersonDF.join(tempSelectedDetailsDF, arrayContains($"array", $"ID"), "left")
.select($"ID", $"key2".as("key"))
.na.fill("")
Rest of the process is already defined above.
I hope the answer is helpful and understandable.