Hive/pyspark: pivot non numeric data for huge dataset - pyspark

I'm looking for a way to pivot a input dataset with the below structure in hive or pyspark, the input contains more than half a billion records and for each emp_id there are 8 rows with and 5 columns possible, so I will end up with 40 columns. I did refer to this link but here the pivoted output column is already there in the dataset, in mine it's not and I also tried this link, but the sql is becoming very huge (not that it matters), but Is there a much way to do where the resultant pivoted columns needs to concatenated with the rank.
input
emp_id, dept_id, dept_name, rank
1001, 101, sales, 1
1001, 102, marketing, 2
1002 101, sales 1
1002 102, marketing, 2
expected output
emp_id, dept_id_1, dept_name_1, dept_id_2, dept_id_2
1001, 101, sales, 102, marketing
1002, 101, sales, 102, marketing

You can use aggregations after pivoting, you'd have an option to rename column like so
import pyspark.sql.functions as F
(df
.groupBy('emp_id')
.pivot('rank')
.agg(
F.first('dept_id').alias('dept_id'),
F.first('dept_name').alias('dept_name')
)
.show()
)
# Output
# +------+---------+-----------+---------+-----------+
# |emp_id|1_dept_id|1_dept_name|2_dept_id|2_dept_name|
# +------+---------+-----------+---------+-----------+
# | 1002| 101| sales| 102| marketing|
# | 1001| 101| sales| 102| marketing|
# +------+---------+-----------+---------+-----------+

Related

Pyspark copy values from other columns depending on a specific column

I want to create the New column depending on the column A values using Pyspark.
I want to take the Column B values for items greater than 1300 when creating the new column.
but I want to keep the Column C values for items less than 1300 when creating the new column.
I am only a beginner. Thank you for your help.
Column A
Column B
Column C
New
1210
100
200
200
1300
70
50
70
1200
10
50
50
1310
15
300
15
I have tried to filter out items greater than 1300.
You can do so with col and when. When A >= 1300 then take column B otherwise column C.
from pyspark.sql.functions import when, col
data = [
[1210, 100, 200],
[1300, 70, 50],
[1200, 10, 50],
[1310, 15, 300]
]
df = spark.createDataFrame(data, ['A', 'B', 'C'])
df.withColumn('New', when(col('A') >= 1300, col('B')).otherwise(col('C'))).show()
+----+---+---+---+
| A| B| C|New|
+----+---+---+---+
|1210|100|200|200|
|1300| 70| 50| 70|
|1200| 10| 50| 50|
|1310| 15|300| 15|
+----+---+---+---+

Get all the rows which have mismatch between values in columns in pyspark dataframe

I have the following pyspark dataframe df1 :-
SL No
category 1
category 2
1
Apples
Oranges
2
Apples
APPLE FRUIT
3
Grapes
Grape
4
Bananas
Oranges
5
Orange
Grape
I want to get the rows of the pyspark dataframe where the column values are not matching b/w columns category 1 and category 2 handling for case partial string match(category 1 contains only Apples/Bananas/Orange/Grapes strings and likewise Category 2 only contains only those distinct strings under Category 2) :-
SL No
category 1
category 2
1
Apples
Oranges
4
Bananas
Oranges
5
Orange
Grape
First, please avoid column names with spaces.
My df
df=spark.createDataFrame([(1, 'Apples', 'Oranges'),
(2, 'Apples', 'APPLE FRUIT'),
(3, 'Grapes', 'Grape'),
(4, 'Bananas', 'Oranges'),
(5, 'Orange', 'Grape')],
('SL No', 'category1 ', 'category 2'))
df.show()
new =(
#Make string columns have a common case and put them into a comma separated array
df.select('*',*[split(initcap(F.col(c)),'\s').alias(c+f'{"_1"}') for c in df.drop('SL No').columns])
#Filter the non wanted using a higher order functio
.filter(expr("filter(category1_1, (x,i)->rlike(x,category2_1[i]))")[0].isNull())
#Drop columns not wanted
.drop('category1_1','category2_1')
).show()
+-----+---------+---------+
|SL No|category1|category2|
+-----+---------+---------+
| 1| Apples| Oranges|
| 4| Bananas| Oranges|
| 5| Orange| Grape|
+-----+---------+---------+

Scala Spark Explode multiple columns pairs into rows

How can I explode multiple columns pairs into multiple rows?
I have a dataframe with the following
client, type, address, type_2, address_2
abc, home, 123 Street, business, 456 Street
I want to have a final dataframe with the follow
client, type, address
abc, home, 123 Street
abc, business, 456 Street
I tried using this code below but it return me 4 records instead of the two records I want
df
.withColumn("type", explode(array("type", "type_2")))
.withColumn("address", explode(array("address", "address_2")))
I can do this with two separate dataframe and perform an union but I wanted to see if there was another way I can do it within a single dataframe
Thanks
you can do it using structs:
df
.withColumn("str",explode(
array(
struct($"type",$"address"),
struct($"type_2".as("type"),$"address_2".as("address"))))
)
.select($"client",$"str.*")
.show()
gives
+------+--------+----------+
|client| type| address|
+------+--------+----------+
| abc| home|123 Street|
| abc|business|456 Street|
+------+--------+----------+
Here is technique I use for complicated transformations - map records on the dataframe and use scala to apply transformation of any complexity.
Here I am hard coding creation of 2 rows, however any logic can be put here to explode rows as needed. I used flatmap to split array of rows into rows.
val df = spark.createDataFrame(Seq(("abc","home","123 Street","business","456 Street"))).toDF("client", "type", "address","type_2","address_2")
df.map{ r =>
Seq((r.getAs[String]("client"),r.getAs[String]("type"),r.getAs[String]("address")),
(r.getAs[String]("client"),r.getAs[String]("type_2"),r.getAs[String]("address_2")))
}.flatMap(identity(_)).toDF("client", "type", "address").show(false)
Result
+------+--------+----------+
|client|type |address |
+------+--------+----------+
|abc |home |123 Street|
|abc |business|456 Street|
+------+--------+----------+

How to Merge values from multiple rows so they can be processed together - Spark scala

I have multiple database rows per personId with columns that may or may not have values - I'm using colors here as the data is text not numeric so doesn't lend itself to built-in aggregation functions. A simplified example is
PersonId ColA ColB ColB
100 red
100 green
100 gold
100 green
110 yellow
110 white
110
120
etc...
I want to be able to decide in a function which column data to use per unique PersonId. A three-way join on the table against itself would be a good solution if the data didn't have multiple values(colors) per column. E.g. that join merges 3 of the rows into one but still produces multiple rows.
PersonId ColA ColB ColB
100 red green gold
100 green
110 white yellow
110
120
So the solution I'm looking for is something that will allow me to address all the values (colors) for a person in one place (function) so the decision can be made across all their data.
The real data of course has more columns but the primary ones for this decision are the three columns. The data is being read in Scala Spark as a Dataframe and I'd prefer using the API to sql. I don't know if any of the exotic windows or groupby functions will help or if it's gonna be down to plain old iterate and accumulate.
The technique used in [How to aggregate values into collection after groupBy? might be applicable but it's a bit of a leap.
Think of using customUDF for doing this.
import org.apache.spark.sql.functions._
val df = Seq((100, "red", null, null), (100, null, "white", null), (100, null, null, "green"), (200, null, "red", null)).toDF("PID", "A", "B", "C")
df.show()
+---+----+-----+-----+
|PID| A| B| C|
+---+----+-----+-----+
|100| red| null| null|
|100|null|white| null|
|100|null| null|green|
|200|null| red| null|
+---+----+-----+-----+
val customUDF = udf((array: Seq[String]) => {
val newts = array.filter(_.nonEmpty)
if (newts.size == 0) null
else newts.head
})
df.groupBy($"PID").agg(customUDF(collect_set($"A")).as("colA"), customUDF(collect_set($"B")).as("colB"), customUDF(collect_set($"C")).as("colC")).show
+---+----+-----+-----+
|PID|colA| colB| colC|
+---+----+-----+-----+
|100| red|white|green|
|200|null| red| null|
+---+----+-----+-----+

Splitting Dataframe into two DataFrame

I have a dateframe which have unique as well as repeated records on the basis of number. Now i want to split the dataframe into two dataframe. In first dataframe i need to copy only unique rows and in second dataframe i want all repeated rows. For example
id name number
1 Shan 101
2 Shan 101
3 John 102
4 Michel 103
The two splitted dataframe should be like
Unique
id name number
3 John 102
4 Michel 103
Repeated
id name number
1 Shan 101
2 Shan 101
The solution you tried could probably get you there.
Your data looks like this
val df = sc.parallelize(Array(
(1, "Shan", 101),
(2, "Shan", 101),
(3, "John", 102),
(4, "Michel", 103)
)).toDF("id","name","number")
Then you yourself suggest grouping and counting. If you do it like this
val repeatedNames = df.groupBy("name").count.where(col("count")>1).withColumnRenamed("name","repeated").drop("count")
then you could actually get all the way by doing something like this afterwards:
val repeated = df.join(repeatedNames, repeatedNames("repeated")===df("name")).drop("repeated")
val distinct = df.except(repeated)
repeated show
+---+----+------+
| id|name|number|
+---+----+------+
| 1|Shan| 101|
| 2|Shan| 101|
+---+----+------+
distinct show
+---+------+------+
| id| name|number|
+---+------+------+
| 4|Michel| 103|
| 3| John| 102|
+---+------+------+
Hope it helps.