Removing alphabets from alphanumeric values present in column of dataframe of spark - scala

The two column of dataframe looks like.
SKU | COMPSKU
PT25M | PT10M
PT3H | PT20M
TH | QR12
S18M | JH
spark with scala
How can i remove all alphabets and only numbers retain..
Expected output:
25|10
3|20
0|12
18|0

You could also do it this way.
df.withColumn(
"SKU",
when(regexp_replace(col("SKU"),"[a-zA-Z]","")==="",0
).otherwise(regexp_replace(col("SKU"),"[a-zA-Z]",""))
).withColumn(
"COMPSKU",
when(regexp_replace(col("COMPSKU"),"[a-zA-Z]","")==="", 0
).otherwise(regexp_replace(col("COMPSKU"),"[a-zA-Z]",""))
).show()
/*
+-----+-------+
| SKU|COMPSKU|
+-----+-------+
| 25 | 10 |
| 3 | 20 |
| 0 | 12 |
| 18 | 0 |
+-----+-------+
*/

Try with regexp_replace function then use case when otherwise statement to replace empty values with 0.
Example:
df.show()
/*
+-----+-------+
| SKU|COMPSKU|
+-----+-------+
|PT25M| PT10M|
| PT3H| PT20M|
| TH| QR12|
| S18M| JH|
+-----+-------+
*/
df.withColumn("SKU",regexp_replace(col("SKU"),"[a-zA-Z]","")).
withColumn("COMPSKU",regexp_replace(col("COMPSKU"),"[a-zA-Z]","")).
withColumn("SKU",when(length(trim(col("SKU")))===0,lit(0)).otherwise(col("SKU"))).
withColumn("COMPSKU",when(length(trim(col("COMPSKU")))===0,lit(0)).otherwise(col("COMPSKU"))).
show()
/*
+---+-------+
|SKU|COMPSKU|
+---+-------+
| 25| 10|
| 3| 20|
| 0| 12|
| 18| 0|
+---+-------+
*/

Related

Pyspark - advanced aggregation of monthly data

I have a table of the following format.
|---------------------|------------------|------------------|
| Customer | Month | Sales |
|---------------------|------------------|------------------|
| A | 3 | 40 |
|---------------------|------------------|------------------|
| A | 2 | 50 |
|---------------------|------------------|------------------|
| B | 1 | 20 |
|---------------------|------------------|------------------|
I need it in the format as below
|---------------------|------------------|------------------|------------------|
| Customer | Month 1 | Month 2 | Month 3 |
|---------------------|------------------|------------------|------------------|
| A | 0 | 50 | 40 |
|---------------------|------------------|------------------|------------------|
| B | 20 | 0 | 0 |
|---------------------|------------------|------------------|------------------|
Can you please help me out to solve this problem in PySpark?
This should help , i am assumming you are using SUM to aggregate vales from the originical DF
>>> df.show()
+--------+-----+-----+
|Customer|Month|Sales|
+--------+-----+-----+
| A| 3| 40|
| A| 2| 50|
| B| 1| 20|
+--------+-----+-----+
>>> import pyspark.sql.functions as F
>>> df2=(df.withColumn('COLUMN_LABELS',F.concat(F.lit('Month '),F.col('Month')))
.groupby('Customer')
.pivot('COLUMN_LABELS')
.agg(F.sum('Sales'))
.fillna(0))
>>> df2.show()
+--------+-------+-------+-------+
|Customer|Month 1|Month 2|Month 3|
+--------+-------+-------+-------+
| A| 0| 50| 40|
| B| 20| 0| 0|
+--------+-------+-------+-------+

How to enrich dataframe by adding columns in specific condition in pyspark?

I have a two different dataframes:
users:
+-------+---------+--------+
|user_id| movie_id|timestep|
+-------+---------+--------+
| 100 | 1000 |20200728|
| 101 | 1001 |20200727|
| 101 | 1002 |20200726|
+-------+---------+--------+
movies:
+--------+---------+--------------------------+
|movie_id| title | genre |
+--------+---------+--------------------------+
| 1000 |Toy Story|Adventure|Animation|Chil..|
| 1001 | Jumanji |Adventure|Children|Fantasy|
| 1002 | Iron Man|Action|Adventure|Sci-Fi |
+--------+---------+--------------------------+
How to get a dataframe in the following format? So I can get user's taste profile for comparing different users by their similarity score?
+-------+---------+---------+---------+--------+-----+
|user_id| Action |Adventure|Animation|Children|Drama|
+-------+---------+---------+---------+---------+----+
| 100 | 0 | 1 | 1 | 1 | 0 |
| 101 | 1 | 2 | 0 | 1 | 0 |
+-------+---------+---------+---------+--------+-----+
First, you need to split your "genre" column.
from pyspark.sql import functions as F
movies = movies.withColumn("genre", F.explode(F.split("genre", '\|')))
# use \ in front of | because split use regex
then you join
user_movie = users.join(movies, on='movie_id')
and you pivot
user_movie.groupBy("user_id").pivot("genre").agg(F.count("*")).fillna(0).show()
+-------+------+---------+---------+--------+-------+------+
|user_id|Action|Adventure|Animation|Children|Fantasy|Sci-Fi|
+-------+------+---------+---------+--------+-------+------+
| 100| 0| 1| 1| 1| 0| 0|
| 101| 1| 2| 0| 1| 1| 1|
+-------+------+---------+---------+--------+-------+------+
FYI : Drama column does not appear because there is no drama "genre" in the movies dataframe. But with your full data, you will have one column for each genre.

Group rows that match sub string in a column using scala

I have a fol df:
Zip | Name | id |
abc | xyz | 1 |
def | wxz | 2 |
abc | wex | 3 |
bcl | rea | 4 |
abc | txc | 5 |
def | rfx | 6 |
abc | abc | 7 |
I need to group all the names that contain 'x' based on same Zip using scala
Desired Output:
Zip | Count |
abc | 3 |
def | 2 |
Any help is highly appreciated
As #Shaido mentioned in the comment above, all you need is filter, groupBy and aggregation as
import org.apache.spark.sql.functions._
fol.filter(col("Name").contains("x")) //filtering the rows that has x in the Name column
.groupBy("Zip") //grouping by Zip column
.agg(count("Zip").as("Count")) //counting the rows in each groups
.show(false)
and you should have the desired output
+---+-----+
|Zip|Count|
+---+-----+
|abc|3 |
|def|2 |
+---+-----+
You want to groupBy bellow data frame.
+---+----+---+
|zip|name| id|
+---+----+---+
|abc| xyz| 1|
|def| wxz| 2|
|abc| wex| 3|
|bcl| rea| 4|
|abc| txc| 5|
|def| rfx| 6|
|abc| abc| 7|
+---+----+---+
then you can simply use groupBy function with passing column parameter and followed by count will give you the result.
val groupedDf: DataFrame = df.groupBy("zip").count()
groupedDf.show()
// +---+-----+
// |zip|count|
// +---+-----+
// |bcl| 1|
// |abc| 4|
// |def| 2|
// +---+-----+

How to iterate over pairs in a column in Scala

I have a data frame like this, imported from a parquet file:
| Store_id | Date_d_id |
| 0 | 23-07-2017 |
| 0 | 26-07-2017 |
| 0 | 01-08-2017 |
| 0 | 25-08-2017 |
| 1 | 01-01-2016 |
| 1 | 04-01-2016 |
| 1 | 10-01-2016 |
What I am trying to achieve next is to loop through each customer's date in pair and get the day difference. Here is what it should look like:
| Store_id | Date_d_id | Day_diff |
| 0 | 23-07-2017 | null |
| 0 | 26-07-2017 | 3 |
| 0 | 01-08-2017 | 6 |
| 0 | 25-08-2017 | 24 |
| 1 | 01-01-2016 | null |
| 1 | 04-01-2016 | 3 |
| 1 | 10-01-2016 | 6 |
And finally, I will like to reduce the data frame to the average day difference by customer:
| Store_id | avg_diff |
| 0 | 7.75 |
| 1 | 3 |
I am very new to Scala and I don't even know where to start. Any help is highly appreciated! Thanks in advance.
Also, I am using Zeppelin notebook
One approach would be to use lag(Date) over Window partition and a UDF to calculate the difference in days between consecutive rows, then follow by grouping the DataFrame for the average difference in days. Note that Date_d_id is converted to yyyy-mm-dd format for proper String ordering within the Window partitions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(0, "23-07-2017"),
(0, "26-07-2017"),
(0, "01-08-2017"),
(0, "25-08-2017"),
(1, "01-01-2016"),
(1, "04-01-2016"),
(1, "10-01-2016")
).toDF("Store_id", "Date_d_id")
def daysDiff = udf(
(d1: String, d2: String) => {
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
DAYS.between(LocalDate.parse(d1), LocalDate.parse(d2))
}
)
val df2 = df.
withColumn( "Date_ymd",
regexp_replace($"Date_d_id", """(\d+)-(\d+)-(\d+)""", "$3-$2-$1")).
withColumn( "Prior_date_ymd",
lag("Date_ymd", 1).over(Window.partitionBy("Store_id").orderBy("Date_ymd"))).
withColumn( "Days_diff",
when($"Prior_date_ymd".isNotNull, daysDiff($"Prior_date_ymd", $"Date_ymd")).
otherwise(0L))
df2.show
// +--------+----------+----------+--------------+---------+
// |Store_id| Date_d_id| Date_ymd|Prior_date_ymd|Days_diff|
// +--------+----------+----------+--------------+---------+
// | 1|01-01-2016|2016-01-01| null| 0|
// | 1|04-01-2016|2016-01-04| 2016-01-01| 3|
// | 1|10-01-2016|2016-01-10| 2016-01-04| 6|
// | 0|23-07-2017|2017-07-23| null| 0|
// | 0|26-07-2017|2017-07-26| 2017-07-23| 3|
// | 0|01-08-2017|2017-08-01| 2017-07-26| 6|
// | 0|25-08-2017|2017-08-25| 2017-08-01| 24|
// +--------+----------+----------+--------------+---------+
val resultDF = df2.groupBy("Store_id").agg(avg("Days_diff").as("Avg_diff"))
resultDF.show
// +--------+--------+
// |Store_id|Avg_diff|
// +--------+--------+
// | 1| 3.0|
// | 0| 8.25|
// +--------+--------+
You can use lag function to get the previous date over Window function, then do some manipulation to get the final dataframe that you require
first of all the Date_d_id column need to be converted to include timestamp for sorting to work correctly
import org.apache.spark.sql.functions._
val timestapeddf = df.withColumn("Date_d_id", from_unixtime(unix_timestamp($"Date_d_id", "dd-MM-yyyy")))
which should give your dataframe as
+--------+-------------------+
|Store_id| Date_d_id|
+--------+-------------------+
| 0|2017-07-23 00:00:00|
| 0|2017-07-26 00:00:00|
| 0|2017-08-01 00:00:00|
| 0|2017-08-25 00:00:00|
| 1|2016-01-01 00:00:00|
| 1|2016-01-04 00:00:00|
| 1|2016-01-10 00:00:00|
+--------+-------------------+
then you can apply the lag function over window function and finally get the date difference as
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Store_id").orderBy("Date_d_id")
val laggeddf = timestapeddf.withColumn("Day_diff", when(lag("Date_d_id", 1).over(windowSpec).isNull, null).otherwise(datediff($"Date_d_id", lag("Date_d_id", 1).over(windowSpec))))
laggeddf should be
+--------+-------------------+--------+
|Store_id|Date_d_id |Day_diff|
+--------+-------------------+--------+
|0 |2017-07-23 00:00:00|null |
|0 |2017-07-26 00:00:00|3 |
|0 |2017-08-01 00:00:00|6 |
|0 |2017-08-25 00:00:00|24 |
|1 |2016-01-01 00:00:00|null |
|1 |2016-01-04 00:00:00|3 |
|1 |2016-01-10 00:00:00|6 |
+--------+-------------------+--------+
now the final step is to use groupBy and aggregation to find the average
laggeddf.groupBy("Store_id")
.agg(avg("Day_diff").as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 11.0|
| 1| 4.5|
+--------+--------+
Now if you want to neglect the null Day_diff then you can do
laggeddf.groupBy("Store_id")
.agg((sum("Day_diff")/count($"Day_diff".isNotNull)).as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 8.25|
| 1| 3.0|
+--------+--------+
I hope the answer is helpful

subtract two columns with null in spark dataframe

I new to spark, I have dataframe df:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | null |
+----------+------------+-----------+
| 5 | null | null |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
when subtracting two columns, one column has null so resulting column also resulting as null.
df.withColumn("Sub", col(A)-col(B))
Expected output should be:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | 4 |
+----------+------------+-----------+
| 5 | null | 5 |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
I don't want to replace the column2 to replace with 0, it should be null only.
Can someone help me on this?
You can use when function as
import org.apache.spark.sql.functions._
df.withColumn("Sub", when(col("Column1").isNull(), lit(0)).otherwise(col("Column1")) - when(col("Column2").isNull(), lit(0)).otherwise(col("Column2")))
you should have final result as
+-------+-------+----+
|Column1|Column2| Sub|
+-------+-------+----+
| 1| 2|-1.0|
| 4| null| 4.0|
| 5| null| 5.0|
| 6| 8|-2.0|
+-------+-------+----+
You can coalesce nulls to zero on both columns and then do the subtraction:
val df = Seq((Some(1), Some(2)),
(Some(4), null),
(Some(5), null),
(Some(6), Some(8))
).toDF("A", "B")
df.withColumn("Sub", abs(coalesce($"A", lit(0)) - coalesce($"B", lit(0)))).show
+---+----+---+
| A| B|Sub|
+---+----+---+
| 1| 2| 1|
| 4|null| 4|
| 5|null| 5|
| 6| 8| 2|
+---+----+---+