Conditional string manipulation in Pyspark - pyspark

I have a Pyspark dataframe, among others, a column of MSNs (of string type) like the following:
+------+
| Col1 |
+------+
| 654- |
| 1859 |
| 5875 |
| 784- |
| 596- |
| 668- |
| 1075 |
+------+
As you can see, those entries with a value of less than 1000 (i.e. three characters) have a - character at the end to make a total of 4 characters.
I want to get rid of that - character, so that I end up with something like:
+------+
| Col2 |
+------+
| 654 |
| 1859 |
| 5875 |
| 784 |
| 596 |
| 668 |
| 1075 |
+------+
I have tried the following code (where df is the dataframe containing the column, but it does not appear to work:
if df.Col1[3] == "-":
df = df.withColumn('Col2', df.series.substr(1, 3))
return df
else:
return df
Does anyone know how to do it?

You can replace - in the column with empty string ("") using F.regexp_replace
See the code below,
df.withColumn("Col2", F.regexp_replace("Col1", "-", "")).show()
+----+----+
|Col1|Col2|
+----+----+
|589-| 589|
|1245|1245|
|145-| 145|
+----+----+

Here is a solution using the .substr() method:
df.withColumn("Col2", F.when(F.col("Col1").substr(4, 1) == "-",
F.col("Col1").substr(1, 3)
).otherwise(
F.col("Col1"))).show()
+----+----+
|Col1|Col2|
+----+----+
|654-| 654|
|1859|1859|
|5875|5875|
|784-| 784|
|596-| 596|
|668-| 668|
|1075|1075|
+----+----+

Related

Removing alphabets from alphanumeric values present in column of dataframe of spark

The two column of dataframe looks like.
SKU | COMPSKU
PT25M | PT10M
PT3H | PT20M
TH | QR12
S18M | JH
spark with scala
How can i remove all alphabets and only numbers retain..
Expected output:
25|10
3|20
0|12
18|0
You could also do it this way.
df.withColumn(
"SKU",
when(regexp_replace(col("SKU"),"[a-zA-Z]","")==="",0
).otherwise(regexp_replace(col("SKU"),"[a-zA-Z]",""))
).withColumn(
"COMPSKU",
when(regexp_replace(col("COMPSKU"),"[a-zA-Z]","")==="", 0
).otherwise(regexp_replace(col("COMPSKU"),"[a-zA-Z]",""))
).show()
/*
+-----+-------+
| SKU|COMPSKU|
+-----+-------+
| 25 | 10 |
| 3 | 20 |
| 0 | 12 |
| 18 | 0 |
+-----+-------+
*/
Try with regexp_replace function then use case when otherwise statement to replace empty values with 0.
Example:
df.show()
/*
+-----+-------+
| SKU|COMPSKU|
+-----+-------+
|PT25M| PT10M|
| PT3H| PT20M|
| TH| QR12|
| S18M| JH|
+-----+-------+
*/
df.withColumn("SKU",regexp_replace(col("SKU"),"[a-zA-Z]","")).
withColumn("COMPSKU",regexp_replace(col("COMPSKU"),"[a-zA-Z]","")).
withColumn("SKU",when(length(trim(col("SKU")))===0,lit(0)).otherwise(col("SKU"))).
withColumn("COMPSKU",when(length(trim(col("COMPSKU")))===0,lit(0)).otherwise(col("COMPSKU"))).
show()
/*
+---+-------+
|SKU|COMPSKU|
+---+-------+
| 25| 10|
| 3| 20|
| 0| 12|
| 18| 0|
+---+-------+
*/

how to explode a spark dataframe

I exploded a nested schema but I am not getting what I want,
before exploded it looks like this:
df.show()
+----------+----------------------------------------------------------+
|CaseNumber| SourceId |
+----------+----------------------------------------------------------+
| 0 |[{"id":"1","type":"Sku"},{"id":"22","type":"ContractID"}] |
+----------|----------------------------------------------------------|
| 1 |[{"id":"3","type":"Sku"},{"id":"24","type":"ContractID"}] |
+---------------------------------------------------------------------+
I want it to be like this
+----------+-------------------+
| CaseNumber| Sku | ContractId |
+----------+-------------------+
| 0 | 1 | 22 |
+----------|------|------------|
| 1 | 3 | 24 |
+------------------------------|
Here is one way using the build-in get_json_object function:
import org.apache.spark.sql.functions.get_json_object
val df = Seq(
(0, """[{"id":"1","type":"Sku"},{"id":"22","type":"ContractID"}]"""),
(1, """[{"id":"3","type":"Sku"},{"id":"24","type":"ContractID"}]"""))
.toDF("CaseNumber", "SourceId")
df.withColumn("sku", get_json_object($"SourceId", "$[0].id").cast("int"))
.withColumn("ContractId", get_json_object($"SourceId", "$[1].id").cast("int"))
.drop("SourceId")
.show
// +----------+---+----------+
// |CaseNumber|sku|ContractId|
// +----------+---+----------+
// | 0| 1| 22|
// | 1| 3| 24|
// +----------+---+----------+
UPDATE
After our discussion we realised that the mentioned data is of array<struct<id:string,type:string>> type and not a simple string. Next is the solution for the new schema:
df.withColumn("sku", $"SourceIds".getItem(0).getField("id"))
.withColumn("ContractId", $"SourceIds".getItem(1).getField("id"))

How to iterate over pairs in a column in Scala

I have a data frame like this, imported from a parquet file:
| Store_id | Date_d_id |
| 0 | 23-07-2017 |
| 0 | 26-07-2017 |
| 0 | 01-08-2017 |
| 0 | 25-08-2017 |
| 1 | 01-01-2016 |
| 1 | 04-01-2016 |
| 1 | 10-01-2016 |
What I am trying to achieve next is to loop through each customer's date in pair and get the day difference. Here is what it should look like:
| Store_id | Date_d_id | Day_diff |
| 0 | 23-07-2017 | null |
| 0 | 26-07-2017 | 3 |
| 0 | 01-08-2017 | 6 |
| 0 | 25-08-2017 | 24 |
| 1 | 01-01-2016 | null |
| 1 | 04-01-2016 | 3 |
| 1 | 10-01-2016 | 6 |
And finally, I will like to reduce the data frame to the average day difference by customer:
| Store_id | avg_diff |
| 0 | 7.75 |
| 1 | 3 |
I am very new to Scala and I don't even know where to start. Any help is highly appreciated! Thanks in advance.
Also, I am using Zeppelin notebook
One approach would be to use lag(Date) over Window partition and a UDF to calculate the difference in days between consecutive rows, then follow by grouping the DataFrame for the average difference in days. Note that Date_d_id is converted to yyyy-mm-dd format for proper String ordering within the Window partitions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(0, "23-07-2017"),
(0, "26-07-2017"),
(0, "01-08-2017"),
(0, "25-08-2017"),
(1, "01-01-2016"),
(1, "04-01-2016"),
(1, "10-01-2016")
).toDF("Store_id", "Date_d_id")
def daysDiff = udf(
(d1: String, d2: String) => {
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
DAYS.between(LocalDate.parse(d1), LocalDate.parse(d2))
}
)
val df2 = df.
withColumn( "Date_ymd",
regexp_replace($"Date_d_id", """(\d+)-(\d+)-(\d+)""", "$3-$2-$1")).
withColumn( "Prior_date_ymd",
lag("Date_ymd", 1).over(Window.partitionBy("Store_id").orderBy("Date_ymd"))).
withColumn( "Days_diff",
when($"Prior_date_ymd".isNotNull, daysDiff($"Prior_date_ymd", $"Date_ymd")).
otherwise(0L))
df2.show
// +--------+----------+----------+--------------+---------+
// |Store_id| Date_d_id| Date_ymd|Prior_date_ymd|Days_diff|
// +--------+----------+----------+--------------+---------+
// | 1|01-01-2016|2016-01-01| null| 0|
// | 1|04-01-2016|2016-01-04| 2016-01-01| 3|
// | 1|10-01-2016|2016-01-10| 2016-01-04| 6|
// | 0|23-07-2017|2017-07-23| null| 0|
// | 0|26-07-2017|2017-07-26| 2017-07-23| 3|
// | 0|01-08-2017|2017-08-01| 2017-07-26| 6|
// | 0|25-08-2017|2017-08-25| 2017-08-01| 24|
// +--------+----------+----------+--------------+---------+
val resultDF = df2.groupBy("Store_id").agg(avg("Days_diff").as("Avg_diff"))
resultDF.show
// +--------+--------+
// |Store_id|Avg_diff|
// +--------+--------+
// | 1| 3.0|
// | 0| 8.25|
// +--------+--------+
You can use lag function to get the previous date over Window function, then do some manipulation to get the final dataframe that you require
first of all the Date_d_id column need to be converted to include timestamp for sorting to work correctly
import org.apache.spark.sql.functions._
val timestapeddf = df.withColumn("Date_d_id", from_unixtime(unix_timestamp($"Date_d_id", "dd-MM-yyyy")))
which should give your dataframe as
+--------+-------------------+
|Store_id| Date_d_id|
+--------+-------------------+
| 0|2017-07-23 00:00:00|
| 0|2017-07-26 00:00:00|
| 0|2017-08-01 00:00:00|
| 0|2017-08-25 00:00:00|
| 1|2016-01-01 00:00:00|
| 1|2016-01-04 00:00:00|
| 1|2016-01-10 00:00:00|
+--------+-------------------+
then you can apply the lag function over window function and finally get the date difference as
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Store_id").orderBy("Date_d_id")
val laggeddf = timestapeddf.withColumn("Day_diff", when(lag("Date_d_id", 1).over(windowSpec).isNull, null).otherwise(datediff($"Date_d_id", lag("Date_d_id", 1).over(windowSpec))))
laggeddf should be
+--------+-------------------+--------+
|Store_id|Date_d_id |Day_diff|
+--------+-------------------+--------+
|0 |2017-07-23 00:00:00|null |
|0 |2017-07-26 00:00:00|3 |
|0 |2017-08-01 00:00:00|6 |
|0 |2017-08-25 00:00:00|24 |
|1 |2016-01-01 00:00:00|null |
|1 |2016-01-04 00:00:00|3 |
|1 |2016-01-10 00:00:00|6 |
+--------+-------------------+--------+
now the final step is to use groupBy and aggregation to find the average
laggeddf.groupBy("Store_id")
.agg(avg("Day_diff").as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 11.0|
| 1| 4.5|
+--------+--------+
Now if you want to neglect the null Day_diff then you can do
laggeddf.groupBy("Store_id")
.agg((sum("Day_diff")/count($"Day_diff".isNotNull)).as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 8.25|
| 1| 3.0|
+--------+--------+
I hope the answer is helpful

subtract two columns with null in spark dataframe

I new to spark, I have dataframe df:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | null |
+----------+------------+-----------+
| 5 | null | null |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
when subtracting two columns, one column has null so resulting column also resulting as null.
df.withColumn("Sub", col(A)-col(B))
Expected output should be:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | 4 |
+----------+------------+-----------+
| 5 | null | 5 |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
I don't want to replace the column2 to replace with 0, it should be null only.
Can someone help me on this?
You can use when function as
import org.apache.spark.sql.functions._
df.withColumn("Sub", when(col("Column1").isNull(), lit(0)).otherwise(col("Column1")) - when(col("Column2").isNull(), lit(0)).otherwise(col("Column2")))
you should have final result as
+-------+-------+----+
|Column1|Column2| Sub|
+-------+-------+----+
| 1| 2|-1.0|
| 4| null| 4.0|
| 5| null| 5.0|
| 6| 8|-2.0|
+-------+-------+----+
You can coalesce nulls to zero on both columns and then do the subtraction:
val df = Seq((Some(1), Some(2)),
(Some(4), null),
(Some(5), null),
(Some(6), Some(8))
).toDF("A", "B")
df.withColumn("Sub", abs(coalesce($"A", lit(0)) - coalesce($"B", lit(0)))).show
+---+----+---+
| A| B|Sub|
+---+----+---+
| 1| 2| 1|
| 4|null| 4|
| 5|null| 5|
| 6| 8| 2|
+---+----+---+

how to ignore special row in the collect_list

I have a table like below.
| COLUMN A| COLUMN b|
| Case| 1111111111|
| Rectype| ABCD|
| Key| UMUM_REF_ID=A1234|
| UMSV ERROR| UNITS_ALLOW must|
| NTNB ERROR| GGGGGGG Value|
| Case| 2222222222|
| Rectype| ABCD|
| Key| UMUM_REF_ID=B8765|
| UMSV ERROR| UNITS_ALLOW must|
| NTNB ERROR| Invalid Value|
I want to add new column "C".
C is the collect_list "Case", "Rectype", "key", "UMSV ERROR" and "NTNB ERRO" in A.
My code is
val window = Window.rowsBetween(0,4)
val begin = rddDF.withColumn("C", collect_list( $"value").over( window)).where( $"A" like "Case")
begin.show()
It works well.
Now, I want to get the collect_list again but ignore the "NTNB ERROR" where its value in column b is "Invalid Value".
What should I do please?