I have an input dateframe, which has below data:
id date_column
1 2011-07-09 11:29:31+0000
2 2011-07-09T11:29:31+0000
3 2011-07-09T11:29:31
4 2011-07-09T11:29:31+0000
I want to check whether format of date_column matches the format "%Y-%m-%dT%H:%M:%S+0000", if format matches, i want to add a column, which has value 1 otherwise 0.
Currently, i have defined a UDF to do this operation:
def date_pattern_matching(value, pattern):
try:
datetime.strptime(str(value),pattern)
return "1"
except:
return "0"
It generates below output dataframe:
id date_column output
1 2011-07-09 11:29:31+0000 0
2 2011-07-09T11:29:31+0000 1
3 2011-07-09T11:29:31 0
4 2011-07-09T11:29:31+0000 1
Execution through UDF takes a lot of time, is there an alternate way to achieve it?
Try the regex pyspark.sql.Column.rlike operator with a when otherwise block
from pyspark.sql import functions as F
data = [[1, '2011-07-09 11:29:31+0000'],
[1,"2011-07-09 11:29:31+0000"],
[2,"2011-07-09T11:29:31+0000"],
[3,"2011-07-09T11:29:31"],
[4,"2011-07-09T11:29:31+0000"]]
df = spark.createDataFrame(data, ["id", "date_column"])
regex = "([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\+?\-?[0-9]{4})"
df_w_output = df.select("*", F.when(F.col("date_column").rlike(regex), 1).otherwise(0).alias("output"))
df_w_output.show()
Output
+---+------------------------+------+
|id |date_column |output|
+---+------------------------+------+
|1 |2011-07-09 11:29:31+0000|0 |
|1 |2011-07-09 11:29:31+0000|0 |
|2 |2011-07-09T11:29:31+0000|1 |
|3 |2011-07-09T11:29:31 |0 |
|4 |2011-07-09T11:29:31+0000|1 |
+---+------------------------+------+
Related
This question is similar to one I've asked before (Pandas pivot ussing column as suffix) but this time I need to do it using Pyspark instead of Pandas. The problem is as follows.
I have a dataframe like the following example:
Id
Type
Value_1
Value_2
1234
A
1
2
1234
B
1
2
789
A
1
2
789
B
1
2
567
A
1
2
And I want to transform to get the following:
Id
Value_1_A
Value_1_B
Value_2_A
Value_2_B
1234
1
1
2
2
789
1
1
2
2
567
1
1
In summary: replicating the value columns using the 'Type' column as a suffix and convert the dataframe to a wide format.
One solution I can think of is creating the columns with the suffix manually and then aggregating.
Other solutions I've tried are using pyspark GroupedData pivot function as follows:
import pandas as pd
df = spark.createDataFrame(pd.DataFrame({'Id': {0: 1234, 1: 1234, 2: 789, 3: 789, 4: 567},
'Type': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A'},
'Value_1': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'Value_2': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2}}))
df.groupBy("Id").pivot("Type").avg().show()
The issue of this solution is that the resulting dataframe would include the Id column repeated 3 times, and the inability to name the columns adding Type as the suffix, since they would be named liked this:
['Id',
'A_avg(Id)',
'A_avg(Value_1)',
'A_avg(Value_2)',
'B_avg(Id)',
'B_avg(Value_1)',
'B_avg(Value_2)']
I also tried specifying the value columns to the pivot functions as follows
df.groupBy("Id").pivot("Type", values=["Value_1", "Value_2"]).avg().show()
This removes the extra Id columns, but the rest of the columns only have null values.
Is there any elegant way to do the transformation I'm attempting on pyspark?
Option 1:
If you don't mind having your Type values as column prefixes rather than suffixes, you can use a combination of agg, avg, and alias:
import pyspark.sql.functions as F
df_pivot = df \
.groupBy("Id") \
.pivot("Type") \
.agg(F.avg("Value_1").alias("Value_1"), F.avg("Value_2").alias("Value_2"))
+----+---------+---------+---------+---------+
|Id |A_Value_1|A_Value_2|B_Value_1|B_Value_2|
+----+---------+---------+---------+---------+
|789 |1.0 |2.0 |1.0 |2.0 |
|567 |1.0 |2.0 |null |null |
|1234|1.0 |2.0 |1.0 |2.0 |
+----+---------+---------+---------+---------+
Separately, it's worth noting here that the values argument in the pivot method is used to limit which values you want to retain from your pivot (i.e., Type) column. For example, if you only wanted A and not B in your output, you would specify pivot("Type", values=["A"]).
Option 2:
If you do still want them as suffixes, you'll likely have to use some regex and withColumnRenamed, which could look something like this:
import pyspark.sql.functions as F
import re
df_pivot = df \
.groupBy("Id") \
.pivot("Type") \
.agg(F.avg("Value_1"), F.avg("Value_2"))
for col in df_pivot.columns:
if "avg(" in col:
suffix = re.findall("^.*(?=_avg\()|$", col)[0]
base_name = re.findall("(?<=\().*(?=\)$)|$", col)[0]
df_pivot = df_pivot.withColumnRenamed(col, "_".join([base_name, suffix]))
+----+---------+---------+---------+---------+
|Id |Value_1_A|Value_2_A|Value_1_B|Value_2_B|
+----+---------+---------+---------+---------+
|789 |1.0 |2.0 |1.0 |2.0 |
|567 |1.0 |2.0 |null |null |
|1234|1.0 |2.0 |1.0 |2.0 |
+----+---------+---------+---------+---------+
I need to transform a Python script to Pyspark and it's being a tough task for me.
I'm trying to remove null values from a dataframe (without removing the entire column or row) and shift the next value to the prior column. Example:
CLIENT| ANIMAL_1 | ANIMAL_2 | ANIMAL_3| ANIMAL_4
ROW_1 1 | cow | frog | null | dog
ROW_2 2 | pig | null | cat | null
My goal is to have:
CLIENT| ANIMAL_1 | ANIMAL_2 | ANIMAL_3| ANIMAL_4
ROW_1 1 | cow | frog | dog | null
ROW_2 2 | pig | cat | null | null
The code I'm using on python is (which I got here on Stackoverflow):
df_out = df.apply(lambda x: pd.Series(x.dropna().to_numpy()), axis=1)
Then I rename the columns. But I have no idea how to do this on Pyspark.
Here's a way to do this for Spark version 2.4+:
Create an array of the columns you want and sort by your conditions, which are the following:
Sort non-null values first
Sort values in the order they appear in the columns
We can do the sorting by using array_sort. To achieve the multiple conditions, use arrays_zip. To make it easy to extract the value you want (i.e. the animal in this example) zip column value as well.
from pyspark.sql.functions import array, array_sort, arrays_zip, col, lit
animal_cols = df.columns[1:]
N = len(animal_cols)
df_out = df.select(
df.columns[0],
array_sort(
arrays_zip(
array([col(c).isNull() for c in animal_cols]),
array([lit(i) for i in range(N)]),
array([col(c) for c in animal_cols])
)
).alias('sorted')
)
df_out.show(truncate=False)
#+------+----------------------------------------------------------------+
#|CLIENT|sorted |
#+------+----------------------------------------------------------------+
#|1 |[[false, 0, cow], [false, 1, frog], [false, 3, dog], [true, 2,]]|
#|2 |[[false, 0, pig], [false, 2, cat], [true, 1,], [true, 3,]] |
#+------+----------------------------------------------------------------+
Now that things are in the right order, you just need to extract the value. In this case, that's the item at element '2' in the i-th index of sorted column.
df_out = df_out.select(
df.columns[0],
*[col("sorted")[i]['2'].alias(animal_cols[i]) for i in range(N)]
)
df_out.show(truncate=False)
#+------+--------+--------+--------+--------+
#|CLIENT|ANIMAL_1|ANIMAL_2|ANIMAL_3|ANIMAL_4|
#+------+--------+--------+--------+--------+
#|1 |cow |frog |dog |null |
#|2 |pig |cat |null |null |
#+------+--------+--------+--------+--------+
I have the following DataFrame in Spark and Scala:
group nodeId date
1 1 2016-10-12T12:10:00.000Z
1 2 2016-10-12T12:00:00.000Z
1 3 2016-10-12T12:05:00.000Z
2 1 2016-10-12T12:30:00.000Z
2 2 2016-10-12T12:35:00.000Z
I need to group records by group, sort them in ascending order by date and make pairs of sequential nodeId. Also, date should be converted to Unix epoch.
This can be better explained with the expected output:
group nodeId_1 nodeId_2 date
1 2 3 2016-10-12T12:00:00.000Z
1 3 1 2016-10-12T12:05:00.000Z
2 1 2 2016-10-12T12:30:00.000Z
This is what I did so far:
df
.groupBy("group")
.agg($"nodeId",$"date")
.orderBy(asc("date"))
But I don't know how to create pairs of nodeId.
You can be benefited by using Window function with lead inbuilt function to create the pairs and to_utc_timestamp inbuilt function to convert the date to epoch date. Finally you have to filter the unpaired rows as you don't require them in the output.
Following is the program of above explanation. I have used comments for clarity
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("group").orderBy("date") //defining window function grouping by group and ordering by date
import org.apache.spark.sql.functions._
df.withColumn("date", to_utc_timestamp(col("date"), "Asia/Kathmandu")) //converting the date to epoch datetime you can choose other timezone as required
.withColumn("nodeId_2", lead("nodeId", 1).over(windowSpec)) //using window for creating pairs
.filter(col("nodeId_2").isNotNull) //filtering out the unpaired rows
.select(col("group"), col("nodeId").as("nodeId_1"), col("nodeId_2"), col("date")) //selecting as required final dataframe
.show(false)
You should get the final dataframe as required
+-----+--------+--------+-------------------+
|group|nodeId_1|nodeId_2|date |
+-----+--------+--------+-------------------+
|1 |2 |3 |2016-10-12 12:00:00|
|1 |3 |1 |2016-10-12 12:05:00|
|2 |1 |2 |2016-10-12 12:30:00|
+-----+--------+--------+-------------------+
I hope the answer is helpful
Note to get the correct epoch date I have used Asia/Kathmandu as timezone.
If I understand your requirement correctly, you can use a self-join on group and a < inequality condition on nodeId:
val df = Seq(
(1, 1, "2016-10-12T12:10:00.000Z"),
(1, 2, "2016-10-12T12:00:00.000Z"),
(1, 3, "2016-10-12T12:05:00.000Z"),
(2, 1, "2016-10-12T12:30:00.000Z"),
(2, 2, "2016-10-12T12:35:00.000Z")
).toDF("group", "nodeId", "date")
df.as("df1").join(
df.as("df2"),
$"df1.group" === $"df2.group" && $"df1.nodeId" < $"df2.nodeId"
).select(
$"df1.group", $"df1.nodeId", $"df2.nodeId",
when($"df1.date" < $"df2.date", $"df1.date").otherwise($"df2.date").as("date")
)
// +-----+------+------+------------------------+
// |group|nodeId|nodeId|date |
// +-----+------+------+------------------------+
// |1 |1 |3 |2016-10-12T12:05:00.000Z|
// |1 |1 |2 |2016-10-12T12:00:00.000Z|
// |1 |2 |3 |2016-10-12T12:00:00.000Z|
// |2 |1 |2 |2016-10-12T12:30:00.000Z|
// +-----+------+------+------------------------+
I have data in two dataframes:
selectedPersonDF:
ID key
1
2
3
4
5
selectedDetailsDF:
first second third key
--------------------------
1 9 9 777
9 8 8 878
8 10 10 765
10 12 19 909
11 2 20 708
Code :
val personDF = spark.read.option("header", "true").option("inferSchema", "false").csv("person.csv")
val detailsDF = spark.read.option("header", "true").option("inferSchema", "false").csv("details.csv")
val selectedPersonDF=personDF.select((col("ID"),col("key"))).show()
val selectedDetailsDF=detailsDF.select(col("first"),col("second"),col("third"),col("key")).show()
I have to match the selectedPersonDF id column with selectedDetailsDF all the columns(First, Second, Third) if any of the column data matches with persons id then we have to take the key value from selectedDetailsDF and have to update in selectedPersonDF key column.
Expected output (in selectedPersonDF):
ID key
1 777
2 708
3
4
5
and after removing the first row from persons'df since its matched with detailsdf remaining data should be stored in another df.
You can use join and use || condition checking and left join as
val finalDF = selectedPersonDF.join(selectedDetailsDF.withColumnRenamed("key", "key2"), $"ID" === $"first" || $"ID" === $"second" || $"ID" === $"third", "left")
.select($"ID", $"key2".as("key"))
.show(false)
so finalDF should give you
+---+----+
|ID |key |
+---+----+
|1 |777 |
|2 |708 |
|3 |null|
|4 |null|
|5 |null|
+---+----+
We can call .na.fill("") on above dataframe (key column has to be StringType()) to get
+---+---+
|ID |key|
+---+---+
|1 |777|
|2 |708|
|3 | |
|4 | |
|5 | |
+---+---+
After that you can use filter to separate the dataframe into matching and non matching using key column with value and null repectively
val notMatchingDF = finalDF.filter($"key" === "")
val matchingDF = finalDF.except(notMatchingDF)
Updated for if the column names of selectedDetailsDF is unknown except the key column
If the column names of the second dataframe is unknown then you will have to form an array column of the unknown columns as
val columnsToCheck = selectedDetailsDF.columns.toSet - "key" toList
import org.apache.spark.sql.functions._
val tempSelectedDetailsDF = selectedDetailsDF.select(array(columnsToCheck.map(col): _*).as("array"), col("key").as("key2"))
Now tempSelectedDetailsDF dataframe has two columns: combined column of all the unknown columns as an array column and the key column renamed as key2.
After that you would need a udf function for checking the condition while joining
val arrayContains = udf((array: collection.mutable.WrappedArray[String], value: String) => array.contains(value))
And then you join the dataframes using the call to the defined udf function as
val finalDF = selectedPersonDF.join(tempSelectedDetailsDF, arrayContains($"array", $"ID"), "left")
.select($"ID", $"key2".as("key"))
.na.fill("")
Rest of the process is already defined above.
I hope the answer is helpful and understandable.
I have an RDD with multiple rows which looks like below.
val row = [(String, String), (String, String, String)]
The value is a sequence of Tuples. In the tuple, the last String is a timestamp and the second one is category. I want to filter this sequence based on maximum timestamp for each category.
(A,B) Id Category Timestamp
-------------------------------------------------------
(123,abc) 1 A 2016-07-22 21:22:59+0000
(234,bcd) 2 B 2016-07-20 21:21:20+0000
(123,abc) 1 A 2017-07-09 21:22:59+0000
(345,cde) 4 C 2016-07-05 09:22:30+0000
(456,def) 5 D 2016-07-21 07:32:06+0000
(234,bcd) 2 B 2015-07-20 21:21:20+0000
I want one row for each of the categories.I was looking for some help on getting the row with the max timestamp for each category. The result I am looking to get is
(A,B) Id Category Timestamp
-------------------------------------------------------
(234,bcd) 2 B 2016-07-20 21:21:20+0000
(123,abc) 1 A 2017-07-09 21:22:59+0000
(345,cde) 4 C 2016-07-05 09:22:30+0000
(456,def) 5 D 2016-07-21 07:32:06+0000
Given input dataframe as
+---------+---+--------+------------------------+
|(A,B) |Id |Category|Timestamp |
+---------+---+--------+------------------------+
|[123,abc]|1 |A |2016-07-22 21:22:59+0000|
|[234,bcd]|2 |B |2016-07-20 21:21:20+0000|
|[123,abc]|1 |A |2017-07-09 21:22:59+0000|
|[345,cde]|4 |C |2016-07-05 09:22:30+0000|
|[456,def]|5 |D |2016-07-21 07:32:06+0000|
|[234,bcd]|2 |B |2015-07-20 21:21:20+0000|
+---------+---+--------+------------------------+
You can do the following to get the result dataframe you require
import org.apache.spark.sql.functions._
val requiredDataframe = df.orderBy($"Timestamp".desc).groupBy("Category").agg(first("(A,B)").as("(A,B)"), first("Id").as("Id"), first("Timestamp").as("Timestamp"))
You should have the requiredDataframe as
+--------+---------+---+------------------------+
|Category|(A,B) |Id |Timestamp |
+--------+---------+---+------------------------+
|B |[234,bcd]|2 |2016-07-20 21:21:20+0000|
|D |[456,def]|5 |2016-07-21 07:32:06+0000|
|C |[345,cde]|4 |2016-07-05 09:22:30+0000|
|A |[123,abc]|1 |2017-07-09 21:22:59+0000|
+--------+---------+---+------------------------+
You can do the same by using Window function as below
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("Category").orderBy($"Timestamp".desc)
df.withColumn("rank", rank().over(windowSpec)).filter($"rank" === lit(1)).drop("rank")