Validate data from the same column in different rows with pyspark - pyspark

How can I change the value of a column depending on some validation between some cells? What I need is to compare the kilometraje values of each customer's (id) record to compare whether the record that follows the kilometraje is higher.
fecha id estado id_cliente error_code kilometraje error_km
1/1/2019 1 A 1 10
2/1/2019 2 A ERROR 20
3/1/2019 1 D 1 ERROR 30
4/1/2019 2 O ERROR
The error in the error_km column is because for customer (id) 2 the kilometraje value is less than the same customer record for 2/1/2019 (If time passes the car is used so the kilometraje increases, so that there is no error the mileage has to be higher or the same)
I know that withColumn I can overwrite or create a column that doesn't exist and that using when I can set conditions. For example: This would be the code I use to validate the estado and id_cliente column and ERROR overwrite the error_code column where applicable, but I don't understand how to validate between different rows for the same client.
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
file_path = 'archive.txt'
error = 'ERROR'
df = spark.read.parquet(file_path)
df = df.persist(StorageLevel.MEMORY_AND_DISK)
df = df.select('estado', 'id_cliente')
df = df.withColumn("error_code", lit(''))
df = df.withColumn('error_code',
F.when((F.col('status') == 'O') &
(F.col('client_id') != '') |
(F.col('status') == 'D') &
(F.col('client_id') != '') |
(F.col('status') == 'A') &
(F.col('client_id') == ''),
F.concat(F.col("error_code"), F.lit(":[{}]".format(error)))
)
.otherwise(F.col('error_code')))

You achieve that with the lag window function. The lag function returns you the row before the current row. With that you can easily compare the kilometraje values. Have a look at the code below:
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 30 ),
('7/1/2019', 3 , 30 ),
('4/1/2019', 2 , 5)]
columns = ['fecha', 'id', 'kilometraje']
df=spark.createDataFrame(l, columns)
df = df.withColumn('fecha',F.to_date(df.fecha, 'dd/MM/yyyy'))
w = Window.partitionBy('id').orderBy('fecha')
df = df.withColumn('error_km', F.when(F.lag('kilometraje').over(w) > df.kilometraje, F.lit('ERROR') ).otherwise(F.lit('')))
df.show()
Output:
+----------+---+-----------+--------+
| fecha| id|kilometraje|error_km|
+----------+---+-----------+--------+
|2019-01-01| 1| 10| |
|2019-01-03| 1| 30| |
|2019-01-04| 1| 10| ERROR|
|2019-01-05| 1| 30| |
|2019-01-07| 3| 30| |
|2019-01-02| 2| 20| |
|2019-01-04| 2| 5| ERROR|
+----------+---+-----------+--------+
The fourth row doesn't get labeled with 'ERROR' as the previous value had a smaller kilometraje value (10 < 30). When you want to label all the id's with 'ERROR' which contain at least one corrupted row, perform a left join.
df.drop('error_km').join(df.filter(df.error_km == 'ERROR').groupby('id').agg(F.first(df.error_km).alias('error_km')), 'id', 'left').show()

I use .rangeBetween(Window.unboundedPreceding,0).
This function searches from the current value for the added value for the back
import pyspark
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql import Window
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
error = 'This is error'
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 22 ),
('7/1/2019', 1 , 23 ),
('22/1/2019', 2 , 5),
('11/1/2019', 2 , 24),
('13/2/2019', 1 , 16),
('14/2/2019', 2 , 18),
('5/2/2019', 1 , 19),
('6/2/2019', 2 , 23),
('7/2/2019', 1 , 14),
('8/3/2019', 1 , 50),
('8/3/2019', 2 , 50)]
columns = ['date', 'vin', 'mileage']
df=spark.createDataFrame(l, columns)
df = df.withColumn('date',F.to_date(df.date, 'dd/MM/yyyy'))
df = df.withColumn("max", lit(0))
df = df.withColumn("error_code", lit(''))
w = Window.partitionBy('vin').orderBy('date').rangeBetween(Window.unboundedPreceding,0)
df = df.withColumn('max',F.max('mileage').over(w))
df = df.withColumn('error_code', F.when(F.col('mileage') < F.col('max'), F.lit('ERROR')).otherwise(F.lit('')))
df.show()
Finally, all that remains is to remove the column that has the maximum
df = df.drop('max')
df.show()

Related

I have a df[rn,rn1,rn2]. When rn is null, I want to generate random number and assign it to rn1,rn2

Df consists of these rows [rn,rn1,rn2].
The condition is,if rn is null,generate a random number between 0-1000 and then assign that value to rn1,rn2.any suggestions please.
I have tried all the possible options. Could not figure out since I'm new to azure.please help
from pyspark.sql.functions import rand, col, when, floor
data = [(None, 200, 1000), (2,300,400), (None, 300,500)]
df = spark.createDataFrame(data).toDF("rn","rn1","rn2")
>>> df.select("*").show()
+----+---+----+
| rn|rn1| rn2|
+----+---+----+
|null|200|1000|
| 2|300| 400|
|null|300| 500|
+----+---+----+
df.select( df.rn,
when(
df.rn.isNull() , # condition
floor(rand()*1000) # true value
).otherwise(
df.rn1 # false value
).alias("rn1") ).show()
+----+---+
| rn|rn1|
+----+---+
|null|545|
| 2|300|
|null|494|
+----+---+
Rinse and repeat for rn2.

Pyspark agg max abs val but keep sign

I'm agg on another col. For ex if the col was
ID val
A 10
A 100
A -150
A 15
B 10
B 200
B -150
B 15
I'd want to return the below (keeping the sign). Not sure how to do this while keeping the sign
ID max(val)
A -150
B 200
Option 1: using a window function + row_number. We parition by ID and order by abs(val) descending. Then we simply take the first row.
import pyspark.sql.functions as F
data = [
('A', 10),
('A', 100),
('A', -150),
('A', 15),
('B', 10),
('B', 200),
('B', -150),
('B', 15)
]
df = spark.createDataFrame(data=data, schema=('ID','val'))
w = Window().partitionBy('ID').orderBy(F.abs('val').desc())
(df
.withColumn('rn', F.row_number().over(w))
.filter(F.col('rn') == 1)
.drop('rn')
).show()
+---+----+
| ID| val|
+---+----+
| A|-150|
| B| 200|
+---+----+
Option 2: A solution which works with agg. We compare the max value to the absolute max value. If they match then take the max, if they don't then take the min. Note that this solution prefers the positive value in case of ties.
df.groupby('ID').agg(
F.when(F.max('val') == F.max(F.abs('val')), F.max('val')).otherwise(F.min('val')).alias('max_val')
).show()
+---+-------+
| ID|max_val|
+---+-------+
| A| -150|
| B| 200|
+---+-------+

pyspark create a structType with conditions in a dateframe

I have this dataframe
Item value
A 1
B 1
C 2
D 2
I want to get this
Item value
{A,B} 1
{C,D} 2
Something like groupby 'value' and combine 'item
'
Your DF:
df = sqlContext.createDataFrame(
[
('A', 1)
,('B', 1)
,('C', 2)
,('D', 2)
]
,['Item', 'value']
)
Steps are:
create a dummy matrix
apply a lambda function
map the list returned by the lambda function to the column names list
from pyspark.sql import functions as F
def map_func(list):
z = []
for index, elem in enumerate(list):
if elem == 1:
z.append(cols[index])
return z
df = df.groupBy("value").pivot("Item").agg(F.lit(1)).fillna(0)
cols = df.columns[1:]
map_func_lamb = F.udf(lambda row: map_func(row))
df = df.withColumn("list", map_func_lamb(F.struct([df[x] for x in df.columns[1:]])))\
.select(F.col("list").alias("Item"), "value")
df.show()
+------+-----+
| Item|value|
+------+-----+
|[A, B]| 1|
|[C, D]| 2|
+------+-----+
Not sure if you really want a dic as a result or a list is fine, but I don't see much difference at this point

pyspark, get rows where first column value equals id and second column value is between two values, do this for each row in a dataframe

So I have one pyspark dataframe like so, let's call it dataframe a:
+-------------------+---------------+----------------+
| reg| val1| val2 |
+-------------------+---------------+----------------+
| N110WA| 1590030660| 1590038340000|
| N876LF| 1590037200| 1590038880000|
| N135MH| 1590039060| 1590040080000|
And another like this, let's call it dataframe b:
+-----+-------------+-----+-----+---------+----------+---+----+
| reg| postime| alt| galt| lat| long|spd| vsi|
+-----+-------------+-----+-----+---------+----------+---+----+
|XY679|1590070078549| 50| 130|18.567169|-69.986343|132|1152|
|HI949|1590070091707| 375| 455| 18.5594|-69.987804|148|1344|
|JX784|1590070110666| 825| 905|18.544968|-69.990414|170|1216|
Is there some way to create a numpy array or pyspark dataframe, where for each row in dataframe a, all the rows in dataframe b with the same reg and postime between val 1 and val 2, are included?
You can try the below solution -- and let us know if works or anything else is expected ?
I have modified the imputes a little in order to showcase the working solution--
Input here
from pyspark.sql import functions as F
df_a = spark.createDataFrame([('N110WA',1590030660,1590038340000), ('N110WA',1590070078549,1590070078559)],[ "reg","val1","val2"])
df_b = spark.createDataFrame([('N110WA',1590070078549)],[ "reg","postime"])
df_a.show()
df_a
+------+-------------+-------------+
| reg| val1| val2|
+------+-------------+-------------+
|N110WA| 1590030660|1590038340000|
|N110WA|1590070078549|1590070078559|
+------+-------------+-------------+
df_b
+------+-------------+
| reg| postime|
+------+-------------+
|N110WA|1590070078549|
+------+-------------+
Solution here
from pyspark.sql import types as T
from pyspark.sql import functions as F
#df_a = df_a.join(df_b,'reg','left')
df_a = df_a.withColumn('condition_col', F.when(((F.col('postime') >= F.col('val1')) & (F.col('postime') <= F.col('val2'))),'1').otherwise('0'))
df_a = df_a.filter(F.col('condition_col') == 1).drop('condition_col')
df_a.show()
Final Output
+------+-------------+-------------+-------------+
| reg| val1| val2| postime|
+------+-------------+-------------+-------------+
|N110WA|1590070078549|1590070078559|1590070078549|
+------+-------------+-------------+-------------+
Yes, assuming df_a and df_b are both pyspark dataframes, you can use an inner join in pyspark:
delta = val
df = df_a.join(df_b, [
df_a.res == df_b.res,
df_a.posttime <= df_b.val1 + delta,
df_a.posttime >= df_b.val2 - delta
], "inner")
Will filter out the results to only include the ones specified

How to replace leading 0 with 91 using regex in pyspark dataframe

In python I am doing this to replace leading 0 in column phone with 91.
But how to do it in pyspark.
con dataframe is :
id phone1
1 088976854667
2 089706790002
Outptut i want is
1 9188976854667
2 9189706790002
# Replace leading Zeros in a phone number with 91
con.filter(regex='[_]').replace('^0','385',regex=True)
You are looking for the regexp_replace function. This function takes 3 parameter:
column name
pattern
repleacement
from pyspark.sql import functions as F
columns = ['id', 'phone1']
vals = [(1, '088976854667'),(2, '089706790002' )]
df = spark.createDataFrame(vals, columns)
df = df.withColumn('phone1', F.regexp_replace('phone1',"^0", "91"))
df.show()
Output:
+---+-------------+
| id| phone1|
+---+-------------+
| 1|9188976854667|
| 2|9189706790002|
+---+-------------+