How to remove rows from pyspark dataframe using pattern matching? - pyspark

I have a pyspark dataframe read from a CSV file that has a value column which contains hexadecimal values.
| date | part | feature | value" |
|----------|-------|---------|--------------|
| 20190503 | par1 | feat2 | 0x0 |
| 20190503 | par1 | feat3 | 0x01 |
| 20190501 | par2 | feat4 | 0x0f32 |
| 20190501 | par5 | feat9 | 0x00 |
| 20190506 | par8 | feat2 | 0x00f45 |
| 20190507 | par1 | feat6 | 0x0e62300000 |
| 20190501 | par11 | feat3 | 0x000000000 |
| 20190501 | par21 | feat5 | 0x03efff |
| 20190501 | par3 | feat9 | 0x000 |
| 20190501 | par6 | feat5 | 0x000000 |
| 20190506 | par5 | feat8 | 0x034edc45 |
| 20190506 | par8 | feat1 | 0x00000 |
| 20190508 | par3 | feat6 | 0x00000000 |
| 20190503 | par4 | feat3 | 0x0c0deffe21 |
| 20190503 | par6 | feat4 | 0x0000000000 |
| 20190501 | par3 | feat6 | 0x0123fe |
| 20190501 | par7 | feat4 | 0x00000d0 |
The requirement is to remove rows that contain values similar to 0x0, 0x00, 0x000, etc. which evaluate to decimal 0(zero) in the value column. The number of 0's after '0x' varies across the dataframe. Removing through pattern matching is the way I tried, but I wasn't successful.
myFile = sc.textFile("file.txt")
header = myFile.first()
fields = [StructField(field_name, StringType(), True) for field_name in header.split(',')]
myFile_header = myFile.filter(lambda l: "date" in l)
myFile_NoHeader = myFile.subtract(myFile_header)
myFile_df = myFile_NoHeader.map(lambda line: line.split(",")).toDF(schema)
## this is the pattern match I tried
result = myFile_df.withColumn('Test', regexp_extract(col('value'), '(0x)(0\1*\1*)',2 ))
result.show()
The other approach I used was using udf:
def convert_value(x):
return int(x,16)
Using this udf in pyspark give me
ValueError: invalid literal for int() with base 16: value

I don't really understand your regular expression, but when you want to match all strings containing 0x0 (+any number of zeros), then you can use ^0x0+$. Filtering with regular expression can be achieved with rlike and the tilde negates the match.
l = [('20190503', 'par1', 'feat2', '0x0'),
('20190503', 'par1', 'feat3', '0x01'),
('20190501', 'par2', 'feat4', '0x0f32'),
('20190501', 'par5', 'feat9', '0x00'),
('20190506', 'par8', 'feat2', '0x00f45'),
('20190507', 'par1', 'feat6', '0x0e62300000'),
('20190501', 'par11', 'feat3', '0x000000000'),
('20190501', 'par21', 'feat5', '0x03efff'),
('20190501', 'par3', 'feat9', '0x000'),
('20190501', 'par6', 'feat5', '0x000000'),
('20190506', 'par5', 'feat8', '0x034edc45'),
('20190506', 'par8', 'feat1', '0x00000'),
('20190508', 'par3', 'feat6', '0x00000000'),
('20190503', 'par4', 'feat3', '0x0c0deffe21'),
('20190503', 'par6', 'feat4', '0x0000000000'),
('20190501', 'par3', 'feat6', '0x0123fe'),
('20190501', 'par7', 'feat4', '0x00000d0')]
columns = ['date', 'part', 'feature', 'value']
df=spark.createDataFrame(l, columns)
expr = "^0x0+$"
df.filter(~ df["value"].rlike(expr)).show()
Output:
+--------+-----+-------+------------+
| date| part|feature| value|
+--------+-----+-------+------------+
|20190503| par1| feat3| 0x01|
|20190501| par2| feat4| 0x0f32|
|20190506| par8| feat2| 0x00f45|
|20190507| par1| feat6|0x0e62300000|
|20190501|par21| feat5| 0x03efff|
|20190506| par5| feat8| 0x034edc45|
|20190503| par4| feat3|0x0c0deffe21|
|20190501| par3| feat6| 0x0123fe|
|20190501| par7| feat4| 0x00000d0|
+--------+-----+-------+------------+

Related

How to replace null values in a dataframe based on values in other dataframe?

Here's a dataframe, df1, I have
+---------+-------+---------+
| C1 | C2 | C3 |
+---------+-------+---------+
| xr | 1 | ixfg |
| we | 5 | jsfd |
| r5 | 7 | hfga |
| by | 8 | srjs |
| v4 | 4 | qwks |
| c0 | 0 | khfd |
| ba | 2 | gdbu |
| mi | 1 | pdlo |
| lp | 7 | ztpq |
+---------+-------+---------+
Here's another, df2, that I have
+----------+-------+---------+
| V1 | V2 | V3 |
+----------+-------+---------+
| Null | 6 | ixfg |
| Null | 2 | jsfd |
| Null | 2 | hfga |
| Null | 7 | qwks |
| Null | 1 | khfd |
| Null | 9 | gdbu |
+----------+-------+---------+
What I would like to have is another dataframe that
Ignores values in V2 and takes values in C2 whereever V3 and C3 match, and
Replaces V1 with values in C1 wherever V3 and C3 match.
The result should look like the following:
+----------+-------+---------+
| M1 | M2 | M3 |
+----------+-------+---------+
| xr | 1 | ixfg |
| we | 5 | jsfd |
| r5 | 7 | hfga |
| v4 | 4 | qwks |
| c0 | 0 | khfd |
| ba | 2 | gdbu |
+----------+-------+---------+
You can join and use coalesce to take a value which has a higher priority.
** coalesce will take any number of columns (the highest priority to least in the order of arguments) and return first non-null value, so if you do want to replace with null when there is a null in the lower priority column, you cannot use this function.
df = (df1.join(df2, on=(df1.C3 == df2.V3))
.select(F.coalesce(df1.C1, df2.V1).alias('M1'),
F.coalesce(df2.V2, df1.C2).alias('M2'),
F.col('C3').alias('M3')))

How to explode a string column based on specific delimiter in Spark Scala

I want to explode string column based on a specific delimiter (| in my case )
I have a dataset like this:
+-----+--------------+
|Col_1|Col_2 |
+-----+--------------+
| 1 | aa|bb |
| 2 | cc |
| 3 | dd|ee |
| 4 | ff |
+-----+-----+---------
I want an output like this:
+-----+--------------+
|Col_1|Col_2 |
+-----+--------------+
| 1 | aa |
| 1 | bb |
| 2 | cc |
| 3 | dd |
| 3 | ee |
| 4 | ff |
+-----+-----+---------
Use explode and split functions, and use \\ escape |.
val df1 = df.select(col("Col_1"), explode(split(col("Col_2"),"\\|")).as("Col_2"))

Check if a value is between two columns, spark scala

I have two dataframes, one with my data and another one to compare. What I want to do is check if a value is in a range of two different columns, for example:
Df_player
+--------+-------+
| Baller | Power |
+--------+-------+
| John | 1.5 |
| Bilbo | 3.7 |
| Frodo | 6 |
+--------+-------+
Df_Check
+--------+--------+--------+
| First | Second | Value |
+--------+--------+--------+
| 1 | 1.5 | Bad- |
| 1.5 | 3 | Bad |
| 3 | 4.2 | Good |
| 4.2 | 6 | Good+ |
+--------+--------+--------+
The result would be:
Df_out
+--------+-------+--------+
| Baller | Power | Value |
+--------+-------+--------+
| John | 1.5 | Bad- |
| Bilbo | 3.7 | Good |
| Frodo | 6 | Good+ |
+--------+-------+--------+
You can do a join based on a between condition, but note that .between is not appropriate here because you want inequality in one of the comparisons:
val result = df_player.join(
df_check,
df_player("Power") > df_check("First") && df_player("Power") <= df_check("Second"),
"left"
).select("Baller", "Power", "Value")
result.show
+------+-----+-----+
|Baller|Power|Value|
+------+-----+-----+
| John| 1.5| Bad-|
| Bilbo| 3.7| Good|
| Frodo| 6.0|Good+|
+------+-----+-----+

Group by certain record in array (pyspark)

I want to group a data in such a way that for particular record each array values also used to group for that record
I am able to group by name only. I am not able to figure out the way to this.
I have tried following query;
import pyspark.sql.functions as f
df.groupBy('name').agg(f.collect_list('data').alias('data_new')).show()
Following is the dataframe;
|-------|--------------------|
| name | data |
|-------|--------------------|
| a | [a,b,c,d,e,f,g,h,i]|
| b | [b,c,d,e,j,k] |
| c | [c,f,l,m] |
| d | [k,b,d] |
| n | [n,o,p,q] |
| p | [p,r,s,t] |
| u | [u,v,w,x] |
| b | [b,f,e,g] |
| c | [c,b,g,h] |
| a | [a,l,f,m] |
|----------------------------|
I am expecting following output;
|-------|----------------------------|
| name | data |
|-------|----------------------------|
| a | [a,b,c,d,e,f,g,h,i,j,k,l,m]|
| n | [n,o,p,q,r,s,t] |
| u | [u,v,w,x] |
|-------|----------------------------|

Spark Dataframe Union giving duplicates

I have a base dataset, and one of the columns is having null and not null values.
so I do:
val nonTrained_ds = base_ds.filter(col("col_name").isNull)
val trained_ds = base_ds.filter(col("col_name").isNotNull)
When I print that out, I get clear separate of rows. But when I do,
val combined_ds = nonTrained_ds.union(trained_ds)
I get duplicate records of rows from nonTrained_ds, and the strange thing is, rows from trained_ds are no longer in the combined ds.
Why does this happen?
the values of trained_ds are:
+----------+----------------+
|unique_no | running_id|
+----------+----------------+
|0456700001|16 |
|0456700004|16 |
|0456700007|16 |
|0456700010|16 |
|0456700013|16 |
|0456700016|16 |
|0456700019|16 |
|0456700022|16 |
|0456700025|16 |
|0456700028|16 |
|0456700031|16 |
|0456700034|16 |
|0456700037|16 |
|0456700040|16 |
|0456700043|16 |
|0456700046|16 |
|0456700049|16 |
|0456700052|16 |
|0456700055|16 |
|0456700058|16 |
|0456700061|16 |
|0456700064|16 |
|0456700067|16 |
|0456700070|16 |
+----------+----------------+
the values of nonTrained_ds are:
+----------+----------------+
|unique_no | running_id|
+----------+----------------+
|0456700002|null |
|0456700003|null |
|0456700005|null |
|0456700006|null |
|0456700008|null |
|0456700009|null |
|0456700011|null |
|0456700012|null |
|0456700014|null |
|0456700015|null |
|0456700017|null |
|0456700018|null |
|0456700020|null |
|0456700021|null |
|0456700023|null |
|0456700024|null |
|0456700026|null |
|0456700027|null |
|0456700029|null |
|0456700030|null |
|0456700032|null |
|0456700033|null |
|0456700035|null |
|0456700036|null |
|0456700038|null |
|0456700039|null |
|0456700041|null |
|0456700042|null |
|0456700044|null |
|0456700045|null |
|0456700047|null |
|0456700048|null |
|0456700050|null |
|0456700051|null |
|0456700053|null |
|0456700054|null |
|0456700056|null |
|0456700057|null |
|0456700059|null |
|0456700060|null |
|0456700062|null |
|0456700063|null |
|0456700065|null |
|0456700066|null |
|0456700068|null |
|0456700069|null |
|0456700071|null |
|0456700072|null |
+----------+----------------+
the values of the combined ds are:
+----------+----------------+
|unique_no | running_id|
+----------+----------------+
|0456700002|null |
|0456700003|null |
|0456700005|null |
|0456700006|null |
|0456700008|null |
|0456700009|null |
|0456700011|null |
|0456700012|null |
|0456700014|null |
|0456700015|null |
|0456700017|null |
|0456700018|null |
|0456700020|null |
|0456700021|null |
|0456700023|null |
|0456700024|null |
|0456700026|null |
|0456700027|null |
|0456700029|null |
|0456700030|null |
|0456700032|null |
|0456700033|null |
|0456700035|null |
|0456700036|null |
|0456700038|null |
|0456700039|null |
|0456700041|null |
|0456700042|null |
|0456700044|null |
|0456700045|null |
|0456700047|null |
|0456700048|null |
|0456700050|null |
|0456700051|null |
|0456700053|null |
|0456700054|null |
|0456700056|null |
|0456700057|null |
|0456700059|null |
|0456700060|null |
|0456700062|null |
|0456700063|null |
|0456700065|null |
|0456700066|null |
|0456700068|null |
|0456700069|null |
|0456700071|null |
|0456700072|null |
|0456700002|16 |
|0456700005|16 |
|0456700008|16 |
|0456700011|16 |
|0456700014|16 |
|0456700017|16 |
|0456700020|16 |
|0456700023|16 |
|0456700026|16 |
|0456700029|16 |
|0456700032|16 |
|0456700035|16 |
|0456700038|16 |
|0456700041|16 |
|0456700044|16 |
|0456700047|16 |
|0456700050|16 |
|0456700053|16 |
|0456700056|16 |
|0456700059|16 |
|0456700062|16 |
|0456700065|16 |
|0456700068|16 |
|0456700071|16 |
+----------+----------------+
This did the trick,
val nonTrained_ds = base_ds.filter(col("primary_offer_id").isNull).distinct()
val trained_ds = base_ds.filter(col("primary_offer_id").isNotNull).distinct()