Optimized way for String length validation for the Pyspark - pyspark

I have the below code for validating the string length in pyspark .
collect the result in two dataframe one with valid dataframe and the other with the data frame with invalid records .
def val_string(DfName,column,len,nullable):
if(nullable=='no'):
dt_valid = DfName.where(DfName[column].cast("string").isNotNull())
valid_len = dt_valid.where(f.length(col(column)) <= len)
invalid_len= dt_valid.where(f.length(col(column)) > len)
invalid_len= invalid_len.withColumn("dataTypeValidationErrors", f.lit(column+' '+'Length More than specified'))
dt_invalid = DfName.where(DfName[column].cast("string").isNull())
dt_invalid = dt_invalid.withColumn('dataTypeValidationErrors', f.lit(column+' '+'Invalid Data for the Datatype'))
dt_invalid = unionAll(dt_invalid,invalid_len)
return valid_len,dt_invalid
For one column the validate is running fine .
When this is running in loop for 100 columns the run time is way too high . its multiplying run timeexponentially.
let me know if there is way to handle this .

sdf = sc.parallelize([[123,123], [456,456],[12345678,None],[None,1245678]]).toDF(["col1","col2"])
sdf.show()
+--------+-------+
| col1| col2|
+--------+-------+
| 123| 123|
| 456| 456|
|12345678| null|
| null|1245678|
+--------+-------+
length_dict = {"col1":5, "col2":3}
def val_length(col, length_dict=length_dict):
return sf.length(col) <= sf.lit(length_dict[col])
sdf.select("*", *[val_length(i, length_dict).alias(i+"_length_val") for i in sdf.columns]).show()
+--------+-------+---------------+---------------+
| col1| col2|col1_length_val|col2_length_val|
+--------+-------+---------------+---------------+
| 123| 123| true| true|
| 456| 456| true| true|
|12345678| null| false| null|
| null|1245678| null| false|
+--------+-------+---------------+---------------+

Related

How does Spark DataFrame find out some lines that only appear once?

I want to eliminate some rows that only appear once in the ‘county’ column, which is not conducive to my statistics.
I used groupBy+count to find:
fault_data.groupBy("county").count().show()
The data looks like this:
+----------+-----+
| county|count|
+----------+-----+
| A| 117|
| B| 31|
| C| 1|
| D| 272|
| E| 1|
| F| 1|
| G| 280|
| H| 1|
| I| 1|
| J| 1|
| K| 112|
| L| 63|
| M| 18|
| N| 71|
| O| 1|
| P| 1|
| Q| 82|
| R| 2|
| S| 31|
| T| 2|
+----------+-----+
Next, I want to eliminate the data whose count is 1.
But when I wrote it like this, it was wrong:
fault_data.filter("count(county)=1").show()
The result is:
Aggregate/Window/Generate expressions are not valid in where clause of the query.
Expression in where clause: [(count(county) = CAST(1 AS BIGINT))]
Invalid expressions: [count(county)];
Filter (count(county#7) = cast(1 as bigint))
+- Relation [fault_id#0,fault_type#1,acs_way#2,fault_1#3,fault_2#4,province#5,city#6,county#7,town#8,detail#9,num#10,insert_time#11] JDBCRelation(fault_data) [numPartitions=1]
So I want to know the right way, thank you.
fault_data.groupBy("county").count().where(col("count")===1).show()

Pyspark filter where value is in another dataframe

I have two data frames. I need to filter one to only show values that are contained in the other.
table_a:
+---+----+
|AID| foo|
+---+----+
| 1 | bar|
| 2 | bar|
| 3 | bar|
| 4 | bar|
+---+----+
table_b:
+---+
|BID|
+---+
| 1 |
| 2 |
+---+
In the end I want to filter out what was in table_a to only the IDs that are in the table_b, like this:
+--+----+
|ID| foo|
+--+----+
| 1| bar|
| 2| bar|
+--+----+
Here is what I'm trying to do
result_table = table_a.filter(table_b.BID.contains(table_a.AID))
But this doesn't seem to be working. It looks like I'm getting ALL values.
NOTE: I can't add any other imports other than pyspark.sql.functions import col
You can join the two tables and specify how = 'left_semi'
A left semi-join returns values from the left side of the relation that has a match with the right.
result_table = table_a.join(table_b, (table_a.AID == table_b.BID), \
how = "left_semi").drop("BID")
result_table.show()
+---+---+
|AID|foo|
+---+---+
| 1|bar|
| 2|bar|
+---+---+
In case you have duplicates or Multiple values in the second dataframe and you want to take only distinct values, below approach can be useful to tackle such use cases -
Create the Dataframe
df = spark.createDataFrame([(1,"bar"),(2,"bar"),(3,"bar"),(4,"bar")],[ "col1","col2"])
df_lookup = spark.createDataFrame([(1,1),(1,2)],[ "id","val"])
df.show(truncate=True)
df_lookup.show()
+----+----+
|col1|col2|
+----+----+
| 1| bar|
| 2| bar|
| 3| bar|
| 4| bar|
+----+----+
+---+---+
| id|val|
+---+---+
| 1| 1|
| 1| 2|
+---+---+
get all the unique values of val column in dataframe two and take in a set/list variable
df_lookup_var = df_lookup.groupBy("id").agg(F.collect_set("val").alias("val")).collect()[0][1][0]
print(df_lookup_var)
df = df.withColumn("case_col", F.when((F.col("col1").isin([1,2])), F.lit("1")).otherwise(F.lit("0")))
df = df.filter(F.col("case_col") == F.lit("1"))
df.show()
+----+----+--------+
|col1|col2|case_col|
+----+----+--------+
| 1| bar| 1|
| 2| bar| 1|
+----+----+--------+
This should work too:
table_a.where( col(AID).isin(table_b.BID.tolist() ) )

Spark adding indexes to dataframe and append other dataset that doesn't have index

I have a dataset that has column userid and index values.
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
+---------+--------+
I want to append a new data frame to it and add an index to the newly added columns.
The userid is unique and the existing data frame will not have the Dataframe 2 user ids.
+----------+
| userid |
+----------+
| user11|
| user21|
| user41|
| user51|
| user64|
+----------+
The expected output with newly added userid and index
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
| user11| 11|
| user21| 12|
| user41| 13|
| user51| 14|
| user64| 15|
+---------+--------+
Is it possible to achive this by passing a max index value and start index for second Dataframe from given index value.
If the userid has some ordering, then you can use the rownumber function. Even if it does not have, then you can add an id using monotonically_increasing_id(). For now I assume that userid can be ordered. Then you can do this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df_merge = df1.select('userid').union(df2.select('userid'))
w=Window.orderBy('userid')
df_result = df_merge.withColumn('indexid',F.row_number().over(w))
EDIT : After discussions in comment.
#%% Test data and imports
import pyspark.sql.functions as F
from pyspark.sql import Window
df = sqlContext.createDataFrame([('a',100),('ab',50),('ba',300),('ced',60),('d',500)],schema=['userid','index'])
df1 = sqlContext.createDataFrame([('fgh',100),('ff',50),('fe',300),('er',60),('fi',500)],schema=['userid','dummy'])
#%%
#%% Merge the two dataframes, with a null columns as the index
df1=df1.withColumn('index', F.lit(None))
df_merge = df.select(df.columns).union(df1.select(df.columns))
#%%Define a window to arrange the newly added rows at the last and order them by userid
#%% The user id, even though random strings, can be ordered
w= Window.orderBy(F.col('index').asc_nulls_last(),F.col('userid'))# if possible add a partition column here, otherwise all your data will come in one partition, consider salting
#%% For the newly added rows, define index as the maximum value + increment of number of rows in main dataframe
df_final = df_merge.withColumn("index_new",F.when(~F.col('index').isNull(),F.col('index')).otherwise((F.last(F.col('index'),ignorenulls=True).over(w))+F.sum(F.lit(1)).over(w)))
#%% If number of rows in main dataframe is huge, then add an offset in the above line
df_final.show()
+------+-----+---------+
|userid|index|index_new|
+------+-----+---------+
| ab| 50| 50|
| ced| 60| 60|
| a| 100| 100|
| ba| 300| 300|
| d| 500| 500|
| er| null| 506|
| fe| null| 507|
| ff| null| 508|
| fgh| null| 509|
| fi| null| 510|
+------+-----+---------+

Fill null or empty with next Row value with spark

Is there a way to replace null values in spark data frame with next row not null value. There is additional row_count column added for windows partitioning and ordering. More specifically, I'd like to achieve the following result:
+---------+-----------+ +---------+--------+
| row_count | id| |row_count | id|
+---------+-----------+ +------+-----------+
| 1| null| | 1| 109|
| 2| 109| | 2| 109|
| 3| null| | 3| 108|
| 4| null| | 4| 108|
| 5| 108| => | 5| 108|
| 6| null| | 6| 110|
| 7| 110| | 7| 110|
| 8| null| | 8| null|
| 9| null| | 9| null|
| 10| null| | 10| null|
+---------+-----------+ +---------+--------+
I tried with below code, It is not giving proper result.
val ss = dataframe.select($"*", sum(when(dataframe("id").isNull||dataframe("id") === "", 1).otherwise(0)).over(Window.orderBy($"row_count")) as "value")
val window1=Window.partitionBy($"value").orderBy("id").rowsBetween(0, Long.MaxValue)
val selectList=ss.withColumn("id_fill_from_below",last("id").over(window1)).drop($"row_count").drop($"value")
Here is a approach
Filter the non nulls (dfNonNulls)
Filter the nulls (dfNulls)
Find the right value for null id, using join and Window function
Fill the null dataframe (dfNullFills)
union dfNonNulls and dfNullFills
data.csv
row_count,id
1,
2,109
3,
4,
5,108
6,
7,110
8,
9,
10,
var df = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("data.csv")
var dfNulls = df.filter(
$"id".isNull
).withColumnRenamed(
"row_count","row_count_nulls"
).withColumnRenamed(
"id","id_nulls"
)
val dfNonNulls = df.filter(
$"id".isNotNull
).withColumnRenamed(
"row_count","row_count_values"
).withColumnRenamed(
"id","id_values"
)
dfNulls = dfNulls.join(
dfNonNulls, $"row_count_nulls" lt $"row_count_values","left"
).select(
$"id_nulls",$"id_values",$"row_count_nulls",$"row_count_values"
)
val window = Window.partitionBy("row_count_nulls").orderBy("row_count_values")
val dfNullFills = dfNulls.withColumn(
"rn", row_number.over(window)
).where($"rn" === 1).drop("rn").select(
$"row_count_nulls".alias("row_count"),$"id_values".alias("id"))
dfNullFills .union(dfNonNulls).orderBy($"row_count").show()
which results in
+---------+----+
|row_count| id|
+---------+----+
| 1| 109|
| 2| 109|
| 3| 108|
| 4| 108|
| 5| 108|
| 6| 110|
| 7| 110|
| 8|null|
| 9|null|
| 10|null|
+---------+----+

How to call print from otherwise in spark scala?

I have dataframe where I'm checking the condition and accordingly I'm performing the operation.I'm using "when and otherwise" function where I'm trying to print failed result in "otherwise" but it is not printing.Any help will be appreciating.
joinDF=
+--------+-------------+----------+--------+-------+
| ID | A | B | C | D |
+--------+-------------+----------+--------+-------+
| 9574| F| 005912| 2016022| 10|
| 9576| F| 005912| 2016022| 21|
| 9578| F| 005912| 2016022| 0|
| 9580| F| 005912| 2016022| 19|
| 9582| F| 005912| 2016022| 89|
+--------+-------------+----------+--------+-------+
joinDF
.withColumn("Validate",when(joinDF("D") =!= 0 ,lit(ture)).otherwise(print(joinDF("ID"))))
It quite simple and very straight-forward :
Seq(("A",1),("B",0)).toDF("key","value")
.withColumn("verdict", when($"value"=!=0, lit("true")).otherwise("false")).show
+---+-----+-------+
|key|value|verdict|
+---+-----+-------+
| A| 1| true|
| B| 0| false|
+---+-----+-------+
You don't need if, else or udfs
With your example :
Seq(("A",1),("B",0)).toDF("ID","D").withColumn("validate",when($"D" =!= 0 ,lit("true")).otherwise($"ID")).show
+---+---+--------+
| ID| D|validate|
+---+---+--------+
| A| 1| true|
| B| 0| B|
+---+---+--------+
If you are trying to print the value if they did not matched than you can create udf as below
val validate = udf((value :String )=>
{
//if if value is not equal to 0 return true else print value and return it
if (value != 0) "true"
else {
print(s"value = ${value}")
value
}
})
//this adds new column named validate and calls the udf validate
joinDF.withColumn("Validate", validate($"ID"))
Hope this helps!