How to Stop Spark Execution if df is empty - pyspark

Here's my use case:
df_1:
user_name
Mark
Jane
Mary
df_2:
participant_name
Mark
Jane
Mary
Bill
Expectation:
Compare these 2 df, if they are different, throw exception and stop the execution.
Example:
In the above example, since user_name does not contain Bill, so the session should be stopped and message should say, Bill is not in user_name df
My approach:
Use Left Anti Join to see if I can get an empty df as result, if not, throw exception.
My questions are:
Any neat way to do this comparison in pyspark?
How to throw exception and stop the execution if condition not met?

Performing left anti join can be a correct approach here. The only demerit is that we need to specify the join condition with each column if we have multiple columns in both the dataframes. We have a similar function in pyspark exceptAll() which does exactly what you are looking for. df1.exceptAll(df2) returns a dataframe having all the rows which are present in df1 but not in df2. Similarly, we can use it the other way and if we get empty dataframes for both, only then the dataframes are similar.
def is_equal(df1: DataFrame, df2: DataFrame):
if len(df1.exceptAll(df2).take(1)) > 0:
return False
if len(df2.exceptAll(df1).take(1)) > 0:
return False
return True
df_1 = spark.createDataFrame([("Mark",), ("Jane",), ("Mary",)], schema="user_name string")
df_2 = spark.createDataFrame([("Mark",), ("Jane",), ("Mary",), ("Bill",)], schema="participant_name string")
if not is_equal(df_1, df_2):
raise Exception("DataFrames are different")
The above code will throw an Exception as df_2 has "Bill" while df_1 does not.
len(df1.exceptAll(df2).take(1)) > 0 is used to find if the returned dataframe is non-empty. If yes, there is a difference in the dataframes and we return False.
For your 2nd point, we can raise an exception using raise.

You can use subtract function, and count the result dataset, if count is zero means 2 dataframes are equals.
You also can use inner join, and if the joined dataframe has the same count as the originals then two of them are equals.
The exception can be thrown as typical Pythonic way:
if df1.subtract(df2).count() > 0:
raise Exception("some items in df1 does not exist in df2")
if df2.subtract(df1).count() > 0:
raise Exception("some items in df2 does not exist in df1")

Related

How to use dataframe within for each loop?

My objective is to get groups of userids from the dataframe and if consecutive rows of Depts match, then the MergeCol column would be set to 1.
I tried to do this within a for loop, getting distinct userids and the getting all the corresponding records from the dataframe for one userid and then loop over.
Its throwing null pointer exception inside the for loop.
When searched in StackOverflow, found out that dataframes cannot be used within for each loop.
Any work around anyone can suggest?
val distinctIdUser = inputTableDf.select("idUser").distinct()
distinctIdUser.foreach{ row =>
val id_user = row.getAs[Long]("idUser")
val subset = inputTableDf.filter($"idUser" === id_user)
val window = Window.orderBy("tsStart")
subset.withColumn("MergeCol",when(compareCol(col("Dept"),lead(col("Dept").over(window), 1)))
}

Join two dataframe using Spark Scala

I have this code :
val o = p_value.alias("d1").join(t_d.alias("d2"),
(col("d1.origin_latitude")===col("d2.origin_latitude")&&
col("d1.origin_longitude")===col("d2.origin_longitude")),"left").
filter(col("d2.origin_longitude").isNull)
val c = p_value2.alias("d3").join(o.alias("d4"),
(col("d3.origin_latitude")===col("d4.origin_latitude") &&
col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
filter(col("d3.origin_longitude").isNull)
I get this error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 'd4.origin_latitude' is ambiguous, could be: d4.origin_latitude, d4.origin_latitude.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)
On this line
(col("d3.origin_latitude")===col("d4.origin_latitude") && col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
Any idea ?
Thank you .
You are aliasing DataFrame not columns, which is used to access/refer columns in that DataFrame.
So the first join will result into another DataFrame having same column name twice (origin_latitude as well as origin_longitude). Once you try to access one of these columns in resultant DataFrame, you are going to get Ambiguity error.
So you need to make sure that DataFrame contains each column only once.
You can rewrite the first join as below:
p_value
.join(t_d, Seq("origin_latitude", "origin_longitude"), "left")
.filter(t_d.col("t_d.origin_longitude").isNull)

Spark Transactional remove rows

I am working with dataframes with Scala in a banking process and I need remove some rows if the transaction is cancellation. For example if I have a cancellation, I must remove the previous row. In the case I have three cancellation continuous I must remove 3 previous rows.
DataFrame initial:
DataFrame expected
I will appreciate your help.
Combination of inbuilt functions, udf function and window function should help you get your desired result (commented for clarity)
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("Account").orderBy("Sequence").rowsBetween(Long.MinValue, Long.MaxValue)
import org.apache.spark.sql.functions._
def filterUdf = udf((array:Seq[Long], sequence: Long)=> !array.contains(sequence))
df.withColumn("collection", sum(when(col("Type") === "Cancellation", 1).otherwise(0)).over(windowSpec)) //getting the count of cancellation in each group
.withColumn("Sequence", when(col("Type") === "Cancellation", col("Sequence")-col("collection")).otherwise(col("Sequence"))) //getting the difference between count and sequence number to get the sequence number of previous
.withColumn("collection", collect_set(when(col("Type") === "Cancellation", col("Sequence")).otherwise(0)).over(windowSpec)) //collecting the differenced sequence number of cancellation
.filter(filterUdf(col("collection"), col("Sequence"))) //filtering out the rows calling the udf
.drop("collection")
.show(false)
which should give you
+-------+-----------+--------+
|Account|Type |Sequence|
+-------+-----------+--------+
|11047 |Aggregation|11 |
|1030583|Aggregation|1 |
|1030583|Aggregation|4 |
+-------+-----------+--------+
Note: This solution works only when you have sequencial cancellation in each group of Account
I think a Map of stack data structure is useful in this case, with the key is account id. You push the Agg rows into stack until encountering a Cancel, then you pop the stack.

Spark: efficient way to search another dataframe

I have one dataframe (df) with ip addresses and their corresponding long value (ip_int) and now I want to search in an another dataframe (ip2Country) which contains geolocation information to find their corresponding country name. How should I do it in Scala. My code currently didnt work out: Memory limit exceed.
val ip_ints=df.select("ip_int").distinct.collect().flatMap(_.toSeq)
val df_list = ListBuffer[DataFrame]()
for(v <- ip_ints){
var ip_int=v.toString.toLong
df_list +=ip2Country.filter(($"network_start_integer"<=ip_int)&&($"network_last_integer">=ip_int)).select("country_name").withColumn("ip_int", lit(ip_int))
}
var df1 = df_list.reduce(_ union _)
df=df.join(df1,Seq("ip_int"),"left")
Basically I try to iterate through every ip_int value and search them in ip2Country and merge them back with df.
Any help is much appreciated!
A simple join should do the trick for you
df.join(df1, df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int"), "left")
.select("ip", "ip_int", "country_name")
If you want to remove the null country_name then you can add filter too
df.join(df1, df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int"), "left")
.select("ip", "ip_int", "country_name")
.filter($"country_name".isNotNull)
I hope the answer is helpful
You want to do a non-equi join, which you can implement by cross joining and then filtering, though it is resource heavy to do so. Assuming you are using Spark 2.1:
df.createOrReplaceTempView("ip_int")
df.select("network_start_integer", "network_start_integer", "country_name").createOrReplaceTempView("ip_int_lookup")
// val spark: SparkSession
val result: DataFrame = spark.sql("select a.*, b.country_name from ip_int a, ip_int_lookup b where b.network_start_integer <= a.ip_int and b.network_last_integer >= a.ip_int)
If you want to include null ip_int, you will need to right join df to result.
I feel puzzled here.
df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int")
Can we use the
df1("network_start_integer")===df("ip_int")
here please?

How to insert record into a dataframe in spark

I have a dataframe (df1) which has 50 columns, the first one is a cust_id and the rest are features. I also have another dataframe (df2) which contains only cust_id. I'd like to add one records per customer in df2 to df1 with all the features as 0. But as the two dataframe have two different schema, I cannot do a union. What is the best way to do that?
I use a full outer join but it generates two cust_id columns and I need one. I should somehow merge these two cust_id columns but don't know how.
You can try to achieve something like that by doing a full outer join like the following:
val result = df1.join(df2, Seq("cust_id"), "full_outer")
However, the features are going to be null instead of 0. If you really need them to be zero, one way to do it would be:
val features = df1.columns.toSet - "cust_id" // Remove "cust_id" column
val newDF = features.foldLeft(df2)(
(df, colName) => df.withColumn(colName, lit(0))
)
df1.unionAll(newDF)