Join two dataframe using Spark Scala - scala

I have this code :
val o = p_value.alias("d1").join(t_d.alias("d2"),
(col("d1.origin_latitude")===col("d2.origin_latitude")&&
col("d1.origin_longitude")===col("d2.origin_longitude")),"left").
filter(col("d2.origin_longitude").isNull)
val c = p_value2.alias("d3").join(o.alias("d4"),
(col("d3.origin_latitude")===col("d4.origin_latitude") &&
col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
filter(col("d3.origin_longitude").isNull)
I get this error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 'd4.origin_latitude' is ambiguous, could be: d4.origin_latitude, d4.origin_latitude.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)
On this line
(col("d3.origin_latitude")===col("d4.origin_latitude") && col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
Any idea ?
Thank you .

You are aliasing DataFrame not columns, which is used to access/refer columns in that DataFrame.
So the first join will result into another DataFrame having same column name twice (origin_latitude as well as origin_longitude). Once you try to access one of these columns in resultant DataFrame, you are going to get Ambiguity error.
So you need to make sure that DataFrame contains each column only once.
You can rewrite the first join as below:
p_value
.join(t_d, Seq("origin_latitude", "origin_longitude"), "left")
.filter(t_d.col("t_d.origin_longitude").isNull)

Related

How to Stop Spark Execution if df is empty

Here's my use case:
df_1:
user_name
Mark
Jane
Mary
df_2:
participant_name
Mark
Jane
Mary
Bill
Expectation:
Compare these 2 df, if they are different, throw exception and stop the execution.
Example:
In the above example, since user_name does not contain Bill, so the session should be stopped and message should say, Bill is not in user_name df
My approach:
Use Left Anti Join to see if I can get an empty df as result, if not, throw exception.
My questions are:
Any neat way to do this comparison in pyspark?
How to throw exception and stop the execution if condition not met?
Performing left anti join can be a correct approach here. The only demerit is that we need to specify the join condition with each column if we have multiple columns in both the dataframes. We have a similar function in pyspark exceptAll() which does exactly what you are looking for. df1.exceptAll(df2) returns a dataframe having all the rows which are present in df1 but not in df2. Similarly, we can use it the other way and if we get empty dataframes for both, only then the dataframes are similar.
def is_equal(df1: DataFrame, df2: DataFrame):
if len(df1.exceptAll(df2).take(1)) > 0:
return False
if len(df2.exceptAll(df1).take(1)) > 0:
return False
return True
df_1 = spark.createDataFrame([("Mark",), ("Jane",), ("Mary",)], schema="user_name string")
df_2 = spark.createDataFrame([("Mark",), ("Jane",), ("Mary",), ("Bill",)], schema="participant_name string")
if not is_equal(df_1, df_2):
raise Exception("DataFrames are different")
The above code will throw an Exception as df_2 has "Bill" while df_1 does not.
len(df1.exceptAll(df2).take(1)) > 0 is used to find if the returned dataframe is non-empty. If yes, there is a difference in the dataframes and we return False.
For your 2nd point, we can raise an exception using raise.
You can use subtract function, and count the result dataset, if count is zero means 2 dataframes are equals.
You also can use inner join, and if the joined dataframe has the same count as the originals then two of them are equals.
The exception can be thrown as typical Pythonic way:
if df1.subtract(df2).count() > 0:
raise Exception("some items in df1 does not exist in df2")
if df2.subtract(df1).count() > 0:
raise Exception("some items in df2 does not exist in df1")

Trying to join tables and getting "Resolved attribute(s) columnName#17 missing from ..."

I'm trying to join two tables and getting a frustrating series of errors:
If I try this:
pop_table = mtrips.join(trips, (mtrips["DOLocationID"] == trips["PULocationID"]))
Then I get this error:
Resolved attribute(s) PULocationID#17 missing from PULocationID#2508,
If I try this:
pop_table = mtrips.join(trips, (col("DOLocationID") == col("PULocationID")))
I get this error:
"Reference 'DOLocationID' is ambiguous, could be: DOLocationID, DOLocationID.;"
If I try this:
pop_table = mtrips.join(trips, col("mtrips.DOLocationID") == col("trips.PULocationID"))
I get this error:
"cannot resolve '`mtrips.DOLocationID`' given input columns: [DOLocationID]
When I search on SO for these errors it seems like every post is telling me to try something that I've already tried and isn't working.
I don't know where to go from here. Help appreciated!
It looks like this problem. There is some ambiguity in the names.
Are you deriving one of the dataframes from another one? In that case, use withColumnRenamed() to rename the 'join' columns in the second dataframe before you do the join operation.
This is pretty evidante that the issue with column name in both the dataframe.
1. When you have all different columns in both the dataframe , expect join key column is same name in both DF, use this
**`df = df.join(df_right, 'join_col_which_is_same_in_both_df', 'left')`**
2. When your join column is different name in both the dataframe - This join will take both the column i.e col1 and col2 in the joined df
**`df = df.join(df_right, df.col1 == df_right.col2, 'left')`**

Fetching all columns from one and some from the other

I'm using spark scala. I've two dataframes that I want to join and select all columns from the first and a few from the second.
This is mu code, that doesn't work,
val df = df1.join(df2,
df1("a") <=> df2("a")
&& df1("b") <=> df2("b"),
"left").select(df1("*"),---> is this correct?
df2("c AS d", "e AS f")) ---> fails here
This fails with the following error,
too many arguments for method apply: (colName: String)org.apache.spark.sql.Column in class Dataset
df2("c AS d", "e AS f"))
I couldn't find a different method in the API to do it.
How do I do this.
Try using Aliases. I don't know in scala, but below code is in python/pyspark to join and get all columns from one table and some column from another table
df1_col =df1.columns
resultdf= df1.alias('left_table') \
.join(df2.alias('right_table'),f.col('left_table.col1') == f.col('right_table.col1')) \
.select(
[f.col('left_table.' + xx) for xx in df1_col] + [f.col('right_table.col2'),f.col('right_table.col3'),f.col('right_table.col4')])

Spark: efficient way to search another dataframe

I have one dataframe (df) with ip addresses and their corresponding long value (ip_int) and now I want to search in an another dataframe (ip2Country) which contains geolocation information to find their corresponding country name. How should I do it in Scala. My code currently didnt work out: Memory limit exceed.
val ip_ints=df.select("ip_int").distinct.collect().flatMap(_.toSeq)
val df_list = ListBuffer[DataFrame]()
for(v <- ip_ints){
var ip_int=v.toString.toLong
df_list +=ip2Country.filter(($"network_start_integer"<=ip_int)&&($"network_last_integer">=ip_int)).select("country_name").withColumn("ip_int", lit(ip_int))
}
var df1 = df_list.reduce(_ union _)
df=df.join(df1,Seq("ip_int"),"left")
Basically I try to iterate through every ip_int value and search them in ip2Country and merge them back with df.
Any help is much appreciated!
A simple join should do the trick for you
df.join(df1, df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int"), "left")
.select("ip", "ip_int", "country_name")
If you want to remove the null country_name then you can add filter too
df.join(df1, df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int"), "left")
.select("ip", "ip_int", "country_name")
.filter($"country_name".isNotNull)
I hope the answer is helpful
You want to do a non-equi join, which you can implement by cross joining and then filtering, though it is resource heavy to do so. Assuming you are using Spark 2.1:
df.createOrReplaceTempView("ip_int")
df.select("network_start_integer", "network_start_integer", "country_name").createOrReplaceTempView("ip_int_lookup")
// val spark: SparkSession
val result: DataFrame = spark.sql("select a.*, b.country_name from ip_int a, ip_int_lookup b where b.network_start_integer <= a.ip_int and b.network_last_integer >= a.ip_int)
If you want to include null ip_int, you will need to right join df to result.
I feel puzzled here.
df1("network_start_integer")<=df("ip_int") && df1("network_last_integer")>=df("ip_int")
Can we use the
df1("network_start_integer")===df("ip_int")
here please?

Join Dataframes in Spark

I have joined two Dataframes in spark using below code -
Dataframes are: expDataFrame, accountList
val expDetails = expDataFrame.as("fex").join(accountList.as("acctlist"),$"fex.acct_id" === $"acctlist.acct_id", "inner")
Now I am trying to show both acct_id from both dataframe.
I have done below code -
expDetails.select($"fex.acct_id",$"acct_id.acct_id").show
but getting same column name twice as acct_id
I want two unique column name like fex_acct_id, acctlist_acct_id to identify the column from which dataframe.
You simply have to add an alias to the columns using the as or alias methods. This will do the job :
expDetails.select(
$"fex.acct_id".as("fex_acct_id"),
$"acct_id.acct_id".as("acctlist_acct_id")
).show