Apache Spark Dataframe - Issue with setting up a not-equal join

Apache Spark Dataframe - Issue with setting up a not-equal join - scala

I have 2 dataframes that I'm doing a multi-column join. The first pair of columns is doing an equal comparison and the second pair is a not-equals comparison. The code looks like this:
val arule_1w = itemLHS
.join(itemRHS, itemLHS("CUST_ID") === itemRHS("CUST_ID") && itemLHS("LHS") != itemRHS("RHS")
The resulting data still has rows that contains itemLHS("LHS") = itemRHS("RHS"), which it shouldn't, with the not-equal join. It maybe user error as well but all my research tells me that format is correct. All datatypes are string values.
Thanks for your help!

Correct method is =!= not !=.

Used below syntaxt
itemLHS("LHS") !== itemRHS("RHS")

Related

Create new column based on equality between existing columns

While it seems a trivial task, I haven't been able to find a tidy solution for it. I want to add a new (integer) column, nCol to a dataframe, the value of which is determined by comparing two existing columns (both String type) of the dataframe, eCol1 and eCol2
something like:
df(nCol) = {
if df(eCol1) == df(eCol2) then 1
else 0
}
I believe it could be done with the help of user-defined functions (UDFs). But isn't there tidier way for such a trivial task?

You need to work with Dataframe DSL when/otherwise, to test equality use ===:
df
.withColumn("newCol", when(df(eCol1) === df(eCol2),1).otherwise(0))

How to do aggregation on dataframe to get distinct count of column

How do I apply where condition on dataframe ,example I need to groupBy on one column and count the distinct values in the column based on certain where condition.I need to do this where condition for multiple columns
I tried the below way.Please let me know how Can I do this.
case class testRdd(name:String,id:Int,price:Int)
val Cols = testRdd.toDF().groupBy("id").agg( countDistinct("name").when(col("price")>0,1).otherwise(0)
This will not work,or Is there a way to do something like ? Thanks in advance
testRdd.toDF().groupBy("id").agg(if(col("price")>0)countDistinct("name"))

Here is an alternative approach to #Robin's answer, namely introducing an additional boolean column to group
df.groupBy($"id",when($"price">0,true).otherwise(false).as("positive_price"))
.agg(
countDistinct($"name")
)
.where($"positive_price")
.show

testRDD.select("name","id").where($"price">0).distinct.groupBy($"id").agg( count("name")).show

Replace Empty values with nulls in Spark Dataframe

I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.

Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))

Filtering out data in Spark dataframe in Scala

I have a dataframe df, which contains below data:
**customers** **product** **Val_id**
1 A 1
2 B X
3 C
4 D Z
I have successfully filtered for data where column val_id is blank
df.where(col("val_id").isin(""))
But I am not able to figure out a way to filter data where column val_id is not blank, i tried something like below, but did not work for me:
df.where(col("val_id").isnotin(""))
Can anyone please help me to achieve it using Spark Scala.

You can use filter to get desired output:
df.filter("rule_id != ''")

Assuming Val_id is of String type, you can use this inequality operator !==:
df.where(col("Val_id") !== "").show
Conversely, you can also use === for matching the blank.

If column type is String:
df.where(trim(col("val_id")) != "")

Removing Blank Strings from a Spark Dataframe

Attempting to remove rows in which a Spark dataframe column contains blank strings. Originally did val df2 = df1.na.drop() but it turns out many of these values are being encoded as "".
I'm stuck using Spark 1.3.1 and also cannot rely on DSL. (Importing spark.implicit_ isn't working.)

Removing things from a dataframe requires filter().
newDF = oldDF.filter("colName != ''")
or am I misunderstanding your question?

In case someone dont want to drop the records with blank strings, but just convvert the blank strings to some constant value.
val newdf = df.na.replace(df.columns,Map("" -> "0")) // to convert blank strings to zero
newdf.show()

You can use this:
df.filter(!($"col_name"===""))
It filters out the columns where the value of "col_name" is "" i.e. nothing/blankstring. I'm using the match filter and then inverting it by "!"

I am also new to spark So I don't know if below mentioned code is more complex or not but it works.
Here we are creating udf which is converting blank values to null.
sqlContext.udf().register("convertToNull",(String abc) -> (abc.trim().length() > 0 ? abc : null),DataTypes.StringType);
After above code you can use "convertToNull" (works on string) in select clause and make all fields null which are blank and than use .na().drop().
crimeDataFrame.selectExpr("C0","convertToNull(C1)","C2","C3").na().drop()
Note : You can use same approach in scala.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Apache Spark Dataframe - Issue with setting up a not-equal join - scala

Correct method is =!= not !=.

Used below syntaxt itemLHS("LHS") !== itemRHS("RHS")

Related

Create new column based on equality between existing columns

How to do aggregation on dataframe to get distinct count of column

Replace Empty values with nulls in Spark Dataframe

Filtering out data in Spark dataframe in Scala

Removing Blank Strings from a Spark Dataframe

Categories

Resources