Apache Spark Dataframe - Issue with setting up a not-equal join - scala

I have 2 dataframes that I'm doing a multi-column join. The first pair of columns is doing an equal comparison and the second pair is a not-equals comparison. The code looks like this:
val arule_1w = itemLHS
.join(itemRHS, itemLHS("CUST_ID") === itemRHS("CUST_ID") && itemLHS("LHS") != itemRHS("RHS")
The resulting data still has rows that contains itemLHS("LHS") = itemRHS("RHS"), which it shouldn't, with the not-equal join. It maybe user error as well but all my research tells me that format is correct. All datatypes are string values.
Thanks for your help!

Correct method is =!= not !=.

Used below syntaxt
itemLHS("LHS") !== itemRHS("RHS")

Related

Create new column based on equality between existing columns

While it seems a trivial task, I haven't been able to find a tidy solution for it. I want to add a new (integer) column, nCol to a dataframe, the value of which is determined by comparing two existing columns (both String type) of the dataframe, eCol1 and eCol2
something like:
df(nCol) = {
if df(eCol1) == df(eCol2) then 1
else 0
}
I believe it could be done with the help of user-defined functions (UDFs). But isn't there tidier way for such a trivial task?
You need to work with Dataframe DSL when/otherwise, to test equality use ===:
df
.withColumn("newCol", when(df(eCol1) === df(eCol2),1).otherwise(0))

How to do aggregation on dataframe to get distinct count of column

How do I apply where condition on dataframe ,example I need to groupBy on one column and count the distinct values in the column based on certain where condition.I need to do this where condition for multiple columns
I tried the below way.Please let me know how Can I do this.
case class testRdd(name:String,id:Int,price:Int)
val Cols = testRdd.toDF().groupBy("id").agg( countDistinct("name").when(col("price")>0,1).otherwise(0)
This will not work,or Is there a way to do something like ? Thanks in advance
testRdd.toDF().groupBy("id").agg(if(col("price")>0)countDistinct("name"))
Here is an alternative approach to #Robin's answer, namely introducing an additional boolean column to group
df.groupBy($"id",when($"price">0,true).otherwise(false).as("positive_price"))
.agg(
countDistinct($"name")
)
.where($"positive_price")
.show
testRDD.select("name","id").where($"price">0).distinct.groupBy($"id").agg( count("name")).show

Replace Empty values with nulls in Spark Dataframe

I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))

Filtering out data in Spark dataframe in Scala

I have a dataframe df, which contains below data:
**customers** **product** **Val_id**
1 A 1
2 B X
3 C
4 D Z
I have successfully filtered for data where column val_id is blank
df.where(col("val_id").isin(""))
But I am not able to figure out a way to filter data where column val_id is not blank, i tried something like below, but did not work for me:
df.where(col("val_id").isnotin(""))
Can anyone please help me to achieve it using Spark Scala.
You can use filter to get desired output:
df.filter("rule_id != ''")
Assuming Val_id is of String type, you can use this inequality operator !==:
df.where(col("Val_id") !== "").show
Conversely, you can also use === for matching the blank.
If column type is String:
df.where(trim(col("val_id")) != "")

Removing Blank Strings from a Spark Dataframe

Attempting to remove rows in which a Spark dataframe column contains blank strings. Originally did val df2 = df1.na.drop() but it turns out many of these values are being encoded as "".
I'm stuck using Spark 1.3.1 and also cannot rely on DSL. (Importing spark.implicit_ isn't working.)
Removing things from a dataframe requires filter().
newDF = oldDF.filter("colName != ''")
or am I misunderstanding your question?
In case someone dont want to drop the records with blank strings, but just convvert the blank strings to some constant value.
val newdf = df.na.replace(df.columns,Map("" -> "0")) // to convert blank strings to zero
newdf.show()
You can use this:
df.filter(!($"col_name"===""))
It filters out the columns where the value of "col_name" is "" i.e. nothing/blankstring. I'm using the match filter and then inverting it by "!"
I am also new to spark So I don't know if below mentioned code is more complex or not but it works.
Here we are creating udf which is converting blank values to null.
sqlContext.udf().register("convertToNull",(String abc) -> (abc.trim().length() > 0 ? abc : null),DataTypes.StringType);
After above code you can use "convertToNull" (works on string) in select clause and make all fields null which are blank and than use .na().drop().
crimeDataFrame.selectExpr("C0","convertToNull(C1)","C2","C3").na().drop()
Note : You can use same approach in scala.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html