generate dynamic join condition spark/scala - scala

I have a array of tuple and I want to generate a join condition(OR) using that.
e.g.
input --> [("leftId", "rightId"), ("leftId", leftAltId")]
output --> leftDF("leftId") === rightDF("rightId") || leftDF("leftAltId") === rightDF("rightAltId")
method signature:
def inner(leftDF: DataFrame, rightDF: DataFrame, fieldsToJoin: Array[(String,String)]): Unit = {
}
I tried using reduce operation on the array but output of my reduce operation is Column and not String hence it can't be fed back as input. I could do recursive but hoping there's simpler way to initiate empty column variable and build the query. thoughts ?

You can do something like this:
val cond = fieldsToJoin.map(x => col(x._1) === col(x._2)).reduce(_ || _)
leftDF.join(rightDF, cond)
Basically you first turn the array into an array of conditions (col transforms the string to column and then === does the comparison) and then the reduce adds the "or" between them. The result is a column you can use.

Related

PySpark Parsing nested array of struct

I would like to parse and get the value of specific key from the PySpark SQL dataframe with the below format
I could able to achieve this with UDF but it takes almost 20 mins to process 40 columns with the JSON size of 100MB. Tried explode as well but it gives seperate rows for each array element. but i need only the specific value of the key in a given array of struct.
Format
array<struct<key:string,value:struct<int_value:string,string_value:string>>>
Function to get a specific key values
def getValueFunc(searcharray, searchkey):
for val in searcharray:
if val["key"] == searchkey:
if val["value"]["string_value"] is not None:
actual = val["value"]["string_value"]
return actual
elif val["value"]["int_value"] is not None:
actual = val["value"]["int_value"]
return str(actual)
else:
return "---"
.....
getValue = udf(getValueFunc, StringType())
....
# register the name rank udf template
spark.udf.register("getValue", getValue)
.....
df.select(getValue(col("event_params"), lit("category")).alias("event_category"))
For Spark 2.40+, you can use SparkSQL's filter() function to find the first array element which matches key == serarchkey and then retrieve its value. Below is a Spark SQL snippet template(searchkey as a variable) to do the first part mentioned above.
stmt = '''filter(event_params, x -> x.key == "{}")[0]'''.format(searchkey)
Run the above stmt with expr() function, and assign the value (StructType) to a temporary column f1, and then use coalesce() function to retrieve the non-null value.
from pyspark.sql.functions import expr
df.withColumn('f1', expr(stmt)) \
.selectExpr("coalesce(f1.value.string_value, string(f1.value.int_value),'---') AS event_category") \
.show()
Let me know if you have any problem running the above code.

Create new column based on equality between existing columns

While it seems a trivial task, I haven't been able to find a tidy solution for it. I want to add a new (integer) column, nCol to a dataframe, the value of which is determined by comparing two existing columns (both String type) of the dataframe, eCol1 and eCol2
something like:
df(nCol) = {
if df(eCol1) == df(eCol2) then 1
else 0
}
I believe it could be done with the help of user-defined functions (UDFs). But isn't there tidier way for such a trivial task?
You need to work with Dataframe DSL when/otherwise, to test equality use ===:
df
.withColumn("newCol", when(df(eCol1) === df(eCol2),1).otherwise(0))

Datatype validation of Spark columns in for loop - Spark DataFrame

I'm trying to validate datatype of DataFrame before entering the loop, wherein I'm trying to do SQL calculation, but datatype validation is not going through and it is not getting inside the loop. The operation needs to be performed on only numeric columns.
How can this be solved? Is this the right way to handle datatype validation?
//get datatype of dataframe fields
val datatypes = parquetRDD_subset.schema.fields
//check if datatype of column is String and enter the loop for calculations.
for (val_datatype <- datatypes if val_datatype.dataType =="StringType")
{
val dfs = x.map(field => spark.sql(s"select * from table"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct()
}
You are comparing the dataType to a string which will never be true (for me the compiler complains that they are unrelated). dataType is an object which is a subtype of org.apache.spark.sql.types.DataType.
Try replacing your for with
for (val_datatype <- datatypes if val_datatype.dataType.isInstanceOf[StringType])
In any case, your for loop does nothing but declare the vals, it doesn't do anything with them.

Replace Empty values with nulls in Spark Dataframe

I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))

How to write a condition based on multiple values for a DataFrame in Spark

I'm working on a Spark Application (using Scala) and I have a List which contains multiple values. I'd like to use this list in order to write a where clause for my DataFrame and select only a subset on tuples. For example, my List contains 'value1', 'value2', and 'value3'. and I would like to write something like this:
mydf.where($"col1" === "value1" || $"col1" === "value2" || $"col1" === "value3)
How can I do that programmatically cause the list contains many values?
You can map a list of values to a list of "filters" (with type Column), and reduce this list into a single filter by applying the || operator on every two filters:
val possibleValues = Seq("value1", "value2", "value3")
val result = mydf.where(possibleValues.map($"col1" === _).reduce(_ || _))