Variable substitution in scala - scala

I have two dataframes in scala both having data from two different tables but of same structure (srcdataframe and tgttable). I have to join these two based on composite primary key and select few columns and append two columns the code for which is as below:
for(i <- 2 until numCols) {
srcdataframe.as("A")
.join(tgttable.as("B"), $"A.INSTANCE_ID" === $"B.INSTANCE_ID" &&
$"A.CONTRACT_LINE_ID" === $"B.CONTRACT_LINE_ID", "inner")
.filter($"A." + srcColnm(i) =!= $"B." + srcColnm(i))
.select($"A.INSTANCE_ID",
$"A.CONTRACT_LINE_ID",
"$"+"\""+"A."+srcColnm(i)+"\""+","+"$"+"\""+"B."+srcColnm(i)+"\"")
.withColumn("MisMatchedCol",lit("\""+srcColnm(i)+"\""))
.withColumn("LastRunDate",current_timestamp.cast("long"))
.createOrReplaceTempView("IPF_1M_Mismatch");
hiveSQLContext.sql("Insert into table xxxx.f2f_Mismatch1 select t.* from (select * from IPF_1M_Mismatch) t");}
Here are the things am trying to do:
Inner join of srcdataframe and tgttable based on instance_id and contract_line_id.
Select only instance_id, contract_line_id, mismatched_col_values, hardcode of mismatched_col_nm, timestamp.
srcColnm(i) is an array of strings which contains the non-primary keys to be compared.
However, I am not able to resolve the variables inside the dataframe in the for loop. I tried looking up for solutions here and here. I got to know that it may be because of the way spark substitutes the variables only at compile time, in this case I'm not sure how to resolve it.

Instead of creating columns with $, you can simply use strings or the col() function. I would also recommend performing the join outside of the for as it's an expensive operation. Slightly changed code, the main difference to solve your problem is in the select:
val df = srcdataframe.as("A")
.join(tgttable.as("B"), Seq("INSTANCE_ID", "CONTRACT_LINE_ID"), "inner")
for(columnName <- srcColnm) {
df.filter(col("A." + columnName) =!= col("B." + columnName))
.select("INSTANCE_ID", "CONTRACT_LINE_ID", "A." + columnName, "B." + columnName)
.withColumn("MisMatchedCol", lit(columnName))
.withColumn("LastRunDate", current_timestamp().cast("long"))
.createOrReplaceTempView("IPF_1M_Mismatch")
// Hive command
}
Regarding the problem in select:
$ is short for the col() function, it's selecting a column in the dataframe by name. The problem in the select is that the two first arguments col("A.INSTANCE_ID") and col("A.CONTRACT_LINE_ID") are two columns ($replaced bycol()` for clarity).
However, the next two arguments are strings. It is not possible to mix these two, either all arguments should be columns or all are strings. As you used "A."+srcColnm(i) to build up the column name $ can't be used, however, you could have used col("A."+srcColnm(i)).

Related

pyspark join on a secondary column when primary column finds no match

The following code is intended to do this operation, however, does not work. It still joins on both columns yielding two rows when I should only get one row.
df = df.join(
msa,
(msa.msa_join_key == df.metro_join_key)
| (
(msa.msa_join_key != df.metro_join_key)
& (msa.msa_join_key == df.state_join_key)
),
"left",
)
This code should work, so I am not sure why it does not. I've seen similar questions that also cannot solve this given that they too yield additional rows. pyspark - join with OR condition
Is this a known bug in pyspark? Is there a different way to do this join?

Is there a Scala collection that maintains the order of insert?

I have a List:hdtList which contain columns that represent the columns of a Hive table:
forecast_id bigint,period_year bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_system_name string,source_record_type string,gl_source_name string,gl_source_system_name string,year string
I have a List: partition_columns which contains two elements: source_system_name, period_year
Using the List: partition_columns, I am trying to match them and move the corresponding columns in List: hdtList to the end of it as below:
val (pc, notPc) = hdtList.partition(c => partition_columns.contains(c.takeWhile(x => x != ' ')))
But when I print them as: println(notPc.mkString(",") + "," + pc.mkString(","))
I see the output unordered as below:
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,period_year bigint,source_system_name string
The columns period_year comes first and the source_system_name last. Is there anyway I can make data as below so that the order of columns in the List: partition_columns is maintained.
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,source_system_name string,period_year bigint
I know there is an option to reverse a List but I'd like to learn if I can implement a collection that maintains that order of insert.
It doesn't matter which collections you use; you only use partition_columns to call contains which doesn't depend on its order, so how could it be maintained?
But your code does maintain order: it's just hdtList's.
Something like
// get is ugly, but safe here
val pc1 = partition_columns.map(x => pc.find(y => y.startsWith(x)).get)
after your code will give you desired order, though there's probably more efficient way to do it.

Minus logic implementation not working with spark/scala

Minus Logic in Hive:
The below (Hive)query will return only records available in left side table ( Full_Table ft), but not in both.
Select ft.* from Full_Table ft left join Stage_Table stg where stg.primary_key1 IS null and stg.primary_key2 IS null
I tried to implement the same in spark/scala using following method ( To support both primary key and composite key ) , But joined result set does not have column from right table, because of that not able to apply stg.primary_key2 IS null condition in joined result set.
ft.join(stg,usingColumns, “left_outer”) // used seq to support composite key column join
Please suggest me how to implement minus logic in spark scala.
Thanks,
Saravanan
https://www.linkedin.com/in/saravanan303/
If your tables have the same columns you can use except method from DataSet:
fullTable.except(stageTable)
If they don't have, but you are interested only on subset of columns that exists in both tables you can first select those column using select transformation and than use except:
val fullTableSelectedColumns = fullTable.select(c1,c2,c3)
val stageTableSelectedColumns = stageTable.select(c1,c2,c3)
fullTableSelectedColumns.except(stageTableSelectedColumns)
On other case, you can use join and filter transformations:
fullTable
.join(stageTable, fullTable("primary_key") === stageTable("primary_key"), "left")
.filter(stageTable("primary_key1").isNotNull)

Replace Empty values with nulls in Spark Dataframe

I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))

How to use orderby() with descending order in Spark window functions?

I need a window function that partitions by some keys (=column names), orders by another column name and returns the rows with top x ranks.
This works fine for ascending order:
def getTopX(df: DataFrame, top_x: String, top_key: String, top_value:String): DataFrame ={
val top_keys: List[String] = top_key.split(", ").map(_.trim).toList
val w = Window.partitionBy(top_keys(1),top_keys.drop(1):_*)
.orderBy(top_value)
val rankCondition = "rn < "+top_x.toString
val dfTop = df.withColumn("rn",row_number().over(w))
.where(rankCondition).drop("rn")
return dfTop
}
But when I try to change it to orderBy(desc(top_value)) or orderBy(top_value.desc) in line 4, I get a syntax error. What's the correct syntax here?
There are two versions of orderBy, one that works with strings and one that works with Column objects (API). Your code is using the first version, which does not allow for changing the sort order. You need to switch to the column version and then call the desc method, e.g., myCol.desc.
Now, we get into API design territory. The advantage of passing Column parameters is that you have a lot more flexibility, e.g., you can use expressions, etc. If you want to maintain an API that takes in a string as opposed to a Column, you need to convert the string to a column. There are a number of ways to do this and the easiest is to use org.apache.spark.sql.functions.col(myColName).
Putting it all together, we get
.orderBy(org.apache.spark.sql.functions.col(top_value).desc)
Say for example, if we need to order by a column called Date in descending order in the Window function, use the $ symbol before the column name which will enable us to use the asc or desc syntax.
Window.orderBy($"Date".desc)
After specifying the column name in double quotes, give .desc which will sort in descending order.
Column
col = new Column("ts")
col = col.desc()
WindowSpec w = Window.partitionBy("col1", "col2").orderBy(col)