Fetching all columns from one and some from the other - scala

I'm using spark scala. I've two dataframes that I want to join and select all columns from the first and a few from the second.
This is mu code, that doesn't work,
val df = df1.join(df2,
df1("a") <=> df2("a")
&& df1("b") <=> df2("b"),
"left").select(df1("*"),---> is this correct?
df2("c AS d", "e AS f")) ---> fails here
This fails with the following error,
too many arguments for method apply: (colName: String)org.apache.spark.sql.Column in class Dataset
df2("c AS d", "e AS f"))
I couldn't find a different method in the API to do it.
How do I do this.

Try using Aliases. I don't know in scala, but below code is in python/pyspark to join and get all columns from one table and some column from another table
df1_col =df1.columns
resultdf= df1.alias('left_table') \
.join(df2.alias('right_table'),f.col('left_table.col1') == f.col('right_table.col1')) \
.select(
[f.col('left_table.' + xx) for xx in df1_col] + [f.col('right_table.col2'),f.col('right_table.col3'),f.col('right_table.col4')])

Related

Spark: replace all smaller values than X by their sum

I have a dataframe, that has a type and a sub type (broadly speaking).
Say something like:
What I'd like to do, is for each type, sum all values that are smaller than X (say 100 here), and replace them with one row where sub-type would be "other"
I.e.
Using window over(Type), I guess I could do two dfs (<100, >=100), where the first I'd sum, pick one row and hack it to get the "Other" single row, and union the result with the >= one. But it seems a rather clumsy way to do it?
(apologies, I don't have access to pyspark right now to do some code).
The way I would propose takes into account the need to have a key to apply an aggregation valid for each row, or you would 'loose' the one with value >= 100.
Therefore, the idée is to add a column that identify rows to be aggregated, and provide the other ones with a unique key. After wards, you'll have to clean the result according to the expected result.
Here is what I propose:
df = df \
.withColumn("to_agg",
F.when(F.col("Value") < 100, "Other")
.otherwise(F.concat(F.col("Type"), F.lit("-"), F.col("Sub-Type")))
) \
.withColumn("sum_other",
F.sum(F.col("Value")).over(Window.partitionBy("Type", "to_agg"))) \
.withColumn("Sub-Type",
F.when(F.col("to_agg") == "Other", F.col("to_agg"))
.otherwise(F.col("Column_4"))) \
.withColumn("Value", F.col("sum_other")) \
.drop("to_agg", "sum_other") \
.dropDuplicates(("Type", "Sub-Type")) \
.orderBy(F.col("Type").asc(), F.col("Value").desc())
Note: the solution to use a groupBy is also valid and is simpler but you will have only the columns used in the statement and the result. That's the reason why I prefer using a window function and enable to keep all other columns from the original dataset.
You could simply replace Sub-Type by other for all rows with Value < 100 and then groupby and sum:
(
df
.withColumn('Sub-Type', F.when(F.col('Value') < 100, 'Other').otherwise(F.col('Sub-Type')
.groupby('Type', 'Sub-Type')
.agg(
F.sum('Value').alias('Value')
)
)

Join two dataframe using Spark Scala

I have this code :
val o = p_value.alias("d1").join(t_d.alias("d2"),
(col("d1.origin_latitude")===col("d2.origin_latitude")&&
col("d1.origin_longitude")===col("d2.origin_longitude")),"left").
filter(col("d2.origin_longitude").isNull)
val c = p_value2.alias("d3").join(o.alias("d4"),
(col("d3.origin_latitude")===col("d4.origin_latitude") &&
col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
filter(col("d3.origin_longitude").isNull)
I get this error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 'd4.origin_latitude' is ambiguous, could be: d4.origin_latitude, d4.origin_latitude.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)
On this line
(col("d3.origin_latitude")===col("d4.origin_latitude") && col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
Any idea ?
Thank you .
You are aliasing DataFrame not columns, which is used to access/refer columns in that DataFrame.
So the first join will result into another DataFrame having same column name twice (origin_latitude as well as origin_longitude). Once you try to access one of these columns in resultant DataFrame, you are going to get Ambiguity error.
So you need to make sure that DataFrame contains each column only once.
You can rewrite the first join as below:
p_value
.join(t_d, Seq("origin_latitude", "origin_longitude"), "left")
.filter(t_d.col("t_d.origin_longitude").isNull)

Spark Scala GroupBy

i've following dataset :
col1_id, col2_id, type
1 t1_1 t1
1 t1_2 t1
2 t2_2 t2
col1_id & col2_id have one to many relationship i.e. multiple col2_id can have same col1_id value
type (eg. t1) is derived from col2_id
Objective is to find number of col1_id having a type (i.e. t1, t2 etc)
Here is what i'm doing currently,
val df1 = df.select($"col1_id", $"type").groupBy($"col1_id", $"type").count()
df1.drop($"count").groupBy($"type").show()
this works fine .. however i'm wondering if there might be a better way to accomplish this.
Pls let me know.
Not sure why you mention col2_id, it does not play a role here?
I expect what you want to do is to count the distinct col1_id values per type? If yes, then do :
import org.apache.spark.sql.functions.countDistinct
df
.groupBy($"type")
.agg(
countDistinct($"col1_id")
)
.show()

Variable substitution in scala

I have two dataframes in scala both having data from two different tables but of same structure (srcdataframe and tgttable). I have to join these two based on composite primary key and select few columns and append two columns the code for which is as below:
for(i <- 2 until numCols) {
srcdataframe.as("A")
.join(tgttable.as("B"), $"A.INSTANCE_ID" === $"B.INSTANCE_ID" &&
$"A.CONTRACT_LINE_ID" === $"B.CONTRACT_LINE_ID", "inner")
.filter($"A." + srcColnm(i) =!= $"B." + srcColnm(i))
.select($"A.INSTANCE_ID",
$"A.CONTRACT_LINE_ID",
"$"+"\""+"A."+srcColnm(i)+"\""+","+"$"+"\""+"B."+srcColnm(i)+"\"")
.withColumn("MisMatchedCol",lit("\""+srcColnm(i)+"\""))
.withColumn("LastRunDate",current_timestamp.cast("long"))
.createOrReplaceTempView("IPF_1M_Mismatch");
hiveSQLContext.sql("Insert into table xxxx.f2f_Mismatch1 select t.* from (select * from IPF_1M_Mismatch) t");}
Here are the things am trying to do:
Inner join of srcdataframe and tgttable based on instance_id and contract_line_id.
Select only instance_id, contract_line_id, mismatched_col_values, hardcode of mismatched_col_nm, timestamp.
srcColnm(i) is an array of strings which contains the non-primary keys to be compared.
However, I am not able to resolve the variables inside the dataframe in the for loop. I tried looking up for solutions here and here. I got to know that it may be because of the way spark substitutes the variables only at compile time, in this case I'm not sure how to resolve it.
Instead of creating columns with $, you can simply use strings or the col() function. I would also recommend performing the join outside of the for as it's an expensive operation. Slightly changed code, the main difference to solve your problem is in the select:
val df = srcdataframe.as("A")
.join(tgttable.as("B"), Seq("INSTANCE_ID", "CONTRACT_LINE_ID"), "inner")
for(columnName <- srcColnm) {
df.filter(col("A." + columnName) =!= col("B." + columnName))
.select("INSTANCE_ID", "CONTRACT_LINE_ID", "A." + columnName, "B." + columnName)
.withColumn("MisMatchedCol", lit(columnName))
.withColumn("LastRunDate", current_timestamp().cast("long"))
.createOrReplaceTempView("IPF_1M_Mismatch")
// Hive command
}
Regarding the problem in select:
$ is short for the col() function, it's selecting a column in the dataframe by name. The problem in the select is that the two first arguments col("A.INSTANCE_ID") and col("A.CONTRACT_LINE_ID") are two columns ($replaced bycol()` for clarity).
However, the next two arguments are strings. It is not possible to mix these two, either all arguments should be columns or all are strings. As you used "A."+srcColnm(i) to build up the column name $ can't be used, however, you could have used col("A."+srcColnm(i)).

Join Dataframes in Spark

I have joined two Dataframes in spark using below code -
Dataframes are: expDataFrame, accountList
val expDetails = expDataFrame.as("fex").join(accountList.as("acctlist"),$"fex.acct_id" === $"acctlist.acct_id", "inner")
Now I am trying to show both acct_id from both dataframe.
I have done below code -
expDetails.select($"fex.acct_id",$"acct_id.acct_id").show
but getting same column name twice as acct_id
I want two unique column name like fex_acct_id, acctlist_acct_id to identify the column from which dataframe.
You simply have to add an alias to the columns using the as or alias methods. This will do the job :
expDetails.select(
$"fex.acct_id".as("fex_acct_id"),
$"acct_id.acct_id".as("acctlist_acct_id")
).show