How to properly join two DataFrames for my case? - scala

I use Spark 2.2.0 and Scala 2.11.8. I have some problems with joining two DataFrames.
df1 =
product1_PK product2_PK
111 222
333 111
...
and:
df2 =
product_PK product_name
111 AAA
222 BBB
333 CCC
I want to get this result:
product1_PK product2_PK product1_name product2_name
111 222 AAA BBB
333 111 CCC AAA
...
How can I do it?
This is how I tried as a part solution, but I don't know how to efficiently make joining for both product1_PK and product2_PK and rename columns:
val result = df1.as("left")
.join(df2.as("right"), $"left.product1_PK" === $"right.product_PK")
.drop($"left.product_PK")
.withColumnRenamed("right.product_name","product1_name")

You need to use two joins : first for product1_name and second for product2_name
df1.join(df2.withColumnRenamed("product_PK", "product1_PK").withColumnRenamed("product_name", "product1_name"), Seq("product1_PK"), "left")
.join(df2.withColumnRenamed("product_PK", "product2_PK").withColumnRenamed("product_name", "product2_name"), Seq("product2_PK"), "left")
.show(false)
You should have your desired output as
+-----------+-----------+-------------+-------------+
|product2_PK|product1_PK|product1_name|product2_name|
+-----------+-----------+-------------+-------------+
|222 |111 |AAA |BBB |
|111 |333 |CCC |AAA |
+-----------+-----------+-------------+-------------+

Related

select columns and add fixed width space between columns and save to fixedWidth File in Spark

I want to select few columns from a DF.
Between the columns I need to add different spaces as end user wants fixed width file (but not sure about the few columns in future). So some fixed width space needs to be added between.
I need to save this file as text file without header as FixedWidth file.
My output string should look like below
aaa bbb ccc ddd
where aaa, bbb... are columns values selected from DF and with 3 spaces added in between.
Can anyone please help here
This is pyspark
In pyspark, how do you add/concat a string to a column?
But in Scala it is almost the same:
df.select(concat(col("firstColumn"), lit(" "),
col("secondColumn"), lit(" "),
col("thirdColumn"))).show()
I think it is better to work with RDDs if you save output as a text file. Here is my solution for pyspark
>>> data = sc.parallelize([
... ('aaa','bbb','ccc','ddd'),
... ('aaa','bbb','ccc','ddd'),
... ('aaa','bbb','ccc','ddd')])
>>> columns = ['a','b','c','d']
>>>
>>> df = spark.createDataFrame(data, columns)
>>>
>>> df.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
|aaa|bbb|ccc|ddd|
|aaa|bbb|ccc|ddd|
|aaa|bbb|ccc|ddd|
+---+---+---+---+
>>>
>>> df.registerTempTable("table1")
>>>
>>> table1 = spark.sql("select concat(a,' ', b,' ',c, ' ', d) col from table1")
>>>
>>> table1.show()
+--------------------+
| col|
+--------------------+
|aaa bbb ccc ...|
|aaa bbb ccc ...|
|aaa bbb ccc ...|
+--------------------+
>>>
>>> rdd = table1.rdd.map(lambda x: "".join([str(i) for i in x]))
>>>
>>> rdd.collect()
['aaa bbb ccc ddd', 'aaa bbb ccc ddd', 'aaa bbb ccc ddd']
>>>
>>> rdd.saveAsTextFile("/yourpath")

Update data from two Data Frames Scala-Spark

I have two Data Frames:
DF1:
ID | Col1 | Col2
1 a aa
2 b bb
3 c cc
DF2:
ID | Col1 | Col2
1 ab aa
2 b bba
4 d dd
How can I join these two DFs and the result should be:
Result:
1 ab aa
2 b bba
3 c cc
4 d dd
My code is:
val df = DF1.join(DF2, Seq("ID"), "outer")
.select($"ID",
when(DF1("Col1").isNull, lit(0)).otherwise(DF1("Col1")).as("Col1"),
when(DF1("Col2").isNull, lit(0)).otherwise(DF2("Col2")).as("Col2"))
.orderBy("ID")
And it works, but I don't want to specify each column, because I have large files.
So, is there any way to update the dataframe (and to add some recors if in the second DF are new one) without specifying each column?
A simple leftanti join of df1 with df2 and merging of the result into df2 should get your desired output as
df2.union(df1.join(df2, Seq("ID"), "leftanti")).orderBy("ID").show(false)
which should give you
+---+----+----+
|ID |Col1|Col2|
+---+----+----+
|1 |ab |aa |
|2 |b |bba |
|3 |c |cc |
|4 |d |dd |
+---+----+----+
The solution doesn't match the logic you have in your code but generates the expected result

How to apply Window function to multiple columns in DataFrame

I have the following DataFrame df:
Id label field1 field2
1 xxx 2 3
1 yyy 1 5
2 aaa 0 10
1 zzz 2 6
For each unique Id I want to know the label with highest field1 and field2.
Expected result:
Id labelField1 lableLield2
1 xxx zzz
2 aaa aaa
I know how to do it if I would only have labelField1 or labelField2.
But I am not sure what is the best way to deal with both labels.
val w1 = Window.partitionBy($"Id").orderBy($"field1".desc)
val w2 = Window.partitionBy($"Id").orderBy($"field2".desc)
val myLabels = df.select("Id", "label", "field1", "field2")
.withColumn("rn", row_number.over(w1)).where($"rn" === 1)
.drop("rn")
.drop("field1")
You can combine struct and max inbuild functions to achieve your requirement as
import org.apache.spark.sql.functions._
df.groupBy("Id")
.agg(max(struct("field1", "label")).as("temp1"), max(struct("field2", "label")).as("temp2"))
.select(col("Id"), col("temp1.label").as("labelField1"), col("temp2.label").as("labelField2"))
.show(false)
which should give you
+---+-----------+-----------+
|Id |labelField1|labelField2|
+---+-----------+-----------+
|1 |xxx |zzz |
|2 |aaa |aaa |
+---+-----------+-----------+
Note: In case of tie as in field1 for Id=1 there is tie between xxx and zzz so random will be chosen

How to add a new column to data frame based on two columns of other data frames

I have two DataFrames df_data and df_node_labels:
df_data =
nodeId field1
1 abc
2 def
3 fed
4 kfl
df_node_labels =
srcId srcLabel dstId dstLabel
1 AAA 2 BBB
2 BBB 4 FFF
4 FFF 3 CCC
I want to add a column label to df_data. The values of label should be taken from srcLabel and dstLabel:
This is how I tried to grab label information:
var df = df_data.join(df_node_labels.select("srcId","srcLabel"),col("nodeId")===col("srcId"),"left")
df = df.join(df_node_labels.select("dstId","dstLabel"),col("nodeId")===col("dstId"),"left")
However, this creates two columns srcLabel and dstLabel in df, while I want to get just one column label.
This is the expected result:
df =
nodeId field1 label
1 abc AAA
2 def BBB
3 fed CCC
4 kfl FFF
Update:
I can do it this way, but in my opinion it is a long way to do a simple thing:
df = df.withColumn("label", when(col("srcLabel") =!= "", col("srcLabel")).otherwise(col("dstLabel"))).drop("srcLabel").drop("dstLabel")
You can create a unique data from df_node_labels as a finalDF in below and perform a join operation which will give you an expected result.
val finalDF = df_node_labels.select($"srcId".as("nodeId"), $"srcLabel".as("label"))
.union(
df_node_labels.select($"dstId".as("nodeId"), $"dstLabel".as("label"))
).dropDuplicates()
df_data.join(finalDF, Seq("nodeId"), "left")
.show(false)
Output:
+------+------+-----+
|nodeId|field1|label|
+------+------+-----+
|1 |abc |AAA |
|2 |def |BBB |
|3 |fed |CCC |
|4 |kfl |FFF |
+------+------+-----+
I hope this helped you!

create map from dataframe in spark scala

I have a json string as below in a dataframe
aaa | bbb | ccc |ddd | eee
--------------------------------------
100 | xxxx | 123 |yyy|2017
100 | yyyy | 345 |zzz|2017
200 | rrrr | 500 |qqq|2017
300 | uuuu | 200 |ttt|2017
200 | iiii | 500 |ooo|2017
I want to get the result as
{100,[{xxxx:{123,yyy}},{yyyy:{345,zzz}}],2017}
{200,[{rrrr:{500,qqq}},{iiii:{500,ooo}}],2017}
{300,[{uuuu:{200,ttt}}],2017}
Kindly help
This works:
val df = data
.withColumn("cd", array('ccc, 'ddd)) // create arrays of c and d
.withColumn("valuesMap", map('bbb, 'cd)) // create mapping
.withColumn("values", collect_list('valuesMap) // collect mappings
.over(Window.partitionBy('aaa)))
.withColumn("eee", first('eee) // e is constant, just get first value of Window
.over(Window.partitionBy('aaa)))
.select("aaa", "values", "eee") // select only columns that are in the question selected
.select(to_json(struct("aaa", "values", "eee")).as("value")) // create JSON
Make sure you do
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._`
You can create a map defining the values as constants with lit() or taking them from other columns in the dataframe with $"col_name", like this:
val new_df = df.withColumn("map_feature", map(lit("key1"), lit("value1"), lit("key2"), $"col2"))