How to merge two columns into a new DataFrame? - scala

I have two DataFrames (Spark 2.2.0 and Scala 2.11.8). The first DataFrame df1 has one column called col1, and the second one df2 has also 1 column called col2. The number of rows is equal in both DataFrames.
How can I merge these two columns into a new DataFrame?
I tried join, but I think that there should be some other way to do it.
Also, I tried to apply withColumm, but it does not compile.
val result = df1.withColumn(col("col2"), df2.col1)
UPDATE:
For example:
df1 =
col1
1
2
3
df2 =
col2
4
5
6
result =
col1 col2
1 4
2 5
3 6

If that there's no actual relationship between these two columns, it sounds like you need the union operator, which will return, well, just the union of these two dataframes:
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.union(df2).show
+---+
|one|
+---+
| a |
| b |
| c |
| d |
| e |
| f |
+---+
[edit]
Now you've made clear that you just want two columns, then with DataFrames you can use the trick of adding a row index with the function monotonically_increasing_id() and joining on that index value:
import org.apache.spark.sql.functions.monotonically_increasing_id
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.withColumn("id", monotonically_increasing_id())
.join(df2.withColumn("id", monotonically_increasing_id()), Seq("id"))
.drop("id")
.show
+---+---+
|one|two|
+---+---+
| a | d |
| b | e |
| c | f |
+---+---+

As far as I know, the only way to do want you want with DataFrames is by adding an index column using RDD.zipWithIndex to each and then doing a join on the index column. Code for doing zipWithIndex on a DataFrame can be found in this SO answer.
But, if the DataFrames are small, it would be much simpler to collect the two DFs in the driver, zip them together, and make the result into a new DataFrame.
[Update with example of in-driver collect/zip]
val df3 = spark.createDataFrame(df1.collect() zip df2.collect()).withColumnRenamed("_1", "col1").withColumnRenamed("_2", "col2")

Depends in what you want to do.
If you want to merge two DataFrame you should use the join. There are the same join's types has in relational algebra (or any DBMS)
You are saying that your Data Frames just had one column each.
In that case you might want todo a cross join (cartesian product) with give you a two columns table of all possible combination of col1 and col2, or you might want the uniao (as referred by #Chondrops) witch give you a one column table with all elements.
I think all other join's types uses can be done specialized operations in spark (in this case two Data Frames one column each).

Related

spark merge datasets based on the same input of one column and concat the others

Currently I have several Dataset[UserRecord], and it looks like this
case class UserRecord(
Id: String,
ts: Timestamp,
detail: String
)
Let's call the several datasets datasets.
Previously I tried this
datasets.reduce(_ union _)
.groupBy("Id")
.agg(collect_list("ts", "detail"))
.as[(String, Seq[DetailRecord]]
but this code gives me an OOM error. I think the root cause is collect_list.
Now I'm thinking if I can do the groupBy and agg for each of the dataset first and then join them together to solve the OOM issue. Any other good advice is welcome too :)
I have an IndexedSeq of datasets look like this
|name| lists |
| x |[[1,2], [3,4]]|
|name| lists |
| y |[[5,6], [7,8]]|
|name| lists |
| x |[[9,10], [11,12]]|
How can I combine them to get a Dataset that looks like
|name| lists |
| x |[[1,2], [3,4],[9,10], [11,12]]|
| y |[[5,6], [7,8]] |
I tried ds.reduce(_ union _) but it didn't seem to work
You can aggregate after union:
val ds2 = ds.reduce(_ unionAll _).groupBy("name").agg(flatten(collect_list("lists")).as("lists"))

spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2

There are two dataframes: df1, and df2 with the same schema. ID is the primary key.
I need merge the two df1, and df2. This can be done by union except one special requirement: if there are duplicates rows with the same ID in df1 and df2. I need keep the one in df1.
df1:
ID col1 col2
1 AA 2019
2 B 2018
df2:
ID col1 col2
1 A 2019
3 C 2017
I need the following output:
df1:
ID col1 col2
1 AA 2019
2 B 2018
3 C 2017
How to do this? Thanks. I think it is possible to register two tmp tables, do full joins and use coalesce. but I do not prefer this way, because there are about 40 columns, in fact, instead of 3 in the above example.
Given that the two DataFrames have the same schema, you could simply union df1 with the left_anti join of df2 & df1:
df1.union(df2.join(df1, Seq("ID"), "left_anti")).show
// +---+---+----+
// | ID|co1|col2|
// +---+---+----+
// | 1| AA|2019|
// | 2| B|2018|
// | 3| C|2017|
// +---+---+----+
One way to do this is, unioning the dataframes with an identifier column that specifies the dataframe and use it thereafter for prioritizing row from df1 with a function like row_number.
PySpark SQL solution shown here.
from pyspark.sql.functions import lit,row_number,when
from pyspark.sql import Window
df1_with_identifier = df1.withColumn('identifier',lit('df1'))
df2_with_identifier = df2.withColumn('identifier',lit('df2'))
merged_df = df1_with_identifier.union(df2_with_identifier)
#Define the Window with the desired ordering
w = Window.partitionBy(merged_df.id).orderBy(when(merged_df.identifier == 'df1',1).otherwise(2))
result = merged_df.withColumn('rownum',row_number().over(w))
result.select(result.rownum == 1).show()
A solution with a left join on df1 could be a lot simpler, except that you have to write multiple coalesces.

Comparing the value of columns in two dataframe

I have two dataframe, one has unique value of id and other can have multiple values of different id.
This is dataframe df1:
id | dt| speed | stats
358899055773504 2018-07-31 18:38:34 0 [9,-1,-1,13,0,1,0]
358899055773505 2018-07-31 18:48:23 4 [8,-1,0,22,1,1,1]
df2:
id | dt| speed | stats
358899055773504 2018-07-31 18:38:34 0 [9,-1,-1,13,0,1,0]
358899055773505 2018-07-31 18:54:23 4 [9,0,0,22,1,1,1]
358899055773504 2018-07-31 18:58:34 0 [9,0,-1,22,0,1,0]
358899055773504 2018-07-31 18:28:34 0 [9,0,-1,22,0,1,0]
358899055773505 2018-07-31 18:38:23 4 [8,-1,0,22,1,1,1]
I aim to compare the second dataframe with the first dataframe and updating the values in first dataframe, only if the value of dt of a particular id of df2 is greater than that in df1 and if it satisfies the greater than condition then comparing the other fields as well.
You need to join the two dataframes together to make any comparison of their columns.
What you can do is first joining the dataframes and then perform all the filtering to get a new dataframe with all rows that should be updated:
val diffDf = df1.as("a").join(df2.as("b"), Seq("id"))
.filter($"b.dt" > $"a.dt")
.filter(...) // Any other filter required
.select($"id", $"b.dt", $"b.speed", $"b.stats")
Note: In some situations it would be required to do a groupBy(id) or use a window function since there should only be one final row per id in the diffDf dataframe. This can be done as as follows (the example here will select the row with maximum in the speed, but it depends on the actual requirements):
val w = Window.partitionBy($"id").orderBy($"speed".desc)
val diffDf2 = diffDf.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn")
More in-depth information about different approaches can be seen here: How to max value and keep all columns (for max records per group)?.
To replace the old rows with the same id in the df1 dataframe, combine the dataframes with an outer join and coalesce:
val df = df1.as("a").join(diffDf.as("b"), Seq("id"), "outer")
.select(
$"id",
coalesce($"b.dt", $"a.dt").as("dt"),
coalesce($"b.speed", $"a.speed").as("speed"),
coalesce($"b.stats", $"a.stats").as("stats")
)
coalesce works by first trying to take the value from the diffDf (b) dataframe. If that value is null it will take the value from df1 (a).
Result when only using the time filter with the provided example input dataframes:
+---------------+-------------------+-----+-----------------+
| id| dt|speed| stats|
+---------------+-------------------+-----+-----------------+
|358899055773504|2018-07-31 18:58:34| 0|[9,0,-1,22,0,1,0]|
|358899055773505|2018-07-31 18:54:23| 4| [9,0,0,22,1,1,1]|
+---------------+-------------------+-----+-----------------+

How to combine several Dataframes together in scala?

I have several dataframes which contains single column in them. Let's say I have 4 such dataframe all with one column. How can I form a single dataframe by combining all of them?
val df = xmldf.select(col("UserData.UserValue._valueRef"))
val df2 = xmldf.select(col("UserData.UserValue._title"))
val df3 = xmldf.select(col("author"))
val df4 = xmldf.select(col("price"))
To combine, I am trying this, but it doesn't work:
var newdf = df
newdf = newdf.withColumn("col1",df1.col("UserData.UserValue._title"))
newdf.show()
It errors out saying that field of one column are not present in another. I am not sure how can I combine these 4 dataframes together. They don't have any common column.
df2 looks like this:
+---------------+
| _title|
+---------------+
|_CONFIG_CONTEXT|
|_CONFIG_CONTEXT|
|_CONFIG_CONTEXT|
+---------------+
and df looks like this:
+-----------+
|_valuegiven|
+-----------+
| qwe|
| dfdfrt|
| dfdf|
+-----------+
df3 and df4 are also in same format. I want like below dataframe:
+-----------+---------------+
|_valuegiven| _title|
+-----------+---------------+
| qwe|_CONFIG_CONTEXT|
| dfdfrt|_CONFIG_CONTEXT|
| dfdf|_CONFIG_CONTEXT|
+-----------+---------------+
I used this:
val newdf = xmldf.select(col("UserData.UserValue._valuegiven"),col("UserData.UserValue._title") )
newdf.show()
But I am getting column name on the go and as such, I would need to append on the go, due to which I don't know exactly how many columns I will get. Which is why I cannot use the above command.
It's a little unclear of your goal. If asking to join these dataframes, but perhaps you just want to select those 4 columns.
val newdf = xmldf.select($"UserData.UserValue._valueRef", $"UserData.UserValue._title", 'author,'price")
newdf.show
If you really want to join all these dataframes, you'll need to join them all and select the appropriate fields.
If the goal is to get 4 columns from xmldf into a new dataframe you shouldn't be splitting it into 4 dataframes in the first place.
You can select multiple columns from a dataframe by providing additional column names in the select function.
val newdf = xmldf.select(
col("UserData.UserValue._valueRef"),
col("UserData.UserValue._title"),
col("author"),
col("price"))
newdf.show()
So I looked at various ways and finally Ram Ghadiyaram's answer in Solution 2 does what I wanted to do. Using this approach, you can combine any number of columns on the go. Basically, you need to create indexes by which you can join the dataframes together and after joining, drop the index column altogether.

How to calculate product of columns followed by sum over all columns?

Table 1 --Spark DataFrame table
There is a column called "productMe" in Table 1; and there are also other columns like a, b, c and so on whose schema name is contained in a schema array T.
What I want is the inner product of columns(product each row of the two columns) in schema array T with the column productMe(Table 2). And sum each column of Table 2 to get Table 3.
Table 2 is not necessary if you have good idea to get Table 3 in one step.
Table 2 -- Inner product table
For example, the column "a·productMe" is (3*0.2, 6*0.6, 5*0.4) to get (0.6, 3.6, 2)
Table 3 -- sum table
For example, the column "sum(a·productMe)" is 0.6+3.6+2=6.2.
Table 1 is DataFrame of Spark, how can I get Table 3?
You can try something like the following :
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)).toDF("productMe", "a", "b", "c")
import org.apache.spark.sql.functions.col
val columnsToSum = df.
columns. // <-- grab all the columns by their name
tail. // <-- skip productMe
map(col). // <-- create Column objects
map(c => round(sum(c * col("productMe")), 3).as(s"sum_${c}_productMe"))
val df2 = df.select(columnsToSum: _*)
df2.show()
# +---------------+---------------+---------------+
# |sum_a_productMe|sum_b_productMe|sum_c_productMe|
# +---------------+---------------+---------------+
# | 6.2| 6.3| 4.3|
# +---------------+---------------+---------------+
The trick is to use df.select(columnsToSum: _*) which means that you want to select all the columns on which we did the sum of columns times the productMe column. The :_* is a Scala-specific syntax to specify that we are passing repeated arguments because we don't have a fix number of arguments.
We can do it with simple SparkSql
val table1 = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)
).toDF("productMe", "a", "b", "c")
table1.show
table1.createOrReplaceTempView("table1")
val table2 = spark.sql("select a*productMe, b*productMe, c*productMe from table1") //spark is sparkSession here
table2.show
val table3 = spark.sql("select sum(a*productMe), sum(b*productMe), sum(c*productMe) from table1")
table3.show
All the other answers use sum aggregation that use groupBy under the covers.
groupBy always introduces a shuffle stage and usually (always?) is slower than corresponding window aggregates.
In this particular case, I also believe that window aggregates give better performance as you can see in their physical plans and details for their only one job.
CAUTION
Either solution uses one single partition to do the calculation that in turn makes them unsuitable for large datasets as their size together may easily exceed the memory size of a single JVM.
Window Aggregates
What follows is a window aggregate-based calculation which, in this particular case where we group over all the rows in a dataset, unfortunately gives the same physical plan. That makes my answer just a (hopefully) nice learning experience.
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)).toDF("productMe", "a", "b", "c")
// yes, I did borrow this trick with columns from #eliasah's answer
import org.apache.spark.sql.functions.col
val columns = df.columns.tail.map(col).map(c => c * col("productMe") as s"${c}_productMe")
val multiplies = df.select(columns: _*)
scala> multiplies.show
+------------------+------------------+------------------+
| a_productMe| b_productMe| c_productMe|
+------------------+------------------+------------------+
|0.6000000000000001| 1.5|1.2000000000000002|
|3.5999999999999996|1.7999999999999998|0.6000000000000001|
| 2.0| 3.0| 2.5|
+------------------+------------------+------------------+
def sumOverRows(name: String) = sum(name) over ()
val multipliesCols = multiplies.
columns.
map(c => sumOverRows(c) as s"sum_${c}")
val answer = multiplies.
select(multipliesCols: _*).
limit(1) // <-- don't use distinct or dropDuplicates here
scala> answer.show
+-----------------+---------------+-----------------+
| sum_a_productMe|sum_b_productMe| sum_c_productMe|
+-----------------+---------------+-----------------+
|6.199999999999999| 6.3|4.300000000000001|
+-----------------+---------------+-----------------+
Physical Plan
Let's see the physical plan then (as it was the only reason why we wanted to see how to do the query using window aggregates, wasn't it?)
The following is the details for the only job 0.
If I understand your question correctly then following can be your solution
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)
).toDF("productMe", "a", "b", "c")
This gives input dataframe as you have (you can add more)
+---------+---+---+---+
|productMe|a |b |c |
+---------+---+---+---+
|3 |0.2|0.5|0.4|
|6 |0.6|0.3|0.1|
|5 |0.4|0.6|0.5|
+---------+---+---+---+
And
val productMe = df.columns.head
val colNames = df.columns.tail
var tempdf = df
for(column <- colNames){
tempdf = tempdf.withColumn(column, col(column)*col(productMe))
}
Above steps should give you Table2
+---------+------------------+------------------+------------------+
|productMe|a |b |c |
+---------+------------------+------------------+------------------+
|3 |0.6000000000000001|1.5 |1.2000000000000002|
|6 |3.5999999999999996|1.7999999999999998|0.6000000000000001|
|5 |2.0 |3.0 |2.5 |
+---------+------------------+------------------+------------------+
Table3 can be achieved as following
tempdf.select(sum("a").as("sum(a.productMe)"), sum("b").as("sum(b.productMe)"), sum("c").as("sum(c.productMe)")).show(false)
Table3 is
+-----------------+----------------+-----------------+
|sum(a.productMe) |sum(b.productMe)|sum(c.productMe) |
+-----------------+----------------+-----------------+
|6.199999999999999|6.3 |4.300000000000001|
+-----------------+----------------+-----------------+
Table2 can be achieved for any number of columns you have but Table3 would require you to define columns explicitly