Scala join different datasets to get value for one column - scala

I am new to Scala. I now have 3 tables.
A:
Marketplace
Level
Band
US
LEVEL_1
CA
LEVEL_1
BAND_1
B:
Marketplace
Level
Value
US
LEVEL_1
10
C:
Marketplace
Level
Band
Value
CA
LEVEL_1
BAND_1
20
I would want to:
For rows with marketplace = US in table A -> join table B on Seq(Marketplace, Level) to get the Value;
For rows with marketplace = CA in table A -> join table C on Seq(Marketplace, Level, Band) to get the Value.
The output table will be like:
Marketplace
Level
Band
Value
US
LEVEL_1
10
CA
LEVEL_1
BAND_1
20
How should I write Scala code to achieve this? Thanks!

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{coalesce, col}
import spark.implicits._
val A = Seq(("US", "LEVEL_1", ""), ("CA", "LEVEL_1", "BAND_1"))
.toDF("Marketplace", "Level", "Band")
val B = Seq(("US", "LEVEL_1", 10)).toDF("Marketplace", "Level", "Value")
val C = Seq(("CA", "LEVEL_1", "BAND_1", 20)).toDF(
"Marketplace",
"Level",
"Band",
"Value"
)
val res = A
.join(B, A.col("Marketplace") === B.col("Marketplace"), "left")
.join(C, A.col("Marketplace") === C.col("Marketplace"), "left")
.select(
A.col("Marketplace").alias("Marketplace"),
A.col("Level").alias("Level"),
C.col("Band").alias("Band"),
coalesce(B.col("Value"), C.col("Value")).alias("Value")
)
res.show(false)
// +-----------+-------+------+-----+
// |Marketplace|Level |Band |Value|
// +-----------+-------+------+-----+
// |US |LEVEL_1|null |10 |
// |CA |LEVEL_1|BAND_1|20 |
// +-----------+-------+------+-----+

Related

Searching for strings from column and creating new column in spark dataframe with results

I have dataframe consisting of columns. I need to check the http link and stored the value code in new column using spark dataframe.
Dataframe :
colA colB ColC colD
A B C a1.abc.com/823659) a1.abc.com/823521)
B C D go.xyz.com/?LinkID=226971 a1.abc.com/823521)
C D E a1.abc.com/?LinkID=226971 go.xyz.com/?LinkID=226975
Required Output:
colA colB ColC colD ColE
A B C a1.abc.com/823659) a1.abc.com/823521) 823659,823521
B C D go.xyz.com/?LinkID=226971 a1.abc.com/823521) 226971,823521
C D E a1.abc.com/?LinkID=226971 go.xyz.com/?LinkID=226975 226971,226975
df.withColumn("colE", regexp_extract(col("colD"), 'regex', ".com"
I have tried using regexp_extract, dont the pattern in not getting matched. Could you please help to get the required output.
Anyway, run this, you can add the other columns, and I am not sure of your string, but assume spaces. Added to the original answer.
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(("A", "a1.abc.com/823659) a1.abc.com/823521 a1.abc.com/9999)"), ("B", "go.xyz.com/?LinkID=226971 a1.abc.com/823521)") ).toDF("col1", "col2")
val df2 = df.withColumn("col3", split(col("col2"), " "))
val df3 = df2.select($"col1", $"col2", array_except($"col3", array(lit(""))).as("col3"))
val df4 = df3.withColumn("xyz", explode($"col3")).withColumn("leng", length($"xyz"))
// Get values to right of .com
val df5 = df4.withColumn("xyz2", substring_index($"xyz", ".com", -1))
val df6 = df5.withColumn("col4", regexp_replace(df5("xyz2"), "[^0-9]", "") )
val df7 = df6.select($"col1", $"col4")
val dfres = df7.groupBy("col1").agg(collect_list(col("col4")).as("col2"))
dfres.show(false)
returns on my data:
+----+----------------------+
|col1|col2 |
+----+----------------------+
|B |[226971, 823521] |
|A |[823659, 823521, 9999]|
+----+----------------------+

How to calculate product of columns followed by sum over all columns?

Table 1 --Spark DataFrame table
There is a column called "productMe" in Table 1; and there are also other columns like a, b, c and so on whose schema name is contained in a schema array T.
What I want is the inner product of columns(product each row of the two columns) in schema array T with the column productMe(Table 2). And sum each column of Table 2 to get Table 3.
Table 2 is not necessary if you have good idea to get Table 3 in one step.
Table 2 -- Inner product table
For example, the column "a·productMe" is (3*0.2, 6*0.6, 5*0.4) to get (0.6, 3.6, 2)
Table 3 -- sum table
For example, the column "sum(a·productMe)" is 0.6+3.6+2=6.2.
Table 1 is DataFrame of Spark, how can I get Table 3?
You can try something like the following :
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)).toDF("productMe", "a", "b", "c")
import org.apache.spark.sql.functions.col
val columnsToSum = df.
columns. // <-- grab all the columns by their name
tail. // <-- skip productMe
map(col). // <-- create Column objects
map(c => round(sum(c * col("productMe")), 3).as(s"sum_${c}_productMe"))
val df2 = df.select(columnsToSum: _*)
df2.show()
# +---------------+---------------+---------------+
# |sum_a_productMe|sum_b_productMe|sum_c_productMe|
# +---------------+---------------+---------------+
# | 6.2| 6.3| 4.3|
# +---------------+---------------+---------------+
The trick is to use df.select(columnsToSum: _*) which means that you want to select all the columns on which we did the sum of columns times the productMe column. The :_* is a Scala-specific syntax to specify that we are passing repeated arguments because we don't have a fix number of arguments.
We can do it with simple SparkSql
val table1 = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)
).toDF("productMe", "a", "b", "c")
table1.show
table1.createOrReplaceTempView("table1")
val table2 = spark.sql("select a*productMe, b*productMe, c*productMe from table1") //spark is sparkSession here
table2.show
val table3 = spark.sql("select sum(a*productMe), sum(b*productMe), sum(c*productMe) from table1")
table3.show
All the other answers use sum aggregation that use groupBy under the covers.
groupBy always introduces a shuffle stage and usually (always?) is slower than corresponding window aggregates.
In this particular case, I also believe that window aggregates give better performance as you can see in their physical plans and details for their only one job.
CAUTION
Either solution uses one single partition to do the calculation that in turn makes them unsuitable for large datasets as their size together may easily exceed the memory size of a single JVM.
Window Aggregates
What follows is a window aggregate-based calculation which, in this particular case where we group over all the rows in a dataset, unfortunately gives the same physical plan. That makes my answer just a (hopefully) nice learning experience.
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)).toDF("productMe", "a", "b", "c")
// yes, I did borrow this trick with columns from #eliasah's answer
import org.apache.spark.sql.functions.col
val columns = df.columns.tail.map(col).map(c => c * col("productMe") as s"${c}_productMe")
val multiplies = df.select(columns: _*)
scala> multiplies.show
+------------------+------------------+------------------+
| a_productMe| b_productMe| c_productMe|
+------------------+------------------+------------------+
|0.6000000000000001| 1.5|1.2000000000000002|
|3.5999999999999996|1.7999999999999998|0.6000000000000001|
| 2.0| 3.0| 2.5|
+------------------+------------------+------------------+
def sumOverRows(name: String) = sum(name) over ()
val multipliesCols = multiplies.
columns.
map(c => sumOverRows(c) as s"sum_${c}")
val answer = multiplies.
select(multipliesCols: _*).
limit(1) // <-- don't use distinct or dropDuplicates here
scala> answer.show
+-----------------+---------------+-----------------+
| sum_a_productMe|sum_b_productMe| sum_c_productMe|
+-----------------+---------------+-----------------+
|6.199999999999999| 6.3|4.300000000000001|
+-----------------+---------------+-----------------+
Physical Plan
Let's see the physical plan then (as it was the only reason why we wanted to see how to do the query using window aggregates, wasn't it?)
The following is the details for the only job 0.
If I understand your question correctly then following can be your solution
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)
).toDF("productMe", "a", "b", "c")
This gives input dataframe as you have (you can add more)
+---------+---+---+---+
|productMe|a |b |c |
+---------+---+---+---+
|3 |0.2|0.5|0.4|
|6 |0.6|0.3|0.1|
|5 |0.4|0.6|0.5|
+---------+---+---+---+
And
val productMe = df.columns.head
val colNames = df.columns.tail
var tempdf = df
for(column <- colNames){
tempdf = tempdf.withColumn(column, col(column)*col(productMe))
}
Above steps should give you Table2
+---------+------------------+------------------+------------------+
|productMe|a |b |c |
+---------+------------------+------------------+------------------+
|3 |0.6000000000000001|1.5 |1.2000000000000002|
|6 |3.5999999999999996|1.7999999999999998|0.6000000000000001|
|5 |2.0 |3.0 |2.5 |
+---------+------------------+------------------+------------------+
Table3 can be achieved as following
tempdf.select(sum("a").as("sum(a.productMe)"), sum("b").as("sum(b.productMe)"), sum("c").as("sum(c.productMe)")).show(false)
Table3 is
+-----------------+----------------+-----------------+
|sum(a.productMe) |sum(b.productMe)|sum(c.productMe) |
+-----------------+----------------+-----------------+
|6.199999999999999|6.3 |4.300000000000001|
+-----------------+----------------+-----------------+
Table2 can be achieved for any number of columns you have but Table3 would require you to define columns explicitly

Lookup in Spark dataframes

I am using Spark 1.6 and I would like to know how to implement in lookup in the dataframes.
I have two dataframes employee & department.
Employee Dataframe
-------------------
Emp Id | Emp Name
------------------
1 | john
2 | David
Department Dataframe
--------------------
Dept Id | Dept Name | Emp Id
-----------------------------
1 | Admin | 1
2 | HR | 2
I would like to lookup emp id from the employee table to the department table and get the dept name. So, the resultset would be
Emp Id | Dept Name
-------------------
1 | Admin
2 | HR
How do I implement this look up UDF feature in SPARK. I don't want to use JOIN on both the dataframes.
As already mentioned in the comments, joining the dataframes is the way to go.
You can use a lookup, but I think there is no "distributed" solution, i.e. you have to collect the lookup-table into driver memory. Also note that this approach assumes that EmpID is unique:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
import scala.collection.Map
val emp = Seq((1,"John"),(2,"David"))
val deps = Seq((1,"Admin",1),(2,"HR",2))
val empRdd = sc.parallelize(emp)
val depsDF = sc.parallelize(deps).toDF("DepID","Name","EmpID")
val lookupMap = empRdd.collectAsMap()
def lookup(lookupMap:Map[Int,String]) = udf((empID:Int) => lookupMap.get(empID))
val combinedDF = depsDF
.withColumn("empNames",lookup(lookupMap)($"EmpID"))
My initial thought was to pass the empRdd to the UDF and use the lookup method defined on PairRDD, but this does of course not work because you cannot have spark actions (i.e. lookup) within transformations (ie. the UDF).
EDIT:
If your empDf has multiple columns (e.g. Name,Age), you can use this
val empRdd = empDf.rdd.map{row =>
(row.getInt(0),(row.getString(1),row.getInt(2)))}
val lookupMap = empRdd.collectAsMap()
def lookup(lookupMap:Map[Int,(String,Int)]) =
udf((empID:Int) => lookupMap.lift(empID))
depsDF
.withColumn("lookup",lookup(lookupMap)($"EmpID"))
.withColumn("empName",$"lookup._1")
.withColumn("empAge",$"lookup._2")
.drop($"lookup")
.show()
As you are saying you already have Dataframes then its pretty easy follow these steps:
1)create a sqlcontext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
2) Create Temporary tables for all 3 Eg:
EmployeeDataframe.createOrReplaceTempView("EmpTable")
3) Query using MySQL Queries
val MatchingDetails = sqlContext.sql("SELECT DISTINCT E.EmpID, DeptName FROM EmpTable E inner join DeptTable G on " +
"E.EmpID=g.EmpID")
Starting with some "lookup" data, there are two approaches:
Method #1 -- using a lookup DataFrame
// use a DataFrame (via a join)
val lookupDF = sc.parallelize(Seq(
("banana", "yellow"),
("apple", "red"),
("grape", "purple"),
("blueberry","blue")
)).toDF("SomeKeys","SomeValues")
Method #2 -- using a map in a UDF
// turn the above DataFrame into a map which a UDF uses
val Keys = lookupDF.select("SomeKeys").collect().map(_(0).toString).toList
val Values = lookupDF.select("SomeValues").collect().map(_(0).toString).toList
val KeyValueMap = Keys.zip(Values).toMap
def ThingToColor(key: String): String = {
if (key == null) return ""
val firstword = key.split(" ")(0) // fragile!
val result: String = KeyValueMap.getOrElse(firstword,"not found!")
return (result)
}
val ThingToColorUDF = udf( ThingToColor(_: String): String )
Take a sample data frame of things that will be looked up:
val thingsDF = sc.parallelize(Seq(
("blueberry muffin"),
("grape nuts"),
("apple pie"),
("rutabaga pudding")
)).toDF("SomeThings")
Method #1 is to join on the lookup DataFrame
Here, the rlike is doing the matching. And null appears where that does not work. Both columns of the lookup DataFrame get added.
val result_1_DF = thingsDF.join(lookupDF, expr("SomeThings rlike SomeKeys"),
"left_outer")
Method #2 is to add a column using the UDF
Here, only 1 column is added. And the UDF can return a non-Null value. However, if the lookup data is very large it may fail to "serialize" as required to send to the workers in the cluster.
val result_2_DF = thingsDF.withColumn("AddValues",ThingToColorUDF($"SomeThings"))
Which gives you:
In my case I had some lookup data that was over 1 million values, so Method #1 was my only choice.

Merge multiple Dataframes into one Dataframe in Spark [duplicate]

I have two DataFrame a and b.
a is like
Column 1 | Column 2
abc | 123
cde | 23
b is like
Column 1
1
2
I want to zip a and b (or even more) DataFrames which becomes something like:
Column 1 | Column 2 | Column 3
abc | 123 | 1
cde | 23 | 2
How can I do it?
Operation like this is not supported by a DataFrame API. It is possible to zip two RDDs but to make it work you have to match both number of partitions and number of elements per partition. Assuming this is the case:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
val a: DataFrame = sc.parallelize(Seq(
("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)
// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)
If above conditions are not met the only option that comes to mind is adding an index and join:
def addIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)
// Join and clean
val ab = aWithIndex
.join(bWithIndex, Seq("_index"))
.drop("_index")
In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation:
val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)
val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)
aWithId.join(bWithId, "id")
A little light reading - Check out how Python does this!
What about pure SQL ?
SELECT
room_name,
sender_nickname,
message_id,
row_number() over (partition by room_name order by message_id) as message_index,
row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id
I know the OP was using Scala but if, like me, you need to know how to do this in pyspark then try the Python code below. Like #zero323's first solution it relies on RDD.zip() and will therefore fail if both DataFrames don't have the same number of partitions and the same number of rows in each partition.
from pyspark.sql import Row
from pyspark.sql.types import StructType
def zipDataFrames(left, right):
CombinedRow = Row(*left.columns + right.columns)
def flattenRow(row):
left = row[0]
right = row[1]
combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
return CombinedRow(*combinedVals)
zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))
combinedSchema = StructType(left.schema.fields + right.schema.fields)
return zippedRdd.toDF(combinedSchema)
joined = zipDataFrames(a, b)

How to zip two (or more) DataFrame in Spark

I have two DataFrame a and b.
a is like
Column 1 | Column 2
abc | 123
cde | 23
b is like
Column 1
1
2
I want to zip a and b (or even more) DataFrames which becomes something like:
Column 1 | Column 2 | Column 3
abc | 123 | 1
cde | 23 | 2
How can I do it?
Operation like this is not supported by a DataFrame API. It is possible to zip two RDDs but to make it work you have to match both number of partitions and number of elements per partition. Assuming this is the case:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
val a: DataFrame = sc.parallelize(Seq(
("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)
// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)
If above conditions are not met the only option that comes to mind is adding an index and join:
def addIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)
// Join and clean
val ab = aWithIndex
.join(bWithIndex, Seq("_index"))
.drop("_index")
In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation:
val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)
val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)
aWithId.join(bWithId, "id")
A little light reading - Check out how Python does this!
What about pure SQL ?
SELECT
room_name,
sender_nickname,
message_id,
row_number() over (partition by room_name order by message_id) as message_index,
row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id
I know the OP was using Scala but if, like me, you need to know how to do this in pyspark then try the Python code below. Like #zero323's first solution it relies on RDD.zip() and will therefore fail if both DataFrames don't have the same number of partitions and the same number of rows in each partition.
from pyspark.sql import Row
from pyspark.sql.types import StructType
def zipDataFrames(left, right):
CombinedRow = Row(*left.columns + right.columns)
def flattenRow(row):
left = row[0]
right = row[1]
combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
return CombinedRow(*combinedVals)
zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))
combinedSchema = StructType(left.schema.fields + right.schema.fields)
return zippedRdd.toDF(combinedSchema)
joined = zipDataFrames(a, b)