Looping through dataframe columns to form a nested dataframe - Spark - scala

I have a dataframe as below,
val x = Seq(("A", "B", "C", "D")).toDF("DOC", "A1", "A2", "A3")
+---+---+---+---+
|DOC| A1| A2| A3|
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
Here the A's can be till 100, so I want to loop and get all the A's and nest them under a common structure as below,
+---+---+---+----+
|DOC|A LIST |
+---+---+---+---+
| A| [B, C, D] |
+---+---+---+---+
I want to create a dataframe by creating dynamic column names like A1, A2.. by looping from 1 to 100 and do a select.
How can I do this?
Cheers!

Simply assemble a list of columns to be combined into an array, transform the column names into Columns via col and apply method array to resulting list:
val df = Seq(
(1, "a", "b", "c", 10.0),
(2, "d", "e", "f", 20.0)
).toDF("id", "a1", "a2", "a3", "b")
val selectedCols = df.columns.filter(_.startsWith("a")).map(col)
val otherCols = df.columns.map(col) diff selectedCols
df.select((otherCols :+ array(selectedCols: _*).as("a_list")): _*).show
// +---+----+---------+
// | id| b| a_list|
// +---+----+---------+
// | 1|10.0|[a, b, c]|
// | 2|20.0|[d, e, f]|
// +---+----+---------+

Related

Join two dataframes with different records and size in Spark

It seems this issue asked couple of times, but the solutions that suggested in previous questions not working for me.
I have two dataframes with different dimensions as shown in picture below. The table two second was part of table one first but after some processing on it I added one more column column4. Now I want to join these two tables such that I have table three Required after joining.
Things that tried.
So I did couple of different solution but no one works for me.
I tried
val required =first.join(second, first("PDE_HDR_CMS_RCD_NUM") === second("PDE_HDR_CMS_RCD_NUM") , "left_outer")
Also I tried
val required = first.withColumn("SEQ", when(second.col("PDE_HDR_FILE_ID") === (first.col("PDE_HDR_FILE_ID").alias("PDE_HDR_FILE_ID1")), second.col("uniqueID")).otherwise(lit(0)))
In the second attempt I used .alias after I get an error that says
Error occured during extract process. Error:
org.apache.spark.sql.AnalysisException: Resolved attribute(s) uniqueID#775L missing from.
Thanks for taking time to read my question
To generate the wanted result, you should join the two tables on column(s) that are row-identifying in your first table. Assuming c1 + c2 + c3 uniquely identifies each row in the first table, here's an example using a partial set of your sample data:
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq(
(1, "e", "o"),
(4, "d", "t"),
(3, "f", "e"),
(2, "r", "r"),
(6, "y", "f"),
(5, "t", "g"),
(1, "g", "h"),
(4, "f", "j"),
(6, "d", "k"),
(7, "s", "o")
).toDF("c1", "c2", "c3")
val df2 = Seq(
(3, "f", "e", 444),
(5, "t", "g", 555),
(7, "s", "o", 666)
).toDF("c1", "c2", "c3", "c4")
df1.join(df2, Seq("c1", "c2", "c3"), "left_outer").show
// +---+---+---+----+
// | c1| c2| c3| c4|
// +---+---+---+----+
// | 1| e| o|null|
// | 4| d| t|null|
// | 3| f| e| 444|
// | 2| r| r|null|
// | 6| y| f|null|
// | 5| t| g| 555|
// | 1| g| h|null|
// | 4| f| j|null|
// | 6| d| k|null|
// | 7| s| o| 666|
// +---+---+---+----+

Scala Spark collect_list() vs array()

What is the difference between collect_list() and array() in spark using scala?
I see uses all over the place and the use cases are not clear to me to determine the difference.
Even though both array and collect_list return an ArrayType column, the two methods are very different.
Method array combines "column-wise" a number of columns into an array, whereas collect_list aggregates "row-wise" on a single column typically by group (or Window partition) into an array, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, "a", "b"),
(1, "c", "d"),
(2, "e", "f")
).toDF("c1", "c2", "c3")
df.
withColumn("arr", array("c2", "c3")).
show
// +---+---+---+------+
// | c1| c2| c3| arr|
// +---+---+---+------+
// | 1| a| b|[a, b]|
// | 1| c| d|[c, d]|
// | 2| e| f|[e, f]|
// +---+---+---+------+
df.
groupBy("c1").agg(collect_list("c2")).
show
// +---+----------------+
// | c1|collect_list(c2)|
// +---+----------------+
// | 1| [a, c]|
// | 2| [e]|
// +---+----------------+

How to replace values in RDD 1 per keys in RDD 2?

Here are two RDDs.
Table1-pair(key,value)
val table1 = sc.parallelize(Seq(("1", "a"), ("2", "b"), ("3", "c")))
//RDD[(String, String)]
Table2-Arrays
val table2 = sc.parallelize(Array(Array("1", "2", "d"), Array("1", "3", "e")))
//RDD[Array[String]]
I am trying to change elements of table2 such as "1" to "a" using the keys and values in table1. My expect result is as follows:
RDD[Array[String]] = (Array(Array("a", "b", "d"), Array("a", "c", "e")))
Is there a way to make this possible?
If so, would it be efficient using a huge dataset?
I think we can do it better with dataframes while avoiding joins as it might involve shuffling of data.
val table1 = spark.sparkContext.parallelize(Seq(("1", "a"), ("2", "b"), ("3", "c"))).collectAsMap()
//Brodcasting so that mapping is available to all nodes
val brodcastedMapping = spark.sparkContext.broadcast(table1)
val table2 = spark.sparkContext.parallelize(Array(Array("1", "2", "d"), Array("1", "3", "e")))
def changeMapping(value: String): String = {
brodcastedMapping.value.getOrElse(value, value)
}
val changeMappingUDF = udf(changeMapping(_:String))
table2.toDF.withColumn("exploded", explode($"value"))
.withColumn("new", changeMappingUDF($"exploded"))
.groupBy("value")
.agg(collect_list("new").as("mappedCol"))
.select("mappedCol").rdd.map(r => r.toSeq.toArray.map(_.toString))
Let me know if it suits your requirement otherwise I can modify it as needed.
You can do that in Dataset
package dataframe
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
/**
* #author vaquar.khan#gmail.com
*/
object Test {
case class table1Class(key: String, value: String)
case class table2Class(key: String, value: String, value1: String)
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-Basic")
.master("local[4]")
.getOrCreate()
import spark.implicits._
//
val table1 = Seq(
table1Class("1", "a"), table1Class("2", "b"), table1Class("3", "c"))
val df1 = spark.sparkContext.parallelize(table1, 4).toDF()
df1.show()
val table2 = Seq(
table2Class("1", "2", "d"), table2Class("1", "3", "e"))
val df2 = spark.sparkContext.parallelize(table2, 4).toDF()
df2.show()
//
df1.createOrReplaceTempView("A")
df2.createOrReplaceTempView("B")
spark.sql("select d1.key,d1.value,d2.value1 from A d1 inner join B d2 on d1.key = d2.key").show()
//TODO
/* need to fix query
spark.sql( "select * from ( "+ //B1.value,B1.value1,A.value
" select A.value,B.value,B.value1 "+
" from B "+
" left join A "+
" on B.key = A.key ) B2 "+
" left join A " +
" on B2.value = A.key" ).show()
*/
}
}
Results :
+---+-----+
|key|value|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
+---+-----+------+
|key|value|value1|
+---+-----+------+
| 1| 2| d|
| 1| 3| e|
+---+-----+------+
[Stage 15:=====================================> (68 + 6) / 100]
[Stage 15:============================================> (80 + 4) / 100]
+-----+-----+------+
|value|value|value1|
+-----+-----+------+
| 1| a| d|
| 1| a| e|
+-----+-----+------+
Is there a way to make this possible?
Yes. Use Datasets (not RDDs as less effective and expressive), join them together and select fields of your liking.
val table1 = Seq(("1", "a"), ("2", "b"), ("3", "c")).toDF("key", "value")
scala> table1.show
+---+-----+
|key|value|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
val table2 = sc.parallelize(
Array(Array("1", "2", "d"), Array("1", "3", "e"))).
toDF("a").
select($"a"(0) as "a0", $"a"(1) as "a1", $"a"(2) as "a2")
scala> table2.show
+---+---+---+
| a0| a1| a2|
+---+---+---+
| 1| 2| d|
| 1| 3| e|
+---+---+---+
scala> table2.join(table1, $"key" === $"a0").select($"value" as "a0", $"a1", $"a2").show
+---+---+---+
| a0| a1| a2|
+---+---+---+
| a| 2| d|
| a| 3| e|
+---+---+---+
Repeat for the other a columns and union together. While repeating the code, you'll notice the pattern that will make the code generic.
If so, would it be efficient using a huge dataset?
Yes (again). We're talking Spark here and a huge dataset is exactly why you chose Spark, isn't it?

How to union DataFrames and add only missing rows?

I have a dataframe df1, which contains below data:
**customer_id** **product** **Val_id** **rule_name**
1 A 1 rule1
2 B X rule1
I have another dataframe df2, which contains below data:
**customer_id** **product** **Val_id** **rule_name**
1 A 1 rule2
2 B X rule2
3 C y rule2
rule_name values in both dataframes is always fixed
I want a new unionized dataframe df3. It should have all customers from dataframe df1 and all other customers from dataframe df2, which are not present in df1. So final df3 should look like:
**customer_id** **product** **Val_id** **rule_name**
1 A 1 rule1
2 B X rule1
3 C y rule2
Can anyone please help me out to achieve this outcome. Any help will be appreciated.
Given the following datasets:
val df1 = Seq(
(1, "A", "1", "rule1"),
(2, "B", "X", "rule1")
).toDF("customer_id", "product", "Val_id", "rule_name")
val df2 = Seq(
(1, "A", "1", "rule2"),
(2, "B", "X", "rule2"),
(3, "C", "y", "rule2")
).toDF("customer_id", "product", "Val_id", "rule_name")
And the requirement:
It should have all customers from dataframe df1 and all other customers from dataframe df2, which are not present in df1.
My first solution could be as follows:
val missingCustomers = df2.
join(df1, Seq("customer_id"), "leftanti").
select($"customer_id", df2("product"), df2("Val_id"), df2("rule_name"))
val all = df1.union(missingCustomers)
scala> all.show
+-----------+-------+------+---------+
|customer_id|product|Val_id|rule_name|
+-----------+-------+------+---------+
| 1| A| 1| rule1|
| 2| B| X| rule1|
| 3| C| y| rule2|
+-----------+-------+------+---------+
Another (and perhaps slower) solution could be as follows:
// find missing ids, i.e. ids in df2 that are not in df1
// BE EXTRA CAREFUL: "Downloading" all missing ids to the driver
val missingIds = df2.
select("customer_id").
except(df1.select("customer_id")).
as[Int].
collect
// filter ids in df2 that match missing ids
val missingRows = df2.filter($"customer_id" isin (missingIds: _*))
scala> df1.union(missingRows).show
+-----------+-------+------+---------+
|customer_id|product|Val_id|rule_name|
+-----------+-------+------+---------+
| 1| A| 1| rule1|
| 2| B| X| rule1|
| 3| C| y| rule2|
+-----------+-------+------+---------+

spark aggregateByKey with tuple

New to the RDD api of spark - thanks to Spark migrate sql window function to RDD for better performance - I managed to generate the following table:
+-----------------+---+
| _1| _2|
+-----------------+---+
| [col3TooMany,C]| 0|
| [col1,A]| 0|
| [col2,B]| 0|
| [col3TooMany,C]| 1|
| [col1,A]| 1|
| [col2,B]| 1|
|[col3TooMany,jkl]| 0|
| [col1,d]| 0|
| [col2,a]| 0|
| [col3TooMany,C]| 0|
| [col1,d]| 0|
| [col2,g]| 0|
| [col3TooMany,t]| 1|
| [col1,A]| 1|
| [col2,d]| 1|
| [col3TooMany,C]| 1|
| [col1,d]| 1|
| [col2,c]| 1|
| [col3TooMany,C]| 1|
| [col1,c]| 1|
+-----------------+---+
with an initial input of
val df = Seq(
(0, "A", "B", "C", "D"),
(1, "A", "B", "C", "D"),
(0, "d", "a", "jkl", "d"),
(0, "d", "g", "C", "D"),
(1, "A", "d", "t", "k"),
(1, "d", "c", "C", "D"),
(1, "c", "B", "C", "D")
).toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
val columnsToDrop = Seq("col3TooMany")
val columnsToCode = Seq("col1", "col2")
val target = "TARGET"
import org.apache.spark.sql.functions._
val exploded = explode(array(
(columnsToDrop ++ columnsToCode).map(c =>
struct(lit(c).alias("k"), col(c).alias("v"))): _*
)).alias("level")
val long = df.select(exploded, $"TARGET")
import org.apache.spark.util.StatCounter
then
long.as[((String, String), Int)].rdd.aggregateByKey(StatCounter())(_ merge _, _ merge _).collect.head
res71: ((String, String), org.apache.spark.util.StatCounter) = ((col2,B),(count: 3, mean: 0,666667, stdev: 0,471405, max: 1,000000, min: 0,000000))
is aggregating statistics of all the unique values for each column.
How can I add to the count (which is 3 for B in col2) a second count (maybe as a tuple) which represents the number of B in col2 where TARGET == 1. In this case, it should be 2.
You shouldn't need additional aggregate here. With binary target column, mean is just an empirical probability of target being equal 1:
number of 1 - count * mean
number of 0 - count * (1 - mean)