How to filter columns in one table based on the same columns in another table using Spark - scala

I need to filter columns in one table (fixTablehb004_p) based on the same columns in another table (filtredTable109_p)
I first wanted to use this code:
val filtredTablehb004_p = fixTablehb004_p
.where($"servizio_rap" === filtredTable109_p.col("servizio_rap"))
.where($"filiale_rap" === filtredTable109_p.col("filiale_rap"))
.where($"codice_rap" === filtredTable109_p.col("codice_rap"))
But it gave out an error.
Then I tried the code based on this stackoverflow question, and I get this code. But the problem is that there are extra columns, I know what you can do drop(columnName), but I want to ask you if I'm doing it right and if there is another better option
val filtredTablehb004_p = sparkSession.sql("SELECT * FROM fixTablehb004_p " +
"JOIN filtredTable109_p " +
"ON fixTablehb004_p.servizio_rap = filtredTable109_p.servizio_rap AND " +
"fixTablehb004_p.filiale_rap = filtredTable109_p.filiale_rap AND " +
"fixTablehb004_p.codice_rap = filtredTable109_p.codice_rap ")

Let's take 2 sample dataframes and see how we can select required columns or avoid duplicate key column names in joined output dataframe.
USING DATAFRAME API:
val df1 = Seq(("A1", "A2", 1), ("A3", "A4", 2), ("A1", "A3", 3))
.toDF("c1", "c2", "c3")
val df2 = Seq(("A1", "A2", 10), ("A3", "A4", 11))
.toDF("c1", "c2", "c4")
df1.createOrReplaceTempView("tab1")
df2.createOrReplaceTempView("tab2")
If column names which you used for joining condition from both dataframes are same then output dataframe will have duplicate columns. To avoid this you can pass all those columns as Seq to join().
df1.join(df2, Seq("c1", "c2")).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| A1| A2| 1| 10|
| A3| A4| 2| 11|
+---+---+---+---+
To select required columns from specific dataframe you can use below syntax:
df1.join(df2, Seq("c1", "c2")).select('c1, 'c2, df1("c3")).show()
// OR
df1.join(df2, df1("c1") === df2("c1") && df1("c2") === df2("c2"))
.select(df1("c1"), df1("c2"), df1("c3")).show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| A1| A2| 1|
| A3| A4| 2|
+---+---+---+
USING SQL API:
spark.sql(
"""
|SELECT t2.c1, t2.c2, t2.c4 FROM tab1 t1
|JOIN tab2 t2 ON t1.c1 = t2.c1 AND t1.c2 = t2.c2
|""".stripMargin).show()
//OR
spark.sql(
"""
|SELECT c1, c2, t2.c4 FROM tab1 t1
|JOIN tab2 t2 USING(c1, c2)
|""".stripMargin).show()
+---+---+---+
| c1| c2| c4|
+---+---+---+
| A1| A2| 10|
| A3| A4| 11|
+---+---+---+

Related

Create summary of Spark Dataframe

I have a Spark Dataframe which I am trying to summarise in order to find overly long columns:
// Set up test data
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df = Seq(
( 1, "a", "bb", "cc", "file1" ),
( 2, "d", "ee", "fff", "file2" ),
( 3, "g", "hhhh", "ii", "file3" )
).
toDF("rowId", "col1", "col2", "col3", "filename")
I can summarise the lengths of the columns and find overly long ones like this:
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df2 = df.columns
.map(c => (c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("columnName", "maxLength")
.filter($"maxLength" > 2)
If I try and add the existing filename column to the map I get an error:
val df2 = df.columns
.map(c => ($"filename", c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("fn", "columnName", "maxLength")
.filter($"maxLength" > 2)
I have tried a few variations of the $"filename" syntax. How can I incorporate the filename column into the summary?
columnName
maxLength
filename
col2
4
file3
col3
3
file2
The real dataframes have 300+ columns and millions of rows so I cannot hard-type column names.
#wBob does the following achieve your goal?
group by file name and get the maximum per column:
val cols = df.columns.dropRight(1) // to remove the filename col
val maxLength = cols.map(c => s"max(length(${c})) as ${c}").mkString(",")
print(maxLength)
df.createOrReplaceTempView("temp")
val df1 = spark
.sql(s"select filename, ${maxLength} from temp group by filename")
df1.show()`
With the output:
+--------+-----+----+----+----+
|filename|rowId|col1|col2|col3|
+--------+-----+----+----+----+
| file1| 1| 1| 2| 2|
| file2| 1| 1| 2| 3|
| file3| 1| 1| 4| 2|
+--------+-----+----+----+----+
Use subqueries to get the maximum per column and concatenate the results using union:
df1.createOrReplaceTempView("temp2")
val res = cols.map(col => {
spark.sql(s"select '${col}' as columnName, $col as maxLength, filename from temp2 " +
s"where $col = (select max(${col}) from temp2)")
}).reduce(_ union _)
res.show()
With the result:
+----------+---------+--------+
|columnName|maxLength|filename|
+----------+---------+--------+
| rowId| 1| file1|
| rowId| 1| file2|
| rowId| 1| file3|
| col1| 1| file1|
| col1| 1| file2|
| col1| 1| file3|
| col2| 4| file3|
| col3| 3| file2|
+----------+---------+--------+
Note that there are multiple entries for rowId and col1 since the maximum is not unique.
There is probably a more elegant way to write it, but I am struggling to find one at the moment.
Pushed a little further for better result.
df.select(
col("*"),
array( // make array of columns name/value/length
(for{ col_name <- df.columns } yield
struct(
length(col(col_name)).as("length"),
lit(col_name).as("col"),
col(col_name).cast("String").as("col_value")
)
).toSeq:_* ).alias("rowInfo")
)
.select(
col("rowId"),
explode( // explode array into rows
expr("filter(rowInfo, x -> x.length >= 3)") //filter the array for the length your interested in
).as("rowInfo")
)
.select(
col("rowId"),
col("rowInfo.*") // turn struct fields into columns
)
.sort("length").show
+-----+------+--------+---------+
|rowId|length| col|col_value|
+-----+------+--------+---------+
| 2| 3| col3| fff|
| 3| 4| col2| hhhh|
| 3| 5|filename| file3|
| 1| 5|filename| file1|
| 2| 5|filename| file2|
+-----+------+--------+---------+
It might be enough to sort your table by total text length. This can be achieved quickly and concisely.
df.select(
col("*"),
length( // take the length
concat( //slap all the columns together
(for( col_name <- df.columns ) yield col(col_name)).toSeq:_*
)
)
.as("length")
)
.sort( //order by total length
col("length").desc
).show()
+-----+----+----+----+--------+------+
|rowId|col1|col2|col3|filename|length|
+-----+----+----+----+--------+------+
| 3| g|hhhh| ii| file3| 13|
| 2| d| ee| fff| file2| 12|
| 1| a| bb| cc| file1| 11|
+-----+----+----+----+--------+------+
Sorting an array[struct] it will sort on the first field first and second field next. This works as we put the size of the sting up front. If you re-order the fields you'll get different results. You can easily accept more than 1 result if you so desired but I think dsicovering a row is challenging is likely enough.
df.select(
col("*"),
reverse( //sort ascending
sort_array( //sort descending
array( // add all columns lengths to an array
(for( col_name <- df.columns ) yield struct(length(col(col_name)),lit(col_name),col(col_name).cast("String")) ).toSeq:_* )
)
)(0) // grab the row max
.alias("rowMax") )
.sort("rowMax").show
+-----+----+----+----+--------+--------------------+
|rowId|col1|col2|col3|filename| rowMax|
+-----+----+----+----+--------+--------------------+
| 1| a| bb| cc| file1|[5, filename, file1]|
| 2| d| ee| fff| file2|[5, filename, file2]|
| 3| g|hhhh| ii| file3|[5, filename, file3]|
+-----+----+----+----+--------+--------------------+

Spark: map columns of a dataframe to their ID of the distinct elements

I have the following dataframe of two columns of string type A and B:
val df = (
spark
.createDataFrame(
Seq(
("a1", "b1"),
("a1", "b2"),
("a1", "b2"),
("a2", "b3")
)
)
).toDF("A", "B")
I create maps between distinct elements of each columns and a set of integers
val mapColA = (
df
.select("A")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
val mapColB = (
df
.select("B")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
Now I want to create a new columns in the dataframe applying those maps to their correspondent columns. For one map only this would be
df.select("A").map(x=>mapColA.get(x)).show()
However I don't understand how to apply each map to their correspondent columns and create two new columns (e.g. with withColumn). The expected result would be
val result = (
spark
.createDataFrame(
Seq(
("a1", "b1", 1, 1),
("a1", "b2", 1, 2),
("a1", "b2", 1, 2),
("a2", "b3", 2, 3)
)
)
).toDF("A", "B", "idA", "idB")
Could you help me?
If I understood correctly, this can be achieved using dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("idA", dense_rank().over(Window.orderBy("A")))
.withColumn("idB", dense_rank().over(Window.orderBy("B")))
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 1|
| a1| b2| 1| 2|
| a1| b2| 1| 2|
| a2| b3| 2| 3|
+---+---+---+---+
If you want to stick with your original code, you can make some modifications:
val mapColA = df.select("A").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val mapColB = df.select("B").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val df2 = df.map(r => (r.getAs[String](0), r.getAs[String](1), mapColA.get(r.getAs[String](0)), mapColB.get(r.getAs[String](1)))).toDF("A","B", "idA", "idB")
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 2|
| a1| b2| 1| 0|
| a1| b2| 1| 0|
| a2| b3| 0| 1|
+---+---+---+---+

Iterating on columns in dataframe

I have the following data frames
df1
+----------+----+----+----+-----+
| WEEK|DIM1|DIM2| T1| T2|
+----------+----+----+----+-----+
|2016-04-02| 14|NULL|9874| 880|
|2016-04-30| 14| FR|9875| 13|
|2017-06-10| 15| PQR|9867|57721|
+----------+----+----+----+-----+
df2
+----------+----+----+----+-----+
| WEEK|DIM1|DIM2| T1| T2|
+----------+----+----+----+-----+
|2016-04-02| 14|NULL|9879| 820|
|2016-04-30| 14| FR|9785| 9|
|2017-06-10| 15| XYZ|9967|57771|
+----------+----+----+----+-----+
I need to produce my output as following -
+----------+----+----+----+-----+----+-----+-------+-------+----------+------------+
| WEEK|DIM1|DIM2| T1| T2| T1| T2|t1_diff|t2_diff|pr_primary|pr_reference|
+----------+----+----+----+-----+----+-----+-------+-------+----------+------------+
|2016-04-02| 14|NULL|9874| 880|9879| 820| -5| 60| Y| Y|
|2017-06-10| 15| PQR|9867|57721|null| null| null| null| Y| N|
|2017-06-10| 15| XYZ|null| null|9967|57771| null| null| N| Y|
|2016-04-30| 14| FR|9875| 13|9785| 9| 90| 4| Y| Y|
+----------+----+----+----+-----+----+-----+-------+-------+----------+------------+
Here, t1_diff is difference between left T1 and right T1, t2_diff is difference between left T2 and right T2, pr_primary is Y if row is present in df1 and not in df2 and similarly for pr_reference.
I have generated the above with following piece of code
val df1 = Seq(
("2016-04-02", "14", "NULL", 9874, 880), ("2016-04-30", "14", "FR", 9875, 13), ("2017-06-10", "15", "PQR", 9867, 57721)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
val df2 = Seq(
("2016-04-02", "14", "NULL", 9879, 820), ("2016-04-30", "14", "FR", 9785, 9), ("2017-06-10", "15", "XYZ", 9967, 57771)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
import org.apache.spark.sql.functions._
val joined = df1.as("l").join(df2.as("r"), Seq("WEEK", "DIM1", "DIM2"), "fullouter")
val j1 = joined.withColumn("t1_diff",col(s"l.T1") - col(s"r.T1")).withColumn("t2_diff",col(s"l.T2") - col(s"r.T2"))
val isPresentSubstitution = udf( (x: String, y: String) => if (x == null && y == null) "N" else "Y")
j1.withColumn("pr_primary",isPresentSubstitution(col(s"l.T1"), col(s"l.T2"))).withColumn("pr_reference",isPresentSubstitution(col(s"r.T1"), col(s"r.T2"))).show
I want to make it generalize for any number of columns not just T1 and T2. Can someone suggest me a better way to do this ? I am running this in spark.
To be able to set any number of columns like t1_diff with any expresion calculating their values, we need to make some refactoring allowing to use withColumn in a more generic manner.
First, we need to collect the target values: the names of the target columns and the expressions that calculate their contents. This can be done with a sequence of Tuples:
val diffColumns = Seq(
("t1_diff", col("l.T1") - col("r.T1")),
("t2_diff", col("l.T2") - col("r.T2"))
)
// or, to make it more readable, create a dedicated "case class DiffColumn(colName: String, expression: Column)"
Now we can use folding to produce the joined DataFrame from joined and the sequence above:
val joinedWithDiffCols =
diffColumns.foldLeft(joined) { case(df, diffTuple) =>
df.withColumn(diffTuple._1, diffTuple._2)
}
joinedWithDiffCols contains the same data as j1 from the question.
To append new columns, you now have to modify diffColumns sequence only. You can even put the calculation of pr_primary and pr_reference in this sequence (but rename the ref to appendedColumns in this case, to be more precise).
Update
To facilitate the creation of the tuples for diffCollumns, it also can be generalized, for example:
// when both column names are same:
def generateDiff(column: String): (String, Column) = generateDiff(column, column)
// when left and right column names are different:
def generateDiff(leftCol: String, rightCol: String): (String, Column) =
(s"${leftCol}_diff", col("l." + leftCol) - col("r." + rightCol))
val diffColumns = Seq("T1", "T2").map(generateDiff)
End-of-update
Assuming the columns are named same in both df1 and df2, you can do something like:
val diffCols = df1.columns
.filter(_.matches("T\\d+"))
.map(c => col(s"l.$c") - col(s"r.$c") as (s"${c.toLowerCase}_diff") )
And then use it with joined like:
joined.select( ( col("*") :+ diffCols ) :_*).show(false)
//+----------+----+----+----+-----+----+-----+-------+-------+
//|WEEK |DIM1|DIM2|T1 |T2 |T1 |T2 |t1_diff|t2_diff|
//+----------+----+----+----+-----+----+-----+-------+-------+
//|2016-04-02|14 |NULL|9874|880 |9879|820 |-5 |60 |
//|2017-06-10|15 |PQR |9867|57721|null|null |null |null |
//|2017-06-10|15 |XYZ |null|null |9967|57771|null |null |
//|2016-04-30|14 |FR |9875|13 |9785|9 |90 |4 |
//+----------+----+----+----+-----+----+-----+-------+-------+
You can do it by adding sequence number to each dataframe and later join those two dataframes based on seq number.
val df3 = df1.withColumn("SeqNum", monotonicallyIncreasingId)
val df4 = df2.withColumn("SeqNum", monotonicallyIncreasingId)
df3.as("l").join(df4.as("r"),"SeqNum").withColumn("t1_diff",col("l.T1") - col("r.T1")).withColumn("t2_diff",col("l.T2") - col("r.T2")).drop("SeqNum").show()

How to update column of spark dataframe based on the values of previous record

I have three columns in df
Col1,col2,col3
X,x1,x2
Z,z1,z2
Y,
X,x3,x4
P,p1,p2
Q,q1,q2
Y
I want to do the following
when col1=x,store the value of col2 and col3
and assign those column values to next row when col1=y
expected output
X,x1,x2
Z,z1,z2
Y,x1,x2
X,x3,x4
P,p1,p2
Q,q1,q2
Y,x3,x4
Any help would be appreciated
Note:-spark 1.6
Here's one approach using Window function with steps as follows:
Add row-identifying column (not needed if there is already one) and combine non-key columns (presumably many of them) into one
Create tmp1 with conditional nulls and tmp2 using last/rowsBetween Window function to back-fill with the last non-null value
Create newcols conditionally from cols and tmp2
Expand newcols back to individual columns using foldLeft
Note that this solution uses Window function without partitioning, thus may not work for large dataset.
val df = Seq(
("X", "x1", "x2"),
("Z", "z1", "z2"),
("Y", "", ""),
("X", "x3", "x4"),
("P", "p1", "p2"),
("Q", "q1", "q2"),
("Y", "", "")
).toDF("col1", "col2", "col3")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val colList = df.columns.filter(_ != "col1")
val df2 = df.select($"col1", monotonically_increasing_id.as("id"),
struct(colList.map(col): _*).as("cols")
)
val df3 = df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
withColumn( "tmp2", last("tmp1", ignoreNulls = true).over(
Window.orderBy("id").rowsBetween(Window.unboundedPreceding, 0)
) )
df3.show
// +----+---+-------+-------+-------+
// |col1| id| cols| tmp1| tmp2|
// +----+---+-------+-------+-------+
// | X| 0|[x1,x2]|[x1,x2]|[x1,x2]|
// | Z| 1|[z1,z2]| null|[x1,x2]|
// | Y| 2| [,]| null|[x1,x2]|
// | X| 3|[x3,x4]|[x3,x4]|[x3,x4]|
// | P| 4|[p1,p2]| null|[x3,x4]|
// | Q| 5|[q1,q2]| null|[x3,x4]|
// | Y| 6| [,]| null|[x3,x4]|
// +----+---+-------+-------+-------+
val df4 = df3.withColumn( "newcols",
when($"col1" === "Y", $"tmp2").otherwise($"cols")
).select($"col1", $"newcols")
df4.show
// +----+-------+
// |col1|newcols|
// +----+-------+
// | X|[x1,x2]|
// | Z|[z1,z2]|
// | Y|[x1,x2]|
// | X|[x3,x4]|
// | P|[p1,p2]|
// | Q|[q1,q2]|
// | Y|[x3,x4]|
// +----+-------+
val dfResult = colList.foldLeft( df4 )(
(accDF, c) => accDF.withColumn(c, df4(s"newcols.$c"))
).drop($"newcols")
dfResult.show
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// | X| x1| x2|
// | Z| z1| z2|
// | Y| x1| x2|
// | X| x3| x4|
// | P| p1| p2|
// | Q| q1| q2|
// | Y| x3| x4|
// +----+----+----+
[UPDATE]
For Spark 1.x, last(colName, ignoreNulls) isn't available in the DataFrame API. A work-around is to revert to use Spark SQL which supports ignore-null in its last() method:
df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
createOrReplaceTempView("df2table")
// might need to use registerTempTable("df2table") instead
val df3 = spark.sqlContext.sql("""
select col1, id, cols, tmp1, last(tmp1, true) over (
order by id rows between unbounded preceding and current row
) as tmp2
from df2table
""")
Yes, there is a lag function that requires ordering
import org.apache.spark.sql.expressions.Window.orderBy
import org.apache.spark.sql.functions.{coalesce, lag}
case class Temp(a: String, b: Option[String], c: Option[String])
val input = ss.createDataFrame(
Seq(
Temp("A", Some("a1"), Some("a2")),
Temp("D", Some("d1"), Some("d2")),
Temp("B", Some("b1"), Some("b2")),
Temp("E", None, None),
Temp("C", None, None)
))
+---+----+----+
| a| b| c|
+---+----+----+
| A| a1| a2|
| D| d1| d2|
| B| b1| b2|
| E|null|null|
| C|null|null|
+---+----+----+
val order = orderBy($"a")
input
.withColumn("b", coalesce($"b", lag($"b", 1).over(order)))
.withColumn("c", coalesce($"c", lag($"c", 1).over(order)))
.show()
+---+---+---+
| a| b| c|
+---+---+---+
| A| a1| a2|
| B| b1| b2|
| C| b1| b2|
| D| d1| d2|
| E| d1| d2|
+---+---+---+

How to pivot dataset?

I use Spark 2.1.
I have some data in a Spark Dataframe, which looks like below:
**ID** **type** **val**
1 t1 v1
1 t11 v11
2 t2 v2
I want to pivot up this data using either spark Scala (preferably) or Spark SQL so that final output should look like below:
**ID** **t1** **t11** **t2**
1 v1 v11
2 v2
You can use groupBy.pivot:
import org.apache.spark.sql.functions.first
df.groupBy("ID").pivot("type").agg(first($"val")).na.fill("").show
+---+---+---+---+
| ID| t1|t11| t2|
+---+---+---+---+
| 1| v1|v11| |
| 2| | | v2|
+---+---+---+---+
Note: depending on the actual data, i.e. how many values there are for each combination of ID and type, you might choose a different aggregation function.
Here's one way to do it:
val df = Seq(
(1, "T1", "v1"),
(1, "T11", "v11"),
(2, "T2", "v2")
).toDF(
"id", "type", "val"
).as[(Int, String, String)]
val df2 = df.groupBy("id").pivot("type").agg(concat_ws(",", collect_list("val")))
df2.show
+---+---+---+---+
| id| T1|T11| T2|
+---+---+---+---+
| 1| v1|v11| |
| 2| | | v2|
+---+---+---+---+
Note that if there are different vals associated with a given type, they will be grouped (comma-delimited) under the type in df2.
This one should work
val seq = Seq((123,"2016-01-01","1"),(123,"2016-01-02","2"),(123,"2016-01-03","3"))
val seq = Seq((1,"t1","v1"),(1,"t11","v11"),(2,"t2","v2"))
val df = seq.toDF("id","type","val")
val pivotedDF = df.groupBy("id").pivot("type").agg(first("val"))
pivotedDF.show
Output:
+---+----+----+----+
| id| t1| t11| t2|
+---+----+----+----+
| 1| v1| v11|null|
| 2|null|null| v2|
+---+----+----+----+