How to compare two files using spark? - scala

I want to compare two files if not matched extra records load into another file with the unmatched records.
Compare each and every fields in both file and count of records also.

Let's say you have two files:
scala> val a = spark.read.option("header", "true").csv("a.csv").alias("a"); a.show
+---+-----+
|key|value|
+---+-----+
| a| b|
| b| c|
+---+-----+
a: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> val b = spark.read.option("header", "true").csv("b.csv").alias("b"); b.show
+---+-----+
|key|value|
+---+-----+
| b| c|
| c| d|
+---+-----+
b: org.apache.spark.sql.DataFrame = [key: string, value: string]
It is unclear which sort of unmatched records you are looking for, but it is easy to find them by any definition with join:
scala> a.join(b, Seq("key")).show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
| b| c| c|
+---+-----+-----+
scala> a.join(b, Seq("key"), "left_outer").show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
| a| b| null|
| b| c| c|
+---+-----+-----+
scala> a.join(b, Seq("key"), "right_outer").show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
| b| c| c|
| c| null| d|
+---+-----+-----+
scala> a.join(b, Seq("key"), "outer").show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
| c| null| d|
| b| c| c|
| a| b| null|
+---+-----+-----+
If you are looking for the records in b.csv that are not present in a.csv:
scala> val diff = a.join(b, Seq("key"), "right_outer").filter($"a.value" isNull).drop($"a.value")
scala> diff.show
+---+-----+
|key|value|
+---+-----+
| c| d|
+---+-----+
scala> diff.write.csv("diff.csv")

Related

Rename Duplicate Columns of a Spark DataFrame?

There are several good answers about managing duplicate columns from joined dataframes, eg (How to avoid duplicate columns after join?), but what if I'm simply presented a DataFrame with duplicate columns that I have to deal with. I have no control over the processes leading up to this point.
What I have:
val data = Seq((1,2),(3,4)).toDF("a","a")
data.show
+---+---+
| a| a|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
What I want:
+---+---+
| a|a_2|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
withColumnRenamed("a","a_2") does not work, for obvious reasons.
The simplest way I found to do this is:
val data = Seq((1,2),(3,4)).toDF("a","a")
val deduped = data.toDF("a","a_2")
deduped.show
+---+---+
| a|a_2|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
For a more general solution:
val data = Seq(
(1,2,3,4,5,6,7,8),
(9,0,1,2,3,4,5,6)
).toDF("a","b","c","a","d","b","e","b")
data.show
+---+---+---+---+---+---+---+---+
| a| b| c| a| d| b| e| b|
+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8|
| 9| 0| 1| 2| 3| 4| 5| 6|
+---+---+---+---+---+---+---+---+
import scala.annotation.tailrec
def dedupeColumnNames(df: DataFrame): DataFrame = {
#tailrec
def dedupe(fixed_columns: List[String], columns: List[String]): List[String] = {
if (columns.isEmpty) fixed_columns
else {
val count = columns.groupBy(identity).mapValues(_.size)(columns.head)
if (count == 1) dedupe(columns.head :: fixed_columns, columns.tail)
else dedupe(s"${columns.head}_${count}":: fixed_columns, columns.tail)
}
}
val new_columns = dedupe(List.empty[String], df.columns.reverse.toList).toArray
df.toDF(new_columns:_*)
}
data
.transform(dedupeColumnNames)
.show
+---+---+---+---+---+---+---+---+
| a| b| c|a_2| d|b_2| e|b_3|
+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8|
| 9| 0| 1| 2| 3| 4| 5| 6|
+---+---+---+---+---+---+---+---+

spark aggregation count on condition

I'm trying to group a data frame, then when aggregating rows, with a count, I want to apply a condition on rows before counting.
here is an example :
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.show
+---+---+
| _1| _2|
+---+---+
| A| X|
| A| X|
| B| O|
| B| O|
| c| O|
| c| X|
| d| X|
| d| O|
+---+---+
in this example I want to group by column _1 on count on column _2 when the value ='X'
here is the expected result :
+---+-----------+
| _1| count(_2) |
+---+-----------+
| A| 2 |
| B| 0 |
| c| 1 |
| d| 1 |
+---+-----------+
Use when to get this aggregation. PySpark solution shown here.
from pyspark.sql.functions import when,count
test.groupBy(col("col_1")).agg(count(when(col("col_2") == 'X',1))).show()
import spark.implicits._
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.groupBy("_1").agg(count(when($"_2"==="X", 1)).as("count")).orderBy("_1").show
+---+-----+
| _1|count|
+---+-----+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-----+
As alternative, in Scala, it can be:
val counter1 = test.select( col("_1"),
when(col("_2") === lit("X"), lit(1)).otherwise(lit(0)).as("_2"))
val agg1 = counter1.groupBy("_1").agg(sum("_2")).orderBy("_1")
agg1.show
gives result:
+---+-------+
| _1|sum(_2)|
+---+-------+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-------+

Sort a Dataset[String] in Scala

I have dsString: Dataset[(String,Long)] (No DataFrame or Dataset[Row]) and I'm trying to order by Long .orderBy(_._2)
My problem is that .orderBy() and .sort() only accept columns, and I can only use .sortBy with RDDs.
DataFrame solution
dsString.toDF("a", "b")
.groupBy("b")
RDD solution:
dsString.toJavaRDD
.sortBy(_._2)
How could Dataset[(String,Long)] do the same?
Dataset also can be applied orderBy. For example,
+---+---+
| _1| _2|
+---+---+
| c| 3|
| b| 5|
| a| 4|
+---+---+
this is my Dataset and
df2.orderBy(col("_1").desc).show
df2.orderBy(col("_2").asc).show
give the results as follows:
+---+---+
| _1| _2|
+---+---+
| c| 3|
| b| 5|
| a| 4|
+---+---+
+---+---+
| _1| _2|
+---+---+
| c| 3|
| a| 4|
| b| 5|
+---+---+

How to show only relevant columns from Spark's DataFrame?

I have a large JSON file with 432 key-value pairs and many rows of such data. That data is loaded pretty well, however when I want to use df.show() to display 20 items I see a bunch of nulls. The file is quite sparse. It's very hard to make something out of it. What would be nice is to drop columns that have only nulls for 20 rows, however, given that I have a lot of key-value pairs it's hard to do manually. Is there a way to detect in Spark's dataframe what columns contain only nulls and drop them?
You can try like below, for more info, referred_question
scala> val df = Seq((1,2,null),(3,4,null),(5,6,null),(7,8,"9")).toDF("a","b","c")
scala> df.show
+---+---+----+
| a| b| c|
+---+---+----+
| 1| 2|null|
| 3| 4|null|
| 5| 6|null|
| 7| 8| 9|
+---+---+----+
scala> val dfl = df.limit(3) //limiting the number of rows you need, in your case it is 20
scala> val col_names = dfl.select(dfl.columns.map(x => count(col(x)).alias(x)):_*).first.toSeq.zipWithIndex.filter(x => x._1.toString.toInt > 0).map(_._2).map(x => dfl.columns(x)).map(x => col(x)) // this will give you column names which is having not null values
col_names: Seq[org.apache.spark.sql.Column] = ArrayBuffer(a, b)
scala> dfl.select(col_names : _*).show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
| 5| 6|
+---+---+
Let me know if it works for you.
Similar to Sathiyan's idea, but using the columnname in the count() itself.
scala> val df = Seq((1,2,null),(3,4,null),(5,6,null)).toDF("a","b","c")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
scala> df.show
+---+---+----+
| a| b| c|
+---+---+----+
| 1| 2|null|
| 3| 4|null|
| 5| 6|null|
+---+---+----+
scala> val notnull_cols = df.select(df.columns.map(x=>concat_ws("=",first(lit(x)),count(col(x)))):_*).first.toSeq.map(_.toString).filter(!_.contains("=0")).map( x=>col(x.split("=")(0)) )
notnull_cols: Seq[org.apache.spark.sql.Column] = ArrayBuffer(a, b)
scala> df.select(notnull_cols:_*).show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
| 5| 6|
+---+---+
The intermediate results shows the count along with column names
scala> df.select(df.columns.map(x=>concat_ws("=",first(lit(x)),count(col(x))).as(x+"_nullcount")):_*).show
+-----------+-----------+-----------+
|a_nullcount|b_nullcount|c_nullcount|
+-----------+-----------+-----------+
| a=3| b=3| c=0|
+-----------+---------- -+-----------+
scala>

Scala/Spark: How to select columns to read ONLY when list of columns > 0

I'm passing in a parameter fieldsToLoad: List[String] and I want to load ALL columns if this list is empty and load only the columns specified in the list if the list has more one or more columns. I have this now which reads the columns passed in the list:
val parquetDf = sparkSession.read.parquet(inputPath:_*).select(fieldsToLoad.head, fieldsToLoadList.tail:_*)
But how do I add a condition to load * (all columns) when the list is empty?
#Andy Hayden answer is correct but I want to introduce how to use selectExpr function to simplify the selection
scala> val df = Range(1, 4).toList.map(x => (x, x + 1, x + 2)).toDF("c1", "c2", "c3")
df: org.apache.spark.sql.DataFrame = [c1: int, c2: int ... 1 more field]
scala> df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
+---+---+---+
scala> val fieldsToLoad = List("c2", "c3")
fieldsToLoad: List[String] = List(c2, c3) ^
scala> df.selectExpr((if (fieldsToLoad.nonEmpty) fieldsToLoad else List("*")):_*).show()
+---+---+
| c2| c3|
+---+---+
| 2| 3|
| 3| 4|
| 4| 5|
+---+---+
scala> val fieldsToLoad = List()
fieldsToLoad: List[Nothing] = List()
scala> df.selectExpr((if (fieldsToLoad.nonEmpty) fieldsToLoad else List("*")):_*).show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
+---+---+---+
You could use an if statement first to replace the empty with just *:
val cols = if (fieldsToLoadList.nonEmpty) fieldsToLoadList else Array("*")
sparkSession.read.parquet(inputPath:_*).select(cols.head, cols.tail:_*).