I would like to compare the size of two DataFrames that have been extracted from Oracle and PostgreSQL databases. I would like to compare them, then either add new rows or delete rows. How does one directly add or delete from PostreSQL? Here is what I did:
System.setProperty("hadoop.home.dir", "C:\\hadoop");
System.setProperty("spark.sql.warehouse.dir", "file:///C:/spark-warehouse");
val sparkSession = SparkSession.builder.master("local").appName("spark session example").getOrCreate()
//connect to table TMP_STRUCTURE oracle
val spark = sparkSession.sqlContext
val df = spark.load("jdbc",
Map("url" -> "jdbc:oracle:thin:System/maher#//localhost:1521/XE",
"dbtable" -> "IPTECH.TMP_STRUCTURE"))
import sparkSession.implicits._
val usedGold = df.filter(length($"CODE") === 2) // get column with length equal to 2
val article_groups = spark.load("jdbc", Map(
"url" -> "jdbc:postgresql://localhost:5432/gemodb?user=postgres&password=maher",
"dbtable" -> "article_groups")).select("id", "name")
val usedArticleGroup =$"*", $"id".cast(StringType) as "newId") // cast column code to Long
val usedPostg ="newId", "name")
// val df3 = usedGold.join(usedPostg, $"code" === $"newId", "outer")
//get different rows
val differentData = usedGold.except(usedPostg).toDF("code", "name")
if (usedGold.count > usedPostg.count) {
//insert into usedPostg values(differentData("code"),differentData("name"))
} else if (usedGold.count < usedPostg.count) {
// delete from usedPostg where newId= differentData("code") in postgresql


Create Spark DataFrame from list row keys

I have a list of HBase row keys in form or Array[Row] and want to create a Spark DataFrame out of the rows that are fetched from HBase using these RowKeys.
Am thinking of something like:
def getDataFrameFromList(spark: SparkSession, rList : Array[Row]): DataFrame = {
val conf = HBaseConfiguration.create()
val mlRows : List[RDD[String]] = new ArrayList[RDD[String]]
conf.set("hbase.zookeeper.quorum", "dev.server")
conf.set("", "2181")
conf.set(TableInputFormat.INPUT_TABLE, "hbase_tbl1")
rList.foreach( r => {
var rStr = r.toString()
conf.set(TableInputFormat.SCAN_ROW_START, rStr)
conf.set(TableInputFormat.SCAN_ROW_STOP, rStr + "_")
// read one row
val recsRdd = readHBaseRdd(spark, conf)
// This works, but it is only one row
//val resourcesDf =
var resourcesDf = <Code here to convert List[RDD[String]] to DataFrame>
I can do recsRdd.collect() in the for loop and convert it to string and append that json to an ArrayList[String but am not sure if its efficient, to call collect() in a for loop like this.
readHBaseRdd is using newAPIHadoopRDD to get data from HBase
def readHBaseRdd(spark: SparkSession, conf: Configuration) = {
val hBaseRDD = spark.sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[Result]) {
case (_: ImmutableBytesWritable, value: Result) =>
Use spark.union([mainRdd, recsRdd]) instead of a list or RDDs (mlRows)
And why read only one row from HBase? Try to have the largest interval as possible.
Always avoid calling collect(), do it only for debug/tests.

spark 2.x with mapPartitions large number of records parallel processing

I am trying to use spark mapPartitions with Datasets[Spark 2.x] for copying large list of files [1 million records] from one location to another in parallel.
However, at times, I am seeing that one record is getting copied multiple times.
The idea is to split 1 million files into number of partitions (here, 24). Then for each partition, perform copy operation in parallel and finally get result from each partition to perform further actions.
Can someone please tell me what am I doing wrong?
def process(spark: SparkSession): DataFrame = {
import spark.implicits._
//Get source and target List for 1 million records
val sourceAndTargetList =
List(("source1" -> "target1"), ("source 1 Million" -> "Target 1 Million"))
// convert list to dataframe with number of partitions as 24
val SourceTargetDataSet =
sourceAndTargetList.toDF.repartition(24).as[(String, String)]
var dfBuffer = new ListBuffer[DataFrame]()
dfBuffer += SourceTargetDataSet
.mapPartitions(partition => {
println("partition id: " + TaskContext.getPartitionId)
//for each partition
val result = partition
.map(row => {
val source = row._1
val target = row._2
val copyStatus = copyFiles(source, target) // Function to copy files that returns a boolean
val dataframeRow = (target, copyStatus)
val dfList = dfBuffer.toList
val newDF = dfList.tail.foldLeft(dfList.head)(
(accDF, newDF) => accDF.join(newDF, Seq("_1"))
println("newDF Count " + newDF.count)
Update 2: I changed the function as shown below and so far it is giving me consistent results as expected. May I know what I was doing wrong and am I getting the required parallelization using below function? If not, how can this be optimized?
def process(spark: SparkSession): DataFrame = {
import spark.implicits._
//Get source and target List for 1 miilion records
val sourceAndTargetList =
List(("source1" -> "target1"), ("source 1 Million" -> "Target 1 Million"))
// convert list to dataframe with number of partitions as 24
val SourceTargetDataSet =
sourceAndTargetList.toDF.repartition(24).as[(String, String)]
val iterator = SourceTargetDataSet.toDF
(it: Iterator[Row]) =>
.map(row => {
val source = row.toString.split(",")(0).drop(1)
val target = row.toString.split(",")(1).dropRight(1)
println("source : " + source)
println("target: " + target)
val copyStatus = copyFiles() // Function to copy files that returns a boolean
val dataframeRow = (target, copyStatus)
val df = y.toList.toDF("targetKey", "copyStatus")
One should avoid performing write operations in map actions because they can be replayed when an executor dies and the same map has to be performed by another executer.
I'd choose foreach instead.

Spark DF: Schema for type Unit is not supported

I am new to Scala and Spark and trying to build on some samples I found. Essentially I am trying to call a function from within a data frame to get State from zip code using Google API..
I have the code working separately but not together ;(
Here is the piece of code not working...
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type Unit is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:716)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:654)
at org.apache.spark.sql.functions$.udf(functions.scala:2837)
at MovieRatings$.getstate(MovieRatings.scala:51)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:48)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:47)...
Line 51 starts with def getstate = udf {(zipcode:String)...
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, zipcode as state FROM Users")
// => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = =>
if (c.toString() == theColumn.toString()) getstate(c).as("transformed") else c)
val newDF =*).show()
def getstate = udf {(zipcode:String) => {
val url = ""+zipcode
val result =
val address = parse(result)
val shortnames = for {
JObject(address_components) <- address
JField("short_name", short_name) <- address_components
} yield short_name
val state = shortnames(3)
//return state.toString()
val stater = state.toString()
Thanks for the responses.. I think I figured it out. Here is the code that works. One thing to note is Google API has restriction so some valid zip codes don't have state info.. not an issue for me though.
private def loaduserdata(spark: SparkSession): Unit = {
import spark.implicits._
// Create an RDD of User objects from a text file, convert it to a Dataframe
val userDF = spark.sparkContext
.map(attributes => users(attributes(0).trim.toInt, attributes(1), attributes(2).trim.toInt, attributes(3), attributes(4)))
// Register the DataFrame as a temporary view
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, substr(zipcode,1,5) as state FROM Users ORDER BY zipcode desc") // => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = =>
if (c.toString() == theColumn.toString()) getstate(c).as("state") else c)
val geoDF =*)//.show()
val getstate = udf {(zipcode: String) =>
val url = ""+zipcode
val result =
val address = parse(result)
val statenm = for {
JObject(statename) <- address
JField("types", JArray(types)) <- statename
JField("short_name", JString(short_name)) <- statename
if types.toString().equals("List(JString(administrative_area_level_1), JString(political))")
// if types.head.equals("JString(administrative_area_level_1)")
} yield short_name
val str = if (statenm.isEmpty.toString().equals("true")) "N/A" else statenm.head