How to perform UPSERT or MERGE operation in Apache Spark? - scala

I am trying to update and insert records to old Dataframe using unique column "ID" using Apache Spark.

In order to update Dataframe, you can perform "left_anti" join on unique columns and then UNION it with Dataframe which contains new records
def refreshUnion(oldDS: Dataset[_], newDS: Dataset[_], usingColumns: Seq[String]): Dataset[_] = {
val filteredNewDS = selectAndCastColumns(newDS, oldDS)
oldDS.join(
filteredNewDS,
usingColumns,
"left_anti")
.select(oldDS.columns.map(columnName => col(columnName)): _*)
.union(filteredNewDS.toDF)
}
def selectAndCastColumns(ds: Dataset[_], refDS: Dataset[_]): Dataset[_] = {
val columns = ds.columns.toSet
ds.select(refDS.columns.map(c => {
if (!columns.contains(c)) {
lit(null).cast(refDS.schema(c).dataType) as c
} else {
ds(c).cast(refDS.schema(c).dataType) as c
}
}): _*)
}
val df = refreshUnion(oldDS, newDS, Seq("ID"))

Spark Dataframes are immutable structure. Therefore, you can't do any update based on the ID.
The way to update dataframe is to merge the older dataframe and the newer dataframe and save the merged dataframe on HDFS. To update the older ID you would require some de-duplication key (Timestamp may be).
I am adding the sample code for this in scala. You need to call the merge function with the uniqueId and the timestamp column name. Timestamp should be in Long.
case class DedupableDF(unique_id: String, ts: Long);
def merge(snapshot: DataFrame)(
delta: DataFrame)(uniqueId: String, timeStampStr: String): DataFrame = {
val mergedDf = snapshot.union(delta)
return dedupeData(mergedDf)(uniqueId, timeStampStr)
}
def dedupeData(dataFrameToDedupe: DataFrame)(
uniqueId: String,
timeStampStr: String): DataFrame = {
import sqlContext.implicits._
def removeDuplicates(
duplicatedDataFrame: DataFrame): Dataset[DedupableDF] = {
val dedupableDF = duplicatedDataFrame.map(a =>
DedupableDF(a(0).asInstanceOf[String], a(1).asInstanceOf[Long]))
val mappedPairRdd =
dedupableDF.map(row ⇒ (row.unique_id, (row.unique_id, row.ts))).rdd;
val reduceByKeyRDD = mappedPairRdd
.reduceByKey((row1, row2) ⇒ {
if (row1._2 > row2._2) {
row1
} else {
row2
}
})
.values;
val ds = reduceByKeyRDD.toDF.map(a =>
DedupableDF(a(0).asInstanceOf[String], a(1).asInstanceOf[Long]))
return ds;
}
/** get distinct unique_id, timestamp combinations **/
val filteredData =
dataFrameToDedupe.select(uniqueId, timeStampStr).distinct
val dedupedData = removeDuplicates(filteredData)
dataFrameToDedupe.createOrReplaceTempView("duplicatedDataFrame");
dedupedData.createOrReplaceTempView("dedupedDataFrame");
val dedupedDataFrame =
sqlContext.sql(s""" select distinct duplicatedDataFrame.*
from duplicatedDataFrame
join dedupedDataFrame on
(duplicatedDataFrame.${uniqueId} = dedupedDataFrame.unique_id
and duplicatedDataFrame.${timeStampStr} = dedupedDataFrame.ts)""")
return dedupedDataFrame
}

Related

ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast

I am using an Aggregator to apply some custom merge on a DataFrame after grouping its records by their primary key:
case class Player(
pk: String,
ts: String,
first_name: String,
date_of_birth: String
)
case class PlayerProcessed(
var ts: String,
var first_name: String,
var date_of_birth: String
)
// Cutomer Aggregator -This just for the example, actual one is more complex
object BatchDedupe extends Aggregator[Player, PlayerProcessed, PlayerProcessed] {
def zero: PlayerProcessed = PlayerProcessed("0", null, null)
def reduce(bf: PlayerProcessed, in : Player): PlayerProcessed = {
bf.ts = in.ts
bf.first_name = in.first_name
bf.date_of_birth = in.date_of_birth
bf
}
def merge(bf1: PlayerProcessed, bf2: PlayerProcessed): PlayerProcessed = {
bf1.ts = bf2.ts
bf1.first_name = bf2.first_name
bf1.date_of_birth = bf2.date_of_birth
bf1
}
def finish(reduction: PlayerProcessed): PlayerProcessed = reduction
def bufferEncoder: Encoder[PlayerProcessed] = Encoders.product
def outputEncoder: Encoder[PlayerProcessed] = Encoders.product
}
val ply1 = Player("12121212121212", "10000001", "Rogger", "1980-01-02")
val ply2 = Player("12121212121212", "10000002", "Rogg", null)
val ply3 = Player("12121212121212", "10000004", null, "1985-01-02")
val ply4 = Player("12121212121212", "10000003", "Roggelio", "1982-01-02")
val seq_users = sc.parallelize(Seq(ply1, ply2, ply3, ply4)).toDF.as[Player]
val grouped = seq_users.groupByKey(_.pk)
val non_sorted = grouped.agg(BatchDedupe.toColumn.name("deduped"))
non_sorted.show(false)
This returns:
+--------------+--------------------------------+
|key |deduped |
+--------------+--------------------------------+
|12121212121212|{10000003, Roggelio, 1982-01-02}|
+--------------+--------------------------------+
Now, I would like to order the records based on ts before aggregating them. From here I understand that .sortBy("ts") do not guarantee the order after the .groupByKey(_.pk). So I was trying to apply the .sortBy between the .groupByKey and the .agg
The output of the .groupByKey(_.pk) is a KeyValueGroupedDataset[String,Player], being the second element an Iterator. So to apply some sorting logic there I convert it into a Seq:
val sorted = grouped.mapGroups{case(k, iter) => (k, iter.toSeq.sortBy(_.ts))}.agg(BatchDedupe.toColumn.name("deduped"))
sorted.show(false)
However, the output of .mapGroups after adding the sorting logic is a Dataset[(String, Seq[Player])]. So when I try to invoke the .agg function on it I am getting the following exception:
Caused by: ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to $line050e0d37885948cd91f7f7dd9e3b4da9311.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Player
How could I convert back the output of my .mapGroups(...) into a KeyValueGroupedDataset[String,Player]?
I tried to cast back to Iterator as follows:
val sorted = grouped.mapGroups{case(k, iter) => (k, iter.toSeq.sortBy(_.ts).toIterator)}.agg(BatchDedupe.toColumn.name("deduped"))
But this approach produced the following exception:
UnsupportedOperationException: No Encoder found for Iterator[Player]
- field (class: "scala.collection.Iterator", name: "_2")
- root class: "scala.Tuple2"
How else can I add the sort logic between the .groupByKey and .agg methods?
Based on the discussion above, the purpose of the Aggregator is to get the latest field values per Player by ts ignoring null values.
This can be achieved fairly easily aggregating all fields individually using max_by. With that there's no need for a custom Aggregator nor the mutable aggregation buffer.
import org.apache.spark.sql.functions._
val players: Dataset[Player] = ...
// aggregate all columns except the key individually by ts
// NULLs will be ignored (SQL standard)
val aggColumns = players.columns
.filterNot(_ == "pk")
.map(colName => expr(s"max_by($colName, if(isNotNull($colName), ts, null))").as(colName))
val aggregatedPlayers = players
.groupBy(col("pk"))
.agg(aggColumns.head, aggColumns.tail: _*)
.as[Player]
On the most recent versions of Spark you can also use the build in max_by expression:
import org.apache.spark.sql.functions._
val players: Dataset[Player] = ...
// aggregate all columns except the key individually by ts
// NULLs will be ignored (SQL standard)
val aggColumns = players.columns
.filterNot(_ == "pk")
.map(colName => max_by(col(colName), when(col(colName).isNotNull, col("ts"))).as(colName))
val aggregatedPlayers = players
.groupBy(col("pk"))
.agg(aggColumns.head, aggColumns.tail: _*)
.as[Player]

Create Spark DataFrame from list row keys

I have a list of HBase row keys in form or Array[Row] and want to create a Spark DataFrame out of the rows that are fetched from HBase using these RowKeys.
Am thinking of something like:
def getDataFrameFromList(spark: SparkSession, rList : Array[Row]): DataFrame = {
val conf = HBaseConfiguration.create()
val mlRows : List[RDD[String]] = new ArrayList[RDD[String]]
conf.set("hbase.zookeeper.quorum", "dev.server")
conf.set("hbase.zookeeper.property.clientPort", "2181")
conf.set("zookeeper.znode.parent","/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, "hbase_tbl1")
rList.foreach( r => {
var rStr = r.toString()
conf.set(TableInputFormat.SCAN_ROW_START, rStr)
conf.set(TableInputFormat.SCAN_ROW_STOP, rStr + "_")
// read one row
val recsRdd = readHBaseRdd(spark, conf)
mlRows.append(recsRdd)
})
// This works, but it is only one row
//val resourcesDf = spark.read.json(recsRdd)
var resourcesDf = <Code here to convert List[RDD[String]] to DataFrame>
//resourcesDf
spark.emptyDataFrame
}
I can do recsRdd.collect() in the for loop and convert it to string and append that json to an ArrayList[String but am not sure if its efficient, to call collect() in a for loop like this.
readHBaseRdd is using newAPIHadoopRDD to get data from HBase
def readHBaseRdd(spark: SparkSession, conf: Configuration) = {
val hBaseRDD = spark.sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[ImmutableBytesWritable],
classOf[Result])
hBaseRDD.map {
case (_: ImmutableBytesWritable, value: Result) =>
Bytes.toString(value.getValue(Bytes.toBytes("cf"),
Bytes.toBytes("jsonCol")))
}
}
}
Use spark.union([mainRdd, recsRdd]) instead of a list or RDDs (mlRows)
And why read only one row from HBase? Try to have the largest interval as possible.
Always avoid calling collect(), do it only for debug/tests.

Scala/Spark: Converting zero inflated data in dataframe to libsvm

I am very new to scala (typically I do this in R)
I have imported a large dataframe (2000+ columns, 100000+ rows) that is zero-inflated.
Task
To convert the data to libsvm format
Steps
As I understand the steps are as follows
Ensure feature columns are set to DoubleType and Target is an Int
Iterate through each row, retaining each value >0 in one array and index of its column in another array
Convert to RDD[LabeledPoint]
Save RDD in libsvm format
I am stuck on 3 (but maybe) because I am doing step 2 wrong.
Here is my code:
Main Function:
#Test
def testSpark(): Unit =
{
try
{
var mDF: DataFrame = spark.read.option("header", "true").option("inferSchema", "true").csv("src/test/resources/knimeMergedTRimmedVariables.csv")
val mDFTyped = castAllTypedColumnsTo(mDF, IntegerType, DoubleType)
val indexer = new StringIndexer()
.setInputCol("Majors_Final")
.setOutputCol("Majors_Final_Indexed")
val mDFTypedIndexed = indexer.fit(mDFTyped).transform(mDFTyped)
val mDFFinal = castColumnTo(mDFTypedIndexed,"Majors_Final_Indexed", IntegerType)
//only doubles accepted by sparse vector, so that's what we filter for
val fieldSeq: scala.collection.Seq[StructField] = schema.fields.toSeq.filter(f => f.dataType == DoubleType)
val fieldNameSeq: Seq[String] = fieldSeq.map(f => f.name)
val labeled:DataFrame = mDFFinal.map(row => convertRowToLabeledPoint(row,fieldNameSeq,row.getAs("Majors_Final_Indexed"))).toDF()
assertTrue(true)
}
catch
{
case ex: Exception =>
{
println(s"There has been an Exception. Message is ${ex.getMessage} and ${ex}")
fail()
}
}
}
Convert each row to LabeledPoint:
#throws(classOf[Exception])
private def convertRowToLabeledPoint(rowIn: Row, fieldNameSeq: Seq[String], label:Int): LabeledPoint =
{
try
{
val values: Map[String, Double] = rowIn.getValuesMap(fieldNameSeq)
val sortedValuesMap = ListMap(values.toSeq.sortBy(_._1): _*)
val rowValuesItr: Iterable[Double] = sortedValuesMap.values
var positionsArray: ArrayBuffer[Int] = ArrayBuffer[Int]()
var valuesArray: ArrayBuffer[Double] = ArrayBuffer[Double]()
var currentPosition: Int = 0
rowValuesItr.foreach
{
kv =>
if (kv > 0)
{
valuesArray += kv;
positionsArray += currentPosition;
}
currentPosition = currentPosition + 1;
}
val lp:LabeledPoint = new LabeledPoint(label, org.apache.spark.mllib.linalg.Vectors.sparse(positionsArray.size,positionsArray.toArray, valuesArray.toArray))
return lp
}
catch
{
case ex: Exception =>
{
throw new Exception(ex)
}
}
}
Problem
So then I try to create a dataframe of labeledpoints which can easily be converted to an RDD.
val labeled:DataFrame = mDFFinal.map(row => convertRowToLabeledPoint(row,fieldNameSeq,row.getAs("Majors_Final_Indexed"))).toDF()
But I get the following error:
SparkTest.scala:285: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for seri
alizing other types will be added in future releases.
[INFO] val labeled:DataFrame = mDFFinal.map(row => convertRowToLabeledPoint(row,fieldNameSeq,row.getAs("Majors_Final_Indexed"))).toDF()
OK, so I skipped the DataFrame and created an Array of LabeledPoints whish is easily converted to an RDD. The rest is easy.
I stress, that while this works, I am new to scala and there may be more efficient ways to do this.
Main Function is now as follows:
val mDF: DataFrame = spark.read.option("header", "true").option("inferSchema", "true").csv("src/test/resources/knimeMergedTRimmedVariables.csv")
val mDFTyped = castAllTypedColumnsTo(mDF, IntegerType, DoubleType)
val indexer = new StringIndexer()
.setInputCol("Majors_Final")
.setOutputCol("Majors_Final_Indexed")
val mDFTypedIndexed = indexer.fit(mDFTyped).transform(mDFTyped)
val mDFFinal = castColumnTo(mDFTypedIndexed,"Majors_Final_Indexed", IntegerType)
mDFFinal.show()
//only doubles accepted by sparse vector, so that's what we filter for
val fieldSeq: scala.collection.Seq[StructField] = mDFFinal.schema.fields.toSeq.filter(f => f.dataType == DoubleType)
val fieldNameSeq: Seq[String] = fieldSeq.map(f => f.name)
var positionsArray: ArrayBuffer[LabeledPoint] = ArrayBuffer[LabeledPoint]()
mDFFinal.collect().foreach
{
row => positionsArray+=convertRowToLabeledPoint(row,fieldNameSeq,row.getAs("Majors_Final_Indexed"));
}
val mRdd:RDD[LabeledPoint]= spark.sparkContext.parallelize(positionsArray.toSeq)
MLUtils.saveAsLibSVMFile(mRdd, "./output/libsvm")

Add scoped variable per row iteration in Apache Spark

I'm reading multiple html files into a dataframe in Spark.
I'm converting elements of the html to columns in the dataframe using a custom udf
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.toDF("filepath", "filecontent")
.withColumn("biz_name", parseDocValue(".biz-page-title")('filecontent))
.withColumn("biz_website", parseDocValue(".biz-website a")('filecontent))
...
def parseDocValue(cssSelectorQuery: String) =
udf((html: String) => Jsoup.parse(html).select(cssSelectorQuery).text())
Which works perfectly, however each withColumn call will result in the parsing of the html string, which is redundant.
Is there a way (without using lookup tables or such) that I can generate 1 parsed Document (Jsoup.parse(html)) based on the "filecontent" column per row and make that available for all withColumn calls in the dataframe?
Or shouldn't I even try using DataFrames and just use RDD's ?
So the final answer was in fact quite simple:
Just map over the rows and create the object ones there
def docValue(cssSelectorQuery: String, attr: Option[String] = None)(implicit document: Document): Option[String] = {
val domObject = document.select(cssSelectorQuery)
val domValue = attr match {
case Some(a) => domObject.attr(a)
case None => domObject.text()
}
domValue match {
case x if x == null || x.isEmpty => None
case y => Some(y)
}
}
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath, minPartitions = 265)
.map {
case (filepath, filecontent) => {
implicit val document = Jsoup.parse(filecontent)
val customDataJson = docJson(filecontent, customJsonRegex)
DataEntry(
biz_name = docValue(".biz-page-title"),
biz_website = docValue(".biz-website a"),
url = docValue("meta[property=og:url]", attr = Some("content")),
...
filename = Some(fileName(filepath)),
fileTimestamp = Some(fileTimestamp(filepath))
)
}
}
.toDS()
I'd probably rewrite it as follows, to do the parsing and selecting in one go and put them in a temporary column:
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.withColumn("temp", parseDocValue(Array(".biz-page-title", ".biz-website a"))('filecontent))
.withColumn("biz_name", col("temp")(0))
.withColumn("biz_website", col("temp")(1))
.drop("temp")
def parseDocValue(cssSelectorQueries: Array[String]) =
udf((html: String) => {
val j = Jsoup.parse(html)
cssSelectorQueries.map(query => j.select(query).text())})

Union Dataframes based on condition in spark(scala)

I have a folder which consists of 4 subfolders which contains parquet files
Folder->A.parquet,B.parquet,C.parquet,D.parquet(subfolders). My requirement is I want to union data frames based on file Names I provide to the method.
I am doing it with code
val df = listDirectoriesGetWantedFile(folderPath,sqlContext,A,B)
def listDirectoriesGetWantedFile(folderPath: String, sqlContext: SQLContext, str1: String, str2: String): DataFrame = {
var df: DataFrame = null
val sb = new StringBuilder
sb.setLength(0)
var done = false
val path = new Path(folderPath)
if (fileSystem.isDirectory(path)) {
var files = fileSystem.listStatus(path)
for (file <- files) {
if (file.getPath.getName.contains(str) && !done) {
sb.append(file.getPath.toString())
sb.append(",")
done = true
} else if (file.getPath.getName.contains(str2)) {
sb.append(file.getPath.toString())
}
}
}
But I need to split the sb and then union the dataframes. Which I am unable to find the solution. How can I approach it and solve
If I understand your question, you could simply do something like this :
def listDirectoriesGetWantedFile(path: String,
sqlContext: SQLContext,
folder1: String,
folder2: String): DataFrame = {
val df1 = sqlContext.read.parquet(s"$path/$folder1")
val df2 = sqlContext.read.parquet(s"$path/$folder2")
df1.union(df2)
}
EDIT
By using Hadoop FileSystem, you can check path existence on your folders. So you may try something like that :
def listDirectoriesGetWantedFile(path: String, sqlContext: SQLContext, folders: Seq[String]): DataFrame = {
val conf = new Configuration()
val fs = FileSystem.get(conf)
val existingFolders = folders
.map(folder => new Path(s"$path/$folder"))
.filter(fs.exists(_))
.map(_.toString)
if (existingFolders.isEmpty) {
sqlContext.emptyDataFrame
} else {
sqlContext.read.parquet(existingFolders: _*)
}
}