How to convert nested object from rdd row to some custom object - scala

I'm trying to learn some scala/spark and trying to practice using some basic spark integration example. So my problem is that I have a Mongo db running locally. I'm pulling some data and making an rdd from it. The data in db has a structure like that:
{
"_id": 0,
"name": "aimee Zank",
"scores": [
{
"score": 1.463179736705023,
"type": "exam"
},
{
"score": 11.78273309957772,
"type": "quiz"
},
{
"score": 35.8740349954354,
"type": "homework"
}
]
}
Here is some code:
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("simple-app")
val sparkSession = SparkSession.builder()
.appName("example-spark-scala-read-and-write-from-mongo")
.config(conf)
.config("spark.mongodb.output.uri", "mongodb://sproot:12345#172.18.0.3:27017/spdb.students")
.config("spark.mongodb.input.uri", "mongodb://sproot:12345#172.18.0.3:27017/spdb.students")
.getOrCreate()
// Reading Mongodb collection into a dataframe
val df = MongoSpark.load(sparkSession)
val dataRdd: RDD[Row] = df.rdd
dataRdd.foreach(row => println(row.getValuesMap[Any](row.schema.fieldNames)))
The code above provides me this:
Map(_id -> 0, name -> aimee Zank, scores -> WrappedArray([1.463179736705023,exam], [11.78273309957772,quiz], [35.8740349954354,homework]))
Map(_id -> 1, name -> Aurelia Menendez, scores -> WrappedArray([60.06045071030959,exam], [52.79790691903873,quiz], [71.76133439165544,homework]))
At the end I have a problem converting this data to:
case class Student(id: Long, name: String, scores: Scores)
case class Scores(#JsonProperty("scores") scores: List[Score])
case class Score (
#JsonProperty("score") score: Double,
#JsonProperty("type") scoreType: String
)
To conclude - the problem is that I cannot convert some data from RDD to the Student object. The most problematic place for me is that 'scores' nested object.
Please help me to understand how this should be done.

Played a bit more with it and ended up with the following solution:
object MainClass {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("simple-app")
val sparkSession = SparkSession.builder()
.appName("example-spark-scala-read-and-write-from-mongo")
.config(conf)
.config("spark.mongodb.output.uri", "mongodb://sproot:12345#172.18.0.3:27017/spdb.students")
.config("spark.mongodb.input.uri", "mongodb://sproot:12345#172.18.0.3:27017/spdb.students")
.getOrCreate()
val objectMapper = new ObjectMapper()
objectMapper.registerModule(DefaultScalaModule)
// Reading Mongodb collection into a dataframe
val df = MongoSpark.load(sparkSession)
val dataRdd: RDD[Row] = df.rdd
val students: List[Student] =
dataRdd
.collect()
.map(row => Student(row.getInt(0), row.getString(1), createScoresObject(row))).toList
println()
}
def createScoresObject(row: Row): Scores = {
Scores(getAllScoresFromWrappedArray(row).map(x => Score(x.getDouble(0), x.getString(1))).toList)
}
def getAllScoresFromWrappedArray(row: Row): mutable.WrappedArray[GenericRowWithSchema] = {
getScoresWrappedArray(row).map(x => x.asInstanceOf[GenericRowWithSchema])
}
def getScoresWrappedArray(row: Row): mutable.WrappedArray[AnyVal] = {
row.getAs[mutable.WrappedArray[AnyVal]](2)
}
}
case class Student(id: Long, name: String, scores: Scores)
case class Scores(scores: List[Score])
case class Score (score: Double, scoreType: String)
But I would be glad to know if there is some elegant solution.

Related

Update to the delta table in spark not working

package jobs
import io.delta.tables.DeltaTable
import model.RandomUtils
import org.apache.spark.sql.streaming.{ OutputMode, Trigger }
import org.apache.spark.sql.{ DataFrame, Dataset, Encoder, Encoders, SparkSession }
import jobs.SystemJob.Rate
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
case class Student(firstName: String, lastName: String, age: Long, percentage: Long)
case class Rate(timestamp: Timestamp, value: Long)
case class College(name: String, address: String, principal: String)
object RCConfigDSCCDeltaLake {
def getSpark(): SparkSession = {
SparkSession.builder
.appName("Delta table demo")
.master("local[*]")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
}
def main(args: Array[String]): Unit = {
val spark = getSpark()
val rate = 1
val studentProfile = "student_profile"
if (!DeltaTable.isDeltaTable(s"spark-warehouse/$studentProfile")) {
val deltaTable: DataFrame = spark.sql(s"CREATE TABLE `$studentProfile` (firstName String, lastName String, age Long, percentage Long) USING delta")
deltaTable.show()
deltaTable.printSchema()
}
val studentProfileDT = DeltaTable.forPath(spark, s"spark-warehouse/$studentProfile")
def processStream(student: Dataset[Student], college: Dataset[College]) = {
val studentQuery = student.writeStream.outputMode(OutputMode.Update()).foreachBatch {
(st: Dataset[Student], y: Long) =>
val listOfStudents = st.collect().toList
println("list of students :::" + listOfStudents)
val (o, n) = ("oldData", "newData")
val colMap = Map(
"firstName" -> col(s"$n.firstName"),
"lastName" -> col(s"$n.lastName"),
"age" -> col(s"$n.age"),
"percentage" -> col(s"$n.percentage"))
studentProfileDT.as(s"$o").merge(st.toDF.as(s"$n"), s"$o.firstName = $n.firstName AND $o.lastName = $n.lastName")
.whenMatched.update(colMap)
.whenNotMatched.insert(colMap)
.execute()
}.start()
val os = spark.readStream.format("delta").load(s"spark-warehouse/$studentProfile").writeStream.format("console")
.outputMode(OutputMode.Append())
.option("truncate", value = false)
.option("checkpointLocation", "retrieved").start()
studentQuery.awaitTermination()
os.awaitTermination()
}
import spark.implicits._
implicit val encStudent: Encoder[Student] = Encoders.product[Student]
implicit val encCollege: Encoder[College] = Encoders.product[College]
def rateStream = spark
.readStream
.format("rate") // <-- use RateStreamSource
.option("rowsPerSecond", rate)
.load()
.as[Rate]
val studentStream: Dataset[Student] = rateStream.filter(_.value % 25 == 0).map {
stu =>
Student(...., ....., ....., .....) //fill with values
}
val collegeStream: Dataset[College] = rateStream.filter(_.value % 40 == 0).map {
stu =>
College(...., ....., ......) //fill with values
}
processStream(studentStream, collegeStream)
}
}
What I am trying to do is a simple UPSERT operation with streaming datasets. But it fails with error
22/04/13 19:50:33 ERROR MicroBatchExecution: Query [id = 8cf759fd-9bee-460f-
b0d9-91889c59c524, runId = 55723708-fd3c-4425-a2bc-83d737c37589] terminated with
error
java.lang.UnsupportedOperationException: Detected a data update (for example part-
00000-d026d92e-1798-4d21-a505-67ec72d334e2-c000.snappy.parquet) in the source table
at version 4. This is currently not supported. If you'd like to ignore updates, set
the option 'ignoreChanges' to 'true'. If you would like the data update to be
reflected, please restart this query with a fresh checkpoint directory.
Dependencies :
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2
--packages io.delta:delta-core_2.12:0.7.0
The update query works when the datasets are not streamed and only hardcoded.
Am I doing something wrong here ?

Spark Structured Streaming with HBase Sink

my use case is to read Kafka messages with structured streaming and use foreachBatch to push those messages into HBase by using some bulk Put to gain some performance over single Put, I am able to push messages using foreach (thanks to Spark Structured Streaming with Hbase integration) but not able to do the same for foreachBatch operation.
Can someone please help with this ? Attaching the code below.
KafkaStructured.scala :
package com.test
import java.math.BigInteger
import java.util
import com.fasterxml.jackson.annotation.JsonIgnoreProperties
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
object KafkaStructured {
#JsonIgnoreProperties(ignoreUnknown = true)
case class Header(field1: String, field2: String, field3: String)
#JsonIgnoreProperties(ignoreUnknown = true)
case class Body(fieldx: String)
#JsonIgnoreProperties(ignoreUnknown = true)
case class Event(header: Header, body: Body)
#JsonIgnoreProperties(ignoreUnknown = true)
case class KafkaResp(event: Event)
#JsonIgnoreProperties(ignoreUnknown = true)
case class HBaseDF(field1: String, field2: String, field3: String)
def main(args: Array[String]): Unit = {
val jsonSchema = Encoders.product[KafkaResp].schema
val spark = SparkSession
.builder()
.appName("Kafka Spark")
.getOrCreate()
val df = spark
.readStream
.format("kafka")
.option...
.load()
import spark.sqlContext.implicits._
val flattenedDf: DataFrame =
df
.select($"value".cast("string").as("json"))
.select(from_json($"json", jsonSchema).as("data"))
.select("data.event.header.field1", "data.event.header.field2", "data.event.header.field3")
val hbaseDf = flattenedDf
.as[HBaseDF]
.filter(hbasedf => hbasedf != null && hbasedf.field1 != null)
flattenedDf
.writeStream
.option("truncate", "false")
.option("checkpointLocation", "some hdfs location")
.format("console")
.outputMode("append")
.start()
def bytes(data: String) = {
val bytes = data match {
case data if data != null && !data.isEmpty => Bytes.toBytes(data)
case _ => Bytes.toBytes("")
}
bytes
}
hbaseDf
.writeStream
.foreachBatch(function = (batchDf, batchId) => {
val putList = new util.ArrayList[Put]()
batchDf
.foreach(row => {
val p: Put = new Put(bytes(row.field1))
val cfName= bytes("fam1")
p.addColumn(cfName, bytes("field1"), bytes(row.field1))
p.addColumn(cfName, bytes("field2"), bytes(row.field2))
p.addColumn(cfName, bytes("field3"), bytes(row.field3))
putList.add(p)
})
new HBaseBulkForeachWriter[HBaseDF] {
override val tableName: String = "<my table name>"
override def bulkPut: util.ArrayList[Put] = {
putList
}
}
}
)
.start()
spark.streams.awaitAnyTermination()
}
}
HBaseBulkForeachWriter.scala :
package com.test
import java.util
import java.util.concurrent.ExecutorService
import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory, Put, Table}
import org.apache.hadoop.hbase.security.User
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.spark.sql.ForeachWriter
import scala.collection.mutable
trait HBaseBulkForeachWriter[RECORD] extends ForeachWriter[RECORD] {
val tableName: String
val hbaseConfResources: mutable.Seq[String] = mutable.Seq("location for core-site.xml", "location for hbase-site.xml")
def pool: Option[ExecutorService] = None
def user: Option[User] = None
private var hTable: Table = _
private var connection: Connection = _
override def open(partitionId: Long, version: Long): Boolean = {
connection = createConnection()
hTable = getHTable(connection)
true
}
def createConnection(): Connection = {
val hbaseConfig = HBaseConfiguration.create()
hbaseConfResources.foreach(hbaseConfig.addResource)
ConnectionFactory.createConnection(hbaseConfig, pool.orNull, user.orNull)
}
def getHTable(connection: Connection): Table = {
connection.getTable(TableName.valueOf(tableName))
}
override def process(record: RECORD): Unit = {
val put = bulkPut
hTable.put(put)
}
override def close(errorOrNull: Throwable): Unit = {
hTable.close()
connection.close()
}
def bulkPut: util.ArrayList[Put]
}
foreachBatch allow you to use foreachPartition inside the function.
The code executed inside a foreachPartition only runs once per executor.
So you can create a function to create a put:
def putValue(key: String, columnName: String, data: Array[Byte]): Put = {
val put = new Put(Bytes.toBytes(key))
put.addColumn(Bytes.toBytes("colFamily"), Bytes.toBytes(columnName), data)
}
Then a function to bulk insert the puts
def writePutList(putList: List[Put]): Unit = {
val config: Configuration = HBaseConfiguration.create()
config.set("hbase.zookeeper.quorum", zookeperUrl)
val connection: Connection = ConnectionFactory.createConnection(config)
val table = connection.getTable(TableName.valueOf(tableName))
table.put(putList.asJava)
logger.info("INSERT record[s] " + putList.size + " to table " + tableName + " OK.")
table.close()
connection.close()
}
And use them inside a foreachPartition and a map
def writeFunction: (DataFrame, Long) => Unit = {
(batchData, id) => {
batchData.foreachPartition(
partition => {
val putList = partition.map(
data =>
putValue(data.getAs[String]("keyField"), "colName", Bytes.toBytes(data.getAs[String]("valueField")))
).toList
writePutList(putList)
}
)
}
}
And finally use the function created in your streaming query:
df.writeStream
.queryName("yourQueryName")
.option("checkpointLocation", checkpointLocation)
.outputMode(OutputMode.Update())
.foreachBatch(writeFunction)
.start()
.awaitTermination()

How to deal with contexts in Spark/Scala when using map()

I'm not very familiar with Scala, neither with Spark, and I'm trying to develop a basic test to understand how DataFrames actually work. My objective is to update my myDF based on values of some registries of another table.
Well, on the one hand, I have my App:
object TestApp {
def main(args: Array[String]) {
val conf: SparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
implicit val hiveContext : SQLContext = new HiveContext(sc)
val test: Test = new Test()
test.test
}
}
On the other hand, I have my Test class :
class Test(implicit sqlContext: SQLContext) extends Serializable {
val hiveContext: SQLContext = sqlContext
import hiveContext.implicits._
def test(): Unit = {
val myDF = hiveContext.read.table("myDB.Customers").sort($"cod_a", $"start_date".desc)
myDF.map(myMap).take(1)
}
def myMap(row: Row): Row = {
def _myMap: (String, String) = {
val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment")
var target: (String, String) = casoX(investmentDF, row.getAs[String]("cod_a"), row.getAs[String]("cod_p"))
target
}
def casoX(df: DataFrame, codA: String, codP: String)(implicit hiveContext: SQLContext): (String, String) = {
var rows: Array[Row] = null
if (codP != null) {
println(df)
rows = df.filter($"cod_a" === codA && $"cod_p" === codP).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
} else {
rows = df.filter($"cod_a" === codA).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
}
if (rows.length > 0) (row(0).asInstanceOf[String], row(1).asInstanceOf[String]) else null
}
val target: (String, String) = _myMap
Row(row(0), row(1), row(2), row(3), row(4), row(5), row(6), target._1, target._2, row(9))
}
}
Well, when I execute it, I have a NullPointerException on the instruction val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment"), and more precisely hiveContext.read
If I analyze hiveContext in the "test" function, I can access to its SparkContext, and I can load my DF without any problem.
Nevertheless if I analyze my hiveContext object just before getting the NullPointerException, its sparkContext is null, and I suppose due to sparkContext is not Serializable (and as I am in a map function, I'm loosing part of my hiveContext object, am I right?)
Anyway, I don't know what's wrong exactly with my code, and how should I alter it to get my investmentDF without any NullPointerException?
Thanks!

i want to store each rdd into database in twitter streaming using apache spark but got error of task not serialize in scala

I write a code in which twitter streaming take a rdd of tweet class and store each rdd in database but it got error task not serialize I paste the code.
sparkstreaming.scala
case class Tweet(id: Long, source: String, content: String, retweet: Boolean, authName: String, username: String, url: String, authId: Long, language: String)
trait SparkStreaming extends Connector {
def startStream(appName: String, master: String): StreamingContext = {
val db = connector("localhost", "rmongo", "rmongo", "pass")
val dbcrud = new DBCrud(db, "table1")
val sparkConf: SparkConf = new SparkConf().setAppName(appName).setMaster(master).set(" spark.driver.allowMultipleContexts", "true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
// .set("spark.kryo.registrator", "HelloKryoRegistrator")
// sparkConf.registerKryoClasses(Array(classOf[DBCrud]))
val sc: SparkContext = new SparkContext(sparkConf)
val ssc: StreamingContext = new StreamingContext(sc, Seconds(10))
ssc
}
}
object SparkStreaming extends SparkStreaming
I use this streaming context in plat controller to store tweets in database but it throws exception. I'm using mongodb to store it.
def streamstart = Action {
val stream = SparkStreaming
val a = stream.startStream("ss", "local[2]")
val db = connector("localhost", "rmongo", "rmongo", "pass")
val dbcrud = DBCrud
val twitterauth = new TwitterClient().tweetCredantials()
val tweetDstream = TwitterUtils.createStream(a, Option(twitterauth.getAuthorization))
val tweets = tweetDstream.filter { x => x.getUser.getLang == "en" }.map { x => Tweet(x.getId, x.getSource, x.getText, x.isRetweet(), x.getUser.getName, x.getUser.getScreenName, x.getUser.getURL, x.getUser.getId, x.getUser.getLang) }
// tweets.foreachRDD { x => x.foreach { x => dbcrud.insert(x) } }
tweets.saveAsTextFiles("/home/knoldus/sentiment project/spark services/tweets/tweets")
// val s=new BirdTweet()
// s.hastag(a.sparkContext)
a.start()
Ok("start streaming")
}
When make a single of streaming which take tweets and use forEachRDD to store each tweet then it works but if I use it from outside it doesn't work.
Please help me.
Try to create connection with MongoDB inside foreachRDD block, as mentioned in Spark Documentation
tweets.foreachRDD { x =>
x.foreach { x =>
val db = connector("localhost", "rmongo", "rmongo", "pass")
val dbcrud = new DBCrud(db, "table1")
dbcrud.insert(x)
}
}

Generate keywords using Apache Spark and mllib

I wrote code like this:
val hashingTF = new HashingTF()
val tfv: RDD[Vector] = sparkContext.parallelize(articlesList.map { t => hashingTF.transform(t.words) })
tfv.cache()
val idf = new IDF().fit(tfv)
val rate: RDD[Vector] = idf.transform(tfv)
How to get top 5 keywords from the "rate" RDD for each articlesList item?
ADD:
articlesList contains objects:
case class ArticleInfo (val url: String, val author: String, val date: String, val keyWords: List[String], val words: List[String])
words contains all words from article.
I do not understand the structure of rate, in the documentation says:
#return an RDD of TF-IDF vectors
My solution is:
(articlesList, rate.collect()).zipped.foreach { (art,tfidf) =>
val keywords = new mutable.TreeSet[(String, Double)]
art.words.foreach { word =>
val wordHash = hashingTF.indexOf(word)
val wordTFIDF = tfidf.apply(wordHash)
if (keywords.size == KEYWORD_COUNT) {
val minimum = keywords.minBy(_._2)
if (minimum._2 < wordHash) {
keywords.remove(minimum)
keywords.add((word,wordTFIDF))
}
} else {
keywords.add((word,wordTFIDF))
}
}
art.keyWords = keywords.toList.map(_._1)
}