Issue in inserting data to Hive Table using Spark and Scala - scala

I am new to Spark. Here is something I wanna do.
I have created two data streams; first one reads data from text file and register it as a temptable using hivecontext. The other one continuously gets RDDs from Kafka and for each RDD, it it creates data streams and register the contents as temptable. Finally I join these two temp tables on a key to get final result set. I want to insert that result set in a hive table. But I am out of ideas. Tried to follow some exmples but that only create a table with one column in hive and that too not readable. Could you please show me how to insert results in a particular database and table of hive. Please note that I can see the results of join using show function so the real challenge lies with insertion in hive table.
Below is the code I am using.
imports.....
object MSCCDRFilter {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Flume, Kafka and Spark MSC CDRs Manipulation")
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val cgiDF = sc.textFile("file:///tmp/omer-learning/spark/dim_cells.txt").map(_.split(",")).map(p => CGIList(p(0).trim, p(1).trim, p(2).trim,p(3).trim)).toDF()
cgiDF.registerTempTable("my_cgi_list")
val CGITable=sqlContext.sql("select *"+
" from my_cgi_list")
CGITable.show() // this CGITable is a structure I defined in the project
val streamingContext = new StreamingContext(sc, Seconds(10)
val zkQuorum="hadoopserver:2181"
val topics=Map[String, Int]("FlumeToKafka"->1)
val messages: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(streamingContext,zkQuorum,"myGroup",topics)
val logLinesDStream = messages.map(_._2) //获取数据
logLinesDStream.print()
val MSCCDRDStream = logLinesDStream.map(MSC_KPI.parseLogLine) // change MSC_KPI to MCSCDR_GO if you wanna change the class
// MSCCDR_GO and MSC_KPI are structures defined in the project
MSCCDRDStream.foreachRDD(MSCCDR => {
println("+++++++++++++++++++++NEW RDD ="+ MSCCDR.count())
if (MSCCDR.count() == 0) {
println("==================No logs received in this time interval=================")
} else {
val dataf=sqlContext.createDataFrame(MSCCDR)
dataf.registerTempTable("hive_msc")
cgiDF.registerTempTable("my_cgi_list")
val sqlquery=sqlContext.sql("select a.cdr_type,a.CGI,a.cdr_time, a.mins_int, b.Lat, b.Long,b.SiteID from hive_msc a left join my_cgi_list b"
+" on a.CGI=b.CGI")
sqlquery.show()
sqlContext.sql("SET hive.exec.dynamic.partition = true;")
sqlContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict;")
sqlquery.write.mode("append").partitionBy("CGI").saveAsTable("omeralvi.msc_data")
val FilteredCDR = sqlContext.sql("select p.*, q.* " +
" from MSCCDRFiltered p left join my_cgi_list q " +
"on p.CGI=q.CGI ")
println("======================print result =================")
FilteredCDR.show()
streamingContext.start()
streamingContext.awaitTermination()
}
}

I have had some success writing to Hive, using the following:
dataFrame
.coalesce(n)
.write
.format("orc")
.options(Map("path" -> savePath))
.mode(SaveMode.Append)
.saveAsTable(fullTableName)
Our attempts to use partitions weren't followed through with, because I think there was some issue with our desired partitioning column.
The only limitation is with concurrent writes, where the table does not exist yet, then any task tries to create the table (because it didn't exist when it first attempted to write to the table) will Exception out.
Be aware, that writing to Hive in streaming applications is usually bad design, as you will often write many small files, which is very inefficient to read and store. So if you write more often than every hour or so to Hive, you should make sure you include logic for compaction, or add an intermediate storage layer more suited to transactional data.

Related

update date for a Dataframe and join with Kafka stream data live in spark

I have a Kafka Stream source and a map table, which I want to join and then write the data to another Kafka topic. this job runs 24/7 without stop.
My issue is that the Map table that I wish to join is partitioned on date and every day I need the new updated Map table for the join.
But when the code runs it keeps using the same old map table, day after day without updating it.
import java.text.SimpleDateFormat
object joiningDF{
def newDate: String = {
val dFormat = new SimpleDateFormat("yyyy-MM-dd")
dateFormat.format(System.currentTimeMillis)
}
def main(args: Array[String]): Unit = {
var date=newDate
val source =spark.readStream.
format("kafka").
option("kafka.bootstrap.servers", "....").
option("subscribe", "....").
option("startingOffsets", "latest").
load()
// MAP TABLE date variable is used to get new date daily
var map=spark.read.parquet("path/day="+date)
val joinDF=source.join(map,Seq("id"),"left")
val outQ = joinDF.
writeStream.
outputMode("append").
format("kafka").
option("kafka.bootstrap.servers", "...").
option("topic", "...").
option("checkpointLocation", "...").
trigger(Trigger.ProcessingTime("300 seconds")).
start()
outQ.awaitTermination()
}
}
Is there a way to resolve it or a work around?
You could use a FileSystem source in continuous processing mode that is watching a directory into which you atomically move new versions of the file when they are ready. This will give you an updating stream to join with.

Using map() and filter() in Spark instead of spark.sql

I have two datasets that I want to INNER JOIN to give me a whole new table with the desired data. I used SQL and manage to get it. But now I want to try it with map() and filter(), is it possible?
This is my code using the SPARK SQL:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object hello {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local")
.setAppName("quest9")
val sc = new SparkContext(conf)
val spark = SparkSession.builder().appName("quest9").master("local").getOrCreate()
val zip_codes = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/zip.csv")
val census = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/census.csv")
census.createOrReplaceTempView("census")
zip_codes.createOrReplaceTempView("zip")
//val query = spark.sql("SELECT * FROM census")
val query = spark.sql("SELECT DISTINCT census.Total_Males AS male, census.Total_Females AS female FROM census INNER JOIN zip ON census.Zip_Code=zip.Zip_Code WHERE zip.City = 'Inglewood' AND zip.County = 'Los Angeles'")
query.show()
query.write.parquet("/home/hdfs/Documents/population/census/IDE/census.parquet")
sc.stop()
}
}
The only sensible way, in general to do this would be to use the join() method of `Dataset̀. I would urge you to question the need to use only map/filter to do this, as this is not intuitive, and will probably confuse any experienced spark developer (or simply put, make him roll his eyes). It may also lead to scalability issues should the dataset grow.
That said, in your use case, it is pretty simple to avoid using join. Another possibility would be to issue two separate jobs to spark :
fetch the zip code(s) that interests you
filter on the census data on that (those) zip code(s)
Step 1 collect the zip codes of interest (not sure of the exact syntax as I do not have a spark shell at hand, but it should be trivial to find the right one).
var codes: Seq[String] = zip_codes
// filter on the city
.filter(row => row.getAs[String]("City").equals("Inglewood"))
// filter on the county
.filter(row => row.getAs[String]("County").equals("Los Angeles"))
// map to zip code as a String
.map(row => row.getAs[String]("Zip_Code"))
.as[String]
// Collect on the driver side
.collect()
Then again, writing it this way instead of using select/where is pretty strange to anyone being used to spark.
Yet, the reason this will work is because we can be sure that zip codes matching a given town and county will be really small. So it is safe to perform driver side collcetion of the result.
Now on to step 2 :
census.filter(row => codes.contains(row.getAs[String]("Zip_Code")))
.map( /* whatever to get your data out */ )
What you need is a join, your query roughly translates to :
census.as("census")
.join(
broadcast(zip_codes
.where($"City"==="Inglewood")
.where($"County"==="Los Angeles")
.as("zip"))
,Seq("Zip_Code"),
"inner" // "leftsemi" would also be sufficient
)
.select(
$"census.Total_Males".as("male"),
$"census.Total_Females".as("female")
).distinct()

Update/Upsert Data in Redshift Table using Spark (or Spark Streaming)

I am Reading data from S3 using Spark Streaming, and I want to update stream data into Amazon Redshift. Data with same primary key exist then that row should be updated and new rows should be inserted. Can someone please suggest the right approach to do it considering the performance?
val ssc = new StreamingContext(sc, Duration(30000))
val lines = ssc.textFileStream("s3://<path-to-data>/YYYY/MM/DD/HH")
lines.foreachRDD(
x => {
val normalizedRDD = processRDD(x)
val df = spark.createDataset(normalizedRDD)
//TODO: How to Update/Upsert data in Redshift?
}
)

Hbase insert are very slow when kafka avro records are converted to Json

I am using Kafka 10 and receiving records in it from DB2 CDC. Kafka 10 uses Confluent Schema Registry to store the DB2 table schema and sends the records as Avro Array[Byte]. I want to store these records into Hbase (lets say Raw Hbase) and then run some transformation over those new records(like dropping columns, aggregation etc) using Hive and store the transformed records again into Hbase (lets say conformed Hbase). I tried 2 approaches and both are giving me some kind of issues. The records are big in length with ~500 columns(although only 10% of columns are req.) and each record is of size ~10kb.
1) I tried deserializing the records into Array[Byte] and then use the streamBulkPut method to insert it into Hbase.
Deserializer code:
def toRecord(buffer: Array[Byte]): Array[Byte] = {
var schemaRegistry: SchemaRegistryClient = null
schemaRegistry= new CachedSchemaRegistryClient(url, 10)
val bb = ByteBuffer.wrap(buffer)
bb.get() // consume MAGIC_BYTE
val schemaId = bb.getInt // consume schemaId //println(schemaId.toString)
val schema = schemaRegistry.getByID(schemaId) // consult the Schema Registry //println(schema)
val reader = new GenericDatumReader[GenericRecord](schema)
val decoder = DecoderFactory.get().binaryDecoder(buffer, bb.position(), bb.remaining(), null)
val writer = new GenericDatumWriter[GenericRecord](schema)
val baos = new ByteArrayOutputStream
val jsonEncoder = EncoderFactory.get.jsonEncoder(schema, baos)
writer.write( reader.read(null, decoder), jsonEncoder) //reader.read(null, decoder): returns Generic record
jsonEncoder.flush
baos.toByteArray
}
HBase bulkPut code:
val messages = KafkaUtils.createDirectStream[Object,Array[Byte],KafkaAvroDecoder,DefaultDecoder](ssc, kafkaParams, topicSet)
val hconf = HBaseConfiguration.create()
val hbaseContext = new HBaseContext(ssc.sparkContext, hconf)
val tableName = "your_table"
var rowKeyArray: Array[String] = null
hbaseContext.streamBulkPut(messages,TableName.valueOf(tableName),putFunction)
def putFunction(avroRecord:Tuple2[Object,Array[Byte]]):Put = {
implicit val formats = DefaultFormats
val recordKey = getKeyString(parse(avroRecord._1.toString.mkString).extract[Map[String,String]].values.mkString)
var put = new Put(Bytes.toBytes(recordKey))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("row"), AvroDeserializer.toRecord(avroRecord._2))
put
}
def getKeyString(keystr:String):String = {
(Math.abs(keystr map (_.hashCode) reduceLeft( 31 * _ + _) ) % 10 + 48).toChar + "_" + keystr.trim
}
Now this method works but the inserts are painfully slow. I am getting a throughput of ~5k records per minute. The plan was once the records are in Raw Hbase I will use Hive to read and explode the json to run the transformation.
2) Instead of re-serializing the records while storing into Raw Hbase I thought of doing it while loading from Raw->Conformed Hbase (I can manage the slowness here as the data will be already with me i.e. out of kafka). So I tried storing Avro records as it is into Hbase and it ran very fast, I was able to insert 1.5 Million records in 2 mins. Below is code:
hbaseContext.streamBulkPut(messages,TableName.valueOf(tableName),putFunction)
def putFunction(avroRecord:Tuple2[Object,Array[Byte]]):Put = {
implicit val formats = DefaultFormats
val recordKey = parse(avroRecord._1.toString.mkString).extract[Map[String,String]]
var put = new Put(Bytes.toBytes(getKeyString(recordKey.values.mkString)))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("row"), avroRecord._2)
put
}
The problem with this approach is Hive is not able to read Avro records from Hbase and I cannot filter the records/run any logic on it.
I would appreciate any kind of help or resource that I can follow to improve the performance. Any approach would work for me if its corresponding issue is solved. Thanks

Spark is duplicating work

I am facing a strange behaviour from Spark. Here's my code:
object MyJob {
def main(args: Array[String]): Unit = {
val sc = new SparkContext()
val sqlContext = new hive.HiveContext(sc)
val query = "<Some Hive Query>"
val rawData = sqlContext.sql(query).cache()
val aggregatedData = rawData.groupBy("group_key")
.agg(
max("col1").as("max"),
min("col2").as("min")
)
val redisConfig = new RedisConfig(new RedisEndpoint(sc.getConf))
aggregatedData.foreachPartition {
rows =>
writePartitionToRedis(rows, redisConfig)
}
aggregatedData.write.parquet(s"/data/output.parquet")
}
}
Against my intuition the spark scheduler yields two jobs for each data sink (Redis, HDFS/Parquet). The problem is the second job is also performing the hive query and doubling the work. I assumed both write operations would share the data from aggregatedData stage. Is something wrong or is it behaviour to be expected?
You've missed a fundamental concept of spark: Lazyness.
An RDD does not contain any data, all it is is a set of instructions that will be executed when you call an action (like writing data to disk/hdfs). If you reuse an RDD (or Dataframe), there's no stored data, just store instructions that will need to be evaluated everytime you call an action.
If you want to reuse data without needing to reevaluate an RDD, use .cache() or preferably persist. Persisting an RDD allows you to store the result of a transformation so that the RDD doesn't need to be reevaluated in future iterations.