Unable to Analyse data - scala

val patterns = ctx.getBroadcastState(patternStateDescriptor)
The imports I made
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.state.{MapStateDescriptor, ValueState, ValueStateDescriptor}
import org.apache.flink.api.scala.typeutils.Types
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.datastream.BroadcastStream
import org.apache.flink.streaming.api.functions.co.KeyedBroadcastProcessFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
Here's the code
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers","localhost:9092")
val patternStream = new FlinkKafkaConsumer010("patterns", new SimpleStringSchema, properties)
val patterns = env.addSource(patternStream)
var patternData = patterns.map {
str =>
val splitted_str = str.split(",")
PatternStream(splitted_str(0).trim, splitted_str(1).trim, splitted_str(2).trim)
}
val logsStream = new FlinkKafkaConsumer010("logs", new SimpleStringSchema, properties)
// logsStream.setStartFromEarliest()
val logs = env.addSource(logsStream)
var data = logs.map {
str =>
val splitted_str = str.split(",")
LogsTest(splitted_str.head.trim, splitted_str(1).trim, splitted_str(2).trim)
}
val keyedData: KeyedStream[LogsTest, String] = data.keyBy(_.metric)
val bcStateDescriptor = new MapStateDescriptor[Unit, PatternStream]("patterns", Types.UNIT, Types.of[PatternStream]) // first type defined is for the key and second data type defined is for the value
val broadcastPatterns: BroadcastStream[PatternStream] = patternData.broadcast(bcStateDescriptor)
val alerts = keyedData
.connect(broadcastPatterns)
.process(new PatternEvaluator())
alerts.print()
// println(alerts.getClass)
// val sinkProducer = new FlinkKafkaProducer010("output", new SimpleStringSchema(), properties)
env.execute("Flink Broadcast State Job")
}
class PatternEvaluator()
extends KeyedBroadcastProcessFunction[String, LogsTest, PatternStream, (String, String, String)] {
private lazy val patternStateDescriptor = new MapStateDescriptor("patterns", classOf[String], classOf[String])
private var lastMetricState: ValueState[String] = _
override def open(parameters: Configuration): Unit = {
val lastMetricDescriptor = new ValueStateDescriptor("last-metric", classOf[String])
lastMetricState = getRuntimeContext.getState(lastMetricDescriptor)
}
override def processElement(reading: LogsTest,
readOnlyCtx: KeyedBroadcastProcessFunction[String, LogsTest, PatternStream, (String, String, String)]#ReadOnlyContext,
out: Collector[(String, String, String)]): Unit = {
val metrics = readOnlyCtx.getBroadcastState(patternStateDescriptor)
if (metrics.contains(reading.metric)) {
val metricPattern: String = metrics.get(reading.metric)
val metricPatternValue: String = metrics.get(reading.value)
val lastMetric = lastMetricState.value()
val logsMetric = (reading.metric)
val logsValue = (reading.value)
if (logsMetric == metricPattern) {
if (metricPatternValue == logsValue) {
out.collect((reading.timestamp, reading.value, reading.metric))
}
}
}
}
override def processBroadcastElement(
update: PatternStream,
ctx: KeyedBroadcastProcessFunction[String, LogsTest, PatternStream, (String, String, String)]#Context,
out: Collector[(String, String, String)]
): Unit = {
val patterns = ctx.getBroadcastState(patternStateDescriptor)
if (update.metric == "IP") {
patterns.put(update.metric /*,update.operator*/ , update.value)
}
// else if (update.metric == "username"){
// patterns.put(update.metric, update.value)
// }
// else {
// println("No required data found")
// }
// }
}
}
Sample Data :- Logs Stream
"21/09/98","IP", "5.5.5.5"
Pattern Stream
"IP","==","5.5.5.5"
I'm unable to analyse data by getting desired result, i.e = 21/09/98,IP,5.5.5.5
There's no error as of now, it's just not analysing the data
The code is reading streams (Checked)

One common source of trouble in cases like this is that the API offers no control over the order in which the patterns and the data are ingested. It could be that processElement is being called before processBroadcastElement.

Related

Spark Structured Streaming with HBase Sink

my use case is to read Kafka messages with structured streaming and use foreachBatch to push those messages into HBase by using some bulk Put to gain some performance over single Put, I am able to push messages using foreach (thanks to Spark Structured Streaming with Hbase integration) but not able to do the same for foreachBatch operation.
Can someone please help with this ? Attaching the code below.
KafkaStructured.scala :
package com.test
import java.math.BigInteger
import java.util
import com.fasterxml.jackson.annotation.JsonIgnoreProperties
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
object KafkaStructured {
#JsonIgnoreProperties(ignoreUnknown = true)
case class Header(field1: String, field2: String, field3: String)
#JsonIgnoreProperties(ignoreUnknown = true)
case class Body(fieldx: String)
#JsonIgnoreProperties(ignoreUnknown = true)
case class Event(header: Header, body: Body)
#JsonIgnoreProperties(ignoreUnknown = true)
case class KafkaResp(event: Event)
#JsonIgnoreProperties(ignoreUnknown = true)
case class HBaseDF(field1: String, field2: String, field3: String)
def main(args: Array[String]): Unit = {
val jsonSchema = Encoders.product[KafkaResp].schema
val spark = SparkSession
.builder()
.appName("Kafka Spark")
.getOrCreate()
val df = spark
.readStream
.format("kafka")
.option...
.load()
import spark.sqlContext.implicits._
val flattenedDf: DataFrame =
df
.select($"value".cast("string").as("json"))
.select(from_json($"json", jsonSchema).as("data"))
.select("data.event.header.field1", "data.event.header.field2", "data.event.header.field3")
val hbaseDf = flattenedDf
.as[HBaseDF]
.filter(hbasedf => hbasedf != null && hbasedf.field1 != null)
flattenedDf
.writeStream
.option("truncate", "false")
.option("checkpointLocation", "some hdfs location")
.format("console")
.outputMode("append")
.start()
def bytes(data: String) = {
val bytes = data match {
case data if data != null && !data.isEmpty => Bytes.toBytes(data)
case _ => Bytes.toBytes("")
}
bytes
}
hbaseDf
.writeStream
.foreachBatch(function = (batchDf, batchId) => {
val putList = new util.ArrayList[Put]()
batchDf
.foreach(row => {
val p: Put = new Put(bytes(row.field1))
val cfName= bytes("fam1")
p.addColumn(cfName, bytes("field1"), bytes(row.field1))
p.addColumn(cfName, bytes("field2"), bytes(row.field2))
p.addColumn(cfName, bytes("field3"), bytes(row.field3))
putList.add(p)
})
new HBaseBulkForeachWriter[HBaseDF] {
override val tableName: String = "<my table name>"
override def bulkPut: util.ArrayList[Put] = {
putList
}
}
}
)
.start()
spark.streams.awaitAnyTermination()
}
}
HBaseBulkForeachWriter.scala :
package com.test
import java.util
import java.util.concurrent.ExecutorService
import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory, Put, Table}
import org.apache.hadoop.hbase.security.User
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.spark.sql.ForeachWriter
import scala.collection.mutable
trait HBaseBulkForeachWriter[RECORD] extends ForeachWriter[RECORD] {
val tableName: String
val hbaseConfResources: mutable.Seq[String] = mutable.Seq("location for core-site.xml", "location for hbase-site.xml")
def pool: Option[ExecutorService] = None
def user: Option[User] = None
private var hTable: Table = _
private var connection: Connection = _
override def open(partitionId: Long, version: Long): Boolean = {
connection = createConnection()
hTable = getHTable(connection)
true
}
def createConnection(): Connection = {
val hbaseConfig = HBaseConfiguration.create()
hbaseConfResources.foreach(hbaseConfig.addResource)
ConnectionFactory.createConnection(hbaseConfig, pool.orNull, user.orNull)
}
def getHTable(connection: Connection): Table = {
connection.getTable(TableName.valueOf(tableName))
}
override def process(record: RECORD): Unit = {
val put = bulkPut
hTable.put(put)
}
override def close(errorOrNull: Throwable): Unit = {
hTable.close()
connection.close()
}
def bulkPut: util.ArrayList[Put]
}
foreachBatch allow you to use foreachPartition inside the function.
The code executed inside a foreachPartition only runs once per executor.
So you can create a function to create a put:
def putValue(key: String, columnName: String, data: Array[Byte]): Put = {
val put = new Put(Bytes.toBytes(key))
put.addColumn(Bytes.toBytes("colFamily"), Bytes.toBytes(columnName), data)
}
Then a function to bulk insert the puts
def writePutList(putList: List[Put]): Unit = {
val config: Configuration = HBaseConfiguration.create()
config.set("hbase.zookeeper.quorum", zookeperUrl)
val connection: Connection = ConnectionFactory.createConnection(config)
val table = connection.getTable(TableName.valueOf(tableName))
table.put(putList.asJava)
logger.info("INSERT record[s] " + putList.size + " to table " + tableName + " OK.")
table.close()
connection.close()
}
And use them inside a foreachPartition and a map
def writeFunction: (DataFrame, Long) => Unit = {
(batchData, id) => {
batchData.foreachPartition(
partition => {
val putList = partition.map(
data =>
putValue(data.getAs[String]("keyField"), "colName", Bytes.toBytes(data.getAs[String]("valueField")))
).toList
writePutList(putList)
}
)
}
}
And finally use the function created in your streaming query:
df.writeStream
.queryName("yourQueryName")
.option("checkpointLocation", checkpointLocation)
.outputMode(OutputMode.Update())
.foreachBatch(writeFunction)
.start()
.awaitTermination()

How to build uber jar for Spark Structured Streaming application to MongoDB sink

I am unable to build a fat jar for my Kafka-SparkStructuredStreaming-MongoDB pipeline.
I have built StructuredStreamingProgram: receives streaming data from Kafka Topics and apply some parsing and then my intention is to save the structured streaming data into a MongoDB collection.
I have followed this article to build my pipeline https://learningfromdata.blog/2017/04/16/real-time-data-ingestion-with-apache-spark-structured-streaming-implementation/
I have created Helpers.scala and MongoDBForeachWriter.scala as suggested in the article for my streaming pipeline and save it under src/main/scala/example
When i do sbt assembly to build a fat jar i face this errors;
"[error] C:\spark_streaming\src\main\scala\example\structuredStreamApp.scala:63: class MongoDBForeachWriter is abstract; cannot be instantiated
[error] val structuredStreamForeachWriter: MongoDBForeachWriter = new MongoDBForeachWriter(mongodb_uri,mdb_name,mdb_collection,CountAccum)"
I need guidance in making this pipeline work.
Any help will be appreciated
package example
import java.util.Calendar
import org.apache.spark.util.LongAccumulator
import org.apache.spark.sql.Row
import org.apache.spark.sql.ForeachWriter
import org.mongodb.scala._
import org.mongodb.scala.bson.collection.mutable.Document
import org.mongodb.scala.bson._
import example.Helpers._
abstract class MongoDBForeachWriter(p_uri: String,
p_dbName: String,
p_collectionName: String,
p_messageCountAccum: LongAccumulator) extends ForeachWriter[Row] {
val mongodbURI = p_uri
val dbName = p_dbName
val collectionName = p_collectionName
val messageCountAccum = p_messageCountAccum
var mongoClient: MongoClient = null
var db: MongoDatabase = null
var collection: MongoCollection[Document] = null
def ensureMongoDBConnection(): Unit = {
if (mongoClient == null) {
mongoClient = MongoClient(mongodbURI)
db = mongoClient.getDatabase(dbName)
collection = db.getCollection(collectionName)
}
}
override def open(partitionId: Long, version: Long): Boolean = {
true
}
override def process(record: Row): Unit = {
val valueStr = new String(record.getAs[Array[Byte]]("value"))
val doc: Document = Document(valueStr)
doc += ("log_time" -> Calendar.getInstance().getTime())
// lazy opening of MongoDB connection
ensureMongoDBConnection()
val result = collection.insertOne(doc).results()
// tracks how many records I have processed
if (messageCountAccum != null)
messageCountAccum.add(1)
}
}
package example
import java.util.concurrent.TimeUnit
import scala.concurrent.Await
import scala.concurrent.duration.Duration
import org.mongodb.scala._
object Helpers {
implicit class DocumentObservable[C](val observable: Observable[Document]) extends ImplicitObservable[Document] {
override val converter: (Document) => String = (doc) => doc.toJson
}
implicit class GenericObservable[C](val observable: Observable[C]) extends ImplicitObservable[C] {
override val converter: (C) => String = (doc) => doc.toString
}
trait ImplicitObservable[C] {
val observable: Observable[C]
val converter: (C) => String
def results(): Seq[C] = Await.result(observable.toFuture(), Duration(10, TimeUnit.SECONDS))
def headResult() = Await.result(observable.head(), Duration(10, TimeUnit.SECONDS))
def printResults(initial: String = ""): Unit = {
if (initial.length > 0) print(initial)
results().foreach(res => println(converter(res)))
}
def printHeadResult(initial: String = ""): Unit = println(s"${initial}${converter(headResult())}")
}
}
package example
import org.apache.spark.sql.functions.{col, _}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.util.LongAccumulator
import example.Helpers._
import java.util.Calendar
object StructuredStreamingProgram {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("OSB_Streaming_Model")
.getOrCreate()
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "10.160.172.45:9092, 10.160.172.46:9092, 10.160.172.100:9092")
.option("subscribe", "TOPIC_WITH_COMP_P2_R2, TOPIC_WITH_COMP_P2_R2.DIT, TOPIC_WITHOUT_COMP_P2_R2.DIT")
.load()
val dfs = df.selectExpr("CAST(value AS STRING)").toDF()
val data =dfs.withColumn("splitted", split($"SERVICE_NAME8", "/"))
.select($"splitted".getItem(4).alias("region"),$"splitted".getItem(5).alias("service"),col("_raw"))
.withColumn("service_type", regexp_extract($"service", """.*(Inbound|Outbound|Outound).*""",1))
.withColumn("region_type", concat(
when(col("region").isNotNull,col("region")).otherwise(lit("null")), lit(" "),
when(col("service").isNotNull,col("service_type")).otherwise(lit("null"))))
val extractedDF = data.filter(
col("region").isNotNull &&
col("service").isNotNull &&
col("_raw").isNotNull &&
col("service_type").isNotNull &&
col("region_type").isNotNull)
.filter("region != ''")
.filter("service != ''")
.filter("_raw != ''")
.filter("service_type != ''")
.filter("region_type != ''")
// sends to MongoDB once every 20 seconds
val mongodb_uri = "mongodb://dstk8sdev06.us.dell.com/?maxPoolSize=1"
val mdb_name = "HANZO_MDB"
val mdb_collection = "Testing_Spark"
val CountAccum: LongAccumulator = spark.sparkContext.longAccumulator("mongostreamcount")
val structuredStreamForeachWriter: MongoDBForeachWriter = new MongoDBForeachWriter(mongodb_uri,mdb_name,mdb_collection,CountAccum)
val query = df.writeStream
.foreach(structuredStreamForeachWriter)
.trigger(Trigger.ProcessingTime("20 seconds"))
.start()
while (!spark.streams.awaitAnyTermination(60000)) {
println(Calendar.getInstance().getTime()+" :: mongoEventsCount = "+CountAccum.value)
}
}
}
with the above by doing corrections i would need to be able to save the structured streaming data into mongodb
You can instantiate object for abstract class. To resolve this issue, implement close function in MongoDBForeachWriter class and make it as as concrete class.
class MongoDBForeachWriter(p_uri: String,
p_dbName: String,
p_collectionName: String,
p_messageCountAccum: LongAccumulator) extends ForeachWriter[Row] {
val mongodbURI = p_uri
val dbName = p_dbName
val collectionName = p_collectionName
val messageCountAccum = p_messageCountAccum
var mongoClient: MongoClient = null
var db: MongoDatabase = null
var collection: MongoCollection[Document] = null
def ensureMongoDBConnection(): Unit = {
if (mongoClient == null) {
mongoClient = MongoClient(mongodbURI)
db = mongoClient.getDatabase(dbName)
collection = db.getCollection(collectionName)
}
}
override def open(partitionId: Long, version: Long): Boolean = {
true
}
override def process(record: Row): Unit = {
val valueStr = new String(record.getAs[Array[Byte]]("value"))
val doc: Document = Document(valueStr)
doc += ("log_time" -> Calendar.getInstance().getTime())
// lazy opening of MongoDB connection
ensureMongoDBConnection()
val result = collection.insertOne(doc)
// tracks how many records I have processed
if (messageCountAccum != null)
messageCountAccum.add(1)
}
override def close(errorOrNull: Throwable): Unit = {
if(mongoClient != null) {
Try {
mongoClient.close()
}
}
}
}
Hope this helps.
Ravi

Connect to Amazon account using Scala

I want to connect to my Amazon account in order to delete resources inside my s3 storage.
I have the access key and secret key, and this is how I started to build my connection to Amazon:
def connectToAmaozn(): Unit = {
val AWS_ACCESS_KEY=conf.getString("WebRecorder.PushSession.AccessKey")
val AWS_SECRET_KEY=conf.getString("WebRecorder.PushSession.SecretKey")
val AWSCredentials = new BasicAWSCredentials(AWS_ACCESS_KEY,AWS_SECRET_KEY)
}
Can you elaborate on how I may do this?
I used this solution to get bucket name and number of objects:
import scala.collection.JavaConversions._
import com.amazonaws.auth.{AWSStaticCredentialsProvider, BasicAWSCredentials}
import com.amazonaws.services.s3
import com.amazonaws.services.s3.model.{GetObjectTaggingRequest, ObjectListing, S3ObjectSummary}
import com.amazonaws.services.s3.{AmazonS3Client, AmazonS3ClientBuilder}
import com.clicktale.pipeline.framework.dal.ConfigParser.conf
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.auth.BasicAWSCredentials
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model._
import scala.language.postfixOps
class Amazon {
val AWS_ACCESS_KEY = conf.getString("WebRecorder.PushSession.AccessKey")
val AWS_SECRET_KEY = conf.getString("WebRecorder.PushSession.SecretKey")
val bucketName = "nv-q-s3-assets-01"
val provider = new AWSStaticCredentialsProvider(new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY))
val client = AmazonS3ClientBuilder.standard().withCredentials(provider).withRegion("us-east-1").build()
// def connectToAmazon(): Unit = {
//
// val provider = new AWSStaticCredentialsProvider(new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY))
// val client = AmazonS3ClientBuilder.standard().withCredentials(provider).withRegion("us-east-1").build()
def removeObjectsFromBucket(){
println("Removing objects from bucket...")
var object_listing: ObjectListing = client.listObjects(bucketName)
var flag: Boolean = true
while (flag) {
val iterator: Iterator[_] = object_listing.getObjectSummaries.iterator()
while (iterator.hasNext) {
val summary: S3ObjectSummary = iterator.next().asInstanceOf[S3ObjectSummary]
client.deleteObject(bucketName, summary.getKey())
}
flag=false
}
}
def countNumberOfObjectsInsideBucket(): Unit ={
var object_listing: ObjectListing = client.listObjects(bucketName)
var flag: Boolean = true
var count=0
while (flag) {
val iterator: Iterator[_] = object_listing.getObjectSummaries.iterator()
while (iterator.hasNext) {
val summary: S3ObjectSummary = iterator.next().asInstanceOf[S3ObjectSummary]
count+=1
}
flag=false
println("Number of objects are: " + count)
}
}
}
You need a AWSCredentialsProvider:
val provider = new AWSStaticCredentialsProvider(
new BasicAWSCredentials(AWS_ACCESS_KEY,AWS_SECRET_KEY)
)
and then use it to create the client:
val client = AmazonS3ClientBuilder
.standard
.withCredentials(provider)
.withRegion("us-west-1") // or whatever your region is
.build

How to write RDD[List[(ImmutableBytesWritable, Put)] to HBase using saveAsNewAPIHadoopDataset

I know that we can use saveAsNewAPIHadoopDataset with RDD[(ImmutableBytesWritable, Put)] to write to HBase table using spark.
But I have a list i.e RDD[List[(ImmutableBytesWritable, Put)] which I want to write 2 different HBase Tables.
How to do it?
Below is the code.
package com.scryAnalytics.FeatureExtractionController
import com.scryAnalytics.FeatureExtractionController.DAO.{DocumentEntitiesDAO, NLPEntitiesDAO, SegmentFeaturesDAO}
import com.scryAnalytics.NLPGeneric.{GateGenericNLP, NLPEntities}
import com.sun.xml.bind.v2.TODO
import com.vocp.ner.main.GateNERImpl
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.hbase.{HBaseConfiguration, HConstants, HTableDescriptor, TableName}
import org.apache.hadoop.hbase.client.{HBaseAdmin, Result}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.{MultiTableOutputFormat, TableInputFormat, TableOutputFormat}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.mapreduce.Job
import com.scryAnalytics.FeatureExtraction.SegmentsFeatureExtraction
import com.scryAnalytics.FeatureExtraction.DAO.VOCPEntities
import scala.collection.JavaConversions._
import gate.FeatureMap
import java.util.Map.Entry
import scala.collection.JavaConversions
import scala.util.control.Breaks.break
import scala.util.control.ControlThrowable
/**
* Created by sahil on 1/12/16.
*/
object Main {
def main(args: Array[String]): Unit = {
val inputTableName = "posts"
val outputTableName = "drugSegmentNew1"
val pluginHome = "/home/sahil/Voice-of-Cancer-Patients/VOCP Modules/bin/plugins"
val sc = new SparkContext(new SparkConf().setAppName("HBaseRead").setMaster("local[4]"))
val conf = HBaseConfiguration.create()
conf.set(HConstants.ZOOKEEPER_QUORUM, "localhost")
conf.set(TableInputFormat.INPUT_TABLE, inputTableName)
val admin = new HBaseAdmin(conf)
if (!admin.isTableAvailable(inputTableName)) {
val tableDesc = new HTableDescriptor(TableName.valueOf(inputTableName))
admin.createTable(tableDesc)
}
val job: Job = Job.getInstance(conf, "FeatureExtractionJob")
job.setOutputFormatClass(classOf[MultiTableOutputFormat])
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result])
val resultRDD = hBaseRDD.map(x => x._2)
// TODO: Add filters
val entity: VOCPEntities = VOCPEntities.DRUG
val nlpRDD = resultRDD.mapPartitions { iter =>
val nlpEntities: NLPEntitiesDAO = new NLPEntitiesDAO
iter.map {
result =>
val message = Bytes.toString(result.getValue(Bytes.toBytes("p"), Bytes.toBytes("message")))
val row_key = Bytes.toString(result.getRow)
nlpEntities.setToken(Utility.jsonToAnnotations(Bytes.toString(
result.getValue(Bytes.toBytes("gen"), Bytes.toBytes("token")))))
nlpEntities.setSpaceToken(Utility.jsonToAnnotations(Bytes.toString(
result.getValue(Bytes.toBytes("gen"), Bytes.toBytes("spaceToken")))))
nlpEntities.setSentence(Utility.jsonToAnnotations(Bytes.toString(
result.getValue(Bytes.toBytes("gen"), Bytes.toBytes("sentence")))))
nlpEntities.setVG(Utility.jsonToAnnotations(Bytes.toString(
result.getValue(Bytes.toBytes("gen"), Bytes.toBytes("verbGroup")))))
nlpEntities.setSplit(Utility.jsonToAnnotations(Bytes.toString(
result.getValue(Bytes.toBytes("gen"), Bytes.toBytes("split")))))
nlpEntities.setNounChunk(Utility.jsonToAnnotations(Bytes.toString(
result.getValue(Bytes.toBytes("gen"), Bytes.toBytes("nounChunk")))))
nlpEntities.setDrugs(Utility.jsonToAnnotations(Bytes.toString(
result.getValue(Bytes.toBytes("ner"), Bytes.toBytes("drug")))))
nlpEntities.setRegimen(Utility.jsonToAnnotations(Bytes.toString(
result.getValue(Bytes.toBytes("ner"), Bytes.toBytes("regimen")))))
nlpEntities.setSideEffects(Utility.jsonToAnnotations(Bytes.toString(
result.getValue(Bytes.toBytes("ner"), Bytes.toBytes("sideEffect")))))
nlpEntities.setALT_DRUG(Utility.jsonToAnnotations(Bytes.toString(
result.getValue(Bytes.toBytes("ner"), Bytes.toBytes("altDrug")))))
nlpEntities.setALT_THERAPY(Utility.jsonToAnnotations(Bytes.toString(
result.getValue(Bytes.toBytes("ner"), Bytes.toBytes("altTherapy")))))
(row_key, message, nlpEntities)
}
}
val featureExtractionOld: SegmentsFeatureExtraction = new SegmentsFeatureExtraction(
pluginHome, entity)
val outputRDD = nlpRDD.mapPartitions { iter =>
val featureExtraction: SegmentsFeatureExtraction = new SegmentsFeatureExtraction(
pluginHome, entity)
iter.map { x =>
val featuresJson = featureExtraction.generateFeatures(x._2, Utility.objectToJson(x._3))
val segmentFeatures: SegmentFeaturesDAO = Utility.jsonToSegmentFeatures(featuresJson)
val documentEntities: DocumentEntitiesDAO = new DocumentEntitiesDAO
documentEntities.setSystemId(x._1)
documentEntities.setToken(x._3.getToken)
documentEntities.setSpaceToken(x._3.getSpaceToken)
documentEntities.setSentence(x._3.getSentence)
documentEntities.setVG(x._3.getVG)
documentEntities.setNounChunk(x._3.getNounChunk)
documentEntities.setSplit(x._3.getSplit)
documentEntities.setDRUG(x._3.getDrugs)
documentEntities.setSE(x._3.getSideEffects)
documentEntities.setREG(x._3.getRegimen)
documentEntities.setALT_DRUG(x._3.getALT_DRUG)
documentEntities.setALT_THERAPY(x._3.getALT_THERAPY)
documentEntities.setSegment(segmentFeatures.getSegment)
documentEntities.setSegmentClass(segmentFeatures.getSegmentClass)
documentEntities.setSegmentInstance(segmentFeatures.getSegmentInstance)
(x._1, documentEntities)
}
}
val newRDD = outputRDD.map { k => convertToPut(k) }
newRDD.saveAsNewAPIHadoopDataset(job.getConfiguration())
}
def convertToPut(NlpWithRowKey: (String, DocumentEntitiesDAO)): List[(ImmutableBytesWritable, Put)] = {
val rowkey = NlpWithRowKey._1
val documentEntities = NlpWithRowKey._2
var returnList: List[(ImmutableBytesWritable, Put)] = List()
val segmentInstances = documentEntities.getSegmentInstance
val segments = documentEntities.getSegment
if(segments != null) {
var count = 0
for(segment <- segmentInstances) {
val keyString: String = documentEntities.getSystemId + "#" + Integer.toString(count)
count = count + 1
val outputKey: ImmutableBytesWritable = new ImmutableBytesWritable(keyString.getBytes())
val put = new Put(outputKey.get())
val features: FeatureMap = segment.getFeatures
val it: Iterator[Entry[Object, Object]] = features.entrySet.iterator()
var sideEffect_offset = "NULL"
var entity_offset = "NULL"
while(it.hasNext) {
val pair = it.next()
if(pair.getKey.equals("sideEffect-offset")) {
sideEffect_offset = pair.getValue().toString()
}
else if(pair.getKey.equals("drug-offset")) {
entity_offset = pair.getValue().toString()
}
else if(pair.getKey().equals("drug") || pair.getKey().equals("sideEffect")){
put.add(Bytes.toBytes("seg"), Bytes.toBytes(pair.getKey.toString), Bytes
.toBytes(pair.getValue().toString))
}
else {
put.add(Bytes.toBytes("segFeatures"), Bytes.toBytes(pair.getKey.toString), Bytes
.toBytes(pair.getValue().toString))
}
}
put.add(Bytes.toBytes("seg"), Bytes.toBytes("RelationId"),
Bytes.toBytes(documentEntities.getSystemId() + "-" + entity_offset + "-" + sideEffect_offset))
put.add(Bytes.toBytes("segInst"),Bytes.toBytes("id"), Bytes.toBytes(segment.getId()))
put.add(Bytes.toBytes("segInst"), Bytes.toBytes("type"), Bytes.toBytes(segment.getType()))
put.add(Bytes.toBytes("segInst"), Bytes.toBytes("startNodeId"), Bytes.toBytes(
segment.getStartNode().getId()))
put.add(Bytes.toBytes("segInst"), Bytes.toBytes("startNodeOffset"),
Bytes.toBytes(segment.getStartNode().getOffset()))
put.add(Bytes.toBytes("segInst"),Bytes.toBytes("endNodeId"),
Bytes.toBytes(segment.getEndNode().getId()))
put.add(Bytes.toBytes("segInst"), Bytes.toBytes("endNodeOffset"),
Bytes.toBytes(segment.getEndNode().getOffset()))
put.add(Bytes.toBytes("seg"),Bytes.toBytes("system_id"),
Bytes.toBytes(documentEntities.getSystemId()))
put.add(Bytes.toBytes("seg"), Bytes.toBytes("segmentText"),
Bytes.toBytes(segment.getAnnotatedText()))
for(segmentClassAnnots <- documentEntities.getSegmentClass) {
try {
if (segment.getId().equals(segmentClassAnnots.getFeatures().get("instance-id"))) {
put.add(Bytes.toBytes("segClass"), Bytes.toBytes("id"),
Bytes.toBytes(segmentClassAnnots.getId()))
put.add(Bytes.toBytes("segClass"), Bytes.toBytes("type"),
Bytes.toBytes(segmentClassAnnots.getType()))
put.add(Bytes.toBytes("segClass"), Bytes.toBytes("startNodeId"), Bytes
.toBytes(segmentClassAnnots.getStartNode()
.getId()))
put.add(Bytes.toBytes("segClass"), Bytes.toBytes("startNodeOffset"), Bytes
.toBytes(segmentClassAnnots.getStartNode()
.getOffset()))
put.add(Bytes.toBytes("segClass"), Bytes.toBytes("endNodeId"), Bytes
.toBytes(segmentClassAnnots.getEndNode()
.getId()))
put.add(Bytes.toBytes("segClass"), Bytes.toBytes("endNodeOffset"), Bytes
.toBytes(segmentClassAnnots.getEndNode()
.getOffset()))
break
}
} catch {
case t: Throwable => t.printStackTrace
}
returnList = returnList:+((new ImmutableBytesWritable(Bytes.toBytes("drugSegmentNew1")), put))
}
}
}
val PUT = new Put(Bytes.toBytes(rowkey))
PUT.add(Bytes.toBytes("f"), Bytes.toBytes("dStatus"), Bytes.toBytes("1"))
returnList = returnList:+((new ImmutableBytesWritable(Bytes.toBytes("posts")), PUT))
(returnList)
}
}
Just change your below line :
val newRDD = outputRDD.map { k => convertToPut(k) }
with this line:
val newRDD = outputRDD.flatMap { k => convertToPut(k) }
Hope this helps!

Serializing to disk and deserializing Scala objects using Pickling

Given a stream of homogeneous typed object, how would I go about serializing them to binary, writing them to disk, reading them from disk and then deserializing them using Scala Pickling?
For example:
object PicklingIteratorExample extends App {
import scala.pickling.Defaults._
import scala.pickling.binary._
import scala.pickling.static._
case class Person(name: String, age: Int)
val personsIt = Iterator.from(0).take(10).map(i => Person(i.toString, i))
val pklsIt = personsIt.map(_.pickle)
??? // Write to disk
val readIt: Iterator[Person] = ??? // Read from disk and unpickle
}
I find a way to so for standard files:
object PickleIOExample extends App {
import scala.pickling.Defaults._
import scala.pickling.binary._
import scala.pickling.static._
val tempPath = File.createTempFile("pickling", ".gz").getAbsolutePath
val outputStream = new FileOutputStream(tempPath)
val inputStream = new FileInputStream(tempPath)
val persons = for{
i <- 1 to 100
} yield Person(i.toString, i)
val output = new StreamOutput(outputStream)
persons.foreach(_.pickleTo(output))
outputStream.close()
val personsIt = new Iterator[Person]{
val streamPickle = BinaryPickle(inputStream)
override def hasNext: Boolean = inputStream.available > 0
override def next(): Person = streamPickle.unpickle[Person]
}
println(personsIt.mkString(", "))
inputStream.close()
}
But I am still unable to find a solution that will work with gzipped files. Since I do not know how to detect the EOF? The following throws an EOFexception since GZIPInputStream available method does not indicate the EOF:
object PickleIOExample extends App {
import scala.pickling.Defaults._
import scala.pickling.binary._
import scala.pickling.static._
val tempPath = File.createTempFile("pickling", ".gz").getAbsolutePath
val outputStream = new GZIPOutputStream(new FileOutputStream(tempPath))
val inputStream = new GZIPInputStream(new FileInputStream(tempPath))
val persons = for{
i <- 1 to 100
} yield Person(i.toString, i)
val output = new StreamOutput(outputStream)
persons.foreach(_.pickleTo(output))
outputStream.close()
val personsIt = new Iterator[Person]{
val streamPickle = BinaryPickle(inputStream)
override def hasNext: Boolean = inputStream.available > 0
override def next(): Person = streamPickle.unpickle[Person]
}
println(personsIt.mkString(", "))
inputStream.close()
}