Can't understand TwoPhaseCommitSinkFunction lifecycle - scala

I needed a sink to Postgres DB, so I started to build a custom Flink SinkFunction. As FlinkKafkaProducer implements TwoPhaseCommitSinkFunction, then I decided to do the same. As stated in O'Reilley's book Stream Processing with Apache Flink, you just need to implement the abstract methods, enable checkpointing and you're up to go. But what really happens when I run my code is that commit method is called only once, and it is called before invoke, what is totally unexpected since you shouldn't be ready to commit if your set of ready-to-commit transactions is empty. And the worst is that, after committing, invoke is called for all of the transaction lines present in my file, and then abort is called, which is even more unexpected.
When the Sink is initialized, It is of my understanding that the following should occur:
beginTransaction is called and sends an identifier to invoke
invoke adds the lines to the transaction, according to the identifier received
pre-commit makes all final modification on current transaction data
commit handles the finalized transaction of pre-commited data
So, I can't see why my program doesn't show this behaviour.
Here goes my sink code:
package PostgresConnector
import java.sql.{BatchUpdateException, DriverManager, PreparedStatement, SQLException, Timestamp}
import java.text.ParseException
import java.util.{Date, Properties, UUID}
import org.apache.flink.api.common.ExecutionConfig
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{SinkFunction, TwoPhaseCommitSinkFunction}
import org.apache.flink.streaming.api.scala._
import org.slf4j.{Logger, LoggerFactory}
class PostgreSink(props : Properties, config : ExecutionConfig) extends TwoPhaseCommitSinkFunction[(String,String,String,String),String,String](createTypeInformation[String].createSerializer(config),createTypeInformation[String].createSerializer(config)){
private var transactionMap : Map[String,Array[(String,String,String,String)]] = Map()
private var parsedQuery : PreparedStatement = _
private val insertionString : String = "INSERT INTO mydb (field1,field2,point) values (?,?,point(?,?))"
override def invoke(transaction: String, value: (String,String,String,String), context: SinkFunction.Context[_]): Unit = {
val LOG = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
val res = this.transactionMap.get(transaction)
if(res.isDefined){
var array = res.get
array = array ++ Array(value)
this.transactionMap += (transaction -> array)
}else{
val array = Array(value)
this.transactionMap += (transaction -> array)
}
LOG.info("\n\nPassing through invoke\n\n")
()
}
override def beginTransaction(): String = {
val LOG: Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
val identifier = UUID.randomUUID.toString
LOG.info("\n\nPassing through beginTransaction\n\n")
identifier
}
override def preCommit(transaction: String): Unit = {
val LOG = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
try{
val tuple : Option[Array[(String,String,String,String)]]= this.transactionMap.get(transaction)
if(tuple.isDefined){
tuple.get.foreach( (value : (String,String,String,String)) => {
LOG.info("\n\n"+value.toString()+"\n\n")
this.parsedQuery.setString(1,value._1)
this.parsedQuery.setString(2,value._2)
this.parsedQuery.setString(3,value._3)
this.parsedQuery.setString(4,value._4)
this.parsedQuery.addBatch()
})
}
}catch{
case e : SQLException =>
LOG.info("\n\nError when adding transaction to batch: SQLException\n\n")
case f : ParseException =>
LOG.info("\n\nError when adding transaction to batch: ParseException\n\n")
case g : NoSuchElementException =>
LOG.info("\n\nError when adding transaction to batch: NoSuchElementException\n\n")
case h : Exception =>
LOG.info("\n\nError when adding transaction to batch: Exception\n\n")
}
this.transactionMap = this.transactionMap.empty
LOG.info("\n\nPassing through preCommit...\n\n")
}
override def commit(transaction: String): Unit = {
val LOG : Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
if(this.parsedQuery != null) {
LOG.info("\n\n" + this.parsedQuery.toString+ "\n\n")
}
try{
this.parsedQuery.executeBatch
val LOG : Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
LOG.info("\n\nExecuting batch\n\n")
}catch{
case e : SQLException =>
val LOG : Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
LOG.info("\n\n"+"Error : SQLException"+"\n\n")
}
this.transactionMap = this.transactionMap.empty
LOG.info("\n\nPassing through commit...\n\n")
}
override def abort(transaction: String): Unit = {
val LOG : Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
this.transactionMap = this.transactionMap.empty
LOG.info("\n\nPassing through abort...\n\n")
}
override def open(parameters: Configuration): Unit = {
val LOG: Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
val driver = props.getProperty("driver")
val url = props.getProperty("url")
val user = props.getProperty("user")
val password = props.getProperty("password")
Class.forName(driver)
val connection = DriverManager.getConnection(url + "?user=" + user + "&password=" + password)
this.parsedQuery = connection.prepareStatement(insertionString)
LOG.info("\n\nConfiguring BD conection parameters\n\n")
}
}
And this is my main program:
package FlinkCEPClasses
import PostgresConnector.PostgreSink
import org.apache.flink.api.java.io.TextInputFormat
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.cep.PatternSelectFunction
import org.apache.flink.cep.pattern.conditions.SimpleCondition
import org.apache.flink.cep.scala.pattern.Pattern
import org.apache.flink.core.fs.{FileSystem, Path}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.cep.scala.{CEP, PatternStream}
import org.apache.flink.streaming.api.functions.source.FileProcessingMode
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import java.util.Properties
import org.apache.flink.api.common.ExecutionConfig
import org.slf4j.{Logger, LoggerFactory}
class FlinkCEPPipeline {
val LOG: Logger = LoggerFactory.getLogger(classOf[FlinkCEPPipeline])
LOG.info("\n\nStarting the pipeline...\n\n")
var env : StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(10)
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
env.setParallelism(1)
//var input : DataStream[String] = env.readFile(new TextInputFormat(new Path("/home/luca/Desktop/lines")),"/home/luca/Desktop/lines",FileProcessingMode.PROCESS_CONTINUOUSLY,1)
var input : DataStream[String] = env.readTextFile("/home/luca/Desktop/lines").name("Raw stream")
var tupleStream : DataStream[(String,String,String,String)] = input.map(new S2PMapFunction()).name("Tuple Stream")
var properties : Properties = new Properties()
properties.setProperty("driver","org.postgresql.Driver")
properties.setProperty("url","jdbc:postgresql://localhost:5432/mydb")
properties.setProperty("user","luca")
properties.setProperty("password","root")
tupleStream.addSink(new PostgreSink(properties,env.getConfig)).name("Postgres Sink").setParallelism(1)
tupleStream.writeAsText("/home/luca/Desktop/output",FileSystem.WriteMode.OVERWRITE).name("File Sink").setParallelism(1)
env.execute()
}
My S2PMapFunction code:
package FlinkCEPClasses
import org.apache.flink.api.common.functions.MapFunction
case class S2PMapFunction() extends MapFunction[String,(String,String,String,String)] {
override def map(value: String): (String, String, String,String) = {
var tuple = value.replaceAllLiterally("(","").replaceAllLiterally(")","").split(',')
(tuple(0),tuple(1),tuple(2),tuple(3))
}
}
My pipeline works like this: I read lines from a file, map them to a tuple of strings, and use the data inside the tuples to save them in a Postgres DB
If you want to simulate the data, just create a file with lines in a format like this:
(field1,field2,pointx,pointy)
Edit
The execution order of the TwoPhaseCommitSinkFUnction's methods is the following:
Starting pipeline...
beginTransaction
preCommit
beginTransaction
commit
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
abort

I'm not an expert on this topic, but a couple of guesses:
preCommit is called whenever Flink begins a checkpoint, and commit is called when the checkpoint is complete. These methods are called simply because checkpointing is happening, regardless of whether the sink has received any data.
Checkpointing is happening periodically, regardless of whether any data is flowing through your pipeline. Given your very short checkpointing interval (10 msec), it does seem plausible that the first checkpoint barrier will reach the sink before the source has managed to send it any data.
It also looks like you are assuming that only one transaction will be open at a time. I'm not sure that's strictly guaranteed, but so long as maxConcurrentCheckpoints is 1 (which is the default), you should be okay.

So, here goes the "answer" for this question. Just to be clear: at this moment, the problem about the TwoPhaseCommitSinkFunction hasn't been solved yet. If what you're looking for is about the original problem, then you should look for another answer. If you don't care about what you'll use as a sink, then maybe I can help you with that.
As suggested by #DavidAnderson, I started to study the Table API and see if it could solve my problem, which was using Flink to insert lines in my database table.
It turned out to be really simple, as you'll see.
OBS: Beware of the version you are using. My Flink's version is 1.9.0.
Source code
package FlinkCEPClasses
import java.sql.Timestamp
import java.util.Properties
import org.apache.flink.api.common.typeinfo.{TypeInformation, Types}
import org.apache.flink.api.java.io.jdbc.JDBCAppendTableSink
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.{EnvironmentSettings, Table}
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.sinks.TableSink
import org.postgresql.Driver
class TableAPIPipeline {
// --- normal pipeline initialization in this block ---
var env : StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(10)
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
env.setParallelism(1)
var input : DataStream[String] = env.readTextFile("/home/luca/Desktop/lines").name("Original stream")
var tupleStream : DataStream[(String,Timestamp,Double,Double)] = input.map(new S2PlacaMapFunction()).name("Tuple Stream")
var properties : Properties = new Properties()
properties.setProperty("driver","org.postgresql.Driver")
properties.setProperty("url","jdbc:postgresql://localhost:5432/mydb")
properties.setProperty("user","myuser")
properties.setProperty("password","mypassword")
// --- normal pipeline initialization in this block END ---
// These two lines create what Flink calls StreamTableEnvironment.
// It seems pretty similar to a normal stream initialization.
val settings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val tableEnv = StreamTableEnvironment.create(env,settings)
//Since I wanted to sink data into a database, I used JDBC TableSink,
//because it is very intuitive and is a exact match with my need. You may
//look for other TableSink classes that fit better in you solution.
var tableSink : JDBCAppendTableSink = JDBCAppendTableSink.builder()
.setBatchSize(1)
.setDBUrl("jdbc:postgresql://localhost:5432/mydb")
.setDrivername("org.postgresql.Driver")
.setPassword("mypassword")
.setUsername("myuser")
.setQuery("INSERT INTO mytable (data1,data2,data3) VALUES (?,?,point(?,?))")
.setParameterTypes(Types.STRING,Types.SQL_TIMESTAMP,Types.DOUBLE,Types.DOUBLE)
.build()
val fieldNames = Array("data1","data2","data3","data4")
val fieldTypes = Array[TypeInformation[_]](Types.STRING,Types.SQL_TIMESTAMP,Types.DOUBLE, Types.DOUBLE)
// This is the crucial part of the code: first, you need to register
// your table sink, informing the name, the field names, field types and
// the TableSink object.
tableEnv.registerTableSink("postgres-table-sink",
fieldNames,
fieldTypes,
tableSink
)
// Then, you transform your DataStream into a Table object.
var table = tableEnv.fromDataStream(tupleStream)
// Finally, you insert your stream data into the registered sink.
table.insertInto("postgres-table-sink")
env.execute()
}

Related

Apache beam stops to process PubSub messages after some time

I'm trying to write a simple Apache Beam pipeline (which will run on the Dataflow runner) to do the following:
Read PubSub messages containing file paths on GCS from a subscription.
For each message, read the data contained in the file associated with the message (the files can be of a variery of formats (csv, jsonl, json, xml, ...)).
Do some processing on each record.
Write back the result on GCS.
I'm using a 10 seconds fixed window on the messages. Since incoming files are already chunked (max size of 10MB) I decided not to use splittable do functions to read the files, in order to avoid adding useless complexity (especially for files that are not trivially splittable in chunks).
Here is a simplified code sample that gives the exact same problem of the full one:
package skytv.ingester
import java.io.{BufferedReader, InputStreamReader}
import java.nio.charset.StandardCharsets
import kantan.csv.rfc
import org.apache.beam.sdk.Pipeline
import org.apache.beam.sdk.io.{Compression, FileIO, FileSystems, TextIO, WriteFilesResult}
import org.apache.beam.sdk.io.gcp.pubsub.{PubsubIO, PubsubMessage}
import org.apache.beam.sdk.options.PipelineOptionsFactory
import org.apache.beam.sdk.transforms.DoFn.ProcessElement
import org.apache.beam.sdk.transforms.windowing.{BoundedWindow, FixedWindows, PaneInfo, Window}
import org.apache.beam.sdk.transforms.{Contextful, DoFn, MapElements, PTransform, ParDo, SerializableFunction, SimpleFunction, WithTimestamps}
import org.apache.beam.sdk.values.{KV, PCollection}
import org.joda.time.{Duration, Instant}
import skytv.cloudstorage.CloudStorageClient
import skytv.common.Closeable
import kantan.csv.ops._
import org.apache.beam.sdk.io.FileIO.{Sink, Write}
class FileReader extends DoFn[String, List[String]] {
private def getFileReader(filePath: String) = {
val cloudStorageClient = new CloudStorageClient()
val inputStream = cloudStorageClient.getInputStream(filePath)
val isr = new InputStreamReader(inputStream, StandardCharsets.UTF_8)
new BufferedReader(isr)
}
private def getRowsIterator(fileReader: BufferedReader) = {
fileReader
.asUnsafeCsvReader[Seq[String]](rfc
.withCellSeparator(',')
.withoutHeader
.withQuote('"'))
.toIterator
}
#ProcessElement
def processElement(c: ProcessContext): Unit = {
val filePath = c.element()
Closeable.tryWithResources(
getFileReader(filePath)
) {
fileReader => {
getRowsIterator(fileReader)
.foreach(record => c.output(record.toList))
}
}
}
}
class DataWriter(tempFolder: String) extends PTransform[PCollection[List[String]], WriteFilesResult[String]] {
private val convertRecord = Contextful.fn[List[String], String]((dr: List[String]) => {
dr.mkString(",")
})
private val getSink = Contextful.fn[String, Sink[String]]((destinationKey: String) => {
TextIO.sink()
})
private val getPartitioningKey = new SerializableFunction[List[String], String] {
override def apply(input: List[String]): String = {
input.head
}
}
private val getNaming = Contextful.fn[String, Write.FileNaming]((destinationKey: String) => {
new Write.FileNaming {
override def getFilename(
window: BoundedWindow,
pane: PaneInfo,
numShards: Int,
shardIndex: Int,
compression: Compression
): String = {
s"$destinationKey-${window.maxTimestamp()}-${pane.getIndex}.csv"
}
}
})
override def expand(input: PCollection[List[String]]): WriteFilesResult[String] = {
val fileWritingTransform = FileIO
.writeDynamic[String, List[String]]()
.by(getPartitioningKey)
.withDestinationCoder(input.getPipeline.getCoderRegistry.getCoder(classOf[String]))
.withTempDirectory(tempFolder)
.via(convertRecord, getSink)
.withNaming(getNaming)
.withNumShards(1)
input
.apply("WriteToAvro", fileWritingTransform)
}
}
object EnhancedIngesterScalaSimplified {
private val SUBSCRIPTION_NAME = "projects/<project>/subscriptions/<subscription>"
private val TMP_LOCATION = "gs://<path>"
private val WINDOW_SIZE = Duration.standardSeconds(10)
def main(args: Array[String]): Unit = {
val options = PipelineOptionsFactory.fromArgs(args: _*).withValidation().create()
FileSystems.setDefaultPipelineOptions(options)
val p = Pipeline.create(options)
val messages = p
.apply("ReadMessages", PubsubIO.readMessagesWithAttributes.fromSubscription(SUBSCRIPTION_NAME))
// .apply("AddProcessingTimeTimestamp", WithTimestamps.of(new SerializableFunction[PubsubMessage, Instant] {
// override def apply(input: PubsubMessage): Instant = Instant.now()
// }))
val parsedMessages = messages
.apply("ApplyWindow", Window.into[PubsubMessage](FixedWindows.of(WINDOW_SIZE)))
.apply("ParseMessages", MapElements.via(new SimpleFunction[PubsubMessage, String]() {
override def apply(msg: PubsubMessage): String = new String(msg.getPayload, StandardCharsets.UTF_8)
}))
val dataReadResult = parsedMessages
.apply("ReadData", ParDo.of(new FileReader))
val writerResult = dataReadResult.apply(
"WriteData",
new DataWriter(TMP_LOCATION)
)
writerResult.getPerDestinationOutputFilenames.apply(
"FilesWritten",
MapElements.via(new SimpleFunction[KV[String, String], String]() {
override def apply(input: KV[String, String]): String = {
println(s"Written ${input.getKey}, ${input.getValue}")
input.getValue
}
}))
p.run.waitUntilFinish()
}
}
The problem is that after the processing of some messages (in the order of 1000), the job stops processing new messages and they remain in the PubSub subscription unacknowledged forever.
I saw that in such a situation the watermark stops to advance and the data freshness linearly increases indefinitely.
Here is a screenshot from dataflow:
And here the situation on the PubSub queue:
Is it possible that there are some messages that remain stuck in the dataflow queues filling them so that no new messages can be added?
I thought that there was some problem on how timestamps are computed by the PubsubIO, so I tried to force the timestamps to be equal to the processing time of each message, but I had no success.
If I leave the dataflow job in this state, it seems that it continuously reprocesses the same messages without writing any data to storage.
Do you have any idea on how to solve this problem?
Thanks!
Most likely the pipeline has encountered an error while processing one(or more) elements in the pipeline (and it shouldn't have anything to do with how timestamps are computed by the PubsubIO), which stops the watermark from advancing since the failed work will be retried again and again on dataflow.
You can check if there's any failure from the log, specifically from worker or harness component. If there's an unhandled runtime exception such as parse error etc, it is very likely being the root cause of a streaming pipeline getting stuck.
If there's no UserCodeException then it is likely some other issue caused by dataflow backend and you can reach out to Dataflow customer support so engineers can look into the backend issue for your pipeline.

How to fetch data from mongodB in Scala

I wrote following code to fetch data from MongoDB
import com.typesafe.config.ConfigFactory
import org.mongodb.scala.{ Document, MongoClient, MongoCollection, MongoDatabase }
import scala.concurrent.ExecutionContext
object MongoService extends Service {
val conf = ConfigFactory.load()
implicit val mongoService: MongoClient = MongoClient(conf.getString("mongo.url"))
implicit val mongoDB: MongoDatabase = mongoService.getDatabase(conf.getString("mongo.db"))
implicit val ec: ExecutionContext = ExecutionContext.global
def getAllDocumentsFromCollection(collection: String) = {
mongoDB.getCollection(collection).find()
}
}
But when I tried to get data from getAllDocumentsFromCollection I'm not getting each data for further manipulation. Instead I'm getting
FindObservable(com.mongodb.async.client.FindIterableImpl#23555cf5)
UPDATED:
object MongoService {
// My settings (see available connection options)
val mongoUri = "mongodb://localhost:27017/smsto?authMode=scram-sha1"
import ExecutionContext.Implicits.global // use any appropriate context
// Connect to the database: Must be done only once per application
val driver = MongoDriver()
val parsedUri = MongoConnection.parseURI(mongoUri)
val connection = parsedUri.map(driver.connection(_))
// Database and collections: Get references
val futureConnection = Future.fromTry(connection)
def db1: Future[DefaultDB] = futureConnection.flatMap(_.database("smsto"))
def personCollection = db1.map(_.collection("person"))
// Write Documents: insert or update
implicit def personWriter: BSONDocumentWriter[Person] = Macros.writer[Person]
// or provide a custom one
def createPerson(person: Person): Future[Unit] =
personCollection.flatMap(_.insert(person).map(_ => {})) // use personWriter
def getAll(collection: String) =
db1.map(_.collection(collection))
// Custom persistent types
case class Person(firstName: String, lastName: String, age: Int)
}
I tried to use reactivemongo as well with above code but I couldn't make it work for getAll and getting following error in createPerson
Please suggest how can I get all data from a collection.
This is likely too late for the OP, but hopefully the following methods of retrieving & iterating over collections using mongo-spark can prove useful to others.
The Asynchronous Way - Iterating over documents asynchronously means you won't have to store an entire collection in-memory, which can become unreasonable for large collections. However, you won't have access to all your documents outside the subscribe code block for reuse. I'd recommend doing things asynchronously if you can, since this is how the mongo-scala driver was intended to be used.
db.getCollection(collectionName).find().subscribe(
(doc: org.mongodb.scala.bson.Document) => {
// operate on an individual document here
},
(e: Throwable) => {
// do something with errors here, if desired
},
() => {
// this signifies that you've reached the end of your collection
}
)
The "Synchronous" Way - This is a pattern I use when my use-case calls for a synchronous solution, and I'm working with smaller collections or result-sets. It still uses the asynchronous mongo-scala driver, but it returns a list of documents and blocks downstream code execution until all documents are returned. Handling errors and timeouts may depend on your use case.
import org.mongodb.scala._
import org.mongodb.scala.bson.Document
import org.mongodb.scala.model.Filters
import scala.collection.mutable.ListBuffer
/* This function optionally takes filters if you do not wish to return the entire collection.
* You could extend it to take other optional query params, such as org.mongodb.scala.model.{Sorts, Projections, Aggregates}
*/
def getDocsSync(db: MongoDatabase, collectionName: String, filters: Option[conversions.Bson]): ListBuffer[Document] = {
val docs = scala.collection.mutable.ListBuffer[Document]()
var processing = true
val query = if (filters.isDefined) {
db.getCollection(collectionName).find(filters.get)
} else {
db.getCollection(collectionName).find()
}
query.subscribe(
(doc: Document) => docs.append(doc), // add doc to mutable list
(e: Throwable) => throw e,
() => processing = false
)
while (processing) {
Thread.sleep(100) // wait here until all docs have been returned
}
docs
}
// sample usage of 'synchronous' method
val client: MongoClient = MongoClient(uriString)
val db: MongoDatabase = client.getDatabase(dbName)
val allDocs = getDocsSync(db, "myCollection", Option.empty)
val someDocs = getDocsSync(db, "myCollection", Option(Filters.eq("fieldName", "foo")))

spark jupyter notebook does not show scala console output

1) I am learning streaming and run into problems of nothing shown up (println via sendEVent) on the console (scala). I further attempted to inplant line of println("xyz") and found out that it only get printed if they are not embedded within the block of 'while' block... otherwise it wont get printed even placed before the while loop. I placed a few more lines of those println("xyz") and found out some might get blocked out... and only the last one get printed out.
Previously I also encounted twice with two different pieces of codes on Storm streaming that: nothing get printed out from Jupyter Notebook but perfectly ok on Scala Shell.
2) Also I wonder in those awaitTermination(), such as:
messages.writeStream.outputMode("append").format("console").option("truncate", false).start().awaitTermination() (I also get no output from console)
or those "infinitive loop" as shown bellowing codes:
var finished = false
while (!finished) {................. ..}
are they waiting for a hard break like halt or [CTR]C... or how to break them properly? so the next line get executed. I get so confused as the author writing the samples / tutorials explained nothing about this.
enter code here
import java.util._
import scala.collection.JavaConverters._
import java.util.concurrent._
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.eventhubs.ConnectionStringBuilder
// Event hub configurations
// Replace values below with yours
val eventHubName = "<Event hub name>"
val eventHubNSConnStr = "<Event hub namespace connection string>"
val connStr = ConnectionStringBuilder (eventHubNSConnStr)
.setEventHubName(eventHubName).build
import com.microsoft.azure.eventhubs._
val pool = Executors.newFixedThreadPool(1)
val eventHubClient = EventHubClient.create(connStr.toString(), pool)
def sendEvent(message: String) = {
val messageData = EventData.create(message.getBytes("UTF-8"))
eventHubClient.get().send(messageData)
println("Sent event: " + message + "\n")
}
import twitter4j._
import twitter4j.TwitterFactory
import twitter4j.Twitter
import twitter4j.conf.ConfigurationBuilder
// Twitter application configurations
// Replace values below with yours
val twitterConsumerKey = "<CONSUMER KEY>"
val twitterConsumerSecret = "<CONSUMER SECRET>"
val twitterOauthAccessToken = "<ACCESS TOKEN>"
val twitterOauthTokenSecret = "<TOKEN SECRET>"
val cb = new ConfigurationBuilder()
cb.setDebugEnabled
(true).setOAuthConsumerKeywitterConsumerKey).setOAuthConsumerSecret
(twitterConsumerSecret).setOAuthAccessToken
(twitterOauthAccessToken).setOAuthAccessTokenSecret(twitterOauthTokenSecret)
val twitterFactory = new TwitterFactory(cb.build())
val twitter = twitterFactory.getInstance()
//Getting tweets with keyword "Azure" and sending them to Event Hub realtime
val query = new Query(" #Azure ")
query.setCount(100)
query.lang("en")
var finished = false
while (!finished) {
val result = twitter.search(query)
val statuses = result.getTweets()
var lowestStatusId = Long.MaxValue
for (status <- statuses.asScala) {
if(!status.isRetweet()){
sendEvent(status.getText())
}
lowestStatusId = Math.min(status.getId(), lowestStatusId)
Thread.sleep(2000)
}
query.setMaxId(lowestStatusId - 1)
}
// Closing connection to the Event Hub
eventHubClient.get().close()

the generation of parse tree of StanfordCoreNLP is stuck

When I use the StanfordCoreNLP to generate the parse using bigdata on Spark, one of the tasks had stuck for a long time. I looked for the error, it showed as follows:
at edu.stanford.nlp.ling.CoreLabel.(CoreLabel.java:68)
  at edu.stanford.nlp.ling.CoreLabel$CoreLabelFactory.newLabel(CoreLabel.java:248)
  at edu.stanford.nlp.trees.LabeledScoredTreeFactory.newLeaf(LabeledScoredTreeFactory.java:51)
  at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:27)
  at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)
  at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)
  at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)
  at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)
the relevant codes I think are as follows:
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.trees.TreeCoreAnnotations.TreeAnnotation
import edu.stanford.nlp.util.CoreMap
import scala.collection.JavaConversions._
object CoreNLP {
def transform(Content: String): String = {
val v = new CoreNLP
v.runEnglishAnnotators(Content);
v.runChineseAnnotators(Content)
}
}
class CoreNLP {
def runEnglishAnnotators(inputContent: String): String = {
var document = new Annotation(inputContent)
val props = new Properties
props.setProperty("annotators", "tokenize, ssplit, parse")
val coreNLP = new StanfordCoreNLP(props)
coreNLP.annotate(document)
parserOutput(document)
}
def runChineseAnnotators(inputContent: String): String = {
var document = new Annotation(inputContent)
val props = new Properties
val corenlp = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties")
corenlp.annotate(document)
parserOutput(document)
}
def parserOutput(document: Annotation): String = {
val sentences = document.get(classOf[SentencesAnnotation])
var result = ""
for (sentence: CoreMap <- sentences) {
val tree = sentence.get(classOf[TreeAnnotation])
//output the tree to file
result = result + "\n" + tree.toString
}
result
}
}
My classmate said the data used to test is recurse and thus the NLP is endlessly run. I don't know whether it's true.
If you add props.setProperty("parse.maxlen", "100"); to your code that will set the parser to not parse sentences longer than 100 tokens. That can help prevent crash issues. You should experiment with the best max sentence length for your application.

Scala Spark Shared connection across all nodes

I want to write some data from spark DF to postgres. After searching on stack i found that easiest way of it - to open connection each time I use my prepared statement - and this works fine. But I want to share variable with connection across all nodes.
I get some code from here: https://www.nicolaferraro.me/2016/02/22/using-non-serializable-objects-in-apache-spark/ and wrote this:
class SharedVariable[T](constructor: => T) extends AnyRef with Serializable {
#transient private lazy val instance: T = constructor
def get = instance
}
object SharedVariable {
def apply[T](constructor: => T): SharedVariable[T] = new SharedVariable[T](constructor)}
val dsv = SharedVariable {
val ds = new BasicDataSource()
ds.setDriverClassName("org.postgresql.Driver")
ds.setUrl("jdbc:postgresql://...")
ds.setUsername("user")
ds.setPassword("pass")
ds}
But i get an error:
error: reference to SharedVariable is ambiguous;
it is imported twice in the same scope by
import $line1031857785174$read.SharedVariable
and import INSTANCE.SharedVariable val dsv = SharedVariable {
Can someone help me?