Apache Flink - Unable to get data From Twitter - scala

I'm trying to get some messages with Twitter Streaming API using Apache Flink.
But, my code is not writing anything in the output file. I'm trying to count the input data for specific words.
Plese check my example:
import java.util.Properties
import org.apache.flink.api.scala._
import org.apache.flink.streaming.connectors.twitter._
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import com.twitter.hbc.core.endpoint.{Location, StatusesFilterEndpoint, StreamingEndpoint}
import org.apache.flink.streaming.api.windowing.time.Time
import scala.collection.JavaConverters._
//////////////////////////////////////////////////////
// Create an Endpoint to Track our terms
class myFilterEndpoint extends TwitterSource.EndpointInitializer with Serializable {
#Override
def createEndpoint(): StreamingEndpoint = {
//val chicago = new Location(new Location.Coordinate(-86.0, 41.0), new Location.Coordinate(-87.0, 42.0))
val endpoint = new StatusesFilterEndpoint()
//endpoint.locations(List(chicago).asJava)
endpoint.trackTerms(List("odebrecht", "lava", "jato").asJava)
endpoint
}
}
object Connection {
def main(args: Array[String]): Unit = {
val props = new Properties()
val params: ParameterTool = ParameterTool.fromArgs(args)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.getConfig.setGlobalJobParameters(params)
env.setParallelism(params.getInt("parallelism", 1))
props.setProperty(TwitterSource.CONSUMER_KEY, params.get("consumer-key"))
props.setProperty(TwitterSource.CONSUMER_SECRET, params.get("consumer-key"))
props.setProperty(TwitterSource.TOKEN, params.get("token"))
props.setProperty(TwitterSource.TOKEN_SECRET, params.get("token-secret"))
val source = new TwitterSource(props)
val epInit = new myFilterEndpoint()
source.setCustomEndpointInitializer(epInit)
val streamSource = env.addSource(source)
streamSource.map(s => (0, 1))
.keyBy(0)
.timeWindow(Time.minutes(2), Time.seconds(30))
.sum(1)
.map(t => t._2)
.writeAsText(params.get("output"))
env.execute("Twitter Count")
}
}
The point is, I have no error message and I can see at my Dashboard. My source is sending data to my TriggerWindow. But it is not receive any data:
I have two questions in once.
First: Why my source is sending bytes to my TriggerWindow if it is not received anything?
Seccond: Is something wrong to my code that I can't take data from twitter?

Your application source did not send actual records to the window which you can see by looking at the Records sent column. The bytes which are sent belong to control messages which Flink sends from time to time between the tasks. More specifically, it is the LatencyMarker message which is used to measure the end to end latency of a Flink job.
The code looks good to me. I even tried out your code and worked for me. Thus, I conclude that there has to be something wrong with the Twitter connection credentials. Please re-check whether you've entered the right credentials.

Related

What the best way to execute "not transformation" actions in elements of a Dataset

Newly coming in spark, I'm looking for a way to execute actions in all elements of a Dataset with Spark structured streaming:
I know this is a specific purpose case, what I want is iterate through all elements of Dataset, do an action on it, then continue to work with Dataset.
Example:
I got val df = Dataset[Person], I would like to be able to do something like:
def execute(df: Dataset[Person]): Dataset[Person] = {
df.foreach((p: Person) => {
someHttpClient.doRequest(httpPostRequest(p.asString)) // this is pseudo code / not compiling
})
df
}
Unfortunately, foreach is not available with structured streaming since I got error "Queries with streaming sources must be executed with writeStream.start"
I tried to use map(), but then error "Task not serializable" occured, I think because http request, or http client, is not serializable.
I know Spark is mostly use for filter and transform, but is there a way to handle well this specific use case ?
Thanks :)
val conf = new SparkConf().setMaster(“local[*]").setAppName(“Example")
val jssc = new JavaStreamingContext(conf, Durations.seconds(1)) // second option tell about The time interval at which streaming data will be divided into batches
Before concluding on whether a solution exists or not
Let’s as few questions
How does Spark Streaming work?
Spark Streaming receives live input data streams from input source and divides the data into batches, which are then processed by the Spark engine and final batch results are pushed down to downstream applications
How Does the batch execution start?
Spark does lazy evaluations on all the transformation applied on Dstream.it will apply transformation on actions (i.e only when you start streaming context)
jssc.start(); // Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate.
Note : Each Batch of Dstream contains multiple partitions ( it is just like running sequence of spark-batch job until input source stop producing data)
So you can have custom logic like below.
dStream.foreachRDD(new VoidFunction[JavaRDD[Object]] {
override def call(t: JavaRDD[Object]): Unit = {
t.foreach(new VoidFunction[Object] {
override def call(t: Object): Unit = {
//pseudo code someHttpClient.doRequest(httpPostRequest(t.asString))
}
})
}
})
But again make sure your someHttpClient is serializable or
you can create that object As mentioned below.
dStream.foreachRDD(new VoidFunction[JavaRDD[Object]] {
override def call(t: JavaRDD[Object]): Unit = {
// create someHttpClient object
t.foreach(new VoidFunction[Object] {
override def call(t: Object): Unit = {
//pseudo code someHttpClient.doRequest(httpPostRequest(t.asString))
}
})
}
})
Related to Spark Structured Streaming
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql._;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQuery
import java.util.Arrays;
import java.util.Iterator;
val spark = SparkSession
.builder()
.appName("example")
.getOrCreate();
val lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load(); // this is example source load copied from spark-streaming doc
lines.foreach(new ForeachFunction[Row] {
override def call(t: Row): Unit = {
//someHttpClient.doRequest(httpPostRequest(p.asString))
OR
// create someHttpClient object here and use it to tackle serialization errors
}
})
// Start running the query foreach and do mention downstream sink below/
val query = lines.writeStream.start
query.awaitTermination()

Making HTTP post requests on Spark usign foreachPartition

Need some help to understand the behaviour of the below in Spark (using Scala and Databricks)
I have some dataframe (reading from S3 if that matters), and would send that data by making HTTP post requests in batches of 1000 (at most). So I repartitioned the dataframe to make sure each partition has no more than 1000 records. Also, created a json column for each line (so I need only to put them in an array later on)
The trouble is on the making the requests. I created the following a Serializable class using the following code
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.http.client.methods.HttpPost
import org.apache.http.impl.client.HttpClientBuilder
import org.apache.http.HttpHeaders
import org.apache.http.entity.StringEntity
import org.apache.commons.io.IOUtils
object postObject extends Serializable{
val client = HttpClientBuilder.create().build()
val post = new HttpPost("https://my-cool-api-endpoint")
post.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
def makeHttpCall(row: Iterator[Row]) = {
val json_str = """{"people": [""" + row.toSeq.map(x => x.getAs[String]("json")).mkString(",") + "]}"
post.setEntity(new StringEntity(json_str))
val response = client.execute(post)
val entity = response.getEntity()
println(Seq(response.getStatusLine.getStatusCode(), response.getStatusLine.getReasonPhrase()))
println(IOUtils.toString(entity.getContent()))
}
}
Now when I try the following:
postObject.makeHttpCall(data.head(2).toIterator)
It works like a charm. The requests go through, there is some output on the screen, and my API gets that data.
But when I try to put it in the foreachPartition:
data.foreachPartition { x =>
postObject.makeHttpCall(x)
}
Nothing happens. No output on screen, nothing arrives in my API. If I try to rerun it, almost all stages just skips. I believe, for any reason, it is just lazy evaluating my requests, but not actually performing it. I don't understand why, and how to force it.
postObject has 2 fields: client and post which has to be serialized.
I'm not sure that client is serialized properly. post object is potentially mutated from several partitions (on the same worker). So many things could go wrong here.
I propose tryng removing postObject and inlining its body into foreachPartition directly.
Addition:
Tried to run it myself:
sc.parallelize((1 to 10).toList).foreachPartition(row => {
val client = HttpClientBuilder.create().build()
val post = new HttpPost("https://google.com")
post.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
val json_str = """{"people": [""" + row.toSeq.map(x => x.toString).mkString(",") + "]}"
post.setEntity(new StringEntity(json_str))
val response = client.execute(post)
val entity = response.getEntity()
println(Seq(response.getStatusLine.getStatusCode(), response.getStatusLine.getReasonPhrase()))
println(IOUtils.toString(entity.getContent()))
})
Ran it both locally and in cluster.
It completes successfully and prints 405 errors to worker logs.
So requests definitely hit the server.
foreachPartition returns nothing as the result. To debug your issue you can change it to mapPartitions:
val responseCodes = sc.parallelize((1 to 10).toList).mapPartitions(row => {
val client = HttpClientBuilder.create().build()
val post = new HttpPost("https://google.com")
post.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
val json_str = """{"people": [""" + row.toSeq.map(x => x.toString).mkString(",") + "]}"
post.setEntity(new StringEntity(json_str))
val response = client.execute(post)
val entity = response.getEntity()
println(Seq(response.getStatusLine.getStatusCode(), response.getStatusLine.getReasonPhrase()))
println(IOUtils.toString(entity.getContent()))
Iterator.single(response.getStatusLine.getStatusCode)
}).collect()
println(responseCodes.mkString(", "))
This code returns the list of response codes so you can analyze it.
For me it prints 405, 405 as expected.
There is a way to do this without having to find out what exactly is not serializable. If you want to keep the structure of your code, you can make all fields #transient lazy val. Also, any call with side effects should be wrapped in a block. For example
val post = {
val httpPost = new HttpPost("https://my-cool-api-endpoint")
httpPost.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
httpPost
}
That will delay the initialization of all fields until they are used by the workers. Each worker will have an instance of the object and you will be able to make invoke the makeHttpCall method.

spark jupyter notebook does not show scala console output

1) I am learning streaming and run into problems of nothing shown up (println via sendEVent) on the console (scala). I further attempted to inplant line of println("xyz") and found out that it only get printed if they are not embedded within the block of 'while' block... otherwise it wont get printed even placed before the while loop. I placed a few more lines of those println("xyz") and found out some might get blocked out... and only the last one get printed out.
Previously I also encounted twice with two different pieces of codes on Storm streaming that: nothing get printed out from Jupyter Notebook but perfectly ok on Scala Shell.
2) Also I wonder in those awaitTermination(), such as:
messages.writeStream.outputMode("append").format("console").option("truncate", false).start().awaitTermination() (I also get no output from console)
or those "infinitive loop" as shown bellowing codes:
var finished = false
while (!finished) {................. ..}
are they waiting for a hard break like halt or [CTR]C... or how to break them properly? so the next line get executed. I get so confused as the author writing the samples / tutorials explained nothing about this.
enter code here
import java.util._
import scala.collection.JavaConverters._
import java.util.concurrent._
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.eventhubs.ConnectionStringBuilder
// Event hub configurations
// Replace values below with yours
val eventHubName = "<Event hub name>"
val eventHubNSConnStr = "<Event hub namespace connection string>"
val connStr = ConnectionStringBuilder (eventHubNSConnStr)
.setEventHubName(eventHubName).build
import com.microsoft.azure.eventhubs._
val pool = Executors.newFixedThreadPool(1)
val eventHubClient = EventHubClient.create(connStr.toString(), pool)
def sendEvent(message: String) = {
val messageData = EventData.create(message.getBytes("UTF-8"))
eventHubClient.get().send(messageData)
println("Sent event: " + message + "\n")
}
import twitter4j._
import twitter4j.TwitterFactory
import twitter4j.Twitter
import twitter4j.conf.ConfigurationBuilder
// Twitter application configurations
// Replace values below with yours
val twitterConsumerKey = "<CONSUMER KEY>"
val twitterConsumerSecret = "<CONSUMER SECRET>"
val twitterOauthAccessToken = "<ACCESS TOKEN>"
val twitterOauthTokenSecret = "<TOKEN SECRET>"
val cb = new ConfigurationBuilder()
cb.setDebugEnabled
(true).setOAuthConsumerKeywitterConsumerKey).setOAuthConsumerSecret
(twitterConsumerSecret).setOAuthAccessToken
(twitterOauthAccessToken).setOAuthAccessTokenSecret(twitterOauthTokenSecret)
val twitterFactory = new TwitterFactory(cb.build())
val twitter = twitterFactory.getInstance()
//Getting tweets with keyword "Azure" and sending them to Event Hub realtime
val query = new Query(" #Azure ")
query.setCount(100)
query.lang("en")
var finished = false
while (!finished) {
val result = twitter.search(query)
val statuses = result.getTweets()
var lowestStatusId = Long.MaxValue
for (status <- statuses.asScala) {
if(!status.isRetweet()){
sendEvent(status.getText())
}
lowestStatusId = Math.min(status.getId(), lowestStatusId)
Thread.sleep(2000)
}
query.setMaxId(lowestStatusId - 1)
}
// Closing connection to the Event Hub
eventHubClient.get().close()

Nothing is being printed out from a Flink Patterned Stream

I have this code below:
import java.util.Properties
import com.google.gson._
import com.typesafe.config.ConfigFactory
import org.apache.flink.cep.scala.pattern.Pattern
import org.apache.flink.cep.scala.CEP
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
object WindowedWordCount {
val configFactory = ConfigFactory.load()
def main(args: Array[String]) = {
val brokers = configFactory.getString("kafka.broker")
val topicChannel1 = configFactory.getString("kafka.topic1")
val props = new Properties()
...
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val dataStream = env.addSource(new FlinkKafkaConsumer010[String](topicChannel1, new SimpleStringSchema(), props))
val partitionedInput = dataStream.keyBy(jsonString => {
val jsonParser = new JsonParser()
val jsonObject = jsonParser.parse(jsonString).getAsJsonObject()
jsonObject.get("account")
})
val priceCheck = Pattern.begin[String]("start").where({jsonString =>
val jsonParser = new JsonParser()
val jsonObject = jsonParser.parse(jsonString).getAsJsonObject()
jsonObject.get("account").toString == "iOS"})
val pattern = CEP.pattern(partitionedInput, priceCheck)
val newStream = pattern.select(x =>
x.get("start").map({str =>
str
})
)
newStream.print()
env.execute()
}
}
For some reason in the above code at the newStream.print() nothing is being printed out. I am positive that there is data in Kafka that matches the pattern that I defined above but for some reason nothing is being printed out.
Can anyone please help me spot an error in this code?
EDIT:
class TimestampExtractor extends AssignerWithPeriodicWatermarks[String] with Serializable {
override def extractTimestamp(e: String, prevElementTimestamp: Long) = {
val jsonParser = new JsonParser()
val context = jsonParser.parse(e).getAsJsonObject.getAsJsonObject("context")
Instant.parse(context.get("serverTimestamp").toString.replaceAll("\"", "")).toEpochMilli
}
override def getCurrentWatermark(): Watermark = {
new Watermark(System.currentTimeMillis())
}
}
I saw on the flink doc that you can just return prevElementTimestamp in the extractTimestamp method (if you are using Kafka010) and new Watermark(System.currentTimeMillis) in the getCurrentWatermark method.
But I don't understand what prevElementTimestamp is or why one would return new Watermark(System.currentTimeMillis) as the WaterMark and not something else. Can you please elaborate on why we do this on how WaterMark and EventTime work together please?
You do setup your job to work in EventTime, but you do not provide timestamp and watermark extractor.
For more on working in event time see those docs. If you want to use the kafka embedded timestamps this docs may help you.
In EventTime the CEP library buffers events upon watermark arrival so to properly handle out-of-order events. In your case there are no watermarks generated, so the events are buffered infinitly.
Edit:
For the prevElementTimestamp I think the docs are pretty clear:
There is no need to define a timestamp extractor when using the timestamps from Kafka. The previousElementTimestamp argument of the extractTimestamp() method contains the timestamp carried by the Kafka message.
Since Kafka 0.10.x Kafka messages can have embedded timestamp.
Generating Watermark as new Watermark(System.currentTimeMillis) in this case is not a good idea. You should create Watermark based on your knowledge of the data. For information on how Watermark and EventTime work together I could not be more clear than the docs

Monitoring Akka Streams Sources with ScalaFX

What Im trying to solve is the following case:
Given an infinite running Akka Stream I want to be able to monitor certain points of the stream. The best way I could think of where to send the messages at this point to an Actor wich is also a Source. This makes it very flexible for me to then connect either individual Sources or merge multiple sources to a websocket or whatever other client I want to connect. However in this specific case Im trying to connect ScalaFX with Akka Source but it is not working as expected.
When I run the code below both counters start out ok but after a short while one of them stops and never recovers. I know there are special considerations with threading when using ScalaFX but I dont have the knowledge enough to understand what is going on here or debug it. Below is a minimal example to run, the issue should be visible after about 5 seconds.
My question is:
How could I change this code to work as expected?
import akka.NotUsed
import scalafx.Includes._
import akka.actor.{ActorRef, ActorSystem}
import akka.stream.{ActorMaterializer, OverflowStrategy, ThrottleMode}
import akka.stream.scaladsl.{Flow, Sink, Source}
import scalafx.application.JFXApp
import scalafx.beans.property.{IntegerProperty, StringProperty}
import scalafx.scene.Scene
import scalafx.scene.layout.BorderPane
import scalafx.scene.text.Text
import scala.concurrent.duration._
/**
* Created by henke on 2017-06-10.
*/
object MonitorApp extends JFXApp {
implicit val system = ActorSystem("monitor")
implicit val mat = ActorMaterializer()
val value1 = StringProperty("0")
val value2 = StringProperty("0")
stage = new JFXApp.PrimaryStage {
title = "Akka Stream Monitor"
scene = new Scene(600, 400) {
root = new BorderPane() {
left = new Text {
text <== value1
}
right = new Text {
text <== value2
}
}
}
}
override def stopApp() = system.terminate()
val monitor1 = createMonitor[Int]
val monitor2 = createMonitor[Int]
val marketChangeActor1 = monitor1
.to(Sink.foreach{ v =>
value1() = v.toString
}).run()
val marketChangeActor2 = monitor2
.to(Sink.foreach{ v =>
value2() = v.toString
}).run()
val monitorActor = Source[Int](1 to 100)
.throttle(1, 1.second, 1, ThrottleMode.shaping)
.via(logToMonitorAndContinue(marketChangeActor1))
.map(_ * 10)
.via(logToMonitorAndContinue(marketChangeActor2))
.to(Sink.ignore).run()
def createMonitor[T]: Source[T, ActorRef] = Source.actorRef[T](Int.MaxValue, OverflowStrategy.fail)
def logToMonitorAndContinue[T](monitor: ActorRef): Flow[T, T, NotUsed] = {
Flow[T].map{ e =>
monitor ! e
e
}
}
}
It seems that you assign values to the properties (and therefore affect the UI) in the actor system threads. However, all interaction with the UI should be done in the JavaFX GUI thread. Try wrapping value1() = v.toString and the second one in Platform.runLater calls.
I wasn't able to find a definitive statement about using runLater to interact with JavaFX data except in the JavaFX-Swing integration document, but this is quite a common thing in UI libraries; same is also true for Swing with its SwingUtilities.invokeLater method, for example.