Creating stream from api in Apache Flink - scala

Firstly I describe what I want to do. I have an API that gets a function as a argument (looks like this:dataFromApi => {//do sth}) and I would like to process this data by Flink. I wrote this code to simulate this API:
val myIterator = new TestIterator
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val th1 = new Thread {
override def run(): Unit = {
for (i <- 0 to 10) {
Thread sleep 1000
myIterator.addToQueue("test" + i)
}
}
}
th1.start()
val texts: DataStream[String] = env
.fromCollection(new TestIterator)
texts.print()
This is my iterator:
class TestIterator extends Iterator[String] with Serializable {
private val q: BlockingQueue[String] = new LinkedBlockingQueue[String]
def addToQueue(s: String): Unit = {
println("Put")
q.put(s)
}
override def hasNext: Boolean = true
override def next(): String = {
println("Wait for queue")
q.take()
}
}
My idea was execute myIterator.addToQueue(dataFromApi) when I receive data, but this code doesn't work. Despiting adding to the queue, execution blocks on q.take(). I tried to write own SourceFunction based on idea with Queue and also I tried with this: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/ but I can't manage I want.

Related

SCALA: monix Obeservable method

I am trying to use Monix Observable to control the big memory of file into smaller chunks of bytes so that it won't use up too many RAM to load the file's bytes.
However, when I using Observable.frominputStreram, it doesn't provide the Array[Byte] that fills into update() function from MessageDigest.
Any suggestions on my codes?
def SHA256_5(file: File)= {
val sha256 = MessageDigest.getInstance("SHA-256")
val in: Observable[Array[Byte]] = {
Observable.fromInputStream(Task(new FileInputStream(file)))
}
in.map(byteArray=>sha256.update(byteArray)).completed
sha256.digest().map("%02x".format(_)).mkString
}
def main(args: Array[String]): Unit = {
val path = "C:\\Users\\ME\\IdeaProjects\\HELLO\\src\\main\\scala\\TRY.scala"
println(SHA256_5(new File(path)))
}
in.map(byteArray=>sha256.update(byteArray)).completed
returns Task - it means that you have to execute that Task and when it finishes you will be able to call
sha256.digest().map("%02x".format(_)).mkString
because Task is used for lazily building asynchronous operation.
Try this instead:
def calcuateSHA(file: File) = for {
sha256 <- Task(MessageDigest.getInstance("SHA-256"))
in = Observable.fromInputStream(Task(new FileInputStream(file)))
_ <- in.map(byteArray=>sha256.update(byteArray)).completed
} yield sha256.digest().map("%02x".format(_)).mkString
def main(args: Array[String]): Unit = {
val path = "C:\\Users\\ME\\IdeaProjects\\HELLO\\src\\main\\scala\\TRY.scala"
import monix.execution.Implicits.global
Await.result(calcuateSHA(new File(path)).runToFuture, Duration.Inf)
}
for starters, or if you want to do it using build in Monix TaskApp instead of hacks for running asynchronous computation in a synchronous main:
object Test extends TaskApp {
def calcuateSHA(file: File) = for {
sha256 <- Task(MessageDigest.getInstance("SHA-256"))
in = Observable.fromInputStream(Task(new FileInputStream(file)))
_ <- in.map(byteArray=>sha256.update(byteArray)).completed
} yield sha256.digest().map("%02x".format(_)).mkString
def run(args: List[String]) = {
val path = "C:\\Users\\ME\\IdeaProjects\\HELLO\\src\\main\\scala\\TRY.scala"
for {
sha <- calcuateSHA(new File(path)
_ = println(sha)
} yield ExitCode.Success
}
}

Kafka tests failing intermittently if not starting/stopping kafka each time

I'm trying to run some integration tests for a data stream using an embedded kafka cluster. When executing all the tests in a different environment than my local, the tests are failing due to some internal state that's not removed properly.
I can get the all the tests running on the non-local environment when I start/stop the kafka cluster before/after each test but I only want to start and stop the cluster once, at the beginning and at the end of the execution of my suite of tests.
I tried to remove the local streams state but that didn't seem to work:
override protected def afterEach(): Unit = KStreamTestUtils.purgeLocalStreamsState(properties)
Is there a way to get my suit of tests running without having to start/stop cluster each time?
Right below there are the relevant classes.
class TweetStreamProcessorSpec extends FeatureSpec
with MockFactory with GivenWhenThen with Eventually with BeforeAndAfterEach with BeforeAndAfterAll {
val CLUSTER: EmbeddedKafkaCluster = new EmbeddedKafkaCluster
val TEST_TOPIC: String = "test_topic"
val properties = new Properties()
override def beforeAll(): Unit = {
CLUSTER.start()
CLUSTER.createTopic(TEST_TOPIC, 1, 1)
}
override def afterAll(): Unit = CLUSTER.stop()
// if uncommenting these lines tests works
// override def afterEach(): Unit = CLUSTER.stop()
// override protected def beforeEach(): Unit = CLUSTER.start()
def createProducer: KafkaProducer[String, TweetEvent] = {
val properties = Map(
KEY_SERIALIZER_CLASS_CONFIG -> classOf[StringSerializer].getName,
VALUE_SERIALIZER_CLASS_CONFIG -> classOf[ReflectAvroSerializer[TweetEvent]].getName,
BOOTSTRAP_SERVERS_CONFIG -> CLUSTER.bootstrapServers(),
SCHEMA_REGISTRY_URL_CONFIG -> CLUSTER.schemaRegistryUrlForcedToLocalhost()
)
new KafkaProducer[String, TweetEvent](properties)
}
def kafkaConsumerSettings: KafkaConfig = {
val bootstrapServers = CLUSTER.bootstrapServers()
val schemaRegistryUrl = CLUSTER.schemaRegistryUrlForcedToLocalhost()
val zookeeper = CLUSTER.zookeeperConnect()
KafkaConfig(
ConfigFactory.parseString(
s"""
akka.kafka.bootstrap.servers = "$bootstrapServers"
akka.kafka.schema.registry.url = "$schemaRegistryUrl"
akka.kafka.zookeeper.servers = "$zookeeper"
akka.kafka.topic-name = "$TEST_TOPIC"
akka.kafka.consumer.kafka-clients.key.deserializer = org.apache.kafka.common.serialization.StringDeserializer
akka.kafka.consumer.kafka-clients.value.deserializer = ${classOf[ReflectAvroDeserializer[TweetEvent]].getName}
akka.kafka.consumer.kafka-clients.client.id = client1
akka.kafka.consumer.wakeup-timeout=20s
akka.kafka.consumer.max-wakeups=10
""").withFallback(ConfigFactory.load()).getConfig("akka.kafka")
)
}
feature("Logging tweet data from kafka topic") {
scenario("log id and payload when consuming a update tweet event") {
publishEventsToKafka(List(upTweetEvent))
val logger = Mockito.mock(classOf[Logger])
val pipeline = new TweetStreamProcessor(kafkaConsumerSettings, logger)
pipeline.start
eventually(timeout(Span(5, Seconds))) {
Mockito.verify(logger, Mockito.times(1)).info(s"updating tweet uuid=${upTweetEvent.getUuid}, payload=${upTweetEvent.getPayload}")
}
pipeline.stop
}
scenario("log id when consuming a delete tweet event") {
publishEventsToKafka(List(delTweetEvent))
val logger = Mockito.mock(classOf[Logger])
val pipeline = new TweetStreamProcessor(kafkaConsumerSettings, logger)
pipeline.start
eventually(timeout(Span(5, Seconds))) {
Mockito.verify(logger, Mockito.times(1)).info(s"deleting tweet uuid=${delTweetEvent.getUuid}")
}
pipeline.stop
}
}
}
class TweetStreamProcessor(kafkaConfig: KafkaConfig, logger: Logger)
extends Lifecycle with TweetStreamProcessor with Logging {
private var control: Control = _
private val valueDeserializer: Option[Deserializer[TweetEvent]] = None
// ...
def tweetsSource(implicit mat: Materializer): Source[CommittableMessage[String, TweetEvent], Control] =
Consumer.committableSource(tweetConsumerSettings, Subscriptions.topics(kafkaConfig.topicName))
override def start: Future[Unit] = {
control = tweetsSource(materializer)
.mapAsync(1) { msg =>
logTweetEvent(msg.record.value())
.map(_ => msg.committableOffset)
}.batch(max = 20, first => CommittableOffsetBatch.empty.updated(first)) { (batch, elem) =>
batch.updated(elem)
}
.mapAsync(3)(_.commitScaladsl())
.to(Sink.ignore)
.run()
Future.successful()
}
override def stop: Future[Unit] = {
control.shutdown()
.map(_ => Unit)
}
}
Any help over this would be much appreciated? Thanks in advance.

Akka-streams: how to get flow names in metrics reported by kamon-akka

I've been trying to set-up some instrumentation for Akka streams. Got it working, but, even though I named all my Flows that are part of the streams, I still get this sort of names in the metrics: flow-0-0-unknown-operation
A simple example of what I'm trying to do:
val myflow = Flow[String].named("myflow").map(println)
Source.via(myflow).to(Sink.ignore).run()
I basically want to see the metrics for the Actor that gets created for "myflow", with a proper name.
Is this even possible? Am I missing something?
I was having this challenge in my project and I solved by using Kamon + Prometheus. However I had to create an Akka Stream Flow which I can set its name metricName and export the metric values from it using val kamonThroughputGauge: Metric.Gauge.
class MonitorProcessingTimerFlow[T](interval: FiniteDuration)(metricName: String = "monitorFlow") extends GraphStage[FlowShape[T, T]] {
val in = Inlet[T]("MonitorProcessingTimerFlow.in")
val out = Outlet[T]("MonitorProcessingTimerFlow.out")
Kamon.init()
val kamonThroughputGauge: Metric.Gauge = Kamon.gauge("akka-stream-throughput-monitor")
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new TimerGraphStageLogic(shape) {
// mutable state
var open = false
var count = 0
var start = System.nanoTime
setHandler(in, new InHandler {
override def onPush(): Unit = {
try {
push(out, grab(in))
count += 1
if (!open) {
open = true
scheduleOnce(None, interval)
}
} catch {
case e: Throwable => failStage(e)
}
}
})
setHandler(out, new OutHandler {
override def onPull(): Unit = {
pull(in)
}
})
override protected def onTimer(timerKey: Any): Unit = {
open = false
val duration = (System.nanoTime - start) / 1e9d
val throughput = count / duration
kamonThroughputGauge.withTag("name", metricName).update(throughput)
count = 0
start = System.nanoTime
}
}
override def shape: FlowShape[T, T] = FlowShape[T, T](in, out)
}
Then I created a simple stream that uses the MonitorProcessingTimerFlow to export the metrics:
implicit val system = ActorSystem("FirstStreamMonitoring")
val source = Source(Stream.from(1)).throttle(1, 1 second)
/** Simulating workload fluctuation: A Flow that expand the event to a random number of multiple events */
val flow = Flow[Int].extrapolate { element =>
Stream.continually(Random.nextInt(100)).take(Random.nextInt(100)).iterator
}
val monitorFlow = Flow.fromGraph(new MonitorProcessingTimerFlow[Int](5 seconds)("monitorFlow"))
val sink = Sink.foreach[Int](println)
val graph = source
.via(flow)
.via(monitorFlow)
.to(sink)
graph.run()
with a proper configuration at application.conf:
kamon.instrumentation.akka.filters {
actors.track {
includes = [ "FirstStreamMonitoring/user/*" ]
}
}
I can see the throughput metrics on prometheus console with the name name="monitorFlow":

Router Hanging in Dealer-Router Setup

Given the following attempt to connect 1 DEALER to 1 ROUTER:
package net.async
import org.zeromq.ZMQ
import org.zeromq.ZMQ.Socket
import scala.annotation.tailrec
object Client {
val Empty = "".getBytes
def message(x: Int) = s"HELLO_#$x".getBytes
val Count = 5
}
class Client(name: String) extends Runnable {
import Client._
import AsyncClientServer.Port
override def run(): Unit = {
val context = ZMQ.context(1)
val dealer = context.socket(ZMQ.DEALER)
dealer.setIdentity(name.getBytes)
dealer.connect(s"tcp://localhost:$Port")
runHelper(dealer, Count)
}
#tailrec
private def runHelper(dealer: Socket, count: Int): Unit = {
dealer.send(dealer.getIdentity, ZMQ.SNDMORE)
dealer.send(Empty, ZMQ.SNDMORE)
dealer.send(message(count), 0)
println(s"Dealer: ${dealer.getIdentity} received message: " + dealer.recv(0))
runHelper(dealer, count - 1)
}
}
object AsyncClientServer {
val Port = 5555
val context = ZMQ.context(1)
val router = context.socket(ZMQ.ROUTER)
def main(args: Array[String]): Unit = {
router.bind(s"tcp://*:$Port")
mainHelper()
new Thread(new Client("Joe")).start()
}
private def mainHelper(): Unit = {
println("Waiting to receive messages from Dealer.")
val identity = router.recv(0)
val empty = router.recv(0)
val message = router.recv(0)
println(s"Router received message, ${new String(message)} from sender: ${new String(identity)}.")
mainHelper()
}
}
I see the following output, hanging on the second message.
[info] Running net.async.AsyncClientServer
[info] Waiting to receive messages from Dealer.
Why is that?
Not sure if its the cause of your problem but you don't need to send the identity frame from your dealer, zeromq will do this for you. By adding it your actually sending a 4 part message.
IDENTITY
IDENTITY
EMPTY
CONTENT

Perfomance using orientdb graph api vs sql query

Actually i'm using this api for standards operations (Read, Remove, Find, Save)
http://orientdb.com/docs/last/Graph-Database-Tinkerpop.html
I noticed that the performance of this method for remove are really bad
object Odb {
val factory = new OrientGraphFactory("remote:localhost:2424/recommendation-system","root","root").setupPool(1,10)
def clearDb = {
val graph = factory.getNoTx
val vertices = graph.getVertices().asScala.map(v => v.remove())
}
}
object TagsOdb extends TagsDao {
override def count: Future[Long] = Future {
val graph = Odb.factory.getNoTx
val count = graph.countVertices("Tags")
count
}
override def update(newTag: Tag, oldTag: Tag): Future[Boolean] = Future { synchronized {
val graph = Odb.factory.getTx
val tagVertices = graph.getVertices("Tags.tag",oldTag.flatten).asScala
if(tagVertices.isEmpty) throw new Exception("Tag not found: "+oldTag.id)
tagVertices.head.setProperty("tag",newTag.flatten)
graph.commit()
true
}}
override def all: Future[List[Tag]] = Future {
val graph = Odb.factory.getNoTx
val tagVertices = graph.getVerticesOfClass("Tags").asScala
val tagList = tagVertices.map(v => Tag(v.getProperty("tag"),None)).toList
tagList
}
override def remove(e: Tag): Future[Boolean] = Future { synchronized {
val graph = Odb.factory.getTx
val tagVertices = graph.getVertices("Tags.tag",e.flatten).asScala
if(tagVertices.isEmpty) throw new Exception("Tag not found: "+e.flatten)
tagVertices.head.remove()
graph.commit()
true
}}
override def save(e: Tag, upsert: Boolean = false): Future[Boolean] = Future { synchronized {
val graph = Odb.factory.getTx
val v = graph.getVertices("Tags.tag",e.flatten).asScala
if(v.nonEmpty) {
if (upsert)
v.head.setProperty("tag", e.flatten)
else
throw new Exception("Element already in database")
} else {
val tagVertex = graph.addVertex("Tags", null)
tagVertex.setProperty("tag", e.flatten)
}
graph.commit()
true
}}
override def find(query: String): Future[List[Tag]] = Future {
val graph = Odb.factory.getNoTx
val res: OrientDynaElementIterable = graph.command(new OCommandSQL(query)).execute()
val ridTags: Iterable[Vertex] = res.asScala.asInstanceOf[Iterable[Vertex]]
def getTag(rid: AnyRef): Tag = {
val tagVertex = graph.getVertex(rid)
Tag(tagVertex.getProperty("tag"),None)
}
ridTags.map(r => getTag(r)).toList
}
}
is there any way to get better performance ?
I should use SQL queries?
Since you're connected via remote interface, foreach remove you have a RPC. Try to execute a DELETE VERTEX on the server.