Possible encoding issue with Google PubSub - scala

When running a subscription source from the Alpakka PubSub library I received possible encoded data.
#Singleton
class Consumer #Inject()(config: Configuration, credentialsService: google.creds.Service)(implicit actorSystem: ActorSystem) {
implicit val m: ActorMaterializer = ActorMaterializer.create(actorSystem)
val logger = Logger(this.getClass)
val subName: String = config.get[String]("google.pubsub.subname")
val credentials: Credentials = credentialsService.getCredentials
val pubSubConfig = PubSubConfig(credentials.projectId, credentials.clientEmail, credentials.privateKey)
val subSource: Source[ReceivedMessage, NotUsed] = GooglePubSub.subscribe(subName, pubSubConfig)
val ackSink: Sink[AcknowledgeRequest, Future[Done]] = GooglePubSub.acknowledge(subName, pubSubConfig)
val computeGraph = Flow[ReceivedMessage].map {
x =>
logger.info(x.message.data)
x
}
val ackGraph = Flow.fromFunction((msgs: Seq[ReceivedMessage]) => AcknowledgeRequest(msgs.map(_.ackId).toList))
subSource
.via(computeGraph)
.groupedWithin(10, 5.minutes)
.via(ackGraph)
.to(ackSink)
.run()
}
I publish the message from the PubSub console. I am expecting my test message to appear however when publishing test I receive dGVzdA==. Is this an expected result? I have had issues with importing the private key and this might be a result of it?
The consumer is bound eagerly with Guice.

Data that is received over REST apis will be base64 encoded. My guess would be that the Alpakka Pub/Sub library which uses the REST APIs is not properly decoding the received data. It looks like they also have a library that uses the GRPC Pub/Sub client as the underlying layer which may not suffer from this defect? You can also use the Cloud Pub/Sub Java client library from Scala directly.

Related

Change a materialized value in a source using the contents of the stream

Alpakka provides a great way to access dozens of different data sources. File oriented sources such as HDFS and FTP sources are delivered as Source[ByteString, Future[IOResult]. However, HTTP requests via Akka HTTP are delivered as entity streams of Source[ByteString, NotUsed]. In my use case, I would like to retrieve content from HTTP sources as Source[ByteString, Future[IOResult] so I can build a unified resource fetcher that works from multiple schemes (hdfs, file, ftp and S3 in this case).
In particular, I would like to convert the Source[ByteString, NotUsed] source to
Source[ByteString, Future[IOResult] where I am able to calculate the IOResult from the incoming byte stream. There are plenty of methods like flatMapConcat and viaMat but none seem to be able to extract details from the input stream (such as number of bytes read) or initialise the IOResult structure properly. Ideally, I am looking for a method with the following signature that will update the IOResult as the stream comes in.
def matCalc(src: Source[ByteString, Any]) = Source[ByteString, Future[IOResult]] = {
src.someMatFoldMagic[ByteString, IOResult](IOResult.createSuccessful(0))(m, b) => m.withCount(m.count + b.length))
}
i can't recall any existing functionality, which can out of the box do this, but you can use alsoToMat (surprisingly didn't find it in akka streams docs, although you can look it in source code documentation & java api) flow function together with Sink.fold to accumulate some value and give it in the very end. eg:
def magic(source: Source[Int, Any]): Source[Int, Future[Int]] =
source.alsoToMat(Sink.fold(0)((acc, _) => acc + 1))((_, f) => f)
the thing is that alsoToMat combines input mat value with the one provided in alsoToMat. at the same time the values produced by source are not affected by the sink in alsoToMat:
def alsoToMat[Mat2, Mat3](that: Graph[SinkShape[Out], Mat2])(matF: (Mat, Mat2) ⇒ Mat3): ReprMat[Out, Mat3] =
viaMat(alsoToGraph(that))(matF)
it's not that hard to adapt this function to return IOResult, which is according to the source code:
final case class IOResult(count: Long, status: Try[Done]) { ... }
one more last thing which you need to pay attention - you want your source be like:
Source[ByteString, Future[IOResult]]
but if you wan't to carry these mat value till the very end of stream definition, and then do smth based on this future completion, that might be error prone approach. eg, in this example i finish the work based on that future, so the last value will not be processed:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Keep, Sink, Source}
import scala.concurrent.duration._
import scala.concurrent.{Await, ExecutionContext, Future}
object App extends App {
private implicit val sys: ActorSystem = ActorSystem()
private implicit val mat: ActorMaterializer = ActorMaterializer()
private implicit val ec: ExecutionContext = sys.dispatcher
val source: Source[Int, Any] = Source((1 to 5).toList)
def magic(source: Source[Int, Any]): Source[Int, Future[Int]] =
source.alsoToMat(Sink.fold(0)((acc, _) => acc + 1))((_, f) => f)
val f = magic(source).throttle(1, 1.second).toMat(Sink.foreach(println))(Keep.left).run()
f.onComplete(t => println(s"f1 completed - $t"))
Await.ready(f, 5.minutes)
mat.shutdown()
sys.terminate()
}
This can be done by using a Promise for the materialized value propagation.
val completion = Promise[IoResult]
val httpWithIoResult = http.mapMaterializedValue(_ => completion.future)
What is left now is to complete the completion promise when the relevant data becomes available.
Alternative approach would be to drop down to the GraphStage API where you get lower level control of materialized value propagation. But even there using Promises is often the chosen implementation for materialized value propagation. Take a look at built in operator implementations like Ignore.

Testing Twitter with Spark Streaming API

I am new to Streaming framework of Spark and was trying to process the twitter stream.
I am in process of writing test cases for same and understand that I can use Spark StreamingSuiteBase which will help me test input as a stream on my functions.
But I have written a function which take DStream[Status] as input and after processing gives DStream[String] as output.
The api I am using from StreamingSuiteBase is testOperation.
test("Filter only words Starting with #") {
val inputTweet = List(List("this is #firstHash"), List("this is #secondHash"), List("this is #thirdHash"))
val expected = List(List("#firstHash"), List("#secondHash"), List("#thirdHash"))
testOperation(inputTweet, TransformTweets.getText _, expected, ordered = false)
And this is the function on which the input is sent..
def getText(englishTweets: DStream[Status]): DStream[String] = {
println(englishTweets.toString)
val hashTags = englishTweets.flatMap(x => x.getText.split(" ").filter(_.startsWith("#")))
hashTags
}
But I am getting the error "type mismatch" due to DStream[Status] and DStream[String]. How do I mock Stream[Status].
So, I resolved this issue by getting the Twitter status from "createStatus" API of TwitterObjectFactory. There was no need to mock TwitterStatus. Even if you manage to mock it there are Serialization issues. So, this is the best solution:
val rawJson = Source.fromURL(getClass.getResource("/tweetStatus.json")).getLines.mkString
val tweetStatus = TwitterObjectFactory.createStatus(rawJson)
Hope this helps someone !

Kafka topic to websocket

I am trying to implement a setup where I have multiple web browsers open a websocket connection to my akka-http server in order to read all messages posted to a kafka topic.
so the stream of messages should go this way
kafka topic -> akka-http -> websocket connection 1
-> websocket connection 2
-> websocket connection 3
For now I have created a path for the websocket:
val route: Route =
path("ws") {
handleWebSocketMessages(notificationWs)
}
Then I have created a consumer for my kafka topic:
val consumerSettings = ConsumerSettings(system,
new ByteArrayDeserializer, new StringDeserializer)
.withBootstrapServers("localhost:9092")
.withGroupId("group1")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val source = Consumer
.plainSource(consumerSettings, Subscriptions.topics("topic1"))
And then finally I want to connect this source to the websocket in handleWebSocketMessages
def handleWebSocketMessages: Flow[Message, Message, Any] =
Flow[Message].mapConcat {
case tm: TextMessage =>
TextMessage(source)::Nil
case bm: BinaryMessage =>
// ignore binary messages but drain content to avoid the stream being clogged
bm.dataStream.runWith(Sink.ignore)
Nil
}
Here is the error I get when I try to use source in the TextMessage:
Error:(77, 9) overloaded method value apply with alternatives:
(textStream: akka.stream.scaladsl.Source[String,Any])akka.http.scaladsl.model.ws.TextMessage
(text: String)akka.http.scaladsl.model.ws.TextMessage.Strict
cannot be applied to (akka.stream.scaladsl.Source[org.apache.kafka.clients.consumer.ConsumerRecord[Array[Byte],String],akka.kafka.scaladsl.Consumer.Control])
TextMessage(source)::Nil
I think I'm making numerous mistakes along the way but I would say that the most blocking part is the handleWebSocketMessages.
The first thing, is to understand that source is of type : Source[ConsumerRecord[K, V], Control].
So, it's not something that you could pass as an argument of a TextMessage.
Now, let's take the websocket's point of view:
An outgoing message is built for each message in the Kafka source. The message will be a TextMessage from a String transformation of the Kafka message.
For each incoming message, just println() it
So, the Flow can be seen as two components: the Source & the Sink.
val incomingMessages: Sink[Message, NotUsed] =
Sink.foreach(println(_))
val outgoingMessages: Source[Message, NotUsed] =
source
.map { consumerRecord => TextMessage(consumerRecord.record.value) }
val handleWebSocketMessages: Flow[Message, Message, Any]
= Flow.fromSinkAndSource(incomingMessages, outgoingMessages)
Hope it helps.

Apache Flink - Unable to get data From Twitter

I'm trying to get some messages with Twitter Streaming API using Apache Flink.
But, my code is not writing anything in the output file. I'm trying to count the input data for specific words.
Plese check my example:
import java.util.Properties
import org.apache.flink.api.scala._
import org.apache.flink.streaming.connectors.twitter._
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import com.twitter.hbc.core.endpoint.{Location, StatusesFilterEndpoint, StreamingEndpoint}
import org.apache.flink.streaming.api.windowing.time.Time
import scala.collection.JavaConverters._
//////////////////////////////////////////////////////
// Create an Endpoint to Track our terms
class myFilterEndpoint extends TwitterSource.EndpointInitializer with Serializable {
#Override
def createEndpoint(): StreamingEndpoint = {
//val chicago = new Location(new Location.Coordinate(-86.0, 41.0), new Location.Coordinate(-87.0, 42.0))
val endpoint = new StatusesFilterEndpoint()
//endpoint.locations(List(chicago).asJava)
endpoint.trackTerms(List("odebrecht", "lava", "jato").asJava)
endpoint
}
}
object Connection {
def main(args: Array[String]): Unit = {
val props = new Properties()
val params: ParameterTool = ParameterTool.fromArgs(args)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.getConfig.setGlobalJobParameters(params)
env.setParallelism(params.getInt("parallelism", 1))
props.setProperty(TwitterSource.CONSUMER_KEY, params.get("consumer-key"))
props.setProperty(TwitterSource.CONSUMER_SECRET, params.get("consumer-key"))
props.setProperty(TwitterSource.TOKEN, params.get("token"))
props.setProperty(TwitterSource.TOKEN_SECRET, params.get("token-secret"))
val source = new TwitterSource(props)
val epInit = new myFilterEndpoint()
source.setCustomEndpointInitializer(epInit)
val streamSource = env.addSource(source)
streamSource.map(s => (0, 1))
.keyBy(0)
.timeWindow(Time.minutes(2), Time.seconds(30))
.sum(1)
.map(t => t._2)
.writeAsText(params.get("output"))
env.execute("Twitter Count")
}
}
The point is, I have no error message and I can see at my Dashboard. My source is sending data to my TriggerWindow. But it is not receive any data:
I have two questions in once.
First: Why my source is sending bytes to my TriggerWindow if it is not received anything?
Seccond: Is something wrong to my code that I can't take data from twitter?
Your application source did not send actual records to the window which you can see by looking at the Records sent column. The bytes which are sent belong to control messages which Flink sends from time to time between the tasks. More specifically, it is the LatencyMarker message which is used to measure the end to end latency of a Flink job.
The code looks good to me. I even tried out your code and worked for me. Thus, I conclude that there has to be something wrong with the Twitter connection credentials. Please re-check whether you've entered the right credentials.

How to read file to array bytes from gridfs?

I want to get Array/List[Byte] from Enumerator[Array[Byte]]. Some articles show me how to do it in playframework. But I'm not using play framework for this project. This is what I have:
lazy val gridfs = GridFS[BSONSerializationPack.type](db, "resource")
gridfs.find(BSONDocument("_id" -> BSONObjectID(id))).headOption.map{
case Some(file) => {
//this gives me Enumerator[Array[Byte]].
//I'm not using playframework, how to get Future[Array[Byte]] from here?
gridfs.enumerate(file)
}
}
Using Play Iteratees is not specific to Play app, but to streaming, like Akka Stream or RxStuff.
You can consume all the chunks of such enumerator using Iteratee.consume.
val sink: Iteratee[Array[Byte], Array[Byte]] = Iteratee.consume[Array[Byte]]()
val allInMem: Future[Array[Byte]] = enumerator |>>> sink
For obvious reason, it's recommanded not to consume big data, but to use appropriate Iteratee to process the data in a streaming way.