NullPointerException in Flink custom SourceFunction - scala

I wanted to create a SourceFunction which reads a http stream.
I used ScalaJ which does what I want (it splits the incoming text by \n-s).
Obviously the code works outside Flink, but I get a NullPointerExcetion every time I start it as a Flink job (sometimes immediately sometimes after 1-2 seconds after it transmitted 1-2 elements). It kind of looks like the Http object has some problems.
import org.apache.flink.streaming.api.functions.source.SourceFunction
import scala.io.Source.fromInputStream
import scalaj.http._
class HttpSource(url: String) extends SourceFunction[String] {
#volatile var isRunning = true
override def cancel(): Unit = isRunning = false
override def run(ctx: SourceFunction.SourceContext[String]): Unit =
httpStream(ctx.collect)
private def httpStream(f: String => Unit) = {
val request = Http(url)
request
.execute { inputStream =>
fromInputStream(inputStream)
.getLines()
.takeWhile(_ => isRunning)
.foreach(f)
}
}
}
Here's the exception I usually get:
(Sometimes it's a bit different, for example I tried to make the request value transient, then it's already null when it tries to refer to request)
Caused by: java.lang.NullPointerException
at java.io.Reader.<init>(Reader.java:78)
at java.io.InputStreamReader.<init>(InputStreamReader.java:129)
at scala.io.BufferedSource.reader(BufferedSource.scala:24)
at scala.io.BufferedSource.bufferedReader(BufferedSource.scala:25)
at scala.io.BufferedSource.scala$io$BufferedSource$$charReader$lzycompute(BufferedSource.scala:35)
at scala.io.BufferedSource.scala$io$BufferedSource$$charReader(BufferedSource.scala:33)
at scala.io.BufferedSource.scala$io$BufferedSource$$decachedReader(BufferedSource.scala:62)
at scala.io.BufferedSource$BufferedLineIterator.<init>(BufferedSource.scala:67)
at scala.io.BufferedSource.getLines(BufferedSource.scala:86)
at flinkextension.HttpSource$$anonfun$httpStream$1.apply(HttpSource.scala:21)
at flinkextension.HttpSource$$anonfun$httpStream$1.apply(HttpSource.scala:19)
at scalaj.http.HttpRequest$$anonfun$execute$1.apply(Http.scala:323)
at scalaj.http.HttpRequest$$anonfun$execute$1.apply(Http.scala:323)
at scalaj.http.HttpRequest$$anonfun$toResponse$3.apply(Http.scala:388)
at scalaj.http.HttpRequest$$anonfun$toResponse$3.apply(Http.scala:380)
at scala.Option.getOrElse(Option.scala:121)
at scalaj.http.HttpRequest.toResponse(Http.scala:380)
at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:360)
at scalaj.http.HttpRequest.exec(Http.scala:335)
at scalaj.http.HttpRequest.execute(Http.scala:323)
at flinkextension.HttpSource.httpStream(HttpSource.scala:19)
at flinkextension.HttpSource.run(HttpSource.scala:14)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)
Everything else seems to be working fine, when I don't use a http request, but something else like file read with the same InputStream type, just a plain while loop with strings or even when I use single http requests, which aren't streaming.
I feel like I'm missing some theoretical background, maybe flink does something in the background which destroys the Http object or the InputStream, but I didn't find anything in the documentation.
UPDATE #1:
If I put a null check into the lambda, the job usually exits immediately, sometimes processes a few elements, sometimes timeouts after hanging for a minute. Here's this version of the httpStream function:
private def httpStream(f: String => Unit) = {
val request = Http(url)
request
.execute { inputStream =>
if (inputStream == null) println("null inputstream")
else {
println("not null inputstream")
fromInputStream(inputStream)
.getLines()
.takeWhile(_ => isRunning)
.foreach(f)
}
}
}
UPDATE #2:
The code actually works in distributed mode and with StreamExecutionEnvironment.createLocalEnvironment()
I only experience the issue if I use start-local.sh and submit the jar to it.

Related

Strange jetty warning "java.lang.IllegalStateException: Committed" on redirecting, how to fix it?

I have a custom Jetty redirect Handler:
object RedirectHandler extends HandlerWrapper {
override def handle(
target: String,
baseRequest: Request,
request: HttpServletRequest,
response: HttpServletResponse
): Unit = {
require(!response.isCommitted)
val redirectURIOpt = rewrite(request.getRequestURI)
redirectURIOpt.foreach { v =>
response.sendRedirect(v)
require(response.isCommitted)
}
}
def rewrite(target: String): Option[String] = {
if (target.endsWith(".html")) None
else {
var _target = target.stripSuffix("/")
if (_target.nonEmpty) {
_target += ".html"
Some(_target)
} else {
None
}
}
}
}
It is configured to work with an ordinary ResourceHandler to set up a web server:
...
val server = new Server(10092)
val resourceHandler = new ResourceHandler
resourceHandler.setDirectoriesListed(true)
// resource_handler.setWelcomeFiles(Array[String]("test-sites.html"))
resourceHandler.setResourceBase(CommonConst.USER_DIR \\ "test-sites")
val handlers = new HandlerList
handlers.setHandlers(Array(RedirectHandler, resourceHandler))
server.setHandler(handlers)
Every time a redirect is triggered, I found a strange warning message:
WARN HttpChannel: /test-sites/e-commerce/static
java.lang.IllegalStateException: Committed
at org.sparkproject.jetty.server.HttpChannel.resetBuffer(HttpChannel.java:994)
at org.sparkproject.jetty.server.HttpOutput.resetBuffer(HttpOutput.java:1459)
at org.sparkproject.jetty.server.Response.resetBuffer(Response.java:1217)
at org.sparkproject.jetty.server.Response.sendRedirect(Response.java:569)
at org.sparkproject.jetty.server.Response.sendRedirect(Response.java:503)
at org.sparkproject.jetty.server.Response.sendRedirect(Response.java:578)
at org.sparkproject.jetty.server.ResourceService.sendWelcome(ResourceService.java:398)
at org.sparkproject.jetty.server.ResourceService.doGet(ResourceService.java:257)
at org.sparkproject.jetty.server.handler.ResourceHandler.handle(ResourceHandler.java:262)
at org.sparkproject.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
at org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.sparkproject.jetty.server.Server.handle(Server.java:516)
at org.sparkproject.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
at org.sparkproject.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:380)
at org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
at org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:105)
at org.sparkproject.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
at org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.lang.Thread.run(Thread.java:748)
According to this post:
jetty webSocket : java.lang.IllegalStateException: Committed
The problem is triggered if the response is further tampered after sendRedirect. However, given that resourceHandler is a very mature component, it is very unlikely to make that low level mistake.
So what are the possible causes, and more importantly, what can be done to fix it by declaring the redirect response to be final?
You missed the baseRequest.setHandled(true) to tell Jetty to stop processing further Handlers. You use this when your Handler produces a response (or handles the request).
A later handler (ResourceHandler according to your stacktrace) attempted to write a response and was smacked down because the response was already committed.

Wrapping Pub-Sub Java API in Akka Streams Custom Graph Stage

I am working with a Java API from a data vendor providing real time streams. I would like to process this stream using Akka streams.
The Java API has a pub sub design and roughly works like this:
Subscription sub = createSubscription();
sub.addListener(new Listener() {
public void eventsReceived(List events) {
for (Event e : events)
buffer.enqueue(e)
}
});
I have tried to embed the creation of this subscription and accompanying buffer in a custom graph stage without much success. Can anyone guide me on the best way to interface with this API using Akka? Is Akka Streams the best tool here?
To feed a Source, you don't necessarily need to use a custom graph stage. Source.queue will materialize as a buffered queue to which you can add elements which will then propagate through the stream.
There are a couple of tricky things to be aware of. The first is that there's some subtlety around materializing the Source.queue so you can set up the subscription. Something like this:
def bufferSize: Int = ???
Source.fromMaterializer { (mat, att) =>
val (queue, source) = Source.queue[Event](bufferSize).preMaterialize()(mat)
val subscription = createSubscription()
subscription.addListener(
new Listener() {
def eventsReceived(events: java.util.List[Event]): Unit = {
import scala.collection.JavaConverters.iterableAsScalaIterable
import akka.stream.QueueOfferResult._
iterableAsScalaIterable(events).foreach { event =>
queue.offer(event) match {
case Enqueued => () // do nothing
case Dropped => ??? // handle a dropped pubsub element, might well do nothing
case QueueClosed => ??? // presumably cancel the subscription...
}
}
}
}
)
source.withAttributes(att)
}
Source.fromMaterializer is used to get access at each materialization to the materializer (which is what compiles the stream definition into actors). When we materialize, we use the materializer to preMaterialize the queue source so we have access to the queue. Our subscription adds incoming elements to the queue.
The API for this pubsub doesn't seem to support backpressure if the consumer can't keep up. The queue will drop elements it's been handed if the buffer is full: you'll probably want to do nothing in that case, but I've called it out in the match that you should make an explicit decision here.
Dropping the newest element is the synchronous behavior for this queue (there are other queue implementations available, but those will communicate dropping asynchronously which can be really bad for memory consumption in a burst). If you'd prefer something else, it may make sense to have a very small buffer in the queue and attach the "overall" Source (the one returned by Source.fromMaterializer) to a stage which signals perpetual demand. For example, a buffer(downstreamBufferSize, OverflowStrategy.dropHead) will drop the oldest event not yet processed. Alternatively, it may be possible to combine your Events in some meaningful way, in which case a conflate stage will automatically combine incoming Events if the downstream can't process them quickly.
Great answer! I did build something similar. There are also kamon metrics to monitor queue size exc.
class AsyncSubscriber(projectId: String, subscriptionId: String, metricsRegistry: CustomMetricsRegistry, pullParallelism: Int)(implicit val ec: Executor) {
private val logger = LoggerFactory.getLogger(getClass)
def bufferSize: Int = 1000
def source(): Source[(PubsubMessage, AckReplyConsumer), Future[NotUsed]] = {
Source.fromMaterializer { (mat, attr) =>
val (queue, source) = Source.queue[(PubsubMessage, AckReplyConsumer)](bufferSize).preMaterialize()(mat)
val receiver: MessageReceiver = {
(message: PubsubMessage, consumer: AckReplyConsumer) => {
metricsRegistry.inputEventQueueSize.update(queue.size())
queue.offer((message, consumer)) match {
case QueueOfferResult.Enqueued =>
metricsRegistry.inputQueueAddEventCounter.increment()
case QueueOfferResult.Dropped =>
metricsRegistry.inputQueueDropEventCounter.increment()
consumer.nack()
logger.warn(s"Buffer is full, message nacked. Pubsub should retry don't panic. If this happens too often, we should also tweak the buffer size or the autoscaler.")
case QueueOfferResult.Failure(ex) =>
metricsRegistry.inputQueueDropEventCounter.increment()
consumer.nack()
logger.error(s"Failed to offer message with id=${message.getMessageId()}", ex)
case QueueOfferResult.QueueClosed =>
logger.error("Destination Queue closed. Something went terribly wrong. Shutting down the jvm.")
consumer.nack()
mat.shutdown()
sys.exit(1)
}
}
}
val subscriptionName = ProjectSubscriptionName.of(projectId, subscriptionId)
val subscriber = Subscriber.newBuilder(subscriptionName, receiver).setParallelPullCount(pullParallelism).build
subscriber.startAsync().awaitRunning()
source.withAttributes(attr)
}
}
}

Alpakka S3 connector stream won't handle the load, throwing akka.stream.BufferOverflowException

I have an akka-http service and I am trying out the alpakka s3 connector for uploading files. Previously I was using a temporary file and then uploading with Amazon SDK. This approach required some adjustments for Amazon SDK to make it more scala like, but it could handle even a 1000 requests at once. Throughput wasn't amazing, but all of the requests went through eventually. Here is the code before changes, with no alpakka:
```
path("uploadfile") {
withRequestTimeout(20.seconds) {
storeUploadedFile("csv", tempDestination) {
case (metadata, file) =>
val uploadFuture = upload(file, file.toPath.getFileName.toString)
onComplete(uploadFuture) {
case Success(_) => complete(StatusCodes.OK)
case Failure(_) => complete(StatusCodes.FailedDependency)
}
}
}
}
}
case class S3UploaderException(msg: String) extends Exception(msg)
def upload(file: File, key: String): Future[String] = {
val s3Client = AmazonS3ClientBuilder.standard()
.withCredentials(new DefaultAWSCredentialsProviderChain())
.withRegion(Regions.EU_WEST_3)
.build()
val promise = Promise[String]()
val listener = new ProgressListener() {
override def progressChanged(progressEvent: ProgressEvent): Unit = {
(progressEvent.getEventType: #unchecked) match {
case ProgressEventType.TRANSFER_FAILED_EVENT => promise.failure(S3UploaderException(s"Uploading a file with a key: $key"))
case ProgressEventType.TRANSFER_COMPLETED_EVENT |
ProgressEventType.TRANSFER_CANCELED_EVENT => promise.success(key)
}
}
}
val request = new PutObjectRequest("S3_BUCKET", key, file)
request.setGeneralProgressListener(listener)
s3Client.putObject(request)
promise.future
}
```
When I changed this to use alpakka connector, the code looks much nicer as we can just connect the ByteSource and alpakka Sink together. However this approach cannot handle such a big load. When I execute 1000 requests at once (10 kb files) less than 10% go through and the rest fails with exception:
akka.stream.alpakka.s3.impl.FailedUpload: Exceeded configured
max-open-requests value of [32]. This means that the request queue of
this pool
(HostConnectionPoolSetup(bargain-test.s3-eu-west-3.amazonaws.com,443,ConnectionPoolSetup(ConnectionPoolSettings(4,0,5,32,1,30
seconds,ClientConnectionSettings(Some(User-Agent: akka-http/10.1.3),10
seconds,1
minute,512,None,WebSocketSettings(,ping,Duration.Inf,akka.http.impl.settings.WebSocketSettingsImpl$$$Lambda$4787/1279590204#4d809f4c),List(),ParserSettings(2048,16,64,64,8192,64,8388608,256,1048576,Strict,RFC6265,true,Set(),Full,Error,Map(If-Range
-> 0, If-Modified-Since -> 0, If-Unmodified-Since -> 0, default -> 12, Content-MD5 -> 0, Date -> 0, If-Match -> 0, If-None-Match -> 0,
User-Agent ->
32),false,true,akka.util.ConstantFun$$$Lambda$4534/1539966798#69c23cd4,akka.util.ConstantFun$$$Lambda$4534/1539966798#69c23cd4,akka.util.ConstantFun$$$Lambda$4535/297570074#6b426c59),None,TCPTransport),New,1
second),akka.http.scaladsl.HttpsConnectionContext#7e0f3726,akka.event.MarkerLoggingAdapter#74f3a78b)))
has completely filled up because the pool currently does not process
requests fast enough to handle the incoming request load. Please retry
the request later. See
http://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html
for more information.
Here is how the summary of a Gatling test looks like:
---- Response Time Distribution ----------------------------------------
t < 800 ms 0 ( 0%)
800 ms < t < 1200 ms 0 ( 0%)
t > 1200 ms 90 ( 9%)
failed 910 ( 91%)
When I execute 100 of simultaneous requests, half of it fails. So, still close to satisfying.
This is a new code:
```
path("uploadfile") {
withRequestTimeout(20.seconds) {
extractRequestContext { ctx =>
implicit val materializer = ctx.materializer
extractActorSystem { actorSystem =>
fileUpload("csv") {
case (metadata, byteSource) =>
val uploadFuture = byteSource.runWith(S3Uploader.sink("s3FileKey")(actorSystem, materializer))
onComplete(uploadFuture) {
case Success(_) => complete(StatusCodes.OK)
case Failure(_) => complete(StatusCodes.FailedDependency)
}
}
}
}
}
}
def sink(s3Key: String)(implicit as: ActorSystem, m: Materializer) = {
val regionProvider = new AwsRegionProvider {
def getRegion: String = Regions.EU_WEST_3.getName
}
val settings = new S3Settings(MemoryBufferType, None, new DefaultAWSCredentialsProviderChain(), regionProvider, false, None, ListBucketVersion2)
val s3Client = new S3Client(settings)(as, m)
s3Client.multipartUpload("S3_BUCKET", s3Key)
}
```
The complete code with both endpoints can be seen here
I have a couple of questions.
1) Is this a feature? Is this what we can call a backpressure?
2) If I would like this code to behave like the old approach with a temporary file (no failed requests and all of them finish at some point) what do I have to do? I was trying to implement a queue for the stream (link to the source below), but this made no difference. The code can be seen here.
(* DISCLAIMER * I am still a scala newbie trying to quickly understand akka streams and find some workaround for the issue. There are big chances that there is something simple wrong in this code. * DISCLAIMER *)
It’s a backpressure feature.
Exceeded configured max-open-requests value of [32] In the config max-open-requests is set to 32 by default.
Streaming is used to work with big amount of data, not to handle many many requests per second.
Akka developers had to put something for max-open-requests. They choose 32 for some reason for sure. And they had no idea what it will be used for. May it be sending 1000 32KB files or 1000 1GB files at once? They don’t know. But they still want to make sure that by default (and 80% of people use defaults probably) the apps will be handled gracefully and safely. So they had to limit processing power.
You asked to do 1000 “now” but I am pretty sure AWS did not send 1000 files simultaneously but used some queue, which may be a good case for you too if you have many small files to upload.
But it is perfectly fine to tune it to your case!
If you know your machine and the target will take care of more simultaneous connections, you can change the number to a higher value.
Also, for a lot of HTTP calls use cached host connection pool.

How to get InputStream from request in Play

Ithink this used to be possible in Play 1.x, but I can't find how to do it in Play 2.x
I know that Play is asynchronous and uses Iteratees. However, there is generally much better support for InputStreams.
(In this case, I will be using a streaming JSON parser like Jackson to process the request body.)
How can I get an InputStream from a chunked request body?
I was able to get this working with the following code:
// I think all of the parens and braces line up -- copied/pasted from code
val pos = new PipedOutputStream()
val pis = new PipedInputStream(pos)
val result = Promise[Either[Errors, String]]()
def outputStreamBodyParser = {
BodyParser("outputStream") {
requestHeader =>
val length = requestHeader.headers(HeaderNames.CONTENT_LENGTH).toLong
Future {
result.completeWith(saveFile(pis, length)) // save returns Future[Either[Errors, String]]
}
Iteratee.fold[Array[Byte], OutputStream](pos) {
(os, data) =>
os.write(data)
os
}.map {
os =>
os.close()
Right(os)
}
}
}
Action.async(parse.when(
requestHeaders => {
val maybeContentLength = requestHeaders.headers.get(HeaderNames.CONTENT_LENGTH)
maybeContentLength.isDefined && allCatch.opt(maybeContentLength.get.toLong).isDefined
},
outputStreamBodyParser,
requestHeaders => Future.successful(BadRequest("Missing content-length header")))) {
request =>
result.future.map {
case Right(fileRef) => Ok(fileRef)
case Left(errors) => BadRequest(errors)
}
}
Play 2 is meant to be fully asynchronous, so this isn't easily possible or desirable. The problem with InputStream is there is no push back, there is no way for the reader of the InputStream to communicate to the input that it wants more data without blocking on read. Technically it is possible to write an Iteratee that could read data and put it into an InputStream, and would wait for a call to read on the InputStream before asking the Enumerator for more data, but it would be dangerous. You would have to make sure that the InputStream was closed properly, or the Enumerator would sit waiting forever (or until it times out) and the call to read must be made from a thread that is not running on the same ExecutionContext as the Enumerator and Iteratee or the application could deadlock.

how to cancel ConsoleReader.readLine()

first of all, i'm learning scala and new to the java world.
I want to create a console and run this console as a service that you could start and stop.
I was able to run a ConsoleReader into an Actor but i don't know how to stop properly the ConsoleReader.
Here is the code :
import eu.badmood.util.trace
import scala.actors.Actor._
import tools.jline.console.ConsoleReader
object Main {
def main(args:Array[String]){
//start the console
Console.start(message => {
//handle console inputs
message match {
case "exit" => Console.stop()
case _ => trace(message)
}
})
//try to stop the console after a time delay
Thread.sleep(2000)
Console.stop()
}
}
object Console {
private val consoleReader = new ConsoleReader()
private var running = false
def start(handler:(String)=>Unit){
running = true
actor{
while (running){
handler(consoleReader.readLine("\33[32m> \33[0m"))
}
}
}
def stop(){
//how to cancel an active call to ConsoleReader.readLine ?
running = false
}
}
I'm also looking for any advice concerning this code !
The underlying call to read a characters from the input is blocking. On non-Windows platform, it will use System.in.read() and on Windows it will use org.fusesource.jansi.internal.WindowsSupport.readByte.
So your challenge is to cause that blocking call to return when you want to stop your console service. See http://www.javaspecialists.eu/archive/Issue153.html and Is it possible to read from a InputStream with a timeout? for some ideas... Once you figure that out, have read return -1 when your console service stops, so that ConsoleReader thinks it's done. You'll need ConsoleReader to use your version of that call:
If you are on Windows, you'll probably need to override tools.jline.AnsiWindowsTerminal and use the ConsoleReader constructor that takes a Terminal (otherwise AnsiWindowsTerminal will just use WindowsSupport.readByte` directly)
On unix, there is one ConsoleReader constructor that takes an InputStream, you could provide your own wrapper around System.in
A few more thoughts:
There is a scala.Console object already, so for less confusion name yours differently.
System.in is a unique resource, so you probably need to ensure that only one caller uses Console.readLine at a time. Right now start will directly call readLine and multiple callers can call start. Probably the console service can readLine and maintain a list of handlers.
Assuming that ConsoleReader.readLine responds to thread interruption, you could rewrite Console to use a Thread which you could then interrupt to stop it.
object Console {
private val consoleReader = new ConsoleReader()
private var thread : Thread = _
def start(handler:(String)=>Unit) : Thread = {
thread = new Thread(new Runnable {
override def run() {
try {
while (true) {
handler(consoleReader.readLine("\33[32m> \33[0m"))
}
} catch {
case ie: InterruptedException =>
}
}
})
thread.start()
thread
}
def stop() {
thread.interrupt()
}
}
You may overwrite your ConsoleReader InputStream. IMHO this is reasonable well because of STDIN is a "slow" stream. Please improve example for your needs. This is only sketch, but it works:
def createReader() =
terminal.synchronized {
val reader = new ConsoleReader
terminal.enableEcho()
reader.setBellEnabled(false)
reader.setInput(new InputStreamWrapper(reader.getInput())) // turn on InterruptedException for InputStream.read
reader
}
with InputStream wrapper:
class InputStreamWrapper(is: InputStream, val timeout: Long = 50) extends FilterInputStream(is) {
#tailrec
final override def read(): Int = {
if (is.available() != 0)
is.read()
else {
Thread.sleep(timeout)
read()
}
}
}
P.S. I tried to use NIO - a lot of troubles with System.in (especially crossplatform). I returned to this variant. CPU load is near 0%. This is suitable for such interactive application.