My flink job keeps reading the same bytes over and over - scala

I have a flink job that always falls behind no matter how much resource i throw at it.
I am using flink 1.15.2 but the problem happened on 1.14.[4|6]
Here is how i am connecting to the input stream:
val properties = getInputProperties(kafkaReadConfig)
val byteArrayDeserializer = new AbstractDeserializationSchema[Array[Byte]]() {
override def deserialize(bytes: Array[Byte]): Array[Byte] = bytes
}
val flinkKafkaConsumer = new FlinkKafkaConsumer[Array[Byte]](kafkaReadConfig.topic, byteArrayDeserializer, properties)
val res = env.addSource(flinkKafkaConsumer).setParallelism(kafkaReadConfig.parallelism).name(kafkaReadConfig.topic)
Just doing a simple FlatMap, i see the same bytes being read over and over again. My pipeline ends in a print(for debugging)
inputStream.setParallelism(1)
.flatMap(lookForDupes).setParallelism(1)
.print()
Is there something dumb i am doing? Do i need to verify the packet was processed somehow?
Looking for some way to mark the block as read at this point.

Have you inspected the task manager logs for clues? This can happen if the job is in a fail -> recover -> fail again loop.

Related

Spark Map partiton not working in yarn-cluster mode

I am running a spark scala program for performing text scanning in input file. I am trying to achieve parallelism by using rdd.mappartition. Inside the mappartition section i am performing few checks and calling the map function to achieve parallel execution for each partition. Inside the map function i am calling a custom method where i am performing the scanning and sending the results back.
Now, the code is working fine when i submit the code using --master local[*] but the same is not working when i submit using --master yarn-cluster. It is working without any error but the call is not getting inside the mappartition itself.I verified this by placing few println statements.
Please help me with your suggestions.
Here is the sample code:
def main(args: Array[String]) {
val inputRdd = sc.textFile(inputFile,2)
val resultRdd = inputRdd.mapPartitions{ iter =>
println("Inside scanning method..")
var scanEngine = ScanEngine.getInstance();
...
....
....
var mapresult = iter.map { y =>
line = y
val last = line.lastIndexOf("|");
message = line.substring(last + 1, line.length());
getResponse(message)
}
}
val finalRdd = sc.parallelize(resultRdd.map(x => x.trim()))
finalRdd.coalesce(1, true).saveAsTextFile(hdfsOutpath)
}
def getResponse(input: String): String = {
var result = "";
val rList = new ListBuffer[String]();
try {
//logic here
}
return result;
}
If your evidence of it working is seeing Inside scanning method.. printed out, it won't show up when run on the cluster because that code is executed by the workers, not the driver.
You're going to have to go over the code in forensic detail, with an open mind and try to find why the job has no output. Usually when a job works on local mode but not on a cluster it is because of some subtlety in where the code is executed, or where output is recorded.
There's too much clipped code to provide a more specific answer.
Spark achieves parallelism using the map function as well as mapPartitions. The number of partitions determines the amount of parallelism, but each partition will execute independently whether or not you use the mapPartitions function.
There are only a few reasons to use mapPartitions over map; e.g. there isa high initialization cost for a function, but then can call it multiple times such as doing some NLP task on text

Is this a suitable way to implement a lazy `take` on RDD?

It's quite unfortunate that take on RDD is a strict operation instead of lazy but I won't get into why I think that's a regrettable design here and now.
My question is whether this is a suitable implementation of a lazy take for RDD. It seems to work, but I might be missing some non-obvious problem with it.
def takeRDD[T: scala.reflect.ClassTag](rdd: RDD[T], num: Long): RDD[T] =
new RDD[T](rdd.context, List(new OneToOneDependency(rdd))) {
// An unfortunate consequence of the way the RDD AST is designed
var doneSoFar = 0L
def isDone = doneSoFar >= num
override def getPartitions: Array[Partition] = rdd.partitions
// Should I do this? Doesn't look like I need to
// override val partitioner = self.partitioner
override def compute(split: Partition, ctx: TaskContext): Iterator[T] = new Iterator[T] {
val inner = rdd.compute(split, ctx)
override def hasNext: Boolean = !isDone && inner.hasNext
override def next: T = {
doneSoFar += 1
inner.next
}
}
}
Answer to your question
No, this doesn't work. There's no way to have a variable which can be seen and updated concurrently across a Spark cluster, and that's exactly what you're trying to use doneSoFar as. If you try this, then when you run compute (in parallel across many nodes), you
a) serialize the takeRDD in the task, because you reference the class variable doneSoFar. This means that you write the class to bytes and make a new instance in each JVM (executor)
b) update doneSoFar in compute, which updates the local instance on each executor JVM. You'll take a number of elements from each partition equal to num.
It's possible this will work in Spark local mode due to some of the JVM properties there, but it CERTAINLY will not work when running Spark in cluster mode.
Why take is an action, not transformation
RDDs are distributed, and so subsetting to an exact number of elements is an inefficient operation -- it can't be done totally in parallel, since each shard needs information about the other shards (like whether it should be computed at all). Take is great for bringing distributed data back into local memory.
rdd.sample is a similar operation that stays in the distributed world, and can be run in parallel easily.

akka streams, InputStream and binary data

I'm trying to use akka streams (scala) to read gzipped data streamed from Play's WS client. The data ultimately need to be presented as an InputStream to a legacy API. I'm using akka 2.4.10, so using akka.stream.scaladsl.Compression isn't an option. If I do the following (replacing the Play stuff with a toy app for simplicity; the results in Play are the same):
object AkkaTester extends App {
implicit val system = ActorSystem("system")
implicit val materializer = ActorMaterializer()
val filename = "/tmp/helloworld.txt.gz"
val path = Paths.get(filename)
val sink = StreamConverters.asInputStream()
val akkaInput = new GZIPInputStream(FileIO.fromPath(path).runWith(sink))
var akkaByte = akkaInput.read()
while (akkaByte != -1) print(akkaByte)
akkaInput.close()
system.terminate()
}
...then it generates the following:
Exception in thread "main" java.io.IOException: akka.stream.impl.io.InputStreamAdapter.read() returned value out of range -1..255: -117
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:272)
at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:259)
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
I'm new to akka streams, so I'm not sure if this is a bug (the java docs for InputStream.read() clearly state that it should return values in the range of 0..255, so it feels like a bug) or just me mis-using akka streams in some way. FWIW, adding 256 to the negative value would give us the value that we would get if we used a straight FileInputStream in place of akka streams.
UPDATE
Definitely a bug - the read() method in the relevant code returns a raw, signed byte, meaning it will be in the range -128 to 127 rather than the required 0 to 255. I've filed an issue on it.

How do I cleanly log to io.stdOutLines and respond to the client with a scalaz.stream.tcp server

I'm very new to both scalaz-stream and specifically scalaz.stream.tcp. I'm trying to do a very simple server for my own educational purposes. I parse the requests into commands, execute them to produce responses, and write the responses back to the client. The part I am having issues with is that I want to log each received command to stdout.
Here is my inner Process that I am passing to tcp.server:
def inner: Process[tcp.Connection, Unit] = {
val requests: Process[Connection, String] = tcp.reads(1024) pipe text.utf8Decode
val cmds: Process[Connection, Command] = requests.map(parseRequest)
val header: Process[Task, ByteVector] = Process("HEADER\n").pipe(text.utf8Encode)
val loggedCmds: Process[Connection, Command] = cmds.map { cmd =>
println(cmd.toString)
cmd
}
val results: Process[Connection, Process[Task, ByteVector]] = loggedCmds.map(_.execute)
val processedRequests: Process[Connection, Unit] = results.flatMap(result => tcp.writes(tcp.lift(header ++ result)))
processedRequests
}
(I am not in the habit of specifying the types everywhere; I just did that to try to get a handle on things. I plan to remove those.)
The above code actually compiles and runs correctly, but I do not feel it is very clean or idiomatic. Specifically I am unhappy with the loggedCmds part. I wanted to use io.stdOutLines, either through .observer or using writer.logged/mapW/drainW, but no matter what I tried I could not seem to get the types to line up correctly. I was always getting type conflicts between Task and Connection. tcp.lift seems to help with an input stream, but it does not seem to work for a Sink. Is there a cleaner/better way to do the loggedCmds part (FWIW: I'm open to corrections or improvements to any of the above code).
I should note that if I just have the results go to stdout via io.stdOutLines I do not have an issue ("through" seems to work in that case, which I have seen in examples), it's just when I want to send the stream to io.stdOutLines and also continue using the stream to respond to the client.
Figured it out on my own (finally). Using ".toChannel" I was able to do it:
val cmdFormatter = process1.id[Command].map(_.toString)
val cmdPrinter = io.stdOutLines.pipeIn(cmdFormatter)
...
val cmds: Process[Connection, Command] = requests.map(parseRequest) through
cmdPrinter.toChannel
A shorter solution is
val log = stdOutLines contramap { (_: Any).toString }
requests map(parseRequest) observe log

akka timeout when using spray client for multiple request

Using spray 1.3.2 with akka 2.3.6. (akka is used only for spray).
I need to read huge files and for each line make a http request.
I read the files line by line with iterator, and for each item make the request.
It run successfully for some of the lines but at some time it start to fail with:
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://default/user/IO-HTTP#-35162984]] after [60000 ms].
I first thought I overloading the service, so I set the "spray.can.host-connector.max-connections" to 1. It run much slower but I got the same errors.
Here the code:
import spray.http.MediaTypes._
val EdnType = register(
MediaType.custom(
mainType = "application",
subType = "edn",
compressible = true,
binary = false,
fileExtensions = Seq("edn")))
val pipeline = (
addHeader("Accept", "application/json")
~> sendReceive
~> unmarshal[PipelineResponse])
def postData(data: String) = {
val request = Post(pipelineUrl).withEntity(HttpEntity.apply(EdnType, data))
val responseFuture: Future[PipelineResponse] = pipeline(request)
responseFuture
}
dataLines.map { d =>
val f = postData(d)
f.onFailure { case e => println("Error - "+e)} // This is where the errors are display
f.map { p => someMoreLogic(d, p) }
}
aggrigateResults(dataLines)
I do it in such way since I don't need the entire data, just some aggregations.
How can I solve this and keep it entirely async?
Akka ask timeout is implemented via firstCompletedOf, so the timer starts when the ask is initialized.
What you seem to be doing, is spawning a Future for each line (during the map) - so all your calls execute nearly at the same time. The timeouts start counting when the futures are initialized, but there are no executor threads left for all the spawned actors to do their work. Hence the asks time out.
Instead of processing "all at once", I would suggest a more flexible approach - somewhat similar to using iteratees, or akka-streams: Work Pulling Pattern. (Github)
You provide the iterator that you already have as an Epic. Introduce a Worker actor, which will perform the call & some logic. If you spawn N workers then, there will be at most N lines being processed concurrently (and the processing pipeline may involve multiple steps). This way you can ensure that you are not overloading the executors, and the timeouts shouldn't happen.