How can I get some chunks come from what flow using mitmproxy? - mitmproxy

I am building web proxy using the following mitmproxy.
https://github.com/mitmproxy/mitmproxy
I want to scan streaming body and pass them to client.
mitmproxy already has the similar feature; https://github.com/mitmproxy/mitmproxy/blob/master/mitmproxy/addons/streambodies.py
I implemented the following code.
def responseheaders(self, flow):
flow.response.stream = self.response_stream
def response_stream(self, chunks):
for chunk in chunks:
if not my_function(chunk):
raise Exception('catch')
yield chunk
In my above code, there is no way to catch some chunks come from whose flow when my_function returns False.
i want to get from what flow some chunks come in flow.response.stream feature.

I solved my question with the following code.
from functools import partial
def responseheaders(self, flow):
flow.response.stream = partial(self.response_stream, flow=flow)
def response_stream(self, chunks, flow):
for chunk in chunks:
yield chunk
Otherwise, using __call__ in object could be one of solutions.
class Streamer:
def __init__(self, flow):
self.flow = flow
def __call__(self, chunks):
for chunk in chunks:
yield chunk
def responseheaders(self, flow):
flow.response.stream = Streamer(flow)

Related

How to update the "bytes written" count in custom Spark data source?

I created a Spark Data Source that uses the "older" DataSource V1 API to write data in a specific binary format our measuring devices and some software requires, i.e., my DefaultSource extends CreatableRelationProvider.
In the appropriate createRelation method I call my own custom method to write the data from the DataFrame passed in. I am doing this with the help of Hadoop's FileSystem API, initialized from the Hadoop Configuration one can pull out of the supplied DataFrame:
def createRelation(sqlContext: SQLContext,
mode : SaveMode,
parameters: Map[String, String],
data : DataFrame): BaseRelation = {
val path = ... // get from parameters; in real here is more preparation code, checking save mode etc.
MyCustomWriter.write(data, path)
EchoingRelation(data) // small class that just wraps the data frame into a BaseRelation with TableScan
}
In the MyCustomWriter I then do all sorts of things, and in the end, I save data as a side effect to map, mapPartitions and foreachPartition calls on the executors of the cluster, like this:
val confBytes = conf.toByteArray // implicit I wrote turning Hadoop Writables to Byte Array, as Configuration isn't serializable
data.
select(...).
where(...).
// much more
as[Foo].
mapPartitions { it =>
val conf = confBytes.toWritable[Configuration] // vice-versa like toByteArray
val writeResult = customWriteRecords(it, conf) // writes data to the disk using Hadoop FS API
writeResult.iterator
}.
// do more stuff
While this approach works fine, I notice that when running this, the Output column in the Spark job UI is not updated. Is it somehow possible to propagate this information or do I have to wrap the data in Writables and use a Hadoop FileOutputFormat approach instead?
I found a hacky approach.
Inside a RDD/DF operation you can get OutputMetrics:
val metrics = TaskContext.get().taskMetrics().outputMetrics
This has the fields bytesWritten and recordsWritten. However, the setters are package-local for org.apache.spark.executor. So, I created a "breakout object" in the package:
package org.apache.spark.executor
object OutputMetricsBreakout {
def setRecordsWritten(outputMetrics: OutputMetrics,
recordsWritten: Long): Unit =
outputMetrics.setRecordsWritten(recordsWritten)
def setBytesWritten(outputMetrics: OutputMetrics,
bytesWritten: Long): Unit =
outputMetrics.setBytesWritten(bytesWritten)
}
Then I can use:
val myBytesWritten = ... // calculate written bytes
OutputMetricsBreakout.setBytesWritten(metrics, myBytesWritten + metrics.bytesWritten)
This is a hack but the only "simple" way I could come up with.

fs2.Stream[IO, Something] does not return on take(1)

I have a function that returns a fs2.Stream of Measurements.
import cats.effect._
import fs2._
def apply(sds: SerialPort, interval: Int)(implicit cs: ContextShift[IO]): Stream[IO, SdsMeasurement] =
for {
blocker <- Stream.resource(Blocker[IO])
stream <- io.readInputStream(IO(sds.getInputStream), 1, blocker)
.through(SdsStateMachine.collectMeasurements())
} yield stream
Normally it is an infinite Stream, unless I pass it a test flag, in which case it should output one value and halt.
val infiniteSource: Stream[IO, SdsMeasurement] = ...
val source = if (isTest) infiniteSource.take(1) else infiniteSource
source.compile.drain
The infinite Stream works fine. It gives me all Measurements infinitely. The test Stream indeed gives me only the first measurement, nothing more. The problem I have is that the Stream does not return after this last measurements. It blocks forever. What am I doing wrong?
Note: I think I abstracted the essential code, but for more context, please take a look at my project: https://github.com/jkransen/fijnstof/blob/ZIO/src/main/scala/nl/kransen/fijnstof/Main.scala
The code you've presented here looks fine, I don't think the issue lies within that code. If it blocks, then presumably one of the underlying APIs blocks, for instance it might be the close method of the InputStream. What I usually do in such situations is to add log statements before and after every function call that might block.

Scala fs2 Streams with Chunks and Tasks?

// Simulated external API that synchronously returns elements one at a time indefinitely.
def externalApiGet[A](): A = ???
// This wraps with the proper fs2 stream that will indefinitely return values.
def wrapGetWithFS2[A](): Stream[Task, A] = Stream.eval(Task.delay(externalApiGet))
// Simulated external API that synchronously returns "chunks" of elements at a time indefinitely.
def externalApiGetSeq[A](): Seq[A] = ???
// How do I wrap this with a stream that hides the internal chunks and just provides a stream of A values.
// The following doesn't compile. I need help fixing this.
def wrapGetSeqWithFS2[A](): Stream[Task, A] = Stream.eval(Task.delay(externalApiGetSeq))
You need to mark the sequence as a Chunk and then use flatMap to flatten the stream.
def wrapGetSeqWithFS2[A](): Stream[Task, A] =
Stream.eval(Task.delay(externalApiGetSeq()))
.flatMap(Stream.emits)
(Edited to simplify solution)

How to read a web page, as a stream of lines, using akka-http 2.4.6

I've done a sample project in GitHub: akauppi/akka-2.4.6-trial
What I want seems simple: read a URL, provide the contents as a line-wise stream of Strings. Now I've struggled with this (and reading documentation) for the whole day so decided to push the sample public and ask for help.
I'm comfortable with Scala. I know Akka, and last time I've used Akka-streams it was probably pre-2.4. Now, I'm lost.
Questions:
On these lines I'd like to return a Source[String,Any], not a Future (note: those lines do not compile).
The problem probably is that Http().singleRequest(...) materialises the flow, and I don't want that. How to just inject the "recipe" of reading a web page without actually reading it?
def sourceAsByteString(url: URL)(implicit as: ActorSystem, mat: Materializer): Source[ByteString, Any] = {
import as.dispatcher
val req: HttpRequest = HttpRequest( uri = url.toString )
val tmp: Source[ByteString, Any] = Http().singleRequest(req).map( resp => resp.entity.dataBytes ) // does not compile, gives a 'Future'
tmp
}
The problem is that the chunks you get from the server are not lines, but might be anything. You will often get small responses in a single chunk. So you have to split the stream to lines yourself.
Something like this should work:
import akka.actor.ActorSystem
import akka.http.scaladsl.Http
import akka.http.scaladsl.client.RequestBuilding._
import akka.stream.ActorMaterializer
implicit val system = ActorSystem("test")
implicit val mat = ActorMaterializer()
val delimiter: Flow[ByteString, ByteString, NotUsed] =
Framing.delimiter(
ByteString("\r\n"),
maximumFrameLength = 100000,
allowTruncation = true)
import system.dispatcher
val f = Http().singleRequest(Get("http://www.google.com")).flatMap { res =>
val lines = res.entity.dataBytes.via(delimiter).map(_.utf8String)
lines.runForeach { line =>
println(line)
}
}
f.foreach { _ =>
system.terminate()
}
Note that if you wanted to return the lines instead of printing them, you would end up with a Future[Source[String, Any]], which is unavoidable because everything in akka-http is async. You could "flatten" this to a Source[String, Any] that produces no elements in case of a failed request, but that would probably not be a good idea.
To get a "recipe" for reading a web page, you could use Http().outgoingConnection("http://www.google.com"), which creates a Flow[HttpRequest, HttpResponse, Future[OutgoingConnection]], so a thing where you put in HttpRequest objects and get back HttpResponse objects.
The problem probably is that Http().singleRequest(...) materialises
the flow, and I don't want that.
That was indeed at the heart of the problem. There are two ways to start:
Http().singleRequest(...) leads to a Future (i.e. materializes the stream, already in the very beginning).
Source.single(HttpRequest(...)) leads to a Source (non-materialized).
http://doc.akka.io/docs/akka/2.4.7/scala/http/client-side/connection-level.html#connection-level-api
Ideally, such an important difference would be visible in the names of the methods used, but it's not. One simply has to know it, and understand the two above approaches are actually vastly different.
#RüdigerKlaehn's answer covers the linewise cutting pretty well, but also see the Cookbook
when dealing with a Source, use mapConcat in place of the flatMap (Futures), to flatten the res.entity.dataBytes (which is an inner stream). Having these two levels of streams (requests, then chunks per request) adds to the mental complexity especially since we only have one of the outer entities.
There might still be some simpler way, but I'm not looking at that actively any more. Things work. Maybe once I become more fluent with akka streams, I'll suggest a further solution.
Code for reading an HTTP response, linewise (akka-http 1.1.0-RC2):
val req: HttpRequest = ???
val fut: Future[Source[String,_]] = Http().singleRequest(req).map { resp =>
resp.entity.dataBytes.via(delimiter)
.map(_.utf8String)
}
delimiter as in #Rüdiger-klaehn's answer.

Memory consumption of a parallel Scala Stream

I have written a Scala (2.9.1-1) application that needs to process several million rows from a database query. I am converting the ResultSet to a Stream using the technique shown in the answer to one of my previous questions:
class Record(...)
val resultSet = statement.executeQuery(...)
new Iterator[Record] {
def hasNext = resultSet.next()
def next = new Record(resultSet.getString(1), resultSet.getInt(2), ...)
}.toStream.foreach { record => ... }
and this has worked very well.
Since the body of the foreach closure is very CPU intensive, and as a testament to the practicality of functional programming, if I add a .par before the foreach, the closures get run in parallel with no other effort, except to make sure that the body of the closure is thread safe (it is written in a functional style with no mutable data except printing to a thread-safe log).
However, I am worried about memory consumption. Is the .par causing the entire result set to load in RAM, or does the parallel operation load only as many rows as it has active threads? I've allocated 4G to the JVM (64-bit with -Xmx4g) but in the future I will be running it on even more rows and worry that I'll eventually get an out-of-memory.
Is there a better pattern for doing this kind of parallel processing in a functional manner? I've been showing this application to my co-workers as an example of the value of functional programming and multi-core machines.
If you look at the scaladoc of Stream, you will notice that the definition class of par is the Parallelizable trait... and, if you look at the source code of this trait, you will notice that it takes each element from the original collection and put them into a combiner, thus, you will load each row into a ParSeq:
def par: ParRepr = {
val cb = parCombiner
for (x <- seq) cb += x
cb.result
}
/** The default `par` implementation uses the combiner provided by this method
* to create a new parallel collection.
*
* #return a combiner for the parallel collection of type `ParRepr`
*/
protected[this] def parCombiner: Combiner[A, ParRepr]
A possible solution is to explicitly parallelize your computation, thanks to actors for example. You can take a look at this example from the akka documentation for example, that might be helpful in your context.
The new akka stream library is the fix you're looking for:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Source, Sink}
def iterFromQuery() : Iterator[Record] = {
val resultSet = statement.executeQuery(...)
new Iterator[Record] {
def hasNext = resultSet.next()
def next = new Record(...)
}
}
def cpuIntensiveFunction(record : Record) = {
...
}
implicit val actorSystem = ActorSystem()
implicit val materializer = ActorMaterializer()
implicit val execContext = actorSystem.dispatcher
val poolSize = 10 //number of Records in memory at once
val stream =
Source(iterFromQuery).runWith(Sink.foreachParallel(poolSize)(cpuIntensiveFunction))
stream onComplete {_ => actorSystem.shutdown()}