Scala making parallel network calls using Futures - scala

i'm new to Scala, i have a method, that reads data from the given list of files and does api calls with
the data, and writes the response to a file.
listOfFiles.map { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
the above method, is taking too much time, during the network call, so tried to do using parallel
processing to reduce the time.
so i tried wrapping the block of code, which consumes more time, but the program ends quickly
and its not generating any output, as the above code.
import scala.concurrent.ExecutionContext.Implicits.global
listOfFiles.map { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
Future {
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
}
it would be helpful, if you have any suggestions.
(I also tried using "par", it works fine,
I'm exploring other options other than 'par' and using frameworks like 'akka', 'cats' etc)

Based on Jatin instead of using default execution context which contains deamon threads
import scala.concurrent.ExecutionContext.Implicits.global
define execution context with non-deamon threads
implicit val nonDeamonEc = ExecutionContext.fromExecutor(Executors.newCachedThreadPool)
Also you can use Future.traverse and Await like so
val resultF = Future.traverse(listOfFiles) { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
Future {
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
}
Await.result(resultF, Duration.Inf)
traverse converts List[Future[A]] to Future[List[A]].

Related

What the best way to execute "not transformation" actions in elements of a Dataset

Newly coming in spark, I'm looking for a way to execute actions in all elements of a Dataset with Spark structured streaming:
I know this is a specific purpose case, what I want is iterate through all elements of Dataset, do an action on it, then continue to work with Dataset.
Example:
I got val df = Dataset[Person], I would like to be able to do something like:
def execute(df: Dataset[Person]): Dataset[Person] = {
df.foreach((p: Person) => {
someHttpClient.doRequest(httpPostRequest(p.asString)) // this is pseudo code / not compiling
})
df
}
Unfortunately, foreach is not available with structured streaming since I got error "Queries with streaming sources must be executed with writeStream.start"
I tried to use map(), but then error "Task not serializable" occured, I think because http request, or http client, is not serializable.
I know Spark is mostly use for filter and transform, but is there a way to handle well this specific use case ?
Thanks :)
val conf = new SparkConf().setMaster(“local[*]").setAppName(“Example")
val jssc = new JavaStreamingContext(conf, Durations.seconds(1)) // second option tell about The time interval at which streaming data will be divided into batches
Before concluding on whether a solution exists or not
Let’s as few questions
How does Spark Streaming work?
Spark Streaming receives live input data streams from input source and divides the data into batches, which are then processed by the Spark engine and final batch results are pushed down to downstream applications
How Does the batch execution start?
Spark does lazy evaluations on all the transformation applied on Dstream.it will apply transformation on actions (i.e only when you start streaming context)
jssc.start(); // Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate.
Note : Each Batch of Dstream contains multiple partitions ( it is just like running sequence of spark-batch job until input source stop producing data)
So you can have custom logic like below.
dStream.foreachRDD(new VoidFunction[JavaRDD[Object]] {
override def call(t: JavaRDD[Object]): Unit = {
t.foreach(new VoidFunction[Object] {
override def call(t: Object): Unit = {
//pseudo code someHttpClient.doRequest(httpPostRequest(t.asString))
}
})
}
})
But again make sure your someHttpClient is serializable or
you can create that object As mentioned below.
dStream.foreachRDD(new VoidFunction[JavaRDD[Object]] {
override def call(t: JavaRDD[Object]): Unit = {
// create someHttpClient object
t.foreach(new VoidFunction[Object] {
override def call(t: Object): Unit = {
//pseudo code someHttpClient.doRequest(httpPostRequest(t.asString))
}
})
}
})
Related to Spark Structured Streaming
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql._;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQuery
import java.util.Arrays;
import java.util.Iterator;
val spark = SparkSession
.builder()
.appName("example")
.getOrCreate();
val lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load(); // this is example source load copied from spark-streaming doc
lines.foreach(new ForeachFunction[Row] {
override def call(t: Row): Unit = {
//someHttpClient.doRequest(httpPostRequest(p.asString))
OR
// create someHttpClient object here and use it to tackle serialization errors
}
})
// Start running the query foreach and do mention downstream sink below/
val query = lines.writeStream.start
query.awaitTermination()

Making HTTP post requests on Spark usign foreachPartition

Need some help to understand the behaviour of the below in Spark (using Scala and Databricks)
I have some dataframe (reading from S3 if that matters), and would send that data by making HTTP post requests in batches of 1000 (at most). So I repartitioned the dataframe to make sure each partition has no more than 1000 records. Also, created a json column for each line (so I need only to put them in an array later on)
The trouble is on the making the requests. I created the following a Serializable class using the following code
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.http.client.methods.HttpPost
import org.apache.http.impl.client.HttpClientBuilder
import org.apache.http.HttpHeaders
import org.apache.http.entity.StringEntity
import org.apache.commons.io.IOUtils
object postObject extends Serializable{
val client = HttpClientBuilder.create().build()
val post = new HttpPost("https://my-cool-api-endpoint")
post.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
def makeHttpCall(row: Iterator[Row]) = {
val json_str = """{"people": [""" + row.toSeq.map(x => x.getAs[String]("json")).mkString(",") + "]}"
post.setEntity(new StringEntity(json_str))
val response = client.execute(post)
val entity = response.getEntity()
println(Seq(response.getStatusLine.getStatusCode(), response.getStatusLine.getReasonPhrase()))
println(IOUtils.toString(entity.getContent()))
}
}
Now when I try the following:
postObject.makeHttpCall(data.head(2).toIterator)
It works like a charm. The requests go through, there is some output on the screen, and my API gets that data.
But when I try to put it in the foreachPartition:
data.foreachPartition { x =>
postObject.makeHttpCall(x)
}
Nothing happens. No output on screen, nothing arrives in my API. If I try to rerun it, almost all stages just skips. I believe, for any reason, it is just lazy evaluating my requests, but not actually performing it. I don't understand why, and how to force it.
postObject has 2 fields: client and post which has to be serialized.
I'm not sure that client is serialized properly. post object is potentially mutated from several partitions (on the same worker). So many things could go wrong here.
I propose tryng removing postObject and inlining its body into foreachPartition directly.
Addition:
Tried to run it myself:
sc.parallelize((1 to 10).toList).foreachPartition(row => {
val client = HttpClientBuilder.create().build()
val post = new HttpPost("https://google.com")
post.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
val json_str = """{"people": [""" + row.toSeq.map(x => x.toString).mkString(",") + "]}"
post.setEntity(new StringEntity(json_str))
val response = client.execute(post)
val entity = response.getEntity()
println(Seq(response.getStatusLine.getStatusCode(), response.getStatusLine.getReasonPhrase()))
println(IOUtils.toString(entity.getContent()))
})
Ran it both locally and in cluster.
It completes successfully and prints 405 errors to worker logs.
So requests definitely hit the server.
foreachPartition returns nothing as the result. To debug your issue you can change it to mapPartitions:
val responseCodes = sc.parallelize((1 to 10).toList).mapPartitions(row => {
val client = HttpClientBuilder.create().build()
val post = new HttpPost("https://google.com")
post.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
val json_str = """{"people": [""" + row.toSeq.map(x => x.toString).mkString(",") + "]}"
post.setEntity(new StringEntity(json_str))
val response = client.execute(post)
val entity = response.getEntity()
println(Seq(response.getStatusLine.getStatusCode(), response.getStatusLine.getReasonPhrase()))
println(IOUtils.toString(entity.getContent()))
Iterator.single(response.getStatusLine.getStatusCode)
}).collect()
println(responseCodes.mkString(", "))
This code returns the list of response codes so you can analyze it.
For me it prints 405, 405 as expected.
There is a way to do this without having to find out what exactly is not serializable. If you want to keep the structure of your code, you can make all fields #transient lazy val. Also, any call with side effects should be wrapped in a block. For example
val post = {
val httpPost = new HttpPost("https://my-cool-api-endpoint")
httpPost.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
httpPost
}
That will delay the initialization of all fields until they are used by the workers. Each worker will have an instance of the object and you will be able to make invoke the makeHttpCall method.

This Akka stream sometimes doesn't finish

I have a graph that reads lines from multiple gzipped files and writes those lines to another set of gzipped files, mapped according to some value in each line.
It works correctly against small data sets, but fails to terminate on larger data. (It may not be the size of the data that's to blame, as I have not run it enough times to be sure - it takes a while).
def files: Source[File, NotUsed] =
Source.fromIterator(
() =>
Files
.fileTraverser()
.breadthFirst(inDir)
.asScala
.filter(_.getName.endsWith(".gz"))
.toIterator)
def extract =
Flow[File]
.mapConcat[String](unzip)
.mapConcat(s =>
(JsonMethods.parse(s) \ "tk").extract[Array[String]].map(_ -> s).to[collection.immutable.Iterable])
.groupBy(1 << 16, _._1)
.groupedWithin(1000, 1.second)
.map { lines =>
val w = writer(lines.head._1)
w.println(lines.map(_._2).mkString("\n"))
w.close()
Done
}
.mergeSubstreams
def unzip(f: File) = {
scala.io.Source
.fromInputStream(new GZIPInputStream(new FileInputStream(f)))
.getLines
.toIterable
.to[collection.immutable.Iterable]
}
def writer(tk: String): PrintWriter =
new PrintWriter(
new OutputStreamWriter(
new GZIPOutputStream(
new FileOutputStream(new File(outDir, s"$tk.json.gz"), true)
))
)
val process = files.via(extract).toMat(Sink.ignore)(Keep.right).run()
Await.result(process, Duration.Inf)
The thread dump shows that the process is WAITING at Await.result(process, Duration.Inf) and nothing else is happening.
OpenJDK v11 with Akka v2.5.15
Most likely it's stuck in groupBy because it ran out of available threads in dispatcher to gather items into 2^16 groups for all sources.
So if I were you I'd probably implement grouping in extract semi-manually using statefulMapConcat with mutable Map[KeyType, List[String]]. Or buffer lines with groupedWithin first and split them into groups that you would write to different files in Sink.foreach.

Large file download with Play framework

I have a sample download code that works fine if the file is not zipped because I know the length and when I provide, it I think while streaming play does not have to bring the whole file in memory and it works. The below code works
def downloadLocalBackup() = Action {
var pathOfFile = "/opt/mydir/backups/big/backup"
val file = new java.io.File(pathOfFile)
val path: java.nio.file.Path = file.toPath
val source: Source[ByteString, _] = FileIO.fromPath(path)
logger.info("from local backup set the length in header as "+file.length())
Ok.sendEntity(HttpEntity.Streamed(source, Some(file.length()), Some("application/zip"))).withHeaders("Content-Disposition" -> s"attachment; filename=backup")
}
I don't know how the streaming in above case takes care of the difference in speed between disk reads(Which are faster than network). This never runs out of memory even for large files. But when I use the below code, which has zipOutput stream I am not sure of the reason to run out of memory. Somehow the same 3GB file when I try to use with zip stream, is not working.
def downloadLocalBackup2() = Action {
var pathOfFile = "/opt/mydir/backups/big/backup"
val file = new java.io.File(pathOfFile)
val path: java.nio.file.Path = file.toPath
val enumerator = Enumerator.outputStream { os =>
val zipStream = new ZipOutputStream(os)
zipStream.putNextEntry(new ZipEntry("backup2"))
val is = new BufferedInputStream(new FileInputStream(pathOfFile))
val buf = new Array[Byte](1024)
var len = is.read(buf)
var totalLength = 0L;
var logged = false;
while (len >= 0) {
zipStream.write(buf, 0, len)
len = is.read(buf)
if (!logged) {
logged = true;
logger.info("logging the while loop just one time")
}
}
is.close
zipStream.close()
}
logger.info("log right before sendEntity")
val kk = Ok.sendEntity(HttpEntity.Streamed(Source.fromPublisher(Streams.enumeratorToPublisher(enumerator)).map(x => {
val kk = Writeable.wByteArray.transform(x); kk
}),
None, Some("application/zip"))
).withHeaders("Content-Disposition" -> s"attachment; filename=backupfile.zip")
kk
}
In the first example, Akka Streams handles all details for you. It knows how to read the input stream without loading the complete file in memory. That is the advantage of using Akka Streams as explained in the docs:
The way we consume services from the Internet today includes many instances of streaming data, both downloading from a service as well as uploading to it or peer-to-peer data transfers. Regarding data as a stream of elements instead of in its entirety is very useful because it matches the way computers send and receive them (for example via TCP), but it is often also a necessity because data sets frequently become too large to be handled as a whole. We spread computations or analyses over large clusters and call it “big data”, where the whole principle of processing them is by feeding those data sequentially—as a stream—through some CPUs.
...
The purpose [of Akka Streams] is to offer an intuitive and safe way to formulate stream processing setups such that we can then execute them efficiently and with bounded resource usage—no more OutOfMemoryErrors. In order to achieve this our streams need to be able to limit the buffering that they employ, they need to be able to slow down producers if the consumers cannot keep up. This feature is called back-pressure and is at the core of the Reactive Streams initiative of which Akka is a founding member.
At the second example, you are handling the input/output streams by yourself, using the standard blocking API. I'm not 100% sure about how writing to a ZipOutputStream works here, but it is possible that it is not flushing the writes and accumulating everything before close.
Good thing is that you don't need to handle this manually since Akka Streams provides a way to gzip a Source of ByteStrings:
import javax.inject.Inject
import akka.util.ByteString
import akka.stream.scaladsl.{Compression, FileIO, Source}
import play.api.http.HttpEntity
import play.api.mvc.{BaseController, ControllerComponents}
class FooController #Inject()(val controllerComponents: ControllerComponents) extends BaseController {
def download = Action {
val pathOfFile = "/opt/mydir/backups/big/backup"
val file = new java.io.File(pathOfFile)
val path: java.nio.file.Path = file.toPath
val source: Source[ByteString, _] = FileIO.fromPath(path)
val gzipped = source.via(Compression.gzip)
Ok.sendEntity(HttpEntity.Streamed(gzipped, Some(file.length()), Some("application/zip"))).withHeaders("Content-Disposition" -> s"attachment; filename=backup")
}
}

How to close enumerated file?

Say, in an action I have:
val linesEnu = {
val is = new java.io.FileInputStream(path)
val isr = new java.io.InputStreamReader(is, "UTF-8")
val br = new java.io.BufferedReader(isr)
import scala.collection.JavaConversions._
val rows: scala.collection.Iterator[String] = br.lines.iterator
Enumerator.enumerate(rows)
}
Ok.feed(linesEnu).as(HTML)
How to close readers/streams?
There is a onDoneEnumerating callback that functions like finally (will always be called whether or not the Enumerator fails). You can close the streams there.
val linesEnu = {
val is = new java.io.FileInputStream(path)
val isr = new java.io.InputStreamReader(is, "UTF-8")
val br = new java.io.BufferedReader(isr)
import scala.collection.JavaConversions._
val rows: scala.collection.Iterator[String] = br.lines.iterator
Enumerator.enumerate(rows).onDoneEnumerating {
is.close()
// ... Anything else you want to execute when the Enumerator finishes.
}
}
The IO tools provided by Enumerator give you this kind of resource management out of the box—e.g. if you create an enumerator with fromStream, the stream is guaranteed to get closed after running (even if you only read a single line, etc.).
So for example you could write the following:
import play.api.libs.iteratee._
val splitByNl = Enumeratee.grouped(
Traversable.splitOnceAt[Array[Byte], Byte](_ != '\n'.toByte) &>>
Iteratee.consume()
) compose Enumeratee.map(new String(_, "UTF-8"))
def fileLines(path: String): Enumerator[String] =
Enumerator.fromStream(new java.io.FileInputStream(path)).through(splitByNl)
It's a shame that the library doesn't provide a linesFromStream out of the box, but I personally would still prefer to use fromStream with hand-rolled splitting, etc. over using an iterator and providing my own resource management.