Making HTTP post requests on Spark usign foreachPartition - scala

Need some help to understand the behaviour of the below in Spark (using Scala and Databricks)
I have some dataframe (reading from S3 if that matters), and would send that data by making HTTP post requests in batches of 1000 (at most). So I repartitioned the dataframe to make sure each partition has no more than 1000 records. Also, created a json column for each line (so I need only to put them in an array later on)
The trouble is on the making the requests. I created the following a Serializable class using the following code
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.http.client.methods.HttpPost
import org.apache.http.impl.client.HttpClientBuilder
import org.apache.http.HttpHeaders
import org.apache.http.entity.StringEntity
import org.apache.commons.io.IOUtils
object postObject extends Serializable{
val client = HttpClientBuilder.create().build()
val post = new HttpPost("https://my-cool-api-endpoint")
post.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
def makeHttpCall(row: Iterator[Row]) = {
val json_str = """{"people": [""" + row.toSeq.map(x => x.getAs[String]("json")).mkString(",") + "]}"
post.setEntity(new StringEntity(json_str))
val response = client.execute(post)
val entity = response.getEntity()
println(Seq(response.getStatusLine.getStatusCode(), response.getStatusLine.getReasonPhrase()))
println(IOUtils.toString(entity.getContent()))
}
}
Now when I try the following:
postObject.makeHttpCall(data.head(2).toIterator)
It works like a charm. The requests go through, there is some output on the screen, and my API gets that data.
But when I try to put it in the foreachPartition:
data.foreachPartition { x =>
postObject.makeHttpCall(x)
}
Nothing happens. No output on screen, nothing arrives in my API. If I try to rerun it, almost all stages just skips. I believe, for any reason, it is just lazy evaluating my requests, but not actually performing it. I don't understand why, and how to force it.

postObject has 2 fields: client and post which has to be serialized.
I'm not sure that client is serialized properly. post object is potentially mutated from several partitions (on the same worker). So many things could go wrong here.
I propose tryng removing postObject and inlining its body into foreachPartition directly.
Addition:
Tried to run it myself:
sc.parallelize((1 to 10).toList).foreachPartition(row => {
val client = HttpClientBuilder.create().build()
val post = new HttpPost("https://google.com")
post.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
val json_str = """{"people": [""" + row.toSeq.map(x => x.toString).mkString(",") + "]}"
post.setEntity(new StringEntity(json_str))
val response = client.execute(post)
val entity = response.getEntity()
println(Seq(response.getStatusLine.getStatusCode(), response.getStatusLine.getReasonPhrase()))
println(IOUtils.toString(entity.getContent()))
})
Ran it both locally and in cluster.
It completes successfully and prints 405 errors to worker logs.
So requests definitely hit the server.
foreachPartition returns nothing as the result. To debug your issue you can change it to mapPartitions:
val responseCodes = sc.parallelize((1 to 10).toList).mapPartitions(row => {
val client = HttpClientBuilder.create().build()
val post = new HttpPost("https://google.com")
post.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
val json_str = """{"people": [""" + row.toSeq.map(x => x.toString).mkString(",") + "]}"
post.setEntity(new StringEntity(json_str))
val response = client.execute(post)
val entity = response.getEntity()
println(Seq(response.getStatusLine.getStatusCode(), response.getStatusLine.getReasonPhrase()))
println(IOUtils.toString(entity.getContent()))
Iterator.single(response.getStatusLine.getStatusCode)
}).collect()
println(responseCodes.mkString(", "))
This code returns the list of response codes so you can analyze it.
For me it prints 405, 405 as expected.

There is a way to do this without having to find out what exactly is not serializable. If you want to keep the structure of your code, you can make all fields #transient lazy val. Also, any call with side effects should be wrapped in a block. For example
val post = {
val httpPost = new HttpPost("https://my-cool-api-endpoint")
httpPost.addHeader(HttpHeaders.CONTENT_TYPE,"application/json")
httpPost
}
That will delay the initialization of all fields until they are used by the workers. Each worker will have an instance of the object and you will be able to make invoke the makeHttpCall method.

Related

Scala making parallel network calls using Futures

i'm new to Scala, i have a method, that reads data from the given list of files and does api calls with
the data, and writes the response to a file.
listOfFiles.map { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
the above method, is taking too much time, during the network call, so tried to do using parallel
processing to reduce the time.
so i tried wrapping the block of code, which consumes more time, but the program ends quickly
and its not generating any output, as the above code.
import scala.concurrent.ExecutionContext.Implicits.global
listOfFiles.map { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
Future {
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
}
it would be helpful, if you have any suggestions.
(I also tried using "par", it works fine,
I'm exploring other options other than 'par' and using frameworks like 'akka', 'cats' etc)
Based on Jatin instead of using default execution context which contains deamon threads
import scala.concurrent.ExecutionContext.Implicits.global
define execution context with non-deamon threads
implicit val nonDeamonEc = ExecutionContext.fromExecutor(Executors.newCachedThreadPool)
Also you can use Future.traverse and Await like so
val resultF = Future.traverse(listOfFiles) { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
Future {
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
}
Await.result(resultF, Duration.Inf)
traverse converts List[Future[A]] to Future[List[A]].

Complete akka-http response with an iterator

I have an iterator of mongodb query results and I want to stream those results to http response without loading the whole results set into memory.
Is it possible to complete akka http response with an iterator instead of a collection or future?
Given an Iterator of data:
type Data = ???
val dataIterator : () => Iterator[Data] = ???
You will first need a function to convert Data to ByteString representation, and the ContentType (e.g. json, binary, csv, xml, ...) of the representation:
import akka.util.ByteString
import akka.http.scaladsl.model.ContentType
val dataToByteStr : Data => ByteString = ???
//see akka.http.scaladsl.model.ContentTypes for possible values
val contentType : ContentType = ???
The Iterator and converter function can now be used to create an HttpResponse that will stream the results back to the http client without holding the entire set of Data in memory:
import akka.http.scaladsl.model.HttpEntity.{Chunked, ChunkStreamPart}
import akka.http.scaladsl.model.ResponseEntity
import akka.stream.scaladsl.Source
import akka.http.scaladsl.model.HttpResponse
val chunks : Source[ChunkStreamPart,_] =
Source.fromIterator(dataIterator)
.map(dataToByteStr)
.map(ChunkStreamPart.apply)
val entity : ResponseEntity = Chunked.fromData(contentType, chunks)
val httpResponse : HttpResponse = HttpResponse(entity=entity)
Note: Since a new Iterator is produced each time from dataIterator you don't have to create a new httpResponse for each incoming request; the same response can be used for all requests.
Take a look to Alpakka MongoDB connector. It allows to create one Source from a Mongo collection like:
val source: Source[Document, NotUsed] = MongoSource(numbersColl.find())
val rows: Future[Seq[Document]] = source.runWith(Sink.seq)
Or you maybe want your own source implementation as a GraphStage for example.

Apache Flink - Unable to get data From Twitter

I'm trying to get some messages with Twitter Streaming API using Apache Flink.
But, my code is not writing anything in the output file. I'm trying to count the input data for specific words.
Plese check my example:
import java.util.Properties
import org.apache.flink.api.scala._
import org.apache.flink.streaming.connectors.twitter._
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import com.twitter.hbc.core.endpoint.{Location, StatusesFilterEndpoint, StreamingEndpoint}
import org.apache.flink.streaming.api.windowing.time.Time
import scala.collection.JavaConverters._
//////////////////////////////////////////////////////
// Create an Endpoint to Track our terms
class myFilterEndpoint extends TwitterSource.EndpointInitializer with Serializable {
#Override
def createEndpoint(): StreamingEndpoint = {
//val chicago = new Location(new Location.Coordinate(-86.0, 41.0), new Location.Coordinate(-87.0, 42.0))
val endpoint = new StatusesFilterEndpoint()
//endpoint.locations(List(chicago).asJava)
endpoint.trackTerms(List("odebrecht", "lava", "jato").asJava)
endpoint
}
}
object Connection {
def main(args: Array[String]): Unit = {
val props = new Properties()
val params: ParameterTool = ParameterTool.fromArgs(args)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.getConfig.setGlobalJobParameters(params)
env.setParallelism(params.getInt("parallelism", 1))
props.setProperty(TwitterSource.CONSUMER_KEY, params.get("consumer-key"))
props.setProperty(TwitterSource.CONSUMER_SECRET, params.get("consumer-key"))
props.setProperty(TwitterSource.TOKEN, params.get("token"))
props.setProperty(TwitterSource.TOKEN_SECRET, params.get("token-secret"))
val source = new TwitterSource(props)
val epInit = new myFilterEndpoint()
source.setCustomEndpointInitializer(epInit)
val streamSource = env.addSource(source)
streamSource.map(s => (0, 1))
.keyBy(0)
.timeWindow(Time.minutes(2), Time.seconds(30))
.sum(1)
.map(t => t._2)
.writeAsText(params.get("output"))
env.execute("Twitter Count")
}
}
The point is, I have no error message and I can see at my Dashboard. My source is sending data to my TriggerWindow. But it is not receive any data:
I have two questions in once.
First: Why my source is sending bytes to my TriggerWindow if it is not received anything?
Seccond: Is something wrong to my code that I can't take data from twitter?
Your application source did not send actual records to the window which you can see by looking at the Records sent column. The bytes which are sent belong to control messages which Flink sends from time to time between the tasks. More specifically, it is the LatencyMarker message which is used to measure the end to end latency of a Flink job.
The code looks good to me. I even tried out your code and worked for me. Thus, I conclude that there has to be something wrong with the Twitter connection credentials. Please re-check whether you've entered the right credentials.

Scala/Akka WSResponse recursively call

Im trying to parse some data from an API
I have a recursion method that calling to this method
def getJsonValue( url: (String)): JsValue = {
val builder = new com.ning.http.client.AsyncHttpClientConfig.Builder()
val client = new play.api.libs.ws.ning.NingWSClient(builder.build())
val newUrl = url.replace("\"", "").replace("|", "%7C").trim
val response: Future[WSResponse] = client.url(newUrl).get()
Await.result(response, Duration.create(10, "seconds")).json
}
Everything is working well but after 128 method calls i'm getting this warning
WARNING: You are creating too many HashedWheelTimer instances. HashedWheelTimer is a shared resource that must be reused across the application, so that only a few instances are created.
After about 20 More calls im getting this exception
23:24:57.425 [main] ERROR com.ning.http.client.AsyncHttpClient - Unable to instantiate provider com.ning.http.client.providers.netty.NettyAsyncHttpProvider. Trying other providers.
23:24:57.438 [main] ERROR com.ning.http.client.AsyncHttpClient - org.jboss.netty.channel.ChannelException: Failed to create a selector.
Questions
1.Im assuming that the connections didnt closed ?? and therefore i can't create new connections.
2.What will be the correct and the safe way to create those HTTP calls
Had the same problem.
Found 2 interesting solutions:
make sure you are not creating tons of clients with closing them
the threadPool you are using may be causing this.
My piece of code (commenting that line of code solved, I'm now testing several configurations):
private[this] def withClient(block: NingWSClient => WSResponse): Try[WSResponse] = {
val config = new NingAsyncHttpClientConfigBuilder().build()
val clientConfig = new AsyncHttpClientConfig.Builder(config)
// .setExecutorService(new ThreadPoolExecutor(5, 15, 30L, TimeUnit.SECONDS, new SynchronousQueue[Runnable]))
.build()
val client = new NingWSClient(clientConfig)
val result = Try(block(client))
client.close()
result
}
for avoiding this you can use different provider.
private AsyncHttpProvider httpProvider =new ApacheAsyncHttpProvider(config);
private AsyncHttpClient asyncHttpClient = new AsyncHttpClient(httpProvider,config);
I ran into this same problem. Before you call your recursive method, you should create builder and client and pass client to the recursive method, as well as getJsonValue. This is what getJsonValue should look like:
def getJsonValue(url: String, client: NingWSClient): JsValue = {
val builder = new com.ning.http.client.AsyncHttpClientConfig.Builder()
val client = new play.api.libs.ws.ning.NingWSClient(builder.build())
val newUrl = url.replace("\"", "").replace("|", "%7C").trim
val response: Future[WSResponse] = client.url(newUrl).get()
Await.result(response, Duration.create(10, "seconds")).json
}

How to use Java libraries asynchronously in a Scala Play 2.0 application?

I see in the Play 2.0 Scala doc for calling web services that the idiomatic approach is to use Scala's asynchronous mechanisms to call web services. So if I'm using Java libraries for, say, downloading images from S3 and uploading to Facebook and Twitter (restfb and twitter4j), does this make for a highly inefficient use of resources (what resources?) or does it not make much difference (or no difference at all)?
If it makes a difference, how would I go about making something like the following asynchronous? Is there a quick way, or would I have to write libraries from scratch?
Note this will be running on heroku, if that matters in this discussion.
def tweetJpeg = Action(parse.urlFormEncoded) { request =>
val form = request.body
val folder = form("folder").head
val mediaType = form("type").head
val photo = form("photo").head
val path = folder + "/" + mediaType + "/" + photo
val config = Play.current.configuration;
val awsAccessKey = config.getString("awsAccessKey").get
val awsSecretKey = config.getString("awsSecretKey").get
val awsBucket = config.getString("awsBucket").get
val awsCred = new BasicAWSCredentials(awsAccessKey, awsSecretKey)
val amazonS3Client = new AmazonS3Client(awsCred)
val obj = amazonS3Client.getObject(awsBucket, path)
val stream = obj.getObjectContent()
val twitterKey = config.getString("twitterKey").get
val twitterSecret = config.getString("twitterSecret").get
val token = form("token").head
val secret = form("secret").head
val tweet = form("tweet").head
val cb = new ConfigurationBuilder();
cb.setDebugEnabled(true)
.setOAuthConsumerKey(twitterKey)
.setOAuthConsumerSecret(twitterSecret)
.setOAuthAccessToken(token)
.setOAuthAccessTokenSecret(secret)
val tf = new TwitterFactory(cb.build())
val twitter = tf.getInstance()
val status = new StatusUpdate(tweet)
status.media(photo, stream)
val twitResp = twitter.updateStatus(status)
Logger.info("Tweeted " + twitResp.getText())
Ok("Tweeted " + twitResp.getText())
}
def facebookJpeg = Action(parse.urlFormEncoded) { request =>
val form = request.body
val folder = form("folder").head
val mediaType = form("type").head
val photo = form("photo").head
val path = folder + "/" + mediaType + "/" + photo
val config = Play.current.configuration;
val awsAccessKey = config.getString("awsAccessKey").get
val awsSecretKey = config.getString("awsSecretKey").get
val awsBucket = config.getString("awsBucket").get
val awsCred = new BasicAWSCredentials(awsAccessKey, awsSecretKey)
val amazonS3Client = new AmazonS3Client(awsCred)
val obj = amazonS3Client.getObject(awsBucket, path)
val stream = obj.getObjectContent()
val token = form("token").head
val msg = form("msg").head
val facebookClient = new DefaultFacebookClient(token)
val fbClass = classOf[FacebookType]
val param = com.restfb.Parameter.`with`("message", msg)
val attachment = com.restfb.BinaryAttachment`with`(photo + ".png", stream)
val fbResp = facebookClient.publish("me/photos", fbClass, attachment, param)
Logger.info("Posted " + fbResp.toString())
Ok("Posted " + fbResp.toString())
}
My attempt at a guess:
Yes it's better to do things asynchronous; you're tying up threads if you do everything synchronously. Threads are memory hogs, so your server can only use so many; the more that are tied up waiting, the fewer requests your server can respond to.
No it's not a huge issue. With node.js (and Rails? Django?) it is a huge issue because there's only one thread and so it blocks your whole web server. A JVM server is multithreaded so you can still service new requests.
You can easily wrap the whole thing in a future, or do it more granularly, but that doesn't really buy you anything because you're calling the same methods, so you're just shifting the wait from one thread do another.
If those Java libraries offer asynchronous methods, you can wrap those in a future to get the real benefits of asynchrony <-how to do?. Otherwise yes you're looking at writing from the ground up.
Don't really know if running on heroku matters. Is one dyno == one simultaneous request?
I think it's best to do these requests asynchronously for two main reasons:
high latency (network calls)
failures
With Play, you should use the Akka actors to make your actions it provides great ways to deal with these two concerns.
The problem synchronous code is that it will block the web server. So it won't be available to other requests. Here we will make the wait in other threads unrelated to the web server.
You could do something like:
// you will have to write the TwitterActor
val twitterActor = Akka.system.actorOf(Props[TwitterActor], name = "twitter-actor")
def tweetJpeg = Action(parse.urlFormEncoded) { request =>
val futureMessage = (twitterActor ? request.body).map {
// Do something with the response from the actor
case ... => ...
}
async {
futureMessage.map( message =>
ok("Tweeted " + message)
)
}
}
Your actor would receive the body and send back the response of the service.
Moreover with Akka, you can tune your process to have several actors available, have a circuit breaker ...
To go further: http://doc.akka.io/docs/akka/2.1.2/scala/actors.html
Ps: I never tried play on Heroku so I don't know the impact of a single dynamo.