continuously fetch database results with scalaz.stream - scala

I'm new to scala and extremely new to scalaz. Through a different stackoverflow answer and some handholding, I was able to use scalaz.stream to implement a Process that would continuously fetch twitter API results. Now i'd like to do the same thing for the Cassandra DB where the twitter handles are stored.
The code for fetching the twitter results is here:
def urls: Seq[(Handle,URL)] = {
Await.result(
getAll(connection).map { List =>
List.map(twitterToGet =>
(twitterToGet.handle, urlBoilerPlate + twitterToGet.handle + parameters + twitterToGet.sinceID)
)
},
5 seconds)
}
val fetchUrl = channel.lift[Task, (Handle, URL), Fetched] {
url => Task.delay {
val finalResult = callTwitter(url)
if (finalResult.tweets.nonEmpty) {
connection.updateTwitter(finalResult)
} else {
println("\n" + finalResult.handle + " does not have new tweets")
}
s"\ntwitter Fetch & database update completed"
}
}
val P = Process
val process =
(time.awakeEvery(3.second) zipWith P.emitAll(urls))((b, url) => url).
through(fetchUrl)
val fetched = process.runLog.run
fetched.foreach(println)
What I'm planning to do is use
def urls: Seq[(Handle,URL)] = {
to continuously fetch Cassandra results (with an awakeEvery) and send them off to an actor to run the above twitter fetching code.
My question is, what is the best way to implement this with scalaz.stream? Note that i'd like it to get ALL the database results, then have a delay before getting ALL the database results again. Should i use the same architecture as the twitter fetching code above? If so, how would I create a channel.lift that doesn't require input? Is there a better way in scalaz.stream?
Thanks in advance

Got this working today. The cleanest way to do it would be to emit the database results as a stream and attach a sink to the end of the stream to do the twitter processing. What I actually have is a bit more complex as it retrieves the database results continuously and sends them off to an actor for the twitter processing. The style of retrieving the results follows my original code from my question:
val connection = new simpleClient(conf.getString("cassandra.node"))
implicit val threadPool = new ScheduledThreadPoolExecutor(4)
val system = ActorSystem("mySystem")
val twitterFetch = system.actorOf(Props[TwitterFetch], "twitterFetch")
def myEffect = channel.lift[Task, simpleClient, String]{
connection: simpleClient => Task.delay{
val results = Await.result(
getAll(connection).map { List =>
List.map(twitterToGet =>
(twitterToGet.handle, urlBoilerPlate + twitterToGet.handle + parameters + twitterToGet.sinceID)
)
},
5 seconds)
println("Query Successful, results= " +results +" at " + format.print(System.currentTimeMillis()))
twitterFetch ! fetched(connection, results)
s"database fetch completed"
}
}
val P = Process
val process =
(time.awakeEvery(3.second).flatMap(_ => P.emit(connection).
through(myEffect)))
val fetching = process.runLog.run
fetching.foreach(println)
Some notes:
I had asked about using channel.lift without input, but it became clear that the input should be the cassandra connection.
The line
val process =
(time.awakeEvery(3.second).flatMap(_ => P.emit(connection).
through(myEffect)))
Changed from zipWith to flatMap because I wanted to retrieve the results continuously instead of once.

Related

foreachPartition method of RDD is not working in GCP cluster

I am trying to upload data in spark job through API calls where API has payload limit of 5MB per API call(3rd party API limitation). I am accumulating data to form API body until payload limit to minimize the number of API calls. I am doing the same inside foreachPartition method of RDD with some comments to analyze. But this entire code is running totally fine when I run spark job locally in my machine (APIs are getting called & data is getting uploaded) but not working the same way in GCP cluster. While running the job in GCP dataproc cluster data is not getting uploaded through APIs so I believe the code inside foreachPartition is not getting called.
While running locally I can see all the log messages(1,2,3,4,5,6,7) but While running the job in GCP cluster I can see only few log messages(1,3)
Sample code is below for your reference
Would appreciate your suggestions to make it running in GCP cluster as well.
def exportData(
client: ApiClient,
batchSize: Int,
): Unit = {
val exportableDataRdd = getDataToUpload() //this is rdd of type RDD[UserDataObj]
logger.info(s"1. exportableDataRdd count:${exportableDataRdd.count}") //not a good practice to call count here but calling just for debugging
exportableDataRdd.foreachPartition { iterator =>
logger.info(s"2. perPartition iteration")
perPartitionMethod(client, iterator, batchSize)
}
logger.info(s"3. Data export completed")
}
def perPartitionMethod(
client: ApiClient,
iterator: Iterator[UserDataObj],
batchSize: Int
): Unit = {
logger.info(s"4. Inside perPartition")
iterator.grouped(batchSize).foreach { userDataGroup =>
val payLoadLimit = 5000000 //5 MB
val groupSize = userDataGroup.size
var counter = 0
var batchedUsersData = Seq[UserDataObj]()
userDataGroup.map{ user =>
counter = counter + 1
val curUsersDataSet = batchedUsersData :+ user
val body = Map[String, Any]("data" -> curUsersDataSet.map(_.toMap))
val apiPayload = Serialization.write(body)
val size = apiPayload.getBytes().length
if (size > payLoadLimit) {
val usersToUpload = batchedUsersData
logger.info(s"5. API called with batch size: ${usersToUpload.size}")
uploadDataThroughAPI(usersToUpload, client) //method to upload data through API
batchedUsersData = Seq[UserDataObj](user)
} else {
batchedUsersData = batchedUsersData :+ user
}
//upload left out data
if(counter == groupSize && batchedUsersData.size > 0){
uploadDataThroughAPI(batchedUsersData, client)
logger.info(s"6. API called with batch size: ${batchedUsersData.size}")
}
}
}
logger.info(s"7. perPartition completed")
}

Akka Streams for server streaming (gRPC, Scala)

I am new to Akka Streams and gRPC, I am trying to build an endpoint where client sends a single request and the server sends multiple responses.
This is my protobuf
syntax = "proto3";
option java_multiple_files = true;
option java_package = "customer.service.proto";
service CustomerService {
rpc CreateCustomer(CustomerRequest) returns (stream CustomerResponse) {}
}
message CustomerRequest {
string customerId = 1;
string customerName = 2;
}
message CustomerResponse {
enum Status {
No_Customer = 0;
Creating_Customer = 1;
Customer_Created = 2;
}
string customerId = 1;
Status status = 2;
}
I am trying to achieve this by sending customer request then the server will first check and respond No_Customer then it will send Creating_Customer and finally server will say Customer_Created.
I have no idea where to start for it implementation, looked for hours but still clueless, I will be very thankful if anyone can point me in the right direction.
The place to start is the Akka gRPC documentation and, in particular, the service WalkThrough. It is pretty straightforward to get the samples working in a clean project.
The relevant server sample method is this:
override def itKeepsReplying(in: HelloRequest): Source[HelloReply, NotUsed] = {
println(s"sayHello to ${in.name} with stream of chars...")
Source(s"Hello, ${in.name}".toList).map(character => HelloReply(character.toString))
}
The problem is now to create a Source that returns the right results, but that depends on how you are planning to implement the server so it is difficult to answer. Check the Akka Streams documentation for various options.
The client code is simpler, just call runForeach on the Source that gets returned by CreateCustomer as in the sample:
def runStreamingReplyExample(): Unit = {
val responseStream = client.itKeepsReplying(HelloRequest("Alice"))
val done: Future[Done] =
responseStream.runForeach(reply => println(s"got streaming reply: ${reply.message}"))
done.onComplete {
case Success(_) =>
println("streamingReply done")
case Failure(e) =>
println(s"Error streamingReply: $e")
}
}

Gatling Feeders - creating new instances

The following code works as expected, for each iteration the next value from the valueFeed is popped and written to the output.csv file
class TestSimulation extends Simulation {
val valueFeed = csv("input.csv")
val writer = {
val fos = new java.io.FileOutputStream("output.csv")
new java.io.PrintWriter(fos, true)
}
val scn = scenario("Test Sim")
.repeat(2) {
feed(valueFeed)
.exec(session => {
writer.println(session("value").as[String])
session
})
}
setUp(scn.inject(constantUsersPerSec(1) during (10 seconds)))
}
When feed creation is inlined in the feed method the behaviour is still exactly the same
class TestSimulation extends Simulation {
val writer = {
val fos = new java.io.FileOutputStream("output.csv")
new java.io.PrintWriter(fos, true)
}
val scn = scenario("Test Sim")
.repeat(2) {
feed(csv("input.csv"))
.exec(session => {
writer.println(session("value").as[String])
session
})
}
setUp(scn.inject(constantUsersPerSec(1) during (10 seconds)))
}
Since the feed creation is not extracted I would not expect each iteration to be using the same feed but creating it's own feed instance.
Why then is it the behaviour implies the same feed is being used and the first value from the input file not always written to the output?
Example input file (data truncated, tested with more lines to prevent empty feeder exception):
value
1
2
3
4
5
Because csv(...) is in fact FeederBuilder which is called once to produce the feeder to be used within the scenario.
The gatling DSL defines builders - these are executed only once at startup, so even when you inline you get a feeder shared between all users as the same (and only) builder is used to create all the users.
if you want to have each user have its own copy of the data, you can't use the .feed method, but you can get all the records and use other looping constructs to iterate through them
val records = csv("foo.csv").records
foreach(records, "record") {
exec(flattenMapIntoAttributes("${record}"))
}

Consuming a service using WS in Play

I was hoping someone can briefly go over the various ways of consuming a service (this one just returns a string, normally it would be JSON but I just want to understand the concepts here).
My service:
def ping = Action {
Ok("pong")
}
Now in my Play (2.3.x) application, I want to call my client and display the response.
When working with Futures, I want to display the value.
I am a bit confused, what are all the ways I could call this method i.e. there are some ways I see that use Success/Failure,
val futureResponse: Future[String] = WS.url(url + "/ping").get().map { response =>
response.body
}
var resp = ""
futureResponse.onComplete {
case Success(str) => {
Logger.trace(s"future success $str")
resp = str
}
case Failure(ex) => {
Logger.trace(s"future failed")
resp = ex.toString
}
}
Ok(resp)
I can see the trace in STDOUT for success/failure, but my controller action just returns "" to my browser.
I understand that this is because it returns a FUTURE and my action finishes before the future returns.
How can I force it to wait?
What options do I have with error handling?
If you really want to block until feature is completed look at the Future.ready() and Future.result() methods. But you shouldn't.
The point about Future is that you can tell it how to use the result once it arrived, and then go on, no blocks required.
Future can be the result of an Action, in this case framework takes care of it:
def index = Action.async {
WS.url(url + "/ping").get()
.map(response => Ok("Got result: " + response.body))
}
Look at the documentation, it describes the topic very well.
As for the error-handling, you can use Future.recover() method. You should tell it what to return in case of error and it gives you new Future that you should return from your action.
def index = Action.async {
WS.url(url + "/ping").get()
.map(response => Ok("Got result: " + response.body))
.recover{ case e: Exception => InternalServerError(e.getMessage) }
}
So the basic way you consume service is to get result Future, transform it in the way you want by using monadic methods(the methods that return new transformed Future, like map, recover, etc..) and return it as a result of an Action.
You may want to look at Play 2.2 -Scala - How to chain Futures in Controller Action and Dealing with failed futures questions.

waiting for ws future response in play framework

I am trying to build a service that grab some pages from another web service and process the content and return results to users. I am using Play 2.2.3 Scala.
val aas = WS.url("http://localhost/").withRequestTimeout(1000).withQueryString(("mid", mid), ("t", txt)).get
val result = aas.map {
response =>
(response.json \ "status").asOpt[Int].map {
st => status = st
}
(response.json \ "msg").asOpt[String].map {
txt => msg = txt
}
}
val rs1 = Await.result(result, 5 seconds)
if (rs1.isDefined) {
Ok("good")
}
The problem is that the service will wait 5 seconds to return "good" even the WS request takes 100 ms. I also cannot set Await time to 100ms because the other web service I am requesting may take between 100ms to 1 second to respond.
My question is: is there a way to process and serve the results as soon as they are ready instead of wait a fixed amount of time?
#wingedsubmariner already provided the answer. Since there is no code example, I will just post what it should be:
def wb = Action.async{ request =>
val aas = WS.url("http://localhost/").withRequestTimeout(1000).get
aas.map(response =>{
Ok("responded")
})
}
Now you don't need to wait until the WS to respond and then decide what to do. You can just tell play to do something when it responds.