How i can integrate Apache Spark with the Play Framework to display predictions in real time? - scala

I'm doing some testing with Apache Spark, for my final project in college. I have a data set that I use to generate a decision tree, and make some predictions on new data.
In the future, I think to use this project into production, where I would generate a decision tree (batch processing), and through a web interface or a mobile application receives new data, making the prediction of the class of that entry, and inform the result instantly to the user. And also go storing these new entries for after a while generating a new decision tree (batch processing), and repeat this process continuously.
Despite the Apache Spark have the purpose of performing batch processing, there is the streaming API that allows you to receive real-time data, and in my application this data will only be used by a model built in a batch process with a decision tree, and how the prediction is quite fast, it allows the user to have the answer quickly.
My question is what are the best ways to integrate Apache Spark with a web application (plan to use the Play Framework scala version)?

One of the issues you will run into with Spark is it takes some time to start up and build a SparkContext. If you want to do Spark queries via web calls, it will not be practical to fire up spark-submit every time. Instead, you will want to turn your driver application (these terms will make more sense later) into an RPC server.
In my application I am embedding a web server (http4s) so I can do XmlHttpRequests in JavaScript to directly query my application, which will return JSON objects.

Spark is a fast large scale data processing platform. The key here is large scale data. In most cases, the time to process that data will not be sufficiently fast to meet the expectations of your average web app user. It is far better practice to perform the processing offline and write the results of your Spark processing to e.g a database. Your web app can then efficiently retrieve those results by querying that database.
That being said, spark job server server provides a REST api for submitting Spark jobs.

Spark (< v1.6) uses Akka underneath. So does Play. You should be able to write a Spark action as an actor that communicates with a receiving actor in the Play system (that you also write).
You can let Akka worry about de/serialization, which will work as long as both systems have the same class definitions on their classpaths.
If you want to go further than that, you can write Akka Streams code that tees the data stream to your Play application.

check this link out, you need to run spark in local mode (on your web server) and the offline ML model should be saved in S3 so you can access the model then from web app and cache the model jut once and you will be having spark context running in local mode continuously .
https://commitlogs.com/2017/02/18/serve-spark-ml-model-using-play-framework-and-s3/
Also another approach is to use Livy (REST API calls on spark)
https://index.scala-lang.org/luqmansahaf/play-livy-module/play-livy/1.0?target=_2.11
the s3 option is the way forward i guess, if the batch model changes you need to refresh the website cache (down time) for few minutes.
look into these links
https://github.com/openforce/spark-mllib-scala-play/blob/master/app/modules/SparkUtil.scala
https://github.com/openforce/spark-mllib-scala-play
Thanks
Sri

Related

design question: best way to aggregate data from several microservices and show in UI

we have a scenario where we need to aggregate data from several services and show in UI. The current scenario is when an agent logins in, we need to show cases assigned to that agent. Case information needs to be aggregated from several microservices. There would be around 1K cases assigned to agent at a time and all of the needs to be shown to agent so that he can perform sorting based on certain case data.
What be best approach to show data in this scenario? should we do API calls to several services for each case and aggregate and show ? Or there are better approaches to achieve this.
No. You'll certainly not call multiple APIs to aggregate data on runtime. Even if you call the apis parallely, it will be a huge latency.
You need to pre-aggregate the case details and cache them in a distributed caching system (e.g. Redis or memcached) using a streaming platform (e.g. Kafka). Also, store the pre-aggregated case details in a persistent database. Basically, it's a kind of materialized views.
Caching will enable you to serve the case details fast to the user without any noticeable latency. And streaming will help you to keep the cache and DB aggregations updated in a near-real time. Storing the materialized view in database will save you from storing everything in memory. You can use an LRU cache. Only the recently used data will be in cache. If you need to show any case data that is not in cache, you'd read it from database and store it in cache for future requests.
I recommend you read these two Martin Kleppmann articles here and here

Kafka Streams Application Updates

I've built a Kafka Streams application. It's my first one, so I'm moving out of a proof-of-concept mindset into a "how can I productionalize this?" mindset.
The tl;dr version: I'm looking for kafka streams deployment recommendations and tips, specifically related to updating your application code.
I've been able to find lots of documentation about how Kafka and the Streams API work, but I couldn't find anything on actually deploying a Streams app.
The initial deployment seems to be fairly easy - there is good documentation for configuring your Kafka cluster, then you must create topics for your application, and then you're pretty much fine to start it up and publish data for it to process.
But what if you want to upgrade your application later? Specifically, if the update contains a change to the topology. My application does a decent amount of data enrichment and aggregation into windows, so it's likely that the processing will need to be tweaked in the future.
My understanding is that changing the order of processing or inserting additional steps into the topology will cause the internal ids for each processing step to shift, meaning at best new state stores will be created with the previous state being lost, and at worst, processing steps reading from an incorrect state store topic when starting up. This implies that you either have to reset the application, or give the new version a new application id. But there are some problems with that:
If you reset the application or give a new id, processing will start from the beginning of source and intermediate topics. I really don't want to publish the output to the output topics twice.
Currently "in-flight" data would be lost when you stop your application for an upgrade (since that application would never start again to resume processing).
The only way I can think to mitigate this is to:
Stop data from being published to source topics. Let the application process all messages, then shut it off.
Truncate all source and intermediate topics.
Start new version of application with a new app id.
Start publishers.
This is "okay" for now since my application is the only one reading from the source topics, and intermediate topics are not currently used beyond feeding to the next processor in the same application. But, I can see this getting pretty messy.
Is there a better way to handle application updates? Or are my steps generally along the lines of what most developers do?
I think you have a full picture of the problem here and your solution seems to be what most people do in this case.
During the latest Kafka-Summit this question has been asked after the talk of Gwen Shapira and Matthias J. Sax about Kubernetes deployment. The responses were the same: If your upgrade contains topology modifications, that implies rolling upgrades can't be done.
It looks like there is no KIP about this for now.

Using spark as an application server?

We have a complex finance / portfolio analytics that we would like to take advantage of Spark.
Instead of having the application submit isolated jars that perform the computation and then having to retrieve the data out of SQL, how viable would it be to simply have the entire application run as a Spark driver so that the results from Spark can be seamlessly accessed from the main application?
Is this a recommended use case of Spark? What would be the potential disadvantages of this approach? Would there be any performance or latency implications?
This should be fine as long as you own the cluster and you don't mind holding it while having nothing to process.
You can programmatically set you spark context and keep it running for as long as you want.
Everything will be one long running application that is using some constant resources.
Things to worry about:
if spark dies, how will this affect your server?
if driver runs out of memory, it will crush your server.
If you have answers for the above I don't see something fundamentally wrong.

How to ensure that parallel queries to ext. system are executed only once and then cached

Server frameworks: Scala, Play 2.2, ReactiveMongo, Heroku
I think I have quite interesting brain teaser for you:
In my trip-planning application I want to display weather forecast on a map(similar to this). I'm using a paid REST service to query weather data. To speed up user experience and reduce costs I plan to cache weather data for each location for one hour.
There are a few not-so obvious things to consider:
It might require to query up to 100 location for weather to display one weather map
Weather must be queried in parallel because it would take too long to query it in serial fashion considering network latency
However launching 100 threads for each user request is not an option as well (imagine just 5 users looking at a map at one time)
The solution is to have let's say 50 workers that query weather for user requests
Multiple users might be viewing the same portion of map
There is a possible racing condition where one location is queried multiple times.
However it should be queried only once and then cached.
The application is running in clustered environment meaning there will be several play instances.
Coming from a Java EE background I can come up with a pretty good solution using the Java EE stack.
However I wonder how to do this using something more natural to Scala/Play stack: Akka. There is an example (google "heroku scala akka") for similar problem but it doesn't solve one issue: Racing condition when multiple users query the same data at once.
How would you implement this?
EDIT: I have decided that the requirement to ensure that weather data is updated only once is not necessary. The situation would happen far too infrequently to be a real problem and all proposed solutions would bring too much overhead and complexity to the system to be viable.
Thanks everyone for your time and effort. I hope answers to this question will help someone in the future with similar problem.
In Akka you can choose from multiple routing strategies. ConsistentHashingRoutingLogic could serve you well in this situation. Since actors are single-threaded you can easily maintain a cache in each actor. This routing logic will assure that two equal messages will always hit the same actor.
Each actor can work in the following way:
1. check local cache (for example apache commons LRUMap)
- if found, return
2. check global cache (distributed memcache or any other key-value store)
- if found, store the result in the local cache and return
3. query the REST service
4. store the result in the global and local caches
You can have a look at this question, which I based my answer on.
I decided that I'll post my JMS solution as well.
Controller that processes the request for weather does following:
Query the DB for weather data. If there are NO locations with out-of-date data reply immediately. Otherwise continue:
Start listening on a topic (explained later).
For each location: Check whether the weather for the location isn't being updated.
If not send a weather update request message to queue.
Certain amount of workers (50?) listen to that queue.
Worker first marks the location weather as being updated
Worker retrieves updated weather and updates the DB.
Worker sends a message to a topic with weather data for that location.
When controller receives (via topic) weather updates for all out-of-date locations, combine it with up-to-date locations and reply.

What are the differences between Taobao's open source projects: Metamorphosis and Timetunnel?

I'd like to build a logs aggregation system and I found these tools developed by Taobao. Both of them can be used to collect logs for further processing and analysis. What's the different betweent them?
if you can read Chinese, there has more details http://rdc.taobao.com/team/jm/archives/921.
using test based protocol like memcached.
metamorphosis is written in Java.
using self defined storage structure.
support sync and async subscribe api.
improve server recovery performance.
add interface for real time status, like memcached status.
support client side connection reuse.