Getting Streaming Data to Right Client(s) - scala

I have streaming data coming in to a platform that looks like this:
RestAPI -> AWS Kinesis Stream A -> Custom Analytics Engine
Once each datum is processed by the Analytics engine, the result needs to be pushed to web front ends so users can look at it in real time. The web front end is a webapp that connects to a cluster of web servers (spray or play or whatever) via a websocket... and there can be many users interested in the same data.
Question: Since the front end user can connect to any web server in the cluster, how do I get the data back to the cluster in a way where it can be pushed to all the users interested in that particular datum? Do I get the datum to a single server and then somehow distribute it to all of the machines in the cluster? Is there a way for that single datum to go to all servers in the cluster and then each server decides whether there is a connected user that is interested in it and, if not, drops it?

Related

What is the correct to prevent doublon stored in db used by a realtime app

I currently have a server that watch some events on the ethereum blockchain. When some events are triggered there, as my server is subscribed to them it will pick them up, do some stuff and fill my database accordingly.
That being said, for scalability purpose, let say I would like to now have several instance of my server. So now, I have 3 servers that watches the ethereum blockchain for events and fill my db.
What is the proper/standard way to tackle the fact that all my server will be pushing the same data on my db ?

Symmetric DS simple configuration guidance

Over the past couple of weeks I have been prototyping out some examples in symmetric DS. Looking for some guidance and examples because I am really running into some walls here. I have used the server and android examples successfully, don't need any assistance with setup on getting the basics working. It is a complex tool and I;m still learning it as well.
So I am trying to setup an environment where all the clients that run on android device sync up to a server. So I know it's fairly straight forward to do a setup where its 1 MASTER -> <- multiple clients, as the example that they provide do.
What I am trying to do is multiple masters to multiple clients. Essentially I want a database on the server for each client. Ill attach a diagram to try to help explain but I want a database for each store so store #1 has a master DB on the server and it syncs both ways with the client device.
server-diagram
SymmetricDS requires having a central node to store the configuration. I would recommend to have a central node with bunch of databases that connect to the central database. Connect each android application to another database. This topology will allow configuring what data syncs from the central node to the bunch of databases and what goes back
On the router from client to server you can set the target catalog to be a variable : $(sourceExternalId). This will use the clients external id as the database name on your server.
If you also need to replicate data back down you can set the external select on the triggers at the server. This would need to be an expression on your server database that would evaluate the current database. This would fire when a change occurs on the server database and populate the external_data column on sym_data during capture with the database that the change occurred in. You would then adjust the router from server to client to be a column match router type. Your expression then for the router would be: EXTERNAL_DATA=:EXTERNAL_ID. This would ensure that this data only be sent to the appropriate client.

Best way to write to Kafka from web site?

I mean I know how to get data into kafka either by some file agent or programmatically using any of the clients, but speaking from architectural point of view...
It can't just be collecting HTTP logs.
I'm assuming when someone clicks a link or does something of interest, we can use some kind of ajax/javascript call to make a call to some microservice to capture the extra info that we want? But that's not always "reliable" per say, but do we care?
Or while the given "action" posts back to the server we simultaneously write to Kafka and perform the other action?
It’s not clear from your question if you are trying to collect all the clickstream logs from a set of web servers, or if you are trying to selective publish some data to Kafka from your web app, so I will answer both.
The easiest way to collect every web click is to configure your web servers to use Syslog ( see http://archive.oreilly.com/pub/a/sysadmin/2006/10/12/httpd-syslog.html ) and configure your Syslog server to send data to Kafka (see https://www.balabit.com/documents/syslog-ng-ose-latest-guides/en/syslog-ng-ose-guide-admin/html/configuring-destinations-kafka.html). Alternatively there are some more advanced features available in this Kafka Connector for Syslog-NG (see https://github.com/jcustenborder/kafka-connect-syslog). You can also write httpd logs to a file and use a Kafka File Connector to publish to Kafka (see https://docs.confluent.io/current/connect/connect-filestream/filestream_connector.html)
If you just want to enable your apps to send certain log data to Kafka directly you can use the Kafka REST Proxy and publish using a simple HTTP POST from either your client JavaScript or your server side logic (see https://docs.confluent.io/current/kafka-rest/docs/index.html)

AWS API Gateway - to use with AWS EC2 Endpoint or AWS Lambda?

I need to create a API where the Vendors will push the data to the server using REST calls and this data needs to further pushed to a user on mobile app(using Websocket guessing as of now) to whom the data belongs.
For Vendors to use REST API : I need to check the Vendor credential and Write that data to DB.
I am keen to know what approach should I use ? Should I use AWS API Gateway which can help for security and scalability.
and while using AWS API Gateway - what would be a better approach to have EC2 Endpoint or Lambda Endpoint.
Using EC2 vs Lambda depends on how you want to design your services and specific use cases. Going serverless is a trend these days, but you do not need to go serverless, just for the sake of being serverless.
For your use case, If the REST API you will expose updates a Database, let's say RDS, Lambda function probably is not an ideal choice. As you will need to open a connection every time the lambda function is invoked. Moreover, if you are running the lambda in a NO VPC config, You will need to publicly expose your RDS port. If its DynamoDB, it works out well.
But again, you want to push out the update to Mobile apps over say web sockets. You definitely need a WebSocket Server somewhere, and I guess its EC2.
You may design your application in way such that all your business logic resides in the lambda functions, updates the DB, posts a message to an SQS queue. The WebSocket server can then pick up messages from the SQS queue and post updates. This decouples your application architecture. This is just one approach and wont scale horizontally out of the box.
OR - You may choose to put everything in one EC2 instance, expose a REST API that updates the DB and also posts updates to the WebSocket connection.

Rolling Over Streaming Connections During Upgrades

I am working on an application that uses Amazon Kinesis, and one of the things I was wondering about is how you can roll over an application during an upgrade without data loss on streams. I have heard about things like blue/green deployments and such, but I was wondering what is the best practice for upgrading a data streaming service so you don't loose data from your streams.
For example, my application has an HTTP endpoint that ingests data as a series of POST operations. If I want to replace the service with a newer version, how do I manage existing application streaming to my endpoint?
One common method is having a software load balancer (LB) with a virtual IP; behind this LB there would be at least two HTTP ingestion endpoints during normal operation. During upgrade, each endpoint is announced out and upgraded in turn. The LB ensures that no traffic is forwarded to an announced out endpoint.
(The endpoints themselves can be on separate VMs, Docker containers or physical nodes).
Of course, the stream needs to be finite; the TCP socket/HTTP stream is owned by one of the endpoints. However, as long as the stream can be stopped gracefully, the following flow works, assuming endpoint A owns the current ingestion:
Tell endpoint A not to accept new streams. All new streams will be redirected only to endpoint B by the LB.
Gracefully stop existing streams on endpoint A.
Upgrade A.
Announce A back in.
Rinse and repeat with endpoint B.
As a side point, you would need two endpoints with a load balanced (or master/slave) set-up if you require any reasonable uptime and reliability guarantees.
There are more bespoke methods which allow hot code swap on the same endpoint, but they are more bespoke and rely on specific internal design (e.g. separate process between networking and processing stack connected by IPC).