MSK & building aggregrate tables (e.g. for analytics) - apache-kafka

I use MSK and I manually build aggregate tables of my streams in my application code (e.g. TypeScript in a node.js webservice). I have lots of data (approaching 1M events per day), and I want to be able to productionise different real-time 'views' on the incoming stream. E.g. for some sales data, I might want to create these views:
sales per customer (table schema: customer, sum_of_sales)
sales per day (table schema: date, sum_of_sales)
sale per customer per day (table schema: date, customer, sum_of_sales)
Today if I wanted to achieve this I would scaffold 3 tables up (could be RDMS or something like DynamoDB), and then in my application code, I would insert/upsert into the table for every sales event that arrived. The scaffolding around that feels a little tedious, I was wondering if there is a better way without having to write a bunch of code in my webservice to actually pull from the consumer, upsert the data into a table.
All I would expect my code in my web service to do is provide APIs (e.g. REST APIs) to fetch data from these views. E.g. a client makes a REST request to get all sales in the last 7 days for customers X, Y and Z.
There seems like a lot of technologies out there, but my use case is fairly trivial and from the not-so-brief look I took nothing does this.
Thanks
If it's noteworthy, I currently keep my data indefinitely.

Related

XRPL: How to get the history of the balance of an account?

I would like to query the history of the balance of an XRPL account with the new WebSocket API.
For example, how do I check the balance of an account on a particular day?
I know with the v2 api, there was a possibility to query balance_changes. But this doesn't seem to be part of the new version.
For example:
https://data.ripple.com/v2/accounts/rf1BiGeXwwQoi8Z2ueFYTEXSwuJYfV2Jpn/balance_changes?start=2018-01-01T00:00:00Z
How is this done with the new Websocket API's?
There's no convenient API call that the WebSocket API can do to get this. I assume you want the XRP balance, not token/issued currency balances, which are in a different place.
One way to go about it is to make an account_tx call and then iterate through the metadata. Many, but not all, transactions will have a ModifiedNode entry of type AccountRoot—if that transaction changed the account's XRP balance, you can see the difference in the PreviousFields vs. FinalFields for that entry. The Look Up Transaction Results tutorial has some details on how to parse out metadata this way. There are some kind of tricky edge cases here: for example, if you send a transaction that buys 10 drops of XRP in the exchange but burns 10 drops of XRP as a transaction cost, then the metadata won't show a balance change because the net change was zero (+10, -10).
Another approach could be to estimate what ledger_index was most recently closed at a given time, then use account_info to look up the account's balance as of that time. The hard part there is figuring out what the latest ledger index was at a given time. This is one of the places where the Data API was just more convenient than the WebSocket API—there's no way to look up by date in WebSocket so you have to try a ledger index, see what the close time of the ledger was, try another ledger index, see what the date is, etc.

Microservices "JOIN" tables within different databases and data replication

I'm trying to achieve data join between entities.
I've got 2 separated microservices which can communicate with each other using events (rabbitmq). And all the requests are currently joined within an api gateway.
Suppose my first service is UserService , and second service is ProductService.
Usually to get a list of products we do an GET request like /products , the same goes when we want to create a product , we do an POST request like /products.
The product schema looks something like this:
{
title: 'ProductTitle`,
description: 'ProductDescriptio',
user: 'userId'
...
}
The user schema looks something like this:
{
username: 'UserUsername`,
email: 'UserEmail'
...
}
So , when creating a product or getting list of products we will not have some details about user like email, username...
What i'm trying to achieve is to get user details when creating or querying for a list of products along with user details like so:
[
{
title: 'ProductTitle`,
description: 'ProductDescriptio',
user: {
username: 'UserUsername`,
email: 'UserEmail'
}
}
]
I could make an REST GET request to UserService , to get the user details for each product.
But my concern is that if UserService goes down the product will not have user details.
What are other ways to JOIN tables ? other than making REST API calls ?
I've read about DATA REPLICATION , but here's another concern how do we keep a copy of user details in ProductService when we create a new product with and POST request ?
Usually i do not want to keep a copy of user details to ProductService if he did not created a product. I could also emit events to each other service.
Approach 1- Data Replication
Data replication is not harmful as long as it makes your service independent and resilient. But too much data replication is not good either. Microservices doesn't fit well every case so we have to compromise on things as well.
Approach 2- Event sourcing and Materialized views
Generally if you have data consist of multiple services you should be considering event sourcing and Materialized views. These views are pre-complied disposable data tables that can be updated using published events from different data services . Say your "user" service publish the event , then you would update your view if another related event is published you can add/update materialized views and so on. These views can be saved in cache for fast retrieval and can be queried to get the data. This pattern adds little complexity but it's highly scale-able.
Event sourcing is basically a store to save all your events and replay the events to reach the particular state of system. Generally we create Materialized views from event store.
Say e.g. you have event store where you keep on saving all your published events. At the same time you are also updating your Materialized views. If you want to query the data then you will be getting it from your Materialized views. Since Materialized views are disposable that can always be generated from event store. Say Materialized views which was in cache got corrupted , you can completely regenerate the view from Event store by replaying the events. Say if i miss the cache hit i can still get the data from event store by replaying the events. You can find more on the following links.
Event Sourcing , Materialized view
Actually we are working with data replication to make each microservice more resilient (giving them the chance to still work even if another service is down).
This can be achieved in many ways, e.g. in your case by making the ProductService listening to the events send by the UserSevice when a user is created, deleted, etc.
Or the UserService could have a feed the ProductService is reading every n minutes or so marking the position last read on the feed. Etc.
There are many thing to consider when designing services and it really depends on your systems mission. E.g. you always have to evaluate the impact of coupling - if it is fine or not for a service not to be able to work when another service is down. Like, how important is a service and how is the impact on other services when this on is not able to work.
If you do not want to keep a copy of data not needed you could just read the data of the users that are related to a product. If a new product is created with a user that is not in your dataset you would then get it from the UserService. This would give you a stronger coupling then replicating everything but a weaker then replicating no data at all.
Again it really depends on what your systems is designed for and what it needs to achieve.

How to handle circular documents in MongoDB/DynamoDB?

Currently the site is using a relational database (MySQL) however the speed to join all the data is too long and has required caching that has lead to other issues.
The issue is how the two tables would nest into each other creating a circular reference. A simple example would be two tables, one for an ACTOR and a second for a MOVIE. The movie would have the actor and the actor would have a movie. Obviously this is easy in a relational database.
So for example, an ACTOR schema:
ACTOR1
- AGE
- BIO
- MOVIES
- FILM1 (ties to the FILM1 document)
- FILM2
Then the MOVIE schema:
FILM1
- RELEASE DATE
- ACTORS
- ACTOR1 (ties back to the ACTOR document)
- ACTOR2
Speed is the most important thing to me. I can easily add ID's to an ACTOR document in place of the full MOVIE document. However I'm back to multiple calls. Are there any features in a NoSQL database like MongoDB or DynamoDB that could solve this in a single call? Or is NoSQL just not the right choice?
While NoSQL generally recommends denormalization of data models, it is best not to have an unbounded list in a single database entry. To model this data in DynamoDB, you should use an adjacency list for modeling the many-to-many relationship. There's no cost-effective way of modeling the data, that I know of, to allow you to get all the data you want in a single call. However, you have said that speed is most important (without giving a latency requirement), so I will try to give you an idea as to how fast you can get the data if stored in DynamoDB.
Your schemas would become something like this:
Actor {
ActorId, <-- This is the application/database id, not the actor's actual ID
Name,
Age,
Bio
}
Film {
FilmId, <-- This is the application/database id for the film
Title,
Description,
ReleaseDate
}
ActedIn {
ActorId,
FilmId
}
To indicate that an actor acted in a movie, you only need to perform one write (which is consistently single-digit milliseconds using DynamoDB in my experience) to add an ActedIn item to your table.
To get all the movies for an actor, you would need to query once to get all the acted in relationships, and then a batch read to get all the movies. Typical latencies for a query (in my experience) is under 10ms, depending on the network speeds and the amount of data being sent over the network. Since the ActedIn relationship is such a small object, I think you could expect an average case of 5ms for a query, if your query is originating from something that is also running in an AWS datacenter (EC2, Lambda, etc).
Getting a single item is going to be under 5 ms, and you can do that in parallel. There's also a BatchGetItems API, but I don't have any statistics for you on that.
So, is ~10ms fast enough for you?
If not, you can use DAX, which adds a caching layer to DynamoDB and promises request latency of <1ms.
What's the unmaintainable, not-cost-effective way to do this in a single call?
For every ActedIn relationship, store your data like this:
ActedIn {
ActorId,
ActorName,
ActorAge,
ActorBio,
FilmId,
FilmTitle,
FilmDescription,
FilmReleaseDate
}
You only need to make one query for any given Actor to get all of their film details, and only one query to get all the Actor details for a given film. Don't actually do this. The duplicated data means that every time you have to update the details for an Actor, you need to update it for every Film they were in, and similarly for Film details. This will be an operational nightmare.
I'm not convinced; it seems like NoSQL is terrible for this.
You should remember that NoSQL comes in many varieties (NoSQL = Not Only SQL), and so even if one NoSQL solution doesn't work for you, you shouldn't rule it out entirely. If you absolutely need this in a single call, you should consider using a Graph database (which is another type of NoSQL database).

Inter-microservices Communication using REST & PUB/SUB

This is still a theory in my mind.
I'm rebuilding my backend by splitting things into microservices. The microservices I'm imagining for starting off are:
- Order (stores order details and status of each order)
- Customer (stores customer details, addresses, orders booked)
- Service Provider (stores service provider details, status & location of each service provider, order(s) currently being processed by the service provider, etc.)
- Payment (stores payment info for each order)
- Channel (communicates with customers via email / SMS / mobile push)
I hope to be able to use PUB/SUB to create a message with corresponding data, which can be used by any other microservice subscribing to that message.
First off, I understand the concept that each microservice should have complete code & data isolation (thus, on different instances / VMs); and that all microservices should communicate strictly using HTTP REST API contracts.
My doubts are as follows:
To show a list of orders, I'll be using the Order DB to get all orders. In each Order document (I'll be using MongoDB for storage), I'll be having a customer_id Foreign Key. Now the issue of resolving customer_name by using customer_id.
If I need to show 100 orders on the page and go with the assumption that each order has a unique customer_id associated with it, then will I need to do a REST API call 100 times so as to get the names of all the 100 customer_ids?
Or, is data replication a good solution for this problem?
I am envisioning something like this w.r.t. PUB/SUB: The business center personnel mark an order as assigned & select the service provider to allot to that order. This creates a message on the cross-server PUB/SUB channel.
Then, the Channel microservice (which is on a totally different instance / VM) captures this message & sends a Push message & SMS to the service provider's device using the data within the message's contents.
Is this possible at all?
UPDATE TO QUESTION 2: I want the Order microservice to be completely independent of any other microservices that will be built upon / side-by-side it. Channel microservice is an example of a microservice that depends upon events taking place within Order microservice.
Also, please guide me as to what all technologies / libraries to use.
What I'll be developing on:
Java
MongoDB
Amazon AWS instances for each microservice.
Would appreciate anyone's help on this.
Thanks!
#1
If I need to show 100 orders and each order has a unique customer_id, will I need to do 100 REST API call?
No, just make 1 request with 100 order_id(s) and return a dictionary of order_id <=> customer_id
#2
It's a single request
POST
/orders/new
{
"selected_service_provider_id" : "123"
...
}
Which can return you order_id and you can print it locally for the customer or track progress or what have you.
On the server side, you receive an order and process it. Processing can include sending an SMS at some stage. This functionality can be implemented inside original service that received this request or as a separate call to another dedicated service.
To your first question, you don't need to do 100 queries, just one with the array of your 100 documents, like the following:
db.collection.find( { _id : { $in : [1,2,3,4] } } );
https://stackoverflow.com/a/7713461/1384539
I know this question is 1 year old, but I would like to add my answer to the first point.
One option would be to use some form of CQRS and store on the OrderDB also some of the user details when creating an order. This way when you have to show the list of orders you already have all the details you need. Also, the order document would represent a photograph of the user state at the moment of the order creation.
Of course, in case you don't have the user details when storing the order, you just need to make a GET call to the User Service, but that would be 1 call, not 100.

Which of CouchDB or MongoDB suits my needs?

Where I work, we use Ruby on Rails to create both backend and frontend applications. Usually, these applications interact with the same MySQL database. It works great for a majority of our data, but we have one situation which I would like to move to a NoSQL environment.
We have clients, and our clients have what we call "inventories"--one or more of them. An inventory can have many thousands of items. This is currently done through two relational database tables, inventories and inventory_items.
The problems start when two different inventories have different parameters:
# Inventory item from inventory 1, televisions
{
inventory_id: 1
sku: 12345
name: Samsung LCD 40 inches
model: 582903-4
brand: Samsung
screen_size: 40
type: LCD
price: 999.95
}
# Inventory item from inventory 2, accomodation
{
inventory_id: 2
sku: 48cab23fa
name: New York Hilton
accomodation_type: hotel
star_rating: 5
price_per_night: 395
}
Since we obviously can't use brand or star_rating as the column name in inventory_items, our solution so far has been to use generic column names such as text_a, text_b, float_a, int_a, etc, and introduce a third table, inventory_schemas. The tables now look like this:
# Inventory schema for inventory 1, televisions
{
inventory_id: 1
int_a: sku
text_a: name
text_b: model
text_c: brand
int_b: screen_size
text_d: type
float_a: price
}
# Inventory item from inventory 1, televisions
{
inventory_id: 1
int_a: 12345
text_a: Samsung LCD 40 inches
text_b: 582903-4
text_c: Samsung
int_a: 40
text_d: LCD
float_a: 999.95
}
This has worked well... up to a point. It's clunky, it's unintuitive and it lacks scalability. We have to devote resources to set up inventory schemas. Using separate tables is not an option.
Enter NoSQL. With it, we could let each and every item have their own parameters and still store them together. From the research I've done, it certainly seems like a great alterative for this situation.
Specifically, I've looked at CouchDB and MongoDB. Both look great. However, there are a few other bits and pieces we need to be able to do with our inventory:
We need to be able to select items from only one (or several) inventories.
We need to be able to filter items based on its parameters (eg. get all items from inventory 2 where type is 'hotel').
We need to be able to group items based on parameters (eg. get the lowest price from items in inventory 1 where brand is 'Samsung').
We need to (potentially) be able to retrieve thousands of items at a time.
We need to be able to access the data from multiple applications; both backend (to process data) and frontend (to display data).
Rapid bulk insertion is desired, though not required.
Based on the structure, and the requirements, are either CouchDB or MongoDB suitable for us? If so, which one will be the best fit?
Thanks for reading, and thanks in advance for answers.
EDIT: One of the reasons I like CouchDB is that it would be possible for us in the frontend application to request data via JavaScript directly from the server after page load, and display the results without having to use any backend code whatsoever. This would lead to better page load and less server strain, as the fetching/processing of the data would be done client-side.
I work on MongoDB, so you should take this with a grain of salt, but this looks like a great fit for Mongo.
We need to be able to select items from only one (or several) inventories.
It's easy to ad hoc queries on any fields.
We need to be able to filter items based on its parameters (eg. get all items from inventory 2 where type is 'hotel').
The query for this would be: {"inventory_id" : 2, "type" : "hotel"}.
We need to be able to group items based on parameters (eg. get the lowest price from items in inventory 1 where brand is 'Samsung').
Again, super easy: db.items.find({"brand" : "Samsung"}).sort({"price" : 1})
We need to (potentially) be able to retrieve thousands of items at a time.
No problem.
Rapid bulk insertion is desired, though not required.
MongoDB has much faster bulk inserts than CouchDB.
Also, there's a REST interface for MongoDB: http://github.com/kchodorow/sleepy.mongoose
You might want to read http://chemeo.com/doc/technology, who dealt with the arbitrary property search problem with MongoDB.