Enterprise integration via a data warehouse, or via messages? - service

Imagine a large organisation with many applications. The applications are not currently integrated to any great extent. There is a new and empty enterprise data warehouse, and it would store all data in a canonical format. The first step is to set up the warehouse and seed it with data from the applications.
I am looking for pros and cons between the following two enterprise integration patterns:
1) Using a combination of integration tools, setup batching to extract transform and load data on a periodic interval into the warehouse. Then, as part of the process, integrate the data from the warehouse to the required applications.
2) Using a combination of integration tools, detect changes real-time, or in batch and publish them to a service bus (in canonical format). Then, for each required application, subscribe to the messages to integrate them. The data warehouse is another subscriber to the same messages.
Thanks in advance.

One aspect that is hard to get right with integration-via-messages is periodic datasets.
Say you have a table in your data warehouse (DW) that contains data partitioned by day. If an ETL job loads that table, you can be sure that if the load job is finished, the respective dataset is complete (unless there's a bug in the job).
Messaging systems, on the other hand, usually don't provide guarantees of timely delivery. So you might get 90% of messages for a particular day by midnight, 8% within the next hour, and the remaining 2% within the next 6 hours (and a few messages might never arrive). In this situation, if you have a job that depends on this data, how can you know that the dataset is ready? You can set an arbitrary cutoff time (e.g. 1 hour past midnight) based on previous experience, SLAs, or some other criteria, when you consider the dataset complete, but that will by design be an approximation. You will also need some means to detect missing data (because of lost messages) and re-request it from the source.
This answer talks about similar problems.
Another issue is backfills. Imagine your source sends a backdated message, for example to correct some previously-sent one that belongs to a dataset in the past. Presumably, any consumers of that dataset need to be notified of the change and recompute their results. However, without some additional logic in the DW they might not know about it. With the ETL approach, since you already have dependencies between jobs, if you rerun some job with a backfill date, its dependencies will run automatically, or at least it'll be explicitly known that some consumers are affected.
With these caveats in mind, the messaging approach has some great advantages:
all your systems will be integrated using a uniform approach
the propagation time for your data will potentially be much lower
you won't have to fix ETL jobs that exploded because the data volume has grown past their ability to scale
you won't get SLA violations because your ETL jobs timed out

I guess you are talking about both ETL Systems and Mediation (intra-communication) design pattern. I don't know why have to choose between them, in my current project we combine them.
The ETL solution is implemented as Layer responsible for management of the Data integration (via Orchestrator module). It a single entry point and part of the Pipes and filters design pattern
concept that we rely on. It's able to perform a variety of tasks of varying complexity on the information that it processes.
On the other hand the Mediation as EAI system acts as "broker" between multiple applications. Whenever an interesting event occurs in an application (for instance, new information is created or a new transaction completed) an integration module in the EAI system is notified. The module then propagates the changes to other relevant applications.
So as bottom line I can't give you pros & cons for both, since to me they are a good solution together and their use is dependent on your goals, design etc.. But from your description it's seems to me that is similar to what I've suggested.

Related

How to overcome API/Websocket limitations with OHCL data for trading platform for lots of users?

I'm using CCXT for some API REST calls for information and websockets. It's OK for 1 user, if I wanted to have many users using the platform, How would I go about an inhouse solution?
Currently each chart is either using websockets or rest calls, if I have 20 charts then thats 20 calls, if I increase users, then thats 20x whatever users. If I get a complete coin list with realtime prices from 1 exchange, then that just slows everything down.
Some ideas I have thought about so far are:
Use proxies with REST/Websockets
Use timescale DB to store the data and serve that OR
Use caching on the server, and serve that to the users
Would this be a solution? There's got to be a way to over come rate limiting & reducing the amount of calls to the exchanges.
Probably, it's good to think about having separated layers to:
receive market data (a single connection that broadcast data to OHLC processors)
process OHLC histograms (subscribe to internal market data)
serve histogram data (subscribe to processed data)
The market data stream is huge, and if you think about these layers independently, it will make it easy to scale and even decouple the components later if necessary.
With timescale, you can build materialized views that will easily access and retrieve the information. Every materialized view can set a continuous aggregate policy based on the interval of the histograms.
Fetching all data all the time for all the users is not a good idea.
Pagination can help bring the visible histograms first and limit the query results to avoid heavy IO in the server with big chunks of memory.

How to create specific warehouse order picking strategy in anylogic

I am currently working on a generalized warehouse model containing all processes that take place in warehouse operations. I just started to work with anylogic and I can not figure out how to implement order picking strategies. My current model is able to receive truckloads containing pallets, the pallets are checked, booked and stored in a racking system. For the outbound processes of picking, packing, shipping I created an order containing a single pallet that moves through all the processes. However, a picking process of only single pallets is not really representative of warehouse operations. Therefore, I want to know if it is possible to implement order picking strategies such as batch picking, wave picking, discrete picking, amongst others. I hope someone can help me out.
Kind regards,
Stefan
what is packaged inside standard Anylogic is just a possibility to easily simulate full pallet moves(receive, putaway,picking, shipping) without a lot of programming. To do that you will just use existing AL objects: RackStore, RackPick, maybe some MoveTo, Queue, etc.
But if you go beyond that and want to build some realistic warehouse with processing on a lower level of packing structure (pieces, and maybe even some more - layers, blisters, etc) that will require quite some coding. Depending on your chosen abstraction level you may even want to code everything exactly as its coded in your WMS in extreme case. Or maybe you simplify something but still its quite a modelling task for every new process (Pickwave, Disc Picking, etc) you want to implement. So the answer to your question - yes, its possible but beware of high effort.

Eventual consistency in plain English

I often hear about eventual consistency in different speeches about NoSQL, data grids etc.
It seems that definition of eventual consistency varies in many sources (and maybe even depends on a concrete data storage).
Can anyone give a simple explanation what Eventual Consistency is in general terms, not related to any concrete data storage?
Eventual consistency:
I watch the weather report and learn that it's going to rain tomorrow.
I tell you that it's going to rain tomorrow.
Your neighbor tells his wife that it's going to be sunny tomorrow.
You tell your neighbor that it is going to rain tomorrow.
Eventually, all of the servers (you, me, your neighbor) know the truth (that it's going to rain tomorrow), but in the meantime the client (his wife) came away thinking it is going to be sunny, even though she asked after one or more of the servers (you and me) had a more up-to-date value.
As opposed to Strict Consistency / ACID compliance:
Your bank balance is $50.
You deposit $100.
Your bank balance, queried from any ATM anywhere, is $150.
Your daughter withdraws $40 with your ATM card.
Your bank balance, queried from any ATM anywhere, is $110.
At no time can your balance reflect anything other than the actual sum of all of the transactions made on your account to that exact moment.
The reason why so many NoSQL systems have eventual consistency is that virtually all of them are designed to be distributed, and with fully distributed systems there is super-linear overhead to maintaining strict consistency (meaning you can only scale so far before things start to slow down, and when they do you need to throw exponentially more hardware at the problem to keep scaling).
Eventual consistency:
Your data is replicated on multiple servers
Your clients can access any of the servers to retrieve the data
Someone writes a piece of data to one of the servers, but it wasn't yet copied to the rest
A client accesses the server with the data, and gets the most up-to-date copy
A different client (or even the same client) accesses a different server (one which didn't get the new copy yet), and gets the old copy
Basically, because it takes time to replicate the data across multiple servers, requests to read the data might go to a server with a new copy, and then go to a server with an old copy. The term "eventual" means that eventually the data will be replicated to all the servers, and thus they will all have the up-to-date copy.
Eventual consistency is a must if you want low latency reads, since the responding server must return its own copy of the data, and doesn't have time to consult other servers and reach a mutual agreement on the content of the data. I wrote a blog post explaining this in more detail.
Think you have an application and its replica. Then you have to add new data item to the application.
Then application synchronises the data to other replica show in below
Meanwhile new client going to get data from one replica that not update yet. In that case he cant get correct up date data. Because synchronisation get some time. In that case it haven't eventually consistency
Problem is how can we eventually consistency?
For that we use mediator application to update / create / delete data and use direct querying to read data. that help to make eventually consistency
When an application makes a change to a data item on one machine, that change has to be propagated to the other replicas. Since the change propagation is not instantaneous, there’s an interval of time during which some of the copies will have the most recent change, but others won’t. In other words, the copies will be mutually inconsistent. However, the change will eventually be propagated to all the copies, and hence the term “eventual consistency”. The term eventual consistency is simply an acknowledgement that there is an unbounded delay in propagating a change made on one machine to all the other copies. Eventual consistency is not meaningful or relevant in centralized (single copy) systems since there’s no need for propagation.
source: http://www.oracle.com/technetwork/products/nosqldb/documentation/consistency-explained-1659908.pdf
Eventual consistency means changes take time to propagate and the data might not be in the same state after every action, even for identical actions or transformations of the data. This can cause very bad things to happen when people don’t know what they are doing when interacting with such a system.
Please don’t implement business critical document data stores until you understand this concept well. Screwing up a document data store implementation is much harder to fix than a relational model because the fundamental things that are going to be screwed up simply cannot be fixed as the things that are required to fix it are just not present in the ecosystem. Refactoring the data of an inflight store is also much harder than the simple ETL transformations of a RDBMS.
Not all document stores are created equal. Some these days (MongoDB) do support transactions of a sort, but migrating datastores is likely comparable to the expense of re-implementation.
WARNING: Developers and even architects who do not know or understand the technology of a document data store and are afraid to admit that for fear of losing their jobs but have been classically trained in RDBMS and who only know ACID systems (how different can it be?) and who don’t know the technology or take the time to learn it, will miss design a document data store. They may also try and use it as a RDBMS or for things like caching. They will break down what should be atomic transactions which should operate on an entire document into “relational” pieces forgetting that replication and latency are things, or worse yet, dragging third party systems into a “transaction”. They’ll do this so their RDBMS can mirror their data lake, without regard to if it will work or not, and with no testing, because they know what they are doing. Then they will act surprised when complex objects stored in separate documents like “orders” have less “order items” than expected, or maybe none at all. But it won’t happen often, or often enough so they’ll just march forward. They may not even hit the problem in development. Then, rather than redesign things, they will throw “delays” and “retries” and “checks” in to fake a relational data model, which won’t work, but will add additional complexity for no benefit. But its too late now - the thing has been deployed and now the business is running on it. Eventually, the entire system will be thrown out and the department will be outsourced and someone else will maintain it. It still won’t work correctly, but they can fail less expensively than the current failure.
In simple English, we can say: Although your system may be in inconsistent states, the aim is always to reach consistency at some point for each piece of data.
Eventual consistency is more like a spectrum. On one end you have strong consistency and on other you have eventual consistency. In between there are levels like Snapshot, read my writes, bounded staleness. Doug Terry has a beautiful explanation in his paper on eventual consistency thru baseball
.
As per me eventual consistency is basically toleration to random data in random order every time you read from a data store. Anything better than that is a stronger consistency model. For example, a snapshot has stale data but will return same data if read again so it is predictable. Sometimes application can tolerate data which is stale for a given amount of time beyond which it demands consistent data.
If you look at meaning of consistency it relates more to uniformity or lack of deviation. So in non computer system terms it could mean toleration for unexpected variations. It could be very well explained thru ATM. An ATM could be offline hence divergent from account balance from core systems. However there is a toleration for showing different balances for a window of time. Once the ATM comes online, it can sync with core systems and reflect same balance. So an ATM could be said to be eventually consistent.
Eventual consistency guarantees consistency throughout the system, but not at all times. There is an inconsistency window, where a node might not have the latest value, but will still return a valid response when queried, even if that response will not be accurate. Cassandra has a ring system where your data is split up into different nodes:
Any of those nodes can act as the primary interface point for your application. So there is no single point of failure because any of those nodes can serve as your primary API point. But there is a trade-off here. Because any node can be primary, that data needs to be replicated amongst all of these nodes in order to stay up to date. So all of the other nodes needs to know what is where at all times and that means that as a trade-off for this architecture, we have eventual consistency. Because it takes time for that data to propagate throughout the ring, through every node in your system. So, as the data is written, it might be a little bit of time before you can actually read that data back you just wrote. Maybe data is written to one node, but you are reading it from a different node and that written data have not made it to that other node yet.
Let's say you back up your photos on your phone to the cloud every Sunday. If you check your photos on Friday on your cloud, you are not going to see the photos that were taken between Monday-Friday. You are still getting a response but not an updated response but if you check your cloud on Sunday night you will see all of your photos. So your data across phone and cloud services eventually reach consistency.

Creating a snapshot in a distributed architecture

I'm thinking about the problem in question title: if I have to query for an aggregate in a distributed architecture where the distributed event store can eventually be waiting for last events to be distributed.. How can I know if the aggregate i'm reading via read model is not being replaced by the updated one in another server of the network?
I have an http server that receive events to save on the store. Store not exists actually but I want implement it soon.
Events regards huge aggregate that serialized in json format takes 4MB
Another sub-question is what storage do you recommend for the snapshot?
EDIT
I don't understand if the question is not written well or if I have selected wrong tags...
The ability to know when the "last" event in the distributed store is processed depends on two things:
Can you define "last"?
Does the distributed storage engine expose it to you?
The CAP theorem is a good reference to the sort of problems you are going to have with both of those in a distributed data store; in general, unless you give up availability you are not going to be able to have the properties needed to get what you want.
On the other hand, if you can define last in a meaningful way, you can still have what you want. For example: do your events expire after a while? If, for example, they expire after 12 hours, you know that you can always meaningfully define last as "the moment in time 12 hours ago", because any unprocessed event older than that is obsolete...
To answer your sub-question, I strongly recommend a storage engine that you do not write yourself, because distributed data storage is an awesomely hard problems that many very smart people, working for companies doing nothing but solving problems in this space, are doing for you.
Leverage their work instead.

Data Synchronization in a Distributed system

We have an REST-based application built on the Restlet framework which supports CRUD operations. It uses a local-file to store the data.
Now the requirement is to deploy this application on multiple VMs and any update operation in one VM needs to be propagated other application instances running on other VMs.
Our idea to solve this was to send multiple POST msgs (to all other applications) when a update operation happens in a given VM.
The assumption here is that each application has a list/URLs of all other applications.
Is there a better way to solve this?
Consistency is a deep topic, and a hard thing to get right. The trouble comes when two nearly-simultaneous changes occur to the same data: conflicting updates can arrive in one order on one server, and in another order on another. This is a problem, since the two servers no longer agree on what the data is, and it isn't clear who is "right".
The short-story: get your favorite RDBMS (for example, mysql is popular) and have your app servers connect to in what is called the three-tier model. Be sure to perform complex updates in transactions, which will provide an acceptable consistency model.
The long-story: The three-tier model serves well for small-to-medium scale web sites/services. You will eventually find that the single database becomes the bottleneck. For services whose read traffic is substantially larger than write traffic, a common optimization is to create a single-master, many-slave database replication arrangement, where all writes go to the single master (required for consistency with non-distributed transactions), but the more-common reads could go to any of the read slaves.
For services with evenly-mixed read/write traffic, you may be better served by dropped some of the conveniences (and accompanying restrictions) that formal SQL provides and instead use of one of the various "nosql" data stores that have recently emerged. Their relative merits and fitness for various problems is a deep topic in itself.
I can see 7 major options for now. You should find out more details and decide whether the facilities / trade-offs are appropriate for your purpose
Perform the CRUD operation on a common RDBMS. Simplest and most consistent
Perform the CRUD operations on a common RDBMS which runs as fast in-memory RDBMS. eg TimesTen from Oracle etc
Perform the CRUD on a distributed cache or your own home cooked distributed hash table which can guarantee synchronization eg Hazelcast/ehcache and others
Use a fast common state server like REDIS/memcached and perform your updates
in a synchronized manner on it and write out the successfull operations to a DB in a lazy manner if required.
Distribute your REST servers such that the CRUD operations on a single entity are only performed by a single master. Once this is done, the details about the changes can be communicated to everyone else using a reliable message bus or a distributed database (eg postgres) that runs underneath and syncs all of your updates fairly fast.
Target eventual consistency and use a distributed data store like Cassandra which lets you target the consistency you require
Use distributed consensus algorithms like Paxos or RAFT or an implementation of the same(recommended) like zookeeper or etcd respectively and take ownership of the item you want to change from each REST server before you perform the CRUD operation - might be a bit slow though and same stuff is what Cassandra might give you.