I know this question has been asked before at PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data.
But I want to rephrase this question.
I am attempting a real-time data warehouse. The difference between real-time and near real-time is huge.
I find real-time data warehouse to be event-driven and transactional in approach.
While near real-time would do the same batch mode application but would poll data more frequently.
It would put so much extra load on the production server and would certainly kill the production system.
Being a batch approach it would scan through all the tables for changes and would take rows which have changed from
a cut-off time stamp.
I mean by event driven, it would be specific to tables which have undergone changes are focus only on transaction
which are happening currently.
But the source system is an elephant of system, SAP, assuming which has 25,000 tables. It is not easy to model that,
not easy to write database triggers on each table to capture each change. I want impact on the production server to be minimal.
Is there any trigger at database level so that I could capture all changes happening in database in one trigger.
Is there any way to write that database trigger on a different database server so that production server goes untouched.
I have not been keeping pace with changes happening to database technology and am sure some nice new technologies would have come by to capture these changes easily.
I know of Log miners and Change data captures but it would be difficult to filter out the information which I need from redo logs.
Alternate ways to capture database write operations on the go.
Just for completeness sake let us assume databases are a heterogeneous mix of Oracle, SQL Server and DB2. But my contention is
the concepts we want to develop.
This is a universal problem, every company is looking for easy to implement solution. So a good discussion would benefit all.
Don't ever try to access SAP directly. Use the APIs of SAP Data Services (http://help.sap.com/bods). Look for the words "Integrator Guide" on that page for documentation.
This document should give you a good hint about where to look for your data sources (http://wiki.scn.sap.com/wiki/display/EIM/What+Extractors+to+use). Extractors are kind-of-somewhat like views in a DBMS, they're abstracting all the SAP stuff into somethin human readable.
As far as near-real-time, think in terms of micro-batches. Run your extract jobs every 5 (?) minutes, or longer if necessary.
Check the Lambda Architecture from Nathan Marz (I provide no link, but you'll find the resources easily). The implementation is all Java and No SQl, but the main ideas are applicable to the classical relational databases as well. In the nutshell: you have two implementations, one real time but responsible for only limited time interval. The "long tail" is maintained with classical best practice batch implementation.
The real time part is always discarded after the batch refresh, effectively blocking the propagation of the problems of the real time processing in the historical data.
As of now I can see only two solutions:
Write services on the source systems. If source is COBOL, put those in services. Put all services in a service bus and
some how trap when changes happen to database. This needs to be explored how that trap will work.
But from outset it appears to be a very
expensive proposition and uncertain. Convincing management for a three year lag time would be difficult. Services are not easy.
Log Shippers: This a trusted database solution. Logs would be available on another server, production server need not be
burdened. There are good number of tools as well available.
But the spirit does not match. Event driven is missing so the action when things
are happening is not captured. I will settle down for this.
As Ron pointed out NEVER TOUCH SAP TABLE DIRECTLY. There are adapters and adapters to access SAP tables. This will build another layer in between but it is unavoidable. One good news I want to share is a customer did a study of SAP tables and found that only 14% of the tables are actually populated or touched by SAP system. Even then 14% of 25,000 tables is coming to huge data model of 2000+ entities. Again micro-batches are like dividing the system into Purchase, Receivables, Payables etc., which is heading for a data mart and not an EDW. I want to focus on a Enterprise Data Warehouse.
Related
I'm new to PipelineDB and have yet to even experience it at runtime (installation pending ...). But I'm reading over the documentation and I'm totally intrigued.
Apparently, PipelineDB is able to take set-based query representations and mechanically transform them into an incremental representation for efficiently processing the streams of deltas with storage limited as a function of the output of the continuous view.
Is it also supported to run the set-based query as a set-based query for priming a continuous view? It seems to me that upon creation of a Continuous View the initial data would be computed this traditional way. Also, since Continuous Views can be truncated, can they then be repopulated (from still-available source tables) without tearing down whatever dependent objects it has to allow a drop/create?
It seems to me that this feature would be critical in many practical scenarios. One easy example would be refreshing occasionally to reset the drift from rounding errors in, say, fractional averages.
Another example is if there were bug discovered and fixed in PipelineDB itself which had caused errors in the data. After the software is patched, the queries based on data still available ought to be rerun.
Continuous Views based fully on event streams with no permanent storage could not be rebuilt in that way. Not sure about if only part of the join sources are ephemeral.
I don't see these topics covered in the docs. Can you explain how these are or aren't a concern?
Thanks!
Jeff from PipelineDB here.
The main answer to your question is covered in the introduction section of the PipelineDB technical docs:
"PipelineDB can dramatically reduce the amount of information that needs to be persisted to disk because only the output of continuous queries is stored. Raw data is discarded once it has been read by the continuous queries that need to read it."
While continuous views only store the output of continuous queries, almost everybody who is using PipelineDB is storing their raw data somewhere cheap like S3. PipelineDB is meant to be the realtime analytics layer that powers things like realtime reporting applications and realtime monitoring & alerting systems, used almost always in conjunction with other systems for data infrastructure.
If you're interested in PipelineDB you might also want to check out the new realtime analytics API product we recently rolled out called Stride. The Stride API gives developers the benefit of continuous SQL queries, integrated storage, windowed queries, and other things like realtime webhooks, all without having to manage any underlying data infrastructure, all via a simple HTTP API.
If you have any additional technical questions you can always find our open-source users and dev team hanging out in our gitter chat channel.
Event sourcing and CQRS is great because it gets rids developers being stuck with one pre-modelled database which the developer has to work with for the lifetime of the application unless there is a big data migration project. CQRS and ES also has other benefits like scaling eventstore, audit log etc. that are already all over the internet.
But what are the disadvantages ?
Here are some disadvantages that I can think of after researching and writing small demo apps
Complex: Some people say ES is complex. But I'd say having a complex application is better than a complex database model on which you can only run very restricted queries using a query language (multiple joins, indexes etc). I mean some programming languages like Scala have very rich collection library that is very flexible to produce some seriously complex aggregations and also there is Apache Spark which makes it easy query distributed collections. But databases will always be restricted to it's query language capabilities and distributing databases are harder then distributed application code (Just deploy another instance on another machine!).
High disk space usage: Event store might end up using a lot of disk space to store events. But we can schedule a clean up every few weeks and creating snapshot and may be we can store historical events locally on an external HD just incase we need old events in the future ?
High memory usage: State of every domain object is stored in memory which might increase RAM usage and we all how expensive RAM is. BIG PROBLEM!! because I'm poor! any solution to this ? May be use Sqlite instead of storing state in memory ? Am I making things more complex by introducing multiple Sqlite instances in my application ?
Longer bootup time: On failure or software upgrade bootup is slow depending on the number of events. But we can use snapshots to solve this ?
Eventual consistency: Problem for some applications. Imagine if Facebook used Event sourcing with CQRS for storing posts and considering how busy facebook's system is and if I posted a post I would see my fb post the next day :)
Serialized events in Event store: Event stores store events as serialized objects which means we can't query the content of events in the event store which is discouraged anyway. And we won't be able to add another attribute to the event in the future. Solution would be to store events as JSON objects instead of Serialized events ? But is that a good idea ? Or add more Events to support the change to orignal event object ?
Can someone please comment on the disadvantages I brought up here and correct me if I am wrong and suggest any other I may have missed out ?
Here is my take on this.
CQRS + ES can make things a lot simpler in complex software systems by having rich domain objects, simple data models, history tracking, more visibility into concurrency problems, scalability and much more. It does require a different way thinking about the systems so it could be difficult to find qualified developers. But CQRS makes it simpler to separate responsibilities across developers. For example, a junior developer can work purely with the read side without having to touch business logic.
Copies of data will require more disk space for sure. But storage is relatively cheap these days. It may require the IT support team to do more backups and planning how to restore the system in a case in things go wrong. However, server virtualization these days makes it a more streamlined workflow. Also, it is much easier to create redundancy in the system without a monolithic database.
I do not consider higher memory usage a problem. Business object hydration should be done on demand. Objects should not keep references to events that have already been persisted. And event hydration should happen only when persisting data. On the read side you do not have Entity -> DTO -> ViewModel conversions that usually happened in tiered systems, and you would not have any kind of object change tracking that full featured ORMs usually do. Most systems perform significantly more reads than writes.
Longer boot up time can be a slight problem if you are using multiple heterogeneous databases due to initialization of various data contexts. However, if you are using something simple like ADO .NET to interact with the event store and a micro-ORM for the read side, the system will "cold start" faster than any full featured ORM. The important thing here is not to over-complicate how you access the data. That is actually a problem CQRS is supposed to solve. And as I said before, the read side should be modeled for the views and not have any overhead of re-mapping data.
Two-phase commit can work well for systems that do not need to scale for thousands of users in my experience. You would need to choose databases that would work well with the distributed transaction coordinator. PostgreSQL can work well for read and write separate models, for example. If the system needs to scale for a high number of concurrent users, it would have to be designed with eventual consistency in mind. There are cases where you would have aggregate roots or context boundaries that do not use CQRS to avoid eventual consistency. It makes sense for non-collaborative parts of the domain.
You can query events in serialized a format like JSON or XML, if you choose the right database for the event store. And that should be only done for purposes of analytics. Nothing inside the system should query event store by anything other than the aggregate root id and the event type. That data would be indexed and live outside the serialized event.
Just to comment on point 5. I've been told that Facebook does use ES with Eventual Consistency, which is why you can sometimes see a post disappear and reappear after you've posted it.
Usually the read-model your browser is accessing is located 'close' to you, but after you make a post the SPA switches over to a read-model that is close to your write-model. The close proximity between the write-model (events) and the read-model mean you get to see your own post.
However, 15 minutes later your SPA switches back to the first, closer, read-model. If the event containing your post hasn't yet propagated to that read-model you'll see your own post disappear only to reappear sometime later.
I know it's been almost 3 years since this question was asked, but still this article may be useful for someone. Key points are
Scaling with snapshots
Visibility of data
Schema changing
Dealing with complex domains
Need to explain it to most new team members
Event sourcing and CQRS is great because it gets rids developers being stuck with one pre-modeled database which the developer has to work with for the lifetime of the application unless there is a big data migration project.
This is a big misconception. The relational databases were invented exactly for the evolution of the model (thanks to simple two-dimensional tables as opposed to pre-defined hierarchical structures). With views and procedures ensuring the encapsulation of data access, the logical and physical model can evolve independently. This is also why SQL defines DDL and DML in the same language. Some RDBMS also allow all those evolutions to be versioned and deployed online (continuous delivery) as Oracle Edition Based Redefinition.
Big data structures are predefined and can be read only with the code developed for this structure. Ok when consumed immediately but you will have hard time to read it 10 years later without the exact version, and language compiler or interpreter.
I hope to not be late to try to give an answer. In these months I've done a lot of research on that argument with the goal of implementing a production-grade solution for some parts of my architecture where ES can make sense
Complex: Actually, it should not be complex, its mission is to be deadly simple. How? pushing all the complexity from business logic code to infrastructure code. The data access should be done by frameworks that are not enough mature yet. Still, there is no clear winner in the ES/CQRS race, maybe because is still a niche/hipster approach (?) So some team is rolling its own solution or adopting some ready-made technology such as Axon
High disk space usage: I would say more, I would say * potentially infinite* Disk Usage. But if you go towards ES, you also have a very good reason to tolerate this apparent drawback. Let's give some of them:
Audit Logs : The datastore is an event log, we already know it. Financial apps or every mission/safety critical could need a centralized audit log that enables to state Who made What in Which moment. ES provides this capability of the box...you can also decorate your event entries with some business meaningful metadata (eg. a transaction Id correlated with some API consumer identity, A severity level of the operation...)
High Concurrency: there are systems where logical resource states are mutated by many clients in a concurrent way. These are games, IoT platforms, and so on. Logging events instead of change a state representation could be a smart way to provide a total order of events. The other way is to delegate to DB the synchronization stuff. But this is not what you want if you're into ES
Analytics Let's say you have a lot of data with a lot of business value, but you still don't know which. For years we extracted knowledge from applications information by translating data organization with different information models (OLAP cubes). The event store provides something similar out of the box again. Event logs is the rawest form of representation of information And you can have many ways to process them, in batch or reacting to events stored
High memory usage: I think it should be the same once you have built your projection
Longer bootup time: If the read side caches its projections and "remembers" the last update event, it should not re-apply the entire event sequence. Snapshots are mitigation but if you do a lot of snapshots maybe you made a bad choice with ES. I think that this problem is minor in microservices ecosystems, where the boot time can be masked without service interruption. In fact, you get the most out of ES/CQRS when you apply it so microservices
Eventual consistency: Blame CAP theorem for this, not ES. Many non ES/CQRS have to deal with this, but there are a lot of scenarios where it is not a real problem. These are the scenarios where ES fits well. And you can mix ES and non ES services into the same platform
Serialized events in Event store: if it's important to have a non-serialized event representation, you could use a document-oriented DB, but if you do this to make queries over events payload, you are missing the point of ES/CQRS. ES means to move all data manipulation from the DB side to the application tier, where every piece changes fastly, and all are stateless. This enhances scalability and fault tolerance and provides means to shape the organization of your team, doing things like let the frontend guy/girl write his/her BFF in javascript easily.
I hope to put into practices this principles with good results and draw on the benefits of this exciting approach
I'm looking to convert a relatively new web-based application with a clear domain model over to more of a CQRS style system. My new application is essentially an enhanced replacement of an older existing system.
The existing systems in my organization share a set of common databases, which are updated by an untold number of applications (developed via the Chaos Method) that exist in silos throughout the company. (As it stands, I believe that no single person in the company can identify them all.)
My question is therefore about the read model(s) for my application. Since various status changes, general user data, etc. are updated by other applications outside my control, what's the best way to handle building the read models in such a way that I can deal with outside updates, but still keep things relatively simple?
I've considered the following so far:
Create Views in the database for read models, that read all tables, legacy and new
Add triggers to existing tables to update new read model tables
Add some code to the database (CLR Stored proc/etc [sql server]) to update an outside datastore for read models
Abandon hope
What is the general consensus on how to approach this? Is it folly to think I can bring order to a legacy system without fully rewriting everything from scratch?
I've used option #1 with success. Creating views to demoralize the data to create a read model is a viable option depending on the complexity of the write database(s). Meaning, if it is relatively straight forward joins that most developers can understand then I would take a closer look to see if it's viable for you. I would be careful with having too much complexity in these views.
Another thing to consider is periodic polling to build and update similar to a traditional reporting databases. Although not optimal in comparison to a notification, depending on how stale your read model can be, this might also be an option to look at.
I once was in a similar situation, the following steps was how i did it:
To improve the legacy system and achieve cleaner code base, the key is to take over the write responsibility. But don't be too ambitious as this may introduce interface/contract changing which makes the final deployment risky.
If all the write are fired through anything except direct sql updates, keep them backward compatible as possible as you can. Take them as adapters/clients of your newly developed command handlers.
Some of the write are direct sql updates but out of your control
Ask the team in charge if they can change to your new interface/contract?
If no, see step 3.
Ask if they can tolerate eventual consistency and are willing to replace the sql updates with database procedures?
If yes, put all the sql updates in the procedures and schedule a deployment and see step 4.
If no, maybe you should include them in your refactoring.
Modify the procedure, replace the sql updates with inserting events. And develop a backend job to roll the events and publish them. Make your new application subscribing these events to fire commands to your command handlers.
Emitting events from your command handlers and use them to update the tables that other applications used to consume.
Move to the next part of the legacy system.
If we had an infinitely powerful server, we wouldn't bother with view models and would instead just read from the basic entities tables. View models are meant to improve performance by preparing and maintaining an appropriate dataset for display. If you use a database View as a view model, you've really not gotten much of a performance gain over an adhoc query (if you ignore the preplanning that the sql parser can do for a view).
I've had success with a solution that's less intrusive than #Hippoom's solution, but more responsive than #Derek's. If you have access to the legacy system and can make minor code changes, you can add an asynchronous queue write to generate an event in a queueing system (RabbitMQ, Kafka, or whatever) in the legacy system repositories or wherever data is persisted. Making these writes asynch should not introduce any significant performance costs, and should the queue write fail it will not affect the legacy system. This change is also fairly easy to get through QA.
Then write an event driven piece that updates your read models. During the legacy system update phase (which can take a while), or if you only have access to some of the legacy systems that write to these databases, you can have a small utility that puts a new "UpdateViewModel" event in the queue every couple minutes. Then you would get timely events when the legacy systems save something significant, but are also covered for the systems that you are not able to update.
I want to build a web app similar to Reddit.com, where you have multy level of comments, lots of reads and writes. I was wondering if nosql and mongoDB in particular is the right tool for this?
Comments -- it's really thing for nosql database, no doubt. You avoiding multiple joins to itself. And it's means that your system can scale out!
With mongodb you can store all hierarchy within one document. Some peoples can say that here will be problems with atomic updates, but i guess that it's not a problem because of you can load and save back entire comments tree. In any way you can easy redesign your system later to support atomic updates and avoid issues with concurrency.
Reddit itself uses Cassandra. If you want something "similar to reddit.com," maybe you should look at their source -- https://github.com/reddit/reddit/wiki.
Here's what David King (ketralnis) said earlier this year about the Cassandra 0.7 release: "Running any large website is a constant race between scaling your user base and scaling your infrastructure to support it. Our traffic more than tripled this year, and the transparent scalability afforded to us by Apache Cassandra is in large part what allowed us to do it on our limited resources. Cassandra v0.7 represents the real-life operations lessons learned from installations like ours and provides further features like column expiration that allow us to scale even more of our infrastructure."
However, Rick Branson notes that Reddit doesn't take full advantage of Cassandra's features, so if you were to start from scratch, you'd want to do some things differently.
I'm deciding between go for a NON-SQL engine or a regular SQL one for a document managment system for small bussines.
I have experience with firebird/sql server and found a good track of reliability (specially with firebird).
This market is full of crappy "servers" (clon-made PC, the mayority), cheap harddisk, rarely use of RAID or anything like that, some are in locations where a power-off is normal, some not have a UPS, etc... (I will include off-site auto-backup to external servers, but that no change the internal setup). (I know about end-user education about such proper setups, but is stupid depend on that, so stick to te point)
From the desing point of view, a schema-less database is the way to go for my system, but, I worry if any of the actual solutions (MongoDb, Tokyo Cabinet, etc) are like firebird and survice crash, malfunctions & abuse so data corruption is very rare.
The plan is store the office documents there & provide a central repository.
Check out Neo4j. It is a graph database (schema-free) that can be used like a document or key/value store.
Neo4j has been in production for many years in environments like you describe. Unlike many other NOSQL databases Neo4j actually flushes data to disk and uses a transaction log to recover from an inconsistent state. It also has real transactions (full ACID) that can span multiple operations and treat them as a single unit (which also seems to be a feature that is frequently left out in many other NOSQL stores).
-Johan
(Disclaimer: I am part of the Neo4j team)
CouchDB has the reliability you need:
The CouchDB file layout and commitment system features all Atomic Consistent Isolated Durable (ACID) properties. On-disk, CouchDB never overwrites committed data or associated structures, ensuring the database file is always in a consistent state.
Look at the ACID Properties section here for more info.
With CouchDB you also get easy backup and replication.
I've no code in production using CouchDB yet, but so far I'm very happy with the tests and the development process with CouchDB.