I am new to Ignite. Happen to find Ignite as an in-memory DB and it might be a good improvement to our current systems.
Here is my situation:
1, We have an existing huge OLTP system. Which is for online E-Commerce.
2, Right now the app uses Spring Boot, and the database is Postgres (AWS)
3, The app contains thousands of sql: select .. from A inner join B, inner join C …. (usually 5~10 tables join)
4, The app uses select … for update to lock entries and perform update. The retry / timeout is configured in the app for concurrency.
5, The system handles online traffic (100 requests/second) also some backend job update. So the concurrency might happen every second on a single record.
Here is my purpose:
1, Wish to change minimum app code to integrate Ignite;
2, Plan to setup architecture as this: App -> Ignite (Memory DB) -> Postgres (Backup DB). (Reason is that we are new to Ignite, to avoid risk of operation. So still prefer to keep Postgres as backup)
Some questions
Q1. writeThrough is not supported to work with TRANSACTIONAL?
Q2, As we require Transaction/lock (select … for update), I use CacheAtomicityMode.TRANSACTIONAL, but it seems can not autosync to Postgres(Q1). Is there a way to have TRANSACTIONAL and Autosync to PG at the same time? Otherwise it is very troublesome. We need to sync ourselves.
Q3, If we implement dynamic dataSource in app, then we can achieve switch to PG if Ignite is down. But that requires data on PG is same as Ignite. May I ask the advice how to keep data consistency between Postgres and Ignite?
writeThrough is supported with TRANSACTIONAL.
However, Apache Ignite does not have transactional SQL in GA currently, so you will need to use cache API transactions (get/put, etc).
Related
I have a use case where I want to audit the DB table data changes into another table for compliance purposes. Primarily, any changes to the data like Inserts/Updates/Deletes should be audited. I found different options like JaVers, Hibernate Envers, Database triggers, and Debezium.
I am avoiding using JaVers, and Hibernate Envers as this will not capture any data change that happens through direct SQL queries and any data change that happens through other applications. The other issue I see is we need to add the audit-related code to the main application code in the same transaction boundary.
I am also avoiding the usage of database triggers as we are not using triggers at all for any of the deployments.
Then I left with Debezium which is promising. But, the only concern that I have is that we need to use Kafka to leverage Debezium. Is Kafka's usage is necessary to use Debezium if both the primary table and the audit table sit in the same DB instance?
Debezium is perfect for auditing, but given it is a source Connector, it represents just one part of the data pipeline in your use case. You will capture every table change event (c=create, r=read, u=update, d=delete), store it on a Kafka topic or local disk and then you need a Sink Connector (i.e. Camel Kafka SQL or JDBC, kafka-connect-jdbc) to insert into the target table.
For the same transaction boundary requirement you can use the Outbox Pattern if the eventual consistency is fine. There is also an Outbox Event Router SMT component that is part of the project.
Note that Debezium can also run embedded in a standalone Java application, storing the offset on local disk, but you lose the HA capability given by KafkaConnect running in distributed mode. With the embedded mode, you are also swtiching from a configuration-driven approach to a code-driven one.
I found Debezium to be a very comprehensive solution, and it is open source backed by Redhat. That gives it not only the credibility, but the security that it is going to be supported.
It provides a rich configuration to whitelist, blacklist databases/tables/columns (with wild card patterns), along with controls to limit data in really large columns.
Since it is driven from BinLogs, you not only get the current state, you also get the previous state. This is ideal for audit trails, and you can customize building a proper Sync to elastic topics (one for table).
Use of Kafka is necessary to account for HA and latency when bulk updates are made on DB, even though Primary and Audit tables are in same DB.
I have Postgresql as my primary database and I would like to take advantage of the Elasticsearch as a search engine for my SpringBoot application.
Problem: The queries are quite complex and with millions of rows in each table, most of the search queries are timing out.
Partial solution: I utilized the materialized views concept in the Postgresql and have a job running that refreshes them every X minutes. But on systems with huge amounts of data and with other database transactions (especially writes) in progress, the views tend to take long times to refresh (about 10 minutes to refresh 5 views). I realized that the current views are at it's capacity and I cannot add more.
That's when I started exploring other options just for the search and landed on Elasticsearch and it works great with the amount of data I have. As a POC, I used the Logstash's Jdbc input plugin but then it doesn't support the DELETE operation (bummer).
From here the soft delete is the option which I cannot take because:
A) Almost all the tables in the postgresql DB are updated every few minutes and some of them have constraints on the "name" key which in this case will stay until a clean-up job runs.
B) Many tables in my Postgresql Db are referenced with CASCADE DELETE and it's not possible for me to update 220 table's Schema and JPA queries to check for the soft delete boolean.
The same question mentioned in the link above also provides PgSync that syncs the postgresql with elasticsearch periodically. However, I cannot go with that either since it has LGPL license which is forbidden in our organization.
I'm starting to wonder if anyone else encountered this strange limitation of elasticsearch and RDMS.
I'm open to other options rather than elasticsearch to solve my need. I just don't know what's the right stack to use. Any help here is much appreciated!
I'd like to preface this by saying I'm not a DBA, so sorry for any gaps in technical knowledge.
I am working within a microservices architecture, where we have about a dozen or applications, each supported by its Postgres database instance (which is in RDS, if that helps). Each of the microservices' databases contains a few tables. It's safe to assume that there's no naming conflicts across any of the schemas/tables, and that there's no sharding of any data across the databases.
One of the issues we keep running into is wanting to analyze/join data across the databases. Right now, we're relying on a 3rd Party tool that caches our data and makes it possible to query across multiple database sources (via the shared cache).
Is it possible to create read-replicas of the schemas/tables from all of our production databases and have them available to query in a single database?
Are there any other ways to configure Postgres or RDS to make joining across our databases possible?
Is it possible to create read-replicas of the schemas/tables from all of our production databases and have them available to query in a single database?
Yes, that's possible and it's actually quite easy.
Setup one Postgres server that acts as the master.
For each remote server, create a foreign server then you then use to create a foreign table that makes the data accessible from the master server.
If you have multiple tables in multiple server that should be viewed as a single table in the master, you can setup inheritance to make all those tables appear like one. If you can define a "sharding" key that identifies a distinct attribute between those server, you can even make Postgres request the data only from the specific server.
All foreign tables can be joined as if they were local tables. Depending on the kind of query, some (or a lot) of the filter and join criteria can even be pushed down to the remote server to distribute the work.
As the Postgres Foreign Data Wrapper is writeable, you can even update the remote tables from the master server.
If the remote access and joins is too slow, you can create materialized views based on the remote tables to create a local copy of the data. This however means that it's not a real time copy and you have to manage the regular refresh of the tables.
Other (more complicated) options are the BDR project or pglogical. It seems that logical replication will be built into the next Postgres version (to be released a the end of this year).
Or you could use a distributed, shared-nothing system like Postgres-XL (which probably is the most complicated system to setup and maintain)
I am very new to Db2. I have a question , Developed few procedures which will perform some operations on db2 database. My question is how to create multiple threads on db2 server concurrently. I mean I have a database with 70,000 tables each having more than 1000 records . I have a procedure which will update all these 70,000 tables. So time consumption is the main factor, here. I want to divide my update statement into 10 threads , where each thread will update 7000 tables. I want to run all the 10 threads simultaneously.
Can some one kindly let me know the way , to achieve this.
DB2 c Express on windows.
There's nothing in DB2 for creating multiple threads.
The enterprise level version of DB2 will automatically process a single statement across multiple cores when and where needed. But that's not what you're asking for.
I don't believe any SQL based RDBMS allows for a SP that create it's own threads. The whole point of SQL is hat it's a higher level of abstraction, you don't have access to those kinds of details.
You'll need to write an external app in a language that supports threads and that opens 10 connections to the DB simultaneously. But depending on the specifics of the update you're doing, and hardware you have. You might find that 10 connections is too many.
To elaborate on Charles's correct answer, it is up to the client application to parallelize its DML workload by opening multiple connections to the database. You could write such a program on your own, but many ETL utilities provide components that enable parallel workflows similar to what you've described. Aside from reduced programming, another advantage of using an ETL tool to define and manage a multi-threaded database update is built-in exception handling, making it easier to roll back all of the involved connections if any of them encounter an error.
Anyone can kindly tell me how to process distributed transaction within postgresql, which is also called "XA"? Is there any resources about it? Great thanks for any answer.
It looks like you are a bit confused. Generally database systems support two notions of distributed transaction types:
Native distributed transactions and
XA transactions.
Native distributed transactions are generally between different servers of the same RDBMS. Postgres also supports this with the dblink_exec command. Generally the connection to the other server is created by a so called database link. Postgres is a bit more clumsy to use then some other commercial grade RDBMS. You first need to install an extension to be able to use database links. However the postgres rdbms is managing the transaction.
XA transactions on the other hand are managed by the external transaction manager (TM) and each of the participating database has the role of a XA resource, which enlists with the transaction manager. The RDBMS can no longer decide itself when to commit a transaction. This is the task of the XA transaction manager. He uses a 2PC protocol to make sure the changes are applied or rolled back in a consistent manner across the databases.
On some OSes like windows a transaction manager is part of the operating system on others not. Generally java is shipped with a transaction manager and the corresponding data source needs to be configured to use XA.