We are using using trigger to store the data on warehouse. Whenever some process is executed a trigger fires and store some information on the data warehouse. When number of transactions increase it affects the processing time.
What would be the best way to do this activity ?
I was thinking about Foreign Data Wrapper or AWS Read replica. Is there any other way to do such activity would be appreciated as well. Or I might not have to use trigger at all ?
Here are quick tips
Reduce latency between database server
Target database table should have less index, To Improve DML Performance
Logical replication may solve syncing data to warehouse
Option 3 is an architectural change, though you don't need to write triggers on each table to sync data
Related
We have a project where we let users execute workflows based on a selection of steps.
Basically each step is linked to an execution and an execution can be linked to one or multiple executionData (the data created or updated during that execution for that step, a blob in postgres).
Today, we execute this through a queuing mechanism where executions are created in queues and workers do the executions and create the next job in the queue.
But this architecture and our implementation make our postgres database slow as when multiple jobs are scheduled at the same time:
We are basically always creating and reading from the execution table (we create the execution to be scheduled, we read the execution when starting the job, we update the status when the job is finished)
We are basically always creating and reading from the executionData table (we add and update executionData during executions)
We have the following issues:
Our executionData table is growing very fast and it's almost impossible to remove rows as there are constantly locks on the table => what could we do to avoid that ? Postgres a good usage for that kind of data ?
Our execution table is growing as well very fast and it impacts the overall execution as to be able to execute we need to create, read & update execution. Delete of rows is as well almost impossible ... => what could we do to improve this ? Usage of historical table ? Suggestions ?
We need to perform statistics on the total executions executed & data saved, this is as well requested on the above table which slows down the process
We use RDS on AWS for our Postgres database.
Thanks for your insights!
Try going for a faster database architecture. Your use-case seems well optimized for a DynamoDB architecture for your executions. You can get O(1) performance, and the blob-storage can fit right into the record as long as you can keep it under 256K.
I'm migrating from a proprietary dbms to PG. In the proprietary dbms, "offlining" and "onlining" data partitions is a very lightweight operation. I'm looking to implement similar functionality with PG by backup and restore of individual table (partitions). Obviously I need to avoid a performance regression. So my question is what the fastest way is of:
Backing up a table (partition), both data and indexes
Taking the table offline (meaning that the data is now gone from the database)
Restoring the table (partition), both data and indexes
Once I have some advice I can design more targeted performance comparisons. Thanks in advance for any pointers.
What is fast and needs to be fast is adding or removing a partition (ALTER TABLE ... ATTACH/DETACH PARTITION).
After you have detached the partition you are in no great hurry to backup/export the data. This can be done comfortably with pg_dump.
Similarly, importing the data for a table that is to become a new partition is normally not time critical.
If you need this to happen faster (for example, you want the old partition to be visible in another database as soon as it is detached in the old one), you could use logical replication to replicate the partition to another PostgreSQL database before you detach it. As soon as replication has caught up, you detach or drop the original partition and attach the copy in the other database.
I'm on GCP, I have a use case where I want to ingest large-volume events streaming from remote machines.
To compose a final event - I need to ingest and "combine" event of type X, with events of types Y and Z.
event type X schema:
SrcPort
ProcessID
event type Y schema:
DstPort
ProcessID
event type Z schema:
ProcessID
ProcessName
I'm currently using Cloud SQL (PostgreSQL) to store most of my relational data.
I'm wondering whether I should use BigQuery for this use case, since I'm expecting large volume of these kind of events, and I may have future plans for running analysis on this data.
I'm also wondering about how to model these events.
What I care about is the "JOIN" between these events, So the "JOIN"ed event will be:
SrcPort, SrcProcessID, SrcProcessName, DstPort, DstProcessID, DstProcessName
When the "final event" is complete, I want to publish it to PubSub.
I can create a de-normalized table and just update partially upon event (how is BigQuery doing in terms of update performance?), and then publish to pubsub when complete.
Or, I can store these as raw events in separate "tables", and then JOIN periodically complete events, then publish to pubsub.
I'm not sure how good PostgreSQL is in terms of storing and handling a large volume of events.
The thing that attracted me with BigQuery is the comfort of handling large volume with ease.
If you have this already on Postgres, I advise you should see BigQuery a complementary system to store a duplicate of the data for analyses purposes.
BigQuery offers you different ways to reduce costs and improve query performance:
read about Partitioning and Clustering, with this in mind you "scan" only the partitions that you are interested to perform the "event completion".
you can use scheduled queries to run MERGE statements periodically to have materialized table (you can schedule this as often as you want)
you can use Materialized Views for some of the situations
BigQuery works well with bulk imports and frequent inserts like http logging. Inserting into bigquery with segments of ~100 or ~1000 rows every few seconds works well.
Your idea of creating a final view will definitely help. Storing data in BigQuery is cheaper than processing it so it won't hurt to keep a raw set of your data.
How you model or structure your events is up to you.
Sorry if this has been asked before. I am hoping to win some time this way :)
What would be the best way to unload delta data from a DB2 source database that has been optimized for OLTP? E.g. by analyzing the redo files, as with Oracle Logminer?
Background: we want near-realtime ETL, and a full table unload every 5 minutes is not feasible.
this is more about the actual technology behind accessing DB2 than about determining the deltas to load into the (Teradata) target.
Ie, we want to unload all records since last unload timestamp.
many many thanks!
Check out IBM InfoSphere Data Replication.
Briefly:
There are 3 replication solutions: CDC, SQL & Q replication.
All 3 solutions read Db2 transaction logs using the same db2ReadLog API, which anyone may use for custom implementation. All other things like staging & transformation of the data changes got from logs, transportation and target application of data are different for each method.
So a little background - we have a large number of data sources ranging from RDBMS's to S3 files. We would like to synchronize and integrate this data with other various data warehouses, databases, etc.
At first, this seemed like the canonical model for Kafka. We would like to stream the data changes through Kafka to the data output sources. In our test case we are capturing the changes with Oracle Golden Gate and successfully pushing the changes to a Kafka queue. However, pushing these changes through to the data output source has proven challenging.
I realize that this would work very well if we were just adding new data to the Kafka topics and queues. We could cache the changes and write the changes to the various data output sources. However this is not the case. We will be updating, deleting, modifying partitions, etc. The logic for handling this seems to be much more complicated.
We tried using staging tables and joins to update/delete the data but I feel that would become quite unwieldy quickly.
This comes to my question - are there any different approaches we could go about handling these operations? Or should we totally move in a different direction?
Any suggestions/help is much appreciated. Thank you!
There are 3 approaches you can take:
Full dump load
Incremental dump load
Binlog replication
Full dump load
Periodically, dump your RDBMS data source table into a file, and load that into the datawarehouse, replacing the previous version. This approach is mostly useful for small tables, but is very simple to implement, and supports updates and deletes to the data easily.
Incremental dump load
Periodically, get the records that changed since your last query, and send them to be loaded to the data warehouse. Something along the lines of
SELECT *
FROM my_table
WHERE last_update > #{last_import}
This approach is slightly more complex to implement, because you have to maintain the state ("last_import" in the snippet above), and it does not support deletes. It can be extended to support deletes, but that makes it more complicated. Another disadvantage of this approach that it requires your tables to have a last_update column.
Binlog replication
Write a program that continuously listens to the binlog of your RDBMS and sends these updates to be loaded to an intermediate table in the data warehouse, containing the updated values of the row, and whether it is a delete operation or update/create. Then write a query that periodically consolidates these updates to create a table that mirrors the original table. The idea behind this consolidation process is to select, for each id, the last (most advanced) version as seen in all the updates, or in the previous version of the consolidated table.
This approach is slightly more complex to implement, but allows achieving high performance even on large tables and supports updates and deletes.
Kafka is relevant to this approach in that it can be used as a pipeline for the row updates between the binlog listener and the loading to the data warehouse intermediate table.
You can read more about these different replication approaches in this blog post.
Disclosure: I work in Alooma (a co-worker wrote the blog post linked above, and we provide data-pipelines as a service, solving problems like this).