I am planning to use AWS RDS Postgres version 10.4 and above for storing data in a single table comprising of ~15 columns.
My use case is to serve:
1. Periodically (after 1 hour) store/update rows in to this table.
2. Periodically (after 1 hour) fetch data from the table say 500 rows at a time.
3. Frequently fetch small data (10 rows) from the table (100's of queries in parallel)
Does AWS RDS Postgres support serving all of above use cases
I am aware of Read-Replicas support, but is there any in built load balancer to serve the queries that come in parallel?
How many read queries can Postgres be able to process concurrently?
Thanks in advance
Your usecases seems to be a normal fit for all relational database systems. So I would say: yes.
The question is: how fast the DB can handle the 100 queries (3).
In general the postgresql documentation is one of the best I ever read. So give it a try:
https://www.postgresql.org/docs/10/parallel-query.html
But also take into consideration how big your data is!
That said, try w/o read replicas first! You might not need them.
Related
We are scraping data online and the data is (relatively for us) fastly growing. The data consists of one big text (approx. 2000 chars) and a dozen of simple text fields (few words max).
We scrap around 1M or 2M rows per week and it will grow up to 5-10M rows per week probably.
Currently we use Mongodb Atlas to store the rows. Until very soon, we were adding all the infos available but now we defined a schema and only keep what we need. Now the flexibility of the document db is not necessary anymore. And since mongodb pricing is growing exponentially with storage and tier upgrade, we are looking for another storage solution to store that data.
Here is the pipeline : we send data from the scrapers to Mongodb -> using Airbyte we replicate data periodically to Bigquery -> we process data on Bigquery using Spark or Apache Beam -> We perform analysis on transformed data using Sisense
With those requirements in mind, could we use Postgres to replace Mongodb for the "raw" storage ?
Is postgres scaling well for that kind of data (we are not even close to big data but in a near futur we will have at least 1TB data) ? We don't plan to use relations in postgres, isn't it overkill then ? We will however use array and json types, that's why I selected postgres first. Is there other storage solution we could use ? Also could it be possible / good practice to store data directly in Bigquery ?
Thanks for the help
I have Postgresql as my primary database and I would like to take advantage of the Elasticsearch as a search engine for my SpringBoot application.
Problem: The queries are quite complex and with millions of rows in each table, most of the search queries are timing out.
Partial solution: I utilized the materialized views concept in the Postgresql and have a job running that refreshes them every X minutes. But on systems with huge amounts of data and with other database transactions (especially writes) in progress, the views tend to take long times to refresh (about 10 minutes to refresh 5 views). I realized that the current views are at it's capacity and I cannot add more.
That's when I started exploring other options just for the search and landed on Elasticsearch and it works great with the amount of data I have. As a POC, I used the Logstash's Jdbc input plugin but then it doesn't support the DELETE operation (bummer).
From here the soft delete is the option which I cannot take because:
A) Almost all the tables in the postgresql DB are updated every few minutes and some of them have constraints on the "name" key which in this case will stay until a clean-up job runs.
B) Many tables in my Postgresql Db are referenced with CASCADE DELETE and it's not possible for me to update 220 table's Schema and JPA queries to check for the soft delete boolean.
The same question mentioned in the link above also provides PgSync that syncs the postgresql with elasticsearch periodically. However, I cannot go with that either since it has LGPL license which is forbidden in our organization.
I'm starting to wonder if anyone else encountered this strange limitation of elasticsearch and RDMS.
I'm open to other options rather than elasticsearch to solve my need. I just don't know what's the right stack to use. Any help here is much appreciated!
Sorry if this has been asked before. I am hoping to win some time this way :)
What would be the best way to unload delta data from a DB2 source database that has been optimized for OLTP? E.g. by analyzing the redo files, as with Oracle Logminer?
Background: we want near-realtime ETL, and a full table unload every 5 minutes is not feasible.
this is more about the actual technology behind accessing DB2 than about determining the deltas to load into the (Teradata) target.
Ie, we want to unload all records since last unload timestamp.
many many thanks!
Check out IBM InfoSphere Data Replication.
Briefly:
There are 3 replication solutions: CDC, SQL & Q replication.
All 3 solutions read Db2 transaction logs using the same db2ReadLog API, which anyone may use for custom implementation. All other things like staging & transformation of the data changes got from logs, transportation and target application of data are different for each method.
I have some set of tables which has 20 million records in a postgres server. As of now i m migrating some table data from one server to another server using insert and update queries with dependent tables in functions. It takes around 2 hours even after optimizing the query. I need a solution to migrate the data faster by using mongodb or cassandra. How?
Try putting your updates and inserts into a file and then load the file. I understand Postgresql will optimise loading the file contents. It's always worked for me although I haven't used that quantity.
I have 3 tables in my redshift database and data is coming from 3 different csv files from S3 every few seconds. One table has ~3 billion records and other 2 has ~100 million record. For the near realtime reporting purpose, I have to merge this table into 1 table. How do I achieve this in redshift ?
Near Real Time Data Loads in Amazon Redshift
I would say that the first step is to consider whether Redshift is the best platform for the workload you are considering. Redshift is not an optimal platform for streaming data.
Redshift's architecture is better suited for batch inserts than streaming inserts. "COMMIT"s are "costly" in Redshift.
You need to consider the performance impact of VACUUM and ANALYZE if those operations are going to compete for resources with streaming data.
It might still make sense to use Redshift on your project depending on the entire set of requirements and workload, but bear in mind that in order to use Redshift you are going to engineer around it, and probably change your workload from a "near-real-time" to a micro batch architecture.
This blog posts details all the recommendations for micro batch loads in Redshift. Read the Micro-batch article here.
In order to summarize it:
Break input files --- Break your load files in several smaller files
that are a multiple of the number of slices
Column encoding --- Have column encoding pre-defined in your DDL.
COPY Settings --- Ensure COPY does not attempt to evaluate the best
encoding for each load
Load in SORT key order --- If possible your input files should have
the same "natural order" as your sort key
Staging Tables --- Use multiple staging tables and load them in
parallel.
Multiple Time Series Tables --- This documented approach for dealing
with time-series in Redshift
ELT --- Do transformations in-database using SQL to load into the
main fact table.
Of course all the recommendations for data loading in Redshift still apply. Look at this article here.
Last but not least, enable Workload Management to ensure the online queries can access the proper amount of resources. Here is an article on how to do it.