Increase postgresql-9.5 performance on 200+ million records

Increase postgresql-9.5 performance on 200+ million records - postgresql

I have 200+ millions of records in postgresql-9.5 table. Almost all queries are analytical queries. To increase and optimize the query performance so far I am trying with Indexing and seems that its not sufficient. What are the other options i need to look it into?

Depending on where clause condition create partitioned table (https://www.postgresql.org/docs/10/static/ddl-partitioning.html)
,it will reduce query cost drastically,also if there is certain fixed value in where clause do partial indexing on partitioned table.
Important point check order of columns in where clause and match it while indexing

You should upgrade to PostgreSQL v10 so that you can use parallel query.
That enables you to run sequential and index scans with several background workers in parallel, which can speed up these operations on large tables.
A good database layout, good indexing, lots of RAM and fast storage are also important factors for good performance of analytical queries.

If the analysis involves a lot of aggregation, consider materialized views to store the aggregates. Materialized views do take up space and they need to be refreshed too. But they are very useful for data aggregation.

Related

Use simultaneously Elasticsearch and Postgres to perform queries

I'm working on a querying tool that would allow to make complex relational queries along to fulltext search on quite big datasets. My data is stored in a Postgres database with a elasticsearch engine for convenient and efficient fulltext search. However, some queries require complex join, cardinality tests, filters on joined data ...
My dilemma is that I cannot use only ElasticSearch or only Postgres. I need both to answer specific needs. But combining them seems to be a really difficult task.
My approach
... was to perform the ES query first and then use the results' id as a filter for the SQL query. The problem is that ES has a max_result_window that prevents to get all the matching data at once. Even worse, ES may return the first 10K results matching the fulltext search, but the subsequent SQL query may narrow those results number to something ridiculously small, while there's actually much more matching items but the 10K limit acted as bottleneck.
Taking the other way around is no better since if we use the result of the SQL query as the document ids as filters for the ElasticSearch query, the max_clause_count limit will be easily reached and the ES query wouldn't be able to be performed on more than (default) 1024 items.
Maybe my logic isn't good. Is there any other approach to combine both ES and Postgres queries simultaneously ? Thanks.

MongoDB Index - Performance considerations on collections with high write-to-read ratio

I have a MongoDB Collection with around 100k inserts every day. Each document consumes around 1 MB space and has a lot of elements.
The Data is mainly stored for analytics purpose and read a couple times each day. I want to speed up the queries by adding indexes to a few fields which are usually used for filtering but stumpled accross this statement in the mongodb documentation:
Adding an index has some negative performance impact for write operations. For collections with high write-to-read ratio, indexes are expensive since each insert must also update any indexes.
source
I was wondering if this should bother me with 100k inserts vs. a couple of big read operations and if this would add a lot of overhead to the insert operations?
If yes, should i separete reads from writes in in separate collections and duplicate the data or are there any other solutions for this?

Reorganizing indexes and database size

I have a fragmentation problem on my production database. One of my main data tables is about 6GB(3GB Indexes) (about 9M records) in size and has 94%(!) index fragmentation.
I know that reorganizing indexes will solve this problem BUT my database is on SQL Server 2008R2 Express which has 10GB database limit and my database is already 8GB in size.
I have read few blog posts about this issue but non gave answer to my situation.
My Question1 is:
How much size(% or in GB) increase can I expect after reorganizing indexes on that table?
Question2:
Will Drop Index -> Build same index take less space? Time is not a factor for me at the moment.
Extra question:
Any other suggestions for database fragmentation? I know only to avoid shrinking like a fire ;)

Having INDEX on key columns will improve joins and Filters by negating the need for a table scan. A well maintained index can drastically improve performance.
It is Right that GUID's makes poor choice for indexed columns but by no means does it mean that you should not create these indexes. Ideally a data type of INT or BIGINT would be advisable.
For me Adding NEWID() as a default has shown some improvement in counteracting index fragmentation but if all alternatives fail you may have to do index maintenance (Rebuild, reorganize) operations more often than for other indexes. Reorganize needs some working space but in your scenario as time is not a concern, I would disable index, shrink DB and create index.

Number of indexes per table

Based on your experience, is there any practical limit on the number of indexes per one table in Postresql? In theory, there is not, as per the documentation, citation: "Maximum Indexes per Table Unlimited" But:
Is it that the more indexes you have the slower the queries? Does it make a difference if I have tens vs hundreds or even thousands indexes? I am asking after I've read the documentation on postgres' partial indexes which makes me think of some very creative solutions that, however, require a lot of indexes.

There is overhead in having a high number of indexes in a few different ways:
Space consumption, although this would be lower with partial indexes of course.
Query optimisation, through making the choice of optimiser plan potentialy more complex.
Table modification time, through the additional work in modifying indexes when a new row is inserted, or current row deleted or modified.
I tend by default to go heavy on indexing as:
Space is generally pretty cheap
Queries with bound variables only need to be optimised once
Rows generally have to be found much more often than they are modified, so it's generally more important to design the system for efficiently finding rows than it is for reducing overhead in making modifications to them.
The impact of missing a required index can be very high, even if the index is only required occasionally.
I've worked on an Oracle system with denormalised reporting tables having over 200 columns with 100 of them indexed, and it was not a problem. Partial indexes would have been nice, but Oracle does not support them directly (you use a rather inconvenient CASE hack).
So I'd go ahead and get creative, as long as you're aware of the pros and cons, and preferably you would also measure the impact that you're having on the system.

Does MongoDB use statement-based and row-based replication like MySQL?

How does MongoDB replicate update query affecting multiple documents ?
Will it use statement based approach conserving op-log or row-based approach ?
What are the criteria to select row or statement based replication ?

Will it use statement based approach conserving op-log or row-based approach
MongoDB works upon a row per row basis using an oplog. So when you do an update that effects multiple rows it will actually write each row one by one to the oplog, this is of course a space taker; as noted in the manual: http://docs.mongodb.org/manual/core/replication/#oplog
The oplog must translate multi-updates into individual operations, in order to maintain idempotency. This can use a great deal of oplog space without a corresponding increase in disk utilization.
The oplog is basically a capped collection and it will replicate going oldest first.
As far as I know MongoDB does not do statement based replication unlike many SQL techs can.