Sync Elasticsearch Postgresql on a Springboot application - postgresql

I have Postgresql as my primary database and I would like to take advantage of the Elasticsearch as a search engine for my SpringBoot application.
Problem: The queries are quite complex and with millions of rows in each table, most of the search queries are timing out.
Partial solution: I utilized the materialized views concept in the Postgresql and have a job running that refreshes them every X minutes. But on systems with huge amounts of data and with other database transactions (especially writes) in progress, the views tend to take long times to refresh (about 10 minutes to refresh 5 views). I realized that the current views are at it's capacity and I cannot add more.
That's when I started exploring other options just for the search and landed on Elasticsearch and it works great with the amount of data I have. As a POC, I used the Logstash's Jdbc input plugin but then it doesn't support the DELETE operation (bummer).
From here the soft delete is the option which I cannot take because:
A) Almost all the tables in the postgresql DB are updated every few minutes and some of them have constraints on the "name" key which in this case will stay until a clean-up job runs.
B) Many tables in my Postgresql Db are referenced with CASCADE DELETE and it's not possible for me to update 220 table's Schema and JPA queries to check for the soft delete boolean.
The same question mentioned in the link above also provides PgSync that syncs the postgresql with elasticsearch periodically. However, I cannot go with that either since it has LGPL license which is forbidden in our organization.
I'm starting to wonder if anyone else encountered this strange limitation of elasticsearch and RDMS.
I'm open to other options rather than elasticsearch to solve my need. I just don't know what's the right stack to use. Any help here is much appreciated!

Related

Processing multiple concurrent read queries in Postgres

I am planning to use AWS RDS Postgres version 10.4 and above for storing data in a single table comprising of ~15 columns.
My use case is to serve:
1. Periodically (after 1 hour) store/update rows in to this table.
2. Periodically (after 1 hour) fetch data from the table say 500 rows at a time.
3. Frequently fetch small data (10 rows) from the table (100's of queries in parallel)
Does AWS RDS Postgres support serving all of above use cases
I am aware of Read-Replicas support, but is there any in built load balancer to serve the queries that come in parallel?
How many read queries can Postgres be able to process concurrently?
Thanks in advance
Your usecases seems to be a normal fit for all relational database systems. So I would say: yes.
The question is: how fast the DB can handle the 100 queries (3).
In general the postgresql documentation is one of the best I ever read. So give it a try:
https://www.postgresql.org/docs/10/parallel-query.html
But also take into consideration how big your data is!
That said, try w/o read replicas first! You might not need them.

How to see changes in a postgresql database

My postresql database is updated each night.
At the end of each nightly update, I need to know what data changed.
The update process is complex, taking a couple of hours and requires dozens of scripts, so I don't know if that influences how I could see what data has changed.
The database is around 1 TB in size, so any method that requires starting a temporary database may be very slow.
The database is an AWS instance (RDS). I have automated backups enabled (these are different to RDS snapshots which are user initiated). Is it possible to see the difference between two RDS automated backups?
I do not know if it is possible to see difference between RDS snapshots. But in the past we tested several solutions for similar problem. Maybe you can take some inspiration from it.
Obvious solution is of course auditing system. This way you can see in relatively simply way what was changed. Depending on granularity of your auditing system down to column values. Of course there is impact on your application due auditing triggers and queries into audit tables.
Another possibility is - for tables with primary keys you can store values of primary key and 'xmin' and 'ctid' hidden system columns (https://www.postgresql.org/docs/current/static/ddl-system-columns.html) for each row before updated and compare them with values after update. But this way you can identify only changed / inserted / deleted rows but not changes in different columns.
You can make streaming replica and set replication slots (and to be on the safe side also WAL log archiving ). Then stop replication on replica before updates and compare data after updates using dblink selects. But these queries can be very heavy.

Best way to backup and restore data in PostgreSQL for testing

I'm trying to migrate our database engine from MsSql to PostgreSQL. In our automated test, we restore the database back to "clean" state at the start of every test. We do this by comparing the "diff" between the working copy of the database with the clean copy (table by table). Then copying over any records that have changed. Or deleting any records that have been added. So far this strategy seems to be the best way to go about for us because per test, not a lot of data is changed, and the size of the database is not very big.
Now I'm looking for a way to essentially do the same thing but with PostgreSQL. I'm considering doing the exact same thing with PostgreSQL. But before doing so, I was wondering if anyone else has done something similar and what method you used to restore data in your automated tests.
On a side note - I considered using MsSql's snapshot or backup/restore strategy. The main problem with these methods is that I have to re-establish the db connection from the app after every test, which is not possible at the moment.
If you're okay with some extra storage, and if you (like me) are particularly not interested in re-inventing the wheel in terms of checking for diffs via your own code, you should try creating a new DB (per run) via templates feature of createdb command (or CREATE DATABASE statement) in PostgreSQL.
So for e.g.
(from bash) createdb todayDB -T snapshotDB
or
(from psql) CREATE DATABASE todayDB TEMPLATE snaptshotDB;
Pros:
In theory, always exact same DB by design (no custom logic)
Replication is a file-transfer (not DB restore). So far less time taken (i.e. doesn't run SQL again, doesn't recreate indexes / restore tables etc.)
Cons:
Takes 2x the disk space (although template could be on a low performance NFS etc)
For my specific situation. I decided to go back to the original solution. Which is to compare the "working" copy of the database with "clean" copy of the database.
There are 3 types of changes.
For INSERT records - find max(id) from clean table and delete any record on working table that has higher ID
For UPDATE or DELETE records - find all records in clean table EXCEPT records found in working table. Then UPSERT those records into working table.

Existing Postgres Database vs Solr

We have an app that uses postgres database, that has about 50 tables. Each table contains about 3 Million records (on average). The tables get updated with new data every now and than. Now, we want to implement search feature in our app. The search needs to be performed on one table at a time (no joins needed).
I've read about postgres full text support and that looks promising. But it seems that Solr is Super fast in comparison to it. Can I use my existing postgres database with Solr? If tables get updated would I need to re-index everything again?
It is definitely worth giving Solr a try. We moved many MySQL queries involving JOINs on multiple tables with sorting on different fields to Solr. We are very happy with Solr's search speed, sort speed, faceting capabilities and highly configurable text analysis/tokenization options.
If tables get updated would I need to re-index everything again?
No, you can run delta imports to only re-index your new and updated documents. See https://wiki.apache.org/solr/DataImportHandler.
Get started with https://lucene.apache.org/solr/4_1_0/tutorial.html and all the links in there.
Since nobody has leapt in, I'll answer.
I'm afraid it all depends. It depends on (at least)
how big the text is in each "document"
how flexible you want your searching to be
how much integration you need between database and text-search
how fast is fast enough
how much experience you have with both
When I've had a database that needs some text searching, I've just used PG's built-in options. If I didn't have superuser access to the db, or was already running a big Java setup then Solr might well have appealed.

How to prevent Write Ahead Logging on just one table in PostgreSQL?

I am considering log-shipping of Write Ahead Logs (WAL) in PostgreSQL to create a warm-standby database. However I have one table in the database that receives a huge amount of INSERT/DELETEs each day, but which I don't care about protecting the data in it. To reduce the amount of WALs produced I was wondering, is there a way to prevent any activity on one table from being recorded in the WALs?
Ran across this old question, which now has a better answer. Postgres 9.1 introduced "Unlogged Tables", which are tables that don't log their DML changes to WAL. See the docs for more info, but at least now there is a solution for this problem.
See Waiting for 9.1 - UNLOGGED tables by depesz, and the 9.1 docs.
Unfortunately, I don't believe there is. The WAL logging operates on the page level, which is much lower than the table level and doesn't even know which page holds data from which table. In fact, the WAL files don't even know which pages belong to which database.
You might consider moving your high activity table to a completely different instance of PostgreSQL. This seems drastic, but I can't think of another way off the top of my head to avoid having that activity show up in your WAL files.
To offer one option to my own question. There are temp tables - "temporary tables are automatically dropped at the end of a session, or optionally at the end of the current transaction (see ON COMMIT below)" - which I think don't generate WALs. Even so, this might not be ideal as the table creation & design will be have to be in the code.
I'd consider memcached for use-cases like this. You can even spread the load over a bunch of cheap machines too.