Why does fetching from postgresql by Hibernate takes extreme long on AWS? - postgresql

I have an environment on AWS with a RDS Postgresql9.6 and a Spring Boot v1.2.7RELEASE application running on a EC2 Instance. Now I want to fetch about 10.000 entries from a table of the Postgresql DB, which takes about 1 minute. If I do this locally It takes about a second to fetch the entities.
I would expect that the fetching would just take some more time than locally like 2 or 3 seconds.
Actually the request takes 1 minute.
To determine if the problem maybe is caused by a bad query I did
explain analyze SELECT * FROM view_name where uuid ='4e663553-4271-4d7d-8de9-d7b746787cc6' which tells me that the execution of the query itself just takes 300ms.
Therefore I thought the performance Issue comes from transmitting the data from the DB to the application. But I don't know how to evaluate this or even how to improve this.
To reproduce this I guess you need a AWS environment with a RDS and an application which just uses Hibernate to fetch from the RDS a table with approximately 10.000 entries.
EDIT 1
Persistence and DataSource Configuration.
We are using hibernate and have the the following configuration:
hibernate.default_batch_fetch_size=8
hibernate.jdbc.fetch_size=10
hibernate.jdbc.batch_size=8
hibernate.cache.use_query_cache=true
hibernate.cache.use_second_level_cache=true
hibernate.cache.region.factory_class=org.hibernate.cache.redis.SingletonRedisRegionFactory
hibernate.cache.use_structured_entries=true
hibernate.max_fetch_depth=10
hibernate.transaction.factory_class=org.hibernate.engine.transaction.internal.jdbc.JdbcTransactionFactory
javax.persistence.sharedCache.mode=ENABLE_SELECTIVE
I should also note that we use ElastiCache Redis with version 2.8.24.

Related

Trino-PostgreSQL schema metadata cannot be queried

I have deployed a test trino cluster composed by a coordinator and one node.
I have defined several catalogs, all PostgreSQL database, I am am trying to execute some simple operation as
describe analysis_n7pt_sarar4.public.tests_summary;
or
show tables from analysis_n7pt_sarar4.public like '%sub_step%'
From trino webpage I found the queries blocke at 9% and everything seems hanging.
If I execute queries such as:
select * from analysis_n7pt_sarar4.public.bench limit 5
or
select count(*) from analysis_n7pt_sarar4.public.tests_summary;
I obtain results in some seconds.
In http-request.log I found no errors in both coordinator and worker.
What shoudl I check?
Thanks

Best way to set up jupyter notebook project in AWS

My current project have the following structure:
Starts with a script in jupyter notebook which dowloads data from a CRM API to put in a local PostgressSql database I run with PgAdmin. After that it runs cluster analysis, return some scoring values, creates a table in database with the results and updates this values in the CRM with another API call. This process will take between 10 to 20 hours (the API only allows 400 requests per minute).
The second notebook reads the database, detects last update, runs api call to update database since the last call, runs kmeans analysis to cluster the data, compare results with the previous call, updates the new ones and the CRM via API. This second process takes less than 2 hours in my estimation and I want this script to run every 24 hours.
After testing, this works fine. Now I'm evaluating how to put this in production in AWS. I understand for the notebooks I need Sagemaker and from I have seen is not that complicated, my only doubt here is if I can call the API without implementing aditional code or need some configuration. My second problem is database. I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3. My goal is to write the less code as possible, but a have try some tutorial of RDS like this one: [1]: https://www.youtube.com/watch?v=6fDTre5gikg&t=10s, and I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance?? and how to connect to it to analysis this data from SageMaker. My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud. Just some orientation about how to use this tools would be appreciated.
I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3
RDS and Aurora are relational databases fully managed by AWS. "Regular" RDS allows you to launch the existing popular databases such as MySQL, PostgreSQSL and other which you can launch at home/work as well.
Aurora is in-house, cloud-native implementation databases compatible with MySQL and PosrgreSQL. It can store the same data as RDS MySQL or PosrgreSQL, but provides a number of features not available for RDS, such as more read replicas, distributed storage, global databases and more.
S3 is not a database, but an object storage, where you can store your files, such as images, csv, excels, similarly like you would store them on your computer.
I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance??
You can migrate your data from your local postgress to RDS or Aurora if you wish. But RDS nor Aurora will not connect to your existing local database, as they are databases themselfs.
My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud.
I don't see a reason why you wouldn't be able to connect to the database. You can try to make it work, and if you encounter difficulties you can make new question on SO with RDS/Aurora setup details.

Spring Boot application performs extremely high number of SET application_name queries to postgres

I have a Spring Boot application (v2.1.5) that uses JPA (Hibernate) to connect to a Postgres DB. Spring Boot uses HikariCP for connection pooling.
In my production environment, I see the following query executed every few seconds regardless of DB activity (almost as if they are some kind of health check?):
SET application_name = 'PostgreSQL JDBC Driver'
I am struggling to get to the bottom of why these queries are performed so often and if they can be avoided because in my local environment the above statement is only executed when a query to the DB is performed. I still do not understand why, but it is less frequent and different behaviour compared to production.
Are these queries necessary? Can they be avoided?
Thanks.
UPDATE:
Here is a screenshot of the queries received by the DB to which the Spring boot app is connecting using HikariCP. The time is shown as "Just now" because all of the query shown are only ~0.5 seconds apart and all within the "current minute".
This seems to be performed by the Hikari Connection Pool. See Default HikariCP connection pool starting Spring Boot application and it's answers.
I wouldn't bother about it since, it is not performing a "extremely high number" of these operations, but only every few seconds, possibly whenever a connection is handed out by or returned to the pool.
If it really bothers you, you could look into various places for disabling it.
The last comment here suggests that setting the connection property assumeMinServerVersion to 9.0 or higher might help.
Since this is probably triggered by the HikariConnectionPool it might be configurable there, by configuring the behaviour when starting, lending and returning a connection.

spring batch remote partitioning remote step performance

We are using remote partitioning in our POC where we process around 20 million records. To process this records, slave needs some static metadata which is around 5000 rows. Our current POC uses EhCache to load this metadata in slave once time from db and put it in cache so the subseuent calls just get this data from cache for better performance.
Now since we are using remote partitioning, our slave has approx 20 MDP/thread so each message listener calls first to get the metadata from db, so basically 20 threads are hitting db at the same time on each remote machine. We have 2 machine for now but will grow to 4.
My question is , is there any better way to load this metadata only one time like before job starts and be accessible to all remote slave?
Or can we use step listener in remote stap? I dont think so this is a good idea, as it will be executed for each remote step execution but needed expert thoughts on this.
You could set up an EhCache server running as a separate application or use another product for caching instead like Hazelcast. If commercial products are an option for you, Coherence might also work.

how to monitor a Heroku postgres database

NewRelic gives nice database analyses, however it seems to track only the web app's transactions.
I have independently managed servers which query and load my Heroku postgresql database. Is there a way I can get diagnostics and analysis of the database activity so that it will include all connections to it?
New Relic application monitoring will only collect data on database queries that are part of a web transaction or background task that is being monitored. If you're using one of New Relic's supported languages to query your database, you may be able to track that code as a background task (see https://newrelic.com/docs/features/monitoring-background-processes). If you would like a general monitoring plugin for your postgresql database, you could check out the postgresql plugin for New Relic (created and supported by Boundless): http://newrelic.com/plugins/boundless/109.
You should also try Heroku PG Extras: https://github.com/heroku/heroku-pg-extras. That will give info about cache hit, indexes, long queries, etc.