Debug why Firehose is not delivering to Redshift [duplicate] - amazon-redshift

This question already has answers here:
AWS Kinesis Firehose not inserting data in Redshift
(5 answers)
Closed 5 years ago.
I setup a Firehose stream that delivers data to my Redshift cluster. It was working for a short period but suddenly seemed to stop delivering to redshift. From my
select * from stl_query order by endtime desc limit 10;
select * from stl_load_errors order by starttime desc;
select * from stl_connection_log where remotehost like '52%' order by recordtime desc;
select * from stl_error where userid!=0 order by recordtime desc;
Running those commands does not list the most recent connections or copy. For example I see:
disconnecting session ... 52.70.63.204 ...
initiating session ... 52.70.63.204 ...
... in my connection logs but it stops after a certain time. I've tried recreating the table and the stream but it still does not list anything. All my data is being recieved in S3 however.
The other problem is that there are no error manifests in the s3 directory which indicates nothing failed.
How can I debug this?

Found the answer for my case. I had configured the redshift cluster with a VPC group. Without whitelisted access the connection attempts will not show up in stl_connection_log. I added a entry for Firehose to the vpc group for my redshift cluster:
Custom TCP Rule, TCP, 5493, 52.70.63.192/27
Whitelisting ip's can be found at the bottom of: http://docs.aws.amazon.com/firehose/latest/dev/controlling-access.html

Related

Dynamically change VPC inside Glue job

Hi I am having issues with VPC settings. When I use a connection (that has VPC attached) in Glue job I can read data from SQL server but then I can't write data into target Snowflake server due to "timeout". By timeout I mean that job doesn't fail because of any reason but timeout. No errors in logs etc.
If I remove the connection from the same job and replace the SQL server data frame with some dummy spark data frame it all writes to Snowflake without any issue.
For connection to SQL Server I use data catalog and for snowflake I use Spark jdbc connector (2 jar files added to job).
I am thinking about connecting to VPC dynamically from the job script itself, then pull the data into data frame, then disconnect from VPC and write data frame to the target. Does anyone think that it is possible? I didn't find any mentions in documentation TBH.

Trino-PostgreSQL schema metadata cannot be queried

I have deployed a test trino cluster composed by a coordinator and one node.
I have defined several catalogs, all PostgreSQL database, I am am trying to execute some simple operation as
describe analysis_n7pt_sarar4.public.tests_summary;
or
show tables from analysis_n7pt_sarar4.public like '%sub_step%'
From trino webpage I found the queries blocke at 9% and everything seems hanging.
If I execute queries such as:
select * from analysis_n7pt_sarar4.public.bench limit 5
or
select count(*) from analysis_n7pt_sarar4.public.tests_summary;
I obtain results in some seconds.
In http-request.log I found no errors in both coordinator and worker.
What shoudl I check?
Thanks

Why does fetching from postgresql by Hibernate takes extreme long on AWS?

I have an environment on AWS with a RDS Postgresql9.6 and a Spring Boot v1.2.7RELEASE application running on a EC2 Instance. Now I want to fetch about 10.000 entries from a table of the Postgresql DB, which takes about 1 minute. If I do this locally It takes about a second to fetch the entities.
I would expect that the fetching would just take some more time than locally like 2 or 3 seconds.
Actually the request takes 1 minute.
To determine if the problem maybe is caused by a bad query I did
explain analyze SELECT * FROM view_name where uuid ='4e663553-4271-4d7d-8de9-d7b746787cc6' which tells me that the execution of the query itself just takes 300ms.
Therefore I thought the performance Issue comes from transmitting the data from the DB to the application. But I don't know how to evaluate this or even how to improve this.
To reproduce this I guess you need a AWS environment with a RDS and an application which just uses Hibernate to fetch from the RDS a table with approximately 10.000 entries.
EDIT 1
Persistence and DataSource Configuration.
We are using hibernate and have the the following configuration:
hibernate.default_batch_fetch_size=8
hibernate.jdbc.fetch_size=10
hibernate.jdbc.batch_size=8
hibernate.cache.use_query_cache=true
hibernate.cache.use_second_level_cache=true
hibernate.cache.region.factory_class=org.hibernate.cache.redis.SingletonRedisRegionFactory
hibernate.cache.use_structured_entries=true
hibernate.max_fetch_depth=10
hibernate.transaction.factory_class=org.hibernate.engine.transaction.internal.jdbc.JdbcTransactionFactory
javax.persistence.sharedCache.mode=ENABLE_SELECTIVE
I should also note that we use ElastiCache Redis with version 2.8.24.

AWS Glue : Unable to process data from multiple sources S3 bucket and postgreSQL db with AWS Glue using Scala-Spark

For my requirement, I need to join data present in PostgreSQL db(hosted in RDS) and file present in S3 bucket. I have created a Glue job(spark-scala) which should connect to both PostgreSQL, S3 bucket and complete processing.
But Glue job encounters connection timeout while connecting to S3(below is error msg). It is successfully fetching data from PostgreSQL.
There is no permission related issue with S3 because I am able to read/write from same S3 bucket/path using different job. The exception/issue happens only if I try to connect both postgreSQL and S3 in one glue job/script.
In Glue job, glue context is created using SparkContext as object. I have tried creating two different sparkSession, each for S3 and postgreSQL db but this approach didnt work. Same timeout issue encountered.
Please help me in resolving the issue.
Error/Exception from log:
ERROR[main] glue.processLauncher (Logging.scala:logError(91)):Exception in User Class
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to emp_bucket.s3.amazonaws.com:443
[emp_bucket.s3.amazonaws.com/] failed : connect timed out
This is fixed.
Issue was with security group. Only TCP traffic was allowed earlier.
As part of the fix traffic was opened for all. Also, added HTTPS rule in inbound rules as well.

Issues with postgres_operator in Airflow dag

I am currently using Airflow 1.8.2 to schedule some EMR tasks and then execute some long running queries on our Redshift cluster. For that purpose I am using the postgres_operator. The queries take about 30 minutes to run. However, once they are done, the connection never closes and the operator runs for an hour and a half more till its terminated at the 2 hour mark every time. The message on termination is that the server closed the connection unexpectedly.
I've checked the logs on Redshift's end and it shows the queries have run and the connection has been closed. Somehow, that is never communicated back to Airflow. Any directions of what more I could check would be helpful. To give some more info, my Airflow installation is an extension of the https://github.com/puckel/docker-airflow docker image, is run in an ECS cluster and has SQLite as backend since I am still testing Airflow out. Also, I'm using the sequential executor for the backend. I would appreciate any help in this matter.
We had similar issue before but I am using SQLAlchemy to Redshift, if you are using postgres_operator, it should be very similar. It seems Redshift will close the connection if it doesn't see any activity for a long running query, in your case, 30 mins are pretty long query.
Check https://www.postgresql.org/docs/9.5/static/runtime-config-connection.html
you have three settings, tcp_keepalives_idle, tcp_keepalives_idle, tcp_keepalives_count, that sends a live message to redshift to indicate "Hey, I am still alive.
You can pass the following as argument, so something like this: connect_args={'keepalives': 1, 'keepalives_idle':60, 'keepalives_interval': 60}