AWS Athena Federated Query - GENERIC_USER_ERROR when running DB query for PostgreSQL - postgresql

Hi all,
I am trying to execute queries on a postgresql database I created in AWS.
I added a data source to Athena, I created the data source for postgresql and I created the lambda function.
In Lambda function I set:
default connection string
spill_bucket and spill prefix (I set the same for both: 'athena-spill'. In the S3 page I cannot see any athena-spill bucket)
the security group --> I set the security group I created to access the db
the subnet --> I set one of the database subnet
I deployed the lambda function but I received an error and I had to add a new environment variable created with the connection string but named as 'dbname_connection_string'.
After adding this new env variable I am able to see the database in Athena but when I try to execute any query on this database as:
select * from tests_summary limit 10;
I receive this error after running query:
GENERIC_USER_ERROR: Encountered an exception[com.amazonaws.SdkClientException] from your LambdaFunction[arn:aws:lambda:eu-central-1:449809321626:function:data-production-athena-connector-nina-lambda] executed in context[retrieving meta-data] with message[Unable to execute HTTP request: Connect to s3.eu-central-1.amazonaws.com:443 [s3.eu-central-1.amazonaws.com/52.219.170.25] failed: connect timed out]
This query ran against the "public" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 3366bd80-143e-459c-a4da-5350b5ab4a77
What could be causing the problem?
Thanks a lot!

Root Cause:
VPC have no internet connection issue, causing Lambda can't access S3.
Solution:
Add VPC Gateway Endpoint (Select com.amazonaws.eu-central-1.s3) in Lambda associated VPC.

Related

timeout errors from Lambda when trying to access an Amazon RDS DB instance

I am writing a Python app to run as lambda function and want to connect to an RDS DB instance without making it publicly accessible.
The RDS DB instance was already created under the default VPC with security group "sg-abcd".
So I have:
created a lambda function under the same default VPC
created a role with the following permission AWSLambdaVPCAccessExecutionRole and assigned it to the lambda function as in https://docs.aws.amazon.com/lambda/latest/dg/services-rds-tutorial.html
set sg-abcd as the lambda function's security group
added sg-abcd as source in the security group's inbound rules
added the CIDR range of the lambda function's subnet as source in the security group's inbound rules
However, when I invoke the lambda function it times out.
I can connect to the RDS DB from my laptop (after setting my IP as source in the sg's inbound rules), so I now that it is not an authentication problem. Also, for the RDS DB "Publicly Accessible" is set to "Yes".
Here's part of the app's code (where I try to connect):
rds_host = "xxx.rds.amazonaws.com"
port = xxxx
name = rds_config.db_username
password = rds_config.db_password
db_name = rds_config.db_name
logger = logging.getLogger()
logger.setLevel(logging.INFO)
try:
conn = psycopg2.connect(host=rds_host, database=db_name, user=name, password=password, connect_timeout=20)
except psycopg2.Error as e:
logger.error("ERROR: Unexpected error: Could not connect to PostgreSQL instance.")
logger.error(e)
sys.exit()
I really can't understand what I am missing. Any suggestion is welcomed, please help me figure it out!
Edit: the inbound rules that I have set look like this:
Security group rule ID: sgr-123456789
Type Info: PostgreSQL
Protocol Info: TPC
Port range Info: 5432
Source: sg-abcd OR IP or CIDR range
This document should help you out. Just make sure to get the suggestions for your specific scenario, whether the lambda function and the RDS instance are in the same VPC or not.
In my case I have the lambda function and the RDS instance in the same VPC and also both have the same subnets and SGs. But just make sure to follow the instructions in that document for the configurations needed for each scenario.

Cannot get AWS Data Pipeline connected to Redshift

I have a query I'd like to run regularly in Redshift. I've set up an AWS Data Pipeline for it.
My problem is that I cannot figure out how to access Redshift. I keep getting "Unable to establish connection" errors. I have an Ec2Resource and I've tried including a subnet from our cluster's VPC and using the Security Group Id that Redshift uses, while also adding that sg-id to the inbound part of the rules. No luck.
Does anyone have a from-scratch way to set up a data pipeline to run against Redshift?
How I currently have my pipeline set up
RedshiftDatabase
Connection String: jdbc:redshift://[host]:[port]/[database]
Username, Password
Ec2Resource
Resource Role: DataPipelineDefaultResourceRole
Role: DataPipelineDefaultRole
Terminate after: 20 minutes
SqlActivity
Database: [database] (from Connection String)
Runs on: Ec2Resource
Script: SQL query
Error message
Unable to establish connection to jdbc:postgresql://[host]:[port]/[database] Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
Ok, so the answer lies in Security Groups. I had to find the Security Group my Redshift cluster is in, and then add that as a value to "Security Group" parameter on the Ec2Resource in the DataPipeline.
Ec2Resource
Resource Role: DataPipelineDefaultResourceRole
Role: DataPipelineDefaultRole
Terminate after: 20 minutes
Security Group: sg-XXXXX [pull from Redshift]
Try opening inbound rules to all sources, just to narrow down possible causes. You've probably done this, but make sure you've set up your jdbc driver and configurations according to this.

Newbie help - how to connect to AWS Redshift cluster (currently using Aginity)

(I'm afraid I'm probably about to reveal myself as completely unfit for the task at hand!)
I'm trying to setup a Redshift cluster and database to help manage data for a class/group project.
I have a dc2.large cluster running with either default options, or what looked like the most generic in the couple of place I was forced to make entries.
I have downloaded Aginity (Win64) as it is described as being specialized for Redshift. That said, I can't find any instructions for connecting using it. The connection dialog requests the follwoing:
Server: using the endpoint for my cluster (less :57xx at the end).
UserID: the Master username for the database defined for the cluster.
Password: to match the UserID
SSL Mode (Disable, Allow, Prefer, Require): trying various options
Database: as named in cluster setup
Port: as defined in cluster setup
I can't get it to connect ("failed to establish connection") and don't know if I'm entering something wrong in Aginity or if I haven't set up my cluster properly.
Message: Failed to establish a connection to 'abc1234-smtm.crone7m2jcwv.us-east-1.redshift.amazonaws.com'.
Type : Npgsql.NpgsqlException
Source : Npgsql
Trace : at Npgsql.NpgsqlClosedState.Open(NpgsqlConnector context, Int32 timeout)
at Npgsql.NpgsqlConnector.Open()
at Npgsql.NpgsqlConnection.Open()
at Aginity.MPP.Common.BaseDataProvider.get_Connection()
at Aginity.MPP.Common.BaseDataProvider.CreateCommand(String commandText, CommandType commandType, IDataParameter[] commandParams)
at Aginity.MPP.Common.BaseDataProvider.ExecuteReader(String commandText, CommandType commandType, IDataParameter[] commandParams)
--- Inner Exception: ---
......
It seems there is not enough information going into Aginity to authorize connection to my cluster - no account credential are supplied. For UserID, am I meant to enter the ID of a valid user? Can I use the root account? What would the ID look like? I have setup a User with FullAccess to S3 and Redshift, then entered the UserID in this format
arn:aws:iam::600123456789:user/john
along with the matching password, but that hasn't worked either.
The only training/tutorial I have been able to find/do on this is the Intro AWS direct you to, at https://qwiklabs.com/focuses/2366, which uses a web-based client that I can't find outside of the tutorial (pgweb).
Any advice what I am doing wrong, and how to do it right?
Well, I think I got it working - I haven't had a chance to see if I can actually create table yet, but it seems to be connected. I had to allow inbound traffic from outside the VPC, as per the above snapshot.
I'm guessing there's a better way than opening it up to all IP addresses, but I don't know the users' (fellow team members) IPs, and aren't they all subject to change depending on the device they're using to connect?
How does one go about getting inside the VPC to connect that way, presumably more securely?

EMR Spark Fails to Save Dataframe to S3

I am using the RunJobFlow command to spin up a Spark EMR cluster. This command sets the JobFlowRole to an IAM Role which has the policies AmazonElasticMapReduceforEC2Role and AmazonRedshiftReadOnlyAccess. The first policy contains an action to allow all s3 permissions.
When the EC2 instances spin up, they assume this IAM role, and generate temporary credentials via STS.
The first thing which I do is read a table from my Redshift cluster into a Spark Dataframe using the com.databricks.spark.redshift format and using the same IAM Role to unload the data from redshift as I did for the EMR JobFlowRole.
So far as I understand, this runs an UNLOAD command on Redshift to dump into the S3 bucket I specify. Spark then loads the newly unloaded data into a Dataframe. I use the recommended s3n:// protocol for the tempdir option.
This command works great, and it always successfully loads the data into the Dataframe.
I then run some transformations and attempt to save the dataframe in the csv format to the same S3 bucket Redshift Unloaded into.
However, when I try to do this, it throws the following error
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively)
Okay. So I don't know why this happens, but I tried to hack around it by setting the recommended hadoop configuration parameters. I then used DefaultAWSCredentialsProviderChain to load the AWSAccessKeyID and AWSSecretKey and set via
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", <CREDENTIALS_ACCESS_KEY>)
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", <CREDENTIALS_SECRET_ACCESS_KEY>)
When I run it again it throws the following error:
java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId;
Okay. So that didn't work. I then removed setting the hadoop configurations and hardcoded an IAM user's credentials in the s3 url via s3n://ACCESS_KEY:SECRET_KEY#BUCKET/KEY
When I ran this it spit out the following error:
java.lang.IllegalArgumentException: Bucket name should be between 3 and 63 characters long
So it tried to create a bucket.. which is definitely not what we want it to do.
I am really stuck on this one and would really appreciate any help here! It works fine when I run it locally, but completely fails on EMR.
The problem was the following:
EC2 Instance Generated Temporary Credentials on EMR Bootstrap Phase
When I queried Redshift, I passed the aws_iam_role to theDatabricks driver. The driver then re-generated temporary credentials for that same IAM role. This invalidated the credentials the EC2 instance generated.
I then tried to upload to S3 using the old credentials (and the credentials which were stored in the instance's metadata)
It failed because it was trying to use out-of-date credentials.
The solution was to remove redshift authorization via aws_iam_role and replace it with the following:
val credentials = EC2MetadataUtils.getIAMSecurityCredentials
...
.option("temporary_aws_access_key_id", credentials.get(IAM_ROLE).accessKeyId)
.option("temporary_aws_secret_access_key", credentials.get(IAM_ROLE).secretAccessKey)
.option("temporary_aws_session_token", credentials.get(IAM_ROLE).token)
On amazon EMR, try usong the prefix s3:// to refer to an object in S3.
It's a long story.

Connection to a S3 instance using a service-connector

I'm trying to create a service-connector to my s3 instance like this:
cf service-connector 13001 mybucketname.ds31s3.swisscom.com:443
But I get the following error:
Server-Error 403: Check of security groups failed (no access)
I have created my service key according to this documentation.
Connecting to my MongoDB works perfectly using a service connector.
You can access Swisscom's S3 directly without the service connector.
The error message suggests that your current org and space do no have access to the S3. This is usually the case is there is no app-binding for that service in the current space. Please check whether you created your service key in the right org and space.
There was a misconfiguration due to security changes. We fixed the issue, so connecting to s3 with the service-connector should now work.