Copying data from S3 to Redshift stuck - amazon-redshift

I have a redshift cluster in private subnet, I have attached an IAM role with the policy of full permission to S3.
But sill copy or unload command stuck and finally got timed out error, any thought?
Seems I have to modify some network or connectivity changes.

Related

Dynamically change VPC inside Glue job

Hi I am having issues with VPC settings. When I use a connection (that has VPC attached) in Glue job I can read data from SQL server but then I can't write data into target Snowflake server due to "timeout". By timeout I mean that job doesn't fail because of any reason but timeout. No errors in logs etc.
If I remove the connection from the same job and replace the SQL server data frame with some dummy spark data frame it all writes to Snowflake without any issue.
For connection to SQL Server I use data catalog and for snowflake I use Spark jdbc connector (2 jar files added to job).
I am thinking about connecting to VPC dynamically from the job script itself, then pull the data into data frame, then disconnect from VPC and write data frame to the target. Does anyone think that it is possible? I didn't find any mentions in documentation TBH.

AWS Glue : Unable to process data from multiple sources S3 bucket and postgreSQL db with AWS Glue using Scala-Spark

For my requirement, I need to join data present in PostgreSQL db(hosted in RDS) and file present in S3 bucket. I have created a Glue job(spark-scala) which should connect to both PostgreSQL, S3 bucket and complete processing.
But Glue job encounters connection timeout while connecting to S3(below is error msg). It is successfully fetching data from PostgreSQL.
There is no permission related issue with S3 because I am able to read/write from same S3 bucket/path using different job. The exception/issue happens only if I try to connect both postgreSQL and S3 in one glue job/script.
In Glue job, glue context is created using SparkContext as object. I have tried creating two different sparkSession, each for S3 and postgreSQL db but this approach didnt work. Same timeout issue encountered.
Please help me in resolving the issue.
Error/Exception from log:
ERROR[main] glue.processLauncher (Logging.scala:logError(91)):Exception in User Class
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to emp_bucket.s3.amazonaws.com:443
[emp_bucket.s3.amazonaws.com/] failed : connect timed out
This is fixed.
Issue was with security group. Only TCP traffic was allowed earlier.
As part of the fix traffic was opened for all. Also, added HTTPS rule in inbound rules as well.

Best way to set up jupyter notebook project in AWS

My current project have the following structure:
Starts with a script in jupyter notebook which dowloads data from a CRM API to put in a local PostgressSql database I run with PgAdmin. After that it runs cluster analysis, return some scoring values, creates a table in database with the results and updates this values in the CRM with another API call. This process will take between 10 to 20 hours (the API only allows 400 requests per minute).
The second notebook reads the database, detects last update, runs api call to update database since the last call, runs kmeans analysis to cluster the data, compare results with the previous call, updates the new ones and the CRM via API. This second process takes less than 2 hours in my estimation and I want this script to run every 24 hours.
After testing, this works fine. Now I'm evaluating how to put this in production in AWS. I understand for the notebooks I need Sagemaker and from I have seen is not that complicated, my only doubt here is if I can call the API without implementing aditional code or need some configuration. My second problem is database. I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3. My goal is to write the less code as possible, but a have try some tutorial of RDS like this one: [1]: https://www.youtube.com/watch?v=6fDTre5gikg&t=10s, and I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance?? and how to connect to it to analysis this data from SageMaker. My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud. Just some orientation about how to use this tools would be appreciated.
I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3
RDS and Aurora are relational databases fully managed by AWS. "Regular" RDS allows you to launch the existing popular databases such as MySQL, PostgreSQSL and other which you can launch at home/work as well.
Aurora is in-house, cloud-native implementation databases compatible with MySQL and PosrgreSQL. It can store the same data as RDS MySQL or PosrgreSQL, but provides a number of features not available for RDS, such as more read replicas, distributed storage, global databases and more.
S3 is not a database, but an object storage, where you can store your files, such as images, csv, excels, similarly like you would store them on your computer.
I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance??
You can migrate your data from your local postgress to RDS or Aurora if you wish. But RDS nor Aurora will not connect to your existing local database, as they are databases themselfs.
My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud.
I don't see a reason why you wouldn't be able to connect to the database. You can try to make it work, and if you encounter difficulties you can make new question on SO with RDS/Aurora setup details.

Data Fusion Dataproc compute profle in a different account

I'm trying to execute a pipeline on a Data Proc cluster in a different project by the one Data Fusion instance is deployed but I am having some trouble. Data Proc instance seems to be created correctly but the start of the job fails. Any idea on how to solve?
Here the stack trace of the error
Thanks
This seems like the project where the Google Cloud Dataproc is doesn't have SSH port open. Can you check that your project allow port 22 connection? Cloud Data Fusion uses SSH to upload and monitor the job in the Cloud Dataproc.

What is the best way to take snapshots of an EC2 instance running MongoDB?

I wanted to automate taking snapshots of the volume attached to an EC2 instance running the primary node of our production MongoDB replicaSet. While trying to gauge the pitfalls and best practices over Google, I came across the fact that data inconsistency and corruption are very much possible while creating a snapshot but not of journaling is enabled, which it is in our case.
So my question is - is it safe to go ahead and execute aws ec2 create-snapshot --volume-id <volume-id> to get clean backups of my data?
Moreover, I plan on running the same command via a cron job that runs once every week. Is that a good enough plan to have scheduled backups?
For MongoDB on an EC2 instance I do the following:
mongodump to a backup directory on the EBS volume
zip the mongodump output directory
copy the zip file to an S3 bucket (with encryption and versioning enabled)
initiate a snapshot of the EBS volume
I write a script to perform the above tasks, and schedule it to run daily via cron. Now you will have a backup of the database on the EC2 instance, in the EBS snapshot, and on S3. You could go one step further by enabling cross region replication on the S3 bucket.
This setup provides multiple levels of backup. You should now be able to recover your database in the event of an EC2 failure, an EBS failure, an AWS Availability Zone failure or even a complete AWS Region failure.
I would recommend reading the official MongoDB documentation on EC2 deployments:
https://docs.mongodb.org/ecosystem/platforms/amazon-ec2/
https://docs.mongodb.org/ecosystem/tutorial/backup-and-restore-mongodb-on-amazon-ec2/