Common DB Configuration in AWS Lambda functions
I have 50 lambda functions which calls a Mongo DB setup in AWS EC2. Currently the ip address, port number and db name are written inside all 50 lambda functions. I need to put this configuration in a single place and all Lambda functions should be able to call it.
Please guide me what's the best way to achieve it.
Thanks.
Sounds like an excellent use-case for the AWS Systems Manager Parameter Store - AWS Systems Manager:
AWS Systems Manager Parameter Store provides secure, hierarchical storage for configuration data management and secrets management. You can store data such as passwords, database strings, and license codes as parameter values. You can store values as plain text or encrypted data. You can then reference values by using the unique name that you specified when you created the parameter. Highly scalable, available, and durable, Parameter Store is backed by the AWS Cloud. Parameter Store is offered at no additional charge.
Related
I have an API where users can query some time-series data. But now I want to make the entire data set available for users to download for their own uses. How would I go about doing something like this? I have RDS, an EC2 instance setup. What would my next steps be?
In this scenario and without any other data or restrictions given, I would use S3 bucket in the center of this process.
Create an S3 Bucket to save the database/dataset dump.
Dump the database/dataset to S3. ( examples: docker, lambda )
Manually transform dataset to CSV or use a Lambda triggered on every dataset dump. (not sure if pg_dump can give you CSV out of the box)
Host those datasets in a bucket accessible to your users and allow access to them as per case:
You can create a publicly available bucket and share its HTTP URL.
You can create a pre-signed URL to allow limited access to your dataset
S3 is proposed since its cheap and you can find a lot of readily available tooling to work with.
I have two sources:
A csv that will be uploaded to a cloud storage service, probably GCP Cloud Storage.
The output of a scrapping process done with Python.
When a user updates 1) (the cloud stored file) an event should be triggered to execute 2) (the scrapping process) and then some transformation should take place in order to merge these two sources into one in a JSON format. Finally, the content of this JSON file should be stored in a DB of easy access and low cost. The files the user will update are of max 5MB and the updates will take place once weekly.
From what I've read, I can use GCP Cloud Functions to accomplish this whole process or I can use Dataflow too. I've even considered using both. I've also thought of using MongoDB to store the JSON objects of the two sources final merge.
Why should I use Cloud Functions, Dataflow or both? What are your thoughts on the DB? I'm open to different approaches. Thanks.
Regarding de use of Cloud Functions and Dataflow. In your case I will go for Cloud Functions as you don't have a big volume of data. Dataflow is more complex, more expensive and you will have to use Apache Beam. If you are confortable with python and having into consideration your scenario I will choose Cloud Functions. Easy, convenient...
To trigger a Cloud Functions when Cloud Storage object is updated you will have to configure the triggers. Pretty easy.
https://cloud.google.com/functions/docs/calling/storage
Regarding the DB. MongoDB is a good option but if you wanth something quick an inexpensive consider DataStore
As a managed service it will make your life easy with a lot of native integrations. Also it has a very interesting free tier.
As per our AWS environment , we have 2 different types SAGs( service account Group) for Data storage. One SAG is for generic storage , another SAG for secure data which will only hold PII or restricted data. In our environment, we are planning to deploy Glue . In that case ,
Would we have one metastore over both secure and non-secure?
If we needed two meta stores, how would this work with Databricks?
If one metastore, how to handle the secure datas ?
Please help us to more details on this in .
If you are using a single region with one AWS Account, there will be only one metastore for both secure and generic data, and you will have to handle access with fine grained access policies.
A better approach would be to either use 2 different regions in a single AWS Account, or two different AWS accounts, so that access is easily managed for two different metastores.
To integrate your metastore with Databricks for (1), you will have to create two Glue Catalog instance profiles with resource level access. One instance profile will have access to generic database and tables while the other will have access to the secure databases and tables.
To integrate your metastores with Databricks for (2), you will simply create two Glue Catalog instance profiles with access to the respective metastore.
It is recommended to go with the second option as it will save you guys a lot of maintenance cost and human errors on longer run. More details on Glue Catalog and Databricks integration.
Edit:
Based on the discussion in comments, if we have to access both datasets inside the same Databricks Runtime, option 2 won't work. Option 1 can be used with 2 permission sets. First only for generic data and second for both generic and secure data.
In AWS Glue, each AWS account has one persistent metadata store per region (called Glue Data catalog).
It contains database definitions, table definitions, job definitions, and other control information to manage your AWS Glue environment. You manage permissions to that objects using IAM (e.g., who can make GetTable or GetDatabase API calls to that objects).
In addition to AWS Glue permissions, you would also need to configure permissions to the data itself (e.g., who can make GetObject API call to the data stored on S3).
So, answering your questions. Yes, you would have a single data catalog.
However, depending on your security requirements, you would be able to define resource-based and role-based permissions on metadata and content.
You can find a detailed overview here - https://aws.amazon.com/blogs/big-data/restrict-access-to-your-aws-glue-data-catalog-with-resource-level-iam-permissions-and-resource-based-policies
All,
I have to BULK Insert data into AzureSQL from a Azure Blob Storage Account. I know one way is to use SAS keys but are there more secure ways to load data from T-SQL?
For example, is there a way to use the users AAD account to connect to the Storage? Would Managed Identity work? I have not come across an example in the Internet that uses anything other than SAS Keys.
Gopi
azure data factory generally serves this purpose. You can build a pipeline that grabs data from blob and massages it / loads it into sql, kind of what it's designed for. However if you do not wish to use that,
the recommended way is SAS because it can be temporary and revoked at any time. Why do you think SAS is less secure?
as per the documentation: https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-ver15#credential--credential_name if you were to create an external data source with blob_storage type the identity/credentials MUST be SAS, as it doesn't support any other authentication type. as such, that means you cannot use any other auth method to a blob storage using tsql.
I would like to develop a multi-tenant web application using PostgreSQL DB, having the data of each tenant in a dedicated scheme.
Each query or update will access only a single tenant scheme and/or the public scheme.
Assuming I will, at some point, need to scale out and have several PostgreSQL servers, is there some automatic way in which I can connect to a single load balancer of some sort, that will redirect the queries/updates to the relevant server, based on the required scheme?
The challenging part of this question is 'automatic way'. I have a feeling that postgres is moving that way, maybe 9.5 or later will have multi-master tendencies with partitioning allowing spreading of data across a cluster so that your frontend doesn't have to change.
Assuming that your tenants can operate in separate databases, and you are looking for a way to operate a query in the correct database, perhaps something like dns could be used during your connection to the database, using the tenant ID as a component in the dns host. Something like:
tenant_1.example.com -> 192.168.0.10
tenant_2.example.com -> 192.168.0.11
tenant_3.example.com -> 192.168.0.11
etc.example.com -> 192.168.0.X
Then you could use the connection as a map to the correct db installation. The tricky part here is the overlapping data that all tenants would need access to. If that overlapping data needs to be joined to then it will have to exist in all databases. Either copied or dblinked. If the overlapping data needs to be updated then automatic is going to be tough. Good question.