Exporting a AWS Postgres RDS Table to AWS S3 - postgresql

I wanted to use AWS Data Pipeline to pipe data from a Postgres RDS to AWS S3. Does anybody know how this is done?
More precisely, I wanted to export a Postgres Table to AWS S3 using data Pipeline. The reason I am using Data Pipeline is I want to automate this process and this export is going to run once every week.
Any other suggestions will also work.

There is a sample on github.
https://github.com/awslabs/data-pipeline-samples/tree/master/samples/RDStoS3
Here is the code:
https://github.com/awslabs/data-pipeline-samples/blob/master/samples/RDStoS3/RDStoS3Pipeline.json

You can define a copy-activity in the Data Pipeline interface to extract data from a Postgres RDS instance into S3.
Create a data node of the type SqlDataNode. Specify table name and select query.
Setup the database connection by specifying RDS instance ID (the instance ID is in your URL, e.g. your-instance-id.xxxxx.eu-west-1.rds.amazonaws.com) along with username, password and database name.
Create a data node of the type S3DataNode.
Create a Copy activity and set the SqlDataNode as input and the S3DataNode as output.
Another option is to use an external tool like Alooma. Alooma can replicate tables from PostgreSQL database hosted Amazon RDS to Amazon S3 (https://www.alooma.com/integrations/postgresql/s3). The process can be automated and you can run it once a week.

I built a Pipeline from scratch using the MySQL and the documentation as reference.
You need to have the roles on place, DataPipelineDefaultResourceRole && DataPipelineDefaultRole.
I haven't load the parameters, so, you need to get into the architech and put your credentials and folders.
Hope it helps.
{
"objects": [
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "#{myS3LogsPath}",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"database": {
"ref": "DatabaseId_WC2j5"
},
"name": "DefaultSqlDataNode1",
"id": "SqlDataNodeId_VevnE",
"type": "SqlDataNode",
"selectQuery": "#{myRDSSelectQuery}",
"table": "#{myRDSTable}"
},
{
"*password": "#{*myRDSPassword}",
"name": "RDS_database",
"id": "DatabaseId_WC2j5",
"type": "RdsDatabase",
"rdsInstanceId": "#{myRDSId}",
"username": "#{myRDSUsername}"
},
{
"output": {
"ref": "S3DataNodeId_iYhHx"
},
"input": {
"ref": "SqlDataNodeId_VevnE"
},
"name": "DefaultCopyActivity1",
"runsOn": {
"ref": "ResourceId_G9GWz"
},
"id": "CopyActivityId_CapKO",
"type": "CopyActivity"
},
{
"dependsOn": {
"ref": "CopyActivityId_CapKO"
},
"filePath": "#{myS3Container}#{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
"name": "DefaultS3DataNode1",
"id": "S3DataNodeId_iYhHx",
"type": "S3DataNode"
},
{
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"instanceType": "m1.medium",
"name": "DefaultResource1",
"id": "ResourceId_G9GWz",
"type": "Ec2Resource",
"terminateAfter": "30 Minutes"
}
],
"parameters": [
]
}

You can now do this with aws_s3.query_export_to_s3 command within postgres itself https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html

Related

Azure ARM Template parameters for parametrized linked service

Please, forgive the confusing tittle, if it is, but it does describe the problem I am having
So, I have a linked service in my Azure Datafactory. It is used for Azure SQL Database connect.
The Database name and user name are being taken from the parameters set in linked service itself. Here is a snippet of json config
"typeProperties": {
"connectionString": "Integrated Security=False;Encrypt=True;Connection Timeout=30;Data Source=myserver.database.windows.net;Initial Catalog=#{linkedService().dbName};User ID=#{linkedService().dbUserName}",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "KeyVaultLink",
"type": "LinkedServiceReference"
},
"secretName": "DBPassword"
},
"alwaysEncryptedSettings": {
"alwaysEncryptedAkvAuthType": "ManagedIdentity"
}
}
This works fine when in debug in the Azure portal. However, when I get the ARM Template for the whole thing, during ARM Template deployment it asks for input Connection string for the linked service. If I go to the linked service definition, and look up its connection string it will come this way
"connectionString": "Integrated Security=False;Encrypt=True;Connection Timeout=30;Data Source=dmsql.database.windows.net;Initial Catalog=#{linkedService().dbName};User ID=#{linkedService().dbUserName}"
Then when I input it in the ARM Template deployment should I be replacing "#{linkedService().dbName}" and "#{linkedService().dbUserName}" with actual values at the spot when I am entring it ? I am confused because during the ARM Template deployment there are no separate fields for these parameters, and these (parameters specific to linked service itself) are not present as separate parameters in the ARM Template definition.
I created database in my azure portal
and enabled system assigned managed Identity for sql db.
Image for reference:
I created azure keywault and created secret.
Image for reference:
I have created new access policy for Azure data factory.
Image for reference:
I created Azure data factory and enabled system managed identity.
Image for reference:
I have created new parametrized linked service to connect with database with below parameters dbName and userName. I am taking database name and User name dynamically by using above parameters.
Image for reference:
Linked service is created successfully.
json format of my lined service:
{
"name": "SqlServer1",
"properties": {
"parameters": {
"dbName": {
"type": "String"
},
"userName": {
"type": "String"
}
},
"annotations": [],
"type": "SqlServer",
"typeProperties": {
"connectionString": "Integrated Security=False;Data Source=dbservere;Initial Catalog=#{linkedService().dbName};User ID=#{linkedService().userName}",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault1",
"type": "LinkedServiceReference"
},
"secretName": "DBPASSWORD"
},
"alwaysEncryptedSettings": {
"alwaysEncryptedAkvAuthType": "ManagedIdentity"
}
}
}
}
I exported the arm template of data factory.
This is my linked service in my ARM template:
"SqlServer1_connectionString": {
"type": "secureString",
"metadata": "Secure string for 'connectionString' of 'SqlServer1'",
"defaultValue": "Integrated Security=False;Data Source=dbservere;Initial Catalog=#{linkedService().dbName};User ID=#{linkedService().userName}"
},
"AzureKeyVault1_properties_typeProperties_baseUrl": {
"type": "string",
"defaultValue": "https://keysqlad.vault.azure.net/"
}
Image for reference:
I have got parameters dbName and userName in my ARM template description.
{
"name": "[concat(parameters('factoryName'), '/SqlServer1')]",
"type": "Microsoft.DataFactory/factories/linkedServices",
"apiVersion": "2018-06-01",
"properties": {
"parameters": {
"dbName": {
"type": "String"
},
"userName": {
"type": "String"
}
},
"annotations": [],
"type": "SqlServer",
"typeProperties": {
"connectionString": "[parameters('SqlServer1_connectionString')]",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault1",
"type": "LinkedServiceReference"
},
"secretName": "DBPASSWORD"
},
"alwaysEncryptedSettings": {
"alwaysEncryptedAkvAuthType": "ManagedIdentity"
}
}
},
"dependsOn": [
"[concat(variables('factoryId'), '/linkedServices/AzureKeyVault1')]"
]
}
Image for reference:
If you didn't get parameters in ARM template description copy the value of "connectionString" and modified what you needed to and left the parameters in place and added it to the "connectionString" override parameter in my Azure Release Pipeline, and it will work.

Unable to inject data from postgresql to druid (deployed using helm)

I deployed druid using helm from repository using commands from https://github.com/apache/druid/tree/master/helm/druid and got it deployed successfully but when I created a task with following spec
{
"type": "index_parallel",
"id": "sairam_testing_postgresql_100",
"spec": {
"dataSchema": {
"dataSource": "test-ingestion-postgresql-100",
"timestampSpec": {
"format": "iso",
"column": "created_at"
},
"dimensionsSpec": {
"dimensions": [
"app_id","user_id"
]
}
},
"ioConfig": {
"type": "index_parallel",
"inputSource": {
"type": "sql",
"database": {
"type": "postgresql",
"connectorConfig": {
"connectURI": "jdbc:postgresql://35.200.128.167:5432/mhere_trans",
"user": "postgres#jiovishwam-frp-att-prod-mhere-trans-psql-db-1",
"password": "lFRWncdXG4Po0e"
}
},
"sqls": [
"SELECT app_id ,user_id FROM transactions limit 10"
]
}
},
"maxNumConcurrentSubTasks": 2,
"tuningConfig": {
"type": "index_parallel",
"partitionsSpec": {
"type": "dynamic"
}
}
}
}
it is throwing error
Failed to submit task: Cannot construct instance of org.apache.druid.firehose.PostgresqlFirehoseDatabaseConnector, problem: java.lang.ClassNotFoundException: org.postgresql.Driver at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 969] (through reference chain: org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask["spec"]->org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexIngestionSpec["ioConfig"]->org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexIOConfig["inputSource"]->org.apache.druid.metadata.input.SqlInputSource["database"])
NOTE: I did try using quickstart and found similar issue (got fixed by manually adding postgresql jar file to lib directory ) but not sure how to handle this when the druid is deployed using helm charts on production.
According to the docs here, in order to use SQL data source for postgresql, you will need to add the extension in the load list:
druid.extensions.loadList=["postgresql-metadata-storage"]
If you are installing with the helm chart, you can specify this in the general parameters section of your custom values.yaml.
configVars:
druid_extensions_loadList: '["druid-histogram", "druid-datasketches", "druid-lookups-cached-global", "postgresql-metadata-storage"]
Note: Please also note that while you can extract data from postgresql using a JDBC connection, it is recommended that for larger data sets you use multiple SQL statements such that ingestion can be parallelized.

Execute SQL script with Azure ARM template

I'm deploying PostgreSQL server with a database and trying to seed this database with SQL script. I've learned that the best way to execute SQL script from ARM template is to use deployment script resource. Here is part of a template:
{
"type": "Microsoft.DBforPostgreSQL/flexibleServers/databases",
"apiVersion": "2021-06-01",
"name": "[concat(parameters('psqlServerName'), '/', parameters('psqlDatabaseName'))]",
"dependsOn": [
"[resourceId('Microsoft.DBforPostgreSQL/flexibleServers', parameters('psqlServerName'))]"
],
"properties": {
"charset": "[parameters('psqlDatabaseCharset')]",
"collation": "[parameters('psqlDatabaseCollation')]"
},
"resources": [
{
"type": "Microsoft.Resources/deploymentScripts",
"apiVersion": "2020-10-01",
"name": "deploySQL",
"location": "[parameters('location')]",
"kind": "AzureCLI",
"dependsOn": [
"[resourceId('Microsoft.DBforPostgreSQL/flexibleServers/databases', parameters('psqlServerName'), parameters('psqlDatabaseName'))]"
],
"properties": {
"azCliVersion": "2.34.1",
"storageAccountSettings": {
"storageAccountKey": "[listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName')), '2019-06-01').keys[0].value]",
"storageAccountName": "[parameters('storageAccountName')]"
},
"cleanupPreference": "Always",
"environmentVariables": [
{
"name": "psqlFqdn",
"value": "[reference(resourceId('Microsoft.DBforPostgreSQL/flexibleServers', parameters('psqlServerName')), '2021-06-01').fullyQualifiedDomainName]"
},
{
"name": "psqlDatabaseName",
"value": "[parameters('psqlDatabaseName')]"
},
{
"name": "psqlAdminLogin",
"value": "[parameters('psqlAdminLogin')]"
},
{
"name": "psqlServerName",
"value": "[parameters('psqlServerName')]"
},
{
"name": "psqlAdminPassword",
"secureValue": "[parameters('psqlAdminPassword')]"
}
],
"retentionInterval": "P1D",
"scriptContent": "az config set extension.use_dynamic_install=yes_without_prompt\r\naz postgres flexible-server execute --name $env:psqlServerName --admin-user $env:psqlAdminLogin --admin-password $env:psqlAdminPassword --database-name $env:psqlDatabaseName --file-path test.sql --debug"
}
}
]
}
Azure does not show any errors regarding the syntax and starts the deployment. However, deploySQL deployment gets stuck and then fails after 1 hour due to agent execution timeout. PostgreSQL server itself, database and firewall rule (not shown in the code above) are deployed without any issues, but SQL script is not executed. I've tried to add --debug option to Azure CLI commands, but got nothing new from pipeline output. I've also tried to execute these commands in Azure CLI pipeline task and they worked perfectly. What am I missing here?

Azure Blob Storage Deployment: Stored Access Policy gets deleted

Context:
I deploy a storage account as well as one or more containers with the following ARM template with Azure DevOps respectively a Resource Deployment Task:
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"storageAccountName": {
"type": "string",
"metadata": {
"description": "The name of the Azure Storage account."
}
},
"containerNames": {
"type": "array",
"metadata": {
"description": "The names of the blob containers."
}
},
"location": {
"type": "string",
"metadata": {
"description": "The location in which the Azure Storage resources should be deployed."
}
}
},
"resources": [
{
"name": "[parameters('storageAccountName')]",
"type": "Microsoft.Storage/storageAccounts",
"apiVersion": "2018-07-01",
"location": "[parameters('location')]",
"kind": "StorageV2",
"sku": {
"name": "Standard_LRS",
"tier": "Standard"
},
"properties": {
"accessTier": "Hot"
}
},
{
"name": "[concat(parameters('storageAccountName'), '/default/', parameters('containerNames')[copyIndex()])]",
"type": "Microsoft.Storage/storageAccounts/blobServices/containers",
"apiVersion": "2018-03-01-preview",
"dependsOn": [
"[parameters('storageAccountName')]"
],
"copy": {
"name": "containercopy",
"count": "[length(parameters('containerNames'))]"
}
}
],
"outputs": {
"storageAccountName": {
"type": "string",
"value": "[parameters('storageAccountName')]"
},
"storageAccountKey": {
"type": "string",
"value": "[listKeys(parameters('storageAccountName'), '2018-02-01').keys[0].value]"
},
"storageContainerNames": {
"type": "array",
"value": "[parameters('containerNames')]"
}
}
}
Input can be e.g.
-storageAccountName 'stor1' -containerNames [ 'con1', 'con2' ] -location 'westeurope'
In an next step I create Stored Access Policies for the containers deployed.
Problem:
The first time I do that everything works fine. But if I execute the pipeline a second time the Stored Access Policies gets deleted by the deployment of the template. The storage account itself with its containers and blobs are not deleted (as it should be). This is unfortunate because I want to keep the Stored Access Policy with its starttime and expirytime as deployed the first time, furthermore I expect that the SAS also become invalid (not tested so far).
Questions:
Why is this happening?
How can I avoid this problem respectively keep the Stored Access Policies?
Thanks
After doing some investigation this seems to be by design. When deploying ARM templates for storage accounts the PUT operation is used, i.e. elements that are not specified within the template are removed. As it is not possible to specify Shared Access Policies for containers within an ARM template for Storage Accounts existing ones get deleted when the template is redeployed...

Calling Azure SQL DW Stored Procedure from Azure Data Factory

I am getting the following error when trying to run a stored procedure in an Azure SQL Datawarehouse.
Activity 'SprocActivitySample' contains an invalid Dataset reference 'Destination-SQLDW-nna'. This dataset is pointing to Azure SQL DW and stored procedure is in it.
Here is the entire code.
{
"name": "SprocActivitySamplePipeline",
"properties": {
"activities": [
{
"type":"SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "DailyImport",
"storedProcedureParameters": {
"DateToImportFor": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)"
}
},
"outputs": [
{
"name": "Destination-SQLDW-nna"
}
],
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "SprocActivitySample"
}
],
"start": "2017-01-01T00:00:00Z",
"end": "2017-02-20T05:00:00Z",
"isPaused": true
}
}
I'm afraid that Azure Sql Data Warehouse does not support table-valued parameters in stored procedures.
Read more about it here: https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-develop-stored-procedures
If you find a workaround for it please share! I couldnt find any.
Also it would be good if you can post the dataset json so we can try to find any errors on it.
Cheers!
I got this working. the problem was that i was referencing wrong
"outputs": [
{
"name": "Destination-SQLDW-nna"
}
after correcting name to the right Dataset it is working