Unable to inject data from postgresql to druid (deployed using helm) - postgresql

I deployed druid using helm from repository using commands from https://github.com/apache/druid/tree/master/helm/druid and got it deployed successfully but when I created a task with following spec
{
"type": "index_parallel",
"id": "sairam_testing_postgresql_100",
"spec": {
"dataSchema": {
"dataSource": "test-ingestion-postgresql-100",
"timestampSpec": {
"format": "iso",
"column": "created_at"
},
"dimensionsSpec": {
"dimensions": [
"app_id","user_id"
]
}
},
"ioConfig": {
"type": "index_parallel",
"inputSource": {
"type": "sql",
"database": {
"type": "postgresql",
"connectorConfig": {
"connectURI": "jdbc:postgresql://35.200.128.167:5432/mhere_trans",
"user": "postgres#jiovishwam-frp-att-prod-mhere-trans-psql-db-1",
"password": "lFRWncdXG4Po0e"
}
},
"sqls": [
"SELECT app_id ,user_id FROM transactions limit 10"
]
}
},
"maxNumConcurrentSubTasks": 2,
"tuningConfig": {
"type": "index_parallel",
"partitionsSpec": {
"type": "dynamic"
}
}
}
}
it is throwing error
Failed to submit task: Cannot construct instance of org.apache.druid.firehose.PostgresqlFirehoseDatabaseConnector, problem: java.lang.ClassNotFoundException: org.postgresql.Driver at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 969] (through reference chain: org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask["spec"]->org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexIngestionSpec["ioConfig"]->org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexIOConfig["inputSource"]->org.apache.druid.metadata.input.SqlInputSource["database"])
NOTE: I did try using quickstart and found similar issue (got fixed by manually adding postgresql jar file to lib directory ) but not sure how to handle this when the druid is deployed using helm charts on production.

According to the docs here, in order to use SQL data source for postgresql, you will need to add the extension in the load list:
druid.extensions.loadList=["postgresql-metadata-storage"]
If you are installing with the helm chart, you can specify this in the general parameters section of your custom values.yaml.
configVars:
druid_extensions_loadList: '["druid-histogram", "druid-datasketches", "druid-lookups-cached-global", "postgresql-metadata-storage"]
Note: Please also note that while you can extract data from postgresql using a JDBC connection, it is recommended that for larger data sets you use multiple SQL statements such that ingestion can be parallelized.

Related

ADF: linkedService template function not defined

I am currently trying to add some parameterised linked services. I have two services currently: a key vault, and a data lake. The configuration are:
// Key vault
{
"name": "Logical Key Vault",
"properties": {
"parameters": {
"environment": {
"type": "String"
}
},
"annotations": [],
"type": "AzureKeyVault",
"typeProperties": {
"baseUrl": "https://kv-#{linkedService().environment}.vault.azure.net"
}
}
}
// Data lake
{
"name": "Logical Data Lake",
"properties": {
"type": "AzureBlobFS",
"parameters": {
"environment": {
"type": "String"
}
},
"annotations": [],
"typeProperties": {
"url": "https://sa#{replace(linkedService().environment, '-', '')}.dfs.core.windows.net",
"accountKey": {
"type": "AzureKeyVaultSecret",
"secretName": "storageAccountKey",
"store": {
"referenceName": "Logical Key Vault",
"type": "LinkedServiceReference",
"parameters": {
"environment": {
"value": "#linkedService().environment",
"type": "Expression"
}
}
}
}
}
}
}
Both linked services are parameterised by an environment parameter, and I have confirmed that the Key Vault works fine and is able to correctly retrieve secrets. The problem happens when I attempt to retrieve the storage key from the key vault. I get the following error:
Error code
FailToResolveParametersInExploratoryController
Details
The parameters and expression cannot be resolved for schema operations.
Error Message: {
"message": "ErrorCode=InvalidTemplate, ErrorMessage=The template function 'linkedService' is not defined or not valid."
}
My attempts at debugging this has identified the use of #linkedService on line 38 to be the issue, which is when the Data Lake is trying to pass its own environment parameter to the Key Vault so that it may obtain the storage key. If I remove this use of #linkedService.environment and replace it with a hard coded value, the linked service successfully connects to the data lake.
The expression is trivially simple, and the web interface itself offers the option:
As a result, I am unsure why the use of #linkedService fails here. The web interface and ability to use expressions would suggest it should work, but then #linkedService is undefined for some reason.
While debugging this, I did find that using the expression
#string(linkedService.environment)
Does indeed work, but this seems rather odd as the environment is itself a string and thus its conversion into a string should be a no-op. I have also looked into removing the # entirely and trying
linkedService.environment
and while this does correctly resolve to the environment, it still results in an error as it resulting parameter contains the surrounding quotation and thus the linked service fails to connect to the key vault https://kv-'foobar'.vault.azure.net as it is clearly invalid (assuming my environment was foobar).

Execute SQL script with Azure ARM template

I'm deploying PostgreSQL server with a database and trying to seed this database with SQL script. I've learned that the best way to execute SQL script from ARM template is to use deployment script resource. Here is part of a template:
{
"type": "Microsoft.DBforPostgreSQL/flexibleServers/databases",
"apiVersion": "2021-06-01",
"name": "[concat(parameters('psqlServerName'), '/', parameters('psqlDatabaseName'))]",
"dependsOn": [
"[resourceId('Microsoft.DBforPostgreSQL/flexibleServers', parameters('psqlServerName'))]"
],
"properties": {
"charset": "[parameters('psqlDatabaseCharset')]",
"collation": "[parameters('psqlDatabaseCollation')]"
},
"resources": [
{
"type": "Microsoft.Resources/deploymentScripts",
"apiVersion": "2020-10-01",
"name": "deploySQL",
"location": "[parameters('location')]",
"kind": "AzureCLI",
"dependsOn": [
"[resourceId('Microsoft.DBforPostgreSQL/flexibleServers/databases', parameters('psqlServerName'), parameters('psqlDatabaseName'))]"
],
"properties": {
"azCliVersion": "2.34.1",
"storageAccountSettings": {
"storageAccountKey": "[listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName')), '2019-06-01').keys[0].value]",
"storageAccountName": "[parameters('storageAccountName')]"
},
"cleanupPreference": "Always",
"environmentVariables": [
{
"name": "psqlFqdn",
"value": "[reference(resourceId('Microsoft.DBforPostgreSQL/flexibleServers', parameters('psqlServerName')), '2021-06-01').fullyQualifiedDomainName]"
},
{
"name": "psqlDatabaseName",
"value": "[parameters('psqlDatabaseName')]"
},
{
"name": "psqlAdminLogin",
"value": "[parameters('psqlAdminLogin')]"
},
{
"name": "psqlServerName",
"value": "[parameters('psqlServerName')]"
},
{
"name": "psqlAdminPassword",
"secureValue": "[parameters('psqlAdminPassword')]"
}
],
"retentionInterval": "P1D",
"scriptContent": "az config set extension.use_dynamic_install=yes_without_prompt\r\naz postgres flexible-server execute --name $env:psqlServerName --admin-user $env:psqlAdminLogin --admin-password $env:psqlAdminPassword --database-name $env:psqlDatabaseName --file-path test.sql --debug"
}
}
]
}
Azure does not show any errors regarding the syntax and starts the deployment. However, deploySQL deployment gets stuck and then fails after 1 hour due to agent execution timeout. PostgreSQL server itself, database and firewall rule (not shown in the code above) are deployed without any issues, but SQL script is not executed. I've tried to add --debug option to Azure CLI commands, but got nothing new from pipeline output. I've also tried to execute these commands in Azure CLI pipeline task and they worked perfectly. What am I missing here?

Calling Azure SQL DW Stored Procedure from Azure Data Factory

I am getting the following error when trying to run a stored procedure in an Azure SQL Datawarehouse.
Activity 'SprocActivitySample' contains an invalid Dataset reference 'Destination-SQLDW-nna'. This dataset is pointing to Azure SQL DW and stored procedure is in it.
Here is the entire code.
{
"name": "SprocActivitySamplePipeline",
"properties": {
"activities": [
{
"type":"SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "DailyImport",
"storedProcedureParameters": {
"DateToImportFor": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)"
}
},
"outputs": [
{
"name": "Destination-SQLDW-nna"
}
],
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "SprocActivitySample"
}
],
"start": "2017-01-01T00:00:00Z",
"end": "2017-02-20T05:00:00Z",
"isPaused": true
}
}
I'm afraid that Azure Sql Data Warehouse does not support table-valued parameters in stored procedures.
Read more about it here: https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-develop-stored-procedures
If you find a workaround for it please share! I couldnt find any.
Also it would be good if you can post the dataset json so we can try to find any errors on it.
Cheers!
I got this working. the problem was that i was referencing wrong
"outputs": [
{
"name": "Destination-SQLDW-nna"
}
after correcting name to the right Dataset it is working

Exporting a AWS Postgres RDS Table to AWS S3

I wanted to use AWS Data Pipeline to pipe data from a Postgres RDS to AWS S3. Does anybody know how this is done?
More precisely, I wanted to export a Postgres Table to AWS S3 using data Pipeline. The reason I am using Data Pipeline is I want to automate this process and this export is going to run once every week.
Any other suggestions will also work.
There is a sample on github.
https://github.com/awslabs/data-pipeline-samples/tree/master/samples/RDStoS3
Here is the code:
https://github.com/awslabs/data-pipeline-samples/blob/master/samples/RDStoS3/RDStoS3Pipeline.json
You can define a copy-activity in the Data Pipeline interface to extract data from a Postgres RDS instance into S3.
Create a data node of the type SqlDataNode. Specify table name and select query.
Setup the database connection by specifying RDS instance ID (the instance ID is in your URL, e.g. your-instance-id.xxxxx.eu-west-1.rds.amazonaws.com) along with username, password and database name.
Create a data node of the type S3DataNode.
Create a Copy activity and set the SqlDataNode as input and the S3DataNode as output.
Another option is to use an external tool like Alooma. Alooma can replicate tables from PostgreSQL database hosted Amazon RDS to Amazon S3 (https://www.alooma.com/integrations/postgresql/s3). The process can be automated and you can run it once a week.
I built a Pipeline from scratch using the MySQL and the documentation as reference.
You need to have the roles on place, DataPipelineDefaultResourceRole && DataPipelineDefaultRole.
I haven't load the parameters, so, you need to get into the architech and put your credentials and folders.
Hope it helps.
{
"objects": [
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "#{myS3LogsPath}",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"database": {
"ref": "DatabaseId_WC2j5"
},
"name": "DefaultSqlDataNode1",
"id": "SqlDataNodeId_VevnE",
"type": "SqlDataNode",
"selectQuery": "#{myRDSSelectQuery}",
"table": "#{myRDSTable}"
},
{
"*password": "#{*myRDSPassword}",
"name": "RDS_database",
"id": "DatabaseId_WC2j5",
"type": "RdsDatabase",
"rdsInstanceId": "#{myRDSId}",
"username": "#{myRDSUsername}"
},
{
"output": {
"ref": "S3DataNodeId_iYhHx"
},
"input": {
"ref": "SqlDataNodeId_VevnE"
},
"name": "DefaultCopyActivity1",
"runsOn": {
"ref": "ResourceId_G9GWz"
},
"id": "CopyActivityId_CapKO",
"type": "CopyActivity"
},
{
"dependsOn": {
"ref": "CopyActivityId_CapKO"
},
"filePath": "#{myS3Container}#{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
"name": "DefaultS3DataNode1",
"id": "S3DataNodeId_iYhHx",
"type": "S3DataNode"
},
{
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"instanceType": "m1.medium",
"name": "DefaultResource1",
"id": "ResourceId_G9GWz",
"type": "Ec2Resource",
"terminateAfter": "30 Minutes"
}
],
"parameters": [
]
}
You can now do this with aws_s3.query_export_to_s3 command within postgres itself https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html

How to dynamically name an ECS cluster with cloudformation?

Its easy to create the cluster MyCluster with a hardcoded name:
"MyCluster": {
"Type": "AWS::ECS::Cluster"
}
However, I'm wanting to have a dynamic name but also reference the named resource. Something like this where the cluster name would be the stack name:
"NamedReferenceButNotClusterName": {
"Type": "AWS::ECS::Cluster",
"Properties": {
"Name": {"Ref": "AWS::StackName"} <-- Name property isnt allowed
}
},
"ecsService": {
"Type": "AWS::ECS::Service",
"DependsOn": [
{"Ref": "NamedReferenceButNotClusterName"} <-- not sure if I can even do this
],
"Properties": {
"Cluster": {
"Ref": "NamedReferenceButNotClusterName" <-- I really want this part
},
"DesiredCount": 2,
"TaskDefinition": {
"Ref": "EcsTask"
}
}
}
Is there any way to do this?
It's not possible with AWS cloud formation.
"MyCluster": {
"Type": "AWS::ECS::Cluster"
}
The above cloudformation script will generate a ECS cluster with name format <StackName>-MyCluster-<RandomSequence>.
The stackname is provided as input at the time of execution of the cloudformation script. The random sequence is generated by cloudformation and cannot be deterministic.
At this point the best bet to create a cluster with desired naming convention will be using aws cli or a small program using aws sdk.