How to create an ETL from BigQuery to Google Storage using CDAP? - google-cloud-storage

I'm setting up CDAP in my Google Cloud Environment, but having problems to execute the following pipeline: run a query on BigQuery and save the result in a CSV file on Google Storage.
My process was:
Install CDAP using the CDAP OSS image at Google Marketplace.
Build the following pipeline:
"artifact": {
"name": "cdap-data-pipeline",
"version": "6.0.0",
"scope": "SYSTEM"
"description": "Data Pipeline Application",
"name": "cdap_dsc_test",
"config": {
"resources": {
"memoryMB": 2048,
"virtualCores": 1
"driverResources": {
"memoryMB": 2048,
"virtualCores": 1
"connections": [
"from": "BigQuery",
"to": "Google Cloud Storage"
"comments": [],
"postActions": [],
"properties": {},
"processTimingEnabled": true,
"stageLoggingEnabled": true,
"stages": [
"name": "BigQuery",
"plugin": {
"name": "BigQueryTable",
"type": "batchsource",
"label": "BigQuery",
"artifact": {
"name": "google-cloud",
"version": "0.12.2",
"scope": "SYSTEM"
"properties": {
"project": "bi-data-science",
"serviceFilePath": "/home/ubuntu/bi-data-science-cdap-4cbf526de374.json",
"schema": "{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"destination_name\",\"type\":[\"string\",\"null\"]},{\"name\":\"destination_country\",\"type\":[\"string\",\"null\"]},{\"name\":\"timestamp\",\"type\":[\"double\",\"null\"]},{\"name\":\"desktop\",\"type\":[\"double\",\"null\"]},{\"name\":\"tablet\",\"type\":[\"double\",\"null\"]},{\"name\":\"mobile\",\"type\":[\"double\",\"null\"]}]}",
"referenceName": "test_tables",
"dataset": "google_trends",
"table": "devices"
"outputSchema": [
"name": "etlSchemaBody",
"schema": "{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"destination_name\",\"type\":[\"string\",\"null\"]},{\"name\":\"destination_country\",\"type\":[\"string\",\"null\"]},{\"name\":\"timestamp\",\"type\":[\"double\",\"null\"]},{\"name\":\"desktop\",\"type\":[\"double\",\"null\"]},{\"name\":\"tablet\",\"type\":[\"double\",\"null\"]},{\"name\":\"mobile\",\"type\":[\"double\",\"null\"]}]}"
"name": "Google Cloud Storage",
"plugin": {
"name": "GCS",
"type": "batchsink",
"label": "Google Cloud Storage",
"artifact": {
"name": "google-cloud",
"version": "0.12.2",
"scope": "SYSTEM"
"properties": {
"project": "bi-data-science",
"suffix": "yyyy-MM-dd",
"format": "json",
"serviceFilePath": "/home/ubuntu/bi-data-science-cdap-4cbf526de374.json",
"schema": "{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"destination_name\",\"type\":[\"string\",\"null\"]},{\"name\":\"destination_country\",\"type\":[\"string\",\"null\"]},{\"name\":\"timestamp\",\"type\":[\"double\",\"null\"]},{\"name\":\"desktop\",\"type\":[\"double\",\"null\"]},{\"name\":\"tablet\",\"type\":[\"double\",\"null\"]},{\"name\":\"mobile\",\"type\":[\"double\",\"null\"]}]}",
"delimiter": ",",
"referenceName": "gcs_cdap",
"path": "gs://hurb_sandbox/cdap_experiments/"
"outputSchema": [
"name": "etlSchemaBody",
"schema": "{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"destination_name\",\"type\":[\"string\",\"null\"]},{\"name\":\"destination_country\",\"type\":[\"string\",\"null\"]},{\"name\":\"timestamp\",\"type\":[\"double\",\"null\"]},{\"name\":\"desktop\",\"type\":[\"double\",\"null\"]},{\"name\":\"tablet\",\"type\":[\"double\",\"null\"]},{\"name\":\"mobile\",\"type\":[\"double\",\"null\"]}]}"
"inputSchema": [
"name": "BigQuery",
"schema": "{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"destination_name\",\"type\":[\"string\",\"null\"]},{\"name\":\"destination_country\",\"type\":[\"string\",\"null\"]},{\"name\":\"timestamp\",\"type\":[\"double\",\"null\"]},{\"name\":\"desktop\",\"type\":[\"double\",\"null\"]},{\"name\":\"tablet\",\"type\":[\"double\",\"null\"]},{\"name\":\"mobile\",\"type\":[\"double\",\"null\"]}]}"
"schedule": "0 * * * *",
"engine": "mapreduce",
"numOfRecordsPreview": 100,
"description": "Data Pipeline Application",
"maxConcurrentRuns": 1
The credential key has owner privileges and I'm able to access the query result using the "preview" option.
Pipeline result:
_SUCCESS (empty)
part-r-00000 (query result)
None csv file has been generated and I'm also not found a place where I can set a name to my output file in CDAP. Did I miss any configuration step?
We eventualy gave up on CDAP, and we're using Google DataFlow.

When configuring the GCS sink in the pipeline, there is a 'format' field, which you have set to JSON. You can set this to CSV to achieve the format you would like.


Can't see my custom extension on Azure Devops Marketplace

My issue
I created an Azure Devops extension task. Deploy it on a publisher, shared it. But I can't find it on the MarkePlace.
What I did
This is my project:
This is my task.json:
"id": "0f6ee401-2a8e-4110-b51d-c8d05086c1d0",
"name": "deployRG",
"category": "Utility",
"visibility": [
"demands": [],
"version": {
"Major": "0",
"Minor": "1",
"Patch": "0"
"instanceNameFormat": "DeployRG $(name)",
"groups": [],
"inputs": [
"name": "Name",
"type": "string",
"label": "RG name",
"defaultValue": "",
"required": true,
"execution": {
"PowerShell3": {
"target": "CreateRG.ps1"
My manifest vss-extension.json:
"manifestVersion": 1,
"id": "deployRG",
"version": "0.1.0",
"name": "Deploy RG",
"publisher": "Amethyste-MyTasks",
"public": false,
"categories": [
"Azure Pipelines"
"tags": [
"contributions": [
"id": "DeployRG",
"type": "ms.vss-distributed-task.task",
"targets": [
"properties": {
"name": "DeployRG"
"targets": [
"id": "Microsoft.VisualStudio.Services"
"files": [
"path": "DeployRG",
"packagePath": "DeployRG"
"path": "VstsTaskSdk"
What I checked
I am owner of the organization and belong to Project Collection Administrators group.
On the portal:
On the publisher portal:
What I need
I checked some tutorial on Internet and can't see what I do wrong.
Has anybody an idea?
Thank you
Aargh, I have just found and its easy.
After sharing, one should install the extension as indicated here:
Don't know why so many tutorials skip this step

Google Cloud Data Fusion produces inconsistent output data

I am creating a DataFusion pipeline to ingest a CSV file from s3 bucket, applying wrangler directives and storing it in GCS bucket. The input CSV file had 18 columns. However, the output CSV file has only 8 columns. I have a doubt that this could be due to the CSV encoding format, but I am not sure. What could be the reason here?
Pipeline JSON
"name": "aws_fusion_v1",
"description": "Data Pipeline Application",
"artifact": {
"name": "cdap-data-pipeline",
"version": "6.1.2",
"scope": "SYSTEM"
"config": {
"resources": {
"memoryMB": 2048,
"virtualCores": 1
"driverResources": {
"memoryMB": 2048,
"virtualCores": 1
"connections": [
"from": "Amazon S3",
"to": "Wrangler"
"from": "Wrangler",
"to": "GCS2"
"from": "Argument Setter",
"to": "Amazon S3"
"comments": [],
"postActions": [],
"properties": {},
"processTimingEnabled": true,
"stageLoggingEnabled": true,
"stages": [
"name": "Amazon S3",
"plugin": {
"name": "S3",
"type": "batchsource",
"label": "Amazon S3",
"artifact": {
"name": "amazon-s3-plugins",
"version": "1.11.0",
"scope": "SYSTEM"
"properties": {
"format": "text",
"authenticationMethod": "Access Credentials",
"filenameOnly": "false",
"recursive": "false",
"ignoreNonExistingFolders": "false",
"schema": "{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"body\",\"type\":\"string\"}]}",
"referenceName": "aws_source",
"path": "${input.bucket}",
"accessID": "${input.access_id}",
"accessKey": "${input.access_key}"
"outputSchema": [
"name": "etlSchemaBody",
"schema": "{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"body\",\"type\":\"string\"}]}"
"type": "batchsource",
"label": "Amazon S3",
"icon": "icon-s3"
"name": "Wrangler",
"plugin": {
"name": "Wrangler",
"type": "transform",
"label": "Wrangler",
"artifact": {
"name": "wrangler-transform",
"version": "4.1.5",
"scope": "SYSTEM"
"properties": {
"field": "*",
"precondition": "false",
"threshold": "1",
"workspaceId": "804a2995-7c06-4ab2-b342-a9a01aa03a3d",
"schema": "${output.schema}",
"directives": "${directive}"
"outputSchema": [
"name": "etlSchemaBody",
"schema": "${output.schema}"
"inputSchema": [
"name": "Amazon S3",
"schema": "{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"body\",\"type\":\"string\"}]}"
"type": "transform",
"label": "Wrangler",
"icon": "icon-DataPreparation"
"name": "GCS2",
"plugin": {
"name": "GCS",
"type": "batchsink",
"label": "GCS2",
"artifact": {
"name": "google-cloud",
"version": "0.14.2",
"scope": "SYSTEM"
"properties": {
"project": "auto-detect",
"suffix": "yyyy-MM-dd-HH-mm",
"format": "csv",
"serviceFilePath": "auto-detect",
"location": "us",
"referenceName": "gcs_sink",
"path": "${output.path}",
"schema": "${output.schema}"
"outputSchema": [
"name": "etlSchemaBody",
"schema": "${output.schema}"
"inputSchema": [
"name": "Wrangler",
"schema": ""
"type": "batchsink",
"label": "GCS2",
"icon": "fa-plug"
"name": "Argument Setter",
"plugin": {
"name": "ArgumentSetter",
"type": "action",
"label": "Argument Setter",
"artifact": {
"name": "argument-setter-plugins",
"version": "1.1.1",
"scope": "USER"
"properties": {
"method": "GET",
"connectTimeout": "60000",
"readTimeout": "60000",
"numRetries": "0",
"followRedirects": "true",
"url": "${argfile}"
"outputSchema": [
"name": "etlSchemaBody",
"schema": ""
"type": "action",
"label": "Argument Setter",
"icon": "fa-plug"
"schedule": "0 * * * *",
"engine": "spark",
"numOfRecordsPreview": 100,
"description": "Data Pipeline Application",
"maxConcurrentRuns": 1
The missing columns in the output file were due to spaces in the column names. But I am facing another issue. In wrangler, when I pass a directive as
"parse-as-csv :body ',' false", the output file is empty. But when I pass something like "parse-as-csv :body ',' true", the output file has all the data without header as expected.

How do I create Virtual Machine with WinRM from an ARM Template?

I'm running into an issue when I attempt to run the 'Azure Resource Group Deploy' release task to create/update a resource group and the resources within it via an ARM Template. In particular, I need to have the Virtual Machine created by the ARM template accessible via WinRM; This needs to be done so that I can copy files (specifically a ZIP file containing the results of a build) to the VM in a later step.
Currently, I have the 'Template' portion of this task set up as follows: (I can't post images since I don't have reputation here yet...)
Unless I've misunderstood (which is definitely possible), the "Configure with WinRM" option should allow the release step to create a WinRM Listener on any Virtual Machines created by this step.
I currently have the following resources in the ARM Template:
"type": "Microsoft.Storage/storageAccounts",
"sku": {
"name": "Standard_LRS",
"tier": "Standard"
"kind": "Storage",
"name": "[variables('StorageAccountName')]",
"apiVersion": "2018-02-01",
"location": "[parameters('LocationPrimary')]",
"scale": null,
"tags": {},
"properties": {
"networkAcls": {
"bypass": "AzureServices",
"virtualNetworkRules": [],
"ipRules": [],
"defaultAction": "Allow"
"supportsHttpsTrafficOnly": false,
"encryption": {
"services": {
"file": {
"enabled": true
"blob": {
"enabled": true
"keySource": "Microsoft.Storage"
"dependsOn": []
"name": "[variables('NetworkInterfaceName')]",
"type": "Microsoft.Network/networkInterfaces",
"apiVersion": "2018-04-01",
"location": "[parameters('LocationPrimary')]",
"dependsOn": [
"[concat('Microsoft.Network/networkSecurityGroups/', variables('NetworkSecurityGroupName'))]",
"[concat('Microsoft.Network/virtualNetworks/', variables('VNetName'))]",
"[concat('Microsoft.Network/publicIpAddresses/', variables('PublicIPAddressName'))]"
"properties": {
"ipConfigurations": [
"name": "ipconfig1",
"properties": {
"subnet": {
"id": "[variables('subnetRef')]"
"privateIPAllocationMethod": "Dynamic",
"publicIpAddress": {
"id": "[resourceId(resourceGroup().name, 'Microsoft.Network/publicIpAddresses', variables('PublicIPAddressName'))]"
"networkSecurityGroup": {
"id": "[variables('nsgId')]"
"tags": {}
"name": "[variables('NetworkSecurityGroupName')]",
"type": "Microsoft.Network/networkSecurityGroups",
"apiVersion": "2018-08-01",
"location": "[parameters('LocationPrimary')]",
"properties": {
"securityRules": [
"name": "RDP",
"properties": {
"priority": 300,
"protocol": "TCP",
"access": "Allow",
"direction": "Inbound",
"sourceAddressPrefix": "*",
"sourcePortRange": "*",
"destinationAddressPrefix": "*",
"destinationPortRange": "3389"
"tags": {}
"name": "[variables('VNetName')]",
"type": "Microsoft.Network/virtualNetworks",
"apiVersion": "2018-08-01",
"location": "[parameters('LocationPrimary')]",
"properties": {
"addressSpace": {
"addressPrefixes": [ "" ]
"subnets": [
"name": "default",
"properties": {
"addressPrefix": ""
"tags": {}
"name": "[variables('PublicIPAddressName')]",
"type": "Microsoft.Network/publicIpAddresses",
"apiVersion": "2018-08-01",
"location": "[parameters('LocationPrimary')]",
"properties": {
"publicIpAllocationMethod": "Dynamic"
"sku": {
"name": "Basic"
"tags": {}
"name": "[variables('VMName')]",
"type": "Microsoft.Compute/virtualMachines",
"apiVersion": "2018-06-01",
"location": "[parameters('LocationPrimary')]",
"dependsOn": [
"[concat('Microsoft.Network/networkInterfaces/', variables('NetworkInterfaceName'))]",
"[concat('Microsoft.Storage/storageAccounts/', variables('StorageAccountName'))]"
"properties": {
"hardwareProfile": {
"vmSize": "Standard_A7"
"storageProfile": {
"osDisk": {
"createOption": "fromImage",
"managedDisk": {
"storageAccountType": "Standard_LRS"
"imageReference": {
"publisher": "MicrosoftWindowsDesktop",
"offer": "Windows-10",
"sku": "rs4-pro",
"version": "latest"
"networkProfile": {
"networkInterfaces": [
"id": "[resourceId('Microsoft.Network/networkInterfaces', variables('NetworkInterfaceName'))]"
"osProfile": {
"computerName": "[variables('VMName')]",
"adminUsername": "[parameters('AdminUsername')]",
"adminPassword": "[parameters('AdminPassword')]",
"windowsConfiguration": {
"enableAutomaticUpdates": true,
"provisionVmAgent": true
"licenseType": "Windows_Client",
"diagnosticsProfile": {
"bootDiagnostics": {
"enabled": true,
"storageUri": "[concat('https://', variables('StorageAccountName'), '')]"
"tags": {}
This ARM Template currently works if I do not attempt to configure the VM to have the WinRM Listener.
When I attempt to run the release, I get the following error message:
Error number: -2144108526 0x80338012
The client cannot connect to the destination specified in the request. Verify that the service on the destination is running and is accepting requests. Consult the logs and documentation for the WS-Management service running on the destination, most commonly IIS or WinRM. If the destination is the WinRM service, run the following command on the destination to analyze and configure the WinRM service: "winrm quickconfig".
In all honesty, my problem is likely a lack of understanding, as this is my first time working with VM Setup in any real capacity. Any insight and advice would be greatly appreciated.
you just need to add this to the "windowsConfiguration":
"winRM": {
"listeners": [
"protocol": "http"
"protocol": "https",
"certificateUrl": "<URL for the certificate you got in Step 4>"
you also need to provision certificates

azure-data-factory waiting for source

I'm trying to copy a sample data from one SQL server DB to another.
For some reason the pipeline keeps waiting for source data.
When I'm looking at the source dataset, there were no slices created.
The following are my JSONS:
Destination table:
"name": "DestTable1",
"properties": {
"structure": [
"name": "C1",
"type": "Int16"
"name": "C2",
"type": "Int16"
"name": "C3",
"type": "String"
"name": "C4",
"type": "String"
"published": false,
"type": "SqlServerTable",
"linkedServiceName": "SqlServer2",
"typeProperties": {
"tableName": "OferTarget1"
"availability": {
"frequency": "Hour",
"interval": 1
Source Table:
"name": "SourceTable1",
"properties": {
"structure": [
"name": "C1",
"type": "Int16"
"name": "C2",
"type": "Int16"
"name": "C3",
"type": "String"
"name": "C4",
"type": "String"
"published": false,
"type": "SqlServerTable",
"linkedServiceName": "SqlServer",
"typeProperties": {
"tableName": "OferSource1"
"availability": {
"frequency": "Hour",
"interval": 1
"external": true,
"policy": { }
"name": "CopyTablePipeline",
"properties": {
"description": "Copy data from source table to target table",
"activities": [
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select c1,c2,c3,c4 from OferSource1"
"sink": {
"type": "SqlSink",
"writeBatchSize": 1000,
"writeBatchTimeout": "60.00:00:00"
"inputs": [
"name": "SourceTable1"
"outputs": [
"name": "DestTable1"
"policy": {
"timeout": "01:00:00",
"concurrency": 1
"scheduler": {
"frequency": "Hour",
"interval": 1
"name": "CopySqlToSql",
"description": "Demo Copy"
"start": "2017-10-22T09:55:00Z",
"end": "2017-10-22T13:55:00Z",
"isPaused": true,
"hubName": "wer-dev-datafactoryv1_hub",
"pipelineMode": "Scheduled"
I can see the process in the monitor view, but the pipeline is stuck and waiting for the source data to arrive.
What am I doing wrong?
Schedule can be a bit tricky initially. there are few reasons why a time slice might be waiting on a trigger
Activity Level
Source Properties
if the source is of type SqlServerTable then the property external should be true. I have personally fallen into the trap where I was copy pasting the JSON files and it took me a while to understand this. More literature is available here :
Setting “external”: ”true” and specifying externalData policy information the Azure Data Factory service that this is a table that is external to the data factory and not produced by an activity in the data factory.
Concurrency (not likely in your case) : An activity can also be held up if multiple slices of the activity are valid in the specific time window. for example your start / end date is 01-01-2014 to 01-01-2015 for a monthly activity, if the concurrency is set to 4, 4 months will run in parallel while the rest are stuck with the message "Waiting on Concurrency"
Pipeline Level
Ensure that the DateTime.Now lies between the start and end accounting for the delay. More on how the scheduling of activities work is explained in this article
Paused : a pipeline can be paused, in which case the time slice will appear in the monitor with the message "Waiting for the pipeline to Resume". You can either author the pipeline JSON and make paused : true or can even resume the pipeline by right clicking and hitting resume.
A good way to check when your next iteration is scheduled is by using the Monitor option

Azure ARM Template and PowerShell Module

I have a module published on PowerShell Gallery and I want to deploy this module with Azure ARM Template. And I did not find how!
Here is my template:
"resources": [
"name": "[variables('automationAccountName')]",
"type": "Microsoft.Automation/automationAccounts",
"apiVersion": "2015-10-31",
"location": "[parameters('AutomationLocation')]",
"tags": {
"displayName": "Compte Automation"
"properties": {
"sku": {
"name": "Basic",
"family": "B"
"resources": [
"name": "[variables('powerShellGalleryModuleName')]",
"type": "modules",
"apiVersion": "2015-10-31",
"location": "[parameters('AutomationLocation')]",
"properties": {
"isGlobal": false,
"sizeInBytes": 0,
"contentLink": {
"uri": "[variables('powerShellGalleryModule')]"
What should be provided for the variable powerShellGalleryModule?
I found a way to do via the PowerShellGallery
This way :
"name": "[variables('powerShellGalleryModule')]",
"type": "modules",
"apiVersion": "2015-10-31",
"location": "[parameters('AutomationLocation')]",
"properties": {
"isGlobal": false,
"sizeInBytes": 0,
"contentLink": {
"uri": "[concat('', variables('powerShellGalleryModule'))]"
"dependsOn": [
"[resourceId('Microsoft.Automation/automationAccounts', variables('automationAccountName'))]"
We can import these integration modules into Azure Automation using any of the following methods:
1.Using the New-AzureRmAutomationModule cmdlet in the AzureRm.Automation module.
2.Using the Azure portal and navigating to the Assets within automation account.
3.Using Azure Resource Manager (ARM) template
We can use ARM template to deploy our custom integration modules. Here is an example template:
"$schema": "",
"contentVersion": "1.0",
"parameters": {
"automationAccountType": {
"type": "string",
"allowedValues": [
"automationAccountName": {
"type": "string"
"moduleName": {
"type": "string"
"type": "string"
"variables": {
"templatelink": "[concat('', parameters('automationAccountType'), 'AccountTemplate.json')]"
"resources": [
"apiVersion": "2015-01-01",
"name": "nestedTemplate",
"type": "Microsoft.Resources/deployments",
"properties": {
"mode": "incremental",
"templateLink": {
"uri": "[variables('templatelink')]",
"contentVersion": "1.0"
"parameters": {
"accountName": {
"value": "[parameters('automationAccountName')]"
"accountLocation": {
"value": "[resourceGroup().Location]"
"moduleName": {
"value": "[parameters('moduleName')]"
"moduleUri": {
"value": "[parameters('moduleUri')]"
More information about deploy custom Azure Automation Integration module using ARM template, please refer to this link writed by Ravikanth.