Apache Drill using Google Cloud Storage - google-cloud-storage

The Apache Drill features list mentions that it can query data from Google Cloud Storage, but I can't find any information on how to do that. I've got it working fine with S3, but suspect i'm missing something very simple in terms of Google Cloud Storage.
Does anyone have an example Storage Plugin configuration for Google Cloud Storage?
Thanks
M

This is quite an old question, so I imagine you either found a solution or moved on with your life, but for anyone looking for a solution without using Dataproc, here's a solution:
Add the JAR file from the GCP connectors to the jars/3rdparty directory.
Add the following to the site-core.xml file in the conf directory (change the upper-case values such as YOUR_PROJECT_ID to your own details):
<property>
<name>fs.gs.project.id</name>
<value>YOUR_PROJECT_ID</value>
<description>
Optional. Google Cloud Project ID with access to GCS buckets.
Required only for list buckets and create bucket operations.
</description>
</property>
<property>
<name>fs.gs.auth.service.account.private.key.id</name>
<value>YOUR_PRIVATE_KEY_ID</value>
</property>
<property>
<name>fs.gs.auth.service.account.private.key</name>
<value>-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n</value>
</property>
<property>
<name>fs.gs.auth.service.account.email</name>
<value>YOUR_SERVICE_ACCOUNT_EMAIL/value>
<description>
The email address is associated with the service account used for GCS
access when fs.gs.auth.service.account.enable is true. Required
when authentication key specified in the Configuration file (Method 1)
or a PKCS12 certificate (Method 3) is being used.
</description>
</property>
<property>
<name>fs.gs.working.dir</name>
<value>/</value>
<description>
The directory relative gs: uris resolve in inside of the default bucket.
</description>
</property>
<property>
<name>fs.gs.implicit.dir.repair.enable</name>
<value>true</value>
<description>
Whether or not to create objects for the parent directories of objects
with / in their path e.g. creating gs://bucket/foo/ upon deleting or
renaming gs://bucket/foo/bar.
</description>
</property>
<property>
<name>fs.gs.glob.flatlist.enable</name>
<value>true</value>
<description>
Whether or not to prepopulate potential glob matches in a single list
request to minimize calls to GCS in nested glob cases.
</description>
</property>
<property>
<name>fs.gs.copy.with.rewrite.enable</name>
<value>true</value>
<description>
Whether or not to perform copy operation using Rewrite requests. Allows
to copy files between different locations and storage classes.
</description>
</property>
Start Apache Drill.
Add a custom storage to Drill.
You're good to go.
The solution is from here, where I detail some more about what we do around data exploration with Apache Drill.

I managed to query parquet data in Google Cloud Storage (GCS) using Apache Drill (1.6.0) running on a Google Dataproc cluster.
In order to set that up, I took the following steps:
Install Drill and make the GCS connector accessible (this can be used as an init-script for dataproc, just note it wasn't really tested and relies on a local zookeeper instance):
#!/bin/sh
set -x -e
BASEDIR="/opt/apache-drill-1.6.0"
mkdir -p ${BASEDIR}
cd ${BASEDIR}
wget http://apache.mesi.com.ar/drill/drill-1.6.0/apache-drill-1.6.0.tar.gz
tar -xzvf apache-drill-1.6.0.tar.gz
mv apache-drill-1.6.0/* .
rm -rf apache-drill-1.6.0 apache-drill-1.6.0.tar.gz
ln -s /usr/lib/hadoop/lib/gcs-connector-1.4.5-hadoop2.jar ${BASEDIR}/jars/gcs-connector-1.4.5-hadoop2.jar
mv ${BASEDIR}/conf/core-site.xml ${BASEDIR}/conf/core-site.xml.old
ln -s /etc/hadoop/conf/core-site.xml ${BASEDIR}/conf/core-site.xml
drillbit.sh start
set +x +e
Connect to the Drill console, create a new storage plugin (call it, say, gcs), and use the following configuration (note I copied most of it from the s3 config, made minor changes):
{
"type": "file",
"enabled": true,
"connection": "gs://myBucketName",
"config": null,
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json",
"extensions": [
"json"
]
},
"avro": {
"type": "avro"
},
"sequencefile": {
"type": "sequencefile",
"extensions": [
"seq"
]
},
"csvh": {
"type": "text",
"extensions": [
"csvh"
],
"extractHeader": true,
"delimiter": ","
}
}
}
Query using the following syntax (note the backticks):
select * from gs.`root`.`path/to/data/*` limit 10;

Related

Dynamically create Step Function state machines locally from CFN template

Goal
I am trying to dynamically create state machines locally from generated Cloud Formation (CFN) templates. I need to be able to do so without deploying to an AWS account or creating the definition strings manually.
Question
How do I "build" a CFN template into a definition string that can be used locally?
Is it possible to achieve my original goal? If not, how are others successfully testing SFN locally?
Setup
I am using Cloud Development Kit (CDK) to write my state machine definitions and generating CFN json templates using cdk synth. I have followed the instructions from AWS here to create a local Docker container to host Step Functions (SFN). I am able to use the AWS CLI to create, run, etc. state machines successfully on my local SFN Docker instance. I am also hosting a DynamoDB Docker instance and using sam local start-lambda to host my lambdas. This all works as expected.
To make local testing easier, I have written a series of bash scripts to dynamically parse the CFN templates and create json input files by calling the AWS CLI. This works sucessfully when writing simple state machines with no references (no lambdas, resources from other stacks, etc.). The issue arises when I want to create and test a more complicated state machine. A state machine DefinitionString in my generated CFN templates looks something like:
{'Fn::Join': ['', ['{
"StartAt": "Step1",
"States": {
{
"StartAt": "Step1",
"States": {
"Step1": {
"Next": "Step2",
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException"
],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
}
],
"Type": "Task",
"Resource": "arn:', {'Ref': 'AWS::Partition'}, ':states:::lambda:invoke",
"Parameters": {
"FunctionName": "', {'Fn::ImportValue': 'OtherStackE9E150CFArn77689D69'}, '",
"Payload.$": "$"
}
},
"Step2": {
"Next": "Step3",
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException"
],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
}
],
"Type": "Task",
"Resource": "arn:', {'Ref': 'AWS::Partition'}, ':states:::lambda:invoke",
"Parameters": {
"FunctionName": "', {'Fn::ImportValue': 'OtherStackE9E150CFArn77689D69'}, '",
"Payload.$": "$"
}
}
}
}
]
},
"TimeoutSeconds": 10800
}']]}
Problem
The AWS CLI does not support json objects, the CFN functions like 'Fn::Join' are not supported, and there are no references allowed ({'Ref': 'AWS::Partition'}) in the definition string.
There is not going to be any magic here to get this done. The CDK renders CloudFormation and that CloudFormation is not truly ASL, as it contains references to other resources, as you pointed out.
One direction you could go would to be to deploy the SFN to a sandbox stack, and allow CFN to dereference all the values and produce the SFN ASL in the service, then re-extract that ASL for local testing.
It's hacky, but I don't know any other way to do it, unless you want to start writing parses that turn all those JSON intrinsics (like Fn:Join) into static strings.

Unable to consume OData in my SAPUI5 app - initial loading of metadata failed

I am facing an issue while trying to consume OData and binding (aggregation binding) it with list item in my demo app.
The webpage is showing "No Data".. I referred other threads, but not similar to my issue. Even posted the thread in SAP QA forum.. no help.
ODATA SERVICE METADATA:
https://sapes5.sapdevcenter.com/sap/opu/odata/IWBEP/GWSAMPLE_BASIC/$metadata
The destination ES5 has been configured in the backend (SAP HANA CLOUD PLATFORM COCKPIT) also. Tried with no authentication and basic authentication. Still no data is displayed.
Connection testing was successful with message "Connection to "ES5" established. Response returned: 307: Temporary Redirect"
Error:
[ODataMetadata] initial loading of metadata failed
Error: HTTP request failed
Code:
VIEW:
<IconTabFilter text="Data Binding" key="db">
<content>
<List headerText="Products" items="{/ProductSet}">
<items>
<ObjectListItem title="{Name}" number="{Price}" intro="{ProductID}"/>
</items>
</List>
</content>
</IconTabFilter>
Manifest.json:
"sap.app": {
......
"dataSources": {
"ES5": {
"uri": "/destinations/ES5/sap/opu/odata/IWBEP/GWSAMPLE_BASIC/",
"type": "OData",
"settings": {
"odataVersion": "2.0"
}
}
}
},
....
"sap.ui5": {
"models": {
.......other models
"" : {
"dataSource": "ES5"
}
}
}
Neoapp.json
{
"path": "/destinations/ES5",
"target": {
"type": "destination",
"name": "ES5"
},
"description": "ES5 Demo Service"
}
The issue with my OData consumption was the creation of ES5 destination in Cloud Foundry Trial. If the app should be developed in SAP Web IDE which is available only in the Neo environment, we have to create the Neo Trial account from the SAP Cloud Platform Cockpit and create the same ES5 destination there. Now I am able to consume the products list in the OData.
From: https://answers.sap.com/questions/13075637/please-help-no-data-is-showing-in-webpage-es5-dest.html

how to create data factory's integrated runtime in arm template

I am trying to deploy data factory using ARM template. It is easy to use the exported template to create a deployment pipeline.
However, as the data factory needs to access an on-premise database server, I need to have an integrated runtime. The problem is how can I include the run time in the arm template?
The template looks like this and we can see that it is trying to include the runtime:
{
"name": "[concat(parameters('factoryName'), '/OnPremisesSqlServer')]",
"type": "Microsoft.DataFactory/factories/linkedServices",
"apiVersion": "2018-06-01",
"properties":
{
"annotations": [],
"type": "SqlServer",
"typeProperties": {
"connectionString": "[parameters('OnPremisesSqlServer_connectionString')]"
},
"connectVia": {
"referenceName": "OnPremisesSqlServer",
"type": "IntegrationRuntimeReference"
}
},
"dependsOn": [
"[concat(variables('factoryId'), '/integrationRuntimes/OnPremisesSqlServer')]"
]
},
{
"name": "[concat(parameters('factoryName'), '/OnPremisesSqlServer')]",
"type": "Microsoft.DataFactory/factories/integrationRuntimes",
"apiVersion": "2018-06-01",
"properties": {
"type": "SelfHosted",
"typeProperties": {}
},
"dependsOn": []
}
Running this template gives me this error:
\"connectVia\": {\r\n \"referenceName\": \"OnPremisesSqlServer\",\r\n \"type\": \"IntegrationRuntimeReference\"\r\n }\r\n }\r\n} and error is: Failed to encrypted linked service credentials on self-hosted IR 'OnPremisesSqlServer', reason is: NotFound, error message is: No online instance..
The problem is that I will need to type in some key in the integrated runtime's UI, so it can be registered in azure. But I can only get that key from my data factory instance's UI. So above arm template deployment will always fail at least once. I am wondering if there is a way to create the run time independently?
The problem is that I will need to type in some key in the integrated
runtime's UI, so it can be registered in azure. But I can only get
that key from my data factory instance's UI. So above arm template
deployment will always fail at least once. I am wondering if there is
a way to create the run time independently?
It seems that you already know how to create Self-Hosted IR in the ADF ARM.
{
"name": "[concat(parameters('dataFactoryName'), '/integrationRuntime1')]",
"type": "Microsoft.DataFactory/factories/integrationRuntimes",
"apiVersion": "2018-06-01",
"properties": {
"additionalProperties": {},
"description": "jaygongIR1",
"type": "SelfHosted"
}
}
Result:
Only you concern is that Windows IR Tool need to be configured with AUTHENTICATION KEY to access ADF Self-Hosted IR node.So,it should be Unavailable status once it is created.This flow is make sense i think,authenticate key should be created first,then you can use it to configure On-Premise Tool.You can't implement all of things in one step because these behaviors are operated on both of azure and on-premise sides.
Based on the Self-Hosted IR Tool document ,the Register steps can't be implemented with Powershell code. So,all steps can't be processed in the flow are creating IR and getting Auth key,not for Registering in the tool.

Can I update Windows ClientIdentities after cluster creation?

I currently have something like this in my cluseterConfig.json file.
"ClientIdentities": [
{
"Identity": "{My Domain}\\{My Security Group}",
"IsAdmin": true
}
]
My questions are:
My cluster is stood up and running. Can I add a second security group to this cluster while running? I've search through the powershell commands and didn't see one that matched this but I may have missed it?
If I can't do this while the cluster is running do I need delete the cluster and recreate? If I do need to recreate I'm zeroing in on the word ClientIdentities. I'm assuming I can have multiple identities and my config should look something like
ClientIdentities": [{
"Identity": "{My Domain}\\{My Security Group}",
"IsAdmin": true
},
{
"Identity": "{My Domain}\\{My Second Security Group}",
"IsAdmin": false
}
]
Thanks,
Greg
Yes, it is possible to update ClientIdentities once the cluster is up using a configuration upgrade.
Create a new JSON file with the added client identities.
Modify the clusterConfigurationVersion in the JSON config.
Run Start-ServiceFabricClusterConfigurationUpgrade -ClusterConfigPath "Path to new JSON"

Connect Apache Drill to Google Cloud

How do I connect google cloud buckets to Apache Drill. I want to connect Apache Drill to google cloud storage buckets and fetch data from the file files stored in those buckets.
I can specify access id and key in core-site.xml in order to connect to AWS. Is there a similar way to connect drill to google cloud.
I found the answer here useful: Apache Drill using Google Cloud Storage
On Google Cloud Dataproc you can set it up with an initialization action as in the answer above. There's also a complete one you can use which creates a GCS plugin for you, pointed by default at the ephemeral bucket created with your dataproc cluster.
If you're not using Cloud Dataproc you can do the following on your already-installed Drill cluster.
Get the GCS connector from somewhere and put it in Drill's 3rdparty jars directory. GCS configuration is detailed at the link above. On dataproc the connector jar is in /usr/lib/hadoop so the above initialization action does this:
# Link GCS connector to drill jars
ln -sf /usr/lib/hadoop/lib/gcs-connector-1.6.0-hadoop2.jar $DRILL_HOME/jars/3rdparty
You need to also configure core-site.xml and make it available to Drill. This is necessary so that Drill knows how to connect to GCS.
# Symlink core-site.xml to $DRILL_HOME/conf
ln -sf /etc/hadoop/conf/core-site.xml $DRILL_HOME/conf
Start or restart your drillbits as needed.
Once Drill is up, you can create a new plugin that points to a GCS bucket. First write out a JSON file containing the plugin configuration:
export DATAPROC_BUCKET=gs://your-bucket-name
cat > /tmp/gcs_plugin.json <<EOF
{
"config": {
"connection": "$DATAPROC_BUCKET",
"enabled": true,
"formats": {
"avro": {
"type": "avro"
},
"csv": {
"delimiter": ",",
"extensions": [
"csv"
],
"type": "text"
},
"csvh": {
"delimiter": ",",
"extensions": [
"csvh"
],
"extractHeader": true,
"type": "text"
},
"json": {
"extensions": [
"json"
],
"type": "json"
},
"parquet": {
"type": "parquet"
},
"psv": {
"delimiter": "|",
"extensions": [
"tbl"
],
"type": "text"
},
"sequencefile": {
"extensions": [
"seq"
],
"type": "sequencefile"
},
"tsv": {
"delimiter": "\t",
"extensions": [
"tsv"
],
"type": "text"
}
},
"type": "file",
"workspaces": {
"root": {
"defaultInputFormat": null,
"location": "/",
"writable": false
},
"tmp": {
"defaultInputFormat": null,
"location": "/tmp",
"writable": true
}
}
},
"name": "gs"
}
EOF
Then POST the new plugin to any drillbit (I'm assuming you're running this on one of the drillbits):
curl -d#/tmp/gcs_plugin.json \
-H "Content-Type: application/json" \
-X POST http://localhost:8047/storage/gs.json
I believe you need to repeat this procedure changing the name ("gs" above) if you want Drill to query multiple buckets.
Then you can launch sqlline and check that you can query files in that bucket.
I know this question is quite old, but still, here's the way to do this without using Dataproc.
Add the JAR file from the GCP connectors to the jars/3rdparty directory.
Add the following to the site-core.xml file in the conf directory (change the upper-case values such as YOUR_PROJECT_ID to your own details):
<property>
<name>fs.gs.project.id</name>
<value>YOUR_PROJECT_ID</value>
<description>
Optional. Google Cloud Project ID with access to GCS buckets.
Required only for list buckets and create bucket operations.
</description>
</property>
<property>
<name>fs.gs.auth.service.account.private.key.id</name>
<value>YOUR_PRIVATE_KEY_ID</value>
</property>
<property>
<name>fs.gs.auth.service.account.private.key</name>
<value>-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n</value>
</property>
<property>
<name>fs.gs.auth.service.account.email</name>
<value>YOUR_SERVICE_ACCOUNT_EMAIL/value>
<description>
The email address is associated with the service account used for GCS
access when fs.gs.auth.service.account.enable is true. Required
when authentication key specified in the Configuration file (Method 1)
or a PKCS12 certificate (Method 3) is being used.
</description>
</property>
<property>
<name>fs.gs.working.dir</name>
<value>/</value>
<description>
The directory relative gs: uris resolve in inside of the default bucket.
</description>
</property>
<property>
<name>fs.gs.implicit.dir.repair.enable</name>
<value>true</value>
<description>
Whether or not to create objects for the parent directories of objects
with / in their path e.g. creating gs://bucket/foo/ upon deleting or
renaming gs://bucket/foo/bar.
</description>
</property>
<property>
<name>fs.gs.glob.flatlist.enable</name>
<value>true</value>
<description>
Whether or not to prepopulate potential glob matches in a single list
request to minimize calls to GCS in nested glob cases.
</description>
</property>
<property>
<name>fs.gs.copy.with.rewrite.enable</name>
<value>true</value>
<description>
Whether or not to perform copy operation using Rewrite requests. Allows
to copy files between different locations and storage classes.
</description>
</property>
Start Apache Drill.
Add a custom storage to Drill.
You're good to go.
The solution is from here, where I detail some more about what we do around data exploration with Apache Drill.