Connect Apache Drill to Google Cloud - google-cloud-storage

How do I connect google cloud buckets to Apache Drill. I want to connect Apache Drill to google cloud storage buckets and fetch data from the file files stored in those buckets.
I can specify access id and key in core-site.xml in order to connect to AWS. Is there a similar way to connect drill to google cloud.

I found the answer here useful: Apache Drill using Google Cloud Storage
On Google Cloud Dataproc you can set it up with an initialization action as in the answer above. There's also a complete one you can use which creates a GCS plugin for you, pointed by default at the ephemeral bucket created with your dataproc cluster.
If you're not using Cloud Dataproc you can do the following on your already-installed Drill cluster.
Get the GCS connector from somewhere and put it in Drill's 3rdparty jars directory. GCS configuration is detailed at the link above. On dataproc the connector jar is in /usr/lib/hadoop so the above initialization action does this:
# Link GCS connector to drill jars
ln -sf /usr/lib/hadoop/lib/gcs-connector-1.6.0-hadoop2.jar $DRILL_HOME/jars/3rdparty
You need to also configure core-site.xml and make it available to Drill. This is necessary so that Drill knows how to connect to GCS.
# Symlink core-site.xml to $DRILL_HOME/conf
ln -sf /etc/hadoop/conf/core-site.xml $DRILL_HOME/conf
Start or restart your drillbits as needed.
Once Drill is up, you can create a new plugin that points to a GCS bucket. First write out a JSON file containing the plugin configuration:
export DATAPROC_BUCKET=gs://your-bucket-name
cat > /tmp/gcs_plugin.json <<EOF
{
"config": {
"connection": "$DATAPROC_BUCKET",
"enabled": true,
"formats": {
"avro": {
"type": "avro"
},
"csv": {
"delimiter": ",",
"extensions": [
"csv"
],
"type": "text"
},
"csvh": {
"delimiter": ",",
"extensions": [
"csvh"
],
"extractHeader": true,
"type": "text"
},
"json": {
"extensions": [
"json"
],
"type": "json"
},
"parquet": {
"type": "parquet"
},
"psv": {
"delimiter": "|",
"extensions": [
"tbl"
],
"type": "text"
},
"sequencefile": {
"extensions": [
"seq"
],
"type": "sequencefile"
},
"tsv": {
"delimiter": "\t",
"extensions": [
"tsv"
],
"type": "text"
}
},
"type": "file",
"workspaces": {
"root": {
"defaultInputFormat": null,
"location": "/",
"writable": false
},
"tmp": {
"defaultInputFormat": null,
"location": "/tmp",
"writable": true
}
}
},
"name": "gs"
}
EOF
Then POST the new plugin to any drillbit (I'm assuming you're running this on one of the drillbits):
curl -d#/tmp/gcs_plugin.json \
-H "Content-Type: application/json" \
-X POST http://localhost:8047/storage/gs.json
I believe you need to repeat this procedure changing the name ("gs" above) if you want Drill to query multiple buckets.
Then you can launch sqlline and check that you can query files in that bucket.

I know this question is quite old, but still, here's the way to do this without using Dataproc.
Add the JAR file from the GCP connectors to the jars/3rdparty directory.
Add the following to the site-core.xml file in the conf directory (change the upper-case values such as YOUR_PROJECT_ID to your own details):
<property>
<name>fs.gs.project.id</name>
<value>YOUR_PROJECT_ID</value>
<description>
Optional. Google Cloud Project ID with access to GCS buckets.
Required only for list buckets and create bucket operations.
</description>
</property>
<property>
<name>fs.gs.auth.service.account.private.key.id</name>
<value>YOUR_PRIVATE_KEY_ID</value>
</property>
<property>
<name>fs.gs.auth.service.account.private.key</name>
<value>-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n</value>
</property>
<property>
<name>fs.gs.auth.service.account.email</name>
<value>YOUR_SERVICE_ACCOUNT_EMAIL/value>
<description>
The email address is associated with the service account used for GCS
access when fs.gs.auth.service.account.enable is true. Required
when authentication key specified in the Configuration file (Method 1)
or a PKCS12 certificate (Method 3) is being used.
</description>
</property>
<property>
<name>fs.gs.working.dir</name>
<value>/</value>
<description>
The directory relative gs: uris resolve in inside of the default bucket.
</description>
</property>
<property>
<name>fs.gs.implicit.dir.repair.enable</name>
<value>true</value>
<description>
Whether or not to create objects for the parent directories of objects
with / in their path e.g. creating gs://bucket/foo/ upon deleting or
renaming gs://bucket/foo/bar.
</description>
</property>
<property>
<name>fs.gs.glob.flatlist.enable</name>
<value>true</value>
<description>
Whether or not to prepopulate potential glob matches in a single list
request to minimize calls to GCS in nested glob cases.
</description>
</property>
<property>
<name>fs.gs.copy.with.rewrite.enable</name>
<value>true</value>
<description>
Whether or not to perform copy operation using Rewrite requests. Allows
to copy files between different locations and storage classes.
</description>
</property>
Start Apache Drill.
Add a custom storage to Drill.
You're good to go.
The solution is from here, where I detail some more about what we do around data exploration with Apache Drill.

Related

Dynamically create Step Function state machines locally from CFN template

Goal
I am trying to dynamically create state machines locally from generated Cloud Formation (CFN) templates. I need to be able to do so without deploying to an AWS account or creating the definition strings manually.
Question
How do I "build" a CFN template into a definition string that can be used locally?
Is it possible to achieve my original goal? If not, how are others successfully testing SFN locally?
Setup
I am using Cloud Development Kit (CDK) to write my state machine definitions and generating CFN json templates using cdk synth. I have followed the instructions from AWS here to create a local Docker container to host Step Functions (SFN). I am able to use the AWS CLI to create, run, etc. state machines successfully on my local SFN Docker instance. I am also hosting a DynamoDB Docker instance and using sam local start-lambda to host my lambdas. This all works as expected.
To make local testing easier, I have written a series of bash scripts to dynamically parse the CFN templates and create json input files by calling the AWS CLI. This works sucessfully when writing simple state machines with no references (no lambdas, resources from other stacks, etc.). The issue arises when I want to create and test a more complicated state machine. A state machine DefinitionString in my generated CFN templates looks something like:
{'Fn::Join': ['', ['{
"StartAt": "Step1",
"States": {
{
"StartAt": "Step1",
"States": {
"Step1": {
"Next": "Step2",
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException"
],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
}
],
"Type": "Task",
"Resource": "arn:', {'Ref': 'AWS::Partition'}, ':states:::lambda:invoke",
"Parameters": {
"FunctionName": "', {'Fn::ImportValue': 'OtherStackE9E150CFArn77689D69'}, '",
"Payload.$": "$"
}
},
"Step2": {
"Next": "Step3",
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException"
],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
}
],
"Type": "Task",
"Resource": "arn:', {'Ref': 'AWS::Partition'}, ':states:::lambda:invoke",
"Parameters": {
"FunctionName": "', {'Fn::ImportValue': 'OtherStackE9E150CFArn77689D69'}, '",
"Payload.$": "$"
}
}
}
}
]
},
"TimeoutSeconds": 10800
}']]}
Problem
The AWS CLI does not support json objects, the CFN functions like 'Fn::Join' are not supported, and there are no references allowed ({'Ref': 'AWS::Partition'}) in the definition string.
There is not going to be any magic here to get this done. The CDK renders CloudFormation and that CloudFormation is not truly ASL, as it contains references to other resources, as you pointed out.
One direction you could go would to be to deploy the SFN to a sandbox stack, and allow CFN to dereference all the values and produce the SFN ASL in the service, then re-extract that ASL for local testing.
It's hacky, but I don't know any other way to do it, unless you want to start writing parses that turn all those JSON intrinsics (like Fn:Join) into static strings.

Unable to consume OData in my SAPUI5 app - initial loading of metadata failed

I am facing an issue while trying to consume OData and binding (aggregation binding) it with list item in my demo app.
The webpage is showing "No Data".. I referred other threads, but not similar to my issue. Even posted the thread in SAP QA forum.. no help.
ODATA SERVICE METADATA:
https://sapes5.sapdevcenter.com/sap/opu/odata/IWBEP/GWSAMPLE_BASIC/$metadata
The destination ES5 has been configured in the backend (SAP HANA CLOUD PLATFORM COCKPIT) also. Tried with no authentication and basic authentication. Still no data is displayed.
Connection testing was successful with message "Connection to "ES5" established. Response returned: 307: Temporary Redirect"
Error:
[ODataMetadata] initial loading of metadata failed
Error: HTTP request failed
Code:
VIEW:
<IconTabFilter text="Data Binding" key="db">
<content>
<List headerText="Products" items="{/ProductSet}">
<items>
<ObjectListItem title="{Name}" number="{Price}" intro="{ProductID}"/>
</items>
</List>
</content>
</IconTabFilter>
Manifest.json:
"sap.app": {
......
"dataSources": {
"ES5": {
"uri": "/destinations/ES5/sap/opu/odata/IWBEP/GWSAMPLE_BASIC/",
"type": "OData",
"settings": {
"odataVersion": "2.0"
}
}
}
},
....
"sap.ui5": {
"models": {
.......other models
"" : {
"dataSource": "ES5"
}
}
}
Neoapp.json
{
"path": "/destinations/ES5",
"target": {
"type": "destination",
"name": "ES5"
},
"description": "ES5 Demo Service"
}
The issue with my OData consumption was the creation of ES5 destination in Cloud Foundry Trial. If the app should be developed in SAP Web IDE which is available only in the Neo environment, we have to create the Neo Trial account from the SAP Cloud Platform Cockpit and create the same ES5 destination there. Now I am able to consume the products list in the OData.
From: https://answers.sap.com/questions/13075637/please-help-no-data-is-showing-in-webpage-es5-dest.html

Consuming OData from Eclipse

I've been practicing SAPUI5 with the documentation but they use SAP Web IDE to consume OData services. However, because my company won't expose their server to the cloud, I can't use SAP Web IDE so I need to use eclipse. I need a tutorial step by step (for dummies) for consuming OData with SAPUI5 from eclipse. I already know how to create them but not how to use them from eclipse.
I use the OData service from Northwind but with SAP, I'll need credentials and other things.
"dataSources": {
"invoiceRemote": {
"uri": "https://services.odata.org/V2/Northwind/Northwind.svc/",
"type": "OData",
"settings": {
"odataVersion": "2.0"
}
}
}
[...] because they won't expose their server to the cloud, I can't use SAP Web IDE
An alternative to the clould based Web IDE is the Web IDE Personal Edition which you can deploy on your local machine but still runs in the browser (localhost). Create a corresponding destination file to connect to remote systems and the rest is pretty much the same as the old Orion-based Web IDE on the cloud.
Here is the destination file for the services from odata.org (e.g. Northwind)
Description=OData service from odata.org for testing, tutorials, demos, etc.
Type=HTTP
Authentication=NoAuthentication
WebIDEUsage=odata_gen
Name=odata_org
WebIDEEnabled=true
URL=http\://services.odata.org/
ProxyType=Internet
WebIDESystem=odata_org
Otherwise, if you want to stick with eclipse, take a look at the documentation topic
App Development Using SAPUI5 Tools for Eclipse
and its underlying topic Use a SimpleProxyServlet for Testing to
Avoid Cross-domain Requests
An exmaple using th OData of northwind: (this i made it in eclipse but the only differece with sap web ide personal edition (i haven't try this yet but it should work) you must configure the destination file for the services)
manifest.json
//between sap.app key you put this:
"dataSources": {
"mainService": {
"uri": "/northwind/V2/OData/OData.svc/",
"type": "OData",
"settings": {
"odataVersion": "2.0",
"localUri": "localService/metadata.xml"
}
}
}
...
// this can be empty the if your using more than two just one can be empty
"": {
"dataSource": "mainService",
"preload": true
}
In my view that i'm going to use the data:
<List
id="list"
items="{
path: '/Categories',
sorter: {
path: 'Name',
descending: false
},
groupHeaderFactory: '.createGroupHeader'
}"
busyIndicatorDelay="{masterView>/delay}"
noDataText="{masterView>/noDataText}"
mode="{= ${device>/system/phone} ? 'None' : 'SingleSelectMaster'}"
growing="true"
growingScrollToLoad="true"
updateFinished="onUpdateFinished"
selectionChange="onSelectionChange">
<infoToolbar>
<Toolbar
active="true"
id="filterBar"
visible="{masterView>/isFilterBarVisible}"
press="onOpenViewSettings">
<Title
id="filterBarLabel"
text="{masterView>/filterBarLabel}" />
</Toolbar>
</infoToolbar>
<items>
<ObjectListItem
type="Active"
press="onSelectionChange"
title="{Name}">
</ObjectListItem>
</items>
</List>
If you're going to use and odata made for you or consume just paste the url in the propertie "uri" of yo key dataSource (something like this: "https://proxy/name:port/sap/opu/odata/SAP/ZNAME_OF_YOUR_CREATED_ODATA_SRV" don't worry this url is you could see it when yo /IWFND/MAINT_SERVICE ) and when is already to deploy it just leave the uri like this /sap/opu/odata/SAP/ZNAME_OF_YOUR_CREATED_ODATA_SRV
I'll give you some lights, but not a full tutorial.
Working in eclipse is not that different from working with WEB IDE.
First you need to use JSONModel(). You can find reference here.
Create a JSONModel object and then use the method loadData.
For the sURL use (in your example):
"https://services.odata.org/V2/Northwind/Northwind.svc/?$format=json"
Then, you will have your oData in your front end. Now you just need to learn how to use it in your view elements. That you can learn it here.
If you want further explanations, please make small and specific questions, so it is easier to answer directed to your needs.

Can I update Windows ClientIdentities after cluster creation?

I currently have something like this in my cluseterConfig.json file.
"ClientIdentities": [
{
"Identity": "{My Domain}\\{My Security Group}",
"IsAdmin": true
}
]
My questions are:
My cluster is stood up and running. Can I add a second security group to this cluster while running? I've search through the powershell commands and didn't see one that matched this but I may have missed it?
If I can't do this while the cluster is running do I need delete the cluster and recreate? If I do need to recreate I'm zeroing in on the word ClientIdentities. I'm assuming I can have multiple identities and my config should look something like
ClientIdentities": [{
"Identity": "{My Domain}\\{My Security Group}",
"IsAdmin": true
},
{
"Identity": "{My Domain}\\{My Second Security Group}",
"IsAdmin": false
}
]
Thanks,
Greg
Yes, it is possible to update ClientIdentities once the cluster is up using a configuration upgrade.
Create a new JSON file with the added client identities.
Modify the clusterConfigurationVersion in the JSON config.
Run Start-ServiceFabricClusterConfigurationUpgrade -ClusterConfigPath "Path to new JSON"

Apache Drill using Google Cloud Storage

The Apache Drill features list mentions that it can query data from Google Cloud Storage, but I can't find any information on how to do that. I've got it working fine with S3, but suspect i'm missing something very simple in terms of Google Cloud Storage.
Does anyone have an example Storage Plugin configuration for Google Cloud Storage?
Thanks
M
This is quite an old question, so I imagine you either found a solution or moved on with your life, but for anyone looking for a solution without using Dataproc, here's a solution:
Add the JAR file from the GCP connectors to the jars/3rdparty directory.
Add the following to the site-core.xml file in the conf directory (change the upper-case values such as YOUR_PROJECT_ID to your own details):
<property>
<name>fs.gs.project.id</name>
<value>YOUR_PROJECT_ID</value>
<description>
Optional. Google Cloud Project ID with access to GCS buckets.
Required only for list buckets and create bucket operations.
</description>
</property>
<property>
<name>fs.gs.auth.service.account.private.key.id</name>
<value>YOUR_PRIVATE_KEY_ID</value>
</property>
<property>
<name>fs.gs.auth.service.account.private.key</name>
<value>-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n</value>
</property>
<property>
<name>fs.gs.auth.service.account.email</name>
<value>YOUR_SERVICE_ACCOUNT_EMAIL/value>
<description>
The email address is associated with the service account used for GCS
access when fs.gs.auth.service.account.enable is true. Required
when authentication key specified in the Configuration file (Method 1)
or a PKCS12 certificate (Method 3) is being used.
</description>
</property>
<property>
<name>fs.gs.working.dir</name>
<value>/</value>
<description>
The directory relative gs: uris resolve in inside of the default bucket.
</description>
</property>
<property>
<name>fs.gs.implicit.dir.repair.enable</name>
<value>true</value>
<description>
Whether or not to create objects for the parent directories of objects
with / in their path e.g. creating gs://bucket/foo/ upon deleting or
renaming gs://bucket/foo/bar.
</description>
</property>
<property>
<name>fs.gs.glob.flatlist.enable</name>
<value>true</value>
<description>
Whether or not to prepopulate potential glob matches in a single list
request to minimize calls to GCS in nested glob cases.
</description>
</property>
<property>
<name>fs.gs.copy.with.rewrite.enable</name>
<value>true</value>
<description>
Whether or not to perform copy operation using Rewrite requests. Allows
to copy files between different locations and storage classes.
</description>
</property>
Start Apache Drill.
Add a custom storage to Drill.
You're good to go.
The solution is from here, where I detail some more about what we do around data exploration with Apache Drill.
I managed to query parquet data in Google Cloud Storage (GCS) using Apache Drill (1.6.0) running on a Google Dataproc cluster.
In order to set that up, I took the following steps:
Install Drill and make the GCS connector accessible (this can be used as an init-script for dataproc, just note it wasn't really tested and relies on a local zookeeper instance):
#!/bin/sh
set -x -e
BASEDIR="/opt/apache-drill-1.6.0"
mkdir -p ${BASEDIR}
cd ${BASEDIR}
wget http://apache.mesi.com.ar/drill/drill-1.6.0/apache-drill-1.6.0.tar.gz
tar -xzvf apache-drill-1.6.0.tar.gz
mv apache-drill-1.6.0/* .
rm -rf apache-drill-1.6.0 apache-drill-1.6.0.tar.gz
ln -s /usr/lib/hadoop/lib/gcs-connector-1.4.5-hadoop2.jar ${BASEDIR}/jars/gcs-connector-1.4.5-hadoop2.jar
mv ${BASEDIR}/conf/core-site.xml ${BASEDIR}/conf/core-site.xml.old
ln -s /etc/hadoop/conf/core-site.xml ${BASEDIR}/conf/core-site.xml
drillbit.sh start
set +x +e
Connect to the Drill console, create a new storage plugin (call it, say, gcs), and use the following configuration (note I copied most of it from the s3 config, made minor changes):
{
"type": "file",
"enabled": true,
"connection": "gs://myBucketName",
"config": null,
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json",
"extensions": [
"json"
]
},
"avro": {
"type": "avro"
},
"sequencefile": {
"type": "sequencefile",
"extensions": [
"seq"
]
},
"csvh": {
"type": "text",
"extensions": [
"csvh"
],
"extractHeader": true,
"delimiter": ","
}
}
}
Query using the following syntax (note the backticks):
select * from gs.`root`.`path/to/data/*` limit 10;