How do we access a file in github repo inside our azure databricks notebook - github

We have a requirement where we need to access a file hosted on our github private repo in our Azure Databricks notebook.
Currently we are doing it using curl command using the Personal Access Token of a user.
curl -H 'Authorization: token INSERTACCESSTOKENHERE' -H 'Accept:
application/vnd.github.v3.raw' -O -L
https://api.github.com/repos/*owner*/*repo*/contents/*path*
Is there a way we can avoid the use of PAT and use deploy keys or anything?

From summer 2021 databricks has introduced integration of git repos functionality.
More info here: https://learn.microsoft.com/en-us/azure/databricks/repos
If you add your file (excel, json etc.) in the repo, then you can use a relative path to access it and read it.
e.g. pd.read_excel("./test_data.xlsx")
Be aware that you need a cluster with a databricks version 8.4+ (or 9.1+?)
You can also test what is your current working directory by executing the following command. os.getcwd()
If you have correctly integrated the repo then your result should be something like:
/Workspace/Repos/george#myemail.com/REPO_FOLDER/analysis
otherwise it will be something like: /databricks/driver

Integrate Git and azure databricks.
This documentation shows how to integrate Git and azure databricks
Step1: Get raw URL of the File.
Step2: Use wget to access the file:
wget https://github.com/githubtraining/hellogitworld/blob/master/resources/labels.properties

Related

How to create and place a .env file in Kafka cluster so a Kafka Connector Config can reference it

I’m trying to create a kafka-connector from kafka to snowflake using the kafka to snowflake section of this tutorial.
Here is a full sample of the connect config that I'm starting with, contained in a curl request. As you can see it references ${file:/data/credentials.properties:ENV_VAR_NAME} multiple times to grab envars:
curl -i -X PUT -H "Content-Type:application/json" \
http://localhost:8083/connectors/sink_snowflake_01/config \
-d '{
"connector.class":"com.snowflake.kafka.connector.SnowflakeSinkConnector",
"tasks.max":1,
"topics":"mssql-01-mssql.dbo.ORDERS",
"snowflake.url.name":"${file:/data/credentials.properties:SNOWFLAKE_HOST}",
"snowflake.user.name":"${file:/data/credentials.properties:SNOWFLAKE_USER}",
"snowflake.user.role":"SYSADMIN",
"snowflake.private.key":"${file:/data/credentials.properties:SNOWFLAKE_PRIVATE_KEY}",
"snowflake.database.name":"DEMO_DB",
"snowflake.schema.name":"PUBLIC",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"com.snowflake.kafka.connector.records.SnowflakeAvroConverter",
"value.converter.schema.registry.url":"https://${file:/data/credentials.properties:CCLOUD_SCHEMA_REGISTRY_HOST}",
"value.converter.basic.auth.credentials.source":"USER_INFO",
"value.converter.basic.auth.user.info":"${file:/data/credentials.properties:CCLOUD_SCHEMA_REGISTRY_API_KEY}:${file:/data/credentials.properties:CCLOUD_SCHEMA_REGISTRY_API_SECRET}"
}'
My question is: How do I put a .env file in “data/credentials.properties” on within the cluster, so that my connect config will be able to access the env vars within the .env file, using “${…}” syntax like this line of the example connect config json:
"snowflake.url.name":"${file:/data/credentials.properties:SNOWFLAKE_HOST}",
Robin's tutorial (thankyou for the link) makes specific reference to that credentials file:
All the code shown here is based on this github repo. If you’re following along then make sure you set up .env (copy the template from .env.example) with all of your cloud details. This .env file gets mounted in the Docker container to /data/credentials.properties, which is what’s referenced in the connector configurations below.
So if you're using his repo, then that file is already dropped in the right spot for you, as you can see at https://github.com/confluentinc/demo-scene/blob/master/pipeline-to-the-cloud/docker-compose.yml#L73

Download released asset from private github repo

I'm trying to download a single added asset from a release in a private repository but all I get is a 404 response. the download url of the asset is https://github.com/<user>/<repo>/releases/download/20211022/file.json
I've tried several different methods of specifying the username and private access token but all give the same result. When I use the access token to access api.github.com then it seems to work.
I've tried the following formats in curl
curl -i -u <user>:<token> <url>
curl -i "https://<user>:<token>#github.com/ ...."
curl -i -H "Authorization: token <token>" <url>
I can download the source code (zip) from the release, but this has a different url: https://github.com/<user>/<repo>/archive/refs/tags/20211022.zip
What am I doing wrong?
As in the document here
You have to use url: /repos/{owner}/{repo}/releases/assets/{asset_id} not browser_download_url and set request header to accept application/octet-stream to download.
And don't forget to add Authorization header too.
The alternative is to use the GitHub CLI gh, with the command gh release download.
Since gh 2.19.0 (Nov. 2022), you can download a single asset with gh release download --output (after a gh auth login, for authentication):
In your case:
gh release download --pattern 'file.json' --output file.json

I have a code in Github repo and wants to clone/copy it to the Azure repository, Is there any way to do it via Rest

I have a code in Github repo and wants to clone/copy it to the Azure repository, Is there any way to do it via Curl
Is there any way to do it via Curl
Yes. The method exists.
You need to use curl to run the following Rest API to create Other Git Service connection.
Post https://dev.azure.com/{Organization Name}/_apis/serviceendpoint/endpoints?api-version=6.0-preview.4
curl example:
curl -X POST \
-u USERName:PAT "https://dev.azure.com/org/_apis/serviceendpoint/endpoints?api-version=6.0-preview.4" \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"authorization":{"scheme":"UsernamePassword","parameters":{"username":"{User name}","password":"{Password}"}},
"data":{"accessExternalGitServer":"true"},
"name":"{name}",
"serviceEndpointProjectReferences":[{"description":"","name":"{Service connection name}","projectReference":{"id":"{Project Id}","name":"{Project Name}"}}],
"type":"git",
"url":"{Target Git URL}",
"isShared":false,
"owner":"library"
}'
Then you could use the importRequests Rest API to import the Github Repo to Azure Repo.
Post https://dev.azure.com/{Organization Name}/{Project Name}/_apis/git/repositories/{Repo Name}/importRequests?api-version=5.0-preview.1
For more detailed info, you could refer to my another ticket: Get github repository using Azure devops REST API
This method is more complicated. I suggest that you could directly import the github repo with the Import repository option in Azure Repo.
You could import the clone URL and auth info, then the repo will be directly cloned to Azure Repo.

Copying directly from S3 to Google Cloud Storage

I can migrate data from Amazon AWS S3 to Azure using AWS SDK for Java and Azure SDk for Java. Now I want to do migrate data from Amazon AWS S3 to Google Cloud storage using Java.
The gsutil command-line tool supports S3. After you've configured gsutil, you'll see this in your ~/.boto file:
# To add aws credentials ("s3://" URIs), edit and uncomment the
# following two lines:
#aws_access_key_id =
#aws_secret_access_key =
Fill in the aws_access_key_id and aws_secret_access_key settings with your S3 credentials and uncomment the variables.
Once that's set up, copying from S3 to GCS is as easy as:
gsutil cp -R s3://bucketname gs://bucketname
If you have a lot of objects, run with the -m flag to perform the copy in parallel with multiple threads:
gsutil -m cp -R s3://bucketname gs://bucketname
Use the Google Cloud Storage transfer tool.
The answer suggested by jterrace (aws key and secret in .boto file) is correct and worked for me for many regions but not for some regions that need only AWS Signature Version 4. For instance while connecting to 'Mumbai' region I got this error:
BadRequestException: 400 InvalidRequest
The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256
In order to overcome this problem (make gsutil use AWS Signature v4) I had to add following additional lines to ~/.boto file. These lines create a new section [s3] in the config file:
[s3]
host = s3.ap-south-1.amazonaws.com
use-sigv4 = True
Reference:
Interoperability support for AWS Signature Version 4
Gsutil cannot copy to s3 due to authentication
Create a new .boto file
[Credentials]
aws_access_key_id = ACCESS_KEY_ID
aws_secret_access_key = SECRET_ACCESS_KEY
and this command
BOTO_CONFIG=.boto gsutil -m cp s3://bucket-name/filename gs://bucket-name
or this
BOTO_CONFIG=.boto gsutil -m cp gs://bucket-name/filename s3://bucket-name
AWS_ACCESS_KEY_ID=XXXXXXXX AWS_SECRET_ACCESS_KEY=YYYYYYYY gsutil -m cp s3://bucket-name/filename gs://bucket-name
This approach allows to copy data from s3 to gcs without the need of a a boto file. There can be situations where storing the credentials file in the running virtual machine is not recommended. With this approach we can integrate the gcp secret manager and generate the above command during runtime and execute, preventing the need to store the credentials permanently as a file stored in the machine.

Deployment from private github repository

I am new to git and github (I used Subversion before). I can not find a solution how to export master branch only from my private repository to my production server.
I need to prepare an automated solution (called via fabric). I found that git has the archive command but this doesn't work with github (I set up SSH keys):
someuser#ews1:~/sandbox$ git archive --format=tar --remote=git#github.com:someuser/somerepository.git master
Invalid command: 'git-upload-archive 'someuser/somerepository.git''
You appear to be using ssh to clone a git:// URL.
Make sure your core.gitProxy config option and the
GIT_PROXY_COMMAND environment variable are NOT set.
fatal: The remote end hung up unexpectedly
So I will need an another way how to do this. I don't want any meta files from git in the export. When I do clone, I will have all these files in .git directory (what I don't want) and this downloads more data than I really need.
Is there a way how to do this over ssh? Or I have to download the zip over HTTPS only?
I'm not sure I fully understood your question.
I use this command to pull the current master version to my server:
curl -sL --user "user:pass" https://github.com/<organisation>/<repository>/archive/master.zip > master.zip
Does this help?
As I explained in SO answer on downloading GitHub archives from a private repo, after creating a GitHub access token, you can use wget or curl to obtain the desired <version> of your repo (e.g. master, HEAD, commit SHA-1, ...).
Solution with wget:
wget --output-document=<version>.tar.gz \
https://api.github.com/repos/<owner>/<repo>/tarball/<version>?access_token=<OAUTH-TOKEN>
Solution with curl:
curl -L https://api.github.com/repos/<owner>/<repo>/tarball/<version>?access_token=<OAUTH-TOKEN> \
> <version>.tar.gz