Ingest data into databricks using Rest API (Scala) - scala

I need to make a call to a rest API from databricks preferably using Scala to get the data and persist the same in databricks. This is the first time i am doing this and I need help. Can any of you please walk me through step by step as to how to achieve this?. The API team has already created a service principal and has given access to the API. So the authentication needs to be done through SPN.
Thanks!

REST API is not recommended approach to ingest data into databricks.
Reason: The amount of data uploaded by single API call cannot exceed 1MB.
To upload a file that is larger than 1MB to DBFS, use the streaming API, which is a combination of create, addBlock, and close.
Here is an example of how to perform this action using Python.
import json
import base64
import requests
DOMAIN = '<databricks-instance>'
TOKEN = b'<your-token>'
BASE_URL = 'https://%s/api/2.0/dbfs/' % (DOMAIN)
def dbfs_rpc(action, body):
""" A helper function to make the DBFS API request, request/response is encoded/decoded as JSON """
response = requests.post(
BASE_URL + action,
headers={"Authorization: Bearer %s" % TOKEN },
json=body
)
return response.json()
# Create a handle that will be used to add blocks
handle = dbfs_rpc("create", {"path": "/temp/upload_large_file", "overwrite": "true"})['handle']
with open('/a/local/file') as f:
while True:
# A block can be at most 1MB
block = f.read(1 << 20)
if not block:
break
data = base64.standard_b64encode(block)
dbfs_rpc("add-block", {"handle": handle, "data": data})
# close the handle to finish uploading
dbfs_rpc("close", {"handle": handle})
For more details, refer "DBFS API"
Hope this helps.

the above code will work, in case if you want to upload jar file or non-ascii file instead of
dbfs_rpc("add-block", {"handle": handle, "data": data})
use
dbfs_rpc("add-block", {"handle": handle, "data": data.decode('UTF8')})
rest of the details are same.

Related

Is there a way for me to extract request information from headers/body using telegraf

So I have telegraf installed on one of our ec2 instances. And on the instance we have a spring boot application installed servicing POST requests and forwarding them on.
I was wondering if it is possible to extract information from requests coming into this service such as header information or key value pairs from a json body and then forward this information onto telegraf?
Telegraf would be then push this information to influxdb.
The idea is that we will extract this information and display using grafana so we can visualise how many requests are coming from where.
I know there are some plugins like http_listener but I can't tell from the readme if this could possibly work or if there is a better way to get this working?
Thanks for any information you can provide in advance!
You could try work with this sample. The sample code could be:
[[inputs.http]]
# URL for your data in JSON format
urls = ["https://yourURL/sample.json"]
# Overwrite measurement name from default `http` to `someOtherMeasurement`
name_override = "someOtherMeasurement"
# Data from HTTP in JSON format
data_format = "json_v2"
# Add a subtable to use the `json_v2` parser
[[inputs.http.json_v2]]
# Add an object subtable for to parse a JSON object
[[inputs.http.json_v2.object]]
# Parse data in `data.stations` path only
path = "data.stations"
#Set station metadata as tags
tags = ["yourTags"]
# Latest station information reported at `last_reported`
timestamp_key = "theTimeStamp"
# Time is reported in unix timestamp format
timestamp_format = "unix"
# all other json key-value pairs will be turned into fields
An excerpt of original HTTP JSON response is:
{
"data":{
"stations":[
{
"last_reported":1655171050,
"is_renting":1,
"num_bikes_available":21,
"is_installed":1,
"legacy_id":"72",
"station_status":"active",
"num_ebikes_available":2,
"is_returning":1,
"eightd_has_available_keys":false,
"num_docks_disabled":0,
"station_id":"72",
"num_bikes_disabled":1,
"num_docks_available":32
},
"..."
]
},
"last_updated":1655171085,
"ttl":5
}
Above Telegraf script will turn the JSON into following line protocol:
someOtherMeasurement,station_id=72 eightd_has_available_keys=false,is_installed=1,is_renting=1,is_returning=1,legacy_id="72",num_bikes_available=21,num_bikes_disabled=1,num_docks_available=32,num_docks_disabled=1,num_ebikes_available=2,station_status="active" 1655171050000000000
You could play with the sample JSON with Telegraf to update your configurations. Happy coding.

How to insert an entity into Azure Storage table using Web Activity of Azure Data Factory service

I have a table in a storage account. I would like to do a test by inserting an entity into this table using Web Activity with the guide from this link (https://learn.microsoft.com/en-us/rest/api/storageservices/insert-entity).
I also tried to create a header in the Web Activity settings with the following format for my shared key (https://learn.microsoft.com/en-us/rest/api/storageservices/authorize-with-shared-key):
Authorization="SharedKey <_AccountName>:<_Signature>"
But it seems that there is no function in the dynamic expression to make a Hash-based Message Authentication Code (HMAC) for the <_Signature>.
Could someone give me some sample or some hints? Thanks.
We have a provision for using sha2 encoding in expression builder while using Data Flows.
But while using web activity in Data factory pipelines you will have to use a workaround. Here is what I tried, Call a serverless function app based on powershell to encode the signature.
basic idea in powershell:
$ClearString = "String_to_sign"
$hasher = [System.Security.Cryptography.HashAlgorithm]::Create('sha256')
$hash = $hasher.ComputeHash([System.Text.Encoding]::UTF8.GetBytes($ClearString))
$hashString = [System.BitConverter]::ToString($hash)
$body = $hashString.Replace('-', '').ToLower()
1. Call the function app:
With body: a JSON with String_to_sign
{
"name": 'pipeline().parameters.StringToSign'
}
2. Assign function app output(HMAC) to a variable: (to later encode using base64)
#activity('Azure Function1').output.Response
3. Configure WebActivity as per your scenario:
Note: I have used sample data for demonstration purpose, please use this method modifying as per your need.
Encode HMAC and prep Authorization header by using base64 function.
Authorization: #concat('SharedKey kteststoragee:',base64(variables('sha256')))
Build Authorization header following MS doc: Table service (Shared Key authorization Use string function such as Concat to build the final string.

connect to REST endpoint using OAuth2

I am trying to explore different options to connect to a REST endpoint using Azure Data Factory. I have the below python code which does what I am looking for but not sure if Azure Data Factory offers something out of the box to connect to the api or a way to call a custom code.
Code:
import sys
import requests
from requests_oauthlib import OAuth2Session
from oauthlib.oauth2 import BackendApplicationClient
import json
import logging
import time
logging.captureWarnings(True)
api_url = "https://webapi.com/api/v1/data"
client_id = 'client'
client_secret = 'secret'
client = BackendApplicationClient(client_id=client_id)
oauth = OAuth2Session(client=client)
token = oauth.fetch_token(token_url='https://webapi.com/connect/accesstoken', client_id=client_id, client_secret=client_secret)
client = OAuth2Session(client_id, token=token)
response = client.get(api_url)
data = response.json()
When I look at the REST linked service I don't see many authentication options
Could you please point to me on what activities to use to make OAuth2 working in Azure Data Factory
You would have to use a WebActivity to call using POST method and get the authentication token before getting data from API.
Here is an example.
First create an Web Activity.
Select your URL that would do the authentication and get the token.
Set Method to POST.
Create header > Name: Content-Type Value: application/x-www-form-urlencoded
Configure request body for HTTP request.
..
Format: grant_type=refresh_token&client_id={client_id}&client_secret=t0_0CxxxxxxxxOKyT8gWva3GPU0JxYhsQ-S1XfAIYaEYrpB&refresh_token={refresh_token}
Example: grant_type=refresh_token&client_id=HsdO3t5xxxxxxxxx0VBsbGYb&client_secret=t0_0CqU8oA5snIOKyT8gWxxxxxxxxxYhsQ-S1XfAIYaEYrpB&refresh_token={refresh_token
I have shown above for example, please replace with respective id and secret when you try.
As an output from this WebActivity, you would receive a JSON string. From which you can extract the access_token to further use in any request header from further activities (REST linked service) in the pipeline depending on your need.
You can get the access_token like below. I have assigned it to a variable for simplicity.
#activity('GetOauth2 token').output.access_token
Here is an example from official MS doc for Oauth authentication implementation for copying data.

How can REST API pass large JSON?

I am building a REST API and facing this issue: How can REST API pass very large JSON?
Basically, I want to connect to Database and return the training data. The problem is in Database I have 400,000 data. If I wrap them into a JSON file and pass through GET method, the server would throw Heap overflow exception.
What methods we can use to solve this problem?
DBTraining trainingdata = new DBTraining();
#GET
#Produces("application/json")
#Path("/{cat_id}")
public Response getAllDataById(#PathParam("cat_id") String cat_id) {
List<TrainingData> list = new ArrayList<TrainingData>();
try {
list = trainingdata.getAllDataById(cat_id);
Gson gson = new Gson();
Type dataListType = new TypeToken<List<TrainingData>>() {
}.getType();
String jsonString = gson.toJson(list, dataListType);
return Response.ok().entity(jsonString).header("Access-Control-Allow-Origin", "*").header("Access-Control-Allow-Methods", "GET").build();
} catch (SQLException e) {
logger.warn(e.getMessage());
}
return null;
}
The RESTful way of doing this is to create a paginated API. First, add query parameters to set page size, page number, and maximum number of items per page. Use sensible defaults if any of these are not provided or unrealistic values are provided. Second, modify the database query to retrieve only a subset of the data. Convert that to JSON and use that as the payload of your response. Finally, in following HATEOAS principles, provide links to the next page (provided you're not on the last page) and previous page (provided you're not on the first page). For bonus points, provide links to the first page and last page as well.
By designing your endpoint this way, you get very consistent performance characteristics and can handle data sets that continue to grow.
The GitHub API provides a good example of this.
My suggestion is no to pass the data as a JSON but as a file using multipart/form-data. In your file, each line could be a JSON representing a data record. Then, it would be easy to use a FileOutputStream to receive te file. Then, you can process the file line by line to avoid memory problems.
A Grails example:
if(params.myFile){
if(params.myFile instanceof org.springframework.web.multipart.commons.CommonsMultipartFile){
def fileName = "/tmp/myReceivedFile.txt"
new FileOutputStream(fileName).leftShift(params.myFile.getInputStream())
}
else
//print or signal error
}
You can use curl to pass your file:
curl -F "myFile=#/mySendigFile.txt" http://acme.com/my-service
More details on a similar solution on https://stackoverflow.com/a/13076550/2476435
HTTP has the notion of chunked encoding that allows you send a HTTP response body in smaller pieces to prevent the server from having to hold the entire response in memory. You need to find out how your server framework supports chunked encoding.

Calling Cloud Stack With Parameters

i am trying to make an api call using the below code and it works fine
import urllib2
import urllib
import hashlib
import hmac
import base64
baseurl='http://www.xxxx.com:8080/client/api?'
request={}
request['command']='listUsers'
request['response']='xml'
request['apikey']='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
secretkey='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
request_str='&'.join(['='.join([k,urllib.quote_plus(request[k])]) for k in request.keys()])
sig_str='&'.join(['='.join([k.lower(),urllib.quote_plus(request[k].lower().replace('+','%20'))])for k in sorted(request.iterkeys())])
sig=hmac.new(secretkey,sig_str,hashlib.sha1)
sig=hmac.new(secretkey,sig_str,hashlib.sha1).digest()
sig=base64.encodestring(hmac.new(secretkey,sig_str,hashlib.sha1).digest())
sig=base64.encodestring(hmac.new(secretkey,sig_str,hashlib.sha1).digest()).strip()
sig=urllib.quote_plus(base64.encodestring(hmac.new(secretkey,sig_str,hashlib.sha1).digest()).strip())
req=baseurl+request_str+'&signature='+sig
res=urllib2.urlopen(req)
result = res.read()
print result
what i want to know how can i send additional parameter with the Api call??
and how to send parameters when iam sending data to cloud stack instead of getting from the cloud stack
e.g createuser
Add additional parameters to the the request dictionary.
E.g. listUsers allows details of a specific username to be listed (listUsers API Reference). To do so, you'd update request creation as follows:
request={}
request['command']='listUsers'
request['username']='admin'
request['response']='xml'
request['apikey']='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
Also the Rules for Signing say to "Lower case the entire Command String and sort it alphabetically via the field for each field-value pair" This section of the docs also covers adding an expiry to the URL.
Finally, you need to ensure the HTTP GET is not cached by network infrastructure by making each HTTP GET unique. The CloudStack API uses a cache buster. Alternatively, you can add an expiry to each query, or use an HTTP POST.