Convert Blob to Dataframe for GCS Storage - scala

I am trying to find solution or function for converting Blob into dataframe.
My usecase is as GCS store data in the blob format. How can I access and convert that GCS blob data into data frame so as to operate on it.

Related

Reading object's data from IBM Cloud Object storage

I'm learning about IBM COS and I haven't got a lot of details about one item in the docs. Could you please let me know if we can read the object data (row by row) after storing a .xlsx file in a bucket? Thanks
If you save the .xlsx as .csv the upload to Cloud Object Storage you can query the data in place with IBM SQL Query.
https://cloud.ibm.com/docs/sql-query?topic=sql-query-overview

How to read data in Qliksense from Databricks Rest API

I am trying to read datasets stored in Databricks DBFS. I want to read these datasets in Qliksense.
I am using Databricks Rest API and getting data in JSON format and also my retrieved data is base64 encoded.
How can get this data in tabular format directly from Rest API.

Table storage showing data in only string format

I'm using ADF pipeline to copy data from data lake to blob storage and then from blob storage to table storage.
As you can see below, here are the column types in ADF Data Flow Sink - Blob Storage (integer, string, timestamp):
Here is the Mapping settings in Copy data activity:
On checking the output in table storage, I see all columns are of string type:
Why is table storage saving data in string values? How do I resolve this issue in table storage so that it will accept columns in the right type (integer, string, timestamp)? Please let me know. Thank you!
In usually, when load data from blob storage in Data Factory, all the default data type in blob file are String, Data Factory will help you convert the data type automatically to Sink.
But it also can not meet all our requests.
I tested copy data from Blob to Table Storage and found that: if we don't specify the data type manually in Source, after pipeline executed, all the data type will be String in Sink(Table Storage).
For example, this my Source blob file:
If I don't change the source data type, it seems that everything is ok in Sink table:
But after the pipeline executed, the data type in table storage are all String:
If we change the data type in Source blob manually, and it works ok!
For another question, a little confuse that just from you screenshot, that seems the UI of Mapping Data Flow Sink, but Mapping Data Flow doesn't support Table Storage as Sink.
Hope this helps.
Finally figured out the issue - I was using DelimitedText format for Blob Storage. After converting to Parquet format, I can see data being written to table storage in correct type.

Can we write a custom method to load data from S3 bucket into Spark DataFrame

I have a scenario where I need to load Json data from s3 bucket in to Spark DataFrame, But problem here is my data in S3 bucket is encrypted with Javax.crypto library using AES/ECB/PKCS5Padding encryption algorithm. when I tried to read data from S3, spark is throwing error that this is not json data since it is in encrypted formate. Is there any way I can write my custom spark code which reads data from s3 bucket as file input stream and apply this decryption process using Javax.crypto util to convert and assign it DataFram? (I want my spark custom code to run over the distributed cluster). Appreciate your help.

Use output data from Copy activity in subsequent activity

I have a ForEach activity which uses a Copy activity with an HTTP source and a blob storage sink to download a json file for each item. The HTTP source is set to Binary Copy whereas the blob storage sink is not, since I want to both copy the complete json file to blob storage and also extract some data from each json file and store this data in a database table in the following activity.
In the blob storage sink, I've added a column definition which extracts some data from the json file. The json files are stored in blob storage successfully, but how can I access the extracted data in the subsequent stored procedure activity?
You can try using lookup activity to extract the data in the Blob and use the output in later activity. Refer to https://learn.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity to see whether lookup activity can satisfy your requirements.