I'm learning about IBM COS and I haven't got a lot of details about one item in the docs. Could you please let me know if we can read the object data (row by row) after storing a .xlsx file in a bucket? Thanks
If you save the .xlsx as .csv the upload to Cloud Object Storage you can query the data in place with IBM SQL Query.
https://cloud.ibm.com/docs/sql-query?topic=sql-query-overview
Related
Recently I face an issue while writing the dataframe data into BigQuery using pyspark. Here it was:
pyspark.sql.utils.IllegalArgumentException: u'Temporary or persistent GCS bucket must be informed
After research the issue I found that Temporary GCS bucket to be mentioned spark.conf.
bucket = "temp_bucket"
spark.conf.set('temporaryGcsBucket', bucket)
I think there is no concept to have a file for a table in Biquery like Hive.
I would like to know more about it, why we need to have temp-gcs-bucket to write the data into bigquery?
I was searching for the reason behind this but I couldn't.
Please clarify.
Spark BigQuery connector has two write modes(writeMethod), 1. Direct 2.Indirect while writing data into BigQuery. This is a optional parameter, default is Indirect.
Indirect
You can specify indirect option like this option("writeMethod","indirect"). Its optional, and Indirect is default. This requires you to specify a temporary gcs bucket, if not you will get the error.
The need of temporary bucket is .
The connector writes the data to BigQuery by first buffering all the
data into a Cloud Storage temporary table. Then it copies all data
from into BigQuery in one operation.
Taken from the GCFS spark example docs here
Direct
In this method the data is written directly to BigQuery using the BigQuery Storage Write API
In scala you can specify like this option("writeMethod","direct"). which eliminates the need for a temporary bucket.
You can read more about the bigquery connector here
We have a ORC file format which are stored in s3 and we want to load the files into AWS Aurora postgres DB .
What we got from internet was :
postgres support csv, txt and other formats not ORC ..
INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SELECT * FROM default.foo;
Can any one please help us to find a solution?
This date PostgreSQL on Aurora supports ingestion of data from S3 through the COPY command only from TXT and CSV files.
Since your files are in ORC format, you could convert these tiles in either CSV or TXT and then ingest the data. You could do this very easily with Athena, by simply creating a table for your original data and running a SELECT * FROM table query. As explained in the Working with Query Results, Output Files, and Query History
page, this will automatically generate a CSV file containing the results.
This would not be optimal as you’d pay not only the transform price but also the he storage twice (as original ORC and converted CSV), but it would allow you to convert the data pretty easily.
A better way to do it would instead be to use a service like AWS Glue, that supports S3 as source and that has an Aurora connector. Using this method would give you an actual ETL and even if now you just need the E(xtract) and L(oad), would still leave the door open for any kind of transform you might need in the future.
In this AWS Blog titled How to extract, transform, and load data for analytic processing using AWS Glue (Part 2) they show the opposite flow (Aurora->S3 via Glue), but it should still give you an idea of the process.
I am accessing Databricks Delta tables from Azure Data Factory, which does not have a native connector to Databricks tables. So, as a workaround, I create the tables with the LOCATION keyword to store them in Azure Data Lake. Then, since I know the table file location, I just read the underlying Parquet files from Data Factory. This works fine.
But... what if there is cached information in the Delta transaction log that has not yet been written to disk? Say, an application updated a row in the table, and the disk does not yet reflect this fact. Then my read from Data Factory will be wrong.
So, two questions...
Could this happen? Are changes held in the log for a while before being written out?
Can I force a transaction log flush, so I know the disk copy is updated?
Azure Data Factory has built in delta lake support (this was not the case at the time the question was raised).
Delta is available as an inline dataset in a Azure Data Factory data flow activity. To get column metadata, click the Import schema button in the Projection tab. This will allow you to reference the column names and data types specified by the corpus (see also the docs here).
ADF supports Delta Lake format as of July 2020:
https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-connectors-for-delta-lake-and-excel/ba-p/1515793
The Microsoft Azure Data Factory team is enabling .. and a data flow connector for data transformation using Delta Lake
Delta is currently available in ADF as a public preview in data flows as an inline dataset.
I am Coping files from Azure blob storage to azure data lake store, I need to pick files from year(folder)\month(folder)\day(txt files are on day bases).I am able to do one file with hadrcoded path but i am not able to pick file per day and process to copy in azure data lake store. Can anyone please help me.
I am using ADF V2 and using UI designer to create my connections,datasets and pipeline my steps are which i is working fine
copy file from blob storage to data lake store
picking that file from data lake store and processing through usql for transform data.
that transform data i am saving in Azure SQL DB
Please give me answer i am not able to get any help b/c all help is in JSON i am looking how i will define and pass parameters in UI designer.
Thanks
For the partitioned file path part, you could take a look at this post.
You could use copy data tool to handle it.
Does the IBM DataWorks Data Load API support CSV files as input source?
The answer is yes. To accomplish this, you have provide the structure of the file in the request payload. This is explained in the API documentation Creating a Data Load Activity. This an excerpt of the documentation:
Within the columns array, specify the columns to provision data
from. If Analytics for Hadoop, Amazon S3, or SoftLayer Object Storage
is the source, you must specify the columns. If you specify columns,
only the columns that you specify are provisioned to the target...
The Data Load application included in DataWorks is provided just as an example and assumes the input file has 2 columns, the first being an INTEGER and the second one a VARCHAR.
Note: This question was answered on dW Answers by user emalaga.