I just try Google cloud fusion and now I try looking for how to checksum file. in the google fusion plugin there is a MD5/SHA field dataset, but this plugin only hashing each column not file. there is way to hashing file or some suggestion for this issue? I need to do this for check duplicate with previous data on my table.
Related
I have an AS400 with an IBM DB2 database and I need to create a Format Description File (FDF) for each table in the DB. I can create the FDF file using the IBM Export tool but it will only create one file at a time which will take several days to complete. I have not found a way to create the files systematically using a tool or query. Is this possible or should this be done using scripting?
First of all, to correct a misunderstanding...
A Format Description File has nothing at all to do with the format of a Db2 table. It actually describes the format of the data in a stream file that you are uploading into the Db2 table. Sure you can turn on an option during the download from Db2 to create the FDF file, but it's still actually describing the data in the stream file you've just downloaded the data into. You can use the resulting FDF file to upload a modified version of the downloaded data or as the starting point for creating an FDF file that matches the actual data you want to upload.
Which explain why there's no built-in way to create an appropriate FDF file for every table on the system.
I question why you think you actually to generate an FDF file for every table.
As I recall, the format of the FDF (or it's newer variant FDFX) is pretty simple; it shouldn't be all that difficult to generate if you really wanted to. But I don't have one handy at the moment, and my Google-FU has failed me.
We have a ORC file format which are stored in s3 and we want to load the files into AWS Aurora postgres DB .
What we got from internet was :
postgres support csv, txt and other formats not ORC ..
INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SELECT * FROM default.foo;
Can any one please help us to find a solution?
This date PostgreSQL on Aurora supports ingestion of data from S3 through the COPY command only from TXT and CSV files.
Since your files are in ORC format, you could convert these tiles in either CSV or TXT and then ingest the data. You could do this very easily with Athena, by simply creating a table for your original data and running a SELECT * FROM table query. As explained in the Working with Query Results, Output Files, and Query History
page, this will automatically generate a CSV file containing the results.
This would not be optimal as you’d pay not only the transform price but also the he storage twice (as original ORC and converted CSV), but it would allow you to convert the data pretty easily.
A better way to do it would instead be to use a service like AWS Glue, that supports S3 as source and that has an Aurora connector. Using this method would give you an actual ETL and even if now you just need the E(xtract) and L(oad), would still leave the door open for any kind of transform you might need in the future.
In this AWS Blog titled How to extract, transform, and load data for analytic processing using AWS Glue (Part 2) they show the opposite flow (Aurora->S3 via Glue), but it should still give you an idea of the process.
I'm learning about IBM COS and I haven't got a lot of details about one item in the docs. Could you please let me know if we can read the object data (row by row) after storing a .xlsx file in a bucket? Thanks
If you save the .xlsx as .csv the upload to Cloud Object Storage you can query the data in place with IBM SQL Query.
https://cloud.ibm.com/docs/sql-query?topic=sql-query-overview
I found that Google ml engine expects data in cloud storage, big query etc. Is there any way to stream data to ml-engine. For example, imagine that I need to use data in WordPress or Drupal site to create a tensorflow model, say a spam detector. One way is to export the whole data as CSV and upload it to cloud storage using google-cloud--php library. The problem here is that, for every minor change, we have to upload the whole data. Is there any better way?
By minor change, do you mean "when you get new data, you have to upload everything--the old and new data--again to gcs"? One idea is to export just the new data to gcs on some schedule, making many csv files over time. You can write your trainer to take a file pattern and expand it using get_matching_files/Glob or multiple file paths.
You can also modify your training code to start from an old checkpoint and train over just the new data (which is in its own file) for a few steps.
Does the IBM DataWorks Data Load API support CSV files as input source?
The answer is yes. To accomplish this, you have provide the structure of the file in the request payload. This is explained in the API documentation Creating a Data Load Activity. This an excerpt of the documentation:
Within the columns array, specify the columns to provision data
from. If Analytics for Hadoop, Amazon S3, or SoftLayer Object Storage
is the source, you must specify the columns. If you specify columns,
only the columns that you specify are provisioned to the target...
The Data Load application included in DataWorks is provided just as an example and assumes the input file has 2 columns, the first being an INTEGER and the second one a VARCHAR.
Note: This question was answered on dW Answers by user emalaga.