Is there a way to dynamically generate output schema in Data Fusion? - google-cloud-data-fusion

I am using google cloud data fusion to ingest a CSV file from AWS, apply directives, and store the resultant file in GCS. Is there a way to dynamically generate the output schema of the CSV file for Wrangler? I am uploading the input path, directives, output schema as a macro to the argument setter. I can see options only to import and export output schema.

Currently there is no way to get the output schema dynamically since the output schema can be wildly different as a result of a parse directive.

Related

Foundry convert uploaded csv to dataframe

how can I convert an uploaded CSV to dataframe in foundry using a code workbook? should I use the #transform decorator with spark.read.... (not sure of the exact syntax)?
Thx!!
CSV is a "special format" where Foundry can infer the schema of the CSV and automatically convert it to a dataset. In this example, I uploaded a CSV with hail data from NOAA:
If you hit Edit Schema on the main page, you can use a front end tool to set certain configurations about the CSV (delimiter, column name row, column type etc).
Once it's a dataset, you can import the dataset in Code Workbooks and it should work out of the box.
If you wanted to parse the raw file, I would recommend you use Code Authoring, not Code Workbooks, as it's better suited for production level code, has less overhead and is more efficient for this type of workflow (parsing raw files). If you really wanted to use code workbooks, change the type of your input using the input helper bar, or in the inputs tab.
Once you finish iterating please move this to a Code Authoring repository and repartition your data. File reads at code workbook can substantially slow down your whole code workbook pipeline. Code Authoring offers preview of raw files now, so it's just as fast to develop as using Code Workbooks.
Note: Only imported datasets and persisted datasets can be read in as Python transform inputs. Transforms that are not saved as a dataset cannot be read in as Python transform inputs. Datasets with no schema should be read in as a transform input automatically.

Migrate from OrientDB to AWS Neptune

I need to migrate a database from OrientDB to Neptune. I have an exported JSON file from Orient that contains the schema (classes) and the records - I now need to import this into Neptune. However, it seems that to import data into Neptune there must be a csv file containing all the vertex's and another file containing all the edges.
Are there any existing tools to help with this migration and converting to the required files/format?
If you are able to export the data as GraphML then you can use the GraphML2CSV tool. It will create a CSV file for the nodes and another for the edges with the appropriate header rows.
Keep in mind that GraphML is a lossy format (it cannot describe complex Java types the way GraphSON can) but you would not be able to import those into Neptune either.

How to capture schema of a JSON file using Talend

I'm trying to capture the schema of a JSON file that I am generating from a SQL database using Talend. I need to store this schema in a separate file. Does anyone know of a way to capture this?
With the metadata section in repository, you can create a JSON File Schema. Here you can import a json file example , it will then generate a schema that you could reuse in the output of your job, in a twritejsonfields component for example.

load orc format to aurora postgres DB

We have a ORC file format which are stored in s3 and we want to load the files into AWS Aurora postgres DB .
What we got from internet was :
postgres support csv, txt and other formats not ORC ..
INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SELECT * FROM default.foo;
Can any one please help us to find a solution?
This date PostgreSQL on Aurora supports ingestion of data from S3 through the COPY command only from TXT and CSV files.
Since your files are in ORC format, you could convert these tiles in either CSV or TXT and then ingest the data. You could do this very easily with Athena, by simply creating a table for your original data and running a SELECT * FROM table query. As explained in the Working with Query Results, Output Files, and Query History
page, this will automatically generate a CSV file containing the results.
This would not be optimal as you’d pay not only the transform price but also the he storage twice (as original ORC and converted CSV), but it would allow you to convert the data pretty easily.
A better way to do it would instead be to use a service like AWS Glue, that supports S3 as source and that has an Aurora connector. Using this method would give you an actual ETL and even if now you just need the E(xtract) and L(oad), would still leave the door open for any kind of transform you might need in the future.
In this AWS Blog titled How to extract, transform, and load data for analytic processing using AWS Glue (Part 2) they show the opposite flow (Aurora->S3 via Glue), but it should still give you an idea of the process.

Storing & reading custom metadata in parquet files using Spark / Scala

I know parquet files store meta data, but is it possible to add custom metadata to a parquet file, using Scala (preferably) using Spark?
The idea is that I store many similar structured parquet files in a Hadoop storage, but each has a uniquely named source (String field, also present as column in the parquet file), however, I'd like to access this information without creating the overhead of actually reading the parquet and possibly even removing this redundant column from the parquet.
I really don't want to put this info in a filename, so my best option is now just to read the first line of each parquet and use the source column as String field.
It works, but I was just wondering if there is a better way.