Foundry convert uploaded csv to dataframe - pyspark

how can I convert an uploaded CSV to dataframe in foundry using a code workbook? should I use the #transform decorator with spark.read.... (not sure of the exact syntax)?
Thx!!

CSV is a "special format" where Foundry can infer the schema of the CSV and automatically convert it to a dataset. In this example, I uploaded a CSV with hail data from NOAA:
If you hit Edit Schema on the main page, you can use a front end tool to set certain configurations about the CSV (delimiter, column name row, column type etc).
Once it's a dataset, you can import the dataset in Code Workbooks and it should work out of the box.
If you wanted to parse the raw file, I would recommend you use Code Authoring, not Code Workbooks, as it's better suited for production level code, has less overhead and is more efficient for this type of workflow (parsing raw files). If you really wanted to use code workbooks, change the type of your input using the input helper bar, or in the inputs tab.
Once you finish iterating please move this to a Code Authoring repository and repartition your data. File reads at code workbook can substantially slow down your whole code workbook pipeline. Code Authoring offers preview of raw files now, so it's just as fast to develop as using Code Workbooks.
Note: Only imported datasets and persisted datasets can be read in as Python transform inputs. Transforms that are not saved as a dataset cannot be read in as Python transform inputs. Datasets with no schema should be read in as a transform input automatically.

Related

Mapping Synapse data flow with parameterized dynamic source need importing projection dynamically

I am trying to build a cloud data warehouse where I have staged the on-prem tables as parquet files in data lake.
I implemented the metadata driven incremental load.
In the above data flow I am trying to implement merge query passing the table name as parameter so that the data flow dynamically locate respective parquet files for full data and incremental data and then go through some ETL steps to implement merge query.
The merge query is working fine. But I found that projection is not correct. As the source files are dynamic, I also want to "import projection" dynamically during the runtime. So that the same data flow can be used to implement merge query for any table.
In the picture, you see it is showing 104 columns (which is a static projection that it imported at the development time). Actually for this table it should be 38 columns.
Can I dynamically (i.e run-time) assign the projection? If so how?
Or anyone has any suggestion regarding this?
Thanking
Muntasir Joarder
Enable Schema drift in your source transformation when the metadata is often changed. This removes or adds columns in the run time.
The source projection displays what has been imported at the run time but it changes based on the source schema at run time.
Refer to this document for more details with examples.

Migrate from OrientDB to AWS Neptune

I need to migrate a database from OrientDB to Neptune. I have an exported JSON file from Orient that contains the schema (classes) and the records - I now need to import this into Neptune. However, it seems that to import data into Neptune there must be a csv file containing all the vertex's and another file containing all the edges.
Are there any existing tools to help with this migration and converting to the required files/format?
If you are able to export the data as GraphML then you can use the GraphML2CSV tool. It will create a CSV file for the nodes and another for the edges with the appropriate header rows.
Keep in mind that GraphML is a lossy format (it cannot describe complex Java types the way GraphSON can) but you would not be able to import those into Neptune either.

Is there a way to dynamically generate output schema in Data Fusion?

I am using google cloud data fusion to ingest a CSV file from AWS, apply directives, and store the resultant file in GCS. Is there a way to dynamically generate the output schema of the CSV file for Wrangler? I am uploading the input path, directives, output schema as a macro to the argument setter. I can see options only to import and export output schema.
Currently there is no way to get the output schema dynamically since the output schema can be wildly different as a result of a parse directive.

How to import datasets as csv file to power bi using rest api?

I want to automate the import process in power bi, But I can't find how to publish a csv file as a dataset.
I'm using a C# solution for this.
Is there a way to do that?
You can't directly import CSV files into a published dataset in Power BI Service. AddRowsAPIEnabled property of datasets published from Power BI Desktop is false, i.e. this API is disabled. Currently the only way to enable this API is to create a push dataset programatically using the API (or create a streaming dataset from the site). In this case you will be able to push rows to it (read the CSV file and push batches of rows, either using C# or some other language, even PowerShell). In this case you will be able to create reports with using this dataset. However there are lots of limitations and you should take care of cleaning up the dataset (to avoid reaching the limit of 5 million rows, but you can't delete "some" of the rows, only to truncate the whole dataset) or to make it basicFIFO and lower the limit to 200k rows.
However a better solution will be to automate the import of these CSV files to some database and make the report read the data from there. For example import these files into Azure SQL Database or Data Bricks, and use this as a data source for your report. You can then schedule the refresh of this dataset (in case you use imported) or use Direct Query.
After a Power BI updates, it is now possible to import the dataset without importing the whole report.
So what I do is that I import the new dataset and I update parameters that I set up for the csv file source (stored in Data lake).

SAS Enterprise Guide - importing date data that contains gibberish as well

Regarding SAS Enterprise Guide and importing data. I have a column of data, some of which is in a date format and some of which is gibberish. Is there a way to force convert the column to date format (MMDDYY10.) and convert all the gibberish (non date format) information to blanks?
Your question lacks some information, but I assume that you are using proc import to read the file. That procedure doesn't provide any control over how to read each column, as it internally determines the type and structure of the data. In your case, if the file you are importing is a text file such as CSV and the internal scan cannot read the column as numeric then it reads it as character to preserve the information which is the right thing to do. The main purpose of proc import is to create a SAS data set from an external file format. You can post-process the data after the initial import.
Another option for you is to use the Import Wizard which steps you through a series of dialogues to gather information about the file you are importing. One of the steps allows to specify the type and format of the data in each column. For the date column in question, you can specify the appropriate date informat and get the desired result in your output data set. Additionally, a final step in the wizard gives you an option to save the import code to a file so that you can use it in the future to import more data from a similarly structure file. The code is a data step with the appropriate infile, informat, format and input statements from the information gathered from you in the wizard.
Of course, the ultimate in control is to write the data step yourself. I prefer to do this even with simple, well-structured files, because I can precisely control the type and length of variables and have the full power of the data step at my disposal. Once I develop a working version of the code, I find that's easier to modify if the future to adapt to changes in the input file.
Hopefully, this gives you options to solve your date problem.