PySpark DataFrame to CSV file with GML format - pyspark

Is there a way to quickly save a DataFrame to a CSV file that I want to modify in such a way that it is in GML format?
My strategy, for now, is to save the file as a standard CSV file and then modify that file.
I appreciate any help you can provide.

Related

pyspark read delta csv file by date

I have several csv files in a folder. please refer to below screenshot.
The files with '20221205' are delta files and are newly uploaded into the folder today.
I want to read these 2 delta csv files only, and do some transformation and then append to existing table.
Every day, i will upload 2 files with current data as suffix, then run the note to handle the files uploaded today only.
Question: how to read only today's file only by pyspark??
How should I load the delta
what you call delta is actually a normal csv file with different prefix, not to be confused with delta data format.
you can read the prefix using glob patterns, simply put the date into the path string and it will read only the files ending with the suffix of the date:
spark.read.csv("path/to/folder/*20221205.csv")
I recommend however, if possible, storing the csv partitioned in your file system. this means each date is in a separate folder.
The file system will be something like:
folder
date=2022-01-01
date=2022-01-02
....
then you can simply:
spark.read.csv('folder').filter(col('date') == '2022-01-02')
the filter on the date will take milliseconds since the data is partitioned, behind the scenes spark knows that csvs with date = X are stored ONLY in date=X folder.

Upload multiple files to pentaho

In pentaho data integration, how do I import a list of xlsx files that are in the same folder?
note: the number of columns are always the same
If your excel column name and sheet name are always same then you can use THIS solution. Here I have take all xlsx file from source folder and convert one-by-one file as csv.
But if your excel columnname and sheet name are dynamic or you need some dynamic solution. Then you can use my another stackoverflow solution from Here

How can we Exclude the unnecessary rows from Excel File while doing Data Load using Copy activity in ADF

I have a excel file which is semi-structured. There is data in a table, but there are dividers in certain rows that should needs to be ignored.
The processing of the data should start with the column headers(Col1 , col2 ....) and only process the rows with actual data.
Could anyone suggest the way to achieve this using copy activity in adf .
My source is xls file and target is ADLA (Parquet file)
Any help appreciated. Thanks in advance.
The most closest solution is that you need manually choose data range in the excel file:
Ref: https://learn.microsoft.com/en-us/azure/data-factory/format-excel#dataset-properties
HTH.

Dataprep : Invalid array type after run job to excel file

I try to use array type column in dataprep and it is look good in dataprep display ui as the picture below.
But when I run job output with .csv file, there are invalid value in the array column.
Why does the .csv output different from dataprep display?
Array in Dataprep display
Array in csv output
It looks like these two columns each contain the complete record...? I also see some non-English characters in there. I suspect something to do with line breaks and/or encoding.
What do you see if you open the CSV file in a plaintext editor, instead of Excel?
What edition of Dataprep are you using (click Help => About Dataprep => see the Edition heading)?
What version of Excel are you using to open the CSV file?
Assuming that this is a straight-forward flow with a single dataset and recipe, could you post a few rows of data and the recipe itself (which you can download), for testing purposes?

TalendOpenStuido DI Replace content of one column of .slx File with another column of .csv file

I have two input files:
an .xlsx file that looks like this:
an .csv files that looks like this:
I already have a talend job that transforms the .xlsx file into an .xml file.
One node in the .xml file contains the
<stockLocationCode>SL213</stockLocationCode>
The output .xml file looks like this:
Now I need to replace every occurence of the stockLocationCode with the second column of the .csv file. In this case the result would be:
My talend job looks like this:
I use a tMap component to put the columns of the .xlsx file into the right node of the output xml file.
But I do not know how I can peplace the StockLocactionCode with the acutal full stock location using the .csv file. I tired to also map the .csv file with the tMap component.
I would neet to build in a methof that looks at the current value of the node <stockLocationCode> and loops over the whole .csv file until it find it in the first column of the .csv file and then replace the <stockLocationCode> content with the content of the second column of the .csv file.
Performance is not important ;)
First, you'll need a lookup in e.g. a tMap or tXMLMap component, where you map your keys and add a new column with the second column of the csv file
The resulting columns would look like this:
Product; Stock Location Code; CSV 2nd column data
Now in a second map you could just remove the stock location code and do the rest of your job.
Voila, you exchanged the columns.
u can use tXMLMap which lookup