How to read and write to the same excel file in Talend in the same subjob? - talend

I have a sub job that reads from an excel file and then uses a tMap to edit some of the rows and then outputs it to another excel file. I need to be able to read and write to the same excel file. Is this possible? Maybe there is a component that I don't know about that can store what is in the tMap and in another sub job I can write that content to the same excel file?

You can do that by add a step.
You need to output in another Excel file in the first step.
Then subjob OK
Then input your temporary file in the first one.
Then delete the temporary file.
TInputExcel (first file) -> tmap -> toutputexcel (Tempory file)
|
subjob ok
|
TInputExcel (temporary file) -> tmap -> toutputexcel (first file)
|
tfiledelete (temporary file)

Related

Requiremet of Error logging mail(mail should contain all the missing file details). in the ADF flow

the reqirement is simple , i have a folder having 4 txt files(1.txt,2.txt,3.txt,4.txt) . the Flow is controlled by a parameter called all or some which is of string type.
If i select all in the parameter, all 4 file should be processed. the requirement starts here >>
IF any file is missing from the folder(for ex 2.txt and 3.txt is not present and i selected ALL in the parameter) , i need a mail saying file is 2.txt and 3.txt is missing.
If i select some in the parameter, for ex 1.txt and 4.txt and if any of the file is missing 1.txt and 4.txt is missing(for example 1.txt is missing) , i need a mail with the missing file name(i.e 1.txt in our case).
capture missing file details in one variable
I tried to repro this capturing missing files using azure data factory. Below is the approach.
Take a parameter of array type in the pipeline. During runtime, you can give the list of file names in a folder to be processed in this array parameter.
Take a get metadata activity and add a dataset in the activity. Click +New in the field list and select child items as an argument.
Take a filter activity and give the array parameter value in items and write the condition to filter the missing files in condition box
item:
#pipeline().parameters.AllorSome
condition:
#not(contains(string(activity('Get Metadata1').output.childItems),item()))
I tried to run this pipeline. During run time, four file names are given to the array parameter.
Get Metadata activity output has three file names.
Parameter has 4 filenames and Get meta data activity has 3 filenames. Missing file names are to be filtered out.
Output of filter activity:
Use this output and send it in email.
Refer the MS document on How to send email - Azure Data Factory & Azure Synapse | Microsoft Learn for sending email.

How can I have nice file names & efficient storage usage in my Foundry Magritte dataset export?

I'm working on exporting data from Foundry datasets in parquet format using various Magritte export tasks to an ABFS system (but the same issue occurs with SFTP, S3, HDFS, and other file based exports).
The datasets I'm exporting are relatively small, under 512 MB in size, which means they don't really need to be split across multiple parquet files, and putting all the data in one file is enough. I've done this by ending the previous transform with a .coalesce(1) to get all of the data in a single file.
The issues are:
By default the file name is part-0000-<rid>.snappy.parquet, with a different rid on every build. This means that, whenever a new file is uploaded, it appears in the same folder as an additional file, the only way to tell which is the newest version is by last modified date.
Every version of the data is stored in my external system, this takes up unnecessary storage unless I frequently go in and delete old files.
All of this is unnecessary complexity being added to my downstream system, I just want to be able to pull the latest version of data in a single step.
This is possible by renaming the single parquet file in the dataset so that it always has the same file name, that way the export task will overwrite the previous file in the external system.
This can be done using raw file system access. The write_single_named_parquet_file function below validates its inputs, creates a file with a given name in the output dataset, then copies the file in the input dataset to it. The result is a schemaless output dataset that contains a single named parquet file.
Notes
The build will fail if the input contains more than one parquet file, as pointed out in the question, calling .coalesce(1) (or .repartition(1)) is necessary in the upstream transform
If you require transaction history in your external store, or your dataset is much larger than 512 MB this method is not appropriate, as only the latest version is kept, and you likely want multiple parquet files for use in your downstream system. The createTransactionFolders (put each new export in a different folder) and flagFile (create a flag file once all files have been written) options can be useful in this case.
The transform does not require any spark executors, so it is possible to use #configure() to give it a driver only profile. Giving the driver additional memory should fix out of memory errors when working with larger datasets.
shutil.copyfileobj is used because the 'files' that are opened are actually just file objects.
Full code snippet
example_transform.py
from transforms.api import transform, Input, Output
import .utils
#transform(
output=Output("/path/to/output"),
source_df=Input("/path/to/input"),
)
def compute(output, source_df):
return utils.write_single_named_parquet_file(output, source_df, "readable_file_name")
utils.py
from transforms.api import Input, Output
import shutil
import logging
log = logging.getLogger(__name__)
def write_single_named_parquet_file(output: Output, input: Input, file_name: str):
"""Write a single ".snappy.parquet" file with a given file name to a transforms output, containing the data of the
single ".snappy.parquet" file in the transforms input. This is useful when you need to export the data using
magritte, wanting a human readable name in the output, when not using separate transaction folders this should cause
the previous output to be automatically overwritten.
The input to this function must contain a single ".snappy.parquet" file, this can be achieved by calling
`.coalesce(1)` or `.repartition(1)` on your dataframe at the end of the upstream transform that produces the input.
This function should not be used for large dataframes (e.g. those greater than 512 mb in size), instead
transaction folders should be enabled in the export. This function can work for larger sizes, but you may find you
need additional driver memory to perform both the coalesce/repartition in the upstream transform, and here.
This produces a dataset without a schema, so features like expectations can't be used.
Parameters:
output (Output): The transforms output to write the single custom named ".snappy.parquet" file to, this is
the dataset you want to export
input (Input): The transforms input containing the data to be written to output, this must contain only one
".snappy.parquet" file (it can contain other files, for example logs)
file_name: The name of the file to be written, if the ".snappy.parquet" will be automatically appended if not
already there, and ".snappy" and ".parquet" will be corrected to ".snappy.parquet"
Raises:
RuntimeError: Input dataset must be coalesced or repartitioned into a single file.
RuntimeError: Input dataset file system cannot be empty.
Returns:
void: writes the response to output, no return value
"""
output.set_mode("replace") # Make sure it is snapshotting
input_files_df = input.filesystem().files() # Get all files
input_files = [row[0] for row in input_files_df.collect()] # noqa - first column in files_df is path
input_files = [f for f in input_files if f.endswith(".snappy.parquet")] # filter non parquet files
if len(input_files) > 1:
raise RuntimeError("Input dataset must be coalesced or repartitioned into a single file.")
if len(input_files) == 0:
raise RuntimeError("Input dataset file system cannot be empty.")
input_file_path = input_files[0]
log.info("Inital output file name: " + file_name)
# check for snappy.parquet and append if needed
if file_name.endswith(".snappy.parquet"):
pass # if it is already correct, do nothing
elif file_name.endswith(".parquet"):
# if it ends with ".parquet" (and not ".snappy.parquet"), remove parquet and append ".snappy.parquet"
file_name = file_name.removesuffix(".parquet") + ".snappy.parquet"
elif file_name.endswith(".snappy"):
# if it ends with just ".snappy" then append ".parquet"
file_name = file_name + ".parquet"
else:
# if doesn't end with any of the above, add ".snappy.parquet"
file_name = file_name + ".snappy.parquet"
log.info("Final output file name: " + file_name)
with input.filesystem().open(input_file_path, "rb") as in_f: # open the input file
with output.filesystem().open(file_name, "wb") as out_f: # open the output file
shutil.copyfileobj(in_f, out_f) # write the file into a new file
You can also use the rewritePaths functionality of the export plugin, to rename the file under spark/*.snappy.parquet file to "export.parquet" while exporting. This of course only works if there is only a single file, so .coalesce(1) in the transform is a must:
excludePaths:
- ^_.*
- ^spark/_.*
rewritePaths:
'^spark/(.*[\/])(.*)': $1/export.parquet
uploadConfirmation: exportedFiles
incrementalType: snapshot
retriesPerFile: 0
bucketPolicy: BucketOwnerFullControl
directoryPath: features
setBucketPolicy: true
I ran into the same requirement the only difference was that the dataset required to be split into multiple parts due to the size. Posting here the code and how I have updated it to handle this use case.
def rename_multiple_parquet_outputs(output: Output, input: list, file_name_prefix: str):
"""
Slight improvement to allow multiple output files to be renamed
"""
output.set_mode("replace") # Make sure it is snapshotting
input_files_df = input.filesystem().files() # Get all files
input_files = [row[0] for row in input_files_df.collect()] # noqa - first column in files_df is path
input_files = [f for f in input_files if f.endswith(".snappy.parquet")] # filter non parquet files
if len(input_files) == 0:
raise RuntimeError("Input dataset file system cannot be empty.")
input_file_path = input_files[0]
print(f'input files {input_files}')
print("prefix for target name: " + file_name_prefix)
for i,f in enumerate(input_files):
with input.filesystem().open(f, "rb") as in_f: # open the input file
with output.filesystem().open(f'{file_name_prefix}_part_{i}.snappy.parquet', "wb") as out_f: # open the output file
shutil.copyfileobj(in_f, out_f) # write the file into a new file
Also to use this into a code workbook the input needs to be persisted and the output parameter can be retrieved as shown below.
def rename_outputs(persisted_input):
output = Transforms.get_output()
rename_parquet_outputs(output, persisted_input, "prefix_for_renamed_files")

spark - write to separate files by key

My code:
df.write.partitionBy("col").format("json").save(output)
then my output:
col=1
col=2
col=3
I need a csv format, but when I changed the format("json") to format("csv") I don't have separate files, but one file with all data
part-00000
why? and how can I fix it?

What is the ideal way to take backup of a configuration file using perl script, which you are going to edit

I am trying to create a script that manipulates a configuration file . So I need to take back up of the existing configuration file , in case there is any problem during manipulation the contents of the back up file should replace the contents of the configuration file . also when rollback is given as argument to the script the contents of the backup file should replace configuration file .
Typically I'd create a file with a name based on the original file's name:
my $file = 'input.txt';
my $new_file = "$file.new";
Start reading lines from the input file, and manipulate them as necessary before writing them to the new file.
When you reach the end-of-file of the input file, close them both. Rename the input file to "$file.old", then rename the new file to the old name $file.
You want to keep the original file intact as long as possible so it remains available in case something fails during processing.
If you have to roll back, reverse the rename process if processing completed. If processing didn't complete just delete the new file.

How can I copy columns from several files into the same output file using Perl

This is my problem.
I need to copy 2 columns each from 7 different files to the same output file.
All input and output files are CSV files.
And I need to add each new pair of columns beside the columns that have already been copied, so that at the end the output file has 14 columns.
I believe I cannot use
open(FILEHANDLE,">>file.csv").
Also all 7 CSV files have nearlly 20,000 rows each, therefore I'm reading and writing the files line by line.
It would be a great help if you could give me an idea as to what I should do.
Thanx a lot in advance.
Provided that your lines are 1:1 (Meaning you're combining data from line 1 of File_1, File_2, etc):
open all 7 files for input
open output file
read line of data from all input files
write line of combined data to output file
Text::CSV is probably the way to access CSV files.
You could define a csv handler for each file (including output), use getline or getline_hr (returns hashref) methods to fetch data, combine it into arrayrefs, than use print.