Apache NiFi: MergeContent do not merges files properly - merge

I have the following flow:
I get data from Salesforce but there is a lot of data (10000 records), so I have to use nextUrl attribute to get next batches with 1000 records. That is why the first file comes from one flow and all others are coming from another flow (using nextUrl attribute).
In general 1 file comes from the 1st flow and 8 files come from the 2nd flow, and then I split JSON, so there are 1000 records in the first flow and 8000 records in the second flow.
Then I have to merge 1000 + 8000 records into 1 file. But in result there is not 1 file but 9 files (as it was in the beginning when JSON was not split yet). And only when I use one more MergeContent processor, finally I have 1 file instead of 9 files.
I have the following configuration for MergeContent:
Merge Strategy: Bin-Packing Algorithm;
Merge Format: Binary Concatenation;
Attribute Strategy: Keep Only Common Attributes.
Actually I tried other parameters as well but all data is combined into 1 file only using the second MergeContent.
Please help me to figure out why it happens and if there is a way to use only one MergeContent processor?
If you need any additional information, just let me know.

Related

How to handle exception in spring batch for CSV -> DB?

I come across this link - How to handle exception and skip the wrong csv line as well?
The answer is marked as answered but don't understand it clearly.
I have the similar situation
a. processing csv file - let's say 100 records
b. first 10 are valid, next 10 are invalid data, next 10 are
valid data but business errors, next 10 are no business
error but DB size limit error
How to generate one single error csv file in all these cases where the error csv file should have the each csv row data which got failed, reason as one column, file name as another column, row number etc.,
I don't find a good example with respect to this context.
Can someone help me with one or explain the above link to handle the case?
General Link for Reference: https://docs.spring.io/spring-batch/docs/1.1.x/spring-batch-docs/reference/html/execution.html
Thanks.

How to capture the last record in a file

I have a requirement to split a sequential file into 3 parts, Header, Data, Trailer. I have the header and Data worked out.
Is there a way, in a Transformer, to determine if you have the last record in a sequential file? I tried using LastRow() but that gives me the last row for each node. I need to leave parallelize on.
Thanks in advance for any help.
You have no a priori knowledge about which node the trailer row will come through on. There is therefore no solution in a Transformer stage if you want to retain parallel execution.
One way to do it is to have a reject link on the Sequential File stage. This will capture any row that does not match the defined metadata. Set up the stage with the metadata for your Data rows, then the Header and Trailer will be captured onto the reject link. It should be pretty obvious from their data which is which, and you can process them further and perhaps even rejoin them to your Data rows.
You could also capture the last row separately (e.g. via head -1 filename) and compare that against every row processed to determine if it's the last. Computationally heavy for very little gain.

How to take data from 2 databases (with same schema) and copy it into 1 database using Data factory

I want to take data from 2 databases and copy(coalesce) it into 1 using Data factory.
The issue is: It seems that multiple inputs is not allowed for copy activities.
So i resorted to having 2 different datasets which are exact copies but with a different name... and then putting 2 different activities into the 1 pipeline which use their specific output dataset.
It just seems odd and wrong to do it this way.
Can i have some help.
This is what my diagram currently looks like:
Is there no way of just copying data from 2 seperate databases (which have the same structure but different data) to the 1 database?
The short answer is yes. But you need to work within the constraints of how ADF handles this.
A couple of things to help...
You'll always need at least 2 activities to do this when using the copy type activity. Microsoft of course charges per activity execution in ADF, so they aren't going to allow you to take shortcuts having many inputs and output per single copy activity (single charge).
The approach you show above is ok and to pass the ADF validation as you've found you simply need to have the output datasets created separately and called different things. Even if they still refer to the same underlying target table etc. This is really only a problem for the copy activity. What you could do is land the data firstly into separate staging tables in the Azure target database just for the copy (1:1). Then have a third downstream activity that executes a stored procedure that does the union of tables. In this case you could have 2 inputs to 1 output in the activity if you want to have that level of control in ADF.
Like this:
Final point, if you don't want the activities to execute in parallel you could chain the datasets to enforce a fake dependency or add a simple 'delay' clause to one of the copy operations. A delay on an activity would be simpler than provisioning a time slice offset.
Hope this helps

Using Talend Open Studio DI to extract extract value from unique 1st row before continuing to process columns

I have a number of excel files where there is a line of text (and blank row) above the header row for the table.
What would be the best way to process the file so I can extract the text from that row AND include it as a column when appending multiple files? Is it possible without having to process each file twice?
Example
This file was created on machine A on 01/02/2013
Task|Quantity|ErrorRate
0102|4550|6 per minute
0103|4004|5 per minute
And end up with the data from multiple similar files
Task|Quantity|ErrorRate|Machine|Date
0102|4550|6 per minute|machine A|01/02/2013
0103|4004|5 per minute|machine A|01/02/2013
0467|1264|2 per minute|machine D|02/02/2013
I put together a small, crude sample of how it can be done. I call it crude because a. it is not dynamic, you can add more files to process but you need to know how many files in advance of building your job, and b. it shows the basic concept, but would require more work to suite your needs. For example, in my test files I simply have "MachineA" or "MachineB" in the first line. You will need to parse that data out to obtain the machine name and the date.
But here is how may sample works. Each Excel is setup as two inputs. For the header the tFileInput_Excel is configured to read only the first line while the body tFileInput_Excel is configured to start reading at line 4.
In the tMap they are combined (not joined) into the output schema. This is done for the Machine A Excel and Machine B excels, then those tMaps are combined with a tUnite for the final output.
As you can see in the log row the data is combined and includes the header info.

TfileList catches one of the 6 files only

I tried to display some results from several files in a directory. I use TFileList, and 2 tFileInputDelimited which are both linked to TFileList. I don't know why but at the end of the processing my results are lugged from just one of the 6 files I want. It appears that there are results from the list file of the directory.
Each tFileInputDelimited has ((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")) as name of the flow.
Here is my TMap:
Your job is set up so your lookup is iterative which causes some issues as Talend only seems to use the last iteration rather than doing what you might expect and iterating through every step for everything it needs (although this might be more complicated than you first think).
One option is to rework the job so you use your iterate part of the job as the main input to the tMap rather than the lookup.
Alternatively, you could iterate the data into a tBufferOutput component and then OnSubjobOk you could link the job as before but replace the iterative part with a tBufferInput component as it will store all of the data from all of the files iterated through.