getting FlatFileParseException when there is empty line in my flat file when doing batch processing with spring-batch - spring-batch

i used the below code for handling it:
<skippable-exception-classes>
<includeclass="org.springframework.batch.item.file.FlatFileParseException"/>
</skippable-exception-classes>
But what is the correct way to handle empty lines in flat file, to avoid seeing the FlatFileParseException, when processing with spring batch.

I think you could accomplish what you're looking for with a custom RecordSeparatorPolicy. You could strip out any empty lines via the RecordSeparatorPolicy#preProcess method and return true in the RecordSeparatorPolicy#isEndOfRecord once you have a record with text. The only gotcha is that you'd have to do some trickery around the use case of if a file had blank lines at the end of the file but I think it's doable.

Related

Specman how to read a specific line from a file, with no loop

I have a long file, and I want to read a specific line not from the first sequence lines.
is there a way to do it without looping all over the file and counting the lines?
For example files.read, which get an index from which line to read?
Thanks
You can use the predefined method files.get_text_lines().

Regarding Capture in Stata

I have code that I mostly took from here (bottom column): http://www.ats.ucla.edu/stat/stata/faq/append_many_files.htm
clear
file open myfile9 using C:\Users\RNCZF01\Documents\Cameron-Fen\Economics-Projects\Neighborhood-Project\list.csv, read
file read myfile9 line
insheet using `line', comma
save `line'.dta, replace
save master_data.dta, replace
drop _all
file read myfile9 line
while r(eof)==0 {
capture insheet using `line', comma
if _rc!=0 {
insheet using `line', comma
save `line'.dta, replace
append using master_data.dta, force
save master_data.dta, replace
}
drop _all
file read myfile9 line
}
Originally I had insheet using line', comma (I removed the back tick before the line because it was interfering with formatting). But the problem was that some of my sheets I was attempting to read were blank and so Stata would close. Thus I changed that to this:
capture insheet using `line', comma
if _rc!=0 {
insheet using `line', comma
However this closes after only reading the first document (and exits in the while loop before the first iteration of the while loop (second document) is done). My thought was that macros may disappear when they are used but I have no idea.
The reason your loop is closing is because you want
if _rc==0
for the inner loop. I am guessing your first file is not found, insheet throws up an error, and you are triggering the if _rc!=0 condition. Then the loop tries to run insheet without the capture and errors out. Another way of diagnosing this would be to run
set trace on
which I have found helpful in these sort of situations.
P.S. Not sure of the etiquette, but credit for this answer goes to lmo. I just thought it was worth writing up as an answer rather than a comment.

Store row numbers which are causing "error"

I have to retrieve certain information from urls. For this I have to enter text into fields of the url. I am using GET operation for this. I have to modify the text to replace spaces with "%20". Some times the text(which is taken from the database) is badly formed. I would like to know the row numbers so I can manually change the text for such rows in the database and run it again. I have tried to use the logs and errors section but with little luck. Does anybody have an idea of how to do this?
First shot: Output bad urls on the console
So far, I came up with the following job design for your problem:
The trick is to catch the exceptions of the tHttpRequest component and print the necessary details on the console. For this example, I included the line number, the exception message and the URL that produced the exception.
Output (I couldn't reproduce your "Illegal character error", so I took a different one):
Second shot: Output to a file
If you really need to output the line numbers to a file, things get a little more complicated.
Instead of printing the info straight onto the console, we collect all line numbers into a context variable of type (Java) List inside the tJavaFlex. After the usual URL processing (which I have left out from the job design to keep the example small), we iterate over the Java List
and save it into a tHashOutput, so that we can finally write to a file.
We cannot directly write to the file in the tLoop section, since the Iterate flow would lead to the situation the the tFileInputDelimited would be opened several times. If "Append" was disabled, only the last bad URL line number would finally appear in the output file. If "Append" was enabled, you would get the full list of line numbers after the very first job run - but you would append every time you run the job, making the list longer and longer. Workarounds would be to use a runtime-dependent file name (e.g. timestamp) or to delete the file at the beginning of the job run. I chose the third option, that overwrites the file every time we run the job. Feel free to chose among those options the one which suits your use case best.
Details
The tHashOutput/tHashInput components are not visible on default, but must be enabled first to show up: https://www.talendforge.org/forum/viewtopic.php?pid=107249#p107249
Context variable:
INIT:
tJavaFlex "catch errors", end code:
tLoop:
tFixedFlowInput "badURL":
tHashOutput:
Needs to have "Append" enabled.

Can I use importdata to return only a part of a text file?

I am using:
importdata(fileName,'',headerLength)
To get data from a text file which is carriage return line feed delimited. The problem I have is that the files are relatively large and there are several thousand of them, which makes the data loading slow. I only want a small part of the file so I would like to know if I can use importdata to realise this?
Something like this:
importdata(fileName,'',headerLength:dataEnd);
This does not work and I can't find any support for doing something like this in the importdata documentation.
Does anyone know of a more suitable function?
If you know the lines (the row number) in each file you wish to load in,
You can use a slower, more traditional way of reading in your data. The readline.m allows you to do this:
http://uk.mathworks.com/matlabcentral/fileexchange/20026-readline-m-v3-0--jun--2009-
This allows you to read whichever line you want from your data block, but it is much slower than your normal csvread/textscan, but could be considered overall faster if you know which lines you are looking for.

Processing text inside variable before writing it into file

I'm using Perl WWW::Mechanize package in order to fetch and process data from some websites. Usually my way of action is as follows:
Fetch a webpage
$mech->get("$url");
Save the webpage contents in a variable (BTW, I'm not sure if it's the right way to save this amount of text inside a scalar which, as far as I know, supposed to be used for a single value)
my $list = $mech->content();
Use a subroutine that I've created to write the contents of the variable to a text file. (The writetoFile subroutine includes few more features, like path and existing file validations..)
writeToFile("$filename.tmp","$path",$list);
Processing the text in a file created in the previous step by creating an additional file and save the processed content there (Then deleting the initial temporary file).
What I wonder about, is whether it is possible to perform the processing before storing the text in a file, directly inside the $list variable? The whole process is working as expected but I don't really like the logic behind it and it seems a bit inefficient as well, since I have to rewrite the same file multiple times.
EDIT:
Just to give a bit more information about what I'm actually after when I process the variable contents. So the data I fetch from the website in this case is actually a list of items separated by a blank line and the first line is irrelevant to me. So what I'm doing while processing this data is 2 things:
Remove the empty (CRLF) lines
Remove the first line if it includes a particular text.
Ideally I want to save the processed list (no blank spaces and first line removed) in a file without creating any additional files on the way. In order to save the file I would like to use the writeToFile sub (I wrote) since it also performs validation on whether such file already exists (If a file will be saved before final processing - the writeToFile will always rewrite the existing file).
Hope it makes sense.
You're looking for split. The pattern depends: use (?<=\n) split at a new line character and keep it. If that doesn't matter, use \R to include all sort of line breaks.
foreach my $line (split qr/\R/, $mech->content) {
…
}
Now the obligatory HTML-parsing-with-regex admonishment: if you get HTML source with Mechanize, parsing it line-by-line does not make much sense. You probably want to process the HTML-stripped text version of the document instead, or pass the HTML source to a parser such as Web::Query to declaratively get at the pieces you need.