Can ItemReaders just pass in the record read and not need a lineMapper t o convert to an object - spring-batch

I'm asking if I can pass into the ItemProcessors the entire delimited record read in the ItemReader as one long string.
I have situations with unpredictable data. The file is pipe-delimited, but even with that, a single double-quote will have a parse error using Spring Batch's ItemReader.
In a standalone java application I wrote code using Spring's StringUtils class. I read in the full delimited record as a String (BufferedReader), then call Spring's StringUtils.delimitedListToStringArray(...,...). This gets all the characters whether valid or not, and then I can do a search/replace to get things like any single double-quote or commas in the fields.
My standalone Java program is a down-n-dirty solution. I'm turning it into a Spring Batch job for the long term solution. It's a monthly process, and it's an impractical, if not impossible, task to get SAP users to keep trash out of data fields (i.e. fat-finger city).
I see where it appears I have to have a domain object for the input record to be mapped into. Is this correct, or can i do a pass-through scenario, and let me handle the parsing myself using StringUtils?
The pipe-delimited records turn into comma-delimited records. There's really no need to create a domain object and do all the field set mapping.
Am happy for ideas if I'm approaching this the wrong way.
Thank you in advance.
Thanks,
Michael
EDIT:
This is the error, and the record. The lone double-quote in column 6 is the problem. I can't control the input, so I'm scrubbing each field (all Strings) for unwanted characters. So, my solution was to skip the line mapping and use StringUtils to do it myself--as I've done as mentioned earlier.
Caused by: org.springframework.batch.item.file.FlatFileParseException: Parsing error at line: 33526 in resource=[URL [file:/temp/comptroller/myfile.txt]], input=[xxx|xxx|xxx|xxx|xxx|xxx x xxx xxxxxxx xxxx xxxx "x|xxx|xxx|xxxxx|xx|xxxxxxxxxxxxx|xxxxxxx|xxx|xx |xxx ]
at org.springframework.batch.item.file.FlatFileItemReader.doRead(FlatFileItemReader.java:182)
at org.springframework.batch.item.support.AbstractItemCountingItemStreamItemReader.read(AbstractItemCountingItemStreamItemReader.java:85)
at org.springframework.batch.core.step.item.SimpleChunkProvider.doRead(SimpleChunkProvider.java:90)
at org.springframework.batch.core.step.item.FaultTolerantChunkProvider.read(FaultTolerantChunkProvider.java:87)
... 27 more
Caused by: org.springframework.batch.item.file.transform.IncorrectTokenCountException: Incorrect number of tokens found in record: expected 15 actual 6

Since the domain objects you read from ItemReaders, write to ItemWriters, and optionally process with ItemProcessors can be any Object, they can be Strings.
So the short answer is yes, you should be able to use a FlatFileItemReader to read one line at a time, pass it to SomeItemProcessor<String,String>, which replaces your pipes with commas (and handles existing commas) with whatever code you want, and sends those converted lines to a FlatFileItemWriter. Spring Batch includes common implementations of the LineTokenizer and LineAggregator classes which could help.
In this scenario, Spring Batch would be acting like a glorified search replace tool, with saner failure handling. To answer the bigger question of whether you should be using domain objects, or at least beans, think about whether you want to perform other tasks in the conversion process, like validation.
P.S. I'm not aware that FFItemReader blows up on a single double-quote, might want to file that as a bug.

Related

Reading CSV file with Spring batch and map to Domain objects based on the the first field and then insert them in DB accordingly [duplicate]

How can we implement pattern matching in Spring Batch, I am using org.springframework.batch.item.file.mapping.PatternMatchingCompositeLineMapper
I got to know that I can only use ? or * here to create my pattern.
My requirement is like below:
I have a fixed length record file and in each record I have two fields at 35th and 36th position which gives record type
for example below "05" is record type which is at 35th and 36th position and total length of record is 400.
0000001131444444444444445589868444050MarketsABNAKKAAAAKKKA05568551456...........
I tried to write regular expression but it does not work, i got to know only two special character can be used which are * and ? .
In that case I can only write like this
??????????????????????????????????05?????????????..................
but it does not seem to be good solution.
Please suggest how can I write this solution, Thanks a lot for help in advance
The PatternMatchingCompositeLineMapper uses an instance of org.springframework.batch.support.PatternMatcher to do the matching. It's important to note that PatternMatcher does not use true regular expressions. It uses something closer to ant patterns (the code is actually lifted from AntPathMatcher in Spring Core).
That being said, you have three options:
Use a pattern like you are referring to (since there is no short hand way to specify the number of ? that should be checked like there is in regular expressions).
Create your own composite LineMapper implementation that uses regular expressions to do the mapping.
For the record, if you choose option 2, contributing it back would be appreciated!

How to generate stream of custom symbols instead of ints in lexer?

I am asking for gplex, however it might be the case, the solution to the problem works for other lex-derived tools.
I wrote all rules, everything is fine with one exception. The type of the scan method of the generated scanner is int, and I would like to be MySymbol (which would consist of id of the token -- INT, STR, PLUS, so on, its value, and possible location in the file).
I checked the samples (not many of them), but they are very simplistic and just write out the fact rule was matched, I've read the manual, but it starts from parser perspective and for now I am a bit lost.
One of my rules in lex file:
while { return new MySymbol(MyTokens.WHILE); }
All I have now is scanning phase, I have to finish it, and then I will think about parser.
Yacc and Yacc-like tools (here GPLex) relies on side effect. Normally you could think of returning the data, but here you are returning token id, and any extra data has to be "passed" via special variables like yyval.

Spreadsheet::ParseExcel module in perl

I always gets confused when I deal with Classes and Objects. As I am trying to understand the Spreadsheet::ParseExcel module, I am having some doubts for its classes and object :
My doubt is:
With $parser= Spreadsheet::ParseExcel->new();, we are creating an object for Spreadsheet::ParseExcel and after this we shall create the object for Spreadsheet::ParseExcel::Workbook.
Why can not we create the object directly for Spreadsheet::ParseExcel::Workbook and start parsing ?
Thanks
Why can not we create the object directly for Spreadsheet::ParseExcel::Workbook and start parsing
That is a reasonable question and in older versions of Spreadsheet::ParseExcel there was a Spreadsheet::ParseExcel::Workbook->Parse() method that did just that. (*)
Users tend to see an Excel file only as a workbook. However the file format also contains data such as metadata (author, creation date, etc.) and vba macros that are separate from the workbook data.
As such the logical division of the parser from the workbook probably occurred due to the physical division of the data in the file.
Or it may have been to allow reporting of file parsing errors rather than just returning an undefined workbook object.
Either way, other people may have chosen to model the interface differently but that is what the original author chose. It is not completely intuitive but it works.
(*) This method is now deprecated since it doesn't allow error checking on the file.
Think about Spreadsheet::ParseExcel and Spreadsheet::ParseExcel::Workbook like they are just of different types, like integer and string, which are both scalar, but you cannot, say, multiply them, although they can interact in some cases. E.g. length() applied to string gives you integer length of string. The same way, Spreadsheet::ParseExcel::parse() gives you Spreadsheet::ParseExcel::Workbook. They are bound by common namespace but they are completely different, Spreadsheet::ParseExcel is a parser and Spreadsheet::ParseExcel::Workbook is a workbook.

How to prevent SQL injection if I don't have option to use "PreparedStatement" in Java/J2EE

I have one application In which I can’t user “PreparedStatement” on some of places.
Most of SQL queries are like….
String sql = "delete from " + tableName;
So I like to know how to fix “SQL Injection” problem in my code.
Regards,
Sanjay Singh
=======================Edited After getting answer and like to verify solution==========
According to provided suggestion I have identified one strategy to prevent SQL injection in my case ….
Like to know views, I am working on the VeraCode Certificate for our application…
Filter Data so it does not content any space and escape SQL character (so if there is any injected code,
it’ll not going to part of my dynamic SQL, so my column name and table name can’t use to inject SQL query).
public static String getTabColName(String tabColName)
{
if(tabColName == null || "".equals(tabColName.trim()))
return "";
String tempStr = StringEscapeUtils.escapeSql(tabColName.trim());
//If this value content space that means it is not a valid table
// or column name, so don’t use it in dynamic generated SQL
//use space so it create an invalid SQL query
return tempStr.indexOf(' ') == -1 ? tempStr : "";
}
Parameterised queries are a major step towards preventing SQL injection attacks. If you cannot use them, you have an equally major setback in your hands. You can somewhat mitigate the danger by:
input string validation. And I mean validation with all the bells and whistles, which can sometimes reach the level of a full-blown parser, not just a few checks.
input manipulation (e.g. quoting and string escaping). Again, you have to do this right, which can be harder than it seems.
Both techniques are problematic - you have to let valid input through unchanged, in order to maintain compatibility with your current codebase, while still effectively protecting your system. Good luck with that...
From my experience, refactoring - or even rewriting - your code to use prepared statements will save you a lot of time and tears in the long run.
If you don't have a peer-reviewed library of string-escaping functions, at the very least you should white-list characters that you know are safe to embed in strings. For instance, ensure your strings are composed only of letters, digits and underscores, and nothing else. Black-listing known "bad characters" is poised to get you in trouble.
Making sure that the input contains only allowed characters is just an important first step. Your sample statement is a good example for the value of the strategy "find input in a list of all good values" (you surely know the set of tables in your database and the subset of tables users are allowed to zap). "compare input against plausible range" (salary shouldn't be increased by millions or half cents), or "match input against a regex to reveal structural violations" are further examples.
To get confidence in your defenses, you may consider using a QuickCheck-like testing library to attack your validation functions by (suitably biased) random strings. This article lists implementations for languages other than Haskell.

Dealing with files in PSUnit

I'm writing a Powershell script which is going to go out into a client's current source control system and do a mass rename on all of the files so that they follow a new naming convention.
Being the diligent TDD developer that I am, I started with putting together a PSUnit test case. At first I was thinking that I would pass in a string to my function for the file name (along with a couple of other relevant parameters) and then return a string. Then it occurred to me that I am going to need to break apart the file name into an extension and a base name. Since System.IO.FileInfo already has that functionality, I thought why not just pass in a file object instead of a string?
If I do that however, then I don't see how I can write my PSUnit test without it being reliant on an external resource (in this case, the file must exist for me to get the FileInfo object - or does it?).
Is there a "clean" way to handle this? How are others approaching these issues?
Thanks for any help or advice!
My suggestion is: Be pragmatic and pass in the base name and extension as two separate strings. For convenience reasons, you can provide an overload that accepts a FileInfo.