OpenCSV vs FlatFileItemReader Spring Batch

OpenCSV vs FlatFileItemReader Spring Batch - spring-batch

What is the best way to read a csv file with Spring Batch OpenCSV or FlatFileItemReader? OpenCSV seems provide more capabilities on columns filtering.
Thanks in advance.
Edit: I was using OpenCSV to read a file and it worked fine. After I decided to use FlatFileItemReader and I am facing some issues and exceptions.
org.springframework.batch.item.file.FlatFileParseException: Parsing error at line:
Caused by: org.springframework.beans.NotWritablePropertyException: Invalid property 'xxx' of bean class [xxx.Record]: Duplicate match with distance <= 5 found for this property in input keys:..
This not all the log, but this is the main exceptions I have. The Duplicate match exception is a bit confusing because I don't have duplicate record in the file.
Please, any help and explanation will be welcome.

Related

Spark giving multiple datasource error on saving parquet file

I am trying to learn spark and scala, on my trying to write the dataframe object of my result to parquet file by calling the parquet method, i am getting error as such
Code Base that fails:-
df2.write.mode(SaveMode.Overwrite).parquet(outputPath)
This fails too
df2.write.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").mode(SaveMode.Overwrite).parquet(outputPath)
Error Log:-
Exception in thread "main" org.apache.spark.sql.AnalysisException: Multiple sources found for parquet (org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2, org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat), please specify the fully qualified class name.;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:707)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:733)
at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:967)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:304)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:848)
How ever if I called another method for the save, the code works properly,
This works fine:-
df2.write.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").mode(SaveMode.Overwrite).save(outputPath)
Although I have a solution for the issue, i'd like to understand why the first approach is not working and how I can solve it.
The details of the specification i am using are:-
Scala 2.12.9
Java 1.8
Spark 2.4.4
P.S. This issue is only seen on spark-submit

What might be some possible reason for the exception : 'Cannot find KieModule'?

I have four rule files and two of them are problematic. When I remove those two, drools runs fine. But with those two rule files, I am always getting the above mentioned exception.
I have kept one of those problematic files in this location : http://www.mediafire.com/file/6xsm8ilxysmyq3i/rulefile.drl. It is auto generated, and I am pretty sure it is an instant turn-off. It is difficult to check each and every line, so I was asking for suggestion what should I check to get a hint about the KieModule exception.
The other two files, with which all runs smooth, are of same structure, except probably they are smaller. So I am almost out of clue.
Any help is appreciated.
no errors are shown in eclipse editor.
Can a syntactically correct rule throw 'Cannot find KieModule' exception upon firing ? In my case, as per the editor, the rule does not have syntactical error.

This exception occurs when drools compiler cannot able to find the kieModule in the rule engine. Check if there is a file as src\main\resources\META-INF\kmodule.xml in your project file directory. If it is not there then create it. For more details read this.

Spark cannot find case class on classpath

I have an issue where Spark is failing to generate code for a case class. Here is the spark error
Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 52, Column 43: Identifier expected instead of '.'
Here is the referenced line in the generated code
/* 052 */ private com.avro.message.video.public.MetricObservation MapObjects_loopValue34;
It should be noted that com.avro.message.video.public.MetricObservation is a nested case class in part of a larger hierarchy. It is also used in other places in the code fine. It should also be noted that this pipeline works fine if I use the RDD API, but I want to use the Dataset API because I want to write out the Dataset in parquet. Has anyone seen this issue before?
I'm using Scala 2.11 and Spark 2.1.0. I was able to upgrade to Spark 2.2.1 and the issue is still there.

Do you think that SI-7555 or something like it has any bearing on this? I have noticed the past that Scala reflection has had issues generating TypeTags for statically nested classes. Do you think something like that is going on or is this strictly a catalyst issue in spark? You might want to file a spark ticket too.

So it turns out that changing the package name of the affect class "fixes" (ie made go away) the problem. I really have no idea why this is or even how to reproduce it in a small test case. What worked for me was I just created a higher level package that work. Specifically com.avro.message.video.public -> com.avro.message.publicVideo.

Can ItemReaders just pass in the record read and not need a lineMapper t o convert to an object

I'm asking if I can pass into the ItemProcessors the entire delimited record read in the ItemReader as one long string.
I have situations with unpredictable data. The file is pipe-delimited, but even with that, a single double-quote will have a parse error using Spring Batch's ItemReader.
In a standalone java application I wrote code using Spring's StringUtils class. I read in the full delimited record as a String (BufferedReader), then call Spring's StringUtils.delimitedListToStringArray(...,...). This gets all the characters whether valid or not, and then I can do a search/replace to get things like any single double-quote or commas in the fields.
My standalone Java program is a down-n-dirty solution. I'm turning it into a Spring Batch job for the long term solution. It's a monthly process, and it's an impractical, if not impossible, task to get SAP users to keep trash out of data fields (i.e. fat-finger city).
I see where it appears I have to have a domain object for the input record to be mapped into. Is this correct, or can i do a pass-through scenario, and let me handle the parsing myself using StringUtils?
The pipe-delimited records turn into comma-delimited records. There's really no need to create a domain object and do all the field set mapping.
Am happy for ideas if I'm approaching this the wrong way.
Thank you in advance.
Thanks,
Michael
EDIT:
This is the error, and the record. The lone double-quote in column 6 is the problem. I can't control the input, so I'm scrubbing each field (all Strings) for unwanted characters. So, my solution was to skip the line mapping and use StringUtils to do it myself--as I've done as mentioned earlier.
Caused by: org.springframework.batch.item.file.FlatFileParseException: Parsing error at line: 33526 in resource=[URL [file:/temp/comptroller/myfile.txt]], input=[xxx|xxx|xxx|xxx|xxx|xxx x xxx xxxxxxx xxxx xxxx "x|xxx|xxx|xxxxx|xx|xxxxxxxxxxxxx|xxxxxxx|xxx|xx |xxx ]
at org.springframework.batch.item.file.FlatFileItemReader.doRead(FlatFileItemReader.java:182)
at org.springframework.batch.item.support.AbstractItemCountingItemStreamItemReader.read(AbstractItemCountingItemStreamItemReader.java:85)
at org.springframework.batch.core.step.item.SimpleChunkProvider.doRead(SimpleChunkProvider.java:90)
at org.springframework.batch.core.step.item.FaultTolerantChunkProvider.read(FaultTolerantChunkProvider.java:87)
... 27 more
Caused by: org.springframework.batch.item.file.transform.IncorrectTokenCountException: Incorrect number of tokens found in record: expected 15 actual 6

Since the domain objects you read from ItemReaders, write to ItemWriters, and optionally process with ItemProcessors can be any Object, they can be Strings.
So the short answer is yes, you should be able to use a FlatFileItemReader to read one line at a time, pass it to SomeItemProcessor<String,String>, which replaces your pipes with commas (and handles existing commas) with whatever code you want, and sends those converted lines to a FlatFileItemWriter. Spring Batch includes common implementations of the LineTokenizer and LineAggregator classes which could help.
In this scenario, Spring Batch would be acting like a glorified search replace tool, with saner failure handling. To answer the bigger question of whether you should be using domain objects, or at least beans, think about whether you want to perform other tasks in the conversion process, like validation.
P.S. I'm not aware that FFItemReader blows up on a single double-quote, might want to file that as a bug.

org.dozer.MappingException: No read or write method found for field

org.dozer.MappingException: No read or write method found for field
(tarShipMethodCode.lmCourier.courierName) in class (class
com.essilor.ong.domain.inventory.POLocationEntity)
I am getting this error when i build my war file and try to run Tomcat.
I am using JPA and dozer mapping.
Can anyone tell me how to fix it?

Check your Beans and your Dozer-Mapping-File.
There are multiple (more or less common) errors possible:
Typo in the mappingfile. Check the package and field names in your POLocationEntity, does it have a field named tarShipMethodCode, and does this have an ImCourier field, and this a courierName field?
Lack of getters / setters. Again check the beans, Dozer usually expects getFieldName and setFieldName methods, unless you specified others (which I do not assume, maybe post your mapping file).
Narrow the problem down: Is this the only field that is not working? Or is this field not specified at all? Dozer tends to try to map-by-name fields that do not have corresponding entries in the mapping file, which could lead to unexpected errors.
tl;dr
With some more information (mapping xml, bean code) this would be easier to analize, but the above pointers are the ones that solve these kinds of problems in my experience.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse