I am trying to read unl file in Spring batch.
Use FlatFileItemReader and delimiter is "|".
001-A472468827" |N|100| The delimiter does not work when encountering this data.
Data cannot be divided by the delimiter if it contains " and spaces or if it contains the # character.
quoteCharacter doesn't seem to work.
In this situation, is there a way to import special characters such as " and # as they are?
#Bean
#StepScope
public FlatFileItemReader unlFileReader() throws MalformedURLException {
return new FlatFileItemReaderBuilder<ExampleDTO>()
.name("unlFileReader")
/*.encoding(StandardCharsets.UTF_8.name())*/
.resource(fileService.inputFileResource(UNZIP_PATH + "example.unl"))
.fieldSetMapper(new BeanWrapperFieldSetMapper<>())
.targetType(ExampleDTO.class)
.delimited().delimiter("|")
.quoteCharacter('#')
.quoteCharacter('"')
.quoteCharacter(DelimitedLineTokenizer.DEFAULT_QUOTE_CHARACTER)
.includedFields(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141
)
.names(ExampleDTO.getFieldNameArrays())
.build();
}
In this situation, is there a way to import special characters such as " and # as they are?
You are calling quoteCharacter() several times, note that this overrides the previous value and does not add the quote character to a list of quote characters. Only one quote character will be used (the last one added if you chain such calls).
Data cannot be divided by the delimiter if it contains " and spaces or if it contains the # character
This is because " is the default quote character. If the input contains a single ", you need to specify another delimiter (otherwise Spring Batch considers that as a "bug" in your data, which is true as the field is not correctly quoted). Here is a quick test that passes:
#Test
void testPipeDelimiter() {
DelimitedLineTokenizer tokenizer = new DelimitedLineTokenizer();
tokenizer.setDelimiter("|");
tokenizer.setQuoteCharacter(' ');
String s = "001-A472468827\"|N|100|";
FieldSet fieldSet = tokenizer.tokenize(s);
Assertions.assertEquals("001-A472468827\"", fieldSet.readString(0));
Assertions.assertEquals("N", fieldSet.readString(1));
Assertions.assertEquals("100", fieldSet.readString(2));
}
This test shows that the " is part of the first field. The same test passes with a # in the input:
#Test
void testPipeDelimiter() {
DelimitedLineTokenizer tokenizer = new DelimitedLineTokenizer();
tokenizer.setDelimiter("|");
tokenizer.setQuoteCharacter(' ');
String s = "001-A472468827#|N|100|";
FieldSet fieldSet = tokenizer.tokenize(s);
Assertions.assertEquals("001-A472468827#", fieldSet.readString(0));
Assertions.assertEquals("N", fieldSet.readString(1));
Assertions.assertEquals("100", fieldSet.readString(2));
}
I'm trying to load a text file (.csv) into a SQL Server database table. Each line in the file is supposed to be loaded into a single column in the table. I find that lines starting with "#" are skipped, with no error. For example, the first two of the following four lines are loaded fine, but the last two are not. Anybody knows why?
ThisLineShouldBeLoaded
This one as well
#ThisIsATestLine
#This is another test line
Here's the segment of my code:
var sqlConn = connection.StoreConnection as SqlConnection;
sqlConn.Open();
CsvReader reader = new CsvReader(new StreamReader(f), false);
using (var bulkCopy = new SqlBulkCopy(sqlConn))
{
bulkCopy.DestinationTableName = "dbo.TestTable";
try
{
reader.SkipEmptyLines = true;
bulkCopy.BulkCopyTimeout = 300;
bulkCopy.WriteToServer(reader);
reader.Dispose();
reader = null;
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
System.Diagnostics.Debug.WriteLine(ex.Message);
throw;
}
}
# is the default comment character for CsvReader. You can change the comment character by changing the Comment property of the Configuration object. You can disable comment processing altogether by setting the AllowComment property to false, eg:
reader.Configuration.AllowComments=false;
SqlBulkCopy doesn't deal with CSV files at all, it sends any data that's passed to WriteServer to the database. It doesn't care where the data came from or what it contains, as long as the column mappings match
Update
Assuming LumenWorks.Framework.IO.Csv refers to this project the comment character can be specified in the constructor. One could set it to something that wouldn't appear in a normal file, perhaps even the NUL character, the default char value :
CsvReader reader = new CsvReader(new StreamReader(f), false, escape:default);
or
CsvReader reader = new CsvReader(new StreamReader(f), false, escape : '\0');
I want to read an csv file from the cloud bucket and write it to a bigquery table with columns using dataflow in java. How can I set the headers to the csv file while writing to bigquery?
There are two issues to solve here
Skipping the header when reading the data, and
Using the header to correctly populate teh bigquery table columns.
For (1) this is, as of June 2019, not implemented natively, though you could try the options listed at Skipping header rows - is it possible with Cloud DataFlow?. For (2) the easiest would be to read the first line of your CSV in your main program, and pass the list of column names in the constructor to a DoFn that converts CSV lines into TableRow objects ready to write to Bigquery.
Your final program would look something like
public void CsvToBigquery(csvInputPattern, bigqueryTable) {
final String[] columns = readAndSplitFirstLineOfFirstFile(csvInputPattern);
Pipeline p = new Pipeline.create(...);
p
.apply(TextIO.read().from(csvInputPattern)
.apply(Filter.by(new MatchIfNonHeader())
.apply(ParDo.of(new DoFn<String, TableRow>() {
... // use columns here to TableRows
})
.apply(BigtableIO.write().withTableId(bigqueryTable)...);
}
I've done a similar task and used Apache Common library in ParDo function to extract the data from CSV files and then converted them to Table Row Objects for BQ.
String fileData = c.element();
BufferedReader fileReader = new BufferedReader(new InputStreamReader(
new ByteArrayInputStream(fileData.getBytes("UTF-8")), "UTF-8"));
CSVParser csvParser = new CSVParser(fileReader,CSVFormat.DEFAULT.withFirstRecordAsHeader().withIgnoreHeaderCase().withTrim());
Iterable<CSVRecord> csvRecords = csvParser.getRecords();
for (CSVRecord csvRecord : csvRecords) {
TableRow row = new TableRow();
checkAndConvertIntoBqDataType(csvRecord.toMap());
c.output(row);
}
I have a folder structure of the type year/month/day/hour/*, and I'd like the beam to read this as an unbounded source in chronological order. Specifically, this means reading in all the files in the first hour on record and adding their contents for processing. Then, add the file contents of the next hour for processing, up until the current time where it waits for new files to arrive in the latest year/month/day/hour folder.
Is it possible to do this with apache beam?
So what I would do is to add timestamps to each element according to the file path. As a test I used the following example.
First of all, as explained in this answer, you can use FileIO to match continuously a file pattern. This will help as, per your use case, once you have finished with the backfill you want to keep reading new arriving files within the same job. In this case I provide gs://BUCKET_NAME/data/** because my files will be like gs://BUCKET_NAME/data/year/month/day/hour/filename.extension:
p
.apply(FileIO.match()
.filepattern(inputPath)
.continuously(
// Check for new files every minute
Duration.standardMinutes(1),
// Never stop checking for new files
Watch.Growth.<String>never()))
.apply(FileIO.readMatches())
Watch frequency and timeout can be adjusted at will.
Then, in the next step we'll receive the matched file. I will use ReadableFile.getMetadata().resourceId() to get the full path and split it by "/" to build the corresponding timestamp. I round it to the hour and do not account for timezone correction here. With readFullyAsUTF8String we'll read the whole file (be careful if the whole file does not fit into memory, it is recommended to shard your input if needed) and split it into lines. With ProcessContext.outputWithTimestamp we'll emit downstream a KV of filename and line (the filename is not needed anymore but it will help to see where each file comes from) and the timestamp derived from the path. Note that we're shifting timestamps "back in time" so this can mess up with the watermark heuristics and you will get a message such as:
Cannot output with timestamp 2019-03-17T00:00:00.000Z. Output timestamps must be no earlier than the timestamp of the current input (2019-06-05T15:41:29.645Z) minus the allowed skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the allowed skew.
To overcome this I set getAllowedTimestampSkew to Long.MAX_VALUE but take into account that this is deprecated. ParDo code:
.apply("Add Timestamps", ParDo.of(new DoFn<ReadableFile, KV<String, String>>() {
#Override
public Duration getAllowedTimestampSkew() {
return new Duration(Long.MAX_VALUE);
}
#ProcessElement
public void processElement(ProcessContext c) {
ReadableFile file = c.element();
String fileName = file.getMetadata().resourceId().toString();
String lines[];
String[] dateFields = fileName.split("/");
Integer numElements = dateFields.length;
String hour = dateFields[numElements - 2];
String day = dateFields[numElements - 3];
String month = dateFields[numElements - 4];
String year = dateFields[numElements - 5];
String ts = String.format("%s-%s-%s %s:00:00", year, month, day, hour);
Log.info(ts);
try{
lines = file.readFullyAsUTF8String().split("\n");
for (String line : lines) {
c.outputWithTimestamp(KV.of(fileName, line), new Instant(dateTimeFormat.parseMillis(ts)));
}
}
catch(IOException e){
Log.info("failed");
}
}}))
Finally, I window into 1-hour FixedWindows and log the results:
.apply(Window
.<KV<String,String>>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
.apply("Log results", ParDo.of(new DoFn<KV<String, String>, Void>() {
#ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
String file = c.element().getKey();
String value = c.element().getValue();
String eventTime = c.timestamp().toString();
String logString = String.format("File=%s, Line=%s, Event Time=%s, Window=%s", file, value, eventTime, window.toString());
Log.info(logString);
}
}));
For me it worked with .withAllowedLateness(Duration.ZERO) but depending on the order you might need to set it. Keep in mind that a value too high will cause windows to be open for longer and use more persistent storage.
I set the $BUCKET and $PROJECT variables and I just upload two files:
gsutil cp file1 gs://$BUCKET/data/2019/03/17/00/
gsutil cp file2 gs://$BUCKET/data/2019/03/18/22/
And run the job with:
mvn -Pdataflow-runner compile -e exec:java \
-Dexec.mainClass=com.dataflow.samples.ChronologicalOrder \
-Dexec.args="--project=$PROJECT \
--path=gs://$BUCKET/data/** \
--stagingLocation=gs://$BUCKET/staging/ \
--runner=DataflowRunner"
Results:
Full code
Let me know how this works. This was just an example to get started and you might need to adjust windowing and triggering strategies, lateness, etc to suit your use case
I have a large csv file as below:
DATE status code value value2
2014-12-13 Shipped 105732491-20091002165230 0.000803398 0.702892835
2014-12-14 Shipped 105732491-20091002165231 0.012925206 1.93748834
2014-12-15 Shipped 105732491-20091002165232 0.000191278 0.004772389
2014-12-16 Shipped 105732491-20091002165233 0.007493046 0.44883348
2014-12-17 Shipped 105732491-20091002165234 0.022015049 3.081006137
2014-12-18 Shipped 105732491-20091002165235 0.001894693 0.227268466
2014-12-19 Shipped 105732491-20091002165236 0.000312871 0.003113062
2014-12-20 Shipped 105732491-20091002165237 0.001754068 0.105016053
2014-12-21 Shipped 105732491-20091002165238 0.009773315 0.585910214
:
:
What i need to do is remove the header and change the date format to an integer yyyymmdd (eg. 20141217)
I am using opencsv to read and write the file.
Is there a way where i can change all the dates at once without parsing them one by one?
Below is my code to remove the header and create a new file:
void formatCsvFile(String fileToChange) throws Exception {
CSVReader reader = new CSVReader(new FileReader(new File(fileToChange)), CSVParser.DEFAULT_SEPARATOR, CSVParser.NULL_CHARACTER, CSVParser.NULL_CHARACTER, 1)
info "Read all rows at once"
List<String[]> allRows = reader.readAll();
CSVWriter writer = new CSVWriter(new FileWriter(fileToChange), CSVWriter.DEFAULT_SEPARATOR, CSVWriter.NO_QUOTE_CHARACTER)
info "Write all rows at once"
writer.writeAll(allRows)
writer.close()
}
Please can some one help?
Thanks
You don't need to parse the dates, but you do need to process each line in the file and convert the data on each line you want to convert. Java/Groovy doesn't have anything like awk where you can work with file data as columns, for example, the first 10 "columns" (characters usually) in every line in a file. Java/Groovy only deals with "rows" of data in a file, not "columns".
You could try something like this: (in Groovy)
reader.eachLine { String theLine ->
int idx = theLine.indexOf(' ')
String oldDate = theLine.subString(0, idx)
String newDate = oldDate.replaceAll('-', '')
String newLine = newDate + theLine.subString(idx);
writer.writeLine(newline);
}
Edit:
If your CSVReader class is not derived from File, then you can't use Groovy's eachLine method on it. And if the CSVReader class's readAll() method really returns a List of String arrays, then the above code could change to this:
allRows.each { String[] theLine ->
String newDate = theLine[0].replaceAll('-', '')
writer.writeLine(newDate + theLine[1..-1])
}
Ignore the first line (the header):
List<String[]> allRows = reader.readAll()[1..-1];
and replace the '-' in the dates by splitting each row and editting the first:
allrows = allrows.collect{
row -> row.split(',')[0].replace(',','') // the date
+ row.split(',')[1..-1] // the rest
}
I don't know what you mean by "all dates at once". For me can only be iterated.