Exception while writing multipart empty csv file from Apache Beam into netApp Storage Grid - apache-beam

Problem Statement
We are consuming multiple csv files into pcollections -> Apply beam SQL to transform data -> write resulted pcollection.
This is working absolutely fine if we have some data in all the source pCollections and beam SQL generates new collection with some data.
When Transform pCollection is generating empty pCollection and when writing that in netApp Storage Grid it is throwing below,
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.io.IOException: Failed closing channel to s3://bucket-name/.temp-beam-847f362f-8884-454e-bfbe-baf9a4e32806/0b72948a5fcccb28-174f-426b-a225-ae3d3cbb126f
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:373)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:341)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:323)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:309)
at ECSOperations.main(ECSOperations.java:53)
Caused by: java.io.IOException: Failed closing channel to s3://bucket-name/.temp-beam-847f362f-8884-454e-bfbe-baf9a4e32806/0b72948a5fcccb28-174f-426b-a225-ae3d3cbb126f
at org.apache.beam.sdk.io.FileBasedSink$Writer.close(FileBasedSink.java:1076)
at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.createMissingEmptyShards(FileBasedSink.java:759)
at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.finalizeDestination(FileBasedSink.java:639)
at org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:1040)
Caused by: java.io.IOException: com.amazonaws.services.s3.model.AmazonS3Exception: Your proposed upload is smaller than the minimum allowed object size. (Service: Amazon S3; Status Code: 400; Error Code: EntityTooSmall; Request ID: 1643869619144605; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null
Following is sample code
ECSOptions options = PipelineOptionsFactory.fromArgs(args).as(ECSOptions.class);
setupConfiguration(options);
Pipeline p = Pipeline.create(options);
PCollection<String> pSource= p.apply(TextIO.read().from("src/main/resources/empty.csv"));
pSource.apply(TextIO.write().to("s3://bucket-name/empty.csv").withoutSharding());
p.run();
Observation
It is working fine if we write simple file and not multipart file(simple put object to Storage Grid)
Seems its known issue with Storage Grid but we want to check whether we can handle this from beam pipeline or not.
What I have tried
Tried to see if I can check size of PCollection before writing and put some string into output file but since PCollection is empty it is not going in PTransform at all.
Tried with Count.globally as well but that event didn't help
Ask
Is there anyway we can handle this in Beam like we can check size of the PCollection before writing? and if size is zero i.e. empty pcollection so we can avoid writing file to avoid this issue.
Has anyone faced similar issue and able to sort out?

I came up with two more options:
TextIO.write().withFooter(...) to always write a single empty line (or space or whatever) at the end of your file to ensure it's not empty.
You could flatten your PCollection with an PCollection that has a single empty line iff the given PCollection is empty. (This is more complicated, but could be used more generally.) Specifically
PCollection<String> pcollToWrite = ...
// This will count the number of elements in toWriteSize at runtime.
PCollectionView<Long> toWriteSize = pcollToWrite.apply(Count.globally().asSingletonView());
PCollection<String> emptyOrSingletonPCollection =
p
// Creates a PCollection with a single element.
.apply(Create.of(Collections.singletonList(""))
// Applies a ParDo that will emit this single element if and only if
// toWriteSize is zero.
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(#Element String mainElement, OutputReceiver<String> out, ProcessContext c) {
if (c.sideInput(toWriteSize) == 0) {
out.output("");
}
}
}).withSideInputs(toWriteSize));
// We now flatten pcollToWrite and emptyOrSingletonPCollection together.
// The right hand side has exactly one element whenever the left hand side
// is empty, so there will always be at least one element.
PCollectionList.of(pcollToWrite, emptyOrSingletonPCollection)
.apply(Flatten.pCollections())
.apply(TextIO.write(...))

You can't check to see if the PCollection is empty during pipeline construction, as it has not been computed yet. If this filesystem can't support empty files, you could try writing to another filesystem and then copying iff the file is not empty (assuming the file in question is not too large).

Related

QuickFIX/J not reading all the repeating groups in FIX message

We are receiving fix messages from WebICE exchange in a text file and our application is reading and parsing them line by line using QuickFixJ. We noticed that in some messages the repeating group fields are not being parsed and upon validating with data dictionary getting error.
quickfix.FieldException: Out of order repeating group members, field=326
For example in the sample file data-test.csv the first 2 rows parsed successfully but third one fails with the above error message.
Upon investigation I found , in first 2 rows tag 326 comes after tag 9133 but in the third row it comes before that and hence fails in validation. If I adjust data dictionary as per the third one it succeeds but ofcourse the first one starts failing.
This is happening only for few messages for most of the other fix messages are getting validated and parsed quite fine. This is part of the migration project from existing C# application using QuickFix/N to our scala application using QuickFix/J. And its been working fine at the source end (with QuickFIx/N). Is there any difference in both the libraries QuickFIx/J and QuickFIx/N in terms of dealing with group fields ?
To help recreate the issue , I have shared the data file having 3 fix messages as explained above.
Data file : data-test.csv
Data dictionary : ICE-FIX42.xml
Here is the test code snippet
val dd: DataDictionary = new DataDictionary("ICE-FIX42.xml")
val mfile = new File("data-test.csv")
for (line <- Source.fromFile(mfile).getLines) {
val message = new quickfix.Message(line,dd)
dd.setCheckUnorderedGroupFields(true)
dd.validate(message)
val noOfunderlyings= message.getInt(711)
println("Number of Underlyings "+noOfunderlyings)
for(i <- 1 to noOfunderlyings ) {
val FixGroup: Group = message.getGroup(i, 711)
println("UnderlyingSecurityID : " + FixGroup.getString(311))
}
}
Request to fellow SO users , If you can help me with this.
Many Thanks
You should use setCheckUnorderedGroupFields(false) to disable the validation of the ordering in repeating groups. However, this is only a workaround.
I would suggest to approach your counterparty about this because especially in repeating groups the field order is required to follow the message definition, i.e. the order in the data dictionary.
FIX TagValue encoding spec
Field sequence within a repeating group
...
Fields within repeating groups must be specified in the order that the fields are specified in the message definition.

Cloud Dataflow GlobalWindow trigger ignored

Using the AfterPane.elementCountAtLeast trigger does not work when run using the Dataflow runner, but works correctly when run locally. When run on Dataflow, it produces only a single pane.
The goals is to extract data from Cloud SQL, transform and write into Cloud Storage. However, there is too much data to keep in memory, so it needs to be split up and written to Cloud Storage in chunks. That's what I hoped this would do.
The complete code is:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
// produce one global window with one pane per ~500 records
.withGlobalWindow(WindowOptions(
trigger = Repeatedly.forever(AfterPane.elementCountAtLeast(500)),
accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES
))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
.withShardNameTemplate("-P-S")
.withWindowedWrites() // gets us one file per window & pane
pipe.saveAsCustomOutput("writer",out)
I think the root of the problem may be that the JdbcIO class is implemented as a PTransform<PBegin,PCollection> and a single call to processElement outputs the entire SQL query result:
public void processElement(ProcessContext context) throws Exception {
try (PreparedStatement statement =
connection.prepareStatement(
query.get(), ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)) {
statement.setFetchSize(fetchSize);
parameterSetter.setParameters(context.element(), statement);
try (ResultSet resultSet = statement.executeQuery()) {
while (resultSet.next()) {
context.output(rowMapper.mapRow(resultSet));
}
}
}
}
In the end, I had two problems to resolve:
1. The process would run out of memory, and 2. the data was written to a single file.
There is no way to work around problem 1 with Beam's JdbcIO and Cloud SQL because of the way it uses the MySQL driver. The driver itself loads the entire result within a single call to executeStatement. There is a way to get the driver to stream results, but I had to implement my own code to do that. Specifically, I implemented a BoundedSource for JDBC.
For the second problem, I used the row number to set the timestamp of each element. That allows me to explicitly control how many rows are in each window using FixedWindows.
elementCountAtLeast is a lower bound so making only one pane is a valid option for a runner to do.
You have a couple of options when doing this for a batch pipeline:
Allow the runner to decide how big the files are and how many shards are written:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
pipe.saveAsCustomOutput("writer",out)
This is typically the fastest option when the TextIO has a GroupByKey or a source that supports splitting that precedes it. To my knowledge JDBC doesn't support splitting so your best option is to add a Reshuffle after the jdbcSelect which will enable parallelization of processing after reading the data from the database.
Manually group into batches using the GroupIntoBatches transform.
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
.apply(GroupIntoBatches.ofSize(500))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
pipe.saveAsCustomOutput("writer",out)
In general, this will be slower then option #1 but it does allow you to choose how many records are written per file.
There are a few other ways to do this with their pros and cons but the above two are likely the closest to what you want. If you add more details to your question, I may revise this question further.

mirth connect Database Reader automatic column mapping

Please could somebody confirm the following..
I am using Mirth Connect 3.5.08232.
My Source Connector is a Database Reader.
Say, I am using a query that returns multiple rows, and return the result (via JavaScript), as documentation suggests, so that Mirth would treat each row as a separate message. I also use a couple of mappers as source transformers, and save the mapped fields in my channel map (which ends up to contain only those fields that I define in transformers)
In the destination, and specifically, in destination response transformer (or destination body, if it is a JavaScript writer), how do I access the source fields?
the only way I found by trial and error is
var rawMsg = connectorMessage.getRawData();
var xmlMsg = new XML(rawMsg);
logger.info(xmlMsg.some_field); // ignore the root element of rawMsg
Is this the right way to do this? I thought that maybe the fields that were nicely automatically detected would be put in some kind of a map, like sourceMap - but that doesn't seem to be the case, right?
Thank you
If you are using Mapper steps in your transformer to extract the data and put it into a variable map (like the channel map), then you can use any of the following methods to retrieve it from a subsequent JavaScript context (including a JavaScript Writer, and your response transformer):
var value = channelMap.get('key');
var value = $c('key');
var value = $('key');
Look at the Variable Maps section of the User Guide for more information.
So to recap, say you're selecting a column "mycolumn" with a Database Reader. The XML sent to the channel will be something like this:
<result>
<mycolumn>value</mycolumn>
</result>
Then you can choose to extract pieces of that message into specific variables for later use. The transformer allows you to easily drag-and-drop pieces of the sample inbound message.
Finally in your JavaScript Writer (or in any subsequent filter, transformer, or response transformer), just drag the value into the field you want:
And the corresponding JavaScript code will automatically be inserted:
One last note, if you are selecting a lot of variables and don't want to make Mapper steps for each one individually, you can use a JavaScript Step to iterate through the message and extract each column into a separate map variable:
for each (child in msg.children()) {
channelMap.put(child.localName(), child.toString());
}
Or, you can just reference the columns directly from within the JavaScript Writer:
var msg = new XML(connectorMessage.getEncodedData());
var column1 = msg.column1.toString();
var column2 = msg.column2.toString();
...

Spark: How to structure a series of side effect actions inside mapping transformation to avoid repetition?

I have a spark streaming application that needs to take these steps:
Take a string, apply some map transformations to it
Map again: If this string (now an array) has a specific value in it, immediately send an email (or do something OUTSIDE the spark environment)
collect() and save in a specific directory
apply some other transformation/enrichment
collect() and save in another directory.
As you can see this implies to lazily activated calculations, which do the OUTSIDE action twice. I am trying to avoid caching, as at some hundreds lines per second this would kill my server.
Also trying to mantaining the order of operation, though this is not as much as important: Is there a solution I do not know of?
EDIT: my program as of now:
kafkaStream;
lines = take the value, discard the topic;
lines.foreachRDD{
splittedRDD = arg.map { split the string };
assRDD = splittedRDD.map { associate to a table };
flaggedRDD = assRDD.map { add a boolean parameter under a if condition + send mail};
externalClass.saveStaticMethod( flaggedRDD.collect() and save in file);
enrichRDD = flaggedRDD.map { enrich with external data };
externalClass.saveStaticMethod( enrichRDD.collect() and save in file);
}
I put the saving part after the email so that if something goes wrong with it at least the mail has been sent.
The final 2 methods I found were these:
In the DStream transformation before the side-effected one, make a copy of the Dstream: one will go on with the transformation, the other will have the .foreachRDD{ outside action }. There are no major downside in this, as it is just one RDD more on a worker node.
Extracting the {outside action} from the transformation and mapping the already sent mails: filter if mail has already been sent. This is a almost a superfluous operation as it will filter out all of the RDD elements.
Caching before going on (although I was trying to avoid it, there was not much to do)
If trying to not caching, solution 1 is the way to go

pointing a skipped item and the error field in a chunk in spring batch

Scenario 1
The skip listener interface is as below:
public interface SkipListener<T,S> extends StepListener {
void onSkipInRead(Throwable t);
void onSkipInProcess(T item, Throwable t);
void onSkipInWrite(S item, Throwable t);
}
This interface is best used to log the skipped item and the error.
Is is possible to get the number of the skipped item in the input. For e.g. if the 10th item in the input is getting skipped, I should be able to log "Item number 10 was skipped!" through above listener.
I need this since I have input as a file where the rows are not having any identifying key. So just by logging out the item, it would not be possible to pin point the item itself in the file.
What if instead of file, the input is a database table ? Is it possible to get the position number of the skipped item there as well ?
Scenario 2
My bean has three properties one, two and three (all strings) where the input is read from a file through appropriate row mapper and then a database table gets loaded with the data after some processing.
Below is a code block from processor:
if(two.charAt(4) == '_')
{ // do some processing }
Clearly if field two is coming empty from the file above block will throw "string index out of bound exception" and will get skipped.
So, inside skip listener, what I want is the information about the column which threw error.
Here since field named two gave error, the information I would like to log in skip listener would be like "Property one threw error "string index out of bound exception" in line number 10" or if possible, even more specific "property one is empty in line number 10" which makes more sense to business who does not know java jargons.
Hope I made my doubts clear.
Thanks for reading!
Scenario 1
to get line number for skipped item from reading a file you can:
for onSkipInRead - implement own/wrap the reader to act on exception
onSkipInProcess - implement own linemapper which writes line number to item or writes current line number in step context
onsKipInWrite - same as for onSkipInProcess
to get line number for skipped item from reading a table you can:
for onSkipInRead - implement own/wrap the reader to act on exception, but i'm not sure if it's possible to the get the line number here, might only be the current chunk-start-number
onSkipInProcess - implement own rowmapper which writes row number to item or writes current row number in step context
onsKipInWrite - same as for onSkipInProcess
Scenario 2
see scenario 1 and add a try/catch block inside your processor, so you throw your own exception e.g. with information on property position, or alter the step context in a similar way