I am getting a large SOAP response with around 10000+ rows of data and am extracting the relevant data from within CDATA and inserting into a Database.
The code looks as :
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setCoalescing(true);
DocumentBuilder db = dbf.newDocumentBuilder();
InputStream is = new ByteArrayInputStream(resp.getBytes());
Document doc = db.parse(is);
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile(fetchResult);
String result = (String) expr.evaluate(doc, XPathConstants.STRING);
resp contains the SOAP response
fetchResult is
String fetchResult = "//result/text()";
I'm getting OutofMemory error in the last line. I did try with incresing heap size to 1.2 GB , but that also didn't work.
Could any of you please help me out?
Process the document as a stream, rather than consuming the entire response as a Document. The code you have at the moment builds the structure representing the document in memory, and then attempts to evaluate an XPath expression upon it.
A more memory efficient approach would be to use a SAX parser instead (see here). By using a SAX parser, you'll stream the through the document as needed keeping only a tiny part of the document in memory at any one time.
Related
Problem Statement
We are consuming multiple csv files into pcollections -> Apply beam SQL to transform data -> write resulted pcollection.
This is working absolutely fine if we have some data in all the source pCollections and beam SQL generates new collection with some data.
When Transform pCollection is generating empty pCollection and when writing that in netApp Storage Grid it is throwing below,
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.io.IOException: Failed closing channel to s3://bucket-name/.temp-beam-847f362f-8884-454e-bfbe-baf9a4e32806/0b72948a5fcccb28-174f-426b-a225-ae3d3cbb126f
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:373)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:341)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:323)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:309)
at ECSOperations.main(ECSOperations.java:53)
Caused by: java.io.IOException: Failed closing channel to s3://bucket-name/.temp-beam-847f362f-8884-454e-bfbe-baf9a4e32806/0b72948a5fcccb28-174f-426b-a225-ae3d3cbb126f
at org.apache.beam.sdk.io.FileBasedSink$Writer.close(FileBasedSink.java:1076)
at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.createMissingEmptyShards(FileBasedSink.java:759)
at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.finalizeDestination(FileBasedSink.java:639)
at org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:1040)
Caused by: java.io.IOException: com.amazonaws.services.s3.model.AmazonS3Exception: Your proposed upload is smaller than the minimum allowed object size. (Service: Amazon S3; Status Code: 400; Error Code: EntityTooSmall; Request ID: 1643869619144605; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null
Following is sample code
ECSOptions options = PipelineOptionsFactory.fromArgs(args).as(ECSOptions.class);
setupConfiguration(options);
Pipeline p = Pipeline.create(options);
PCollection<String> pSource= p.apply(TextIO.read().from("src/main/resources/empty.csv"));
pSource.apply(TextIO.write().to("s3://bucket-name/empty.csv").withoutSharding());
p.run();
Observation
It is working fine if we write simple file and not multipart file(simple put object to Storage Grid)
Seems its known issue with Storage Grid but we want to check whether we can handle this from beam pipeline or not.
What I have tried
Tried to see if I can check size of PCollection before writing and put some string into output file but since PCollection is empty it is not going in PTransform at all.
Tried with Count.globally as well but that event didn't help
Ask
Is there anyway we can handle this in Beam like we can check size of the PCollection before writing? and if size is zero i.e. empty pcollection so we can avoid writing file to avoid this issue.
Has anyone faced similar issue and able to sort out?
I came up with two more options:
TextIO.write().withFooter(...) to always write a single empty line (or space or whatever) at the end of your file to ensure it's not empty.
You could flatten your PCollection with an PCollection that has a single empty line iff the given PCollection is empty. (This is more complicated, but could be used more generally.) Specifically
PCollection<String> pcollToWrite = ...
// This will count the number of elements in toWriteSize at runtime.
PCollectionView<Long> toWriteSize = pcollToWrite.apply(Count.globally().asSingletonView());
PCollection<String> emptyOrSingletonPCollection =
p
// Creates a PCollection with a single element.
.apply(Create.of(Collections.singletonList(""))
// Applies a ParDo that will emit this single element if and only if
// toWriteSize is zero.
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(#Element String mainElement, OutputReceiver<String> out, ProcessContext c) {
if (c.sideInput(toWriteSize) == 0) {
out.output("");
}
}
}).withSideInputs(toWriteSize));
// We now flatten pcollToWrite and emptyOrSingletonPCollection together.
// The right hand side has exactly one element whenever the left hand side
// is empty, so there will always be at least one element.
PCollectionList.of(pcollToWrite, emptyOrSingletonPCollection)
.apply(Flatten.pCollections())
.apply(TextIO.write(...))
You can't check to see if the PCollection is empty during pipeline construction, as it has not been computed yet. If this filesystem can't support empty files, you could try writing to another filesystem and then copying iff the file is not empty (assuming the file in question is not too large).
I have a ReactiveMongo BSONDocument but I want to write it to a file - I know there's the BSON format (http://bsonspec.org/spec.html) and I want to write it according to those specs, but the problem is that I can't find any method call to do this. I've been able to convert it to an array of Bytes, but the problem begins when I convert to a string, UTF8 format by default.
However the BSON specs require a 32 bit number in the beginning. Is there a library that can do this for me? If not, how can I add string representing a 32 bit number and UTF8 string together without losing the encoding for either or both?
Here's what I've got in Scala:
import reactivemongo.bson.buffer.ArrayBSONBuffer
val doc = BSONDocument("data" -> overall)
val buffer = new ArrayBSONBuffer()
BSONDocument.write(doc, buffer)
val bytes = buffer.array
val str = new String(bytes, Charset.forName("UTF8"))
For reference, I know in Ruby, we can do something like this, but how do I do the same thing with ReactiveMongo?
bson_data = BSON.serialize({data: arr}).to_s
As indicated in the documentation, you can use BSONDocument.pretty(myDoc).
Note that you are using the deprecated/being removed BSON API.
Please could somebody confirm the following..
I am using Mirth Connect 3.5.08232.
My Source Connector is a Database Reader.
Say, I am using a query that returns multiple rows, and return the result (via JavaScript), as documentation suggests, so that Mirth would treat each row as a separate message. I also use a couple of mappers as source transformers, and save the mapped fields in my channel map (which ends up to contain only those fields that I define in transformers)
In the destination, and specifically, in destination response transformer (or destination body, if it is a JavaScript writer), how do I access the source fields?
the only way I found by trial and error is
var rawMsg = connectorMessage.getRawData();
var xmlMsg = new XML(rawMsg);
logger.info(xmlMsg.some_field); // ignore the root element of rawMsg
Is this the right way to do this? I thought that maybe the fields that were nicely automatically detected would be put in some kind of a map, like sourceMap - but that doesn't seem to be the case, right?
Thank you
If you are using Mapper steps in your transformer to extract the data and put it into a variable map (like the channel map), then you can use any of the following methods to retrieve it from a subsequent JavaScript context (including a JavaScript Writer, and your response transformer):
var value = channelMap.get('key');
var value = $c('key');
var value = $('key');
Look at the Variable Maps section of the User Guide for more information.
So to recap, say you're selecting a column "mycolumn" with a Database Reader. The XML sent to the channel will be something like this:
<result>
<mycolumn>value</mycolumn>
</result>
Then you can choose to extract pieces of that message into specific variables for later use. The transformer allows you to easily drag-and-drop pieces of the sample inbound message.
Finally in your JavaScript Writer (or in any subsequent filter, transformer, or response transformer), just drag the value into the field you want:
And the corresponding JavaScript code will automatically be inserted:
One last note, if you are selecting a lot of variables and don't want to make Mapper steps for each one individually, you can use a JavaScript Step to iterate through the message and extract each column into a separate map variable:
for each (child in msg.children()) {
channelMap.put(child.localName(), child.toString());
}
Or, you can just reference the columns directly from within the JavaScript Writer:
var msg = new XML(connectorMessage.getEncodedData());
var column1 = msg.column1.toString();
var column2 = msg.column2.toString();
...
I have a dataset of employees and their leave-records. Every record (of type EmployeeRecord) contains EmpID (of type String) and other fields. I read the records from a file and then transform into PairRDDFunctions:
val empRecords = sc.textFile(args(0))
....
val empsGroupedByEmpID = this.groupRecordsByEmpID(empRecords)
At this point, 'empsGroupedByEmpID' is of type RDD[String,Iterable[EmployeeRecord]]. I transform this into PairRDDFunctions:
val empsAsPairRDD = new PairRDDFunctions[String,Iterable[EmployeeRecord]](empsGroupedByEmpID)
Then, I go for processing the records as per the logic of the application. Finally, I get an RDD of type [Iterable[EmployeeRecord]]
val finalRecords: RDD[Iterable[EmployeeRecord]] = <result of a few computations and transformation>
When I try to write the contents of this RDD to a text file using the available API thus:
finalRecords.saveAsTextFile("./path/to/save")
the I find that in the file every record begins with an ArrayBuffer(...). What I need is a file with one EmployeeRecord in each line. Is that not possible? Am I missing something?
I have spotted the missing API. It is well...flatMap! :-)
By using flatMap with identity, I can get rid of the Iterator and 'unpack' the contents, like so:
finalRecords.flatMap(identity).saveAsTextFile("./path/to/file")
That solves the problem I have been having.
I also have found this post suggesting the same thing. I wish I saw it a bit earlier.
My application takes a user inputted string and tries to parse it with the Lucene query parser. I noticed however that there are several formats of strings that provoke an error in this query parser.
e.g.:
~anystring
anystring +
First I tried molding my user inputted string so that it could not contain these cases, but as I see it, there could be more cases I do not foresee now.
How do you handle Query parser exceptions? How do you prevent them?
We catch the remaining parse exceptions and display an error message ("Your search did not match any documents. Suggestion: Try different keywords.").
See also How to make the Lucene QueryParser more forgiving?
query.replace(/([\!\*\+\&\|\(\)\[\]\{\}\^\~\?\:\"\/])/g, "");