Streamsets: SpoolDIR_01 Failed to process file - streamsets

Hi I'm trying to run a pipeline to process a very large file (about 4milion records). Everytime it reaches to around 270, 000 it fails and then stops processing anymore records and returns this error.
'/FileLocation/FiLeNAME..DAT' at position '93167616': com.streamsets.pipeline.lib.dirspooler.BadSpoolFileException: com.streamsets.pipeline.api.ext.io.OverrunException: Reader exceeded the read limit '131072'.
If anyone else has experienced similar issue, please help. Thank you
I have checked the lines where it stops the pipeline but there seems to be nothing obvious there. Tried another file and still not working.
'/FileLocation/FiLeNAME..DAT' at position '93167616': com.streamsets.pipeline.lib.dirspooler.BadSpoolFileException: com.streamsets.pipeline.api.ext.io.OverrunException: Reader exceeded the read limit '131072'.

Looks like you're hitting the maximum record size. This limit is in place to guard against badly formatted data causing 'out of memory' errors.
Check your data format configuration and increase Max Record Length, Maximum Object Length, Max Line Length etc depending on the data format you are using.
See the Directory Origin documentation for more detail. Note in particular that you may have to edit sdc.properties if the records you are parsing are bigger than the system-wide limit of 1048576 bytes.

I recently received this error message as well. When I come up against such size limits in StreamSets, I'll often set the limit to something ridiculous:
Then set the maximum value to the value given to me in the subsequent error message:
I find it really unfortunate that StreamSets then fails to process the rest of a file when an extra-long record is encountered. This seems counter intuitive to me for a tool used to process vast amounts of data.

Related

How can I avoid "Data too large" in ELK / elasticsearch bulk inserts?

I'm sending data daily to my elk-stack via https://metacpan.org/pod/Search::Elasticsearch::Client::7_0::Bulk
Sometimes it happens, more often recently, that I receive a "Data too large" error. The first part of my data was received, but after this error my sending script stops and I end up with incomplete data.
As far as I understood, correct me if I'm wrong, this happens when my stack is experiencing memory issues while processing the data it already received. I assume that, after some time, I could send the rest of the data, because the next day, the same issue occurs: The first bunch of my data is processed, the rest rejected with "Data too large".
I saw that I can add an "on-error" callback, but I have no clue what I can do in it. My idea would be to implement a delay and retry after some time.
Can anyone give me have a hint how to achieve it?
Are there any ideas how to avoid the issue in the first place? I already increased heap space some time ago, but after 2 month the issue reoccured.
you'd need to check your Elasticsearch logs and the full response that Elasticsearch sends back (eg was it a 429?). however heap pressure can cause this, and you'd probably need to dig into why you are experiencing that
the other option is to reduce the size of your requests that you are sending
Update Remembering my "experience" with Java I simply did a restart of my ELK stack and the next import went through smoothly.
So despite the fact that 512m memory seem a bit low, it worked after a restart. Will check again today and then.
Increase memory
Schedule a nightly restart

Can I break anything if I set log.segment.delete.delay.ms to 0

I'm trying to put an absolute limit to the size of a topic so it won't fill up disks in case of unexpectedly high write speeds.
Mostly, this amounts to some calculation like segment=min(1GB, 0.1 * $max_space / $partitions) and setting retention.bytes=($max_space / $partitions - $segment),segment.bytes=$segment and reducing log.retention.check.interval.ms. (You can even pull stunts like determining your maximum write rate (disk or network bound) and then subtract a safety margin based on that and log.retention.check.interval.ms, and maybe some absolute offset of the 10MB head index file.) That's all been discussed in other questions, I'm not asking about any of that.
I'm wondering what log.segment.delete.delay.ms is for, and if I can safely set it to 0 (or some very small value - 5?). The documentation just states "The amount of time to wait before deleting a file from the filesystem". I'd like to know why you would do that, i.e. what or who benefits from this amount of time being non-0.
This configuration can be used to ensure that a Consumer of your topic has actually the time to consume the message before it gets deleted.
This configuration declares the delay that is applied when the LogCleaner is triggered by the time-based or size-based cleanup policies set through log.retention.bytes or log.retention.ms.
Especially when your have calculated the retention based on the size of data, this configuration can help you set a minimal time before a message gets deleted and to allow the consumer to actually consume the message during that time.

Kafka Streams topology with windowing doesn't trigger state changes

I am building the following Kafka Streams topology (pseudo code):
gK = builder.stream().gropuByKey();
g1 = gK.windowedBy(TimeWindows.of("PT1H")).reduce().mapValues().toStream().mapValues().selectKey();
g2 = gK.reduce().mapValues();
g1.leftJoin(g2).to();
If you notice, this is a rhomb-like topology that starts at single input topic and ends in the single output topic with messages flowing through two parallel flows that eventually get joined together at the end. One flow applies (tumbling?) windowing, the other does not. Both parts of the flow work on the same key (apart from the WindowedKey intermediately introduced by the windowing).
The timestamp for my messages is event-time. That is, they get picked from the message body by my custom configured TimestampExtractor implementation. The actual timestamps in my messages are several years to the past.
That all works well at first sight in my unit tests with a couple of input/output messages and in the runtime environment (with real Kafka).
The problem seems to come when the number of messages starts being significant (e.g. 40K).
My failing scenario is following:
~40K records with the same
key get uploaded into the input topic first
~40K updates are
coming out of the output topic, as expected
another ~40K records
with the same but different to step 1) key get uploaded into the
input topic
only ~100 updates are coming out of the output topic,
instead of expected new ~40K updates. There is nothing special to
see on those ~100 updates, their contents seems to be right, but
only for certain time windows. For other time windows there are no
updates even though the flow logic and input data should definetly
generate 40K records. In fact, when I exchange dataset in step 1)
and 3) I have exactly same situation with ~40K updates coming from
the second dataset and same number ~100 from the first.
I can easily reproduce this issue in the unit tests using TopologyTestDriver locally (but only on bigger numbers of input records).
In my tests, I've tried disabling caching with StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG. Unfortunately, that didn't make any difference.
UPDATE
I tried both, reduce() calls and aggregate() calls instead. The issue persists in both cases.
What I'm noticing else is that with StreamsConfig.TOPOLOGY_OPTIMIZATION set to StreamsConfig.OPTIMIZE and without it, the mapValues() handler gets called in debugger before the preceding reduce() (or aggregate()) handlers at least for the first time. I didn't expect that.
Tried both join() and leftJoin() unfortunately same result.
In debugger the second portion of the data doesn't trigger reduce() handler in the "left" flow at all, but does trigger reduce() handler in the "right" flow.
With my configuration, if the number or records in both datasets is 100 in each, the problem doesn't manifests itself, I'm getting 200 output messages as I expect. When I raise the number to 200 in each data set, I'm getting less than 400 expected messages out.
So, it seems at the moment that something like "old" windows get dropped and the new records for those old windows get ignored by the stream.
There is window retention setting that can be set, but with its default value that I use I was expecting for windows to retain their state and stay active for at least 12 hours (what exceeds the time of my unit test run significantly).
Tried to amend the left reducer with the following Window storage config:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
Duration.ofHours(1), false)
)
still no difference in results.
Same issue persists even with only single "left" flow without the "right" flow and without join(). It seems that the problem is in the window retention settings of my set up. Timestamps (event-time) of my input records span 2 years. The second dataset starts from the beginning of 2 years again. this place in Kafka Streams makes sure that the second data set records get ignored:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/InMemoryWindowStore.java#L125
Kafka Streams Version is 2.4.0. Also using Confluent dependencies version 5.4.0.
My questions are
What could be the reason for such behaviour?
Did I miss anything in my stream topology?
Is such topology expected to work at all?
After some debugging time I found the reason for my problem.
My input datasets contain records with timestamps that span 2 years. I am loading the first dataset and with that the "observed" time of my stream gets set to the maximum timestamp from from input data set.
The upload of the second dataset that starts with records with timestamps that are 2 years before the new observed time causes the stream internal to drop the messages. This can be seen if you set the Kafka logging to TRACE level.
So, to fix my problem I had to configure the retention and grace period for my windows:
instead of
.windowedBy(TimeWindows.of(windowSize))
I have to specify
.windowedBy(TimeWindows.of(windowSize).grace(Duration.ofDays(5 * 365)))
Also, I had to explicitly configure reducer storage settings as:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
windowSize, false)
)
That's it, the output is as expected.

Partial batch sizes

I'm trying to simulate pallet behavior by using batch and move to. This works fine except towards the end where the number of elements left is smaller than the batch size, and these never get picked up. Any way out of this situation?
Have tried messing with custom queues, pickup/dropoff pairs.
To elaborate, the batch object has a queue size of 15. However once the entire set has been processed a number of elements less than 15 remain which don't get picked up by the subsequent moveTo block. I need to send the agents to the subsequent block once the queue size falls below 15.
You can dynamically change the batch size of your Batch object towards "the end" (whatever you mean by that :-) ). You need to figure out when to change the batch size (as this depends on your model). But once it is time to adjust, you can call myBatchItem.set_batchSize(1) and it will now batch things together individually.
However, a better model design might be to have a cool-down period before the model end, i.e. stop taking model measurements before your batch objects run out of agents to batch.
You need to know what the last element is somehow for example using a boolean variable called isLast in your agent that is true for the last agent.
and in the batch you have to change the batch size programatically.. maybe like this in the on enter action of your batch:
if(agent.isLast)
self.set_batchSize(self.size());
To determine if the "end" or any lack of supply is reached, I suggest a timeout. I would save a timestamp in a variable lastBatchDate in the OnExit code of the batch block:
lastBatchDate = date();
A cyclically activated event checkForLeftovers will check every once in a while if there is objects waiting to be batched and the timeout (here: 10 minutes) is reached. In this case, the batch size will be reduced to exactly the number of waiting objects, in order for them to continue in a smaller batch:
if( lastBatchDate!=null //prevent a NullPointerError when date is not yet filled
&& ((date().getTime()-lastBatchDate.getTime())/1000)>600 //more than 600 seconds since last batch
&& batch.size()>0 //something is waiting
&& batch.size()<BATCH_SIZE //not more then a normal batch is waiting
){
batch.set_batchSize(batch.size()); // set the batch size to exactly the amount waiting
}
else{
batch.set_batchSize(BATCH_SIZE); // reset the batch size to the default value BATCH_SIZE
}
The model will look something like this:
However, as Benjamin already noted, you should be careful if this is what you really need to model. Take care for example on these aspects:
Is the timeout long enough to not accidentally push smaller batches during normal operations?
Is the timeout short enough to have any effect?
Is it ok to have a batch of a smaller size downstream in your process?
etc.
You might just want to make sure upstream that the number of objects reaches the batching station are always filling full batches, or you might to just stop your simulation before the line "runs dry".
You can see the model and download the source code here.

Can mirthconnect 3.0.1.7051 choke on a large file without having heap error?

Can mirthconnect 3.0.1.7051 choke on a large file without having heap error?
We are using Mirthconnect 3.0.1.7051 on a linux machine (Redhat, I think). The Linux machine has 16 gb, and java heap (in mirth) is set to 12gb. We have a javascript reader channel attempting to read a delimited text file, and convert it to xml (after checking some things). Other settings are Durable message delivery, remove content and attachments on completion, Polling type =(interval),Polling frequency = 2 minutes,
Source queue off – respond after processing, Destination/channel type = channel writer, Queue retry = always, retry interval every 10,000 ms.
The delimnited text file is close to 80KB. At first, when we had much less memory in machine and set in heap, it would fail to read the file and show heap error in mirth log. now, with more memory, it is not throwing heap error but mirth stops reading the input file somewhere in the middle of it, and then starts reading it again. The result is that two incomplete xml files are produced, with some overlapping data, and no apparent errors in the log.