Mirth - removing segments on its own - mirth

We are running Mirth 3.9.0 and noticed that some of our diet order messages are being stripped of segments. For example we create an order which may contain two ORC and ODS segments along with an NTE afterwards. Despite our interface adding the extra segments, once Mirth sends the message the extra segments have been removed. We can test this by calling our API (via Postman) and seeing the resulting message containing the extra segments but viewing the same message in Mirth Connect the segments are missing.
Why would Mirth remove the segments and how can we keep it from doing so?

I discovered that Mirth does not allow two ODC segments in a row. We added code to adjust for that by adding an extra ORC segment before the ODS. We didn't see any improvement and then found out our master branch does not seem to match production. The change allowed it to work in our test branch but not production. This is not a Mirth issue at this point but a deployment issue.
Sorry for wasting anyone's time on this.

Related

Kafka - different configuration settings

I am going through the documentation, and there seems to be there are lot of moving with respect to message processing like exactly once processing , at least once processing . And, the settings scattered here and there. There doesnt seem a single place that documents the properties need to be configured rougly for exactly once processing and atleast once processing.
I know there are many moving parts involved and it always depends . However, like i was mentioning before , what are the settings to be configured atleast to provide exactly once processing and at most once and atleast once ...
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.

Brod: Every brod_group_subscriber_v2 needs its own brod client

DISCLAIMER: This question is specifically about the Erlang/OTP Kafka Client library Brod (No Tag available yet)
I am trying to establish three consumer groups, one should write messages just plain to the console, another one should update a state representing API with certain messages, and the third one should store every message into a long term database (crate). I use a supervisor to start three according brod_group_subscriber_v2 (see this GIST). If I also start three brod (kafka) clients first and attach each group subscriber its own client, everything works perfectly so that offsets are commited to Kafka for every group and reads start from the latest commited offset.
If I use only one client (as it should be possible and intended according to the Brod docs and issues, see Reference by zmstone), only the last group in my CHILD_SPEC works, both other do not receive handle_message calls.
At the moment starting a client for every group is not an issue for me, as there are only three. But as our project grows, we plan to establish a couple of consumer groups, and I don't really think that it might be a good idea to run 20 to 30 brod clients and blocking ressources for each of them.

Kafka Streams topology with windowing doesn't trigger state changes

I am building the following Kafka Streams topology (pseudo code):
gK = builder.stream().gropuByKey();
g1 = gK.windowedBy(TimeWindows.of("PT1H")).reduce().mapValues().toStream().mapValues().selectKey();
g2 = gK.reduce().mapValues();
g1.leftJoin(g2).to();
If you notice, this is a rhomb-like topology that starts at single input topic and ends in the single output topic with messages flowing through two parallel flows that eventually get joined together at the end. One flow applies (tumbling?) windowing, the other does not. Both parts of the flow work on the same key (apart from the WindowedKey intermediately introduced by the windowing).
The timestamp for my messages is event-time. That is, they get picked from the message body by my custom configured TimestampExtractor implementation. The actual timestamps in my messages are several years to the past.
That all works well at first sight in my unit tests with a couple of input/output messages and in the runtime environment (with real Kafka).
The problem seems to come when the number of messages starts being significant (e.g. 40K).
My failing scenario is following:
~40K records with the same
key get uploaded into the input topic first
~40K updates are
coming out of the output topic, as expected
another ~40K records
with the same but different to step 1) key get uploaded into the
input topic
only ~100 updates are coming out of the output topic,
instead of expected new ~40K updates. There is nothing special to
see on those ~100 updates, their contents seems to be right, but
only for certain time windows. For other time windows there are no
updates even though the flow logic and input data should definetly
generate 40K records. In fact, when I exchange dataset in step 1)
and 3) I have exactly same situation with ~40K updates coming from
the second dataset and same number ~100 from the first.
I can easily reproduce this issue in the unit tests using TopologyTestDriver locally (but only on bigger numbers of input records).
In my tests, I've tried disabling caching with StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG. Unfortunately, that didn't make any difference.
UPDATE
I tried both, reduce() calls and aggregate() calls instead. The issue persists in both cases.
What I'm noticing else is that with StreamsConfig.TOPOLOGY_OPTIMIZATION set to StreamsConfig.OPTIMIZE and without it, the mapValues() handler gets called in debugger before the preceding reduce() (or aggregate()) handlers at least for the first time. I didn't expect that.
Tried both join() and leftJoin() unfortunately same result.
In debugger the second portion of the data doesn't trigger reduce() handler in the "left" flow at all, but does trigger reduce() handler in the "right" flow.
With my configuration, if the number or records in both datasets is 100 in each, the problem doesn't manifests itself, I'm getting 200 output messages as I expect. When I raise the number to 200 in each data set, I'm getting less than 400 expected messages out.
So, it seems at the moment that something like "old" windows get dropped and the new records for those old windows get ignored by the stream.
There is window retention setting that can be set, but with its default value that I use I was expecting for windows to retain their state and stay active for at least 12 hours (what exceeds the time of my unit test run significantly).
Tried to amend the left reducer with the following Window storage config:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
Duration.ofHours(1), false)
)
still no difference in results.
Same issue persists even with only single "left" flow without the "right" flow and without join(). It seems that the problem is in the window retention settings of my set up. Timestamps (event-time) of my input records span 2 years. The second dataset starts from the beginning of 2 years again. this place in Kafka Streams makes sure that the second data set records get ignored:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/InMemoryWindowStore.java#L125
Kafka Streams Version is 2.4.0. Also using Confluent dependencies version 5.4.0.
My questions are
What could be the reason for such behaviour?
Did I miss anything in my stream topology?
Is such topology expected to work at all?
After some debugging time I found the reason for my problem.
My input datasets contain records with timestamps that span 2 years. I am loading the first dataset and with that the "observed" time of my stream gets set to the maximum timestamp from from input data set.
The upload of the second dataset that starts with records with timestamps that are 2 years before the new observed time causes the stream internal to drop the messages. This can be seen if you set the Kafka logging to TRACE level.
So, to fix my problem I had to configure the retention and grace period for my windows:
instead of
.windowedBy(TimeWindows.of(windowSize))
I have to specify
.windowedBy(TimeWindows.of(windowSize).grace(Duration.ofDays(5 * 365)))
Also, I had to explicitly configure reducer storage settings as:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
windowSize, false)
)
That's it, the output is as expected.

Filter read access events in Debezium

We are using Debezium + PostgreSQL.
Notice that we get 4 types of events for create, read, update and delete - c, r, u and d.
The read type of event is unused for our application. Actually, I could not think of an use case for the 'r' events unless we are doing auditing or mirroring the activities of a transaction.
We are facing difficulties scaling & I suspect it is because of network getting hogged by read type of events.
How do we filter out those events in postgreSQL itself?
I got a clue from one of the contributors to use snapshot.mode. I guess something that has to be done when Debezium creates a snapshot. I am unable to figure out how to do that.
It is likely that your database has existed for some time and contains data and changes that have been purged from the logical decoding logs. If you then start using the Debezium PostgreSQL connector to start capturing changes into Kafka, the question becomes what a consumer of the events in Kafka should be able to see.
One scenario is that a consumer should be able to see events for all rows in the database, even those that existed prior to the start of CDC. For example, this allows a consumer to completely reproduce/replicate all of the existing data and keep that data in sync over time. To accomplish this, the Debezium PostgreSQL connector starts up can begin by creating a snapshot of the database contents before it starts capturing the changes. This is done atomically, so that even if the snapshot process takes a while to run, the connector will still see all of the events that occurred since the snapshot process was started. These events are represented as "read" events, since in effect the connector is simply reading the existing rows. However, they are identical to "insert" events, so any application could treat reads and inserts in the same way.
On the other hand, if consumers of the events in Kafka do not need to see events for all existing rows, then the connector can be configured to avoid the snapshot and to instead begin by capturing the changes. This may be useful in some scenarios where the entire database state need not be found in Kafka, but instead the goal is to simply capture the changes that are occurring.
The Debezium PostgreSQL connector will work either way, so you should use the approach that works for how you're consuming the events.

divert on large files on hornetq leaves a copy of the file behind

We have this queue where we get some large files and many small ones, so we use divert to get the large files onto their own queue.
The divert is exclusive. However see that under the large messages folder there are two copies of the message (which is a bytesmessage - file with properties) for every large message diverted. Once we consume the diverted message one of them goes away, but the other is left behind until hornet is restarted. When it is restarted we see messages like the following:
[org.hornetq.core.server] HQ221018: Large message: 19,327,352,827 did not have any associated reference, file will be deleted
We use streaming over JMS to put them in and get them out.
Below is the divert configuration. By the way, large size is considered everything over 100k in our HornetQ.
Are we missing any thing or did we just discover a bug?
HornetQ version is 2.3.0
<diverts>
<divert name="large-message-divert">
<routing-name>large-message-divert</routing-name>
<address>jms.queue.FileDelivery</address>
<forwarding-address>jms.queue.FileDelivery.large</forwarding-address>
<filter string="_HQ_LARGE_SIZE IS NOT NULL AND _HQ_LARGE_SIZE > 52428800"/>
<exclusive>true</exclusive>
</divert>
</diverts>
This is likely a bug fixed in a later release of hornetq, since there were at least 2 fixes since 2.3.0 to the current date when I wrote this answer:
HORNETQ-1292 - Delete large message from disk when message is dropped
HORNETQ-431 - Large Messages Files NOT DELETED on unbounded address