Read from Kinesis is giving empty records when run using previous sequence number or timestamp - mongodb

I am trying to read the messages pushed to Kinesis stream with the help of
get_records() and get_shard_iterator() APIs.
My producer keeps pushing the records when processed at it's end and consumer also keeps running as a cron every 30 minutes. So, I tried storing the sequence number of the current message read in my database and use AFTER_SEQUENCE_NUMBER shard iterator along with the sequence number last read. However, the same won't work for the second time (first time successfully read all messages in the stream) after new messages are pushed.
I also tried using AT_TIMESTAMP along with message timestamp that producer pushed to stream as part of the message and stored that message to be further used. Again, first run processes all messages and from the second run I get empty records.
I am really not sure where I am going wrong. I would appreciate if someone can help me in this.
Providing the code below using timestamp but the same thing is done for sequence number method too.
def listen_to_kinesis_stream():
kinesis_client = boto3.client('kinesis', region_name=SETTINGS['region_name'])
stream_response = kinesis_client.describe_stream(StreamName=SETTINGS['kinesis_stream'])
for shard_info in stream_response['StreamDescription']['Shards']:
kinesis_stream_status = mongo_coll.find_one({'_id': "DOC_ID"})
last_read_ts = kinesis_stream_status.get('state', {}).get(
shard_info['ShardId'], datetime.datetime.strftime(datetime.date(1970, 01, 01), "%Y-%m-%dT%H:%M:%S.%f"))
shard_iterator = kinesis_client.get_shard_iterator(
StreamName=SETTINGS['kinesis_stream'],
ShardId=shard_info['ShardId'],
ShardIteratorType='AT_TIMESTAMP',
Timestamp=last_read_ts)
get_response = kinesis_client.get_records(ShardIterator=shard_iterator['ShardIterator'], Limit=1)
if len(get_response['Records']) == 0:
continue
message = json.loads(get_response['Records'][0]['Data'])
process_resp = process_message(message)
if process_resp['success'] is False:
print process_resp
generic_config_coll.update({'_id': "DOC_ID"}, {'$set': {'state.{0}'.format(shard_info['ShardId']): message['ts']}})
print "Processed {0}".format(message)
while 'NextShardIterator' in get_response:
get_response = kinesis_client.get_records(ShardIterator=get_response['NextShardIterator'], Limit=1)
if len(get_response['Records']) == 0:
break
message = json.loads(get_response['Records'][0]['Data'])
process_resp = process_message(message)
if process_resp['success'] is False:
print process_resp
mongo_coll.update({'_id': "DOC_ID"}, {'$set': {'state.{0}'.format(shard_info['ShardId']): message['ts']}})
print "Processed {0}".format(message)
logger.debug("Processed all messages from Kinesis stream")
print "Processed all messages from Kinesis stream"

As per my discussion with AWS technical support person, there can be a few messages with empty records and hence it is not a good idea to break when len(get_response['Records']) == 0.
The better approach suggested was - we can have a counter indicating maximum number of messages that you read in a run and exit loop after reading as many messages.

Related

How to debug further this dropped record in apache beam?

I am seeing intermittent dropped records(only for error messages though not for success ones). We have a test case that intermittenly fails/passes because of a lost record. We are using "org.apache.beam.sdk.testing.TestPipeline.java" in the test case. This is the relevant setup code where I have tracked the dropped record too ....
PCollectionTuple processed = records
.apply("Process RosterRecord", ParDo.of(new ProcessRosterRecordFn(factory))
.withOutputTags(TupleTags.OUTPUT_INTEGER, TupleTagList.of(TupleTags.FAILURE))
);
errors = errors.and(processed.get(TupleTags.FAILURE));
PCollection<OrderlyBeamDto<Integer>> validCounts = processed.get(TupleTags.OUTPUT_INTEGER);
PCollection<OrderlyBeamDto<Integer>> errorCounts = errors
.apply("Flatten Roster File Error Count", Flatten.pCollections())
.apply("Publish Errors", ParDo.of(new ErrorPublisherFn(factory)));
The relevant code in ProcessRosterRecordFn.java is this
if(dto.hasValidationErrors()) {
RosterIngestError error = new RosterIngestError(record.getRowNumber(), record.toTitleValue());
error.getValidationErrors().addAll(dto.getValidationErrors());
error.getOldValidationErrors().addAll(dto.getOldValidationErrors());
log.info("Tagging record row number="+record.getRowNumber());
c.output(TupleTags.FAILURE, new OrderlyBeamDto<>(error));
return;
}
I see this log for the lost record of Tagging record row for 2 rows that fail. After that however, inside the first line of ErrorPublisherFn.java, we log immediately after receiving each message. We only receive 1 of the 2 rows SOMETIMES. When we receive both, the test passes. The test is very flaky in this regard.
Apache Beam is really annoying in it's naming of threads(they are all the same name), so I added a logback thread hashcode to get more insight and I don't see any and the ErrorPublisherFn could publish #4 on any thread anyways.
Ok, so now the big question: How to insert more things to figure out why this is being dropped INTERMITTENTLY?
Do I have to debug apache beam itself? Can I insert other functions or make changes to figure out why this error is 'sometimes' lost on some test runs and not others?
EDIT: Thankfully, this set of tests are not testing errors upstream and this line "errors = errors.and(processed.get(TupleTags.FAILURE));" can be removed which forces me to remove ".apply("Flatten Roster File Error Count", Flatten.pCollections())" and in removing those 2 lines, the issue goes away for 10 test runs in a row(ie. can't completely say it is gone with this flaky stuff going on). Are we doing something wrong in the join and flattening? I checked the Error structure and rowNumber is a part of equals and hashCode so there should be no duplicates and I am not sure why it would be intermittently failure if there are duplicate objects either.
What more can be done to debug here and figure out why this join is not working in the TestPipeline?
How to get insight into the flatten and join so I can debug why we are losing an event and why it is only 'sometimes' we lose the event?
Is this a windowing issue? even though our job started with a file to read in and we want to process that file. We wanted a constant dataflow stream available as google kept running into limits but perhaps this was the wrong decision?

perl msgrcv() errno 22 (EINVAL) Ubuntu

I have two perl processes that communicate over System V IPC message Q on Ubuntu.
The receiver runs successfully for a time, receives messages in a poller function like this
sub getCompleteRecord {
while( msgrcv($q, $buff, $size, $msgType, &IPC_NOWAIT) ) {
# assemble record from messages and return
}
# print $! error code
#`here if nothing ready (42) or other error`
After some time I eventually get an error code 22 (EINVAL), which means invalid argument, and then subsequently all calls to msgrcv() fail with 22 and the separate sender process also cannot msgsnd(), again getting EIVAL.
When I restart the processes the queue can again be used.
Any suggestions for reasons, or how to approach diagnosing this?
As noted in the comments the meaning of the error code 22 is that either the msqid ($q) or the buffer size ($size) are incorrect. However, this is all happening in a loop, and those two values never change. I log the values before each call and I see many successes and then suddenly a failure for seemingly the same values.
masterQueue 360448 read buffer size:5000 msgType1
Message received
--- many similar successes, then:
masterQueue 360448 read buffer size:5000 msgType1
read error 22=Invalid argument
And from this point both reader process and writer process fail. If I restart, then everything works for about 30 minutes and then fails again.

Spark Local File Streaming - Fault tolerance

I am working on an application where every 30sec(can be 5 sec also) some files will be dropped in a file system. I have to read it parse it and push some records to REDIS.
In each file all records are independent and I am not doing any calculation that will require updateStateByKey.
My question is if due to some issue (eg: REDIS connection issue, Data issue in a file etc) some file is not processed completely I want to reprocess (say n times) the files again and also keep a track of the files already processed.
For testing purpose I am reading from a local folder. Also I am not sure how to conclude that one file is fully processed and mark it as completed (ie write in a text file or db that this file processed)
val lines = ssc.textFileStream("E:\\SampleData\\GG")
val words = lines.map(x=>x.split("_"))
words.foreachRDD(
x=> {
x.foreach(
x => {
var jedis = jPool.getResource();
try{
i=i+1
jedis.set("x"+i+"__"+x(0)+"__"+x(1), x(2))
}finally{
jedis.close()
}
}
)
}
)
Spark has a fault tolerance guide. Read more :
https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html#fault-tolerance-semantics

Mirth is reading too slow from disk

I am using Mirth 3.0.1 version. I am reading a file (using File Reader) having 34,000 records. Every record is having 45 columns and are pipe(|) separated. Mirth is taking too much time while reading the file from the disk. Mirth is installed on the same server where file is located.Earlier, I was facing the java head space issue which I resolved after setting the -Xms1024m -Xmx4096m in files mcserver.vmoptions & mcservice.vmoptions. Now I have to solve reading performance issue. Please find in attachment the channel for the same.
The answer to this problem is highly dependent on the solution itself. As an example, if you are doing transformations when you benchmark, it might be that the problem is not with reading the files, but rather with doing massive amounts of filtering and transformations in Mirth. Since Mirth converts everything you configure into basically one gigantic Javascript that executes on the server, it might just as well be that this is causing the performance problem. Pre-processor scripts might also create a problem if you do something that causes Mirth to read the whole file.
It migh also be that your 34.000 lines in the file contains huge quantities of information, simply making the file very big and extensive to process. If every record in the file is supposed to create new messages within Mirth, you might also want to check your batch settings for the reader.
And in addition to this, the performance of the read operations from disk is of course affected a lot by the infrastructure and hardware of the platform itself. You did mention that you are reading the files locally and that you had to increase the memory for Mirth. All of this could of course be a problem in itself. To make a benchmark you would want to compare this to something else. Maybe write a small Java program to just read the file to compare performance outside of Mirth.
Thanks for the suggestions.
I have used router.routeMessage('channelName','PartOfMsg') to route the 5000 records(from one channel to second channel) from the file having 34000 of records. This has helped to read faster from the file and processing the records at the same time.
For Mirth Community, below is the code to route the msg from one channel to other channel, this solution is also for the requirement if you have bulk of records to process in batches
In Source Transformer,
debug = "ON";
XML.ignoreWhitespace = true;
logger.debug('Inside source transformer "SplitFileIntoFiles" of channel: SplitFile');
var
subSegmentCounter = 0,
xmlMessageProcessCounter = 0,
singleFileLimit = 5000,
isError = false,
xmlMessageProcess = new XML(<delimited><row><column1></column1><column2></column2></row></delimited>),
newSubSegment = <row><column1></column1><column2></column2></row>,
totalPatientRecords = msg.children().length();
logger.debug('Total number of records found in the patient input file are: ');
logger.debug(totalPatientRecords);
try{
for each (seg in msg.children())
{
xmlMessageProcess.appendChild(newSubSegment);
xmlMessageProcess['row'][xmlMessageProcessCounter] = msg['row'][subSegmentCounter];
if (xmlMessageProcessCounter == singleFileLimit -1)
{
logger.debug('Now sending the 5000 records to the next channel from channel DOR Batch File Process IHI');
router.routeMessage('DOR SendPatientsToMedicare',xmlMessageProcess);
logger.debug('After sending the 5000 records to the next channel from channel DOR Batch File Process IHI');
xmlMessageProcessCounter = 0;
delete xmlMessageProcess['row'];
}
subSegmentCounter++;
xmlMessageProcessCounter++;
}// End of FOR loop
}// End of try block
catch (exception)
{
logger.error('The exception has been raised in source transformer "SplitFileIntoFiles" of channel: SplitFile');
logger.error(exception);
globalChannelMap.put('isFailed',true);
globalChannelMap.put('errDesc',exception);
return true;
}
if (xmlMessageProcessCounter > 1)
{
try
{
logger.debug('Now sending the remaining records to the next channel from channel DOR Batch File Process IHI');
router.routeMessage('DOR SendPatientsToMedicare',xmlMessageProcess);
logger.debug('After sending the remaining records to the next channel from channel DOR Batch File Process IHI');
delete xmlMessageProcess['row'];
}
catch (exception)
{
logger.error('The exception has been raised in source transformer "SplitFileIntoFiles" of channel: SplitFile');
logger.error(exception);
globalChannelMap.put('isFailed',true);
globalChannelMap.put('errDesc',exception);
return true;
}
}
return true;
// End of JavaScript
Hope, this will help.

kafka 0.72, minimum number of brokers

I'm trying to create a kafka producer that sends messages to kafka brokers (and not to zoo keeper).
I know that the better practice is working with zk, but for the moment I would like to send messages directly to a broker.
To do that, I'm setting the property "broker.list" as described in the documentation. The thing is that it appears that in order for it to work it requires minimum of 3 brokers (else I get an exception).
In the source code of kafka I can see:
if(brokerInfo.size < 3) throw new InvalidConfigException("broker.list has invalid value")
This is weird cause in my data center I hold only 2 kafka nodes (and 3 zk), what can I do in this case?
Is there a way go around this?
The brokerInfo is obtained by splitting the individual broker info and NOT the number of brokers .. if you checked the source code more carefully you would see some thing like
// check if each individual broker info is valid => (brokerId: brokerHost: brokerPort)
and then they split this info as below
brokerInfoList.foreach { bInfo =>
val brokerInfo = bInfo.split(":")
if(brokerInfo.size < 3) throw new InvalidConfigException("broker.list has invalid value")
}
so every single broker expected to have an id with host name and port separated by the : delimiter
basically regarding the number of broker it just do this
val brokerInfoList = config.brokerList.split(",")
if(brokerInfoList.size == 0) throw new InvalidConfigException("broker.list is empty")
So you should be fine with that I guess, just try to pass a single broker and it should work. Let us know how it goes
Apparently when writing
props.put("broker.list", "0:" + <host:port>);
It works (I added the "0:" to the original string).
I have found it in section 9 of the quick start guide.
I'm not sure I'm getting it, maybe this zero is the partition number(?) maybe something else (could be nice if someone can shed some light here).