Debug missing messages in akka - mongodb

I have the following architecture at the moment:
Load(Play app with basic interface for load tests) -> Gateway(Spray application with REST interface for incoming messages) -> Processor(akka app that works with MongoDB) -> MongoDB
Everything works fine as long as number of messages I am pushing through is low. However when I try to push 10000 events, that will eventully end up at MongoDB as documents, it stops at random places, for example on 742 message or 982 message and does nothing after.
What would be the best way to debug such situations? On the load side I am just pushing hard into the REST service:
for (i ← 0 until users) workerRouter ! Load(server, i)
and then in the workerRouter
WS.url(server + "/track/message").post(Json.toJson(newUser)).map { response =>
println(response.body)
true
}
On the spray side:
pathPrefix("track") {
path("message") {
post {
entity(as[TrackObj]) { msg =>
processors ! msg
complete("")
}
}
}
}
On the processor side it's just basically an insert into a collection. Any suggestions on where to start from?
Update:
I tried to move the logic of creating messages to the Gatewat, did a cycle of 1 to 10000 and it works just fine. However if spray and play are involed in a pipeline it interrupts and random places. Any suggestions on how to debug in this case?

In a distributed and parallel environment it is next to impossible to create a system that work reliably. Whatever debugging method you use it will only allow you to find a few bugs that will happen during the debug session.
Once our team spent 3 months(!) while tuning an application for a robust 24/7 working. And still there were bugs. Then we applied a method of Model checking (Spin). Within a couple of weeks we implemented a model that allowed us to get a robust application. However, model checking requires a bit different way of thinking and it can be difficult to start.

I moved the load test app to Spray framework and now it works like a charm. So I suppose the problem was somewhere in the way that I used WS API in Play framework:
WS.url(server + "/track/message").post(Json.toJson(newUser)).map { response =>
println(response.body)
true
}
The problem is resovled but not solved, won't work on a solution based on Play.

Related

ZMQ socket - disconnect when all request are served

I am trying to implement ZMQ REQ/REP model in Java
I have a Server-role, running on post 5564, which acts as Replier
ZMQ.Socket repSock = context.socket(ZMQ.REP);
I have a Client-role, running on post 5563
ZMQ.Socket syncclient = context.socket(ZMQ.REQ);
I have a proxy-server in middle, which passes request and response
ZMQ.proxy(reqSocket, repSocket, null);
Good thing about having a proxy is I can add multiple Servers
repSocket.connect("tcp://" + addr.getHostAddress() + ":" + port);
Which is working fine .
Now, when I remove a Server node from Proxy
repSocket.disconnect("tcp://" + addr.getHostAddress() + ":" + port);
Client gets stuck, as an request has being made and the REQ-socket waits for a response.
So the process stucks at syncclient.recvStr()
for (int request_nbr = 0; request_nbr < (request_nbr + 1); request_nbr++) {
syncclient.send(str.getBytes(),0);
System.out.println("Send Dataaaa....... " );
String data = syncclient.recvStr(Charset.defaultCharset());
System.out.println(" here.. " +data);
request_nbr++;
}
I searched and couldn't find a way to track the REQ-socket
I need any one of 2 things:
A way to keep track on a Socket-instance, which I am about to disconnect, wait till all messages are processed, so that syncclient.recvStr() will not be blocked
A way to reset the syncclient-socket, so that I can keep getting REQ/REP respond without an interruption
In real-world scenarios, rather avoid using a blocking-mode of the ZeroMQ .send() / .recv() methods and better use .poll().
While this may require a bit more SLOCs of code, the results are leaving you in a control, whereas a blocking SLOC takes all the control from your code and you cannot do much about that until ( if at all ) a next message gets delivered. That's a very wrong design practice and except the most simplistic schoolbook examples, that are actually sort of anti-patterns for the real-world.
So, do not expect Question 2 to become somehow magically solved, this is not a part of ZeroMQ API ( for many rather aloud evangelisated reasons ). Better decide between .setsockopt( ZMQ.REQ_RELAXED, 1 ), if API version and context of use permits, or do not use the trivial REQ/REP pattern at all ( due it's known risk of falling into an unsalvageable mutual dead-lock ( ref. other my posts on this very subject, where this phenomenon was both illustrated and countless times explained ) ).
In a similar manner, asking Question 1 seems reasonable in cases you have never read the ZeroMQ specifications and/or documentation and ZeroMQ "Best Practices". Having spent some time in this, your options would be crystal-clear. There are none such tools for doing this built-in. One can add some add-on, if in a need to add any similar non-core logic for her/his own need. The only setting that indirectly influences the behaviour on aSocket.close() is available in .setsockopt( ZMQ.LINGER, 0 ), which may help to prevent a system from a transition into an effective hangup-state, once aSocket waits infinitely for a state that will never happen in cases, when a message-queue is still non-empty ( messages still waiting for getting delivered ).
Going into Distributed-Systems design is like entering a new world. No sequences are guarranteed ( non-serial code execution paths happen ). No means of any local control of remote entities, their states, their failures, their presence at all, their actual ZeroMQ API version.
Indeed a challenging world to enter into.
N.b.:
You might already know, that one can .connect() aSocket-instance ( better an Access Point to aSocket-instance ) to more than one remote ends without using the proxy. With some additional .setsockopt() tuning ZMQ.IMMEDIATE to a value of 1, will help better manage the round-robin distribution policy, irrespective of the transport-classes used for the actual message delivery ( { tcp:// | ipc:// | vmci:// | pgm:// | epgm:// | inproc:// } ). All that at your fingertips.

Clustering in AEM

I am facing an error, which is something peculiar. I am using AEM 5.6.1.
I have 2 author instances(a1 and a2) and both are in cluster. We are performing tar optimization on the instances daily between 2a.m. - 5a.m.(London Timezone). Now, in the error.log of a2, I am seeing the below error everyday in the above mentioned time:
419 ERROR [pool-6-thread-1] org.apache.sling.discovery.impl.cluster.ClusterViewServiceImpl getEstablishedView: the existing established view does not incude the enter code herelocal instance yet! Assming isolated mode.
Now, I did some research on this and has come to know that, AEM users ClusterViewServiceImpl.java for clustering. And in that, the below mentioned code snippet is basically failing:
EstablishedClusterView clusterViewImpl = new EstablishedClusterView(
config, view, getSlingId());
boolean foundLocal = false;
for (Iterator<InstanceDescription> it = clusterViewImpl
.getInstances().iterator(); it.hasNext();) {
InstanceDescription instance = it.next();
if (instance.isLocal()) {
foundLocal = true;
break;
}
}
if (foundLocal) {
return clusterViewImpl;
} else {
logger.info("getEstablishedView: the existing established view does not incude the local instance yet! Assuming isolated mode.");
return getIsolatedClusterView();
}
Can someone help me to understand more in depth regarding the same. Does it mean that, the clustering is not properly working? What can be the possible impacts because of this error?
I think you've got a classic case of split brain.
Clustering authors is not a good approach and has been disfavoured in future versions of AEM, as the authors often get out of sync when they can't talk to each other for whatever reason, usually temporarily network related. Believe me, they are sensitive.
When communication drops, the slave thinks it no longer has a master, and claims to be the master itself. When that occurs, and communication is re-established the damage has been done as there is no recovery mechanism.
At best, only ever allow users to connect to the primary author and have the secondary author as a High Availability server.
Better still, set up replication from the primary author that everyone writes to, and have it auto replicate on write to the secondary backup author.
Hope that helps.

Moving from file-based tracing session to real time session

I need to log trace events during boot so I configure an AutoLogger with all the required providers. But when my service/process starts I want to switch to real-time mode so that the file doesn't explode.
I'm using TraceEvent and I can't figure out how to do this move correctly and atomically.
The first thing I tried:
const int timeToWait = 5000;
using (var tes = new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl") { StopOnDispose = false })
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
using (var tes = new TraceEventSession("TEMPSESSIONNAME", TraceEventSessionOptions.Attach))
{
Thread.Sleep(timeToWait);
tes.SetFileName(null);
Thread.Sleep(timeToWait);
Console.WriteLine("Done");
}
Here I wanted to make that I can transfer the session to real-time mode. But instead, the file I got contained events from a 15s period instead of just 10s.
The same happens if I use new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl", TraceEventSessionOptions.Create) instead.
It seems that the following will cause the file to stop being written to:
using (var tes = new TraceEventSession("TEMPSESSIONNAME"))
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
But here I must reenable all the providers and according to the documentation "if the session already existed it is closed and reopened (thus orphans are cleaned up on next use)". I don't understand the last part about orphans. Obviously some events might occur in the time between closing, opening and subscribing on the events. Does this mean I will lose these events or will I get the later?
I also found the following in the documentation of the library:
In real time mode, events are buffered and there is at least a second or so delay (typically 3 sec) between the firing of the event and the reception by the session (to allow events to be delivered in efficient clumps of many events)
Does this make the above code alright (well, unless the improbable happens and for some reason my thread is delayed for more than a second between creating the real-time session and starting processing the events)?
I could close the session and create a new different one but then I think I'd miss some events. Or I could open a new session and then close the file-based one but then I might get duplicate events.
I couldn't find online any examples of moving from a file-based trace to a real-time trace.
I managed to contact the author of TraceEvent and this is the answer I got:
Re the exception of the 'auto-closing and restarting' feature, it is really questions about the OS (TraceEvent simply calls the underlying OS API). Just FYI, the deal about orphans is that it is EASY for your process to exit but leave a session going. This MAY be what you want, but often it is not, and so to make the common case 'just work' if you do Create (which is the default), it will close a session if it already existed (since you asked for a new one).
Experimentation of course is the touchstone of 'truth' but I would frankly expecting unusual combinations to just work is generally NOT true.
My recommendation is to keep it simple. You need to open a new session and close the original one. Yes, you will end up with duplicates, but you CAN filter them out (after all they are IDENTICAL timestamps).
The other possibility is use SetFileName in its intended way (from one file to another). This certainly solves your problem of file size growth, and often is a good way to deal with other scenarios (after all you can start up you processing and start deleting files even as new files are being generated).

How do I disable Celery's default timeout for a task, and/or prevent it from retrying?

I'm having some troubles with celery. Unfortunately the person who set it up isn't working here any more, and until now we've never had problems and thought we understood how it works well enough. Now it has become clear that we don't, and after hours of searching through documentation and other posts on here, I have to admit defeat. Hopefully, someone here can shed some light on what I am missing.
We're using several tasks, all of them are defined in a CELERYBEAT_SCHEDULE like this:
CELERYBEAT_SCHEDULE = {
'runs-every-5-minutes': {
'task': 'tasks.webhook',
'schedule': crontab(minute='*/5'),
'args': (WEBHOOK_BASE + '/task/refillordernumberbuffer', {'refill_count': 1000})
},
'send-sameday-delivery-confirmation': {
'task': 'tasks.webhook',
'schedule': crontab(minute='*/2'),
'args': (WEBHOOK_BASE + '/task/sendsamedaydeliveryconfirmation', {})
},
'send-customer-hotspot-notifications': {
'task': 'tasks.webhook',
'schedule': crontab(hour=9, minute=0),
'args': (WEBHOOK_BASE + '/task/sendcustomerhotspotnotifications', {})
},
}
That's not all of them, but they all work like this. All of those are actually PHP scripts that have no knowledge of the whole celery concept. They are just scripts that execute certain things, and send notifications if necessary. When they are done, they just spit out a JSON response that says success=true.
As far as I know, celery is only used to execute them periodically. We don't have problems with any of them except the last one from my code snippet. That task/script sends out emails, usually 5 to 10, but sometimes a lot more. And that's where the problems start, because (as far as I could examine by watching in celery events, I could honestly not find any confirmation for this in the docs anywhere) when the successful JSOn response from the PHP script doesn't arrive within 3 minutes, celery retries the task, and the script sends a lot of emails again. And again, because just a small amount of emails was saved as "done" form the tasks initial run. This often leads to 4 or 5 retries until finally enough emails were marked as "successfully sent" by the prior retries that finally the last retry finishes under this mystical 3 minute limit.
My questions:
Is there a default time limit? Where is it set? How do I override it? I've read about time_limit and soft_time_limit, but nothing I tried in the config seemed to help. If this is the solution, I would be in need of assistance as to how the settings are properly applied.
Can't I "just" disable the whole retry concept (for one task or for all, doesn't really matter) altogether? It seems to me that we don't need it, as we're running our tasks periodically and missing one due to a temporary error would not matter. I guess that means we shouldn't have used celery in the first place as we're misusing it, but for now I'd just like to understand it better.
Thanks for any help, and sorry if I left anything unclear – happy to answer any follow-up questions and provide more details if necessary.
The rest of the config file goes like this:
## Broker settings.
databases = parse_databases_xml()
settings = parse_custom_settings_xml()
BROKER_URL = 'redis://' + databases['taskqueue']['host'] + '/' + databases['taskqueue']['dbname']
# List of modules to import when celery starts.
CELERY_IMPORTS = ("tasks", )
## Using the database to store task state and results.
CELERY_RESULT_BACKEND = BROKER_URL
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_ANNOTATIONS = {
"*": {"rate_limit": "100/m"},
"ping": {"rate_limit": "100/m"},
}
There is no time_limit to be found anywhere, so I don't think we're setting it ourselves. I left out the python imports and the functions that read from our config xml files, as that stuff is all working fine and just concerns some database auth data.

WMI and Win32_DeviceChangeEvent - Wrong event type returned?

I am trying to register to a "Device added/ Device removed" event using WMI. When I say device - I mean something in the lines of a Disk-On-Key or any other device that has files on it which I can access...
I am registering to the event, and the event is raised, but the EventType propery is different from the one I am expecting to see.
The documentation (MSDN) states : 1- config change, 2- Device added, 3-Device removed 4- Docking. For some reason I always get a value of 1.
Any ideas ?
Here's sample code :
public class WMIReceiveEvent
{
public WMIReceiveEvent()
{
try
{
WqlEventQuery query = new WqlEventQuery(
"SELECT * FROM Win32_DeviceChangeEvent");
ManagementEventWatcher watcher = new ManagementEventWatcher(query);
Console.WriteLine("Waiting for an event...");
watcher.EventArrived +=
new EventArrivedEventHandler(
HandleEvent);
// Start listening for events
watcher.Start();
// Do something while waiting for events
System.Threading.Thread.Sleep(10000);
// Stop listening for events
watcher.Stop();
return;
}
catch(ManagementException err)
{
MessageBox.Show("An error occurred while trying to receive an event: " + err.Message);
}
}
private void HandleEvent(object sender,
EventArrivedEventArgs e)
{
Console.WriteLine(e.NewEvent.GetPropertyValue["EventType"]);
}
public static void Main()
{
WMIReceiveEvent receiveEvent = new WMIReceiveEvent();
return;
}
}
Well, I couldn't find the code. Tried on my old RAC account, nothing. Nothing in my old backups. Go figure. But I tried to work out how I did it, and I think this is the correct sequence (I based a lot of it on this article):
Get all drive letters and cache
them.
Wait for the WM_DEVICECHANGE
message, and start a timer with a
timeout of 1 second (this is done to
avoid a lot of spurious
WM_DEVICECHANGE messages that start
as start as soon as you insert the
USB key/other device and only end
when the drive is "settled").
Compare the drive letters with the
old cache and detect the new ones.
Get device information for those.
I know there are other methods, but that proved to be the only one that would work consistently in different versions of windows, and we needed that as my client used the ActiveX control on a webpage that uploaded images from any kind of device you inserted (I think they produced some kind of printing kiosk).
Oh! Yup, I've been through that, but using the raw Windows API calls some time ago, while developing an ActiveX control that detected the insertion of any kind of media. I'll try to unearth the code from my backups and see if I can tell you how I solved it. I'll subscribe to the RSS just in case somebody gets there first.
Well,
u can try win32_logical disk class and bind it to the __Instancecreationevent.
You can easily get the required info
I tried this on my system and I eventually get the right code. It just takes a while. I get a dozen or so events, and one of them is the device connect code.