Clustering in AEM - aem

I am facing an error, which is something peculiar. I am using AEM 5.6.1.
I have 2 author instances(a1 and a2) and both are in cluster. We are performing tar optimization on the instances daily between 2a.m. - 5a.m.(London Timezone). Now, in the error.log of a2, I am seeing the below error everyday in the above mentioned time:
419 ERROR [pool-6-thread-1] org.apache.sling.discovery.impl.cluster.ClusterViewServiceImpl getEstablishedView: the existing established view does not incude the enter code herelocal instance yet! Assming isolated mode.
Now, I did some research on this and has come to know that, AEM users ClusterViewServiceImpl.java for clustering. And in that, the below mentioned code snippet is basically failing:
EstablishedClusterView clusterViewImpl = new EstablishedClusterView(
config, view, getSlingId());
boolean foundLocal = false;
for (Iterator<InstanceDescription> it = clusterViewImpl
.getInstances().iterator(); it.hasNext();) {
InstanceDescription instance = it.next();
if (instance.isLocal()) {
foundLocal = true;
break;
}
}
if (foundLocal) {
return clusterViewImpl;
} else {
logger.info("getEstablishedView: the existing established view does not incude the local instance yet! Assuming isolated mode.");
return getIsolatedClusterView();
}
Can someone help me to understand more in depth regarding the same. Does it mean that, the clustering is not properly working? What can be the possible impacts because of this error?

I think you've got a classic case of split brain.
Clustering authors is not a good approach and has been disfavoured in future versions of AEM, as the authors often get out of sync when they can't talk to each other for whatever reason, usually temporarily network related. Believe me, they are sensitive.
When communication drops, the slave thinks it no longer has a master, and claims to be the master itself. When that occurs, and communication is re-established the damage has been done as there is no recovery mechanism.
At best, only ever allow users to connect to the primary author and have the secondary author as a High Availability server.
Better still, set up replication from the primary author that everyone writes to, and have it auto replicate on write to the secondary backup author.
Hope that helps.

Related

Late data handling | Apache Beam

Late data which has missed the window and .withAllowedLateness period is dropped off from the pipeline as documented here
I have a few questions on this behavior:
How to handle late data which is dropped off from the pipeline? Can we add default behavior? Say all late data should be logged somewhere like catch-all bucket?
Can we have a Metric(Google Dataflow Metrics/Beam) to say how many of these messages are dropped off from pipeline due to huge latency?
In general we define late data as elements that, by the time they arrive, we just prefer to drop them and do not want to process any further. As far as I know, adding extra functionality to handle those messages would require substantial effort to modify the Java SDK. However, if you just want to log them this is done by the LateDataDroppingDoFnRunner code, which is responsible for dropping data from expired windows:
for (WindowedValue<InputT> input : concatElements) {
BoundedWindow window = Iterables.getOnlyElement(input.getWindows());
if (canDropDueToExpiredWindow(window)) {
// The element is too late for this window.
droppedDueToLateness.inc();
WindowTracing.debug(
"{}: Dropping element at {} for key:{}; window:{} "
+ "since too far behind inputWatermark:{}; outputWatermark:{}",
LateDataFilter.class.getSimpleName(),
input.getTimestamp(),
key,
window,
timerInternals.currentInputWatermarkTime(),
timerInternals.currentOutputWatermarkTime());
}
}
Note that the log has DEBUG level so you might not see it. As explained here, to override the level in Dataflow, you can use --defaultWorkerLogLevel=DEBUG or, even better, specify a particular class such as --workerLogLevelOverrides={"org.apache.beam.sdk.util.WindowTracing":"DEBUG"}. Choosing your keys wisely can help expose information to identify the dropped message (i.e. data lineage).
As can be seen in the previous snippet, droppedDueToLateness is a Counter metric that is incremented each time we drop an element: droppedDueToLateness.inc();. You can monitor it using Stackdriver with resource type dataflow_job and metric custom.googleapis.com/dataflow/droppedDueToLateness.

Moving from file-based tracing session to real time session

I need to log trace events during boot so I configure an AutoLogger with all the required providers. But when my service/process starts I want to switch to real-time mode so that the file doesn't explode.
I'm using TraceEvent and I can't figure out how to do this move correctly and atomically.
The first thing I tried:
const int timeToWait = 5000;
using (var tes = new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl") { StopOnDispose = false })
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
using (var tes = new TraceEventSession("TEMPSESSIONNAME", TraceEventSessionOptions.Attach))
{
Thread.Sleep(timeToWait);
tes.SetFileName(null);
Thread.Sleep(timeToWait);
Console.WriteLine("Done");
}
Here I wanted to make that I can transfer the session to real-time mode. But instead, the file I got contained events from a 15s period instead of just 10s.
The same happens if I use new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl", TraceEventSessionOptions.Create) instead.
It seems that the following will cause the file to stop being written to:
using (var tes = new TraceEventSession("TEMPSESSIONNAME"))
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
But here I must reenable all the providers and according to the documentation "if the session already existed it is closed and reopened (thus orphans are cleaned up on next use)". I don't understand the last part about orphans. Obviously some events might occur in the time between closing, opening and subscribing on the events. Does this mean I will lose these events or will I get the later?
I also found the following in the documentation of the library:
In real time mode, events are buffered and there is at least a second or so delay (typically 3 sec) between the firing of the event and the reception by the session (to allow events to be delivered in efficient clumps of many events)
Does this make the above code alright (well, unless the improbable happens and for some reason my thread is delayed for more than a second between creating the real-time session and starting processing the events)?
I could close the session and create a new different one but then I think I'd miss some events. Or I could open a new session and then close the file-based one but then I might get duplicate events.
I couldn't find online any examples of moving from a file-based trace to a real-time trace.
I managed to contact the author of TraceEvent and this is the answer I got:
Re the exception of the 'auto-closing and restarting' feature, it is really questions about the OS (TraceEvent simply calls the underlying OS API). Just FYI, the deal about orphans is that it is EASY for your process to exit but leave a session going. This MAY be what you want, but often it is not, and so to make the common case 'just work' if you do Create (which is the default), it will close a session if it already existed (since you asked for a new one).
Experimentation of course is the touchstone of 'truth' but I would frankly expecting unusual combinations to just work is generally NOT true.
My recommendation is to keep it simple. You need to open a new session and close the original one. Yes, you will end up with duplicates, but you CAN filter them out (after all they are IDENTICAL timestamps).
The other possibility is use SetFileName in its intended way (from one file to another). This certainly solves your problem of file size growth, and often is a good way to deal with other scenarios (after all you can start up you processing and start deleting files even as new files are being generated).

Debug missing messages in akka

I have the following architecture at the moment:
Load(Play app with basic interface for load tests) -> Gateway(Spray application with REST interface for incoming messages) -> Processor(akka app that works with MongoDB) -> MongoDB
Everything works fine as long as number of messages I am pushing through is low. However when I try to push 10000 events, that will eventully end up at MongoDB as documents, it stops at random places, for example on 742 message or 982 message and does nothing after.
What would be the best way to debug such situations? On the load side I am just pushing hard into the REST service:
for (i ← 0 until users) workerRouter ! Load(server, i)
and then in the workerRouter
WS.url(server + "/track/message").post(Json.toJson(newUser)).map { response =>
println(response.body)
true
}
On the spray side:
pathPrefix("track") {
path("message") {
post {
entity(as[TrackObj]) { msg =>
processors ! msg
complete("")
}
}
}
}
On the processor side it's just basically an insert into a collection. Any suggestions on where to start from?
Update:
I tried to move the logic of creating messages to the Gatewat, did a cycle of 1 to 10000 and it works just fine. However if spray and play are involed in a pipeline it interrupts and random places. Any suggestions on how to debug in this case?
In a distributed and parallel environment it is next to impossible to create a system that work reliably. Whatever debugging method you use it will only allow you to find a few bugs that will happen during the debug session.
Once our team spent 3 months(!) while tuning an application for a robust 24/7 working. And still there were bugs. Then we applied a method of Model checking (Spin). Within a couple of weeks we implemented a model that allowed us to get a robust application. However, model checking requires a bit different way of thinking and it can be difficult to start.
I moved the load test app to Spray framework and now it works like a charm. So I suppose the problem was somewhere in the way that I used WS API in Play framework:
WS.url(server + "/track/message").post(Json.toJson(newUser)).map { response =>
println(response.body)
true
}
The problem is resovled but not solved, won't work on a solution based on Play.

WF4 InstancePersistenceCommand interrupted

I have a windows service, running workflows. The workflows are XAMLs loaded from database (users can define their own workflows using a rehosted designer). It is configured with one instance of the SQLWorkflowInstanceStore, to persist workflows when becoming idle. (It's basically derived from the example code in \ControllingWorkflowApplications from Microsoft's WCF/WF samples).
But sometimes I get an error like below:
System.Runtime.DurableInstancing.InstanceOwnerException: The execution of an InstancePersistenceCommand was interrupted because the instance owner registration for owner ID 'a426269a-be53-44e1-8580-4d0c396842e8' has become invalid. This error indicates that the in-memory copy of all instances locked by this owner have become stale and should be discarded, along with the InstanceHandles. Typically, this error is best handled by restarting the host.
I've been trying to find the cause, but it is hard to reproduce in development, on production servers however, I get it once in a while. One hint I found : when I look at the LockOwnersTable, I find the LockOnwersTable lockexpiration is set to 01/01/2000 0:0:0 and it's not getting updated anymore, while under normal circumstances the should be updated every x seconds according to the Host Lock Renewal period...
So , why whould SQLWorkflowInstanceStore stop renewing this LockExpiration and how can I detect the cause of it?
This happens because there are procedures running in the background and trying to extend the lock of the instance store every 30 seconds, and it seems that once the connection fail connecting to the SQL service it will mark this instance store as invalid.
you can see the same behaviour if you delete the instance store record from [LockOwnersTable] table.
The proposed solution is when this exception fires, you need to free the old instance store and initialize a new one
public class WorkflowInstanceStore : IWorkflowInstanceStore, IDisposable
{
public WorkflowInstanceStore(string connectionString)
{
_instanceStore = new SqlWorkflowInstanceStore(connectionString);
InstanceHandle handle = _instanceStore.CreateInstanceHandle();
InstanceView view = _instanceStore.Execute(handle,
new CreateWorkflowOwnerCommand(), TimeSpan.FromSeconds(30));
handle.Free();
_instanceStore.DefaultInstanceOwner = view.InstanceOwner;
}
public InstanceStore Store
{
get { return _instanceStore; }
}
public void Dispose()
{
if (null != _instanceStore)
{
var deleteOwner = new DeleteWorkflowOwnerCommand();
InstanceHandle handle = _instanceStore.CreateInstanceHandle();
_instanceStore.Execute(handle, deleteOwner, TimeSpan.FromSeconds(10));
handle.Free();
}
}
private InstanceStore _instanceStore;
}
you can find the best practices to create instance store handle in this link
Workflow Instance Store Best practices
This is an old thread but I just stumbled on the same issue.
Damir's Corner suggests to check if the instance handle is still valid before calling the instance store. I hereby quote the whole post:
Certain aspects of Workflow Foundation are still poorly documented; the persistence framework being one of them. The following snippet is typically used for setting up the instance store:
var instanceStore = new SqlWorkflowInstanceStore(connectionString);
instanceStore.HostLockRenewalPeriod = TimeSpan.FromSeconds(30);
var instanceHandle = instanceStore.CreateInstanceHandle();
var view = instanceStore.Execute(instanceHandle,
new CreateWorkflowOwnerCommand(), TimeSpan.FromSeconds(10));
instanceStore.DefaultInstanceOwner = view.InstanceOwner;
It's difficult to find a detailed explanation of what all of this
does; and to be honest, usually it's not necessary. At least not,
until you start encountering problems, such as InstanceOwnerException:
The execution of an InstancePersistenceCommand was interrupted because
the instance owner registration for owner ID
'9938cd6d-a9cb-49ad-a492-7c087dcc93af' has become invalid. This error
indicates that the in-memory copy of all instances locked by this
owner have become stale and should be discarded, along with the
InstanceHandles. Typically, this error is best handled by restarting
the host.
The error is closely related to the HostLockRenewalPeriod property
which defines how long obtained instance handle is valid without being
renewed. If you try monitoring the database while an instance store
with a valid instance handle is instantiated, you will notice
[System.Activities.DurableInstancing].[ExtendLock] being called
periodically. This stored procedure is responsible for renewing the
handle. If for some reason it fails to be called within the specified
HostLockRenewalPeriod, the above mentioned exception will be thrown
when attempting to persist a workflow. A typical reason for this would
be temporarily inaccessible database due to maintenance or networking
problems. It's not something that happens often, but it's bound to
happen if you have a long living instance store, e.g. in a constantly
running workflow host, such as a Windows service.
Fortunately it's not all that difficult to fix the problem, once you
know the cause of it. Before using the instance store you should
always check, if the handle is still valid; and renew it, if it's not:
if (!instanceHandle.IsValid)
{
instanceHandle = instanceStore.CreateInstanceHandle();
var view = instanceStore.Execute(instanceHandle,
new CreateWorkflowOwnerCommand(), TimeSpan.FromSeconds(10));
instanceStore.DefaultInstanceOwner = view.InstanceOwner;
}
It's definitely less invasive than the restart of the host, suggested
by the error message.
you have to be sure about expiration of owner user
here how I am used to handle this issue
public SqlWorkflowInstanceStore SetupSqlpersistenceStore()
{
SqlWorkflowInstanceStore sqlWFInstanceStore = new SqlWorkflowInstanceStore(ConfigurationManager.ConnectionStrings["DB_WWFConnectionString"].ConnectionString);
sqlWFInstanceStore.InstanceCompletionAction = InstanceCompletionAction.DeleteAll;
InstanceHandle handle = sqlWFInstanceStore.CreateInstanceHandle();
InstanceView view = sqlWFInstanceStore.Execute(handle, new CreateWorkflowOwnerCommand(), TimeSpan.FromSeconds(30));
handle.Free();
sqlWFInstanceStore.DefaultInstanceOwner = view.InstanceOwner;
return sqlWFInstanceStore;
}
and here how you can use this method
wfApp.InstanceStore = SetupSqlpersistenceStore();
wish this help

Why could database changes disappear?

I have a MongoDB server running on an 64-bit Amazon EC2 instance (journaling enabled). Yesterday I updated some documents and refreshed the webpage to make sure it reflects the changes. It did.
But today I see that not only yesterday's changes are gone. I lost a week of updates!
Why could this be and is it possible to recover the lost data?
Maybe there's something wrong in the way I make the changes?
public function edit_app()
{
$query = array('_id' => $_POST['id']);
$apps = $this->mongo->db->apps;
if ($app = $apps->findOne($query)) {
$app['title'] = $_POST['title'];
$app['version'] = $_POST['version'];
$app['author'] = $_POST['author'];
...
$apps->save($app);
}
}
There is not much that can be definitively said based on the information you have provided. I can however provide some hints to take you in the right direction:
Think about whether there is a possibility that an application
process could have been holding some documents in memory (loaded
before your update) and and re-saved after your update?
Is the server part of a replica set? If so, were all members of the
replica set healthy with primary server up and elected correctly?
I apologize, I must have been blind. There was an error in the edit_app() function:
$app['visible'] = $_POST['visible']; // was
$app['visible'] = isset($_POST['visible']); // fixed