syslog - log line classifications - saas

A very generic question; in the context of a programmer, with operational aspect of the process (program) in mind.
Is there any sort of best-practice / guide to classify messages, particularly in the context of SaaS / multi-tenancy (server) software environment, which would be generating errors and warnings due to user actions or misconfiguration. Due to the nature of the software, most modules that I am having to deal with, are stateless; i.e when an error happens due to user-error, it is quite hard to distinguish between that and an operational error (like network misconfiguration, etc).
What I want to know is from some of you experienced folks; what is the sensible logic to be employed here, in order to make it easy for the operations boys/girls to classify these messages, and identify problems?

Just three aspects from an admin and log analysis/classification perspective:
Make the tag field/program name configurable. Then one can configure multiple instances to use log tags like app/user_1, app/user_2 etc., allowing for fast and simple filters on the syslog level.
Structure you messages from left to right, so one can filter different categories of log lines with simple search patterns or regular expression. E.g. config error - cannot parse line 123 or runtime warning - lost connection to DB xyz
For very structured logs you might also take a look at the 'structured data' field in syslog-protocol. So far it is rarely used and without tool support, but it allows for application log messages with namespaces and very clear key-value-attributes.

Identify the servers and server types (name, ip address, etc.)
Classify by severity, make sure all the clocks are in synch in order
to have the message ordered correctly.
Put a message/error code to filter/create some rules in your monitoring tool.
Put a module (used if several modules on one server)
Put a category for addressing general services like networking, etc.
I guess you will gather the logs from the different machines with their syslog deamon to a central machine in charge of supervision/monitoring.

Most *nix processes log to syslog (or should at least) using a semi-standard format "Month Day 24H-Time host process_name[pid]: message". Syslog incorporates ways to indicate the message's severity, use them (though keep in mind that the severity is from the system's prospective, not the applications).
If message is a debugging problem then it's usually "Function_Name File_Name Line_No Error_Code Error_Desc"; otherwise the format of the message is entirely program dependent.
For multi-tenant systems it's pretty common for the "message" part to start with some form of tenant identification, followed by the actual log message.

Related

Want to know what allth terms in windbg output of "!analyze -v" indicates?

What the key value indicates.......and which is the term help me to undersatnd how the windbg bucket the crashes means how it braodly classify the crashes into?
help me to understand the windbg bucket
IMHO, the idea of buckets was introduced for WER (Windows Error Reporting). WER was used by Microsoft but was also available for companies. WER included a service where you could log in on a Microsoft website and then get an overview of your application crashes.
Of course, people were not interested in a flat list of crashes, but they wanted to know how many crashes of the same type occured. Thus Microsoft and other company could focus on fixing those bugs first which affected most of the users.
The bucket, as the name suggests, is a container where similar problems grouped. The bucket ID is generated in 2 phases: a labeling process which was done on the client side and a classifying process which was done on server side.
What you get from !analyze is the classification, so basically you have access to the functionality via WinDbg that Microsoft used on the server side for providing the WER services.
These WER services are not available any more. They hae been replaced by something else, but I have forgotten the name.
how it braodly classify the crashes into?
An ideal bucketing algorithm would create a new bucket for each bug. So the number of buckets is just limited by the amount of bugs you can code into your application.
The command !analyze has implemented more than 500 different heuristics. The combination of these can create more than 25.000.000 different buckets.
Buckets can differ because of
stack
modules
function name
function offset
corruptions (heap corruption, image corruption)
detected malware
known outdated programs or libraries
known defective hardware
exception codes
exception subcodes
...
The result of that bucketing process is this line of output:
FAILURE_BUCKET_ID: BREAKPOINT_80000003_ntdll.dll!LdrpDoDebuggerBreak
which is probably somehow equivalent to this hash:
FAILURE_ID_HASH: {06f54d4d-201f-7f5c-0224-0b1f2e1e15a5}
I have read some of your previous questions in the windbg tag and I get the impression that you want to use the bucket ID to display some meaningful information to humans.
Actually, the WER system provided such a feature. It worked like this: a developer analyzes the crashes in a bucket and finds out what to do (e.g. update a driver, install a newer version of the application etc). He then assigns that bucket ID a text. Any customers that experience the same crash again were redirected to a website at Microsoft that contained the text written by the developer.
However, note that there is no magic involved that would transfer a crash into something human readable. That's the developer doing hard work and then creating a mapping from the bucket ID to some text that is displayed.
IMHO, the latter can easily be achieved. However, any new bug will require an analysis first. But, who knows, maybe we can train an AI that does better at this.
For more on buckets etc. please read the Microsoft paper Debugging in the (Very) Large:Ten Years of Implementation and Experience

Is it correct to use TargetSubID as a flag for test data in FIX protocol?

We are currently working on a FIX connection, whereby data that should only be validated can be marked. It has been decided to mark this data with a specific TargetSubID. But that implies a new session.
Let's say we send the messages to the session FIX.4.4:S->T. If we then get a message that should only be validated with TargetSubID V, this implies the session FIX.4.4:S->T/V. If this Session is not configured, we get the error
Unknown session: FIX.4.4:S->T/V
and if we explicitly configure this session next to the other, there is the error
quickfix.Session – [FIX/Session] Disconnecting: Encountered END_OF_STREAM
what, as bhageera says, is that you log in with the same credentials.
(...) the counterparty I was connecting to allows only 1 connection
per user/password (i.e. session with those credentials) at a time.
I'm not a FIX expert, but I'm wondering if the TargetSubID is not just being misused here. If not, I would like to know how to do that. We develop the FIX client with camel-quickfix.
It depends a lot on what you system is like and what you want to achieve in the end.
Usually the dimensions to assess are:
maximising the flexibility
minimising the amount of additional logic required to support the testing
minimising the risk of bad things happening on accidental connection from a test to a prod environment (may happen, despite what you might think).
Speaking for myself, I would not use tags potentially involved in the sesson/routing behavior for testing unless all I need is routing features and my system reliably behaves the way I expect (probably not your case).
Instead I would consider one of these:
pick something from a user defined range (5000-9999)
use one of symbology tags (say Symbol(55)) corrupted in some reversible way (say "TEST_VOD.L" in the tag 55 instead of "VOD.L")
A tag from a custom range would give a lot of flexibility, a corrupted symbology tag would make sure a test order would bounce if sent to prod by accident.
For either solution you may potentially need a tag-based routing and transformation layer. Both are done in couple of hours in generic form if you are using something Java-based (I'd look towards javax.scripting / Nashorn).
It's up to the counterparties - sometimes Sender/TargetSubID are considered part of the unique connection, sometimes they distinguish messages on one connection.
Does your library have a configuration option to exclude the sub IDs from the connection lookups? e.g. in QuickFix you can set the SessionQualifier.

What's a unique, persistent alternative to MAC address?

I need to be able to repeatably, non-randomly, uniquely identify a server host, which may be arbitrarily virtualized and over which I have no control.
A MAC address doesn't work because in some virtualized environments, network interfaces don't have hardware addresses.
Generating a state file and saving it to disk doesn't work because the virtual machine may be cloned, thus duplicating the file.
The server's SSH host keys may be a candidate. They can be cloned like a state file, but in practice they generally aren't because it's such a security problem that it's a mistake not often made.
There's also /var/lib/dbus/machine-id, but that's dependent on dbus. (Thanks Preetam).
There's a cpuid but that's apparently deprecated. (Thanks Bruno Aguirre on Twitter).
Hostname is worth considering. Many systems like Chef already require unique hostnames. (Thanks Alfie John)
I'd like the solution to persist a long time, and certainly across server reboots and software restarts. Ultimately, I also know that users of my software will deprecate a host and want to replace it with another, but keep continuity of the data associated with it, so there are reasons a UUID might be considered mutable over the long term, but I don't particularly want a host to start considering itself to be unknown and re-register itself for no reason.
Are there any alternative persistent, unique identifiers for a host?
It really depends on what is meant by "persistent". For example, two VMs can't each open the same network socket to you, so even if they are bit-level clones of each other it is possible to tell them apart.
So, all that is required is sufficient information to tell the machines apart for whatever the duration of the persistence is.
If the duration of the persistence is the length of a network connection, then you don't need any identifiers at all -- the sockets themselves are unique.
If the persistence needs to be longer -- say, for the length of a boot -- then you can regenerate UUIDs whenever the system boots. (Note that a VM that is cloned would still have to reboot, unless you're hot-copying it.)
If it needs to be longer than that -- say, indefinitely -- then you can generate a UUID identifier on boot and save it to disk, but only use this as part of the identifying information of the machine. If the virtual machine is subsequently cloned, you will know this since you will have two machines reporting the same ID from different sources -- for instance, two different network sockets, different boot times, etc. Since you can tell them apart, you have enough information to differentiate the two cloned machines, which means you can take a subsequent action that forces further differentiation, like instructing each machine to regenerate its state file.
Ultimately, if a machine is perfectly cloned, then by definition you cannot tell which one was the "real one" to begin with, only that there are now two distinguishable machines.
Implying that you can tell the difference between the "real one" and the "cloned one" means that there is some state you can use to record the difference between the two, like the timestamp of when the virtual machine itself was created, in which case you can incorporate that into the state record.
It looks like simple solutions have been ruled out.
So that could lead to complex solutions, like this protocol:
- Client sends tuple [ MAC addr, SSH public host key, sequence number ]
- If server receives this tuple as expected, server and client both increment sequence number.
- Otherwise server must determine what happened (was client cloned? did client move?), perhaps reaching a tentative conclusion and alerting a human to verify it.
I don't think there is a straight forward "use X solution" based on the info available but here are some general suggestions that might get you to a better spot.
If cloning from a "gold image" consider using some "first boot" logic to generate a unique ID. Config management systems like Chef, Puppet or Cf-engine provide some scaffolding to achieve this.
Consider a global state manager like zookeeper. Specifically its atomic counter functionality. Same system could get new ID over time, but it would be unique.
Also this stack overflow might give you some other direction. It references Twitter's approach to a similar problem.
If I understand correctly, you want a durable, globally unique identifier under these conditions:
An OS installation that can be cloned while running, so any state inside the VM won't work, and
Could be running in an arbitrary virtualization environment, so any state outside the VM won't work.
I realize this doesn't directly answer your question, but it really seems like either the design or the constraints need some substantial adjustment to accomodate a solution.

Why do we need Operational Transformation for real-time collaboration?

Having seen apps like Google Docs and libraries like ShareJS and EtherPad Lite, I am pretty excited about real-time collaboration, and this seems to be implemented using a very complex technique known as Operational Transformation.
My question is perhaps somewhat odd: why is OT necessary?
What I mean is, we have very low latency on the web in most settings - with tools like Google Docs, ShareJS and EtherPad, changes are almost instantly reflected on connected clients.
Why the incredibly complex solution of resolving conflicts and keeping things synchronized on the server-side?
Being familiar with the command pattern and undo/redo, it seems to me a much simpler solution would be to simply implement every change to a document as a command with an equivalent undo-command.
Let clients submit serialized commands when they make changes. Assign a serial number on the server-side to every received command. Distribute all commands applied to a document back to the clients, which also maintain a history of commands.
Each connected client receives back from the server all the commands applied to the document, now with serial numbers indicating the "correct" order, e.g. the order in which the commands were received by the server, and in which they were applied to the master document held by the server.
If a client was at command number 100, and submits a new command to the server that comes back as number 102, the client knows that it missed a command - it then simply applies the "undo" commands for the last command it submitted, applies command number 101, and then applies it's own command number 102 again, thus putting things back in order.
If it's behind by several commands, it simply rolls back as far as needed, then applies all missed commands, etc.
That sounds much simpler to me.
In what way is Operational Transformation better than that?

machine learning and code generator from strings

The problem: Given a set of hand categorized strings (or a set of ordered vectors of strings) generate a categorize function to categorize more input. In my case, that data (or most of it) is not natural language.
The question: are there any tools out there that will do that? I'm thinking of some kind of reasonably polished, download, install and go kind of things, as opposed to to some library or a brittle academic program.
(Please don't get stuck on details as the real details would restrict answers to less generally useful responses AND are under NDA.)
As an example of what I'm looking at; the input I'm wanting to filter is computer generated status strings pulled from logs. Error messages (as an example) being filtered based on who needs to be informed or what action needs to be taken.
Doing Things Manually
If the error messages are being generated automatically and the list of exceptions behind the messages is not terribly large, you might just want to have a table that directly maps each error message type to the people who need to be notified.
This should make it easy to keep track of exactly who/which-groups will be getting what types of messages and to update the routing of messages should you decide that some of the messages are being misdirected.
Typically, a small fraction of the types of errors make up a large fraction of error reports. For example, Microsoft noticed that 80% of crashes were caused by 20% of the bugs in their software. So, to get something useful, you wouldn't even need to start with a complete table covering every type of error message. Instead, you could start with just a list that maps the most common errors to the right person and routes everything else to a person for manual routing. Each time an error is routed manually, you could then add an entry to the routing table so that errors of that type are handled automatically in the future.
Document Classification
Unless the error messages are being editorialized by people who submit them and you want to use this information when routing them, I wouldn't recommend treating this as a document classification task. However, if this is what you want to do, here's a list of reasonably good packages for document document classification organized by programming language:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J, Weka, Lucene Mahout, and as adi92 mentioned Mallet.
Learning Rules with Weka - If rules are what you want, Weka might be of particular interest, since it includes a rule set based learner. You'll find a tutorial on using Weka for text categorization here.
Mallet has a bunch of classifiers which you can train and deploy entirely from the commandline
Weka is nice too because it has a huge number of classifiers and preprocessors for you to play with
Have you tried spam or email filters? By using text files that have been marked with appropriate categories, you should be able to categorize further text input. That's what those programs do, anyway, but instead of labeling your outputs a 'spam' and 'not spam', you could do other categories.
You could also try something involving AdaBoost for a more hands-on approach to rolling your own. This library from Google looks promising, but probably doesn't meet your ready-to-deploy requirements.