I am building a logging system that will log requests and responses to a web service which is distributed across multiple application nodes. I was thinking of using MongoDB as the repository and logging in real-time, or more realistically dumping logs to DB after x number of requests. The application is designed to be considerably high volume and is built in Perl. Does anyone have any experience doing this? Recommendations? Or is this a no-no?
I've seen at lot of companies are using MongoDB to store logs. Its schema-freeness is really flexible for application logs, at which schema tends to change time-to-time. Also, its Capped Collection feature is really useful because it automatically purges old data to keep the data fit into the memory.
People aggregates the logs by normal Grouping or MapReduce, but it's not that fast. Especially MongoDB's MapReduce only works within a single thread and its JavaScript execution overhead is huge. New aggregation framework could solve this problem.
Another concern is high write through put. Although MongoDB's insert is fire-and-forget style by default, calling a lot of insert command causes a heavy write lock contention. This could affect the application performance, and prevent the readers to aggregate / filter the stored logs.
One solution might be using the log collector framework such as Fluentd, Logstash, or Flume. These daemons are supposed to be launched at every application nodes, and takes the logs from app processes.
They buffer the logs and asynchronously writes out the data to other systems like MongoDB / PostgreSQL / etc. The write is done by batches, so it's a lot more efficient than writing directly from apps. This link describes how to put the logs into Fluentd from Perl program.
Fluentd: Data Import from Perl Applications
I use it in several applications through Log::Dispatch::MongoDB; works like a charm!
# Declaration
use Log::Dispatch;
use Log::Dispatch::MongoDB;
use Log::Dispatch::Screen;
use Moose;
has log => (is => 'ro', isa => 'Log::Dispatch', default => sub { Log::Dispatch->new }, lazy => 1)
...
# Configuration
$self->log->add(
Log::Dispatch::Screen->new(
min_level => 'debug',
name => 'screen',
newline => 1,
)
);
$self->log->add(
Log::Dispatch::MongoDB->new(
collection => MongoDB::Connection->new(
host => $self->config->mongodb
)->saveme->log,
min_level => 'debug',
name => 'crawler',
)
);
...
# The logging facility
$self->log->log(
level => 'info',
message => 'Crawler finished',
info => {
origin => $self->origin,
country => $self->country,
counter => $self->counter,
start => $self->start,
finish => time,
}
);
And here is a sample record from the capped collection:
{
"_id" : ObjectId("50c453421329307e4f000007"),
"info" : {
"country" : "sa",
"finish" : NumberLong(1355043650),
"origin" : "onedayonly_sa",
"counter" : NumberLong(2),
"start" : NumberLong(1355043646)
},
"level" : "info",
"name" : "crawler",
"message" : "Crawler finished"
}
I've done this on a webapp that runs on two app servers. Writes in mongodb are non-blocking by default (the java driver just gets the request for you and returns back immediately, I assume it's the same for perl, but you better check) which is perfect for this use case since you don't want your users to wait for a log to be recorded.
The downside of this is that in certain failure scenarios you might lose some logs (your app fails before mongo gets the data for example).
For some interesting ideas for your app, I recommend checking out Graylog2 if you haven't already. They use a combination of MongoDB and Elasticsearch quite effectively. Adding a powerful search engine into the mix can give you some interesting query and analysis options.
For your reference, here's an Elasticsearch page dedicated to log processing tools and techniques.
If you are planning to queue the log entries before processing (which I would recommend), I suggest Kestrel as a solid message queue option. This is what Gaug.es uses, and I've been putting it through it's paces lately. A Java app, it's extremely fast and atomic, and it conveniently speaks the Memcache protocol. It's a great way to scale horizontally, and the memory cache is backed up to a journalled file for a good balance of speed and durability.
Related
in a Meteor app, having real-time reactive updates between all connected clients is achieved with writing in collections, publishing and subscribing the right data. In normal case this means also database writes.
But what if I would like to sync particular data which does not need to be persistent and I would like to save the overhead of writing in the database ? Is it possible to use mini-mongo or other in-memory caching on the server by still preserving DDP synchronisation to all clients ?
Example
In my app I have a multiple collapsed threads and I want to show, which users currently expanded particular thread
Viewed by: Mike, Johny, Steven ...
I can store the information in the threads collection or make make a separate viewers collection and publish the information to the clients. But there is actually no meaning in making this information persistent an having the overhead of database writes.
I am confused by the collections documentation. which states:
OPTIONS
connection Object
The server connection that will manage this collection. Uses the default connection if not specified. Pass the return value of calling DDP.connect to specify a different server. Pass null to specify no connection.
and
... when you pass a name, here’s what happens:
...
On the client (and on the server if you specify a connection), a Minimongo instance is created.
But If I create a new collection and pass the option object with conneciton: null
// Creates a new Mongo collections and exports it
export const Presentations = new Mongo.Collection('presentations', {connection: null});
/**
* Publications
*/
if (Meteor.isServer) {
// This code only runs on the server
Meteor.publish(PRESENTATION_BY_MAP_ID, (mapId) => {
check(mapId, nonEmptyString);
return Presentations.find({ matchingMapId: mapId });
});
}
no data is being published to the clients.
TLDR: it's not possible.
There is no magic in Meteor that allow data being synced between clients while the data doesn't transit by the MongoDB database. The whole sync process through publications and subscriptions is triggered by MongoDB writes. Hence, if you don't write to database, you cannot sync data between clients (using the native pub/sub system available in Meteor).
After countless hours of trying everything possible I found a way to what I wanted:
export const Presentations = new Mongo.Collection('presentations', Meteor.isServer ? {connection: null} : {});
I checked the MongoDb and no presentations collection is being created. Also, n every server-restart the collection is empty. There is a small downside on the client, even the collectionHanlde.ready() is truthy the findOne() first returns undefined and is being synced afterwards.
I don't know if this is the right/preferable way, but it was the only one working for me so far. I tried to leave {connection: null} in the client code, but wasn't able to achieve any sync even though I implemented the added/changed/removed methods.
Sadly, I wasn't able to get any further help even in the meteor forum here and here
I'm using play framework with scala. I also use RedisScala driver (this one https://github.com/etaty/rediscala ) to communicate with Redis. If Redis doesn't contain data then my app is looking for data in MongoDB.
When Redis fails or just not available for some reason then application waits a response too long. How to implement failover strategy in this case. I would like to stop requesting Redis if requests take too long time. And start working with Redis when it is back online.
To clarify the question my code is like following now
private def getUserInfo(userName: String): Future[Option[UserInfo]] = {
CacheRepository.getBaseUserInfo(userName) flatMap{
case Some(userInfo) =>
Logger.trace(s"AuthenticatedAction.getUserInfo($userName). User has been found in cache")
Future.successful(Some(userInfo))
case None =>
getUserFromMongo(userName)
}
}
I think you need to distinguish between the following cases (in order of their likelihood of occurrence) :
No Data in cache (Redis) - I guess in this case, Redis will return very quickly and you have to get it from Mongo. In your code above you need to set the data in Redis after you get it from Mongo so that you have it in the cache for subsequent calls.
You need to wrap your RedisClient in your application code aware of any disconnects/reconnects. Essentially have a two states - first, when Redis is working properly, second, when Redis is down/slow.
Redis is slow - this could happen because of one of the following.
2.1. Network is slow: Again, you cannot do much about this but to return a message to your client. Going to Mongo is unlikely to resolve this if your network itself is slow.
2.2. Operation is slow: This happens if you are trying to get a lot of data or you are running a range query on a sorted set, for example. In this case you need to revisit the Redis data structure you are using the the amount of data you are storing in Redis. However, looks like in your example, this is not going to be an issue. Single Redis get operations are generally low latency on a LAN.
Redis node is not reachable - I'm not sure how often this is going to happen unless your network is down. In such a case you also will have trouble connecting to MongoDB as well. I believe this can also happen when the node running Redis is down or its disk is full etc. So you should handle this in your design. Having said that the Rediscala client will automatically detect any disconnects and reconnect automatically. I personally have done this. Stopped and updgrade Redis version and restart Redis without touching my running client(JVM).
Finally, you can use a Future with a timeout (see - Scala Futures - built in timeout?) in your program above. If the Future is not completed by the timeout you can take your other action(s) (go to Mongo or return an error message to the user). Given that #1 and #2 are likely to happen much more frequently than #3, you timeout value should reflect these two cases. Given that #1 and #2 are fast on a LAN you can start with a timeout value of 100ms.
Soumya Simanta provided detailed answer and I just would like to post code I used for timeout. The code requires Play framework which is used in my project
private def get[B](key: String, valueExtractor: Map[String, ByteString] => Option[B], logErrorMessage: String): Future[Option[B]] = {
val timeoutFuture = Promise.timeout(None, Duration(Settings.redisTimeout))
val mayBeHaveData = redisClient.hgetall(key) map { value =>
valueExtractor(value)
} recover{
case e =>
Logger.info(logErrorMessage + e)
None
}
// if timeout occurred then None will be result of method
Future.firstCompletedOf(List(mayBeHaveData, timeoutFuture))
}
This question is specifically pertaining to Couchbase, but I believe it would apply to anything with the memcached api.
Lets, say I am creating a client/server chat application, and on my server, I am storing chat session information for each user in a data bucket. Well after the chat session is over, I will remove the session object from the data bucket, but at the same time I also want to persist it to a permanent NoSQL datastore for reporting and analytics purposes. I also want session objects to be persisted upon cache eviction, when sessions timeout, etc.
Is there some sort of "best practice" (or even a function of Couchbase that I am missing) that enables me to do this efficiently and maintaining best possible performance of my in memory caching system?
Using Couchbase Server 2.0, you could setup two buckets (or two separate clusters if you want to separate physical resources). On the session cluster, you'd store JSON documents (the value in the key/value pair), perhaps like the following:
{
"sessionId" : "some-guid",
"users" : [ "user1", "user2" ],
"chatData" : [ "message1", "message2"],
"isActive" : true,
"timestamp" : [2012, 8, 6, 11, 57, 00]
}
You could then write a Map/Reduce view in the session database that gives you a list of all expired items (note the example below with the meta argument requires a recent build of Couchbase Server 2.0 - not the DP4.
function(doc, meta) {
if (doc.sessionId && ! doc.isActive) {
emit(meta.id, null);
}
}
Then, using whichever Couchbase client library you prefer, you could have a task to query the view, get the items and move them into the analytics cluster (or bucket). So in C# this would look something like:
var view = sessionClient.GetView("sessions", "all_inactive");
foreach(var item in view)
{
var doc = sessionClient.Get(item.ItemId);
analyticsClient.Store(StoreMode.Add, item.ItemId, doc);
sessionClient.Remove(item.ItemId);
}
If you instead, wanted to use an explicit timestamp or expiry, your view could index based on the timestamp:
function(doc) {
if (doc.sessionId && ! doc.isActive) {
emit(timestamp, null);
}
}
Your task could then query the view by including a startkey to return all documents that have not been touched in x days.
var view = sessionClient.GetView("sessions", "all_inactive").StartKey(new int[] { DateTime.Now.Year, DateTime.Now.Months, DateTime.Now.Days-1);
foreach(var item in view)
{
var doc = sessionClient.Get(item.ItemId);
analyticsClient.Store(StoreMode.Add, item.ItemId, doc);
sessionClient.Remove(item.ItemId);
}
Checkout http://www.couchbase.com/couchbase-server/next for more info on Couchbase Server 2.0 and if you need any clarification on this approach, just let me know on this thread.
-- John
CouchDB storage is (eventually) persistent and without built-in expiry mechanism, so whatever you store in it will remain stored until you remove it - it's not like in Memcached where you can set timeout for stored data.
So if you are storing session in CouchDB you will have to remove them on your own when they expire and since it's not an automated mechanism, but something you do on your own there is no reason for you not to save data wherever you want at the same time.
BTH I see no advantage of using Persistent NoSQL over SQL for session storage (and vice versa) - performance of both will be IO bound. Memory only key store or hybrid solution is a whole different story.
As for your problem: move data in you apps session expiry/session close mechanism and/or run a cron job that periodically checks session storage for expired sessions and move the data.
I have an elastic search river setup using the jdbc river plugin, that just does a simple select * from and indexes that table.
But I would like to be able to trigger the river on demand via the API as well as at a standard time interval, so that I can have it index a document when its inserted into this table.
Anybody know if theres anyway to do this at present?
i.e.
/_river/my_river/_refresh
Thanks.
I don't see a good way for you to trigger the JDBC River into indexing your specific updated document in real time, and I'm not sure it's meant to be used for that anyways.
Instead of triggering the JDBC river to index your document, why don't you just index the document from the update code?
The JDBC river is a great way to feed in large streams of data, and there is documentation for maintaining coherency with the polling. but I don't think there is an easy to meet your real time requirement.
Thanks for your suggestion. You are very welcome for giving feedback, please join the elasticsearch community. I will open an issue for triggering a fetch at https://github.com/jprante/elasticsearch-river-jdbc/issues
It sounds like you are struggling with the classic "push vs. pull" indexing problem. Rivers are designed to pull data out of the database at an interval. They're easy to set up, but like all things in computer science, they are a trade-off. Specifically, you lose real-time indexing. A river that you can trigger might be the best of both worlds, or it might inundate your server with a lot of unnecessary traffic (i.e. why do "SELECT * ..." when you know exactly which document was updated?).
If you have a real-time indexing requirement (like I did), you "push" your updates into Elasticsearch. You just need to write an Elasticsearch client that will deliver your updated records to Elasticsearch as they are saved. FWIW, I solved this problem by firing messages on a service bus, and a service waiting on the other end retrieved the entity from SQL and indexed it. Once you have that infrastructure, it's not a big deal to write a small app to do an initial import of SQL data or create a scheduled job to index data.
alernative would be use logstash with jdbc plugin
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html
download logstash
install input-jdbc-plugin
example config:
input {
jdbc {
jdbc_connection_string => "jdbc:oracle:thin:#localhost:1521:XE"
jdbc_user => "user"
jdbc_driver_library => "/home/logstash/lib/ojdbc6.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
statement => "select * from events where update_date > :sql_last_value order by update_date"
last_run_metadata_path => "run_metadata_event.log"
schedule => "* * * * *"
jdbc_password => "password"
}
}
# The filter part of this file is commented out to indicate that it is
# optional.
filter {
mutate {
split => { "field_1" => ";"}
}
}
output {
elasticsearch {
#protocol => "http"
hosts => ["localhost:9200"]
index => "items"
document_type => "doc_type"
document_id => "%{doc_id}"
}
}
So I'm getting a ton of data continuously that's getting put into a processedData collection. The data looks like:
{
date: "2011-12-4",
time: 2243,
gender: {
males: 1231,
females: 322
},
age: 32
}
So I'll get lots and lots of data objects like this continually. I want to be able to see all "males" that are above 40 years old. This is not an efficient query it seems because of the sheer size of the data.
Any tips?
Generally speaking, you can't.
However, there may be some shortcuts, depending on actual requirements. Do you want to count 'males above 40' across all dataset, or just one day?
1 day: split your data into daily collections (processedData-20111121, ...), this will help your queries. Also you can cache results of such query.
whole dataset: pre-aggregate data. That is, upon insertion of new data entry, do something like this:
db.preaggregated.update({_id : 'male_40'},
{$set : {gender : 'm', age : 40}, $inc : {count : 1231}},
true);
Similarly, if you know all your queries beforehand, you can just precalculate them (and not keep raw data).
It also depends on how you define "real-time" and how big a query load you will have. In some cases it is ok to just fire ad-hoc map-reduces.
My guess your target GUI is a website? In that case you are looking for something called comet. You should make a layer which processes all the data and broadcasts new mutations to your client or event bus (more on that below). Mongo doesn't enable real-time data as it doesn't emit anything on an mutation. So you can use any data store which suites you.
Depending on the language you'll use you have different options (for comet):
Socket.io (nodejs) - Javascript
Cometd - Java
SignalR - C#
Libwebsocket - C++
Most of the times you'll need an event bus or message queue to put the mutation events on. Take a look at JMS, Redis or NServiceBus (depending on what you'll use).