I have an elastic search river setup using the jdbc river plugin, that just does a simple select * from and indexes that table.
But I would like to be able to trigger the river on demand via the API as well as at a standard time interval, so that I can have it index a document when its inserted into this table.
Anybody know if theres anyway to do this at present?
i.e.
/_river/my_river/_refresh
Thanks.
I don't see a good way for you to trigger the JDBC River into indexing your specific updated document in real time, and I'm not sure it's meant to be used for that anyways.
Instead of triggering the JDBC river to index your document, why don't you just index the document from the update code?
The JDBC river is a great way to feed in large streams of data, and there is documentation for maintaining coherency with the polling. but I don't think there is an easy to meet your real time requirement.
Thanks for your suggestion. You are very welcome for giving feedback, please join the elasticsearch community. I will open an issue for triggering a fetch at https://github.com/jprante/elasticsearch-river-jdbc/issues
It sounds like you are struggling with the classic "push vs. pull" indexing problem. Rivers are designed to pull data out of the database at an interval. They're easy to set up, but like all things in computer science, they are a trade-off. Specifically, you lose real-time indexing. A river that you can trigger might be the best of both worlds, or it might inundate your server with a lot of unnecessary traffic (i.e. why do "SELECT * ..." when you know exactly which document was updated?).
If you have a real-time indexing requirement (like I did), you "push" your updates into Elasticsearch. You just need to write an Elasticsearch client that will deliver your updated records to Elasticsearch as they are saved. FWIW, I solved this problem by firing messages on a service bus, and a service waiting on the other end retrieved the entity from SQL and indexed it. Once you have that infrastructure, it's not a big deal to write a small app to do an initial import of SQL data or create a scheduled job to index data.
alernative would be use logstash with jdbc plugin
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html
download logstash
install input-jdbc-plugin
example config:
input {
jdbc {
jdbc_connection_string => "jdbc:oracle:thin:#localhost:1521:XE"
jdbc_user => "user"
jdbc_driver_library => "/home/logstash/lib/ojdbc6.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
statement => "select * from events where update_date > :sql_last_value order by update_date"
last_run_metadata_path => "run_metadata_event.log"
schedule => "* * * * *"
jdbc_password => "password"
}
}
# The filter part of this file is commented out to indicate that it is
# optional.
filter {
mutate {
split => { "field_1" => ";"}
}
}
output {
elasticsearch {
#protocol => "http"
hosts => ["localhost:9200"]
index => "items"
document_type => "doc_type"
document_id => "%{doc_id}"
}
}
Related
not quite sure if this question is good to ask on Stackoverflow or not.
Im currently creating a webpage that would use Mongodb logic, and redis. ( with node.js )
when user are on a page, the backend will get asked to get their user details every 5 seconds.
when retrieving this that frequently, should i get it / store it in the redis, or the mongoDB? need it for some sort of caching.
the reason is that its every 5 sec, is because it could be some changes to it that needs to be reflected backend.
each user would have their details as username, password, money, and 25 other values.
How should i approach this, to make less heavy if i were just using MongoDB alone?
example:
function calledEvery5Sec(userid) {
// get from Mongodb...
}
or
function CalledEverySec(userid) {
// get from redis if its avaliable there, else load from mongodb).
}
Use TTL index for the fields you want to store for a short period of time
For that using $out or $merge create a new tmp collection and fetch details from it .
Can you post sample document.
in a Meteor app, having real-time reactive updates between all connected clients is achieved with writing in collections, publishing and subscribing the right data. In normal case this means also database writes.
But what if I would like to sync particular data which does not need to be persistent and I would like to save the overhead of writing in the database ? Is it possible to use mini-mongo or other in-memory caching on the server by still preserving DDP synchronisation to all clients ?
Example
In my app I have a multiple collapsed threads and I want to show, which users currently expanded particular thread
Viewed by: Mike, Johny, Steven ...
I can store the information in the threads collection or make make a separate viewers collection and publish the information to the clients. But there is actually no meaning in making this information persistent an having the overhead of database writes.
I am confused by the collections documentation. which states:
OPTIONS
connection Object
The server connection that will manage this collection. Uses the default connection if not specified. Pass the return value of calling DDP.connect to specify a different server. Pass null to specify no connection.
and
... when you pass a name, here’s what happens:
...
On the client (and on the server if you specify a connection), a Minimongo instance is created.
But If I create a new collection and pass the option object with conneciton: null
// Creates a new Mongo collections and exports it
export const Presentations = new Mongo.Collection('presentations', {connection: null});
/**
* Publications
*/
if (Meteor.isServer) {
// This code only runs on the server
Meteor.publish(PRESENTATION_BY_MAP_ID, (mapId) => {
check(mapId, nonEmptyString);
return Presentations.find({ matchingMapId: mapId });
});
}
no data is being published to the clients.
TLDR: it's not possible.
There is no magic in Meteor that allow data being synced between clients while the data doesn't transit by the MongoDB database. The whole sync process through publications and subscriptions is triggered by MongoDB writes. Hence, if you don't write to database, you cannot sync data between clients (using the native pub/sub system available in Meteor).
After countless hours of trying everything possible I found a way to what I wanted:
export const Presentations = new Mongo.Collection('presentations', Meteor.isServer ? {connection: null} : {});
I checked the MongoDb and no presentations collection is being created. Also, n every server-restart the collection is empty. There is a small downside on the client, even the collectionHanlde.ready() is truthy the findOne() first returns undefined and is being synced afterwards.
I don't know if this is the right/preferable way, but it was the only one working for me so far. I tried to leave {connection: null} in the client code, but wasn't able to achieve any sync even though I implemented the added/changed/removed methods.
Sadly, I wasn't able to get any further help even in the meteor forum here and here
I have following query within my project and it is consuming lot of time to execute. I am trying to optimize it, but not able to successfully do it. Any suggestions would be highly appreciated.
_context.MainTable
.Include(mt => mt.ChildTable1)
.Include(mt => mt.ChildTable1.ChildTable2)
.Include(mt => mt.ChildTable3)
.Include(mt => mt.ChildTable3.ChildTable4)
.SingleOrDefault(
mt =>
mt.ChildTable3.ChildTable4.Id == id
&&
mt.ChildTable1.Operation == operation
&&
mt.ChildTable1.Method = method
&&
mt.StatusId == statusId);
Include() gets translates to join and you are using too many joins in the code. You can optimize indexes with the help of DB engine execution plan.
I suggest you not to use all Include in one go. instead, you break the query and apply Include one by one. I meant you apply Include, get the result and then apply theIncludeagain and so..By having more than twoInclude` affect the performance.
I Don't see any performance issues with you query.
Since you have a singleOrDefault I would look at uptimizing the database call. If you have analytics tools available then in SQL Server Management Studio then you choose tools > Sql Server Profiler. Get query into SQL Server Management Studio, mark the query and choose "Analyse Query in Database Engine Tuning advisor"
I've installed Riak 1.0.2 on Ubuntu Natty.
I have also added some sample data to the database. I'm using a LevelDB backend because I want to test the Secondary Indexing functionality.
I added a test_1 bucket. And to that bucket I added the following information.
array("name" => "Ray_6", "age" => rand(25, 90), "email" => "addr_6#orbican.com") with key "id_1"
array("name" => "Ray_7", "age" => rand(25, 90), "email" => "addr_7#orbican.com") with key "id_2"
array("name" => "Ray_8", "age" => rand(25, 90), "email" => "addr_8#orbican.com") with key "id_3"
I'm trying to use the Search feature to query this data. Below is the CURL request that I enter into the command line:
curl http://localhost:8098/solr/test_1/select?q=name:Ray_6
But when I do this, I get a no found error.
Is there something I'm missing? Am I supposed to do something to the bucket to make it searchable?
I'd appreciate some assistance.
Thanks in advance.
Well, firstly, the above URL is using Riak Search and not the secondary indexes. The URL to query a secondary index is in the form of:
/buckets/<bucket>/index/<fieldname_bin>/query
You form a secondary index by adding metadata headers when creating a record through the cURL interface. Client libraries for different languages will generate this for you.
Back to your specific question, though. Did you use the search-cmd tool to install an index for the test_1 bucket? If you did, did you have data in the bucket before doing so? Riak Search will not retroactively index your data. There are a few ways available to do so, but both are time-consuming if this is just an experimental app.
If you don't have much data, I suggest you re-enter it after setting up the index. Otherwise, you need to add secondary index or process it through the search API as you read/write a piece of data. It'll take time, but it's what is available through Riak now.
Hope this helps.
I am building a logging system that will log requests and responses to a web service which is distributed across multiple application nodes. I was thinking of using MongoDB as the repository and logging in real-time, or more realistically dumping logs to DB after x number of requests. The application is designed to be considerably high volume and is built in Perl. Does anyone have any experience doing this? Recommendations? Or is this a no-no?
I've seen at lot of companies are using MongoDB to store logs. Its schema-freeness is really flexible for application logs, at which schema tends to change time-to-time. Also, its Capped Collection feature is really useful because it automatically purges old data to keep the data fit into the memory.
People aggregates the logs by normal Grouping or MapReduce, but it's not that fast. Especially MongoDB's MapReduce only works within a single thread and its JavaScript execution overhead is huge. New aggregation framework could solve this problem.
Another concern is high write through put. Although MongoDB's insert is fire-and-forget style by default, calling a lot of insert command causes a heavy write lock contention. This could affect the application performance, and prevent the readers to aggregate / filter the stored logs.
One solution might be using the log collector framework such as Fluentd, Logstash, or Flume. These daemons are supposed to be launched at every application nodes, and takes the logs from app processes.
They buffer the logs and asynchronously writes out the data to other systems like MongoDB / PostgreSQL / etc. The write is done by batches, so it's a lot more efficient than writing directly from apps. This link describes how to put the logs into Fluentd from Perl program.
Fluentd: Data Import from Perl Applications
I use it in several applications through Log::Dispatch::MongoDB; works like a charm!
# Declaration
use Log::Dispatch;
use Log::Dispatch::MongoDB;
use Log::Dispatch::Screen;
use Moose;
has log => (is => 'ro', isa => 'Log::Dispatch', default => sub { Log::Dispatch->new }, lazy => 1)
...
# Configuration
$self->log->add(
Log::Dispatch::Screen->new(
min_level => 'debug',
name => 'screen',
newline => 1,
)
);
$self->log->add(
Log::Dispatch::MongoDB->new(
collection => MongoDB::Connection->new(
host => $self->config->mongodb
)->saveme->log,
min_level => 'debug',
name => 'crawler',
)
);
...
# The logging facility
$self->log->log(
level => 'info',
message => 'Crawler finished',
info => {
origin => $self->origin,
country => $self->country,
counter => $self->counter,
start => $self->start,
finish => time,
}
);
And here is a sample record from the capped collection:
{
"_id" : ObjectId("50c453421329307e4f000007"),
"info" : {
"country" : "sa",
"finish" : NumberLong(1355043650),
"origin" : "onedayonly_sa",
"counter" : NumberLong(2),
"start" : NumberLong(1355043646)
},
"level" : "info",
"name" : "crawler",
"message" : "Crawler finished"
}
I've done this on a webapp that runs on two app servers. Writes in mongodb are non-blocking by default (the java driver just gets the request for you and returns back immediately, I assume it's the same for perl, but you better check) which is perfect for this use case since you don't want your users to wait for a log to be recorded.
The downside of this is that in certain failure scenarios you might lose some logs (your app fails before mongo gets the data for example).
For some interesting ideas for your app, I recommend checking out Graylog2 if you haven't already. They use a combination of MongoDB and Elasticsearch quite effectively. Adding a powerful search engine into the mix can give you some interesting query and analysis options.
For your reference, here's an Elasticsearch page dedicated to log processing tools and techniques.
If you are planning to queue the log entries before processing (which I would recommend), I suggest Kestrel as a solid message queue option. This is what Gaug.es uses, and I've been putting it through it's paces lately. A Java app, it's extremely fast and atomic, and it conveniently speaks the Memcache protocol. It's a great way to scale horizontally, and the memory cache is backed up to a journalled file for a good balance of speed and durability.