mongoid/ mongodb query with time condition is extremely slow - mongodb

I want to select all the data which ts(timestamp) less than specific time
last_record = History.where(report_type: /#{params["report_type"]}/).order_by(ts: 1).only(:ts).last
History.where(:ts.lte => last_record.ts)
It seems this query will take super long time
I don't understand why, is there any quick way to do the sort of query ?
class History
include Mongoid::Document
include Mongoid::Timestamps
include Mongoid::Attributes::Dynamic
field :report_type, type: String
field :symbol, type: String
field :ts, type: Time
end
The query log in console
Started GET "/q/com_disagg/last" for 127.0.0.1 at 2015-01-10 10:36:55 +0800
Processing by QueryController#last as HTML
Parameters: {"report_type"=>"com_disagg"}
MOPED: 127.0.0.1:27017 COMMAND database=admin command={:ismaster=>1} runtime: 0.4290ms
...
MOPED: 127.0.0.1:27017 GET_MORE database=cot_development collection=histories limit=0 cursor_id=44966970901 runtime: 349.9560ms
Have set the timestamp as index, but still extremely slow query
db.system.indexes.find()
{ "v" : 1, "key" : { "ts" : 1 },
"name" : "ts_index",
"ns" : "cot_development.histories" }

Related

MongoDB does not remove documents from a collection with a configured TTL Index

I try to get started with TTL indices in mongo, and wanted to get a simple demo setup running, but I just cannot get it to work and I am not sure what I do wrong.
Mongod version:
$ mongod --version
db version v4.4.5
Build Info: {
"version": "4.4.5",
"gitVersion": "ff5cb77101b052fa02da43b8538093486cf9b3f7",
"openSSLVersion": "OpenSSL 1.1.1m 14 Dec 2021",
"modules": [],
"allocator": "tcmalloc",
"environment": {
"distmod": "debian10",
"distarch": "x86_64",
"target_arch": "x86_64"
}
}
I start with a completely fresh, default configured mongod instance:
$ mkdir /tmp/db && mongod --dbpath /tmp/db
Then I run the following commands
use test
db.ttldemo.insertOne({
created_at: Date(Date.now() + 60_000)
})
db.ttldemo.createIndex({created_at: 1}, {expireAfterSeconds: 1})
db.ttldemo.find()
I would expect the document to vanish after some time, but even after running find() after waiting for a few minutes, the document is still present.
Is there anything I am missing here?
db.adminCommand({getParameter:1, ttlMonitorSleepSecs: 1}) yields 60 to me
I tried in mongo 4.2.17, your insertion of documents yields a document like:
{
"_id" : ObjectId("61f17ea34844b5f0505d80ea"),
"created_at" : "Wed Jan 26 2022 19:02:27 GMT+0200 (GTB Standard Time)"
}
when the
db.getCollection('ttldemo').insertOne({created_at: new Date()})
yields a document like:
{
"_id" : ObjectId("61f17f404844b5f0505d80ec"),
"created_at" : ISODate("2022-01-26T17:05:04.023Z")
}
Emphasis here on the created_at field, where this on the first case is a string, in the second case is an ISODate.
as from the documentation , this index works on dates only:
To create a TTL index, use the createIndex() method on a field whose value is either a date or an array that contains date values, and specify the expireAfterSeconds option with the desired TTL value in seconds.
The solution here is to construct the date as follows:
db.getCollection('ttldemo').insertOne({created_at: new Date(new Date().getTime()+60000)})
this will create a new ISODate based on the current date + 60 seconds.

Importing nested JSON documents into Elasticsearch and making them searchable

We have MongoDB-collection which we want to import to Elasticsearch (for now as a one-off effort). For this end, we have exported the collection with monogexport. It is a huge JSON file with entries like the following:
{
"RefData" : {
"DebtInstrmAttrbts" : {
"NmnlValPerUnit" : "2000",
"IntrstRate" : {
"Fxd" : "3.1415"
},
"MtrtyDt" : "2020-01-01",
"TtlIssdNmnlAmt" : "200000000",
"DebtSnrty" : "SNDB"
},
"TradgVnRltdAttrbts" : {
"IssrReq" : "false",
"Id" : "BMTF",
"FrstTradDt" : "2019-04-01T12:34:56.789"
},
"TechAttrbts" : {
"PblctnPrd" : {
"FrDt" : "2019-04-04"
},
"RlvntCmptntAuthrty" : "GB"
},
"FinInstrmGnlAttrbts" : {
"ClssfctnTp" : "DBFNXX",
"ShrtNm" : "AVGO 3.625 10/16/24 c24 (URegS)",
"FullNm" : "AVGO 3 5/8 10/15/24 BOND",
"NtnlCcy" : "USD",
"Id" : "USU1109MAXXX",
"CmmdtyDerivInd" : "false"
},
"Issr" : "549300WV6GIDOZJTVXXX"
}
We are using the following Logstash configuration file to import this data set into Elasticsearch:
input {
file {
path => "/home/elastic/FIRDS.json"
start_position => "beginning"
sincedb_path => "/dev/null"
codec => json
}
}
filter {
mutate {
remove_field => [ "_id", "path", "host" ]
}
}
output {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "firds"
}
}
All this works fine, the data ends up in the index firds of Elasticsearch, and a GET /firds/_search returns all the entries within the _source field.
We understand that this field is not indexed and thus is not searchable, which we are actually after. We want make all of the entries within the original nested JSON searchable in Elasticsearch.
We assume that we have to adjust the filter {} part of our Logstash configuration, but how? For consistency reasons, it would not be bad to keep the original nested JSON structure, but that is not a must. Flattening would also be an option, so that e.g.
"RefData" : {
"DebtInstrmAttrbts" : {
"NmnlValPerUnit" : "2000" ...
becomes a single key-value pair "RefData.DebtInstrmAttrbts.NmnlValPerUnit" : "2000".
It would be great if we could do that immediately with Logstash, without using an additional Python script operating on the JSON file we exported from MongoDB.
EDIT: Workaround
Our current work-around is to (1) dump the MongoDB database to dump.json and then (2) flatten it with jq using the following expression, and finally (3) manually import it into Elastic
ad (2): This is the flattening step:
jq '. as $in | reduce leaf_paths as $path ({}; . + { ($path | join(".")): $in | getpath($path) }) | del(."_id.$oid") '
-c dump.json > flattened.json
References
Walker Rowe: ElasticSearch Nested Queries: How to Search for Embedded Documents
ElasticSearch search in document and in dynamic nested document
Mapping for Nested JSON document in Elasticsearch
Logstash - import nested JSON into Elasticsearch
Remark for the curious: The shown JSON is a (modified) entry from the Financial Instruments Reference Database System (FIRDS), available from the European Securities and Markets Authority (ESMA) who is an European financial regulatory agency overseeing the capital markets.

elasticsearch: Import data from sql using c#

can anyone give me some directions / examples about how to import about 100 million rows from SQL server to Elasticsearch using c# language?
Currently I'm using a NEST client in c# but is very slow ( 5k - 10k / Minute ), the slowness looks like is more from the app side than ES.
Appreciate any help.
You can use IndexMany but if you want to index only one table I think you can try with JDBC plugin. After installation, you can simply execute a .bat script to index your table.
#echo off
set DIR=%~dp0
set LIB=%DIR%..\lib\*
set BIN=%DIR%..\bin
REM ???
echo {^
"type" : "jdbc",^
"jdbc" : {^
"url" : "jdbc:sqlserver://localhost:25488;instanceName=SQLEXPRESS;databaseName=AdventureWorks2014",^
"user" : "hintdesk",^
"password" : "123456",^
"sql" : "SELECT BusinessEntityID as _id, BusinessEntityID, Title, FirstName, MiddleName, LastName FROM Person.Person",^
"treat_binary_as_string" : true,^
"elasticsearch" : {^
"cluster" : "elasticsearch",^
"host" : "localhost",^
"port" : 9200^
},^
"index" : "person",^
"type" : "person"^
}^
}^ | "%JAVA_HOME%\bin\java" -cp "%LIB%" -Dlog4j.configurationFile="%BIN%\log4j2.xml" "org.xbib.tools.Runner" "org.xbib.tools.JDBCImporter"

elasticsearch jdbc river polling--- load data from mysql repeatedly

When using https://github.com/jprante/elasticsearch-river-jdbc I notice that the following curl statement successfully indexes data the first time. However, the river fails to repeatedly poll the database for updates.
To restate, when I run the following, the river successfully connects to MySQL, runs the query successfully, indexes the results, but never runs the query again.
curl -XPUT '127.0.0.1:9200/_river/projects_river/_meta' -d '{
"type" : "jdbc",
"index" : {
"index" : "test_projects",
"type" : "project",
"bulk_size" : 100,
"max_bulk_requests" : 1,
"autocommit": true
},
"jdbc" : {
"driver" : "com.mysql.jdbc.Driver",
"poll" : "1m",
"strategy" : "simple",
"url" : "jdbc:mysql://localhost:3306/test",
"user" : "root",
"sql" : "SELECT name, updated_at from projects p where p.updated_at > date_sub(now(),interval 1 minute)"
}
}'
Tailing the log, I see:
[2013-09-27 16:32:24,482][INFO ][org.elasticsearch.river.jdbc.strategy.simple.SimpleRiverFlow] next run, waiting 1m
[2013-09-27 16:33:24,488][INFO ]> [org.elasticsearch.river.jdbc.strategy.simple.SimpleRiverFlow] next run, waiting 1m
[2013-09-27 16:34:24,494][INFO ]> [org.elasticsearch.river.jdbc.strategy.simple.SimpleRiverFlow] next run, waiting 1m
But the index stays empty. Running on a macbook pro with elasticsearch version stable 0.90.2, HEAD and mysql-connector-java-5.1.25-bin.jar in the river pligns directory.
I think if you switch your strategy value from "simple" to "poll" you may get what you are looking for - it has worked for me with jdbc on that version of elasticsearch against MS SQL.
Also - you will need to select a field as _id (select primarykey as _id) as this is used in the elasticsearch river for determining what records are added/deleted/updated.

No updatedExisting from getLastError in MongoLab

I am running updates against a database in MongoLab (Heroku) and cannot get information from getLastError.
As an example, below are statements to update a collection in a MongoDB database running locally in my machine (db version v2.0.3-rc1).
ariels-MacBook:mongodb ariel$ mongo
MongoDB shell version: 2.0.3-rc1
connecting to: test
> db.mycoll.insert({'key': '1','data': 'somevalue'});
> db.mycoll.find();
{ "_id" : ObjectId("505bcc5783cdc9e90ffcddd8"), "key" : "1", "data" : "somevalue" }
> db.mycoll.update({'key': '1'},{$set: {'data': 'anothervalue'}});
> db.runCommand('getlasterror');
{
"updatedExisting" : true,
"n" : 1,
"connectionId" : 4,
"err" : null,
"ok" : 1
}
>
All is well locally.
Now I switch to a database in MongoLab and run the same statements to update a document. getLastError is not returning an updatedExisting field. Hence, I am unable to test if my update was successful or otherwise.
ariels-MacBook:mongodb ariel$ mongo ds0000000.mongolab.com:00000/heroku_app00000 -u someuser -p somepassword
MongoDB shell version: 2.0.3-rc1
connecting to: ds000000.mongolab.com:00000/heroku_app00000
> db.mycoll.insert({'key': '1','data': 'somevalue'});
> db.mycoll.find();
{ "_id" : ObjectId("505bcf9b2421140a6b8490dd"), "key" : "1", "data" : "somevalue" }
> db.mycoll.update({'key': '1'},{$set: {'data': 'anothervalue'}});
> db.runCommand('getlasterror');
{
"n" : 0,
"lastOp" : NumberLong("5790450143685771265"),
"connectionId" : 1097505,
"err" : null,
"ok" : 1
}
> db.mycoll.find();
{ "_id" : ObjectId("505bcf9b2421140a6b8490dd"), "data" : "anothervalue", "key" : "1" }
>
Did anyone run into this?
If it matters, my resource at MongoLab is running mongod v2.0.7 (my shell is 2.0.3).
Not exactly sure what I am missing.
I am waiting to hear from their support (I will post here when I hear back) but wanted to check with you fine folks here as well just in case.
Thank you.
This looks to be a limitation of not having admin privileges to the mongod process. You might file a ticket with 10gen as it doesn't seem like a necessary limitation.
When I run Mongo in auth mode on my laptop I need to authenticate as a user in the admin database in order to see an "n" other than 0 or the "updatedExisting" field. When I authenticate as a user in any other database I get similar results to what you're seeing in MongoLab production.
(Full disclosure: I work for MongoLab. As a side note, I don't see the support ticket you mention in our system. We'd be happy to work with you directly if you'd like. You can reach us at support#mongolab.com or http://support.mongolab.com)