Importing nested JSON documents into Elasticsearch and making them searchable - mongodb

We have MongoDB-collection which we want to import to Elasticsearch (for now as a one-off effort). For this end, we have exported the collection with monogexport. It is a huge JSON file with entries like the following:
{
"RefData" : {
"DebtInstrmAttrbts" : {
"NmnlValPerUnit" : "2000",
"IntrstRate" : {
"Fxd" : "3.1415"
},
"MtrtyDt" : "2020-01-01",
"TtlIssdNmnlAmt" : "200000000",
"DebtSnrty" : "SNDB"
},
"TradgVnRltdAttrbts" : {
"IssrReq" : "false",
"Id" : "BMTF",
"FrstTradDt" : "2019-04-01T12:34:56.789"
},
"TechAttrbts" : {
"PblctnPrd" : {
"FrDt" : "2019-04-04"
},
"RlvntCmptntAuthrty" : "GB"
},
"FinInstrmGnlAttrbts" : {
"ClssfctnTp" : "DBFNXX",
"ShrtNm" : "AVGO 3.625 10/16/24 c24 (URegS)",
"FullNm" : "AVGO 3 5/8 10/15/24 BOND",
"NtnlCcy" : "USD",
"Id" : "USU1109MAXXX",
"CmmdtyDerivInd" : "false"
},
"Issr" : "549300WV6GIDOZJTVXXX"
}
We are using the following Logstash configuration file to import this data set into Elasticsearch:
input {
file {
path => "/home/elastic/FIRDS.json"
start_position => "beginning"
sincedb_path => "/dev/null"
codec => json
}
}
filter {
mutate {
remove_field => [ "_id", "path", "host" ]
}
}
output {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "firds"
}
}
All this works fine, the data ends up in the index firds of Elasticsearch, and a GET /firds/_search returns all the entries within the _source field.
We understand that this field is not indexed and thus is not searchable, which we are actually after. We want make all of the entries within the original nested JSON searchable in Elasticsearch.
We assume that we have to adjust the filter {} part of our Logstash configuration, but how? For consistency reasons, it would not be bad to keep the original nested JSON structure, but that is not a must. Flattening would also be an option, so that e.g.
"RefData" : {
"DebtInstrmAttrbts" : {
"NmnlValPerUnit" : "2000" ...
becomes a single key-value pair "RefData.DebtInstrmAttrbts.NmnlValPerUnit" : "2000".
It would be great if we could do that immediately with Logstash, without using an additional Python script operating on the JSON file we exported from MongoDB.
EDIT: Workaround
Our current work-around is to (1) dump the MongoDB database to dump.json and then (2) flatten it with jq using the following expression, and finally (3) manually import it into Elastic
ad (2): This is the flattening step:
jq '. as $in | reduce leaf_paths as $path ({}; . + { ($path | join(".")): $in | getpath($path) }) | del(."_id.$oid") '
-c dump.json > flattened.json
References
Walker Rowe: ElasticSearch Nested Queries: How to Search for Embedded Documents
ElasticSearch search in document and in dynamic nested document
Mapping for Nested JSON document in Elasticsearch
Logstash - import nested JSON into Elasticsearch
Remark for the curious: The shown JSON is a (modified) entry from the Financial Instruments Reference Database System (FIRDS), available from the European Securities and Markets Authority (ESMA) who is an European financial regulatory agency overseeing the capital markets.

Related

Creating mongodb database and populating it with some data via docker-compose

I've the following script with a custom database specified but I don't see the database user getting created within the GUI (compass). I only see 3 default databases (admin, config, local).
I've looked into this linked answer but I need a specific answer for my question, please.
mongo:
image: mongo:4.0.10
container_name: mongo
restart: always
environment:
MONGO_INITDB_ROOT_USERNAME: root
MONGO_INITDB_ROOT_PASSWORD: mypass
MONGO_INITDB_DATABASE: mydb
ports:
- 27017:27017
- 27018:27018
- 27019:27019
The expectation for a user database to be created.
Database prefilled with some records.
Edit - made some progress, 2 Problems
Added volumes
mongo:
image: mongo:4.0.1r0
container_name: mongo
restart: always
volumes:
- ./assets:/docker-entrypoint-initdb.d/
1. Ignore
Within assets folder, I've 3 files and I see this in the logs, my files are getting ignored.
/usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/file1.json
/usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/file2.json
/usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/file3.json
all my JSON files look like following. (no root array object? no [] at the root?)
{ "_id" : { "$oid" : "5d3a9d423b881e4ca04ae8f0" }, "name" : "Human Resource" }
{ "_id" : { "$oid" : "5d3a9d483b881e4ca04ae8f1" }, "name" : "Sales" }
2. Default Database not getting created. following line is not having any effect.
MONGO_INITDB_DATABASE: mydb
All files *.json extension will be ignored, it should in *.js. Look into the documentation of mongo DB docker hub
MONGO_INITDB_DATABASE
This variable allows you to specify the name of a database to be used
for creation scripts in /docker-entrypoint-initdb.d/*.js (see
Initializing a fresh instance below). MongoDB is fundamental
designed for "create on first use", so if you do not insert data with
your JavaScript files, then no database is created.
Initializing a fresh instance
When a container is started for the first time it will execute files
with extensions .sh and .js that are found in
/docker-entrypoint-initdb.d. Files will be executed in alphabetical
order. .js files will be executed by mongo using the database
specified by the MONGO_INITDB_DATABASE variable, if it is present, or
test otherwise. You may also switch databases within the .js script.
you can look into this example
create folder data and place create_article.js in it
( in the example I am passing your created DB user)
db = db.getSiblingDB("user");
db.article.drop();
db.article.save( {
title : "this is my title" ,
author : "bob" ,
posted : new Date(1079895594000) ,
pageViews : 5 ,
tags : [ "fun" , "good" , "fun" ] ,
comments : [
{ author :"joe" , text : "this is cool" } ,
{ author :"sam" , text : "this is bad" }
],
other : { foo : 5 }
});
db.article.save( {
title : "this is your title" ,
author : "dave" ,
posted : new Date(4121381470000) ,
pageViews : 7 ,
tags : [ "fun" , "nasty" ] ,
comments : [
{ author :"barbara" , text : "this is interesting" } ,
{ author :"jenny" , text : "i like to play pinball", votes: 10 }
],
other : { bar : 14 }
});
db.article.save( {
title : "this is some other title" ,
author : "jane" ,
posted : new Date(978239834000) ,
pageViews : 6 ,
tags : [ "nasty" , "filthy" ] ,
comments : [
{ author :"will" , text : "i don't like the color" } ,
{ author :"jenny" , text : "can i get that in green?" }
],
other : { bar : 14 }
});
mount the data directory
docker run --rm -it --name some-mongo -v /home/data/:/docker-entrypoint-initdb.d/ -e MONGO_INITDB_DATABASE=user -e MONGO_INITDB_ROOT_USERNAME=root -e MONGO_INITDB_ROOT_PASSWORD=mypass mongo:4.0.10
once container created you will be able to see the DBs,

logstash-input-mongodb loop on a "restarting error" - Timestamp

I try to use the mongodb plugin as input for logstash.
Here is my simple configuration:
input {
mongodb {
uri => 'mongodb://localhost:27017/testDB'
placeholder_db_dir => '/Users/TEST/Documents/WORK/ELK_Stack/LogStash/data/'
collection => 'logCollection_ALL'
batch_size => 50
}
}
filter {}
output { stdout {} }
But I'm facing a "loop issue" probably due to a field "timestamp" but I don't know what to do.
[2018-04-25T12:01:35,998][WARN ][logstash.inputs.mongodb ] MongoDB Input threw an exception, restarting {:exception=>#TypeError: wrong argument type String (expected LogStash::Timestamp)>}
With also a DEBUG log:
[2018-04-25T12:01:34.893000 #2900] DEBUG -- : MONGODB | QUERY | namespace=testDB.logCollection_ALL selector={:_id=>{:$gt=>BSON::ObjectId('5ae04f5917e7979b0a000001')}} flags=[:slave_ok] limit=50 skip=0 project=nil |
runtime: 39.0000ms
How can I parametrize my logstash config to get my output in the stdout console ?
It's because of field #timestamp that has ISODate data type.
You must remove this field from all documents.
db.getCollection('collection1').update({}, {$unset: {"#timestamp": 1}}, {multi: true})

Fiware cygnus: no data have been persisted in mongo DB

I am trying to use cygnus with Mongo DB, but no data have been persisted in the data base.
Here is the notification got in cygnus:
15/07/21 14:48:01 INFO handlers.OrionRestHandler: Starting transaction (1437482681-118-0000000000)
15/07/21 14:48:01 INFO handlers.OrionRestHandler: Received data ({ "subscriptionId" : "55a73819d0c457bb20b1d467", "originator" : "localhost", "contextResponses" : [ { "contextElement" : { "type" : "enocean", "isPattern" : "false", "id" : "enocean:myButtonA", "attributes" : [ { "name" : "ButtonValue", "type" : "", "value" : "ON", "metadatas" : [ { "name" : "TimeInstant", "type" : "ISO8601", "value" : "2015-07-20T21:29:56.509293Z" } ] } ] }, "statusCode" : { "code" : "200", "reasonPhrase" : "OK" } } ]})
15/07/21 14:48:01 INFO handlers.OrionRestHandler: Event put in the channel (id=1454120446, ttl=10)
Here is my agent configuration:
cygnusagent.sources = http-source
cygnusagent.sinks = OrionMongoSink
cygnusagent.channels = mongo-channel
#=============================================
# source configuration
# channel name where to write the notification events
cygnusagent.sources.http-source.channels = mongo-channel
# source class, must not be changed
cygnusagent.sources.http-source.type = org.apache.flume.source.http.HTTPSource
# listening port the Flume source will use for receiving incoming notifications
cygnusagent.sources.http-source.port = 5050
# Flume handler that will parse the notifications, must not be changed
cygnusagent.sources.http-source.handler = com.telefonica.iot.cygnus.handlers.OrionRestHandler
# URL target
cygnusagent.sources.http-source.handler.notification_target = /notify
# Default service (service semantic depends on the persistence sink)
cygnusagent.sources.http-source.handler.default_service = def_serv
# Default service path (service path semantic depends on the persistence sink)
cygnusagent.sources.http-source.handler.default_service_path = def_servpath
# Number of channel re-injection retries before a Flume event is definitely discarded (-1 means infinite retries)
cygnusagent.sources.http-source.handler.events_ttl = 10
# Source interceptors, do not change
cygnusagent.sources.http-source.interceptors = ts gi
# TimestampInterceptor, do not change
cygnusagent.sources.http-source.interceptors.ts.type = timestamp
# GroupinInterceptor, do not change
cygnusagent.sources.http-source.interceptors.gi.type = com.telefonica.iot.cygnus.interceptors.GroupingInterceptor$Builder
# Grouping rules for the GroupingInterceptor, put the right absolute path to the file if necessary
# See the doc/design/interceptors document for more details
cygnusagent.sources.http-source.interceptors.gi.grouping_rules_conf_file = /home/egm_demo/usr/fiware-cygnus/conf/grouping_rules.conf
# ============================================
# OrionMongoSink configuration
# sink class, must not be changed
cygnusagent.sinks.mongo-sink.type = com.telefonica.iot.cygnus.sinks.OrionMongoSink
# channel name from where to read notification events
cygnusagent.sinks.mongo-sink.channel = mongo-channel
# FQDN/IP:port where the MongoDB server runs (standalone case) or comma-separated list of FQDN/IP:port pairs where the MongoDB replica set members run
cygnusagent.sinks.mongo-sink.mongo_hosts = 127.0.0.1:27017
# a valid user in the MongoDB server (or empty if authentication is not enabled in MongoDB)
cygnusagent.sinks.mongo-sink.mongo_username =
# password for the user above (or empty if authentication is not enabled in MongoDB)
cygnusagent.sinks.mongo-sink.mongo_password =
# prefix for the MongoDB databases
#cygnusagent.sinks.mongo-sink.db_prefix = kura
# prefix pro the MongoDB collections
#cygnusagent.sinks.mongo-sink.collection_prefix = button
# true is collection names are based on a hash, false for human redable collections
cygnusagent.sinks.mongo-sink.should_hash = false
# ============================================
# mongo-channel configuration
# channel type (must not be changed)
cygnusagent.channels.mongo-channel.type = memory
# capacity of the channel
cygnusagent.channels.mongo-channel.capacity = 1000
# amount of bytes that can be sent per transaction
cygnusagent.channels.mongo-channel.transactionCapacity = 100
Here is my rule :
{
"grouping_rules": [
{
"id": 1,
"fields": [
"button"
],
"regex": ".*",
"destination": "kura",
"fiware_service_path": "/kuraspath"
}
]
}
Any ideas of what I have missed? Thanks in advance for your help!
This configuration parameter is wrong:
cygnusagent.sinks = OrionMongoSink
According to your configuration, it must be mongo-sink (I mean, you are configuring a Mongo sink named mongo-sink when you configure lines such as cygnusagent.sinks.mongo-sink.type).
In addition, I would recommend you to not using the grouping rules feature; it is an advanced feature about sending the data to a collection different than the default one, and in a first stage I would play with the default behaviour. Thus, my recommendation is to leave the path to the file in cygnusagent.sources.http-source.interceptors.gi.grouping_rules_conf_file, but comment all the JSON within it :)

mongoid/ mongodb query with time condition is extremely slow

I want to select all the data which ts(timestamp) less than specific time
last_record = History.where(report_type: /#{params["report_type"]}/).order_by(ts: 1).only(:ts).last
History.where(:ts.lte => last_record.ts)
It seems this query will take super long time
I don't understand why, is there any quick way to do the sort of query ?
class History
include Mongoid::Document
include Mongoid::Timestamps
include Mongoid::Attributes::Dynamic
field :report_type, type: String
field :symbol, type: String
field :ts, type: Time
end
The query log in console
Started GET "/q/com_disagg/last" for 127.0.0.1 at 2015-01-10 10:36:55 +0800
Processing by QueryController#last as HTML
Parameters: {"report_type"=>"com_disagg"}
MOPED: 127.0.0.1:27017 COMMAND database=admin command={:ismaster=>1} runtime: 0.4290ms
...
MOPED: 127.0.0.1:27017 GET_MORE database=cot_development collection=histories limit=0 cursor_id=44966970901 runtime: 349.9560ms
Have set the timestamp as index, but still extremely slow query
db.system.indexes.find()
{ "v" : 1, "key" : { "ts" : 1 },
"name" : "ts_index",
"ns" : "cot_development.histories" }

elasticsearch jdbc river polling--- load data from mysql repeatedly

When using https://github.com/jprante/elasticsearch-river-jdbc I notice that the following curl statement successfully indexes data the first time. However, the river fails to repeatedly poll the database for updates.
To restate, when I run the following, the river successfully connects to MySQL, runs the query successfully, indexes the results, but never runs the query again.
curl -XPUT '127.0.0.1:9200/_river/projects_river/_meta' -d '{
"type" : "jdbc",
"index" : {
"index" : "test_projects",
"type" : "project",
"bulk_size" : 100,
"max_bulk_requests" : 1,
"autocommit": true
},
"jdbc" : {
"driver" : "com.mysql.jdbc.Driver",
"poll" : "1m",
"strategy" : "simple",
"url" : "jdbc:mysql://localhost:3306/test",
"user" : "root",
"sql" : "SELECT name, updated_at from projects p where p.updated_at > date_sub(now(),interval 1 minute)"
}
}'
Tailing the log, I see:
[2013-09-27 16:32:24,482][INFO ][org.elasticsearch.river.jdbc.strategy.simple.SimpleRiverFlow] next run, waiting 1m
[2013-09-27 16:33:24,488][INFO ]> [org.elasticsearch.river.jdbc.strategy.simple.SimpleRiverFlow] next run, waiting 1m
[2013-09-27 16:34:24,494][INFO ]> [org.elasticsearch.river.jdbc.strategy.simple.SimpleRiverFlow] next run, waiting 1m
But the index stays empty. Running on a macbook pro with elasticsearch version stable 0.90.2, HEAD and mysql-connector-java-5.1.25-bin.jar in the river pligns directory.
I think if you switch your strategy value from "simple" to "poll" you may get what you are looking for - it has worked for me with jdbc on that version of elasticsearch against MS SQL.
Also - you will need to select a field as _id (select primarykey as _id) as this is used in the elasticsearch river for determining what records are added/deleted/updated.