Failed Importing Cyrillic (UTF8) json file into MongoDB database - mongodb

Question: How to use mongoimport.exe utility to import UTF8 json documents with Cyrillic chars into mongodb database under MS Windows 10? (mongoimport.exe online docs do not provide any information or I'm missing it?)
Environment: MS Windows 10, MongoDB 3.2.3, mongoimport.exe
Source UTF8 json file:
{
"PersonnelNumber": "15128",
"OrderNumber": "765-01",
"OrderDate": "2011-05-04T00:00:00",
"JobPosition": "Слесарь по ремонту подвижного состава"
}
Command line:
"C:\Program Files\MongoDB\Server\3.2\bin\mongoimport.exe" --db testOrders --collection orders --drop < "TestOrder.json"
Import log:
2016-03-01T14:47:04.350+0300 connected to: localhost
dropping: testOrders.orders
Failed: error processing document #1: invalid character 'ï' looking for beginning of value
imported 0 documents
Note
ANSI version of the test file is getting imported without errors but Cyrillic char codes are stored in mongoDb database incorrectly, here is how they are displayed in mongoDb shell:
MongoDB shell version: 3.2.3
connecting to: testOrders
> db.orders.findOne()
{
"_id" : ObjectId("56d5838aef35e4f7c03e81bd"),
"PersonnelNumber" : "15128",
"OrderNumber" : "765-01",
"OrderDate" : "2011-05-04T00:00:00",
"JobPosition" : "������� �� ������� ���������� �������"
}
>
Here is a screenshot with hex codes of UTF8 and ANSI files:
When using C# driver (VS2015) to list imported by mongoimport.exe utility json file
foreach (var order in
(new MongoClient()).GetDatabase("testOrders")
.GetCollection<BsonDocument>("orders").FindSync(new BsonDocument()).ToList())
System.Console.WriteLine("{0}", order.ToString());
the test output also has wrong Cyrillic chars:
{
"_id" : ObjectId("56d5838aef35e4f7c03e81bd"),
"PersonnelNumber" : "15128",
"OrderNumber" : "765-01",
"OrderDate" : "2011-05-04T00:00:00",
"JobPosition" : "������� �� ������� ���������� �������"
}
When using C# driver (VS2015) to insert BsonDocument from C# and then to list imported by mongoimport.exe utility json file/document and the BsonDocument document inserted via C# code
var collection = (new MongoClient()).GetDatabase("testOrders")
.GetCollection<BsonDocument>("orders");
var document = new BsonDocument
{
{"PersonnelNumber", "15128" },
{"OrderNumber", "765-01"},
{"OrderDate", "2011-05-04T00:00:00"},
{"JobPosition", "Слесарь по ремонту подвижного состава"}
};
collection.InsertOne(document);
foreach (var order in collection.FindSync(new BsonDocument()).ToList())
System.Console.WriteLine("{0}", order.ToString());
the former has incorrect Cyrillic chars and the latter has correct Cyrillic chars:
{
"_id" : ObjectId("56d5838aef35e4f7c03e81bd"),
"PersonnelNumber" : "15128",
"OrderNumber" : "765-01",
"OrderDate" : "2011-05-04T00:00:00",
"JobPosition" : "������� �� ������� ���������� �������"
}
{
"_id" : ObjectId("56d58c5c1ed24820b80b80f6"),
"PersonnelNumber" : "15128",
"OrderNumber" : "765-01",
"OrderDate" : "2011-05-04T00:00:00",
"JobPosition" : "Слесарь по ремонту подвижного состава"
}

Related

Importing nested JSON documents into Elasticsearch and making them searchable

We have MongoDB-collection which we want to import to Elasticsearch (for now as a one-off effort). For this end, we have exported the collection with monogexport. It is a huge JSON file with entries like the following:
{
"RefData" : {
"DebtInstrmAttrbts" : {
"NmnlValPerUnit" : "2000",
"IntrstRate" : {
"Fxd" : "3.1415"
},
"MtrtyDt" : "2020-01-01",
"TtlIssdNmnlAmt" : "200000000",
"DebtSnrty" : "SNDB"
},
"TradgVnRltdAttrbts" : {
"IssrReq" : "false",
"Id" : "BMTF",
"FrstTradDt" : "2019-04-01T12:34:56.789"
},
"TechAttrbts" : {
"PblctnPrd" : {
"FrDt" : "2019-04-04"
},
"RlvntCmptntAuthrty" : "GB"
},
"FinInstrmGnlAttrbts" : {
"ClssfctnTp" : "DBFNXX",
"ShrtNm" : "AVGO 3.625 10/16/24 c24 (URegS)",
"FullNm" : "AVGO 3 5/8 10/15/24 BOND",
"NtnlCcy" : "USD",
"Id" : "USU1109MAXXX",
"CmmdtyDerivInd" : "false"
},
"Issr" : "549300WV6GIDOZJTVXXX"
}
We are using the following Logstash configuration file to import this data set into Elasticsearch:
input {
file {
path => "/home/elastic/FIRDS.json"
start_position => "beginning"
sincedb_path => "/dev/null"
codec => json
}
}
filter {
mutate {
remove_field => [ "_id", "path", "host" ]
}
}
output {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "firds"
}
}
All this works fine, the data ends up in the index firds of Elasticsearch, and a GET /firds/_search returns all the entries within the _source field.
We understand that this field is not indexed and thus is not searchable, which we are actually after. We want make all of the entries within the original nested JSON searchable in Elasticsearch.
We assume that we have to adjust the filter {} part of our Logstash configuration, but how? For consistency reasons, it would not be bad to keep the original nested JSON structure, but that is not a must. Flattening would also be an option, so that e.g.
"RefData" : {
"DebtInstrmAttrbts" : {
"NmnlValPerUnit" : "2000" ...
becomes a single key-value pair "RefData.DebtInstrmAttrbts.NmnlValPerUnit" : "2000".
It would be great if we could do that immediately with Logstash, without using an additional Python script operating on the JSON file we exported from MongoDB.
EDIT: Workaround
Our current work-around is to (1) dump the MongoDB database to dump.json and then (2) flatten it with jq using the following expression, and finally (3) manually import it into Elastic
ad (2): This is the flattening step:
jq '. as $in | reduce leaf_paths as $path ({}; . + { ($path | join(".")): $in | getpath($path) }) | del(."_id.$oid") '
-c dump.json > flattened.json
References
Walker Rowe: ElasticSearch Nested Queries: How to Search for Embedded Documents
ElasticSearch search in document and in dynamic nested document
Mapping for Nested JSON document in Elasticsearch
Logstash - import nested JSON into Elasticsearch
Remark for the curious: The shown JSON is a (modified) entry from the Financial Instruments Reference Database System (FIRDS), available from the European Securities and Markets Authority (ESMA) who is an European financial regulatory agency overseeing the capital markets.

elasticsearch 6 not allowing multiple types when trying to pipeline with mongo-connector

I am trying to push data from mongodb3.6 to elasticsearch6.1 using mongo-connector.
My records are:
db.administrators.find({}).pretty()
{
"_id" : ObjectId("5701d81893dc484c812b4fc1"),
"name" : "Test Naupada",
"username" : "adminn",
"ward" : "56a6129f44fc869f215fe3fe",
"password" : "nadmin"
}
rs0:PRIMARY> db.sub_ward_master.find({}).pretty()
{
"_id" : ObjectId("56a6129f44fc869f215fe3fe"),
"wardCode" : "3",
"wardName" : "Naupada",
"wardgeoCodes" : [],
"cityName" : "thane"
}
When I run mongo-connector I am getting following error:
OperationFailed: (u'1 document(s) failed to index.', [{u'index': {u'status': 400, u'_type': u'administrators', u'_index': u'smartjn', u'error': {u'reason': u'Rejecting mapping update to [smartjn] as the final mapping would have more than 1 type: [sub_ward_master, administrators]', u'type': u'illegal_argument_exception'}, u'_id': u'5701d81893dc484c812b4fc1', u'data': {u'username': u'adminn', u'ward': u'56a6129f44fc869f215fe3fe', u'password': u'nadmin', u'name': u'Test Naupada'}}}
Any help any one?
Thanks
ES 6 does not allow to create more than one type in any single index.
There's an open issue in the mongo-connector repo to support ES 6. Until that's solved, you should go with ES 5 instead.
You can do it in ES6 by creating a new index for different document type (ie different collection in mongoDB) and use -g flag to direct it to new index.
For example:
mongo-connector -m localhost:27017 -t localhost:9200 -d elastic2_doc_manager -n {db}.{collection_name} -g {new_index}.{document_type}.
Refer mongo-connector-wiki

Json to csv in Mongodb

I know this question has been answered... using below
print("name,id,email");
db.User.find().forEach(function(user){
print(user.name+","+user._id.valueOf()+","+user.email);
});
But I am facing issue while reading the records whose fields start with number.
Below is the O/P
db.Detail.find({"Comment": /ABCD/,"CreateDt": { "$gte" : ISODate("2015-12-03") }},{'Data.01-WaitQueue.EndTime':1}).limit().pretty()
{
"Data" : {
"01-WaitQueue" : {
"EndTime" : ISODate("2015-12-03T02:39:11Z")
}
},
"_id" : ObjectId("565fab4ea5c75a3c4f000000")
}
When I am using forEach to convert into CSV
db.Detail.find({"Comment": /ABCD/,"CreateDt": { "$gte" : ISODate("2015-12-03") }},{'Data.01-WaitQueue.EndTime':1}).limit().forEach(function(PD) {
print(PD.Data.01-WaitQueue.EndTime +":"+ PD._id);
});
I am getting below error
Fri Dec 4 07:11:38 SyntaxError: missing ) after argument list (shell):1
Can someone please help me in rectifying it?
The exception is thrown because of the print line has syntax error.
Instead of
print(PD.Data.01-WaitQueue.EndTime +":"+ PD._id);
Try this instead:
print(PD.Data['01-WaitQueue'].EndTime +":"+ PD._id);

elasticsearch: Import data from sql using c#

can anyone give me some directions / examples about how to import about 100 million rows from SQL server to Elasticsearch using c# language?
Currently I'm using a NEST client in c# but is very slow ( 5k - 10k / Minute ), the slowness looks like is more from the app side than ES.
Appreciate any help.
You can use IndexMany but if you want to index only one table I think you can try with JDBC plugin. After installation, you can simply execute a .bat script to index your table.
#echo off
set DIR=%~dp0
set LIB=%DIR%..\lib\*
set BIN=%DIR%..\bin
REM ???
echo {^
"type" : "jdbc",^
"jdbc" : {^
"url" : "jdbc:sqlserver://localhost:25488;instanceName=SQLEXPRESS;databaseName=AdventureWorks2014",^
"user" : "hintdesk",^
"password" : "123456",^
"sql" : "SELECT BusinessEntityID as _id, BusinessEntityID, Title, FirstName, MiddleName, LastName FROM Person.Person",^
"treat_binary_as_string" : true,^
"elasticsearch" : {^
"cluster" : "elasticsearch",^
"host" : "localhost",^
"port" : 9200^
},^
"index" : "person",^
"type" : "person"^
}^
}^ | "%JAVA_HOME%\bin\java" -cp "%LIB%" -Dlog4j.configurationFile="%BIN%\log4j2.xml" "org.xbib.tools.Runner" "org.xbib.tools.JDBCImporter"

No updatedExisting from getLastError in MongoLab

I am running updates against a database in MongoLab (Heroku) and cannot get information from getLastError.
As an example, below are statements to update a collection in a MongoDB database running locally in my machine (db version v2.0.3-rc1).
ariels-MacBook:mongodb ariel$ mongo
MongoDB shell version: 2.0.3-rc1
connecting to: test
> db.mycoll.insert({'key': '1','data': 'somevalue'});
> db.mycoll.find();
{ "_id" : ObjectId("505bcc5783cdc9e90ffcddd8"), "key" : "1", "data" : "somevalue" }
> db.mycoll.update({'key': '1'},{$set: {'data': 'anothervalue'}});
> db.runCommand('getlasterror');
{
"updatedExisting" : true,
"n" : 1,
"connectionId" : 4,
"err" : null,
"ok" : 1
}
>
All is well locally.
Now I switch to a database in MongoLab and run the same statements to update a document. getLastError is not returning an updatedExisting field. Hence, I am unable to test if my update was successful or otherwise.
ariels-MacBook:mongodb ariel$ mongo ds0000000.mongolab.com:00000/heroku_app00000 -u someuser -p somepassword
MongoDB shell version: 2.0.3-rc1
connecting to: ds000000.mongolab.com:00000/heroku_app00000
> db.mycoll.insert({'key': '1','data': 'somevalue'});
> db.mycoll.find();
{ "_id" : ObjectId("505bcf9b2421140a6b8490dd"), "key" : "1", "data" : "somevalue" }
> db.mycoll.update({'key': '1'},{$set: {'data': 'anothervalue'}});
> db.runCommand('getlasterror');
{
"n" : 0,
"lastOp" : NumberLong("5790450143685771265"),
"connectionId" : 1097505,
"err" : null,
"ok" : 1
}
> db.mycoll.find();
{ "_id" : ObjectId("505bcf9b2421140a6b8490dd"), "data" : "anothervalue", "key" : "1" }
>
Did anyone run into this?
If it matters, my resource at MongoLab is running mongod v2.0.7 (my shell is 2.0.3).
Not exactly sure what I am missing.
I am waiting to hear from their support (I will post here when I hear back) but wanted to check with you fine folks here as well just in case.
Thank you.
This looks to be a limitation of not having admin privileges to the mongod process. You might file a ticket with 10gen as it doesn't seem like a necessary limitation.
When I run Mongo in auth mode on my laptop I need to authenticate as a user in the admin database in order to see an "n" other than 0 or the "updatedExisting" field. When I authenticate as a user in any other database I get similar results to what you're seeing in MongoLab production.
(Full disclosure: I work for MongoLab. As a side note, I don't see the support ticket you mention in our system. We'd be happy to work with you directly if you'd like. You can reach us at support#mongolab.com or http://support.mongolab.com)