Elastic Search (NEST) cannot search documents content (PDF, MS Office, txt) - plugins

I am using "Nest" Elastic Search to index my documents. everything is working normally: indexing and retrieve Documents.
I need to search the content of any document type, i installed the "mapper attachment types" plugin, and restarted the elastic search service.
When indexing the document, i convert its content to base64 as requested.
foreach (string file in Directory.GetFiles(#"C:\Lucene\sample"))
{
var document = new Document
{
ID = counter++ ,
Name = "Current file " + counter.ToString(),
Content = Convert.ToBase64String(File.ReadAllBytes(file)),
IsLatest = true,
VersionNo = 1,
FilePath = file ,
};
However, when searching i cannot get any result. Using the url "http://localhost:9200/my_first_index/Node/20?pretty" i can retreive the following document:
{
"_index" : "my_first_index",
"_type" : "Node",
"_id" : "20",
"_version" : 1,
"found" : true,
"_source":{
"_id": 20,
"name": "Current file 21",
"content":
"QXJjaGl0ZWN0aW5nIEFwcGxpY2F0aW9ucyBmb3IgdGhlIFJ
lYWwgV29ybGQgaW4gLk5FVAydCAtIERlc2lnbiBQYXR0ZXJucyBPbi1SYW",
"originalNodeID": 0,
"versionNo": 1, "isLatest": true,
"path": "C:\\Lucene\\sample\\extensible software downloads.txt"
}
}
Also, searching for another non content fields is working normally.
Is there any thing that i have missed?

Related

How to move Embedded Fields out of their embedded document?

Here is an example of one of my JSON docs:
{
"_id": 1,
"SongId": 1,
"Details": {
"Artist": "Cyndi Lauper",
"Album": "She's So Unusual",
"ReleaseYear": 1983
},
"SongTitle": "Girls Just Want To Have Fun"
}
How would one write a query to move the location of "Artist" and it's value out of the "Details" document, leaving "Album" & "ReleaseYear" still embedded.
In addition to updating the name of a field, the $rename operator can be used to move fields out of (or into) embedded documents.
When working with fields in embedded documents you need to use dot notation to refer to the field name.
Assuming a collection name of discography, you could move your Details.Artist field using:
db.discography.update(
{_id: 1},
{$rename: { "Details.Artist": "Artist"}}
)
Example result:
> db.discography.findOne({_id: 1})
{
"_id" : 1,
"SongId" : 1,
"Details" : {
"Album" : "She's So Unusual",
"ReleaseYear" : 1983
},
"SongTitle" : "Girls Just Want To Have Fun",
"Artist" : "Cyndi Lauper"
}

Dictionary style access returning None in MongoDB

I have been trying to access a database in mongodb and then modifying the particular fetched data into key-value pair for further issue. Although, my code works fine with other collections. But, It's giving me some errors with this.
1: Here is the snippet of my code which was giving the first error:
kpi = sample["kpi"]
for proc in kpi:
volume_used = int(float(kpi[proc]["percent"][:-1]))
volume_free = 100 - volume_used
volume_name = kpi[proc]["folder"]
vol_first = [volume_name, volume_used]
vol_second = [volume_name, volume_free]
data_first.append(vol_first)
data_second.append(vol_second)
value_first.append({"key": "volume used", "values": data_first})
value_first.append({"key": "volume free", "values": data_second})
disk_data.append({
"key": dev["device_name"] + "," + dev["ipaddr"],
"values": value_first
})
print disk_data
From this, The error i got was like this:
File "stats_server.py", line 1547, in getD3DiskData_columnchart
"key": dev["device_name"] + "," + dev["ipaddr"],
TypeError: unsupported operand type(s) for +: 'float' and 'str'
Then, I modified the device_name which was of float format to string.
2: Modified code:
kpi = sample["kpi"]
for proc in kpi:
volume_used = int(float(kpi[proc]["percent"][:-1]))
volume_free = 100 - volume_used
volume_name = kpi[proc]["folder"]
vol_first = [volume_name, volume_used]
vol_second = [volume_name, volume_free]
data_first.append(vol_first)
data_second.append(vol_second)
value_first.append({"key": "volume used", "values": data_first})
value_first.append({"key": "volume free", "values": data_second})
disk_data.append({
"key": str(dev["device_name"]) + "," + dev["ipaddr"],
"values": value_first
})
print disk_data
Then, i started getting this error.
File "stats_server.py", line 1530, in getD3DiskData_columnchart
kpi = sample["kpi"]
TypeError: 'NoneType' object has no attribute '__getitem__'
The line kpi = sample["kpi"] returns the documents from the particular collection.
The query i used to fetch the data is:
disk_util_coll = db[kpi_meta]
disk_docs = disk_util_coll.find_one()
sample = disk_docs
Where, kpi_meta is the collection's name.
The document kpi will be containing the data i need as:
"kpi" : {
"none" : {
"usage" : "0",
"folder" : "/run/shm",
"percent" : "0%",
"free" : "246M",
"dev" : "none"
},
"tmpfs" : {
"usage" : "256K",
"folder" : "/run",
"percent" : "1%",
"free" : "99M",
"dev" : "tmpfs"
},
"/dev/sda1" : {
"usage" : "1.2G",
"folder" : "/",
"percent" : "74%",
"free" : "404M",
"dev" : "/dev/sda1"
},
"udev" : {
"usage" : "4.0K",
"folder" : "/dev",
"percent" : "1%",
"free" : "238M",
"dev" : "udev"
}
Any sort of help will be appreciated.
Let me know, if i am supposed to give something more from my side.
Thank you
The error message:
TypeError: 'NoneType' object has no attribute '__getitem__'
means that "sample" is None. This means your find_one query didn't return a document. That is, the query didn't match any documents in the collection. Check that find_one() returns a document before trying to access its fields.

Can I utilize indexes when querying by MongoDB subdocument without known field names?

I have a document structure like follows:
{
"_id": ...,
"name": "Document name",
"properties": {
"prop1": "something",
"2ndprop": "other_prop",
"other3": ["tag1", "tag2"],
}
}
I can't know the actual field names in properties subdocument (they are given by the application user), so I can't create indexes like properties.prop1. Neither can I know the structure of the field values, they can be single value, embedded document or array.
Is there any practical way to do performant queries to the collection with this kind of schema design?
One option that came to my mind is to add a new field to the document, index it and set used field names per document into this field.
{
"_id": ...,
"name": "Document name",
"properties": {
"prop1": "something",
"2ndprop": "other_prop",
"other3": ["tag1", "tag2"],
},
"property_fields": ["prop1", "2ndprop", "other3"]
}
Now I could first run query against property_fields field and after that let MongoDB scan through the found documents to see whether properties.prop1 contains the required value. This is definitely slower, but could be viable.
One way of dealing with this is to use schema like below.
{
"name" : "Document name",
"properties" : [
{
"k" : "prop1",
"v" : "something"
},
{
"k" : "2ndprop",
"v" : "other_prop"
},
{
"k" : "other3",
"v" : "tag1"
},
{
"k" : "other3",
"v" : "tag2"
}
]
}
Then you can index "properties.k" and "properties.v" for example like this:
db.foo.ensureIndex({"properties.k": 1, "properties.v": 1})

Indexing embedded mongoDB documents (in an array) with Solr

Is there any way, how I can make Solr index embedded mongoDB documents? We already can index top-level values of keys in a mongo document via mongo-connector, pushes the data to Solr.
However, in situations like in this structure which represents a post:
{
author: "someone",
post_text : "some really long text which is already indexed by solr",
comments : [
{
author:"someone else"
comment_text:"some quite long comment, which I do not
know how to index in Solr"
},
{
author:"me"
comment_text:"another quite long comment, which I do not
know how to index in Solr"
}
]
}
This is just an example structure. In our project, we handle more complicated structures, and sometimes, the text we want to index is nested on a second or third level (depth, or what is the formal name for it).
I believe that there is a community of mongoDB + Solr users and so that this issue must have been adressed before, but I was unable to find good materials, that would cover this problem, if there is a nice way, how to handle this or whether there is no solution and workarounds have yet to be founded (and maybe you could provide me with one)
For a better understanding, one of our structures have at top level key that has for its value an array of some several analysis results, where one of them has an array of singular values, that are parts of the result. We need to index these values. E.g. (this is not the actual data structure, we use):
{...
Analysis_performed: [
{
User_tags:
[
{
tag_name: "awesome",
tag_score: 180
},
{
tag_name: "boring",
tag_score: 10
}
]
}
]
}
In this case we would need to index on the tag names. There is a possibility of us having a bad structure for storing the data, we want to store, but we thought hard about it and we think it's quite good. However, even if we switch to less nested information, we will most likely come across at least one situation where we will have to index information stored in embedded documents that are in an array and this is the question's main focus. Can we index such data with SOLR somehow?
I had a question like this a couple months ago. My solution is to use doc_manager.
You can use solr_doc_manager (upsert method), to modify document posted into solr. For example, if you have
ACL: {
Read: [ id1, id2 ... ]
}
you can handle it something like
def upsert(self, doc):
if ("ACL" in doc) and ("Read" in doc["ACL"]):
doc["ACL.Read"] = []
for item in doc["ACL"]["Read"]:
if not isinstance(item, dict):
id = ObjectId(item)
doc["ACL.Read"].append(str(id))
self.solr.add([doc], commit=False)
It adds new field - ACL.Read. This field is multivalued and stores list of ids from ACL : { Read: [ ... ] }
If you do not want to write you own handlers for nested documents, you can try another mongo connector. Github project page https://github.com/SelfishInc/solr-mongo-connector. It supports nested documents out of the box.
Official 10gen mongo connector now supports flattening of arrays and indexing subdocuments.
See https://github.com/10gen-labs/mongo-connector
However for arrays it does something unpleasant like this. It would transform this document:
{
"hashtagEntities" : [
{
"start" : "66",
"end" : "81",
"text" : "startupweekend"
},
{
"start" : "82",
"end" : "90",
"text" : "startup"
},
{
"start" : "91",
"end" : "100",
"text" : "startups"
},
{
"start" : "101",
"end" : "108",
"text" : "london"
}
]
}
into this:
{
"hashtagEntities.0.start" : "66",
"hashtagEntities.0.end" : "81",
"hashtagEntities.0.text" : "startupweekend",
"hashtagEntities.1.start" : "82",
"hashtagEntities.1.end" : "90",
"hashtagEntities.1.text" : "startup",
....
}
The above is very difficult to index in Solr - even more if you have no stable schema for your documents. We wanted something more like this:
{
"hashtagEntities.xArray.start": [
"66",
"82",
"91",
"101"
],
"hashtagEntities.xArray.text": [
"startupweekend",
"startup",
"startups",
"london"
],
"hashtagEntities.xArray.end": [
"81",
"90",
"100",
"108"
],
}
I have implemented an alternative solr_doc_manager.py
If you want to use this, just edit the flatten_doc function in your doc_manager to this, to support such functionality:
def flattened(doc):
return dict(flattened_kernel(doc, []))
def flattened_kernel(doc, path):
for k, v in doc.items():
path.append(k)
if isinstance(v, dict):
for inner_k, inner_v in flattened_kernel(v, path):
yield inner_k, inner_v
elif isinstance(v, list):
for inner_k, inner_v in flattened_list(v, path).items():
yield inner_k, inner_v
path.pop()
else:
yield ".".join(path), v
path.pop()
def flattened_list(v, path):
tem = dict()
#path2 = list()
path.append(str("xArray"))
for li, lv in enumerate(v):
if isinstance(lv, dict):
for dk, dv in flattened_kernel(lv, path):
got = tem.get(dk, list())
if isinstance(dv, list):
got.extend(dv)
else:
got.append(dv)
tem[dk] = got
else:
got = tem.get(".".join(path)+".ROOT", list())
if isinstance(lv, list):
got.extend(lv)
else:
got.append(lv)
tem[".".join(path)+".ROOT"] = got
return tem
In case you do not want to lose data from array, which are not sub-documents, this implementation will place the data into a "array.ROOT" attribute. See here:
{
"array" : [
{
"innerArray" : [
{
"c" : 1,
"d" : 2
},
{
"ahah" : "asdf"
},
42,
43
]
},
1,
2
],
}
into:
{
"array.xArray.ROOT": [
"1.0",
"2.0"
],
"array.xArray.innerArray.xArray.ROOT": [
"42.0",
"43.0"
],
"array.xArray.innerArray.xArray.c": [
"1.0"
],
"array.xArray.innerArray.xArray.d": [
"2.0"
],
"array.xArray.innerArray.xArray.ahah": [
"asdf"
]
}
I had the same issue, I want to index/store in Solr complicated documents. My approach was to modify the the JsonLoader to accept complicated json documents with arrays/objects as values.
It stores the object/array and then flatten it and indexes the fields.
e.g basic example document
{
"titles_json":{"FR":"This is the FR title" , "EN":"This is the EN title"} ,
"id": 1000003,
"guid": "3b2f2998-85ac-4a4e-8867-beb551c0b3c6"
}
It will store
titles_json:{
"FR":"This is the FR title" ,
"EN":"This is the EN title"
}
and then index fields
titles.FR:"This is the FR title"
titles.EN:"This is the EN title"
Not only you will be able to index the child documents, but also when you perform a search on solr you will receive the original complicated structure of the document that you indexed.
If you want to check the source code, installation and integration details with your existing solr, check
http://www.solrfromscratch.com/2014/08/20/embedded-documents-in-solr/
please note that I have tested this for solr 4.9.0
M.

Read and update a mongodb document by single call

I have a collection called books.
When use browse a particular book, I get the book by id.
But I also want to increase the view count by 1 each time I read the doc.
I can use 2 commands: one to read and another to update the views counter by 1
Is there a wayy to do this by single command like findAndModify?
How to use that using CSharp driver?
Books:
{
{
"_id": "1"
"title" : "Earth Day",
"author" : "John ",
"pages" : 212,
"price" : 14.5,
"views" : 1000
},
{
"_id": "2"
"title" : "The last voyage",
"author" : "Bob",
"pages" : 112,
"price" : 10.5,
"views" : 100
}
}
I have this:
var query = Query.And(Query.EQ("_id", id));
var sortBy = SortBy.Null;
var update = Update.Inc("views", 1);
var result = Books.FindAndModify(query, sortBy, update, true);
But how do I get the matching document back?
EDIT: I got it working..
return result.GetModifiedDocumentAs<T>();
My question is this call GetModifiedDocumentAs() will hit the database again?
No, it won't hit the database again.
When in doubt about things like this, look at the source. It shows that the GetModifiedDocumentAs method just accesses the resulting doc from the existing Response object and casts it to the requested type.