Haskell mongodb text search

Haskell mongodb text search - mongodb

What is the status of text search with haskell mongodb driver?
There is now 'LIKE' operator in mongo similar to SQL variants, so what is the best way to search a collection or the whole db for a particular text string?
I've read some people referencing external tools but I can also see that some text search was implemented in 2.4 mongo version which is done through command interface.
There should not be any problems doing it from console but how would I do it from haskell driver? I found 'runCommand' function in the driver APIs and it looks like it should be possible to send 'text' command to the server but the signature shows that it returns only one document - not a list of documents. So how is it done correctly?
How would I efficiently search for a word or a sentence in a collection or db so that it returns a list of documents containing the word? Is it possible to do without external tools using mongo 'text search' feature? SHould it be done in the application level?
Thanks.

The result type already contains the list of documents (that contain the searched text). Unfortunately, I could not test the query on my running database, but I have used runCommand to run an aggregation (before it was implemented for the haskell driver). The result document you get for such an query looks something like this:
{ results: [
{ score : ...,
obj : { ... }
},
...
],
... ,
ok : 1
}
The result document has a field results and its value is a document with fields score and obj. So in the end, you can find each of the matched document behind the obj-field in the list of results.
For more details, you should take a look here.

Related

Mongo - query, Embedded document not match except dot notation

i am confuse.
here is the example:
MongoDB Enterprise > db.employee.find()
result:
{"_id":1002,"name":"Jack","address":{"previous":"Cresent Street","current":"234,Bald Hill Street","unit":"MongoDB" } }
I try this:
db.employee.find({address:{previous: "Cresent Street"}})
result: nothing returns
Next a try this:
db.employee.find({"address.previous": "Cresent Street"})
result:
{"_id":1002,"name":"Jack","address":{"previous":"Cresent Street","current":"234,Bald Hill Street","unit":"MongoDB"}}
The question is wath is wrong with this?
i use
MongoDB shell version v4.2.7 installed
cmd db.version() 4.2.6
debian 10.4
thanks for your replies.

When you Query on Embedded/Nested Documents using dotted field notation
{"address.previous": "Cresent Street"}
means find a document that containd an address field that contains a document whose previous field is equal to "Cresent Street".
When you provide a subdocument like
{address:{previous: "Cresent Street"}}
this means to find a document that contains an address field whose content is exactly the document {previous: "Cresent Street"}, with no additional fields. If you provide multiple fields in the subdocument, field order also matters.
Both of these queries are useful in specific scenarios, pick the one that does what you need in your situaion.

MongoDB search via index of documents containing JSON

Say I have objects in a MongoDB collection:
{
...
"json" : "{\"things\":[2494090781803658355,5114030115038563045,3035856943768375362,8931213615561493991,7574631742057150605,480863244020297489]}"
}
It's an Azure "MongoDB" so doesn't support all the features, but suppose it does.
This search will find that document:
db.coll.find({"json" : {$regex : "5114030115038563045|8931213615561493991"}})
Of course, it's scanning the whole collection to pull these records out. What's an efficient/faster way to find documents where the list of "things"
contains any of a list of "things" in a query? It seems like throwing a search engine like Solr or ElasticSearch would solve this, and perhaps
using another Azure's Data Lake storage would make this more searchable, so I'm considering those options. They're outside the scope of this
question though; I'd like to know if there's a Mongo-ish way to search this collection by index.

The only option you have available to you if you're storing a JSON string is to use a text index with a $text operator.
If this document structure isn't set in stone, however, you might consider also separately storing the JSON as a nested subdocument (with the appropriate sanitation, of course). This would allow you to construct an index on json.things, while still storing the JSON string, and allow you to perform a query on e.g. "json.things": {$in: [ "5114030115038563045", "8931213615561493991" ]}

Pymongo: iterate over all documents in the collection

I am using PyMongo and trying to iterate over (10 millions) documents in my MongoDB collection and just extract a couple of keys: "name" and "address", then output them to .csv file.
I cannot figure out the right syntax to do it with find().forEach()
I was trying workarounds like
cursor = db.myCollection.find({"name": {$regex: REGEX}})
where REGEX would match everything - and it resulted in "Killed".
I also tried
cursor = db.myCollection.find({"name": {"$exist": True}})
but that did not work either.
Any suggestions?

I cannot figure out the right syntax to do it with find().forEach()
cursor.forEach() is not available for Python, it's a JavaScript function. You would have to get a cursor and iterate over it. See PyMongo Tutorial: querying for more than one document, where you can do :
for document in myCollection.find():
print(document) # iterate the cursor
where REGEX would match everything - and it resulted in "Killed".
Unfortunately there's lack of information here to debug on why and what 'Killed' is. Although if you would like to match everything, you can just state:
cursor = db.myCollection.find({"name": {$regex: /.*/}})
Given that field name contains string values. Although using $exists to check whether field name exists would be preferable than using regex.
While the use of $exists operator in your example above is incorrect. You're missing an s in $exists. Again, unfortunately we don't know much information on what 'didn't work' meant to help debug further.
If you're writing this script for Python exercise, I would recommend to review:
PyMongo Tutorial
MongoDB Tutorial: query documents
You could also enrol in a free online course at MongoDB University for M220P: MongoDB for Python Developers.
However, if you are just trying to accomplish your task of exporting CSV from a collection. As an alternative you could just use MongoDB's mongoexport. Which has the support for :
Exporting specific fields via --fields "name,address"
Exporting in CSV via --type "csv"
Exporting specific values with query via --query "..."
See mongoexport usage for more information.

I had no luck with .find().forEach() either, but this should find what you are searching for and then print it.
First find all documents that match what you are searching for
cursors = db.myCollection.find({"name": {$regex: REGEX}})
then iterate it over the matches
for cursor in cursors
print(cursor.get("name"))

The find() methods returns a PyMongo cursor, which is a reference to the result set of a query.
You have to de-reference, somehow, the reference(address).
After that, you will get a better understanding how to manipulate/manage the cursor.
Try the following for a start:
result = db.*collection_name*.find()
print(list(result))

I think I get the question but there's no accurate answer yet I believe. I had the same challenge and that's how I came about this, although, I don't know how to output to a .csv file. For my situation I needed the result in JSON. Here's my solution to your question using mongodb Projections;
your_collection = db.myCollection
cursor = list(your_collection.find( { }, {"name": 1, "address": 1}))
This second line returns the result as a list using the python list() function.
And then you can use jsonify(cursor) or just print(cursor) as a list.
I believe with the list it should be easier to figure how to output to a .csv.

Field's datatype of collection in mongodb

How to get field information of a collection in mongodb.
information I am looking for are
field name
data type

You will need to loop over all the documents and figure out what the used names are, and which types each specific field uses. MongoDB does not have a schema, so there is no short cut to fetch this. Be also aware that each field's value can have totally different data types as well—another one of MongoDB's strenghts.
To figure out some statistics, such as field names, the following script can help:
mr = db.runCommand({
"mapreduce" : "things",
"map" : function() {
for (var key in this) { emit(key, null); }
},
"reduce" : function(key, stuff) { return null; },
"out": "things" + "_keys"
})
Then run distinct on the resulting collection so as to find all the keys:
db[mr.result].distinct("_id");
But there is no way to also include the field types with a Map/Reduce job like this.

You can't determine the schema of a collection. Each of the objects of an collection might have a different schema, you should be aware of this.
I made a similar question a few months ago , in the post you can find how to retrieve the schema of an object using the java programing language; However, to the best of my knowledge, the is no way to retrieve the data types other than try to cast the objects (this is the way the BasicBsonObjects do it).

MongoDB supports dynamic schema, and there is no inbuilt feature for schema introspection or analysis as at MongoDB 2.4.
However .. it is possible to infer the schema by inspecting using a Map/Reduce across either a sample of documents or the entire collection.
There are a few open source tools which package this approach up in a helpful interface, for example:
Schema.js - extends the mongo shell with collection.schema() prototypes
Variety - runs as a standalone script
I like the approach of schema.js, and include it in my ~/mongorc.js startup file so it is available in my mongo shell sessions.
By default schema.js analyzes up to 50 documents in a collection and returns the results inline. There is a limit option to inspect more (or even all) documents in a collection, and it supports the Map/Reduce out options so results can optionally be saved or merged with an output collection.

MongoDB fulltext search + workaround for partial word match

Since it is not possible to find "blueberry" by the word "blue" by using a mongodb full text search, I want to help my users to complete the word "blue" to "blueberry". To do so, is it possible to query all the words in a mongodb full text index -> that I can use the words as suggestions i.e. for typeahead.js?

Language stemming in text search uses an algorithm to try to relate words derived from a common base (eg. "running" should match "run"). This is different from the prefix match (eg. "blue" matching "blueberry") that you want to implement for an autocomplete feature.
To most effectively use typeahead.js with MongoDB text search I would suggest focusing on the prefetch support in typeahead:
Create a keywords collection which has the common words (perhaps with usage frequency count) used in your collection. You could create this collection by running a Map/Reduce across the collection you have the text search index on, and keep the word list up to date using a periodic Incremental Map/Reduce as new documents are added.
Have your application generate a JSON document from the keywords collection with the unique keywords (perhaps limited to "popular" keywords based on word frequency to keep the list manageable/relevant).
You can then use the generated keywords JSON for client-side autocomplete with typeahead's prefetch feature:
$('.mysearch .typeahead').typeahead({
name: 'mysearch',
prefetch: '/data/keywords.json'
});
typeahead.js will cache the prefetch JSON data in localStorage for client-side searches. When the search form is submitted, your application can use the server-side MongoDB text search to return the full results in relevance order.

A simple workaround I am doing right now is to break the text into individual chars stored as a text indexed array.
Then when you do the $search query you simply break up the query into chars again.
Please note that this only works for short strings say length smaller than 32 otherwise the indexing building process will take really long thus performance will be down significantly when inserting new records.

You can not query for all the words in the index, but you can of course query the original document's fields. The words in the search index are also not always the full words, but are stemmed anyway. So you probably wouldn't find "blueberry" in the index, but just "blueberri".

Don't know if this might be useful to some new people facing this problem.
Depending on the size of your collection and how much RAM you have available, you can make a search by $regex, by creating the proper index. E.g:
db.collection.find( {query : {$regex: /querywords/}}).sort({'criteria': -1}).limit(limit)
You would need an index as follows:
db.collection.ensureIndex( { "query": 1, "criteria" : -1 } )
This could be really fast if you have enough memory.
Hope this helps.

For those who have not yet started implementing any database architecture and are here for a solution, go for Elasticsearch. Its a json document driven database similar to mongodb structurally. It has "edge-ngram" analyzer which is really really efficient and quick in giving you did you mean for mis-spelled searches. You can also search partially.