I have a bunch of html pages. The idea is to allow users enter keywords to search through these html pages. only html pages that match the criteria will be store for later references. I knew that Elastic search can index html,pdf, and more but in my case I already have postgresql as my database and my system is small enough so I don't want to have Elasticsearch as extra dependency for this project.
A few issues I have here here:
because html won't be stored unless the query match the users' keywords, is there a better approach to handle this without having had to index html first to the search engine to be able to search and remove it afterward if it doesn't match the criteria ?
yes is it possible to index whole html content like in Elasticsearch ?
Thanks a lot for your help?
Related
I am working on a website, where users can create entries into a database. These entries are all of the same form, so I was using Sqlite (with FTS5) so far, which is the only database I know ;)
However it is crucial that users can search these entries on the website. The full text search is working decently well (the users know approximately what they are looking for) but I need two improvements:
Wrong spelling should return the correct entry (I have seen the spellfix extension for sqlite for that, but I don't know how well it works)
more importantly if a user enters a few query words on the website I try to MATCH those with a sql query. If a user enters too many words, it will return 0 matches:
for example: if a user inputs "sars covid 19" into the search-bar:
CREATE VIRTUAL TABLE TEST USING FTS5(SomeText);
INSERT INTO TEST(SomeText) VALUES
('Covid 19');
SELECT SomeText
FROM TEST
WHERE SomeText MATCH 'sars covid 19';
=> 0 matches, but I would want it to return the 'covid 19' entry.
I realise that sqlite might be too simplistic for a database that needs to handle searches well. Now my question is: Do Postgres or MongoDB have search engines that include the functionality that I need? Or is it worth diving into solutions with Elastic Search etc.?
Most articles I found on this are 5-10 years old, so I am wondering what the current state of affairs is regarding search engines in databases. Any hints are greatly appreciated
Combination es + mongodb work well, you index and perform full text search in es and you keep the original documents with some key fields indexed in mongodb...
Elasticsearch will work for sure. You only have to think about how you will index your document, and you will be able to find them the way you index them, in your context it seems that the default text will work with a match query :
https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html
MongoDb will work too in this simple case : https://docs.mongodb.com/manual/text-search/ , but mongo wont work with tokenizer so if you need to upgrade your text search mongo will be limited.
Postgresql could do it, using the like but I am not familiar with enough, if you have 10k entries, it will be ok for sure, if you expect 1 millions, mongo or es would be better.
If you have to choose between mongodb and es, you have to be more specific in your question, for full text, es is really nice, having a lot of features, mongodb give some nice database tools too. Sometimes es will be better, sometimes mongo, depends of what you need. If you only want full text, es is a must.
I'm using the text search feature and I couldn't find a way to get the stemmed terms in the query. Is there a way to also return the list of words in the stemmed form together with the query results and also the parts of the document that matched the result? This would be meaningful to understand and identify which part of the document matches.
Cheers!
As of MongoDB 2.6, the only meta information about the text search that can be used is a score indicating the strength of the match. You can submit a ticket on the Core Server project to request this feature (as I looked and I don't think one exists at the moment).
Some of the fields in my MongoDB documents contain sensitive data, and when I use this data for testing I need to sanitise them.
The data was previously stored in MySQL and I did this with something like REPEAT('x', LENGTH(fieldName)).
I would like to keep the length of the sanitized fields the same as they were and ideally preserve whitespace.
Can anyone suggest a good way to do this in MongoDB?
Update
The sensitive data is stuff like performance review feedback that has been provided for employees so when testers are using the app they must not see this data. I want to preserve the length of the strings and whitespace so that the the layout of the text is similar to what it is in production.
I was wondering if it would be possible to do this using some simple MongoDB operators, but haven't been able to find what I'm looking for.
The application is written in Java and I am using Spring Data. In the case of MySQL replacing characters with 'x' in the strings in Java and then updating the rows was slow which is why I resorted to using repeat even though I lost the whitespace in the strings.
You can do this from a MongoDB shell:
db.myColl.find().forEach(function(doc){
doc.myField = Array(doc.myField.length+1).join("X");
db.myColl.save(doc);
});
I'm pretty new to mongodb (only work with it for one small project) and I wanted to have your tips on how to organize my documents. My brain is not (yet) nosql formatted.
I have a collection storing all kind of informations and I want to add tags to it. There will be 1-5 tags by document. I want to be able to search by tags (among other things), display all the documents for 1 or more given tags, know the number of documents by tag.
What do you think is the best way to approach this simple problem ? should I give it his own collection ? should I embed it ?
How would you do ?
Thanks
Embed the tags in the document. You can search on embedded arrays and you can index them.
I'm using Sphinx search engine to index all my Intranet documents using tags. With that I don't have any trouble to find specific documents with one ore more tags.
I want to go further with a new feature like the StackOverflow "related tags" feature.
Does anybody know the best way to do this with Sphinx ?
Thanks
You run a boolean OR query on all terms in the document you want to find related items for. It can be fairly slow because all documents in the database has to be ranked on similarity, unless you limit the search using and:ed terms. See my text here: https://stackoverflow.com/questions/3121266/efficient-item-similarity-search-using-sphinx