Using Mongodb to check existence of URL in small crawler - mongodb

I'm using MongoDB for indexing URLs in a small crawler. Maximum number of URLs in my crawler is about 500 million URLs. I want to search in the URLdb for checking existing URLs, but the speed of MongoDB in search is very low for this query:
db.hosts.find({URL:"http://myhost.com"})
My questions are:
What can I do to improve the search speed in MongoDB?
For my purpose, is Lucene better than MongoDB or not?

It's fairly well established in the documentation that the way to improve query performance is by adding an index to the field on which you are querying.
The amount of information about what you are doing is insufficient for anyone to tell if Lucene will be better than MongoDB.
Also, if you are searching your URL for an existing URL so that you don't add a duplicate, then what you want is to create a unique index.

Related

How to set the size of a hash-table of a hash-index in MongoDB?

I'm using MongoDB to store data for experiments.
Each document represent a web page with unique ID and URL. There are approximately 500M documents.
I need to map some data based on a URL string field, so I built a hash-typed index for that field.
I have some performance issue when querying based on URL, so I wanted to check what's the index's hash-table size and whether I can make it bigger.
I've failed to find any help regarding this question online and on Mongo's documentation.
Thanks

Real Time Searching of a Lucene Index that is Updated Frequently - Is this practical?

I have a query that involves many joins on many tables and takes too long to query. I've been asked to try to use Lucene to speed things up. What I've done is exported the query to XML, and used Java to parse the XML, Lucene to index the XML, and created a API to query this index in Java. This reduces the query time 6-10 fold.
However, unless a dedicated VM or machine constantly queries the database, exports the data, and reindexes the data, any end user who uses the API to search the Lucene index will be receiving not-up-to-date data. Even if a machine is dedicated for this purpose, the data will not be up to date on every attempt to search the Lucene index.
I believe that "delta import" for Solr is what I am talking about. I think that is unique to Solr though, not Lucene.
What options exist for Lucene to index data that will change with some frequency, and allow users to search/query in real time? Is this too much to ask from Lucene?
Solr happens to be a search application build on top of lucene. So any indexing and searching functionality provided comes from lucene.
Lucene supports Near real time search - http://wiki.apache.org/lucene-java/NearRealtimeSearch
For your indexing concerns I would say it depends on your app which syncs data between your database and lucene. Lucene can index at a very high throughput. http://people.apache.org/~mikemccand/lucenebench/indexing.html
So your app should be smart enough to figure changes made in the database and re-index only that "delta"

Mongodb- embedded vs Indexes

My question is pretty simple. I am building my first application with mongodb. Up until now, i have always used sql. I have read a lot of information about embedding documents versus linked documents.
My question to the mongodb veterans is: Is there a huge difference in speed/performance if I used indexed links/queries apposed to embedded docs? If there is a huge difference can you please explain why? Thank you.
Again, i am new to mongodb and just don't want to get off on the wrong foot. thank you.
Yes, there is an enormous difference between references and embedded docs.
An embedded document is stored in the document in the same disk location as the rest of the doc's fields, so there's no additional network round-trips or disk seeks to retrieve the embedded document when you query the document as a whole.
DBRefs, on the other hand, are simply the _id of a document in another collection. It will take an additional roundtrip and additional disk seeks to get the "linked" document. See the spec for DBRefs here:
http://www.mongodb.org/display/DOCS/Database+References#DatabaseReferences-DBRef
You should try to optimize your most common query by including in a single document all the info needed to satisfy that query.

Do you need Solr/Lucene for MongoDB, CouchDB and Cassandra?

If you have RDBMS you probably have to use Solr to index your relational tables to fully nested documents.
Im new to non-sql databases like Mongodb, CouchDB and Cassandra, but it seems to me that the data you save is already in that document structure like the documents saved in Solr/Lucene.
Does this mean that you don't have to use Solr/Lucene when using these databases?
Is it already indexed so that you can do full-text search?
It depends on your needs. They have a full text search. In CouchDB the search is Lucene (same as solr). Unfortunately, this is just a full text index, if you need complex scoring or DisMax type searching, you'll likely want the added capabilities of an independent Solr Index.
Solr (Lucene) uses an algorithm to returns relevant documents from a query. It will returns a score to indicate how relevant each document is related to the query.
It is different than what a database (relational or not) does, which is returning results that matches or not a query.

Sphinx search engine and related tags

I'm using Sphinx search engine to index all my Intranet documents using tags. With that I don't have any trouble to find specific documents with one ore more tags.
I want to go further with a new feature like the StackOverflow "related tags" feature.
Does anybody know the best way to do this with Sphinx ?
Thanks
You run a boolean OR query on all terms in the document you want to find related items for. It can be fairly slow because all documents in the database has to be ranked on similarity, unless you limit the search using and:ed terms. See my text here: https://stackoverflow.com/questions/3121266/efficient-item-similarity-search-using-sphinx