Using facet.date with stats.facet in solr - date

I'm using the Stats component in solr to do get faceted statistics, which works very well, and now I'm interested in doing the same for my date fields. But it seems it doesn't work to use facet.date fields with the stats module, is there a way of geting this to work?
My fallback plan is to add my facets as specific fields (date, year-quarter, year-month, etc), but this will require heavy re-indexing.

Your best bet might be to store a copy of all date fields in unix time. This way you can have an integer field and be able to run stats on them.
It seems that stats on date fields is in the works though.
https://issues.apache.org/jira/browse/SOLR-1023

Related

DB Schema Design

Currently working on trying to model a DB schema in MongoDB. The bit I'm getting stuck on is where an employee must indicate their times that they are available to work.
I.e.
Monday:
9AM-12PM, 2:00PM-6:00PM
Tuesday:
8AM-10AM, 12:00PM-2:00PM, 4:00PM-6:00PM
etc.
I could just have an embedded field in my schema with a list of times, but I'm not sure if thats the best solution to this.
Opinions?
There is no universal rule when it comes to schema design. I would store a list of numerical ranges, where a range's unit is seconds from the beginning of workweek. This way it would be possible to use mongo to search directly for available personnel in a single query. Date manipulation should not be a problem on a modern platform.

Multiple index search in redis

I have a series of text data (tweets) which need to be indexed on 3 attributes. I wanted to use redis for the same as the response time has to be fast. Can anyone suggest how to go about this. Or should I go with MongoDB.
In most cases, with Redis you'll need to maintain an index for each attribute you want to search on. Here's a simple example - let's say that you store your tweets like in hashes, e.g.:
HMSET tweet:<id> text <tweet text> time <timestamp> ...
To create an index on your tweets' timestamps, you'll need to maintain a sorted set with the timestamp as score and the tweet's id as the value:
ZADD _tweet:time <timestamp> <id>
This will allow you to search for certain tweets in a given time period with ZRANGEBYSCORE.
Note that you'll also have to take care of maintaining the index (modify, del). You'd also need to repeat this approach for any additional indices. If you're looking for more material, here are some slides on the subject: http://www.slideshare.net/itamarhaber/20140922-redis-tlv-redis-indices
I'm a little bit confused about the term "index"... If you want to search in (JSON) data, I think ElasticSearch (http://www.elasticsearch.org/) would be the much better choice.
It you use case is mostly about geo information, have a look at
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-geo-point-type.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-geo-distance-filter.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-geohashgrid-aggregation.html
I think building this in Redis would give you a hard time. Don't get me wrong, I love Redis and I'm a huge advocate, but I think it's the wrong tool for what you want to build apparantly.
There's even a plugin for ElasticSearch which gathers the tweets for you:
https://github.com/elasticsearch/elasticsearch-river-twitter

Is MongoDB a good fit for this?

In a system I'm building, it's essentially an issue tracking system, but with various issue templates. Some issue types will have different formats that others.
I was originally planning on using MySQL with a main issues table and an issues_meta table that contains key => value pairs. However, I'm thinking NoSQL (MongoDB) might be the better option.
Can MongoDB provide me with the ability to generate "standard"
reports, like # of issues by type, # of issues by type by month, # of
issues assigned per person, etc? I ask this because I've read a few
sources that said Mongo was bad at reporting.
I'm also planning on storing my audit logs in Mongo, since I want a single "table" for all actions (Modifications to any table). In Mongo I can store each field that was changed easily, since it is schemaless. Is this a bad idea?
Anything else I should know, and will Mongo work for what I want?
I think MongoDB will be a perfect match for that use case.
MongoDB collections are heterogeneous, meaning you can store documents with different fields in the same bag. So different reporting templates won't be a show stopper. You will be able to model a full issue with a single document.
MongoDB would be a good fit for logging too. You may be interested in capped collections.
Should you need to have relational association between documents, you can do have it too.
If you are using Ruby, I can recommend you Mongoid. It will make it easier. Also, it has support for versioning of documents.
MongoDB will definitely work (and you can use capped collections to automatically drop old records, if you want), but you should ask yourself, does it fit to this task well? For use case you've described it is better option to use Redis (simple and fast enough) or Riak (if you care a lot about your log data).

advanced searching mongodb using mongomapper, sunspot/solr or sphinx?

I have am using mongodb with mongomapper to store all my products. Each product belongs to multiple categories that have many levels i.e. category, sub category etc.
Each product has many search fields that are embedded documents in product.
All this is working and I now want to add search to the app.
The search system needs text search: multiple, dynamic, faceted search including min/max range search.
I have been looking into sunspot gem but having difficulty setting it up on dev let alone trying to run it in production! And I have also looked at sphinx.
But I am wondering if using just mongomapper / mongodb will be quick enough and the best way, as its quite a complex search system ?
Any help / suggestions / experiences / tutorials and examples on this would be most appreciated.
Thanks a lot,
Rick
I've been involved with a very large Sphinx powered search and I think its awful. Very difficult to configure if you want anything past a very simple full-text search. Solr\Lucene, on the other hand, is incredibly flexible and was unbelievably easier to setup and get running.
I am not using Solr in conjunction with MongoDB to power full text search with all the extra goodies, like facets, etc. Depending on how you configure Solr, you may not need to even hit your MongoDB for data. Or, you may tell Solr to index fields, but not to store them and instead you just store the ObjectId's that correspond to data inside of MongoDB.
If your search truly is a complex search system, I very strongly recommend that you do not use MongoDB for search and go with Solr. One big reason is that MongoDb doesnt have a full text feature - instead, it has regular expression matches. The Regex matches work wonderfully but will only use indexes in certain cases.

Best practices for combining Lucene.NET and a relational database?

I'm working on a project where I will have a LOT of data, and it will be searchable by several forms that are very efficiently expressed as SQL Queries, but it also needs to be searched via natural language processing.
My plan is to build an index using Lucene for this form of search.
My question is that if I do this, and perform a search, Lucene will then return the ID's of matching documents in the index, I then have to lookup these entities from the relational database.
This could be done in two ways (That I can think of so far):
N amount of queries (Horrible)
Pass all the ID's to a stored procedure at once (Perhaps as a comma delimited parameter). This has the downside of being limited to the max parameter size, and the slow performance of a UDF to split the string into a temporary table.
I'm almost tempted to mirror everything into lucenes index, so that I can periodicly generate the index from the backing store, but only need to access it for the frontend.
Advice?
I would store the 'frontend' data inside the index itself, avoiding any db interaction. The db would be queried only when you want more information on the specific record.
When I encountered this problem I went with a relational database that has full-text search capabilities (I used PostgreSQL 8.3, which has built in ft support, with stemming and thesaurus support). This way the database can query using both SQL and ft commands. The downside is that you need a DB that has full-text-search capabilities, and these capabilities might be inferior to what lucene can do.
I guess the answer depends on what you are going to do with the results, if you are going to display the results in a grid and let the user choose the exact document he wants to access then you may want to add to the index enough text to help the user identify the document, like a blurb of say 200 characters and then once the member selects a document hit the DB to retrieve the whole thing.
This will impact the size of your index for sure, so that is another consideration you need to keep in mind. I would also put a cache between the DB and the front end so that the most used items will not incur the full cost of a DB access every time.
Probably not an option depending on how much stuff is in your database, but what I have done is store the db id's in the search index along with the properties I wanted indexed. Then in my service classes I cache all the data needed to display search results for all the objects (e.g., name, db id, image url's, description blurbs, social media info). The service class returns a Dictionary that can look up objects by db id, and I use the id's returned by Lucene.NET to pull data from the in-memory cache.
You could also forego the in-memory cache and store all the necessary properties for displaying a search result in the search index. I didn't do this because the in-memory cache is also used in scenarios other than search.
The in-memory cache is always fresh to within a few hours, and the only time I have to hit the db is if I need to pull more detailed data for a single object (if the user clicks on the link for a specific object to go to the page for that object).