Solr: How to Search by Time *AND* Distance - scala

We are working on an app using Solr to search by distance. A Solr consultant wrote the original code but is no longer available, so I, a Solr newbie, try to take this over. The current index insertion code looks like this:
{add:
{doc:
{id: <my_id>,
category: <my_type>,
resourcename: <private_flag>,
store: <my_latlng>,
},
overwrite: true,
commitWithin: <commit_time>
}
}
And the query below properly returns all the documents near (mylat, mylng):
localhost:8983/solr/select?wt=json&q=category:"<mytype>"&fl=id&fq=
{!bbox}&sfield=store&pt=<mylat>,<mylng>&d=<my_distance>&rows=200
All was well in paradise. Now we need to add a time dimension, meaning instead of just retrieve nearby docs, we need to retrieve nearby docs within a specific time range (e.g. 2 days ago, 2 month ago). This means adding to each index a "origin_time" field, and then modify the query to search for TIME plus distance.
Can anyone suggest how I should add this time field to the index and how to adjust the query to search by distance and time?
Thanks!

Related

Firestore 1 global index vs 1 index per query what is better?

I'm working on my app and I just ran into a dilemma regarding what's the best way to handle indexes for firestore.
I have a query that search for publication in a specify community that contains at least one of the tag and in a geohash range. The index for that query looks like this:
community Ascending tag Ascending location.geohash Ascending
Now if my user doesnt need to filter by tag, I run the query without the arrayContains(tag) which prompt me to create another index:
community Ascending location.geohash Ascending
My question is, is it better to create that second index or, to just use the first one and specifying all possible tags in arrayContains in the query if the user want no filters on tag ?
Neither is pertinently better, but it's a typical space vs time tradeoff.
Adding the extra tags in the query adds some overhead there, but it saves you the (storage) cost for the additional index. So you're trading some small amount of runtime performance for a small amount of space/cost savings.
One thing to check is whether the query with tags can actually run on just the second index, as Firestore may be able to do a zigzag merge join. In that case you could only keep the second, smaller index and save the runtime performance of adding additional clauses, but then get a (similarly small) performance difference on the query where you do specify one or more tags.

MongoDB Geospatial and createdAt sorting

I have a headache for a idea how to properly sort data from a MongoDB. It is using 2dsphere index and has timestamp createdAt. The goal is to show latest pictures (that what this collection is about, just a field mediaUrl...) but it has to be close to the user. I'm not very familiar with complex MongoDB aggregation queries so I thought here's a good place to ask. Sorting with $near shows only items sorted by distance. But there's a upload time, e.g. if item is 5 min fresh but is like 500 meters far than older item it still should be sorted higher.
Ugly way would be to iterate every few hundreds meters and collect data but maybe there's a smarter way?
So if I am correct you want to be able to sort on 2 fields ?
distance
timestamp
You should check out this method:
https://docs.mongodb.com/manual/reference/operator/aggregation/sort/
It allows you to sort multiple columns.

Algolia Custom Ranking on the Fly

I have an index of thousands of music tracks. For searching I want the tracks to be returned by track title ascending.
I also have the created_at field which is a date time of when I added the track to library. Is it ok for me to change the ranking on the fly?
So for my normal artist / title search before the query I would run:
index.setSettings({
ranking: [
"asc(title)",
"asc(artist)"
]
});
And then when I want to return the tracks I recently added to the database I would run:
index.setSettings({
ranking: [
"desc(created_at)",
"asc(title)",
"asc(artist)"
]
});
My question is: Is this performant? Are there any down sides for doing this for each query?
Thanks for the advice!
Algolia sort data at indexing time, not query time. That's why you have to use the method setSettings. If you do it this way, all the data will be resorted every time you set the new settings.
The solution is to use replicas. They are a copy of the master index, sorted differently.
You can find the relevant doc here:
https://www.algolia.com/doc/guides/relevance/sorting/#multiple-sorting-strategies

Best way to structure MongoDB with the following use cases?

sorry to have to ask this but I am new to MongoDB (only have experience with relational databases) and was just curious as to how you would structure your MongoDB.
The documents will be in the format of JSONs with some of the following fields:
{
"url": "http://....",
"text": "entire ad content including HTML (very long)",
"body": "text (50-200 characters)",
"date": "01/01/1990",
"phone": "8001112222",
"posting_title": "buy now"
}
Some of the values will be very long strings.
Each document is essentially an ad from a certain city. We are storing all ads for a lot of big cities in the US (about 422). We are storing more ads every day, and the amount of ads per city varies from as little as 0 to as big as 2000. The average is probably around 700-900.
We need to do the following types of queries, in almost instant time (if possible):
Get all ads for any specific city, for any specific date range.
Get all ads that were posted by a specific phone number, for any city, for any date range.
What would you recommend? I'm thinking I should have 422 collections - one for each city. I'm just worried about the query time when we query for phone numbers because it needs to go through each collection. I have an iterable list of all collection names.
Or would it be faster to just have one collection so that I don't have to switch through 422 collections?
Thank you so much, everyone. I'm here to answer any questions!
EDIT:
Here is my "iterating through all collections" snippet:
for name in glob.glob("Data\Nov. 12 - 5pm\*"):
val = name.split("5pm")[1].split(".json")[0][1:]
coll = db[val]
# Add into collection here...
MongoDB does not offer any operations which get results from more than one collection, so putting your data in multiple collections is not advisable in this case.
You can considerably speed up all the use-cases you mentioned by creating indexes for them. When you have a very large dataset and always query for exact equality, then hashed indexes are the fastest.
When you query a range of dates (between day x and day y), you should use the Date type and not strings, because this not just allows you to use lots of handy date operators in aggregation but also allows you to speed up ranged queries and sorts with ascending or descending indexes.
Maybe I'm missing something, but wouldn't making "city" a field in your JSON solve your problem? That way you only need to do something like this db.posts.find({ city: {$in: ['Boston', 'Michigan']}})

Solr: Query for documents whose from-to date range contains the user input

I would like to store and query documents that contain a from-to date range, where the range represents an interval when the document has been valid.
Typical use cases in lucene/solr documentation address the opposite problem: Querying for documents that contain a single timestamp and this timestamp is contained in a date range provided as query parameter. (createdate:[1976-03-06T23:59:59.999Z TO *])
I want to use the edismax parser.
I have found the ms() function, which seems to me to be designed for boosting score only, not to eliminate non-matching results entirely.
I have found the article Spatial Search Tricks for People Who Don't Have Spatial Data, where the problem described by me is said to be Easy... (Find People Alive On May 25, 1977).
Is there any simpler way to express something like
date_from_query:[valid_from_field TO valid_to_field] than using the spacial approach?
The most direct approach is to create the bounds yourself:
valid_from_field:[* TO date_from_query] AND valid_to_field:[date_from_query TO *]
.. which would give you documents where the valid_from_field is earlier than the date you're querying, and the valid_to_field is later than the date you're querying, in effect, extracting the interval contained between valid_from_field and valid_to_field. This assumes that neither field is multi valued.
I'd probably add it as a filter query, since you don't need any scoring from it, and you probably want to allow other search queries at the same time.