Complex URL handling conception - mongodb

I'm currently struggling at a complex URL handling concept question. The application have a product property database table/collection with all the different product types (i.e. categories, colors, manufacturers, materials, etc.).
{_id:1,alias:"mercedes-benz",type:"brand"},
{_id:2,alias:"suv-cars",type:"category"},
{_id:3,alias:"cars",type:"category"},
{_‌​id:4,alias:"toyota",type:"manufacturer"},
{_id:5,alias:"red",type:"color"},
{_id:6,alias:"yellow",type:"color"},
{_id:7,alias:"bmw",type:"manufacturer"},
{_id:8,alias:"leather",type:"material"}
...
Now the mission is to handle URL requests in the style below in every(!) possible order to retrieve the included product properties. The only allowed character is the dash (settled SEO requirement, some properties also can include dashes by themselve - i think also an important point - i.e. the category "suv-cars" or the manufacturer "mercedes-benz"):
http:\\www.example.com\{category}-{color}-{manufacturer}-{material}
http:\\www.example.com\{color}-{manufacturer}
http:\\www.example.com\{color}-{category}-{material}-{manufacturer}
http:\\www.example.com\{category}-{color}-nonexistingproperty-{manufacturer}
http:\\www.example.com\{color}-{category}-{manufacturer}
http:\\www.example.com\{manufacturer}
http:\\www.example.com\{manufacturer}-{category}-{color}-{material}
http:\\www.example.com\{category}
http:\\www.example.com\{manufacturer}-nonexistingproperty-{category}-{color}-{material}
http:\\www.example.com\{color}-crap-{manufacturer}
...
...so: every order of the properties should be allowed! The result have to be the information about the used properties per URL-Request (BTW yes, the duplicate content will be fixed by redirects and a predefined schema). The "nonexistingproperties"/"crap" are possible and just should be ignored.
UPDATE:
Idea 1: One way i'm thinking about the question is to split the query string by dashes and analyze them value by value, the problem: At the two or three or more word combinations at some properties there are too many different combinations and variations so a loooot of queries which kills this idea i think..
Idea 2: The other way is to build a (in my opinion) too large Alias/URL-Table with all of the different combinations, but i think that's just an ugly workaround. There are about 15.000 of different properties so the count of the aliases in the different sort orders is killing this idea.
Idea 3: It's your turn! Thanks for your mind and your time.

While your question is a bit broad, below are some ideas. There isn't a single awesome answer unless you find a free or commercial engine for this that works exactly the way you want.
The way I thought about your problem was to consider the URL as a list of keywords.
use Lucene as a keyword/tag system. It's good at the types of searches you suggest you want, including phrases, stems, etc.
store and index the data in DB of choice, but pull the keywords into memory and build a bit index of all keywords vs items. Iterate through the keyword table producing weighted results. If order of keywords matters, you'll also need make a pass through the result set to weight based on word order. These types of searches always need to cap their result set quickly in order to return results quickly.
cache the results like crazy from working matches, and give precedence to results that users seem to click on the most for a given URL.
attack the database by using tag indexes in MongoDB. You'd still need to merge and weight results. Very intensive and not likely a good use of DB resources.
read some of the academic papers on keyword searches. It's a popular topic.
build a table of words that have dashes in them, and normalize/convert those before running your queries
always check for full exact matches first

The only way this may work, if you restrict all property values to be unique. So, you make a set of categories+colors+manufacturers, etc. All values have to be unique. This will allow you to find to what property the value belongs.
The data structure for this should be fairly simple:
{_id:ValueOfTheProperty, Property:TypeOfProperty}
Here are some possible samples:
{ _id: Red, Property: Color }
{ _id: Green, Property: Color }
{ _id: Boots, Property: Category }
{ _id: Shoes, Property: Category }
...
This way, the order does not matter, and you are able to convert them in a single pass to a map:
{ Color: Red, Category: Boots }
Though, I predict some problems with ambigous names here.

Related

Using found set of records as basis for value list

Beginner question. I would like to have a value list display only the records in a found set.
For example, in a law firm database that has two tables, Clients and Cases, I can easily create value list that displays all cases for clients.
But that is a lot of cases to pick from, and invites user mistakes. I would like the selection from the value list to be restricted to cases matched to a particular client.
I have tried this method https://support.claris.com/s/article/Creating-conditional-Value-Lists-1503692929150?language=en_US and it works up to a point, but it requires too much entry of data and too many tables.
It seem like there ought to be a simpler method using the find function. Any help or ideas greatly appreciated.

Firestore query where document id not in array [duplicate]

I'm having a little trouble wrapping my head around how to best structure my (very simple) Firestore app. I have a set of users like this:
users: {
'A123': {
'name':'Adam'
},
'B234': {
'name':'Bella'
},
'C345': {
'name':'Charlie'
}
}
...and each user can 'like' or 'dislike' any number of other users (like Tinder).
I'd like to structure a "likes" table (or Firestore equivalent) so that I can list people who I haven't yet liked or disliked. My initial thought was to create a "likes" object within the user table with boolean values like this:
users: {
'A123': {
'name':'Adam',
'likedBy': {
'B234':true,
},
'disLikedBy': {
'C345':true
}
},
'B234': {
'name':'Bella'
},
'C345': {
'name':'Charlie'
}
}
That way if I am Charlie and I know my ID, I could list users that I haven't yet liked or disliked with:
var usersRef = firebase.firestore().collection('users')
.where('likedBy.C345','==',false)
.where('dislikedBy.C345','==',false)
This doesn't work (everyone gets listed) so I suspect that my approach is wrong, especially the '==false' part. Could someone please point me in the right direction of how to structure this? As a bonus extra question, what happens if somebody changes their name? Do I need to change all of the embedded "likedBy" data? Or could I use a cloud function to achieve this?
Thanks!
There isn't a perfect solution for this problem, but there are alternatives you can do depending on what trade-offs you want.
The options: Overscan vs Underscan
Remember that Cloud Firestore only allows queries that scale independent of the total size of your dataset.
This can be really helpful in preventing you from building something that works in test with 10 documents, but blows up as soon as you go to production and become popular. Unfortunately, this type of problem doesn't fit that scalable pattern and the more profiles you have, and the more likes people create, the longer it takes to answer the query you want here.
The solution then is to find a one or more queries that scale and most closely represent what you want. There are 2 options I can think of that make trade-offs in different ways:
Overscan --> Do a broader query and then filter on the client-side
Underscan --> Do one or more narrower queries that might miss a few results.
Overscan
In the Overscan option, you're basically trading increased cost to get 100% accuracy.
Given your use-case, I imagine this might actually be your best option. Since the total number of profiles is likely orders of magnitude larger than the number of profiles an individual has liked, the increased cost of overscanning is probably inconsequential.
Simply select all profiles that match any other conditions you have, and then on the client side, filter out any that the user has already liked.
First, get all the profiles liked by the user:
var likedUsers = firebase.firestore().collection('users')
.where('likedBy.C345','==',false)
Then get all users, checking against the first list and discarding anything that matches.
var allUsers = firebase.firestore().collection('users').get()
Depending on the scale, you'll probably want to optimize the first step, e.g. every time the user likes someone, update an array in a single document for that user for everyone they have liked. This way you can simply get a single document for the first step.
var likedUsers = firebase.firestore().collection('likedUsers')
.doc('C345').get()
Since this query does scale by the size of the result set (by defining the result set to be the data set), Cloud Firestore can answer it without a bunch of hidden unscalable work. The unscalable part is left to you to optimize (with 2 examples above).
Underscan
In the Underscan option, you're basically trading accuracy to get a narrower (hence cheaper) set of results.
This method is more complex, so you probably only want to consider it if for some reason the liked to unliked ratio is not as I suspect in the Overscan option.
The basic idea is to exclude someone if you've definitely liked them, and accept the trade-off that you might also exclude someone you haven't yet liked - yes, basically a Bloom filter.
In each users profile store a map of true/false values from 0 to m (we'll get to what m is later), where everything is set to false initially.
When a user likes the profile, calculate the hash of the user's ID to insert into the Bloom filter and set all those bits in the map to true.
So let's say C345 hashes to 0110 if we used m = 4, then your map would look like:
likedBy: {
0: false,
1: true,
2: true,
3: false }
Now, to find people you definitely haven't liked, you need use the same concept to do a query against each bit in the map. For any bit 0 to m that your hash is true on, query for it to be false:
var usersRef = firebase.firestore().collection('users')
.where('likedBy.1','==',false)
Etc. (This will get easier when we support OR queries in the future). Anyone who has a false value on a bit where your user's ID hashes to true definitely hasn't been liked by them.
Since it's unlikely you want to display ALL profiles, just enough to display a single page, you can probably randomly select a single one of the ID's hash bits that is true and just query against it. If you run out of profiles, just select another one that was true and restart.
Assuming most profiles are liked 500 or less times, you can keep the false positive ratio to ~20% or less using m = 1675.
There are handy online calculators to help you work out ratios of likes per profile, desired false positive ratio, and m, for example here.
Overscan - bonus
You'll quickly realize in the Overscan option that every time you run the query, the same profiles the user didn't like last time will be shown. I'm assuming you don't want that. Worse, all the ones the user liked will be early on in the query, meaning you'll end up having to skip them all the time and increase your costs.
There is an easy fix for that, use the method I describe on this question, Firestore: How to get random documents in a collection. This will enable you to pull random profiles from the set, giving you a more even distribution and reducing the chance of stumbling on lots of previously liked profiles.
Underscan - bonus
One problem I suspect you'll have with the Underscan option is really popular profiles. If someone is almost always liked, you might start exceeding the usefulness of a bloom filter if that profile has a size not reasonable to keep in a single document (you'll want m to be less than say 8000 to avoid running into per document index limits in Cloud Firestore).
For this problem, you want to combine the Overscan option just for these profiles. Using Cloud Functions, any profile that has more than x% of the map set to true gets a popular flag set to true. Overscan everyone on the popular flag and weave them into your results from the Underscan (remember to do the discard setup).

Why there are two refs in declaring one-to-many association in mongoose?

I'm very new in mongodb, see this one-to-many example
As per my understanding
This example says that a person can write many stories or a story belongs_to a person , I think storing the person._id in stories collection was enough
why the person collection has the field stories
cases for fetching data
case 1
Fetch all stories of a person whose id is let us say x
solution: For this just fire a query in story collection where author = x
case 2
Fetch the author name of a particular story
solution: For this we have author field story collection
TL;DR
Put simply: Because there is no such notion as explicit relations in MongoDB.
Mongoose can not know how you want to resolve the relationship. Will the search be from a given story object and the author is to find? Or will the search be to find all stories for an author object? So it makes sure that it can resolve the relation regardless.
Note that there is a problem with that approach, and a big one. Say we are not talking of a one-to-few relation as in this example, but a "One-To-A-Shitload"™ relation. Since BSON documents have a size limit of 16MB, you have a limit of relations you can manage this way. Quite some, but there will be an artificial limit.
How to solve this: Instead of using an ODM, do proper modelling yourself. Since you know your use cases. I will give you an example below.
Detailed
Let us first elaborate your cases a bit:
For a given user (aka "we already have all the data of that user document"), what are his or her stories?
List all stories together with the user name on an overview page.
For a selected ("given") story, what are the authors details?
And just for demonstration purposes: A given user wants to change the name under which a story is displayed, be it his user name or natural name (it happens!) or even pseudonym.
Ok, and now lets put mongoose aside for now and let us think about how we could implement this ourselves. Keeping in mind that
Data modelling in MongoDB is deriving your model from the questions which come from your use cases so that they most common use cases are covered in the most efficient way.
As opposed to RDBMS modelling, where you identify your entities, their properties and relations and then jump through some hoops to get your questions answered somehow.
So, looking at our user stories, I guess we can agree that 2 is the most common use case, 3 and 1 next and 4 is rather rare compared to the other ones.
So now we can start
Modelling
We model the data involved in our most common use cases first.
So, we want to make the query for stories the most efficient one. And we want to sort the stories in descending order of submission. Simple enough:
{
_id: new ObjectId(),
user: "Name to Display",
story: "long story cut short",
}
Now lets say you want to display your stories, 10 of them:
db.stories.find({}).sort({_id:-1}).limit(10)
No relation, all the data we need, a single query, used the default index on _id for sorting. Since a timestamp is part of the ObjectId and it is the most significant part, we can use it to sort the stories by time. The question "Hey, but what if one changes his or her user name?" usually comes now. Simple:
db.stories.update({"user":oldname},{$set:{"user":newname}},{multi:true})
Since this is a rare use case, it only has to be doable and does not have to be extremely efficient. However, later we will see that we have to put an index on user anyway.
Talking of authors: Here it really depends on how you want to do it. But I will show you how I tend to model something like that:
{
_id: "username",
info1: "foo",
info2: "bar",
active: true,
...
}
We make use of some properties of _id here: It is a required index with a unique constraint. Just what we want for usernames.
However it comes with a caveat: _id is immutable. So if somebody wants to change his or her username, we need to copy the original user to a document with the _id of the new user name and change the user property in our stories accordingly. The advantage of this way of doing it that even when the update for changing usernames (see above) should fail during its runtime, each and every story can still be related to a user. If the update is successful, I tend to log out the user and have him log in with the new username again.
In case you want to have a distinction between username and displayed name, it all becomes even easier:
{
_id: "username",
displayNames: ["Foo B. Baz","P.S. Eudonym"],
...
}
Then you use the display name in your stories, of course.
Now let us see how we can get the user details of a given story. We know the author's name so it is as easy as:
db.authors.find({"_id":authorNameOfStory})
or
db.authors.find({"displayNames": authorNameOfStory})
Finding all stories for a given user is quite simple, too. It is either:
db.stories.find({"name":idFieldOfUser})
or
db.stories.find({"name":{$in:displayNamesOfUser}})
Now we have all your our use cases covered, now we can make them even more efficient with
Indexing
An obvious index is on the story models user field, so we do it:
db.stories.ensureIndex({"name":1})
If you are good with the "username as _id" way only, you are done with indexing. Using display names, you obviously need to index them. Since you most likely want display names and pseudonyms to be unique, it is a bit more complicated:
db.authors.ensureIndex({"displayNames":1},{sparse:true, unique:true})
Note: We need to make this as sparse index in order to prevent unnecessary errors when somebody has not decided for a display name or pseudonym yet. Make sure you explicitly add this field to an author document only when a user decides for a display name. Otherwise, it would evaluate to null server side , which is a valid value and you will get a constraint violation error, namely "E1100 duplicate key".
Conclusion
We have covered all your use cases with relations handled by the application thereby simplifying our data model a great deal and have the most efficient queries for our most common use cases. Every use case is covered with a single query, taking into account the information we already have at the time we are doing the query.
Note that there is no artificial limit on how many stories a user can publish since we use implicit relations to our advantage.
As for more complicated queries ("How many stories does each user submit per month?"), use the aggregation framework. That is what it is there for.

SQL Database Design - Flag or New Table?

Some of the Users in my database will also be Practitioners.
This could be represented by either:
an is_practitioner flag in the User table
a separate Practitioner table with a user_id column
It isn't clear to me which approach is better.
Advantages of flag:
fewer tables
only one id per user (hence no possibility of confusion, and also no confusion in which id to use in other tables)
flexibility (I don't have to decide whether fields are Practitioner-only or not)
possible speed advantage for finding User-level information for a practitioner (e.g. e-mail address)
Advantages of new table:
no nulls in the User table
clearer as to what information pertains to practitioners only
speed advantage for finding practitioners
In my case specifically, at the moment, practitioner-related information is generally one-to-many (such as the locations they can work in, or the shifts they can work, etc). I would not be at all surprised if it turns I need to store simple attributes for practitioners (i.e., one-to-one).
Questions
Are there any other considerations?
Is either approach superior?
You might want to consider the fact that, someone who is a practitioner today, is something else tomorrow. (And, by that I don't mean, not being a practitioner). Say, a consultant, an author or whatever are the variants in your subject domain, and you might want to keep track of his latest status in the Users table. So it might make sense to have a ProfType field, (Type of Professional practice) or equivalent. This way, you have all the advantages of having a flag, you could keep it as a string field and leave it as a blank string, or fill it with other Prof.Type codes as your requirements grow.
You mention, having a new table, has the advantage for finding practitioners. No, you are better off with a WHERE clause on the users table for that.
Your last paragraph(one-to-many), however, may tilt the whole choice in favour of a separate table. You might also want to consider, likely number of records, likely growth, criticality of complicated queries etc.
I tried to draw two scenarios, with some notes inside the image. It's really only a draft just to help you to "see" the various entities. May be you already done something like it: in this case do not consider my answer please. As Whirl stated in his last paragraph, you should consider other things too.
Personally I would go for a separate table - as long as you can already identify some extra data that make sense only for a Practitioner (e.g.: full professional title, University, Hospital or any other Entity the Practitioner is associated with).
So in case in the future you discover more data that make sense only for the Practitioner and/or identify another distinct "subtype" of User (e.g. Intern) you can just add fields to the Practitioner subtable, or a new Table for the Intern.
It might be advantageous to use a User Type field as suggested by #Whirl Mind above.
I think that this is just one example of having to identify different type of Objects in your DB, and for that I refer to one of my previous answers here: Designing SQL database to represent OO class hierarchy

Does Algolia have a search with recommendation?

I was wondering if the Algolia service provides some kind of recommendations mechanism when doing a search.
I could not find in the API documentation anything related with providing the client with more refined and intelligent search alternatives based on the index data.
The scenario I am trying to describe is the following (this example is a bit over-the-top):
Given a user is searching for "red car" then the system provides more specific search alternatives and possibly related items that exist in the database (e.g. ferrari, red driving gloves, fast and furious soundtrack :-) )
Update
For future reference, I ended up doing a basic recommendation system using the text search capabilities of Algolia.
Summarising, when caris saved it's attributes color, speed, engine, etc, are used to create synonyms indexes, for example for engine ferrari in Engine index:
{
synonyms: ['red', 'ferrari', 'fast'],
value: 'ferrari'
}
Finally, the each index, must indicate the synonyms attribute for search and value as the returned result of a search.
Algolia does not provide that kind of "intelligence" out of the box.
Something you can do to approximate what you're looking for is using synonyms and a combination of other parameters:
Define synonyms groups such as "car,Ferrari,driving gloves", "red,dark red,tangerine,orange", ...
when sending a search query, set optionalWords to the list of words contained in that query. This will make each word of your query optional.
Also set removeStopWords to true so that words like "the", "a" (...) are ignored, to improve relevance.
With a well defined list of synonyms, this will make your original query interpreted as many other possibilities and thus increase the variety of possible results.
Be aware though that it could also impact the relevance of your results, as users might for instance not want to look for gloves when they search for a car!