Is there way to create & index N no. of 'fields' dynamically with Sphinx - sphinx

I am using Sphinx (with Thinking Sphinx v2.0 for RoR plugin),
Lets say I have several indexes on User model, lets say on 'name', 'address' and its one-to-many associations like 'posts' , 'comments' etc.
This means searching by post content would return me the User who made the post, and using :fieldmask 'rank mode' of sphinx, I am able to determine that the user was searched due to matching of 'posts'. But user has 'many' posts. So how to determine which 'post' it matched.
Is there any way, while indexing I can specify the index dynamically.?
For e.g. If I can specify index 'post_1'='< post1content >' , 'post_5'='< post5content >' as different 'fields' for user1; similarly 'post_2', 'post_7' for user2, Thus after searching It would return me user2 matched with matching fields as post_7...

Sphinx can't have different fields for each record, I'm afraid, so what you're hoping to do isn't possible with that approach.
If you need to know which posts match a query, I'd recommend conducting the search on the Post model instead, and then you can refer to a post's user? You could sort by user_id before weight, or group by user_id (so only one post per user is returned)? You'd be able to bring in user data into the Post index definition (and if a post has one user, then that data is kept to single values, instead of many, per record).
Hope this gives you some clarity with your options.

If you know, you want to search for post_5 in one query, and for post_7 in another query, you may use json as {post_1:, post_2:}.
Problem is that you have to know number of post you are searching for.
Maybe look to: https://stackoverflow.com/a/24505347/1444576 -if it is similar to your example.

Related

How to form an unordered key with many elements in mongodb

I'm attempting to use mongodb to implement a simple messaging system between two users in mongo. I want to be able to take two users, user0 and user1, and search for their entry in a collection. If the entry for those two users doesn't exist I want to create it and then add the message that was sent to its message field. If it does exist I just want to push the message to the message field.
I'm not really sure the best way to implement this.
db.privateChat.update(
{between:{$all:['user0', 'user1']}},
{$push:{message:'text'}}, {upsert:true}
)
And other similar entry schemes but they don't work. They produce the error:
"Cannot create base during insert of update. Caused by :ConflictingUpdateOperators Cannot update 'between' and 'between' at the same time"
I can think of other ways to do this producing a symmetric key (where the order of the users don't matter for the purposes of the search) from say adding the hashes together or a query that checks if either messenger0 or messenger1 is either user0 or user1 but these don't seem like great ways of doing it. Is this totally the wrong approach?
Thanks.
I think this could be solved by design.
let say that we have document in collection chats;
chat{
_id,
between[arrayOfIds],
startTime,
events[
{message:{
fromUserId,
timeStamp,
data}
}}
]}
}
then messages will be stored in message object inside chat .
App will be aware of chat _id so there will be no issues when you will have a group chat between more than 2 users.
This approach will allow you to prevent overflowing document size limitation as you could start new chat entry every week, day, etc...
Have a fun!

Finding similar posts with PostgreSQL

I have a table posts:
CREATE TABLE posts (
id serial primary key,
content text
);
When a user submits a post, how can I compare his post with the others and find similar posts?
I'm looking for something like StackOverflow does with the "Similar Questions".
While Text Search is an option it is not meant for this type of search primarily. The typical use case would be to find words in a document based on dictionaries and stemming, not to compare whole documents.
I am sure StackOverflow has put some smarts into the similarity search, as this is not a trivial matter.
You can get halfway decent results with the similarity function and operators provided by the pg_trgm module:
SELECT content, similarity(content, 'grand new title asking foo') AS sim_score
FROM posts
WHERE content % 'grand new title asking foo'
ORDER BY 2 DESC, content;
Be sure to have a GiST index on content for this.
But you'll probably have to do more. You could combine it with Text Search after identifying keywords in the new content ..
You need to use Full Text Search in Postgres.
http://www.postgresql.org/docs/9.1/static/textsearch-intro.html

Efficient way to model azure table storage for social networking

I have tables like this in SQL Server
Users
UserId (Unique)
Name
Age
Friends
UserId
FriendId
Topics
UserId
Subject
There can be several thousands of users. and there are several other properties in the table.
I can query to get following answers.
Give me all the friends of user "Tom".
Give me all the topics created by "Tom".
Give me all the topics created by Tom's friends that contains "abc" in the subject.
If I were to do it in Azure table storage, how do I structure my tables?
I have gone through this and this I would like someone who had more experience on modeling Azure Table storage to give some insights..
1 and 2 are pretty easy. You create two Azure tables - Friends and Topics indexed by user id (with user id in the key).
3rd one is much more difficult with Azure tables, especially "that contains 'abc' in the subject" part.
Azure tables don't support full text search. Basically it is only possible to efficiently retrieve values (or range of values) either using exact keys or using 'startswith' operator. Like "Give me all records where key is equal to 'key value'". Or "give me all records where key is greated than 'key lower bound' and is less than 'key upper bound'".
It is also possible to filter using 'startswith' by any non-key field of a record, but this will involve table scan and is not efficient. It's not possible to do similar filtering with 'contains'.
So I think you need something with full text search support here.

Documentation for CONTAINS() in FQL?

There have recently been several questions posted on Facebook.SO using CONTAINS() in the WHERE clause. It seems to work like the Graph API search function, AND functions as an indexed field. All great things for the FQL developer.
SELECT name,
username,
type
FROM profile
WHERE CONTAINS("Facebook")
However, the only official mention of the CONTAINS function appears in the unified_thread documentation. It is mentioned in passing, as a way to search for text contained in a message. It also appeared in this fbrell code sample.
But Contains doesn't seem to be a straightforward search. For example, this query:
SELECT name
FROM user
WHERE CONTAINS("Joe Biden")
returns "Joe Biden" and also "Joseph Biden" and "Biden Joe". But it also returns "Joe Scardino", "Lindsay Noyan" and "Mehmad Moha" among others. What relationship do these people have with the VP of the USA? They aren't my friends, so I'll never know.
There also appears to be the ability to pass CONTAINS a field to search on, however changing the end of my first query to `CONTAINS("Facebook", name) returns an OAuth error:
(#615) 'name' is not a valid search field for the profile table.
In my not-so rigorous testing, I have yet to find a field/table combination that does not return this error.
So what is this mystery function? How does it work? Can it allow us to do things to date impossible in FQL like traversing arrays and filtering data stored in strings?
An answer here would be great, but a description on an FQL functions & methods reference page on the official developer documentation site would be better still.
I don't think that a have any great answers here, but I can give a workaround for the issue of returning unrelated names- which I suspect is because people have made public posts about Joe Biden, liked him, or so on. If you do the following:
SELECT name
FROM user
WHERE CONTAINS("Joe Biden")
AND strpos(lower(name),lower("Joe Biden")) >=0
You will get a resultset that only contains the right names- though it removes the advantage of also returning Joseph Biden, etc. etc.
My personal point of pain is that CONTAINS() appears to work with partial strings (e.g. "Joe Bide") on the profile table, but not on the user table. Very frustrating.

Searches (and general querying) with HBase and/or Cassandra (best practices?)

I have User model object with quite few fields (properties, if you wish) in it. Say "firstname", "lastname", "city" and "year-of-birth". Each user also gets "unique id".
I want to be able to search by them. How do I do that properly? How to do that at all?
My understanding (will work for pretty much any key-value storage -- first goes key, then value)
u:123456789 = serialized_json_object
("u" as a simple prefix for user's keys, 123456789 is "unique id").
Now, thinking that I want to be able to search by firstname and lastname, I can save in:
f:Steve = u:384734807,u:2398248764,u:23276263
f:Alex = u:12324355,u:121324334
so key is "f" - which is prefix for firstnames, and "Steve" is actual firstname.
For "u:Steve" we save as value all user id's who are "Steve's".
That makes every search very-very easy. Querying by few fields (properties) -- say by firstname (i.e. "Steve") and lastname (i.e. "l:Anything") is still easy - first get list of user ids from "f:Steve", then list from "l:Anything", find crossing user ids, an here you go.
Problems (and there are quite a few):
Saving, updating, deleting user is a pain. It has to be atomic and consistent operation. Also, if we have size of value limited to some value - then we are in (potential) trouble. And really not of an answer here. Only zipping the list of user ids? Not too cool, though.
What id we want to add new field to search by. Eventually. Say by "city". We certainly can do the same way "c:Los Angeles" = ..., "c:Chicago" = ..., but if we didn't foresee all those "search choices" from the very beginning, then we will have to be able to create some night job or something to go by all existing User records and update those "c:CITY" for them... Quite a big job!
Problems with locking. User "u:123" updates his name "Alex", and user "u:456" updates his name "Alex". They both have to update "f:Alex" with their id's. That means either we get into overwriting problem, or one update will wait for another (and imaging if there are many of them?!).
What's the best way of doing that? Keeping in mind that I want to search by many fields?
P.S. Please, the question is about HBase/Cassandra/NoSQL/Key-Value storages. Please please - no advices to use MySQL and "read about" SELECTs; and worry about scaling problems "later". There is a reason why I asked MY question exactly the way I did. :-)
Being able to query properties directly is one of the features you lose when moving away from SQL, so you need a way to maintain your own index to let you find records.
If your datastore does not have built in indexing or atomic list operations, you will need to deal with the locking issues you mention. However, indexing doesn't necessarily need to be synchronous - maintain a queue of updated records to be reindexed and you have a solution for 3 that can be reused to solve 2 also.
If the index list for a particular value becomes too large for the system to handle in a single list, you can replace the list of users with a list of lists. However, if you have that many records with the same value it probably isn't a particularly useful search criteria anyway.
Another option that is useful in some cases is to use a seperate system for the indexing - for example you could set up lucene to index the records in your main datastore.
I guess i would have implemented this as a MapReduce job, which would run on schedule.
Each search word, would be a row-key with lookup to UID.
Rowkey:uid1
profile:firstName: Joe
profile:lastName: Doe
profile:nick: DoeMaster
Rowkey: uid2
profile:firstName: Jane
profile:lastName: Doe
profile:nick: SuperBabe
MapReduse indexes all searchable properties and add them with search word as row key
Rowkey: Jane
lookup:uid: uid2
Rowkey: Doe
lookup:uid: uid2, uid1
Rowkey: DoeMaster
lookup:uid: uid1
..etc
Now, if you need to update the index list on the fly as a user change, you would write the change directly to the index base, by remove uid value from index and add to another row key. In case of this happens at the same time, temporary locking could be implemented.
For users being removed, an additional attribute telling the state of the user could be use to filter them out from search.
Adding additional search word isn't very hard, since its just about which name:value you want to index. you could filter search more also by adding type attribute to your row key/keyword. i.e boston - lookup:type: city.
The idea is to maintain your own row key based search index inside hbase.