deduplication of ids in SELECT vs SELECT GROUP BY

deduplication of ids in SELECT vs SELECT GROUP BY - sphinx

We have an issue around deduplication when our data is spread across multiple indexes, and there exists a particular id in more than one index.
When doing a straight select, we get X records back, but when we do a group by, we will get counts that add up to more than X. We have, as stated above, tracked this back to the offending id existing in more than one index.
Sphinx is smart enough to deduplicate the records when doing the straight select, but doesn't when bucketing them for a group by.
Of course it would be better to not have the duplicates, and we'll hopefully find a way to deal with that, but for the time being, I'm wondering if there is a way to tell sphinx to do the deduplication on group by as well?

Related

Eclipselink batch fetch VS join fetch

When should I use "eclipselink.join-fetch", when should I use "eclipselink.batch" (batch type = IN)?
Is there any limitations for join fetch, such as the number of tables being fetched?

Answer is alway specific to your query, the specific use case, and the database, so there is no hard rule on when to use one over the other, or if to use either at all. You cannot determine what to use unless you are serious about performance and willing to test both under production load conditions - just like any query performance tweaking.
Join-fetch is just what it says, causing all the data to be brought back in the one query. If your goal is to reduce the number of SQL statements, it is perfect. But it comes at costs, as inner/outer joins, cartesian joins etc can increase the amount of data being sent across and the work the database has to do.
Batch fetching is one extra query (1+1), and can be done a number of ways. IN collects all the foreign key values and puts them into one statement (more if you have >1000 on oracle). Join is similar to fetch join, as it uses the criteria from the original query to select over the relationship, but won't return as much data, as it only fetches the required rows. EXISTS is very similar using a subquery to filter.

How do you scale postgres to billions of rows for this schema?

Consider this scenario.
You're a link shortening service, and you have two tables:
Links
Clicks - predominantly append only, but will need a full scan to produce aggregates, which should be (but probably won't be) quick.
Links is millions of rows, Clicks is billions of rows.
Should you split these onto separate hardware? What's the right approach to getting the most out of postgres for this sort of problem?

With partitioning, it should be scalable enough. Partition links on hash of the shortened link (the key used for retrieval). Depending on your aggregation and reporting needs, you might partition clicks by date (maybe one partition per day?). When you create a new partition, the old one can be summed and moved to history (or removed, if the summed data is enough for your needs.

In addition to partitioning, I suggest pre-aggregating the data. If you never need the individual data, but only aggregates per day, then perform the aggregation and materialize it in another table after each day is over. That will reduce the amount considerably and make the data manageable.

DynamoDB schema for querying date range

I'm learning to use DynamoDB table and storing some job postings with info like date posted, company, and job title.
The query I use most is get all job posting greater than x date.
What partition key should I use so that I can do the above query without using a scan?
Partition key can only be checked for equality so using date as the partition key is no good. Date as the sort key seems best since I can query using equality on that.
However I'm a bit stuck on what is a good partition key to use then. If I put company or job title, I would have to include that as part of my query but I want ALL job postings after a certain date not just for specific company or job.
One way I thought of was using month as a partition key and date as the sort key. That way to get say last 14 days I know I need to hit the partition key of this month and maybe the last month. Then I can use the sort key to just keep the records within the last 14 days. This seems hackish tho.

I would probably do something similar to what you mentioned in the last paragraph - keep a sub-part of the date as the partition key. Either use something like the month, or the first N digits of the unix timestamp, or something similar.
Note that, depending on how large partitions you choose you may still need to perform multiple queries when querying for, say, the last 14 days' of posts due to crossing partition boundaries (when querying for the last 14 days on January 4 you would want to query also for December of the previous year etc), but it should still be usable.
Remember that it's important to choose the partition key so that items are as evenly distributed as possible, so any hacks involving a lot of (or, as is sometimes seen in questions on SO: ALL!) items sharing the same partition key to simplify sorting is not a good idea.
Perhaps you might also want to have a look at Time-to-live to have AWS automatically delete items after a certain amount of time. This way, you could keep one table of the newest items, and "archive" all other items which are not frequently queried. Of course you could also do something similar manually by keeping separate tables for new and archived posts, but TTL is pretty neat for auto-expirying items. Querying for all new posts would then simply be a full scan of the table with the new posts.

Is there a canonical way to handle inserting records that might contain duplicates?

I am using OrientDB to graph a social network. Vertices are unique-indexed by user_id. When inserting new vertices, I might encounter potential duplicates. For example, user_id(1) has friends: user_id(0)...user_id(5)...user_id(n). That same user 1, might also have followers that intersect, for example, user_id(5) might be both in the set of friends and followers.
So, I insert the root user vertex, and then grab all the ID's of friends, as well as the ID's of followers, and then insert those in a batch operation.
My question is, when inserting vertices, is there a canonical method to insert new records that accounts for duplicates?
Given the structure of my application, I can think of a couple of methods:
1) Creating each new user vertex with the following:
create vertex User content {user_node}
Prior to each insertion, I check the user_id against a secondary-index contained in a Redis database. I use the #rid field for subsequent creation of edges. The Redis store contains a hash-index with fields user_id and values #rid, so I can populate #rid this way.
This has the advantage of keeping OrientDB access down. Also, accessing Redis is likely quite a bit faster than accessing OrientDB, and the time complexity of getting the value of a hash field is O(1); therefore, I'm thinking that, even though I'm querying two databases for each operation, I'm still coming out ahead (maybe). However, in the event of divergence between OrientDB and the secondary-index, duplicate records would raise ORecordDuplicatedException. I can catch this exception at the application level, but then I have #rid issues. I am using the #rid field because, according to the OrientDB docs, this is the fastest way to access a record.
2) Update/Upsert for each insertion:
update User content {user_node} upsert where user_id = 1
This would, I think, keep all the error catching and duplicate detection on the database-side. Coming from Neo4j, MERGE operations like this are a bit more expensive than CREATE; does the same apply to OrientDB?
Are there any other insert methods that I'm missing? Is there a best solution here? Thanks!

Sorting Cassandra using individual components of Composite Keys

I want to store a list of users in a Cassandra Column Family(Wide rows).
The columns in the CF will have Composite Keys of pattern id:updated_time:name:score
After inserting all the users, i need to query users in a different sorted order each time.
For example, if i specify updated_time, i could be able to fetch the recent 10 users.
And, if i specify score, then i could be able to fetch the top 10 users based on score.
Does Cassandra supports this?
Kindly help me in this regard...

i need to query users in a different sorted order each time...
Does Cassandra supports this
It does not. Unlike a RDBMS, you can not make arbitrary queries and expect reasonable performance. Instead you must design you data model so the queries you anticipate will be made will be efficient:
The best way to approach data modeling for Cassandra is to start with your queries and work backwards from there. Think about the actions your application needs to perform, how you want to access the data, and then design column families to support those access patterns.
So rather than having one column family (table) for your data, you might want several with cross references between them. That is, you might have to denormalise your data.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse