Double countring the number of tracks

Double countring the number of tracks - postgresql

An album Z has missing tracks if the Tracks table contains less rows for A than the number of tracks for A reported in the Albums table. For each album without missing tracks, find its total running seconds.

Since this is an exercise, I don't want to spoil the learning effect totally by just giving you the final solution. I'll try to guide you there.
Your problem is the WHERE clause
WHERE albums.tracks >= tracks.number
I guess you intend it to implement the requirement “for each album without missing tracks, …”.
However, that's not what the condition does; rather, it excludes tracks whose number exceeds the track count of the album.
You need something like: “where the count of tracks that are related to the album is (greater than or) equal to the album track count”.
In other words: WHERE count(tracks.*) >= albums.tracks. (The “related to the album” part is implemented by the join condition — it excludes tracks not related to the album.)
See? The secret is often just to translate a natural language sentence into SQL.
Now we are facing a problem, because we cannot have an aggregate function like count in the WHERE clause. This is because WHERE is processed before GROUP BY where the groups are formed.
Fortunately for us there is a kind of “WHERE clause” that is processed after grouping, and that is HAVING.
I leave the rest to you :^)

Related

Aggregate data while inserting into raw table

I'm currently building a forum alike application. Users will be able to see recent posts with the total like count. If the post is interesting to the user, they can like it as well and contribute to the total like count.
The normalized approach would be to have two tables: user_post(contains id, metadata ...), liked_post(which includes the user id + post id). When posts are getting queried, the like count would be determined with the COUNT() statement on the liked_post table grouped by the post id.
Im thinking of another approach, which requires no group by on a potential huge table. That would be to add a like_count column to the user_post table and break the normalization. This column would be always updated when a new liked_post entry gets inserted or deleted. That means: Every time a user likes a post -> there will be an update on the user_post table (increment the like_count column) + insert/delete entity in liked_post table (With a trigger or code in App layer).
Would this aggregation on the fly approach have any disadvantages, except for consistency concerns? This would enable very simple and fast select queries but Im not sure if the additional update would be an issue.
What are your thoughts ?
Im really interested in the performance impact and not if you should do this from the project begin or not.

Your idea is correct and widely used. Problem that you will face:
how do you make sure that like_count is valid? Can this number be delayed or approximated somehow?
In general you can do this following ways
update like_count within application code
update like_count by triggers
If you want to have exact values correct you could accumulate those sums by triggers or do it programatically ensuring that like count update is always within same transaction that insert to liked_posts
Using triggers it could be something like this:
CREATE FUNCTION public.update_like_count() RETURNS trigger
LANGUAGE plpgsql
AS $$
BEGIN
UPDATE user_post SET user_post.liked_count = user_post.liked_count + 1
WHERE user_post.id = NEW.post_id;
RETURN NEW;
END;
$$;
CREATE TRIGGER update_like_counts
AFTER INSERT ON public.liked_posts
FOR EACH ROW EXECUTE PROCEDURE public.update_like_count();
Also you should handle AFTER DELETE by separate trigger.
Be aware that depending on transaction isolation level you might enter concurrency problem here (if 2 inserts are done at the same time - like_count may be exactly same number for two transactions) and end up with invalid total.

So I've had a problem similar to this in the past, the solution I went with is similar to what you've described, which is having an aggregated stored value like_count. Like you mentioned the only downside would be consistency concerns however this problem exists even in the former.
The solution to something like this lies more in the application dev, so utilizing something like web-sockets to keep posts up to date, without too much fluff
When a user's browser/client loads a post they join a room with the post id, and when user interacts with a post ( like, dislike etc ) that interaction is broadcasted to all users in that room ( post id ).
Finally when it comes to finding out which users liked this post, you can query/load at the point of when the user clicks to find out. ~ cheers

Sphinx: Show all results order by previous searches

I use SphinxQL for searching and filtering in product database and I store last x search phrases of each user. I wonder if is it possible to show all products (all rows) to every user but with relevance on previous search.
Let's say one user sought for mobile phones (iphone, galaxy s7...), ie. electronics category. I want to show him all products randomly, but products from category electronics more often and products with those searched keywords even more often.
Is it even possible with Sphinx?
Thanks and sorry for english.

An alternative, would be perhaps to create random numbers attached to each result. A high and a low number, with an overlapping range.
sql_query = SELECT id, RAND()*100 AS rand_low, (RAND()*100)+50 AS rand_high, ...
sql_attr_uint = rand_low
sql_attr_uint = rand_high
Can then arrange the ranking expression to pick either of these numbers depending on if matches or not, and sort by the result.
SELECT id FROM index WHERE MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
OPTION ranker=expr('IF(doc_word_count>1,rand_high,rand_low)');
Will be mixed up. But results that match one of the words, have a greater chance of showing up first (because use the weighted random number) - its still only a chance, because a rand_high CAN still be smaller than rand_low.
... can change the size of the number 'overlap' to tweak the mix of matching/non matching results.
(added as a new answer as its a quite differnt idea, although uses the same 'all' keyword)

Sphinx doesn't have a 'mode' to just do that. But can get very close...
Can use MAYBE operator
MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
The complication is need a way to match all products. Depending on your data you may already have a word can use (eg word like 'the' in every single product), or add the word to every document, during indexing.
... using MAYBE allows the matching results to have a higher weight.
But you dont want to sort strictly by weight. So need a different alogithm, something to shuffle the results a bit (as you not really wanting 'random'!)
SELECT id, IDIV(id/10000) AS int,WEIGHT() AS w
FROM index WHERE MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
ORDER BY int DESC, w DESC;
This creates banding by ID, as in theory results can be spread over all the id-space will mix them up a bit. But the category results will still tend to be shown first within each band.
If you have one a different attribute other than ID might be better, something more spread out. Or can add a deliberate random attribute to results)
... there are all sort so variations, your imagination is the only limitation, this basic techqiue can be used to mix things up quote a bit.
(There are other possiblities, Sphinxes little known GROUP N BY function, can be used to produce a sampling search result. This is isnt random, but it might give the similar enough result - ie just mixing up results)

Selecting records with a huge "where data set"

Background Info
C#
MS MVC 4
Sql Azure
Linq - Identities
Problem at hand:
Selecting records in an Items table where zip code is within a certain range of miles.
Items Table
id (PK)
Title
Body
ZipCode (Int)
Summary of Progress:
I have a class which uses the 2013 US Gazatteer zip code and tabulation areas to gather zip codes and assess distances between zip codes. It is basically a .csv/.txt file that I open into a stream and convert to POCOs in order to process distances. That much of the equation is working fine; however, selecting a list of Items from an Items table based on this list of zip codes is where I'm not sure what to do.
Scenario
User A wants to search for items within a 25 miles radius of area code 46324.
User A hits search and in the background my class returns a list of 124 zip codes within a 25 mile radius.
Question: What is the best way (performance wise) to retrieve items in my Item table using this list of zipcodes?
Possible Solutions
I thought about creating a dynamic query using the tsql in keyword within my where clause and simply supplying this list as the where parameters. This does not seem to be a very performance oriented way of doing this; however, considering my current architecture I do not see any other way.
I also thought about incorporating a sort of paging functionality that will only take the first 5 zip codes to return results followed by the next 5 and so on and so on. This will involve more work but it definitely would seem to be a better performance choice.
Any ideas?

I stumbled across your question purely by chance searching for something else, and also I see it's quite old, but I thought I'd give you a comment none the less:
What I would do in this case is actually allow the database to do the search and the C# to do the calcs. You have a class in C# which calculates the distances? Then why not save the distance from each zip code to each zip code in a "lookup table" in sql.
Doing it this way makes sure that the data is calculated once but you let the sql find the right data for you.
ie:
Create a table with from_zip, to_zip, distance fields
Calculate and populate table once at the beginning
Query by saying "select * from zip_lookup where zip_from = bla and distance between 0 and 100" or something like that

How to implement cursors for pagination in an api

This is similar to to this question which doesn't have any answers. I've read all about how to use cursors with the twitter, facebook, and disqus api's and also this article about how disqus generally built their cursors, but I still cannot seem to grok the concept of how they work and how to implement a similar solution in my own projects. Can someone explain specifically the different techniques and concepts behind them?

Lets first understand why offset pagination fails for large data sets with an example.
Clients provide two parameters limit for number of results and offset and for page offset.
For example, with offset = 40, limit = 20, we can tell the database to return the next 20 items, skipping the first 40.
Drawbacks:
Using LIMIT OFFSET doesn’t scale well for large
datasets. As the offset increases the farther you go within the
dataset, the database still has to read up to offset + count rows
from disk, before discarding the offset and only returning count
rows.
If items are being written to the dataset at a high frequency, the
page window becomes unreliable, potentially skipping or returning
duplicate results.
How Cursors solve this ?
Cursor-based pagination works by returning a pointer to a specific item in the dataset. On subsequent requests, the server returns results after the given pointer.
We will use parameters next_cursor along with limit as the parameters provided by client in this case.
Let’s assume we want to paginate from the most recent user to the oldest user.When client request for the first time , suppose we select the first page through query:
SELECT * FROM users
WHERE team_id = %team_id
ORDER BY id DESC
LIMIT %limit
Where limit is equal to limit plus one, to fetch one more result than the count specified by the client. The extra result isn’t returned in the result set, but we use the ID of the value as the next_cursor.
The response from the server would be:
{
"users": [...],
"next_cursor": "1234", # the user id of the extra result
}
The client would then provide next_cursor as cursor in the second request.
SELECT * FROM users
WHERE team_id = %team_id
AND id <= %cursor
ORDER BY id DESC
LIMIT %limit
With this, we’ve addressed the drawbacks of offset based pagination:
Instead of the window being calculated from scratch on each request based on the total number of items, we’re always fetching the next count rows after a specific reference point. If items are being written to the dataset at a high frequency, the overall position of the cursor in the set might change, but the pagination window adjusts accordingly.
This will scale well for large datasets. We’re using a WHERE clause to fetch rows with id values less than the last id from the previous page. This lets us leverage the index on the column and the database doesn’t have to read any rows that we’ve already seen.
For detailed explanation you can visit this wonderful engineering article from slack!

Here is an article about pagination: paginating-real-time-data-cursor-based-pagination
Cursors – we need to have at least one column with unique sequential values to implement cursor based pagination. This can be similar to Twitter’s max_id parameter or Facebook’s after parameter.

In general you should pass the current item or page number in the request as a param. Other usual param is the batch size of the page. Then on the server side backend you select and return the proper dataset, with an SQL query for example.

enter image description hereHere's what I am Done with. The cursor is working as a pointer and it points to that index. and limit will pick that many rows from that pointer. Let's say we have given id 10 and limit 5 then it will go to id 10 and pick 5 elements from there.

Some Graph API connections uses cursors by default. You can use 'limit' and 'before'/'after' parameters in your call. If you are still not clear, you can post your code here and I can explain with it.

Efficiently updating cosine similarity scores

My iPhone application is using a SQLite database with the following schema:
items(id, name, ...) -> this table contains 50 records
tags(id, name) -> this table contains 50 records
item_tags(id, item_id, tag_id, user_id)
similarities(id, item1_id, item2_id, score)
The items, tags, item_tags and similarities tables are populated with pre-defined records, hence also the similarities between different items have already been calculated offline (using cosine similarity algorithm based on the items' tags).
Users are able to add additional tags to items and to remove their custom tags later on. Whenever this happens the similarity scores between the items should be updated locally, i.e. without contacting the server application.
My question now is the following:
What is the most efficient way to do so? So far, on startup of the iPhone application, I compute a term-document matrix for all the items and tags (which reflects the tag frequencies for each item) and keep this matrix in memory as long as the application is running. Whenever a tag is added or removed, I use this matrix to update the similarities in the database. However, this is rather inefficient. Do you have any suggestions?
Thanks!

This presentation might help you:
http://www.slideshare.net/jnvms/incremental-itembased-collaborative-filtering-4095306

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Double countring the number of tracks - postgresql

An album Z has missing tracks if the Tracks table contains less rows for A than the number of tracks for A reported in the Albums table. For each album without missing tracks, find its total running seconds.

Related

Aggregate data while inserting into raw table

Sphinx: Show all results order by previous searches

Selecting records with a huge "where data set"

How to implement cursors for pagination in an api

Efficiently updating cosine similarity scores

Categories

Resources