What is the best method of tracking page views (uniques specially) for a certain page?
Example: Threads in a forum, Videos in a video website, questions in a Q/A script(SO).
Currently, I'm taking the approach of just a simple "Views" column for each row I'm trying to count the views for, however, I know this is not the most efficient way.
And for unique views, I have a separate table that holds a row with the "QuestionID" and "UserID". When a user visits a question/page, my code attempts find a row in the views table with the "UserID" and "QuestionID", if it can't, it adds a row, then increments the Views value of the Question in the "Questions" table.
your solution for storage seems to be the best way to track for users. The tables don't have redundant data and your many to many relationship is represented by its own table.
On another note:
For anonymous users you would need to record their IP address, and the you can use the COUNT() sql function to get the number of unique visitors that way, but even and IP address not "unique" per se.
First of all when you have a table that stores userid-quesitonid pairs, it means you can count them, adding a views column is against the rules of normalisation I suppose.
Other thing is that I can F5 as much as I want, if you do not implement a cookie solution, if not then add a row to the table.
when it comes to ip address solution, which is far form being a solution at all, it will blcok people behind routers.
I think of (also implementing right now) a solution that checks cookies, sessionIds, DB table for registered users, and if none found, adds a row to the table. Itr also records SessionID and IP addresses anyway, but don't really rely on them.
If you are using ASP.NET MEmbership and the Anonymous Provider for anonymous users, then each anonymous user gets a row created in aspnet_Users table as soon as you say Profile.Save(). In that case, you can track both anon and registered users viewing certain Page. All you need to do is record the aspnet_user's UserID and the QuestionID.
However, I strongly discourage doing this at database level since it can blow up your database. If you have 10,000 questions and 1,000 registered users and 100,000 anonymous users, and assuming each user visits 10 questions on an average, then you end up having 1M rows in the tracking table. Quite some load.
Moreover, doing a SELECT COUNT() on the tracking table is quite some load on the database, especially you are doing this almost every page view on your forum. Best is to keep a total counter at the Question table against each question. Whenever you get a unique user looking at a page, you just increase the counter.
Also don't create a FK relationship to the user table from the tracking table. You will need to cleanup the aspnet_users table as it piles up a lot of anonymous users over time who will most likely never come back. So, the tracking page needs to just have the userID field, and no FK. Moreover, you will have to cleanup the tracking table over time as well as it will keep getting millions of rows. That's why the TotalView counter needs to be at the Question table and never use SELECT COUNT() to calculate the number of views while displaying the page.
Does this answer your question?
Related
This feels like it should be a common requirement, but I'm not sure how best to implement the requirement.
I'm going to make up a really simple example. It's similar enough to what I actually need, without getting over-complicated.
Imagine we have a table called transport and it has the following columns:
type
model_name
size
number_of_wheels
fuel
maximum_passenger_count
We want to store all sorts of different types of transportation in this table, but not every type will have values in every column. Imagine the commonality is a lot higher, as this is a bit of fake example.
Here's a few examples of how this might work in practice:
type = cycle, we ban fuel, as it's not relevant for a cycle
type = bus, all columns are valid
type = sledge, we ban number_of_wheels, as sledges don't have wheels, we also ban fuel
type = car, all columns are valid
I want my UI to show a grid with a list of all the rows in the transport table. Users can edit the data directly in the grid and add new rows. When they add a new row, they're forced to pick the transport type in a dropdown before it appears in the grid. They then complete the details. All the values are optional, apart from the ones we explicitly don't want to record a value for, where we expect to not see anything at all.
I can see a number of ways to implement this, but none of them seems like a complete solution:
I could put this logic into the UI, enabling/ disabling grid cells based on type. But there's nothing to stop someone directly inserting data into the "wrong" columns in the backend of the database or via the API, which would then come through into the UI unless I added a rule to also mask out values in disabled cells. Making changes to which columns are restricted per transport type would be very difficult
I could put this logic into the API, raising an error if someone enters data into a cell that should be disallowed. This closes one gap for insertion to the database via the API, but SQL scripts would still allow entry into the "wrong" column. Also, the user experience would suck, as users would have to guess which columns to complete and which to leave blank. It would still be difficult to make changes to which columns are allowed/ restricted
I could add a trigger to the database, maybe to set the values to NULL if they shouldn't be allowed, but this seems clunky and users would not understand what was happening
I could add a generated column, but this doesn't help if I sometimes need to set a value and sometimes don't
I could just allow the unnecessary data to be stored in the database, then hide it by using a view to report it back. It doesn't seem great, as users would still see data disappearing from the UI with no explanation
I could add a second table, storing a matrix of which values are allowed and which are restricted by type. The API, UI and database could all implement this list using different mechanisms - this comes with the advantage of having a single place to make changes that will immediately be reflected across the entire system, but it's a lot of work and I have lots of these tables that work the same way
I'm using Lazarus and a Firebird database to display several Forms. The datasource related query_1 joins two, sometimes three tables, and the data is shown in TDBEdit displays.
I need to place two buttons on forms: MODIFY and SAVE for the users. If data was only one table MODIFY would basically call Query_1. Edit, and SAVE would ApplyUpdates. With the joined tables in the query ApplyUpdates would give an error, but thought I could have the SAVE call a query_2 to select the record in one table (and later the corresponding one in the othery -all in the same transaction) and Edit the modified field taking the value from the TDBEdit.Text or the modified dataset’s fields, but this does not work because Lazarus/Firebirs does not let the user write into the TDBEdit of the joined table even though they are not ReadOnly.
I could instead use simple TEdits , populating them from Qry_1 fields to display the data and make changes, and then use the modified texts to edit the corresponding fields, but this will require a lot of changes as well as programing to keep the user from introducing garbage, and seems not to be the best solution.
I believe I am complicating something that should be more "standard" and simple but have not found other solutions. Would appreciate somebody pointing me to the right “normal” practice.
Some of the Users in my database will also be Practitioners.
This could be represented by either:
an is_practitioner flag in the User table
a separate Practitioner table with a user_id column
It isn't clear to me which approach is better.
Advantages of flag:
fewer tables
only one id per user (hence no possibility of confusion, and also no confusion in which id to use in other tables)
flexibility (I don't have to decide whether fields are Practitioner-only or not)
possible speed advantage for finding User-level information for a practitioner (e.g. e-mail address)
Advantages of new table:
no nulls in the User table
clearer as to what information pertains to practitioners only
speed advantage for finding practitioners
In my case specifically, at the moment, practitioner-related information is generally one-to-many (such as the locations they can work in, or the shifts they can work, etc). I would not be at all surprised if it turns I need to store simple attributes for practitioners (i.e., one-to-one).
Questions
Are there any other considerations?
Is either approach superior?
You might want to consider the fact that, someone who is a practitioner today, is something else tomorrow. (And, by that I don't mean, not being a practitioner). Say, a consultant, an author or whatever are the variants in your subject domain, and you might want to keep track of his latest status in the Users table. So it might make sense to have a ProfType field, (Type of Professional practice) or equivalent. This way, you have all the advantages of having a flag, you could keep it as a string field and leave it as a blank string, or fill it with other Prof.Type codes as your requirements grow.
You mention, having a new table, has the advantage for finding practitioners. No, you are better off with a WHERE clause on the users table for that.
Your last paragraph(one-to-many), however, may tilt the whole choice in favour of a separate table. You might also want to consider, likely number of records, likely growth, criticality of complicated queries etc.
I tried to draw two scenarios, with some notes inside the image. It's really only a draft just to help you to "see" the various entities. May be you already done something like it: in this case do not consider my answer please. As Whirl stated in his last paragraph, you should consider other things too.
Personally I would go for a separate table - as long as you can already identify some extra data that make sense only for a Practitioner (e.g.: full professional title, University, Hospital or any other Entity the Practitioner is associated with).
So in case in the future you discover more data that make sense only for the Practitioner and/or identify another distinct "subtype" of User (e.g. Intern) you can just add fields to the Practitioner subtable, or a new Table for the Intern.
It might be advantageous to use a User Type field as suggested by #Whirl Mind above.
I think that this is just one example of having to identify different type of Objects in your DB, and for that I refer to one of my previous answers here: Designing SQL database to represent OO class hierarchy
I'm building a DynamoDB table that holds notification messages. Messages are directed from a given user (from_user) to another user (to_user). They're quite simple:
{ "to_user": "e17818ae-104e-11e3-a1d7-080027880ca6", "from_user": "e204ea36-104e-11e3-9b0b-080027880ca6", "notification_id": "e232f73c-104e-11e3-9b30-080027880ca6", "message": "Bob recommended a good read.", "type": "recommended", "isbn": "1844134016" }
These are the Hash/Range keys defined on the table:
HashKey: to_user, RangeKey: notification_id
Case 1: Users regularly phone home to ask for any available notifications.
With these keys, it's easy to fetch the notifications awaiting a given user:
notifications.query(to_user="e17818ae-104e-11e3-a1d7-080027880ca6")
Case 2: Once a user has seen a message, they will explicitly acknowledge it and it will be deleted. This is similarly simple to accomplish with the given Hash/Range keys:
notifications.delete(to_user="e17818ae-104e-11e3-a1d7-080027880ca6", notification_id="e232f73c-104e-11e3-9b30-080027880ca6")
Case 3: It may sometimes be necessary to delete items in this table identified by keys other than the to_user and notification_id. For example, user Bob decides to un-recommnend a book and we would like to pull notifications with from_user=Bob, action=recommended and isbn=isbnval.
I know this can't be done with the Hash/Range keys I've chosen. Local secondary indexes also seem unhelpful here since I don't want to work within the table's chosen HashKey.
So am I stuck doing a full Scan? I can imagine creating a second table to map from_user+action+isbn back to items in the original table but that means I have to manage that additional complexity... and it seems like this hand-rolled index could get out of sync easily.
Any insights would be appreciated. I'm new to DynamoDB and trying to understand how typical data models map to it. Thanks.
Your analysis is correct. For case 3 and this schema, you must do a table scan .
There are a number of options which you've identified, but all of them will add a layer of complexity to your application.
Use a second table as you state. You are effectively creating your own global index and must manage that complexity yourself. This grows in complexity as you require more indices.
Perform a full table scan. Look at DynamoDB's scan segmenting for a method of distributing the scan across multiple worker nodes. Depending on your latency requirements(is it ok if the recommendations don't go away until the next scan?) you may be able to combine this and other future background tasks into a constant background process. This is also simpler than 1.
Both of these seem to be fairly common models.
I am making an API over HTTP that fetches many rows from PostgreSQL with pagination. In ordinary cases, I usually implement such pagination through naive OFFET/LIMIT clause. However, there are some special requirements in this case:
A lot of rows there are so that I believe users cannot reach the end (imagine Twitter timeline).
Pages does not have to be randomly accessible but only sequentially.
API would return a URL which contains a cursor token that directs to the page of continuous chunks.
Cursor tokens have not to exist permanently but for some time.
Its ordering has frequent fluctuating (like Reddit rankings), however continuous cursors should keep their consistent ordering.
How can I achieve the mission? I am ready to change my whole database schema for it!
Assuming it's only the ordering of the results that fluctuates and not the data in the rows, Fredrik's answer makes sense. However, I'd suggest the following additions:
store the id list in a postgresql table using the array type rather than in memory. Doing it in memory, unless you carefully use something like redis with auto expiry and memory limits, is setting yourself up for a DOS memory consumption attack. I imagine it would look something like this:
create table foo_paging_cursor (
cursor_token ..., -- probably a uuid is best or timestamp (see below)
result_ids integer[], -- or text[] if you have non-integer ids
expiry_time TIMESTAMP
);
You need to decide if the cursor_token and result_ids can be shared between users to reduce your storage needs and the time needed to run the initial query per user. If they can be shared, chose a cache window, say 1 or 5 minute(s), and then upon a new request create the cache_token for that time period and then check to see if the results ids have already been calculated for that token. If not, add a new row for that token. You should probably add a lock around the check/insert code to handle concurrent requests for a new token.
Have a scheduled background job that purges old tokens/results and make sure your client code can handle any errors related to expired/invalid tokens.
Don't even consider using real db cursors for this.
Keeping the result ids in Redis lists is another way to handle this (see the LRANGE command), but be careful with expiry and memory usage if you go down that path. Your Redis key would be the cursor_token and the ids would be the members of the list.
I know absolutely nothing about PostgreSQL, but I'm a pretty decent SQL Server developer, so I'd like to take a shot at this anyway :)
How many rows/pages do you expect a user would maximally browse through per session? For instance, if you expect a user to page through a maximum of 10 pages for each session [each page containing 50 rows], you could make take that max, and setup the webservice so that when the user requests the first page, you cache 10*50 rows (or just the Id:s for the rows, depends on how much memory/simultaneous users you got).
This would certainly help speed up your webservice, in more ways than one. And it's quite easy to implement to. So:
When a user requests data from page #1. Run a query (complete with order by, join checks, etc), store all the id:s into an array (but a maximum of 500 ids). Return datarows that corresponds to id:s in the array at positions 0-9.
When the user requests page #2-10. Return datarows that corresponds to id:s in the array at posisions (page-1)*50 - (page)*50-1.
You could also bump up the numbers, an array of 500 int:s would only occupy 2K of memory, but it also depends on how fast you want your initial query/response.
I've used a similar technique on a live website, and when the user continued past page 10, I just switched to queries. I guess another solution would be to continue to expand/fill the array. (Running the query again, but excluding already included id:s).
Anyway, hope this helps!