How do I design a data warehouse model that allows me to dynamically query for total action count, unique user count, and a count of total users - amazon-redshift

Currently facing a problem where I am trying to create a login utilization report for a web application. To describe the report a bit, users in our system are tagged with different metadata about the user. For example, I could be tagged with "New York City" and "Software Engineer", while other users may be tagged with different locations and job titles. The utilization report is essentially the following:
Time period (quarterly)
Total number of logins
Unique logins
Total users
"Engagement percentage" (Unique logins / Total users)
The catch is, the report needs to be a bit dynamic. I need to be able to be apply any combination of job titles and locations and have each of the numbers reflect the applied metadata. The time period also needs to be able to be easily adjusted to support weekly, monthly, and yearly as well. Ideally, I can create a view in Redshift that allows our BI software users to run this report whenever they see fit.
My question is, what is an ideal strategy to design a data model to support this report? I currently have an atomic fact table that contains all logins with this schema:
User ID
Login ID
Login Timestamp
Job Title Group ID (MD5 hash of job titles to support multi valued)
Location Group ID (MD5 hash of locations to support multi valued)
The fact table allows me to easily write a query to aggregate on total (count of login id) and unique (distinct count of user id).
How can I supplement the data I have to include a count of total users? Is what I currently have the best approach?

Hierarchical, fixed-depth many-to-one (M:1) relationships between attributes are typically denormalized or collapsed into a flattened dimension table. If you’ve spent most of your career designing entity-relationship models for transaction processing systems, you’ll need to resist your instinctive tendency to normalize or snowflake a M:1 relationship into smaller subdimensions; dimension denormalization is the name of the game in dimensional modeling.
It is relatively common to have multiple M:1 relationships represented in a single dimension table. One-to-one relationships, like a unique product description associated with a product code, are also handled in a dimension table. Occasionally many-to-one relationships are resolved in the fact table, such as the case when the detailed dimension table has millions of rows and its roll-up attributes are frequently changing. However, using the fact table to resolve M:1 relationships should be done sparingly.
In your case I recommend you to have this following design as a solution :

Related

REST - should endpoints include summary data?

Simple model:
Products, which have weights (can be mixed - ounces, grams, kilograms etc)
Cagtegories, which have many products in them
REST endpoints:
/products - get all products and post to add a new one
/products/id - delete,update,patch a single product
/categories/id - delete,update,patch a single category
/categories - get all categories and post to add a new one
The question is that the frontend client wants to display a chart of the total weight of products in ALL categories. Imagine a bar chart or a Pie chart showing all categories, and the total weight of products in each.
Obviously the product model has a 'weight_value' and a 'weight_unit' so you know the weight of a product and its measure (oz, grams etc).
I see 2 ways of solving this:
In the category model, add a computed field that totals the weight of all the products in a category. The total is calculated on the fly for that category (not stored) and so is always up to date. Note the client always needs all categories (eg. to populate drop downs when creating a product, you have to choose the category it belongs to) so it now will automatically always have the total weight of each category. So constructing the chart is easy - no need to get the data for the chart from the server, it's already on the client. But first time loading that data might be slow. Only when a product is added will the totals be stale (insignificant change though to the overall number and anyway stale totals are fine).
Create a separate endpoint, say categories/totals, that returns for all categories: a category id, name and total weight. This endpoint would loop through all the categories, calculating the weight for each and returning a category dataset with weights for each.
The problem I see with (1) is that it is not that performant. I know it's not very scalable, especially when a category ends up with a lot of products (millions!) but this is a hobby project so not worried about that.
The advantage of (1) is that you always have the data on the client and don't have to make a separate request to get the data for the chart.
However, the advantage of (2) is that every request to category/id does not incur a potentially expensive totalling operation (because now it is inside its own dedicated endpoint). Of course that endpoint will have to do quite a complex database query to calculate the weights of all products for all categories (although handing that off to the database should be the way as that is what db's are good for!)
I'm really stumped on which is the better way to do this. Which way is more true to the RESTful way (HATEOS basically)?
I would go with 2. favouring scalability and best practices. It does not really make any sense to perform the weight calculations every time the category is requested and even though you may not anticipate this to be a problem since it is a 'hobby' project, it's always best to avoid shortcuts where the benefits are minimal ( or so experience has taught me !).
Choosing 1, the only advantage would be that you have to set up one less endpoint and one extra call to get the weight total - the extra call shouldn't add too much overhead, and setting up the extra endpoint shouldn't take up too much effort.
Regardless of whether you choose 1 or 2, I would consider caching the weight total ( for a reasonable amount of time, depending on accuracy required ) to increase performance even further.

MongoDB scheme on a big project

We recently started to work in a big project and we decided to use MongoDB as a DDBB solution.
We wrote a lot of code, but the project has started to grow and we found out that we're trying to use joins instead of doing it the NoSQLway, which denotes a bad DDBB design.
What I'm trying to ask here is a good design for our project, which, at this point consists of the following:
More than 12.000 Products
More than 2.000 Sellers
Every seller should have its own private area that will allow to create a product catalog based on the +12.000 "products template list".
The seller should be able to set the price, stock and offers, which will then be reflected only in his public product listing. The template list of products will remain unchanged.
Currently we have two collections. One for the products (which holds the general product information, like name, description, photos, etc...) and one collection in which we store documents that contain the ID of the product from the first collection, an ID that is related to the seller and the stock, price and offers values.
We are using aggregate with $lookup to "emulate" SQL's left join to merge the two collections, but the process is not scaling as we'd like it to and we're hitting serious performance issues.
We're aware that using joins is not the way to go in NoSQL. What should we do? How should we refactor our DDBB design? Should we embed the prices, offers and stock for each seller in each document?
The decision of using "Embedded documents" or "Joins among two or more different collections" should depend on how you are going to retrieve the data.If every time,while fetching product, you are going to fetch sellers,then it makes sense to make it an embedded document instead of different collections.But if you will be planning to fetch these two entities separately, then only option you are left with is to use Join.

Which way of storing this data in MongoDB is more performant? Caching max/min values in Item collection or on-the-fly calculation based on all bids?

I'm working with a startup building an exchange platform where commodities from an Item collection with around 50,000 documents can be bought and sold by users, who create buy and sell bids for these items.
For our "buy it now"/"sell it now" features, it's required to calculate the best buy and sell bids for an item. Currently we are calculating these on the fly with an index in the UserBids collection on the buy and sell bids field (for a given Item document, let's say with ID 1234, we'll find all UserBids for item 1234 and get the maximum buy bid and minimum sell bid). This is used to present every user with the best price they can buy/sell an item instantly at, and requires a lot of queries on the UserBids collection, but prevents having to update a canonical 'best' price for each item.
I'm wondering if it would be more performant for the Item schema to have a MaxBuy and MinSell field. This would require the MaxBuy and MinSell fields for an Item document to receive an update every time a user enters a new bid, using something like Items.update({id: itemId, $or: [{maxBuy: {$lt: currentBuyBid}}, {maxBuy: null}]}). We would still have to perform the same number of queries to show a user the best price, but the queries wouldn't require an aggregation, and as the exchange grows, we expect the UserBids collection to grow much more than the Items collection (which should remain relatively the same size)
Bids may be added/modified regularly, but we expect the volume of users checking best buy/sell prices to be about 10-100 times greater. Is there a good way to evaluate which one of these approaches would be best?
This mostly depends on which use-case is more frequent and performance-critical:
a user placing a bid which would trigger a recalculation of said fields
someone checks the price
When you assume that the latter use-case is more frequent, this is the one you should optimize for.

How do I fetch objects from Core Data that do not have any relatives that meet a set of criteria?

I'm new to Core Data and having a little trouble understanding the best way to fetch data efficiently, particularly with entities that are related.
Imagine that I have two entities: Patients and Appointments. Patients have many Appointments.
I want to fetch all the patients that haven't had an appointment this [Patient.appointment_frequency], where appointment_frequency is weekly, monthly, etc.
How would I do that, particularly in a way that's fast with hundreds or thousands of Patient objects and hundreds of appointments per patient?
First, you'd predicate your fetch request for appointments within the variable appointment threshold. Your returned fetched set will contain each of the matching appointment objects, you can then build a patient set by asking each of the returned appointments for it's patient, and then grouping the appointments by patient (ie, to present in a tableview).
If the result set contains many objects (hundreds, thousands), CoreData will manage populating and faulting objects in the result set, so don't concern yourself with the memory or performance unless you're using it and finding performance less than expected.
Apple has provided a core data programming guide with sample code. It explains how to make different kinds of fetch requests and is very clearly written.
In your situation, I'd take the current date and subtract a week (or month, or whatever the frequency says). Fetch a request for all patients using a predicate that says you want all users with frequency X whose last appointment < the calculated date. The returned users need to schedule appointments.

Tracking Unique Page Views

What is the best method of tracking page views (uniques specially) for a certain page?
Example: Threads in a forum, Videos in a video website, questions in a Q/A script(SO).
Currently, I'm taking the approach of just a simple "Views" column for each row I'm trying to count the views for, however, I know this is not the most efficient way.
And for unique views, I have a separate table that holds a row with the "QuestionID" and "UserID". When a user visits a question/page, my code attempts find a row in the views table with the "UserID" and "QuestionID", if it can't, it adds a row, then increments the Views value of the Question in the "Questions" table.
your solution for storage seems to be the best way to track for users. The tables don't have redundant data and your many to many relationship is represented by its own table.
On another note:
For anonymous users you would need to record their IP address, and the you can use the COUNT() sql function to get the number of unique visitors that way, but even and IP address not "unique" per se.
First of all when you have a table that stores userid-quesitonid pairs, it means you can count them, adding a views column is against the rules of normalisation I suppose.
Other thing is that I can F5 as much as I want, if you do not implement a cookie solution, if not then add a row to the table.
when it comes to ip address solution, which is far form being a solution at all, it will blcok people behind routers.
I think of (also implementing right now) a solution that checks cookies, sessionIds, DB table for registered users, and if none found, adds a row to the table. Itr also records SessionID and IP addresses anyway, but don't really rely on them.
If you are using ASP.NET MEmbership and the Anonymous Provider for anonymous users, then each anonymous user gets a row created in aspnet_Users table as soon as you say Profile.Save(). In that case, you can track both anon and registered users viewing certain Page. All you need to do is record the aspnet_user's UserID and the QuestionID.
However, I strongly discourage doing this at database level since it can blow up your database. If you have 10,000 questions and 1,000 registered users and 100,000 anonymous users, and assuming each user visits 10 questions on an average, then you end up having 1M rows in the tracking table. Quite some load.
Moreover, doing a SELECT COUNT() on the tracking table is quite some load on the database, especially you are doing this almost every page view on your forum. Best is to keep a total counter at the Question table against each question. Whenever you get a unique user looking at a page, you just increase the counter.
Also don't create a FK relationship to the user table from the tracking table. You will need to cleanup the aspnet_users table as it piles up a lot of anonymous users over time who will most likely never come back. So, the tracking page needs to just have the userID field, and no FK. Moreover, you will have to cleanup the tracking table over time as well as it will keep getting millions of rows. That's why the TotalView counter needs to be at the Question table and never use SELECT COUNT() to calculate the number of views while displaying the page.
Does this answer your question?