Rails 4 + PostgreSQL (Heroku): complex/scalable queries for analysis/reporting on data

Rails 4 + PostgreSQL (Heroku): complex/scalable queries for analysis/reporting on data - postgresql

Working on a financial application that tracks sales. However, I'm running into problems trying to create a schema for properly tracking the data for reports (the main point of the app).
A purchase is the foundation of the app. It has several associations (listed below). Each purchase is tracked via a year and month field. A year is the smallest unit a user may filter a report by, so I will only have to show data for each month in that year.
# purchase.rb model
class Purchase < ActiveRecord::Base
# Associations:
# belongs_to :partner
# belongs_to :purchase_type
# belongs_to :purchase_category
# Attributes:
# partner_id => association
# purchase_type_id => association
# purchase_category_id => association
# year => year in integer (2013, 2014, etc...)
# month => month in integer ("January" => 1, etc...)
# amount => amount a product sold for in cents ($10.00 => 1000)
# fee => fee for associated partner (if there is one) in cents ($2.00 => 200)
end
The problem is that I need to show an overview for a given year, which breaks things down by how many purchases were completed, what partners completed them, and what were the fee amounts. I solved that by having YearMetric and MonthMetric tables that are updated everytime a purchase is added/updated/removed. So you add a new purchase for a given year/month, and the corresponding YearMetric and MonthMetric rows are found and updated with +/- the appropriate amounts/fees.
This solution works well for the overview page. However, I also need to be able to view purchases in the context of partners, purchase_types, and purchase_categories. If I followed the same strategy as my overview report, I would have to add the following tables:
PartnerYearMetric, PartnerMonthMetric
PurchaseCategoryYearMetric, PurchaseCategoryMonthMetric
PurchaseTypeYearMetric, PurchaseTypeMonthMetric
So everytime I add a purchase, I would be doing up to 8 additional DB updates (8 finds and then 8 updates).
The items I'm reporting on are total purchases made, average purchases (historical comparison), total amounts/fees for the period, top partners by number of purchases and by most fee amounts, etc...
There has to be a better solution than this. "Live calculation" by updating 8 records for every 1 purchase seems a bit overkill.

What you're doing is maintaining materialized views of the data in the application. It's a form of denormalization. That can be OK as an optimization but should not be your first choice. It can be very error prone, especially in the presence of concurrency, and must be done quite carefully.
Instead, when you wish to generate a summary report, use an aggregate to SUM them, COUNT them, etc as appropriate. See aggregate functions in the Pg docs, rails Calculations, rails aggregates.
You may find it convenient to create a VIEW over the query you use, and then access the view from the application.
If you find performance of calculating the aggregates in real time for the summary to be a problem, and you cannot solve it with proper indexing and tuning, then you should think about denormalizing. Rather than maintaining your materialized views in the app, though, consider using triggers in the database; they're much easier to write in a concurrency-safe way.
You may also want to look up PostgreSQL 9.4's enhanced materialized views support.

Related

Segregate Products based on shipping <SQL>

I have 10 different products (A,B,C),..,J)have multiple purchase dates (by various customers) and delivery dates. I want to see which products have the date difference of less than 5 days. If the date difference is less than 5 days, which products have customer rating less more than 3.If the above criteria is satisfied I want to fetch those products that has the minimum date difference from the queue along with the "Important_date".If there are same minimum date difference for a particular product then I would like to select the top one among the same product in recent times and mark the purchase date as the "Important_date".
The columns in the table are: Product,Purchasedate,deliverydate,date_difference,customer_rating.
I am trying to use case statements to solve the problem in PostgreSQL.
I am looking for an output which will give me all the columns of the table along with "Important_date."

How to accomplish star schema in Tableau with multiple facts (without records falling off)

I have a fairly simple data model which consists of a star schema of 2 Fact tables and 2 dimension tables:
Fact 1 - Revenue
Fact 2 - Purchases
Dimension 1 - Time
Dimension 2 - Product
These tables are at different levels of granularity - meaning a given date could have many rows across many products. A specific date and product may have revenue, but no purchases. Likewise it may have purchases but no revenue.
Each fact joins the both dimensions, which contain additional detail such as the product name, product category, etc.
What I would like to do is combine these two facts such that I can report revenue and purchases together (example, by date, by product, or by date and product combined):
I can get very close with data blending, however the issue I run into is that data blending only supports an pseudo 'inner-join'. As you can see, if either of these data sources is specified as primary then dates without purchases/revenue will cause rows in the secondary source to fall off.
What is the best way to blend this data without causing records to fall off

Create a union of your fact tables. There will be mismatched fields, but that is ok.
Perform data blending on the connection to bring in additional dimensions(Not shown in the example)
Build the view

How to Handle Rows that Change over Time in Druid

I'm wondering how we could handle data that changes over time in Druid. I realize that Druid is built for streaming data where we wouldn't expect a particular row to have data elements change. However, I'm working on a project where we want to stream transactional data from a logistics management system, but there's a calculation that happens in that system that can change for a particular transaction based on other transactions. What I mean:
-9th of the month - I post transaction A with a date of today (9th) that results in the stock on hand coming to 0 units
-10th of the month - I post transaction B with a date of the 1st of the month, crediting my stock amount by 10 units. At this time (on the 10th of the month) the stock on hand for transaction A recalculates to 10 units. The same would be true for ALL transactions after the 1st of the month
As I understand it, we would re-extract transaction A, resulting in transaction A2.
The stock on hand dimension is incredibly important to our metrics. Specifically, identifying when stockouts occur (when stock on hand = 0). In the above example, if I have two rows for transaction A, I would be mistakenly identifying a stockout with transaction A1, whereas transaction A2 is the source of truth.
Is there any ability to archive a row and replace it with an updated row, or do we need to add logic to our queries that finds the rows with the freshest timestamp per transaction id?
Thanks

I have two thoughts that I hope help you. The key documentation for this is "Updating Existing Data": http://druid.io/docs/latest/ingestion/update-existing-data.html which gives you three options: Lookup Tables, Reindexing, and Delta Ingestion. The last one, Delta Ingestion, is only for adding new rows to old segments, so that's not very useful for you, let's go over the other two.
Reindexing: You can crunch all the numbers that change in your ETL process, identify the segments that would need to be reloaded, and simply have Druid re-index those segments. That will replace the stock-on-hand value for A in your example whenever you want, whenever you do the re-indexing.
Lookups: If you have stock values for multiple products, you can store the product id in the segment and have that be immutable, but lookup the stock-on-hand value in a lookup. So, you would store:
A, 2018-01-01, product-id: 123
And in your lookup, you'd have:
product-id: 123, stock-on-hand: 0
And later, you'd update the lookup and change that to 10. This would update any rows that reference product-id: 123.
I can't be sure but you may be mixing up dimensions and metrics while you're doing this, and you may need to read over that terminology in OLAP descriptions like this: https://en.wikipedia.org/wiki/Online_analytical_processing
Good luck!

Access Crosstab or Form based on 2 tables with dates

There are a few answers here already that have part answered my challenge in Access but not fully.
I have 2 tables that form the basis of my database: customers and items
I have a further 2 tables; one for order quantities against customers and items (orders_a), and one for forecast quantities against customers and items (forecast_a).
forecast_a and orders_a also have a date for each customer and item combination (basically there will be 12 dates only for the 12 months of the year - 01/01/12,01/02/12,01/03/12 etc.)
Because a user will want to manually forecast quantities for a full year for each customer and each item, if there were 2 customers and 2 items, the forecast_a table would contain 48 rows. 2 items x 2 customers = 4, 4 x 12 dates = 48. The same goes for the orders_a.
I know this is a slightly unusual set up but the user requires visibility of a full year.
My main challenge based on this is as follows:
A user will want to see a form with customers in the first column, items in the second and then (like a crosstab): Jan Forecast Qty, Jan Order Qty, Feb Forecast Qty, Feb Order Qty etc.
Therefore how would I create a crosstab to pull both these tables together, and how would I go about creating a form for data entry off the back of it?
I may well be constructing my database the wrong way but the fact that the user needs a 'grid' where every entry is manual means I can't just have a form that creates a record one at a time for orders or forecasts.
Thanks in advance!
Nick

The problem you have is that this is a task that is in essence a spreadsheet task. Accordingly it may be best handled in Excel. To achieve this create an Excel object, create a blank worksheet, populate it with the data, then have a button to suck it back into the database when the user has finished.

Database design challenge

I'm creating an virtual stamp card program for the iphone and have run into an issue with implementing my database. The program essentially has a main points system that can be utitlized through all merchants (sort've like air miles), but i also want to keep track of how many times you've been to EACH merchant
So far, i have created 3 main tables for users, merchants, and transactions.
1) Users table contains basic info like user_id and total points collected.
2) Merchants table contains info like merchant_id, location, total points given.
3) Transactions table simply creates a new row for every time someone checks into each merchant, and records date-stamp, user name, merchant name, and points awarded.
So the most basic way to deal with finding out how many times you've been to each merchant is to query the entire transaction table for both user and merchant, and this will give me a transaction history of how many times you've been to that specific merchant(which is perfect), but in the long run, i feel this will be horrible for performance.
The other straightforward, yet "dumb" method for implementing this, would be to create a column in the users table for EACH merchant, and keep the running totals there. This seems inappropriate, as I will be adding new merchants on a regular basis, and there would need to be new columns added to every user for every time this happens.
I've looked into one-to-many and many-to-many relationships for mySQL databases, but can't seem to come up with something very concrete, as i'm extremely new to web/PHP/mySQL development but i'm guessing this is what i'm looking for...
I've also thought of creating a special transaction table for each user, which will have a column for merchant and another for the # of times visited. Again, not sure if this is the most efficient implementation.
Can someone point me in the right direction?

You're doing the right thing in the sense of thinking up the different options, and weighing up the good and bad for each.
Personally, I'd go with a MerchantCounter table which joins on your Merchant table by id_merchant (for example) and which you keep up-to-date explicitly.
Over time it does not get slower (unlike an activity-search), and does not take up lots of space.
Edit: based on your comment, Janan, no I would use a single MerchantCounter table. So you've got your Merchant table:
id_merchant nm_merchant
12 Jim
15 Tom
17 Wilbur
You would add a single additional table, MerchantCounter (edited to show how to tally totals for individual users):
id_merchant id_user num_visits
12 101 3
12 102 8
15 101 6007
17 102 88
17 104 19
17 105 1
You can see how id_merchant links the table to the Merchant table, and id_user links to a further User table.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse