How to use RLS with compund field - amazon-redshift

In Redshift we have a table (let's call it entity) which among other columns it has two important ones: hierarchy_id & entity_timestampt, the hierarchy_id is a combination of the ids of three hierarchical dimensions (A, B, C; each one having a relationship of one-to-many with the next one).
Thus: hierarchy_id == A.a_id || '-' || B.b_id || '-' || C.c_id
Additionally the table is distributed according to DISTKEY(hierarchy_id) and sorted using COMPOUND SORTKEY(hierarchy_id, entity_timestampt).
Over this table we need to generate multiple reports, some of them are fixed to the depths level of the hierarchy, while others will be filtered by higher parts and group the results by the lowers. However, the first layer of the hierarchy (the A dimension) is what defines our security model, users will never have access to different A dimensions other than the one they belong (this is our tenant information).
The current design proven to be useful for that matter when we were prototyping the reports in plain SQL as we could do things like this for the depths queries:
WHERE
entity.hierarchy_id = 'fixed_a_id-fixed_b_id-fixed_c_id' AND
entity.entity_timestampt BETWEEN 'start_date' AND 'end_data'
Or like this for filtering by other points of the hierarchy:
WHERE
entity.hierarchy_id LIKE 'fixed_a_id-%' AND
entity.entity_timestampt BETWEEN 'start_date' AND 'end_data'
Which would still take advantage of the DISTKEY & SORTKEY setup, even though we are filtering just for a partial path of the hierarchy.
Now we want to use QuickSight for creating and sharing those reports using the embedding capabilities. But we haven't found a way to filter the data of the analysis as we want.
We tried to use the RLS by tags for annonymous users, but we have found two problems:
How to inject the A.a_id part of the query in the API that generates the embedding URL in a secure way (i.e. that users can't change it), While allowing them to configure the other parts of the hierarchy. And finally combining those independent pieces in the filter; without needing to generate a new URL each time users change the other parts.
(however, we may live with this limitation but)
How to do partial filters; i.e., the ones that looked like LIKE 'fixed_a_id-fixed_b_id-%' Since it seems RLS is always an equals condition.
Is there any way to make QuickSight to work as we want with our current table design? Or would we need to change the design?
For the latter, we have thought on keeping the three dimension ids as separated columns, that way we may add RLS for the A.a_id column and use parameters for the other ones, the problem would be for the reports that group by lower parts of the hierarchy, it is not clear how we could define the DISTKEY and SORTKEY so that the queries are properly optimized.

COMPOUND SORTKEY(hierarchy_id, entity_timestampt)
You are aware you are sorting on only the first eight bytes of hierarchy_id? and the ability of the zone map to differentiate between blocks is based purely on the first eight bytes of the string?
I suspect you would have done a lot better to have had three separate columns.
Which would still take advantage of the DISTKEY & SORTKEY setup, even though we are filtering just for a partial path of the hierarchy.
I may be wrong - I would need to check - but I think if you use operators of any kind (such as functions, or LIKE, or even addition or subtraction) on a sortkey, the zone map does not operate and you read all blocks.
Also in your case, it may be - I've not tried using it yet - if you have AQUA enabled, because you're using LIKE, your entire query is being processed by AQUA. The performance consequences of this, positive and/or negative, are completely unknown to me.
Have you been using the system tables to verify your expectations of what is going on with your queries when it comes to zone map use?
the problem would be for the reports that group by lower parts of the hierarchy, it is not clear how we could define the DISTKEY and SORTKEY so that the queries are properly optimized.
You are now facing the fundamental nature of sorted column-store; the sorting you choose defines the queries you can issue and so also defines the queries you cannot issue.
You either alter your data design, in some way, so what you want becomes possible, or you can duplicate the table in question where each duplicate has different sorting orders.
The first is an art, the second has obvious costs.
As an aside, although I've never used Quicksight, my experience with all SQL generators has been that they are completely oblivious to sorting and so the SQL they issue cannot be used on Big Data (as sorting is the method by which Big Data can be handled in a timely manner).
If you do not have Big Data, you'll be fine, but the question then is why are you using Redshift?
If you do have Big Data, the only solution I know of is to create a single aggregate table per dashboard, about 100k rows, and have the given dashboard use and only use that one table. The dashboard should normally simply read the entire table, which is fine, and then you avoid the nightmare SQL it normally will produce.

Related

Will one Foreign key to two tables cause confusion?

If I have table A, and table B, and I have data that start off in table A but end up in table B, and I have a table C which has a foreign key that points to the primary key of A, but when the data get removed from A and ends up in table B, it should point to B instead (having the same id as A's data did). Will this cause confusion. Heres and example to show what I mean:
A (Pending results)
id =3
B( Completed Results)
null
C(user)
id = 1
results id = 3 (foreign key to both A and B)
After three minutes, the results have been posted.
A (Pending results)
null
B( Completed Results)
id = 3
C(user)
id = 1
results id = 3 (foreign key to both A and B)
Is there anything wrong with this implementation. Or would it be better to have A and B as one table? The table could grow very large which is what I am worried about. As separate tables, the reads to table A would be far greater than the reads to table B and table A would be much smaller, as it is just pending results. If A and B were combined into one table, then it would be both pending and a history of all completed results, so finding the ones which are pending would take much more time I assume. All of this is being done is postgresql if that makes a difference.
So I guess my question is: Is this implementation fine for a medium scale, or given the information I just said, should I combine table A and B (Even though B will grow infinitely large whereas A only contains present data and is significantly smaller).
Sounds like you've already found that this does not work. I couldn't follow your example properly because "A", "B", and "C" never work for me. I suspect those kinds of formulaic labels are better than specifics for other people. You just can't win ;-) In any case, it sounds like you're facing a practical concern about table size, and are being tempted to use a design that splits a natural table into two parts. (Hot and old.) As you found, that doesn't really work with the keys in a system. The relational model (etc., etc.) doesn't have a concept for "this thing is a child of this or that." So, you're swimming up stream there. Regardless, this kind of setup is very commonplace in the wild, so much so that it's got a name. Well, several names. "Polymorphic Association" from Bill Karwin's SQL Anti-Patterns is common. That's a good book, and short, by the way. Similarly, "promiscuous association" is a term you'll see. Or sometimes you'll see the table itself listed as a "jump table", or a "hub", etc.
I suspect there's a reason this non-relational pattern is so widely used: It makes sense to humans. An area where the relational model is always a tight pinch is when you have things which are kinds of things. Like, people who are staff or student. So many fields in common, several that are distinct to their specific type. One table? Two? Three? Table inheritance in Postgres might help...at least it's trying to. Anyway, polymorphic relations are problematic in an RDBMS because they're not able to be modeled or constrained automatically. You need custom code to figure out that this record is a child of that table...or the other table. You can't bake that into the relations. If you're interested in various solutions to this design problem, Karwin's chapter is quite good, easy to read, and full of alternative designs. If you don't feel like tracking down the book but are a bit interested, check out this article from a few years ago:
https://hashrocket.com/blog/posts/modeling-polymorphic-associations-in-a-relational-database
Chances are, your interest right now is more day-to-day. Is sounds like you've got a processing pipeline with a few active records and an ever-increasing collection of older records. You don't mention your Postgres version, but you might have less to worry about than you imagine. First up, you could consider partitioning the table. A partitioned table has a single logical table that you talk to in your queries with a collection of smaller physical tables under the hood. You can get at the partitions directly, but you don't need to. You just talk to my_big_table and Postgres figures out where to look. So, you could split the data on week, month, etc. so that no one bucket every gets too big for you. In this case, the individual partitions have their own indexes too. So, you'll end up with smaller tables and smaller indexes under the hood. For this, you're best off using PG 11, or maybe PG 10. Partitioning is a big topic, and the Postgres feature set isn't a perfect match for every situation...you have to work within its limits. I'll leave it at that now as it's likely not what you need first.
Simpler than partitioning is an awesome Postgres feature you may not know about, partial indexes. This isn't unique to Postgres (SQL Server calls the same sort of feature a "filtered" index), but I don't think MySQL has it. Okay, the idea is really simple: Build an index that only includes rows that match a condition. Here's an example:
CREATE INDEX tasks_pending
ON tasks (status)
WHERE status = 'Pending'
If you're table has 100M records, a full B-tree has to catalog all 100M rows. You need that for a uniqueness check on a primary key...but it's big and expensive. Now imagine your 100M records have only 1,000 rows where status = pending. You've got an index with just those 1,000 rows. Tiny, fast, perfect. The beauty part here is that the partial index doesn't necessarily get bigger as your historical data set grows. And, shout out to historical data sets, they're very nice to have when you need to get aggregates, etc. in a simple search. If you split things into multiple tables, you'll need to write longer queries with UNION. (That wouldn't be the case with partitions where the physical division is masked by the logical partition master table.)
HTH

OLAP Approach for Backend redshift connection

We have a system where we do some aggregations in Redshift based on some conditions. We aggregate this data with complex joins which usually takes about 10-15 minutes to complete. We then show this aggregated data on Tableau to generate our reports.
Lately, we are getting many changes regarding adding a new dimension ( which usually requires join with a new table) or get data on some more specific filter. To entertain these requests we have to change our queries everytime for each of our subprocesses.
I went through OLAP a little bit. I just want to know if it would be better in our use case or is there any better way to design our system to entertain such adhoc requests which does not require developer to change things everytime.
Thanks for the suggestions in advance.
It would work, rather it should work. Efficiency is the key here. There are few things which you need to strictly monitor to make sure your system (Redshift + Tableau) remains up and running.
Prefer Extract over Live Connection (in Tableau)
Live connection would query the system everytime someone changes the filter or refreshes the report. Since you said the dataset is large and queries are complex, prefer creating an extract. This'll make sure data is available upfront whenever someone access your dashboard .Do not forget to schedule the extract refresh, other wise the data will be stale forever.
Write efficient queries
OLAP systems are expected to query a large dataset. Make sure you write efficient queries. It's always better to first get a small dataset and join them rather than bringing everything in the memory and then joining / using where clause to filter the result.
A query like (select foo from table1 where ... )a left join (select bar from table2 where) might be the key at times where you only take out small and relevant data and then join.
Do not query infinite data.
Since this is analytical and not transactional data, have an upper bound on the data that Tableau will refresh. Historical data has an importance, but not from the time of inception of your product. Analysing the data for the past 3, 6 or 9 months can be the key rather than querying the universal dataset.
Create aggregates and let Tableau query that table, not the raw tables
Suppose you're analysing user traits. Rather than querying a raw table that captures 100 records per user per day, design a table which has just one (or two) entries per user per day and introduce a column - count which'll tell you the number of times the event has been triggered. By doing this, you'll be querying sufficiently smaller dataset but will be logically equivalent to what you were doing earlier.
As mentioned by Mr Prashant Momaya,
"While dealing with extracts,your storage requires (size)^2 of space if your dashboard refers to a data of size - **size**"
Be very cautious with whatever design you implement and do not forget to consider the most important factor - scalability
This is a typical problem and we tackled it by writing SQL generators in Python. If the definition of the metric is the same (like count(*)) but you have varying dimensions and filters you can declare it as JSON and write a generator that will produce the SQL. Example with pageviews:
{
metric: "unique pageviews"
,definition: "count(distinct cookie_id)"
,source: "public.pageviews"
,tscol: "timestamp"
,dimensions: [
['day']
,['day','country']
}
can be relatively easy translated to 2 scripts - this:
drop table metrics_daily.pageviews;
create table metrics_daily.pageviews as
select
date_trunc('day',"timestamp") as date
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1;
and this:
drop table metrics_daily.pageviews_by_country;
create table metrics_daily.pageviews_by_country as
select
date_trunc('day',"timestamp") as date
,country
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1,2;
the amount of complexity of a generator required to produce such sql from such config is quite low but in increases exponentially as you need to add new joins etc. It's much better to keep your dimensions in the encoded form and just use a single wide table as aggregation source, or produce views for every join you might need and use them as sources.

Is it good practice to have 2 or more tables with the same columns?

I'm creating a web-app that lets users search for restaurants and cafes. Since I currently have no data other than their type to differentiate the two, I have two options on storing the list of eateries.
Use a single table for both restaurants and cafes, and have an enum (text) column stating if an entry is a restaurant or cafe.
Create two separate tables, one for restaurants, and one for cafes.
I will never need to execute a query that collects data from both, so the only thing that matters to me I guess is performance. What would you suggest as the better option for PostgreSQL?
Typical database modeling would lend itself to a single table. The main reason is maintainability. If you have two tables with the same columns and your client decides they want to add a column, say hours of operation. You now have to write two sets of code for creating the column, reading the new column, updating the new column, etc. Also, what if your client wants you to start tracking bars, now you need a third table with a third set of code. It gets messy quick. It would be better to have two tables, a data table (say Establishment) with most of the columns (name, location, etc.) and then a second table that's a "type" table (say EstablishmentType) with a row for Restaurant, Cafe, Bar, etc. And of course a foreign key linking the two. This way you can have "X" types and only need to maintain a single set of code.
There are of course exceptions to this rule where you may want separate tables:
Performance due to a HUGE data set. (It depends on your server, but were talking at least hundreds of thousands of rows before it should matter in Postgres). If this is the reason I would suggest table inheritance to keep much of the proper maintainability while speeding up performance.
Cafes and Restaurants have two completely different sets of functionality in your website. If the entirety of your code is saying if Cafe, do this, if Restaurant, do that, then you already have two sets of code to maintain, with the added hassle of if logic in your code. If that's the case, two separate tables is a much cleaner and logical option.
In the end I chose to use 2 separate tables, as I really will never need to search for both at the same time, and this way I can expand a single table in the future if I need to add another data field specific to cafes, for example.

ef-core - bool index vs historical table

I´m using aspnet-core, ef-core with sql server. I have an 'order' entity. As I'm expecting the orders table to be large and a the most frequent query would get the active orders only for certain customer (active orders are just a tiny fraction of the whole table) I like to optimize the speed of the query but I can decide from this two approaches:
1) I don't know if this is possible as I haven't done this before, but I was thinking about creating a Boolean column named 'IsActive' and make it an index thus when querying only Active orders would be faster.
2) When an order becomes not active, move the order to another table, i.e HistoricalOrders, thus keeping the orders table small.
Which of the two would have better results?, or none of this is a good solution and a third approach could be suggested?
If you want to partition away cold data then a leading boolean index column is a valid way to do that. That column must be added to all indexes that you want to hot/cold partition. This includes the clustered index. This is quite awkward. The query optimizer requires that you add a dummy predicate where IsActive IN (0, 1) to make it able to still seek on such indexes. Of course, this will now also touch the cold data. So you probably need to know the concrete value of IsActive or try the 1 value first and be sure that it matches 99% of the time.
Depending on the schema this can be impractical. I have never seen a good case for this but I'm sure it exists.
A different way to do that is to use partitioning. Here, the query optimizer is used to probing multiple partitions anyway but again you don't want it to probe cold data. Even if it does not find anything this will pull pages into memory making the partitioning moot.
The historical table idea (e.g. HistoricalOrders) is the same thing in different clothes.
So in order to make this work you need:
Modify all indexes that you are about (likely all), or partition, or create a history table.
Have a way to almost never need to probe the cold partition.
I think (2) kills it for most cases.
Among the 3 solutions I'd probably pick the indexing solution because it is simplest. If you are worried about people making mistakes by writing bad queries all the time, I'd pick a separate table. That makes mistakes hard but makes the code quite awkward.
Note, that many indexes are already naturally partitioned. Indexes on the identity column or on an increasing datetime column are hot at the end and cold elsewhere. An index on (OrderStatus INT, CreateDateTime datetime2) would have one hot spot per order status and be cold otherwise. So those are already solved.
Some further discussion.
Before think about the new table HistoricalOrders,Just create a column name IsActive and test it with your data.You don't need to make it as index column.B'cos Indexes eat up storage and it slows down writes and updates.So we must very careful when we create an index.When you query the data do it as shown below.On the below query where data selection (or filter) is done on the SQL srever side (IQueryable).So it is very fast.
Note : Use AsNoTracking() too.It will boost the perfromnace too.
var activeOrders =_context.Set<Orders>().Where(o => o.IsActive == true).AsNoTracking()
.ToList();
Reference : AsNoTracking()

One big and wide table or many not so big for statistics data

I'm writing simplest analytics system for my company. I have about 100 different event types that should be collected per tens of projects. We are not interested in cross-project analytic requests but events have similar types through all projects. I use PostgreSQL as primary storage for this system. Now I should decide which architecture is more preferable.
First architecture is one very big table (in terms of rows count) per project that contains data for all types of events. It will be about 20 or more columns many of them will be nullable. May be it will be used partitioning to split this table by event type but table still be so wide.
Second one architecture is a lot of tables (fairly big in terms of rows count but not so wide) with one table per event type.
I going to retrieve analytic data from this tables using different join queries (self join in case of first architecture). Which one is more preferable and where are pitfalls of them?
UPD. All events have about 10 common attributes. And remain attributes are varied from one event type to another.
In the past, I've had similar situations. With postgres you have a bunch of options.
Depending on how your data is input into the system (all at once/ a little at a time) and the volume of your data per project (hundreds of data points vs millions of data points) and the querying pattern (IE, querying after the data is all in, querying nightly, or reports running constantly throughout), there are many options. One other factor will be IF new project types (with new data point types) are likely to crop up.
First, in your "first architecture" the first question that comes up for me is: Are all the "data points" the same data type (or at least very similar). Are some text and others numeric? Are some numeric and others floats? If so, you're likely to run into issues with rolling up your data without either building a column or a table for every data type.
If all your data is the same datatype, then the first architecture you mentioned might work really well.
The second architecture you mentioned is OK especially if you don't predict having a bunch of new project types coming down the pike anytime soon, otherwise, you'll be constantly modifying the DB, which I prefer to avoid when unnecessary.
A third architecture that you didn't mention is to have a combination of 1 and 2. Basically have 1 table to hold the 10 common attributes and use either 1 or 2 to hold the additional attributes. This would have an advantage, especially if the additional data wasn't that frequently used, or was non-numeric.
Lastly, you could use one of PostgreSQLs "document store" type datatypes. You could store this information in arrays, hstores, or json. Now, this will be fairly inefficient if you're doing a ton of aggregate functions as you might be left calculating the aggregates outside of Pgsql, or at a minimum, running an inefficient query. You could store the 10 common fields in normal fields, and the additional ones as hstore or json.
I didn't ask you, but it'd be nice to know that if each event within a project had more than 1 data point (IE are you logging changes, or just updating data).If your overall table has less than 100,000 rows, it's likely just going to be best to focus on what's easier to maintain and program rather than performance, as small amounts of data are pretty quick regardless of how they're stored.