SSAS Tabular - dimension records of special interest - ssas-tabular

I have a dimension with 700K entries. 10 of these are special interest to the client and will be frequently used in queries and as such need to be quickly retrieved. Do I:
a) add an attribute to the dimension that flags these records, or
b) adopt a snowflake schema and add another dimension with the 10 records in it and attach it to the 700K dimension (effectively a 1 to 1 optional), or
c) add a new dimension with the 10 and attach it to the fact (but I need to obtain information from the 700K table - unless I denormalise the solution further and duplicate the attributes in the new dimension as part of ETL)?

I always go for option a), because it's the simplest and most intuitive solution to users. Only instead of adding a flag attribute, I prefer to create a "VIP" version of the main attribute (if possible). For example, let's say your dimension is "Customer", and it contains "Customer Name". I'd create a new attribute "Special Customer Name", list there names of special interest customers, and replace the rest with "Other" or something like that. Such design looks great on the reports, and easy to implement.
I always avoid snowflake design as much as possible. It's inferior in performance, and non-intuitive to users. The only justifiable case were I'd use it - if I need to share a dimension between fact tables with different granularity.
The third option is rather ugly conceptually. I'd consider it only if a dimension is too large (several million entries), and kills the performance of the entire star schema.

Related

How to use RLS with compund field

In Redshift we have a table (let's call it entity) which among other columns it has two important ones: hierarchy_id & entity_timestampt, the hierarchy_id is a combination of the ids of three hierarchical dimensions (A, B, C; each one having a relationship of one-to-many with the next one).
Thus: hierarchy_id == A.a_id || '-' || B.b_id || '-' || C.c_id
Additionally the table is distributed according to DISTKEY(hierarchy_id) and sorted using COMPOUND SORTKEY(hierarchy_id, entity_timestampt).
Over this table we need to generate multiple reports, some of them are fixed to the depths level of the hierarchy, while others will be filtered by higher parts and group the results by the lowers. However, the first layer of the hierarchy (the A dimension) is what defines our security model, users will never have access to different A dimensions other than the one they belong (this is our tenant information).
The current design proven to be useful for that matter when we were prototyping the reports in plain SQL as we could do things like this for the depths queries:
WHERE
entity.hierarchy_id = 'fixed_a_id-fixed_b_id-fixed_c_id' AND
entity.entity_timestampt BETWEEN 'start_date' AND 'end_data'
Or like this for filtering by other points of the hierarchy:
WHERE
entity.hierarchy_id LIKE 'fixed_a_id-%' AND
entity.entity_timestampt BETWEEN 'start_date' AND 'end_data'
Which would still take advantage of the DISTKEY & SORTKEY setup, even though we are filtering just for a partial path of the hierarchy.
Now we want to use QuickSight for creating and sharing those reports using the embedding capabilities. But we haven't found a way to filter the data of the analysis as we want.
We tried to use the RLS by tags for annonymous users, but we have found two problems:
How to inject the A.a_id part of the query in the API that generates the embedding URL in a secure way (i.e. that users can't change it), While allowing them to configure the other parts of the hierarchy. And finally combining those independent pieces in the filter; without needing to generate a new URL each time users change the other parts.
(however, we may live with this limitation but)
How to do partial filters; i.e., the ones that looked like LIKE 'fixed_a_id-fixed_b_id-%' Since it seems RLS is always an equals condition.
Is there any way to make QuickSight to work as we want with our current table design? Or would we need to change the design?
For the latter, we have thought on keeping the three dimension ids as separated columns, that way we may add RLS for the A.a_id column and use parameters for the other ones, the problem would be for the reports that group by lower parts of the hierarchy, it is not clear how we could define the DISTKEY and SORTKEY so that the queries are properly optimized.
COMPOUND SORTKEY(hierarchy_id, entity_timestampt)
You are aware you are sorting on only the first eight bytes of hierarchy_id? and the ability of the zone map to differentiate between blocks is based purely on the first eight bytes of the string?
I suspect you would have done a lot better to have had three separate columns.
Which would still take advantage of the DISTKEY & SORTKEY setup, even though we are filtering just for a partial path of the hierarchy.
I may be wrong - I would need to check - but I think if you use operators of any kind (such as functions, or LIKE, or even addition or subtraction) on a sortkey, the zone map does not operate and you read all blocks.
Also in your case, it may be - I've not tried using it yet - if you have AQUA enabled, because you're using LIKE, your entire query is being processed by AQUA. The performance consequences of this, positive and/or negative, are completely unknown to me.
Have you been using the system tables to verify your expectations of what is going on with your queries when it comes to zone map use?
the problem would be for the reports that group by lower parts of the hierarchy, it is not clear how we could define the DISTKEY and SORTKEY so that the queries are properly optimized.
You are now facing the fundamental nature of sorted column-store; the sorting you choose defines the queries you can issue and so also defines the queries you cannot issue.
You either alter your data design, in some way, so what you want becomes possible, or you can duplicate the table in question where each duplicate has different sorting orders.
The first is an art, the second has obvious costs.
As an aside, although I've never used Quicksight, my experience with all SQL generators has been that they are completely oblivious to sorting and so the SQL they issue cannot be used on Big Data (as sorting is the method by which Big Data can be handled in a timely manner).
If you do not have Big Data, you'll be fine, but the question then is why are you using Redshift?
If you do have Big Data, the only solution I know of is to create a single aggregate table per dashboard, about 100k rows, and have the given dashboard use and only use that one table. The dashboard should normally simply read the entire table, which is fine, and then you avoid the nightmare SQL it normally will produce.

Will one Foreign key to two tables cause confusion?

If I have table A, and table B, and I have data that start off in table A but end up in table B, and I have a table C which has a foreign key that points to the primary key of A, but when the data get removed from A and ends up in table B, it should point to B instead (having the same id as A's data did). Will this cause confusion. Heres and example to show what I mean:
A (Pending results)
id =3
B( Completed Results)
null
C(user)
id = 1
results id = 3 (foreign key to both A and B)
After three minutes, the results have been posted.
A (Pending results)
null
B( Completed Results)
id = 3
C(user)
id = 1
results id = 3 (foreign key to both A and B)
Is there anything wrong with this implementation. Or would it be better to have A and B as one table? The table could grow very large which is what I am worried about. As separate tables, the reads to table A would be far greater than the reads to table B and table A would be much smaller, as it is just pending results. If A and B were combined into one table, then it would be both pending and a history of all completed results, so finding the ones which are pending would take much more time I assume. All of this is being done is postgresql if that makes a difference.
So I guess my question is: Is this implementation fine for a medium scale, or given the information I just said, should I combine table A and B (Even though B will grow infinitely large whereas A only contains present data and is significantly smaller).
Sounds like you've already found that this does not work. I couldn't follow your example properly because "A", "B", and "C" never work for me. I suspect those kinds of formulaic labels are better than specifics for other people. You just can't win ;-) In any case, it sounds like you're facing a practical concern about table size, and are being tempted to use a design that splits a natural table into two parts. (Hot and old.) As you found, that doesn't really work with the keys in a system. The relational model (etc., etc.) doesn't have a concept for "this thing is a child of this or that." So, you're swimming up stream there. Regardless, this kind of setup is very commonplace in the wild, so much so that it's got a name. Well, several names. "Polymorphic Association" from Bill Karwin's SQL Anti-Patterns is common. That's a good book, and short, by the way. Similarly, "promiscuous association" is a term you'll see. Or sometimes you'll see the table itself listed as a "jump table", or a "hub", etc.
I suspect there's a reason this non-relational pattern is so widely used: It makes sense to humans. An area where the relational model is always a tight pinch is when you have things which are kinds of things. Like, people who are staff or student. So many fields in common, several that are distinct to their specific type. One table? Two? Three? Table inheritance in Postgres might help...at least it's trying to. Anyway, polymorphic relations are problematic in an RDBMS because they're not able to be modeled or constrained automatically. You need custom code to figure out that this record is a child of that table...or the other table. You can't bake that into the relations. If you're interested in various solutions to this design problem, Karwin's chapter is quite good, easy to read, and full of alternative designs. If you don't feel like tracking down the book but are a bit interested, check out this article from a few years ago:
https://hashrocket.com/blog/posts/modeling-polymorphic-associations-in-a-relational-database
Chances are, your interest right now is more day-to-day. Is sounds like you've got a processing pipeline with a few active records and an ever-increasing collection of older records. You don't mention your Postgres version, but you might have less to worry about than you imagine. First up, you could consider partitioning the table. A partitioned table has a single logical table that you talk to in your queries with a collection of smaller physical tables under the hood. You can get at the partitions directly, but you don't need to. You just talk to my_big_table and Postgres figures out where to look. So, you could split the data on week, month, etc. so that no one bucket every gets too big for you. In this case, the individual partitions have their own indexes too. So, you'll end up with smaller tables and smaller indexes under the hood. For this, you're best off using PG 11, or maybe PG 10. Partitioning is a big topic, and the Postgres feature set isn't a perfect match for every situation...you have to work within its limits. I'll leave it at that now as it's likely not what you need first.
Simpler than partitioning is an awesome Postgres feature you may not know about, partial indexes. This isn't unique to Postgres (SQL Server calls the same sort of feature a "filtered" index), but I don't think MySQL has it. Okay, the idea is really simple: Build an index that only includes rows that match a condition. Here's an example:
CREATE INDEX tasks_pending
ON tasks (status)
WHERE status = 'Pending'
If you're table has 100M records, a full B-tree has to catalog all 100M rows. You need that for a uniqueness check on a primary key...but it's big and expensive. Now imagine your 100M records have only 1,000 rows where status = pending. You've got an index with just those 1,000 rows. Tiny, fast, perfect. The beauty part here is that the partial index doesn't necessarily get bigger as your historical data set grows. And, shout out to historical data sets, they're very nice to have when you need to get aggregates, etc. in a simple search. If you split things into multiple tables, you'll need to write longer queries with UNION. (That wouldn't be the case with partitions where the physical division is masked by the logical partition master table.)
HTH

Is it good practice to have 2 or more tables with the same columns?

I'm creating a web-app that lets users search for restaurants and cafes. Since I currently have no data other than their type to differentiate the two, I have two options on storing the list of eateries.
Use a single table for both restaurants and cafes, and have an enum (text) column stating if an entry is a restaurant or cafe.
Create two separate tables, one for restaurants, and one for cafes.
I will never need to execute a query that collects data from both, so the only thing that matters to me I guess is performance. What would you suggest as the better option for PostgreSQL?
Typical database modeling would lend itself to a single table. The main reason is maintainability. If you have two tables with the same columns and your client decides they want to add a column, say hours of operation. You now have to write two sets of code for creating the column, reading the new column, updating the new column, etc. Also, what if your client wants you to start tracking bars, now you need a third table with a third set of code. It gets messy quick. It would be better to have two tables, a data table (say Establishment) with most of the columns (name, location, etc.) and then a second table that's a "type" table (say EstablishmentType) with a row for Restaurant, Cafe, Bar, etc. And of course a foreign key linking the two. This way you can have "X" types and only need to maintain a single set of code.
There are of course exceptions to this rule where you may want separate tables:
Performance due to a HUGE data set. (It depends on your server, but were talking at least hundreds of thousands of rows before it should matter in Postgres). If this is the reason I would suggest table inheritance to keep much of the proper maintainability while speeding up performance.
Cafes and Restaurants have two completely different sets of functionality in your website. If the entirety of your code is saying if Cafe, do this, if Restaurant, do that, then you already have two sets of code to maintain, with the added hassle of if logic in your code. If that's the case, two separate tables is a much cleaner and logical option.
In the end I chose to use 2 separate tables, as I really will never need to search for both at the same time, and this way I can expand a single table in the future if I need to add another data field specific to cafes, for example.

One big and wide table or many not so big for statistics data

I'm writing simplest analytics system for my company. I have about 100 different event types that should be collected per tens of projects. We are not interested in cross-project analytic requests but events have similar types through all projects. I use PostgreSQL as primary storage for this system. Now I should decide which architecture is more preferable.
First architecture is one very big table (in terms of rows count) per project that contains data for all types of events. It will be about 20 or more columns many of them will be nullable. May be it will be used partitioning to split this table by event type but table still be so wide.
Second one architecture is a lot of tables (fairly big in terms of rows count but not so wide) with one table per event type.
I going to retrieve analytic data from this tables using different join queries (self join in case of first architecture). Which one is more preferable and where are pitfalls of them?
UPD. All events have about 10 common attributes. And remain attributes are varied from one event type to another.
In the past, I've had similar situations. With postgres you have a bunch of options.
Depending on how your data is input into the system (all at once/ a little at a time) and the volume of your data per project (hundreds of data points vs millions of data points) and the querying pattern (IE, querying after the data is all in, querying nightly, or reports running constantly throughout), there are many options. One other factor will be IF new project types (with new data point types) are likely to crop up.
First, in your "first architecture" the first question that comes up for me is: Are all the "data points" the same data type (or at least very similar). Are some text and others numeric? Are some numeric and others floats? If so, you're likely to run into issues with rolling up your data without either building a column or a table for every data type.
If all your data is the same datatype, then the first architecture you mentioned might work really well.
The second architecture you mentioned is OK especially if you don't predict having a bunch of new project types coming down the pike anytime soon, otherwise, you'll be constantly modifying the DB, which I prefer to avoid when unnecessary.
A third architecture that you didn't mention is to have a combination of 1 and 2. Basically have 1 table to hold the 10 common attributes and use either 1 or 2 to hold the additional attributes. This would have an advantage, especially if the additional data wasn't that frequently used, or was non-numeric.
Lastly, you could use one of PostgreSQLs "document store" type datatypes. You could store this information in arrays, hstores, or json. Now, this will be fairly inefficient if you're doing a ton of aggregate functions as you might be left calculating the aggregates outside of Pgsql, or at a minimum, running an inefficient query. You could store the 10 common fields in normal fields, and the additional ones as hstore or json.
I didn't ask you, but it'd be nice to know that if each event within a project had more than 1 data point (IE are you logging changes, or just updating data).If your overall table has less than 100,000 rows, it's likely just going to be best to focus on what's easier to maintain and program rather than performance, as small amounts of data are pretty quick regardless of how they're stored.

What NoSQL DB to use for sparse Time Series like data?

I'm planning a side project where I will be dealing with Time Series like data and would like to give one of those shiny new NoSQL DBs a try and am looking for a recommendation.
For a (growing) set of symbols I will have a list of (time,value) tuples (increasing over time).
Not all symbols will be updated; some symbols may be updated while others may not, and completely new symbols may be added.
The database should therefore allow:
Add Symbols with initial one-element (tuple) list. E.g. A: [(2012-04-14 10:23, 50)]
Update Symbols with a new tuple. (Append that tuple to the list of that symbol).
Read the data for a given symbol. (Ideally even let me specify the time frame for which the data should be returned)
The create and update operations should possibly be atomic. If reading multiple symbols at once is possible, that would be interesting.
Performance is not critical. Updates/Creates will happen roughly once every few hours.
I believe literally all the major NoSQL databases will support that requirement, especially if you don't actually have a large volume of data (which begs the question, why NoSQL?).
That said, I've had to recently design and work with a NoSQL database for time series data so can give some input on that design, which can then be extrapolated for all others.
Our chosen database was Cassandra, and our design was as follows:
A single keyspace for all 'symbols'
Each symbol was a new row
Each time entry was a new column for that relevant row
Each value (can be more than a single value) was the value part of the time entry
This lets you achieve everything you asked for, most notably to read the data for a single symbol, and using a range if necessary (column range calls). Although you said performance wasn't critical, it was for us and this was quite performant also - all data for any single symbol is by definition sorted (column name sort) and always stored on the same node (no cross node communication for simple queries). Finally, this design translates well to other NoSQL databases that have have dynamic columns.
Further to this, here's some information on using MongoDB (and capped collections if necessary) for a time series store: MongoDB as a Time Series Database
Finally, here's a discussion of SQL vs NoSQL for time series: https://dba.stackexchange.com/questions/7634/timeseries-sql-or-nosql
I can add to that discussion the following:
Learning curve for NoSQL will be higher, you don't get the added flexibility and functionality for free in terms of 'soft costs'. Who will be supporting this database operationally?
If you expect this functionality to grow in future (either as more fields to be added to each time entry, or much larger capacity in terms of number of symbols or size of symbol's time series), then definitely go with NoSQL. The flexibility benefit is huge, and the scalability you get (with the above design) on both the 'per symbol' and 'number of symbols' basis is almost unbounded (I say almost unbounded - maximum columns per row is in the billions, maximum rows per key space is unbounded I believe).
Have a look at opentsdb.org an opensource time series database which use hbase. They have been smart on how they store the TS. It is well documented here: http://opentsdb.net/misc/opentsdb-hbasecon.pdf