OLAP Cube: per Business Process? per Fact table? - sql-server-2008-r2

So i've finished my dimensional modeling, it resulted in 2 business process, 1 simple with only one fact table and a few dimension, the other one a bit more complex with 2 fact tables (related in a similar way has Invoice and InvoiceRecord) and a lot more dimensions.
My question now is how to start building the OLAP cube(s), one for each Business Process? Or one for each Business Process and for each fact table?

You need to consider all the fact tables and dimensional tables for creating a common star schema. You should consider creating single cube unless fact and dimension pairs are not interrelated at all. It all depends on your design.

Related

Star Schema horizontal scaling

AFAIK, in case of Relational Database on MPP hardware, the key to performance is a correct data distribution. While Dimensional Modeling is about query flexibility, you don't even know how the data will be queried (shuffled) in future.
For example, you have MPP Data Warehouse (Greenplum, Redshift, Synapse Analytics). For example, in 1-2 years, you expect your fact table will grow up to 10 billion of rows and you'll have 15-30 dimension tables of 10s millions of rows. How the data should be distributed accross DW nodes? Is there any common techniques? Like shard fact table and replicate dimension tables. Or should I minimize node amount in MPP DW?
I can bring specific use case, but I believe that the question arise from my misunderstanding of how Dimensional Modeling could be paired with scaling out.
One technique I’ve seen applied with success in the past is: segment the fact table (e.g., by mod’ing the date key), and distribute all dimensions across all nodes. That way all joins can be done locally.
Note that even with large dimensions, their total size on disk should be a small fraction of the total needed for the fact table.

What kind of program(s) cannot maintain its complexity structure when the program or data is partitioned?

I received this question in my operating systems class and after some research I still cannot find an answer to this question.
I understand complexity structure to be the min complexity (number of computation steps) necessary to compute the given level of partitioning of data or program.
The answers are is in the question, namely programs that need more steps to tackle partitioned data or processing units.
If data access pattern (granularity, scope, cardinality) necessitates access and integration of the results.
Accessing and integrating products of division of computation (Threads, Processes, Nodes) (IO, Integration)
programs that have X level complexity utilising indexed access to all the parts and granularities of data at the same time. If the data were partitioned more steps would be necessary to access and query partitions individually Y and still more to integrate W, resulting in f(X, Y, W) level of complexity depending on the level of integration and access.
One example would be programs that perform table join queries optimising searches by indexing (SQL joins). Such programs could not remain in the same complexity operate if the tables or columns were in different databases or nodes (NoSQL(Key value, Columnar ...)).
Another example would be a program calling (threads, processes, nodes) and combining the results. Calling the threads and combining the results would take more computation steps than doing it sequentially.
The question is a bit out of context you would do well do add context!

Designed PostGIS Database...Points table and polygon tables...How to make more efficient?

This is a conceptual question, but I should have asked it long ago on this forum.
I have a PostGIS database, and I have many tables in it. I have researched some on the use of keys in databases, but I'm not sure how to incorporate keys in the case of the point data that is dynamic and increases with time.
I'm storing point data in one table, and this data grows each day. It's about 10 million rows right now and will probably grow about 10 million rows each year or so. There are lat, lon, time, and the_geom columns.
I have several other tables, each representing different polygon groups (converted shapefiles to tables with shp2pgsql), like counties, states, etc.
I'm writing queries that relate the point data to the spatial tables to see if points are inside of the polygons, resulting in things like "55 points in X polygon in the past 24 hours", etc.
The problem is, I don't have a key that relates the point table to the other tables. I think this is probably inhibiting query efficiency, but I'm not sure.
I know this question is fairly vague, and I'm happy to clarify anything, but I basically have a bunch of points in a table that I'm spatially comparing to other tables, and I'm trying to find the best way to design things.
Thanks for any help!
If you don't have already, you should build a spatial index on both the point and polygons table.
Anyway, spatial comparisons are usually slower than numerical comparison.
So adding one or more keys to the point table referencing the other tables, and using them on your select queries instead of spatial operations, will surely speed up.
Obviously, inserts will be slower, but, given the numbers you gave (10millions per year), it should not be a problem.
Probably, just adding a foreign key to the smallest entities (cities for example) and joining the others to get results (countries, states...) will be faster than spatial comparison.
Foreign keys (and other constraints) are not needed to query. Moreover they arise as a consequence of whatever design arises appropriate to the application per priciples of good design.
They just tell the DBMS that a list of values under a list of columns in a table also appear elsewhere as a list of values under a list of columns in some table. (For avoiding errors and improving optimization.)
You still would want indices on columns that will be involved in joins. Eg you might want X coordinates in two tables to hav sorted indices, in the same order. This is independent of whether one column's values form a subset of another's, ie whether a foreign key constraint holds between them.

Microstrategy Data Model

I am new to MSTR.
We are working on migrating from Essbase to Microstrategy 10.2.
After migration, we expect business users to be able to create report on top of MSTR cube and play around with the data similar to the way they have been doing using Essbase and Excel.
I need help to design data model for given scenario:
FactTb:
Subcategory Revenue
1 100
2 200
3 300
DimensionTb:
Category Subcategory
A 1
A 2
B 1
B 2
B 3
C 2
C 3
User wants to see revenue by category or subcategory.
FactTb has 3 rows. Assuming size of each row as 10 bytes, size of FactTb is 30 bytes.
If it is joined with DimensionTb there will 7 rows and size will grow (approximately) to 70 bytes.
Is there any way to restrict size of Cube?
Mapping of Category and Subcategory is static and there is no need to maintain a table for it.
Can I create/define DimensionTb out of Cube (Store it in report, create derived element using Subcategory)?
We want to restrict size of cube to maintain it in memory and ensure that report will always hit cube over database.
A cube is just the result of a SQL query, copied in memory for faster access. As you cannot imagine the result of a query split in two, the same is for a cube.
In memory cubes are compressed by MicroStrategy using multiple algorithms (to use the best compression depending on column data types and value distributions), but cubes contains also internal indexes (to speed up data access) created automatically depending on the queries used for the cube.
A VLDB setting can help reducing the size of the cube.
If you check the technote TN32540: Intelligent Cube Population methods in MicroStrategy 9.x, you will see different options, for my experience the last setting (Direct loading of dimensional data and filtered fact data.) is quete helpful in speed up cube loading and reduce the size, but you can also try the other ones (Normalize Intelligent Cube data in the Database).
With this approach the values from Dimension tables will be stored in memory, but separated from the fact, saving space.
Finally to be sure that your users alway use the cube, allow/teach them to create reports and dashboards clicking directly on the cube (or selecting it).
This is the safe way, MicroStrategy offers also a dynamic way to map reports to cubes (when conditions are satisfied), but users can surprise even the most thorough designer.

Unsupervised Anomaly Detection with Mixed Numeric and Categorical Data

I am working on a data analysis project over the summer. The main goal is to use some access logging data in the hospital about user accessing patient information and try to detect abnormal accessing behaviors. Several attributes have been chosen to characterize a user (e.g. employee role, department, zip-code) and a patient (e.g. age, sex, zip-code). There are about 13 - 15 variables under consideration.
I was using R before and now I am using Python. I am able to use either depending on any suitable tools/libraries you guys suggest.
Before I ask any question, I do want to mention that a lot of the data fields have undergone an anonymization process when handed to me, as required in the healthcare industry for the protection of personal information. Specifically, a lot of VARCHAR values are turned into random integer values, only maintaining referential integrity across the dataset.
Questions:
An exact definition of an outlier was not given (it's defined based on the behavior of most of the data, if there's a general behavior) and there's no labeled training set telling me which rows of the dataset are considered abnormal. I believe the project belongs to the area of unsupervised learning so I was looking into clustering.
Since the data is mixed (numeric and categorical), I am not sure how would clustering work with this type of data.
I've read that one could expand the categorical data and let each category in a variable to be either 0 or 1 in order to do the clustering, but then how would R/Python handle such high dimensional data for me? (simply expanding employer role would bring in ~100 more variables)
How would the result of clustering be interpreted?
Using clustering algorithm, wouldn't the potential "outliers" be grouped into clusters as well? And how am I suppose to detect them?
Also, with categorical data involved, I am not sure how "distance between points" is defined any more and does the proximity of data points indicate similar behaviors? Does expanding each category into a dummy column with true/false values help? What's the distance then?
Faced with the challenges of cluster analysis, I also started to try slicing the data up and just look at two variables at a time. For example, I would look at the age range of patients accessed by a certain employee role, and I use the quartiles and inter-quartile range to define outliers. For categorical variables, for instance, employee role and types of events being triggered, I would just look at the frequency of each event being triggered.
Can someone explain to me the problem of using quartiles with data that's not normally distributed? And what would be the remedy of this?
And in the end, which of the two approaches (or some other approaches) would you suggest? And what's the best way to use such an approach?
Thanks a lot.
You can decide upon a similarity measure for mixed data (e.g. Gower distance).
Then you can use any of the distance-based outlier detection methods.
You can use k-prototypes algorithm for mixed numeric and categorical attributes.
Here you can find a python implementation.