Optimal relational design for groups and subgroup relationships - postgresql

I have a bit of an intro-level relational database design question. I'm working on a project where I'm capturing information from scientific journal articles and storing that in a Postgres database. One of my primary goals is to define a schema that is flexible enough to cover most cases I might encounter in a broad set of papers. In reality, articles tend to report a semi-standard set of details, but there's definitely variance once you get into the details. These things are written for humans, not machines.
For the most part, defining the schema has been pretty straightforward, but one thing I'm stuck on is how to sensibly structure a set of tables to capture details about a study's subject groups and subsets of subjects.
Take for example a simple randomized control trial - you typically have a set of people identified as screened for eligibility, a set determined to be eligible, a set randomized into the control group, and a set randomized into the treatment group. Within each of those groups you can have subgroups defined in all sorts of specific ways, but generally by some sort of interval (e.g. Age 26-32) or a category (e.g. pregnant/not pregnant).
Currently, I've set this up so that a Study record can have many Subject records, and Subject records can have many Interval_Subgroup records and many Categorical_Subgroup records.
Subject
-----------------------------------------
id | groupType | measure | value | study
-----------------------------------------
13 | treatment | count | 578 | 17
14 | control | count | 552 | 17
Interval_Subgroup
---------------------------------------------------------------
id | factor | factorMin | factorMax | measure | value | subject
---------------------------------------------------------------
41 | age | 18 | 24 | count | 125 | 13
42 | age | 25 | 32 | count | 204 | 13
Categorical_Subgroup
-----------------------------------------------------
id | factor | factorValue | measure | value | subject
-----------------------------------------------------
74 | sex | male | count | 251 | 13
75 | sex | female | count | 327 | 13
This seems workable, but feels clunky because I have two tables for capturing the same type of information. Also it's limiting because it wouldn't allow me to capture any combination of subgroup sets like males of age 18-24. Some studies report that kind of detail, some don't, but I want to be able to capture any depth of subgroup info the paper offers.
What is a more flexible way to structure these tables than what I've described above? I'm trying to sketch out how I think this should work, and right now, I have subject groups having many subgroups and subgroups having many subgroup definitions. There would just be one table capturing measurements about subgroups, and another table for defining what each subgroup is. I'm not sure if that is in the right direction. Maybe there is a far more simple solution that you might know of.
Thanks for taking the time to help out - it's much appreciated!
Edit:
Fixed id to be unique in the example tables.

From your description it sounds like a factor is a thing, and that each subgroup has one or more factors. To me this implies that factor needs its own table. Factors can in turn be of type interval or categorical, which means single table inheritance might be in order.
Example tables might look something like this:
subgroups
------------------------------
id | measure | value | subject
------------------------------
41 | count | 125 | 13
42 | count | 204 | 13
factors
id | type | factor | category | interval_min | interval_max | subgroup
-----------------------------------------------------------------------------
68 | interval | age | NULL | 18 | 24 | 13
69 | categorical | sex | male | NULL | NULL | 13
In this example subgroup 41 has two factors, age 18-24 and gender male.
It could also be that STI is overkill here, in which case you'd split factor into two tables, categorical_factors and interval_factors, and a subgroup could have zero or many of each.
As far as I'm aware, the complexity of using STI mostly depends on what ORM you're using. Rails / ActiveRecord has good support, other frameworks vary.
Hope that helps!

Related

Kafka / KSQL, stuck in reducing stream/table

What I have, are two streams (from two different systems, imported via connectors). Some of the information from the different streams will be used to build combined information.
Currently, I'm working with ksqlDB but I'm having problems with the last step to reduce the information from both streams.
Both streams contains a tree structure (id/parentId), so I've used a second table for each stream to find certain information from the parents, which is then joined into a table containing all the information to do the final reduce.
The main matching column is always the same, however, one or more columns (not fixed) is also needed to do the final match. The columns might also be partial matches between them.
An example output of the table might look like this:
| id | match | matchExtra1 | matchExtra2 | matchExtra3 |
| 1 | 1 | Extra1 | Extra2 | Extra3 |
| 2 | 1 | Extra1 | Extra4 | Extra5 |
| 3 | 1 | Extra6 | Extra7 | Extra8 |
| 4 | 1 | Extra9 | Extr10 | tra8 |
In this case, id 1 and 2 should be matched and id 3 and 4 should be another match.
If this is possible within ksqlDB, that would be great. If needed to work with low-level Kafka, that's fine as long as we can achieve the end result.
Basic flow as I have it right now:

Getting breakdown of "Others" (the rest of Top N members) with SSAS MDX

How can I recursively get the breakdown of "Others" when Top N is applied to dimensions?
Imagine a measure Sales Amount is sliced by 3 dimensions, Region, Category and Product, and Top 1 is applied to each dimension. The result I want to see is a table like below. On each slice, the rest of members are grouped as "Others".
Region | Category | Product | Sales
============================================
Europe | Bikes | Mountain Bikes | $100
| |------------------------
| | Others | $ 30
|-----------------------------------
| Others | Gloves | $ 50
| |------------------------
| | Others | $120
--------------------------------------------
Others | Clothes | Jackets | $ 80
| |------------------------
| | Others | $130
|-----------------------------------
| Others | Shoes | $ 90
| |------------------------
| | Others | $110
--------------------------------------------
When an "Others" appears, I want to see the Top 1 of the next dimension within the scope of this "Others". This seems a little tricky. e.g. tuples like (North America, Clothes) and (Central America, Clothes) need to be aggregated as (Other Regions, Clothes). Is there a neat way to aggregate the measure based on the 2nd dimension, Category?
Alternatively, I think a sub cube that filters out Europe will easily provide the breakdown of Other Regions, Clothes and Other Categories. However, this is likely to result in creating many dependent queries. For an easy processing of the result set, it would be ideal if the query returns data in the above format.
Can this be possibly achieved by a single MDX query?
To get the breakdown of others we must use dynamic set, EXCEPT() and aggregate functions
in each of the three dimensions we will need to create a named dynamic set that holds too members (top 1 and others ).
as exemple, in the dimension category i have created a dynamic set that holds two members (Top 1 and others) like this :
CREATE MEMBER
CURRENTCUBE.[Product].[French Product Category Name].[ALL].[OTHERS] AS
AGGREGATE(EXCEPT([Product].[French Product Category Name].[French Product Category Name].MEMBERS,
TOPCOUNT([Product].[French Product Category Name].[French Product Category Name],1,[Measures].[Sales Amount])
));
CREATE DYNAMIC SET [TOP1 and Others]
AS {TOPCOUNT([Product].[French Product Category Name].[French Product Category Name],1,[Measures].[Sales Amount]),[OTHERS]};
because the set is dynamic then the values of top 1 and others will change according to the filters and slicers that you applay.

Scala use ML to find outliers in a Dataframe

this question is gonna be a little vague, but I cant seem to find any concrete example online.
https://spark.apache.org/docs/0.9.0/mllib-guide.html
From the above spark docs, I can see multiple ways of training and predict anomaly/outliers with Mllib library. However, every single one of those examples only involve numbers or at most 2 columns of data.
I can't figure out how to train and predict on a data set with more values and etc...
Let's say if I wanted to use the clustering method to find outliers to my data, and my data looks like the following in a Dataframe:
UserId | Department | Date | Item | Cost
user1 | Electronic | 11-19 | Iphone | 115
user1 | Electronic | 11-19 | Iphone | 150
user1 | Electronic | 11-19 | Iphone | 900
user1 | Electronic | 11-23 | Iphone | 85
user1 | Electronic | 11-20 | Iphone | 120
user2 | Electronic | 11-19 | Iphone | 600
user2 | Electronic | 11-19 | Iphone | 550
user2 | Electronic | 11-19 | Iphone | 600
user2 | Electronic | 11-23 | Iphone | 575
user2 | Electronic | 11-20 | Iphone | 570
....
There will be millions of data like this across months.
I want to research across the pattern of the users in the past X months and constantly updating my model every day with new data. So something like
user1 | Electronic | 11-19 | Iphone | 900
should be considered an outlier
How can I apply any of the above supervised learning methods on this type of datasets?
Thanks!
Are you sure that you are using Spark 0.9 (current version is 2.2)? The site you have quoted is showing a kMeans example [1]. The parameter parsedData can have more than two columns, but kMeans in Spark 0.9 can only handle double values [2].
Also the other examples can have more than two columns [3]. The label parameter could be an ongoing number and features are your listed data, but like kMeans spark 0.9 is only able to handle double values.
Looking at the other avaiable classes of 0.9 api let's me assume the spark 0.9 was only able to handle double values. If you want to handle the data liked you have showed above, you should consider to use a more recent version of spark.
[1] https://spark.apache.org/docs/0.9.0/mllib-guide.html#clustering-1
[2] https://spark.apache.org/docs/0.9.0/api/mllib/index.html#org.apache.spark.mllib.clustering.KMeans$
[3] https://spark.apache.org/docs/0.9.0/api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint

Which table is correct Database Design?

Option 1:
Country | Risk Category | Value
USA | Health | 0.75
USA | Market | 0.66
USA | Technology | 0.35
Option 2:
Country | Health Risk | Market Risk | Technology Risk
USA | 0.75 | 0.66 | 0.35
Option 1: allows for new risk categories to be dynamically added without needing to add columns when changes occur, but I'll have to run algorithms to find the values I'm looking for as it doesn't work well with LINQ.
Option 2: is easier to work with in entity framework as everything is scaffolded out. however, the database needs to change every time a new category is added. Also, there could potentially be 200+ columns.
Which option is the best for long-term success and maintainability?
Honestly, neither are optimal for long-term & maintainability, but if I'd had to choose I'd go for Option 2.
Since you have a many to many relationship between the country entity and the risk entity I'd separate the information a bit more and do it like this.
Country
ID | Name |
1 | USA |
2 | CANADA |
Risk
Id | Name |
1 | Health |
2 | Market |
3 | Technology |
CountryRisk
CountryId | RiskId | Value |
1 | 1 | 0.75 |
1 | 2 | 0.66 |
1 | 3 | 0.35 |
Option 2.
If every country is always going to have a Health Risk, Market Risk and Technology Risk - You'll never risk repeating data. Databases should never repeat data.
I honestly can't see any benefits to Option 1 over Option 2.

Calculate median and average in a partition in Tableau using table calculation

I have a details table of posts and subjects digged from a forum. Row is the single subject (ie postID and subjectIS is the primary key for the table), then I have some measures at subject level and some at post level. For example:
+---------+-------------+--------------+------------+--------------+--------+
| post.ID | post.Author | post.Replies | subject.ID | subject.Rank | year |
+---------+-------------+--------------+------------+--------------+--------+
| 1 | mike | 10 | movie | 4 | 1990 |
| 1 | mike | 10 | comics | 6 | 1990 |
| 2 | sarah | 0 | tv | 10 | 2001 |
| 3 | tom | 4 | tv | 10 | 2003 |
| 3 | tom | 4 | comics | 6 | 2003 |
| 4 | mike | 1 | movie | 4 | 2008 |
+---------+-------------+--------------+------------+--------------+--------+
I want to study the trend of posts and subjects by year and color it by subject.Rank.
Firsts are easily measured putting COUNTD(post.ID) and COUNTD(subject.ID) in rows and 'year' in column.
But if I drag MEDIAN(subject.Rank) in Color, I got a wrong result: it's not calculated at distinct subject.ID level but at row level.
I think I can accomplish it using table calculation features, but I have no idea on how to proceed.
It sounds like you are trying to treat Subject.Rank as a dimension, instead of as a measure. If so, just convert it to a dimension on the worksheet in question by right clicking on the field and choosing dimension. You can also convert it to a dimension in the data pane by dragging the field from the measures section up to the dimensions section. That will tell Tableau to treat that field as a dimension by default in the future.
A field can be treated a dimension in some cases, and a measure in others. Depends on what you are trying to achieve. If you are familiar with SQL, dimensions are used to partition data rows for aggregation using the GROUP BY clause.
Finally, count distinct (COUNTD) can be expensive on large datasets. Often, you can get the same result another way. So try to think of other approaches and save COUNTD for when you really need it.
Try using {fixed [1st LEVEL],[2nd level]: median()}
or
Table calculation approach
when you put in median there is an edit table calculation under advance compute using put you fields in there(Make sure its ordered the way you want it to calculate when you select them) then click OK select the at which level and restart every