I have hierarchical clustering tree (using linkage). Each cluster has its own level in the dendrogram which corresponds to the cost of that cluster. I have budget for n1 clusters with cost c1, n2 clusters with cost c2, and n3 clusters with cost c3. The question is to select which cluster using the budget to cover all original items. and c1>c2>c3. Clusters with cost c1 can be used for c2 or c3 clusters obviously.
For only one budget category, the solution is trivial: just start from the root and in each subtree when we go below c1 add it. If n1 categories are finished before subtrees there is no solution.
For two categories it is also simple. Find all candidates for c1 cost. Label them with the number of c2 sub-clusters and sort them descendingly. Select the c1 categories with maximum label. Then select c2 clusters.
But the problem is complex for more than 2 categories as sorting for c2 categories does not necessarily keeps track of c3 budget.
Related
I am reading through Postgres' query optimizer's statistical estimator's code to understand how it works.
For reference, Postgres' query optimizer's statistical estimator estimates the size of the output of an operation (e.g. join, select) in a Postgres plan tree. This allows Postgres to choose between the different ways a query can be executed.
Postgres' statistical estimator uses cached statistics about the contents of each a relation's columns to help estimate output size. The two key saved data structures seem to be:
A most common value (MCV) list: a list of each of the most common values stored in that column and the frequency that they appear in the column.
A histogram of the data stored in that column.
For example, given the table:
X Y
1 A
1 B
1 C
2 A
2 D
3 B
The most common values list for Y would contain {1:0.5, 2:0.333}.
However, when Postgres completes the first join in a multi join operation like in the example below:
SELECT *
FROM A, B, C, D
WHERE A.ID = B.ID AND B.ID2 = C.ID2 AND C.ID3 = D.ID3
the resulting table does not have an MCV (or histogram) (since we've just created the table and we haven't ANALYZEd it! This will make it harder to estimate the output size/cost of the remaining joins.
Does Postgres automatically generate/estimate the MCV (and histogram) for this table to help statistical estimation? If it does, how does it create this MCV?
For reference, here's what I've looked at so far:
The documentation giving a high level overview of how Postgres statistical planner works:
https://www.postgresql.org/docs/12/planner-stats-details.html
The code which carries out the majority of POSTGRES's statistical estimation:
https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/selfuncs.c
The code which generates a relation's MCV:
https://github.com/postgres/postgres/blob/master/src/backend/statistics/mcv.c
Generic logic for clause selectivities:
https://github.com/postgres/postgres/blob/master/src/backend/optimizer/path/clausesel.c
A pointer to the right code file to look at would be much appreciated! Many thanks for your time. :)
The result of a join is called a join relation in PostgreSQL jargon, but that does not mean that it is a “materialized” table that is somehow comparable to a regular PostgreSQL table (which is called a base relation).
In particular, since the join relation does not physically exist, it cannot be ANALYZEd to collect statistics. Rather, the row count is estimated based on the size of the joined relations and the selectivity of the join conditions. This selectivity is a number between 0 (the condition excludes all rows) and 1 (the condition does not filter out anything).
The relevant code is in calc_joinrel_size_estimate in src/backend/optimizer/path/costsize.c, which you are invited to study.
The key points are:
Join conditions that correspond to foreign keys are considered specially:
If all columns in a foreign key are join conditions, then we know that the result of such a join must be as big as the referenced table, so the selectivity is 1 / referenced table size.
Other join conditions are estimated separately by guessing what percentage of rows will be eliminated by that condition.
In the case of an left (or right) outer join, we know that the result size must be at least as big as the left (or right) side.
Finally, the size of the cartesian join (the product of the relation sizes) is multiplied with all selectivities calculated above.
Note that this treats all conditions as independent, which causes bad estimates if the conditions are correlated. But since PostgreSQL doesn't have cross-table statistics, it cannot do better.
I performed a clustering analysis of the media usage of different users in order to find different groups that use a specific set of media (e.g. group 1 use media A, B and C and group 2 use media B, C and D). Then I divided the datset in different groups, since the users belong to a specific group (as a consequence the original dataset and the new datasets have a different size). Within in this groups I like to cluster again which different media sets are used.
How can I determine the number of clusters to guarantee that the results are comparable?
Thank you in advance!
Don't rely on clustering to be stable.
It's a hypothesis generation tool.
You clustered, and now you have the hypothesis that there are groups ABCD of media usage. You should first evaluate if this hypothesis is adequate. Now what you want to do in your next step is to assign the labels to subsets of the data. First of all, you should be able to simply subset this from the previous labels. But if this really is different data, you can label new data, for example using the most similar record (nearest neighbor classification). But that is classification now, because your classes are fixed.
I have been trying to generate recommendations for some selected users in spark. This is done by dot-producting user factor (a Vector of n floats) with each product factor (Vector of n floats) and then order descendingly.
So, let's say I have customer factors as (customerId, Array[Float]) and I have product factors as (productId, Array[Float]). I have to create score of each product for every customer and produce (customerId, productId, score) where top N result for each customer is kept. So I do this:
val customers = ... // (customerId, Array[Float])
val products = ... // (productId, Array[Float])
val combination = customers.cartesian(products)
val result = combination.map(x => (combination._1._1, combination._2._1,
dotProd(combination._1._2, combination._2._2))
... then filter top N for each customer using dataframe
But this is taking ages and one reason is cartesian results in making the data size huge, repeating same product factor for each and every customer.
As you can see this 11 TB of data for 100K customers and 300K products. And this is the DAG created (I do a select and keep only top N of the scores hence the partition):
What would you suggest? How can I improve the process to get around the huge IO?
Thanks
UPDATE
In the end, it took 10 hours to run this on 48 cores.
And with 80TB of IO!
Update 2
I suspect the solution is to collect and then broadcast two RDDs and create cartesian on just the IDs and then lookup the factors. This will massively reduce the IO.
I will give it a go.
[NOTE: I will not accept my own answer since this is just an improvement, and not materially better]
As I described, I broadcasted customer and product factors and that sped up the process by almost x3 times and reduced IO to 2.4TB.
There could be even more improved approaches, but I guess this is OK for now.
I am working on an exercise to build influencer score for each user in my data set. What that means is that a user with higher engagement should get higher score and vice versa. however, i have many different type of engagement variables and i am not sure which one should weight higher.
so, i first did a cluster analysis to divide users into different group based on engagement activity using 5 different types of engagement. Based on this, i found that one of the cluster has high level of engagement across all the different types of engagement variables. This is the group i am interested in. however, it is possible that the group size i get may be smaller than the number of users i want to use in future. so, i want to now use these clusters and create a propensity score.
e.g. in the cluster analysis, say i get 5 clusters c1, c2,c3,c4,c5 and c5 is my cluster of interest. so, i give all users in c5 a value of 1 (= influencer) and i give all users in c1 to c4 a value of 0 (= not influencer). now, i use this binary variable and build a logistic regression model (using same engagement variables as used for clustering) to get propensity for everyone to an influencer. this way, i can change the threshold to reduce or increase the numbers of users i want to select.
Now, the issue i am running in is that one of the engagement variable is able to predict influencer very well and hence my propensity scores are very close to either 1 or 0 which defeats the purpose of why i wanted the propensity score in the first place.
S0, 2 questions -
1) is this approach of building a unsupervised classification and then using this to build supervised classification a sound approach of what i am trying to do?
2) how do i reduce the contribution from the variable that predicts influencer really well to ensure that i get much more smoother curve instead of values near 0 or 1. i don't want to remove this variable from the model as this is important from business perspective.
I have a dataset contains 3 categories {c1,c2 and c3}. I’m using the single- linkage hierarchical cluster method (from the matlab) to cluster the dataset. I built my own distance measure. The following figure shows the results. Note that the hierarchical cluster method clusters the data correctly; where the points of c1 (yellow) are very close to each other. And similarly, c2(green) and c3(blue).
From the figure, we can note that the distances between the points in c1 are very small comparing to c2 and c3. So, for example, If I decide to cut the tree at 8, this will results with c1, c2 and c3 will be splited into 8 clusters; where each point will be in different cluster.
How can I overcome this problem; do I need to change the clustering method? Or cut the tree at 17 and cluster the resulted clusters again?
There are different ways of extracting clusters from a dendrogram. You are not required to do a single cut (although matlab may only offer this choice). Selecting regions like you did is also reasonable, and so is cutting the dendrogram at multiple heights. But not every tools has all the capabilities.
Notice that c3 was split into two, half of which is not well separated from c2.