How can I cluster a set of queries with starting and ending times? - cluster-analysis

I'm given a set of queries q_1, q_2, ..., q_m. The query q_i for any 1<=i<=m can be executed several times in different times. I'm also given the starting and ending times of any execution of the query q_i.
Given the above information, I am interested in clustering queries such that a cluster consists of a set of queries that are ''often'' executed within some ''time gaps'' or distance. As it is shown in the following figure, for an example, we are given 5 queries with their execution time windows.
A feasible clustering algorithm outputs the following result.
Which clustering algorithms do you recommend for my problem?

Extract appropriate features.
Such as duration, frequency, cooccurrence.
Next, define a similarity function toquantify which queries are similar.
Then choose the algorithm based on that.

Related

Why does every simulation I perform in Arena Simulation give the same results?

Every time I run a simulation with the same parameters in the Run window I get exactly the same results. The results are different if a different number of replications is set each time the run is started
These are my settings in the run window:
enter image description here
I have a lot of Process blocks. Most of them have a normal distribution in duration. Why are the results not different?
If it helps in any way, her is a photo of the constructed model:
enter image description here
Arena uses the same random nr stream for your run unless you tell it to use another - so your first rep will look the same every time. The answer depends on the distributions you sample from. If you change the logic and sample at other times, the answer will change. Each following rep will have a unique answer based on the random nr streams you are sampling from. It makes it easier/possible to find errors if you can execute exactly the same run.
Arena provides (by default) ten different "streams" of (pseudo) random numbers. If you don't ask the system to use a particular stream, it will use stream 10. For example NORM(10,2) will use stream 10 to calculate a random normally distributed number (mean 10, standard deviation 2); NORM(10,2,4) will use stream 4 to calculate a similarly distributed number.
By default, the 10 random number generators for the streams are initialised at the beginning of each run to 14561, 25971, 31131, 22553, 12121, 32323, 19991, 18765, 14327, and 32535 (from Arena help). At the end of one replication, the generators will not be re-initialised, so they will start the next replication with a new value.
You can control the random number generator initialisation with the SEEDS element.
As #Marlize says, this helps to ensure that you can reproduce a simulation result if you need to.

Prometheus query quantile of pod memory usage performance

I'd like to get the 0.95 percentile memory usage of my pods from the last x time. However this query start to take too long if I use a 'big' (7 / 10d) range.
The query that i'm using right now is:
quantile_over_time(0.95, container_memory_usage_bytes[10d])
Takes around 100s to complete
I removed extra namespace filters for brevity
What steps could I take to make this query more performant ? (except making the machine bigger)
I thought about calculating the 0.95 percentile every x time (let's say 30min) and label it p95_memory_usage and in the query use p95_memory_usage instead of container_memory_usage_bytes, so that i can reduce the amount of points the query has to go through.
However, would this not distort the values ?
As you already observed, aggregating quantiles (over time or otherwise) doesn't really work.
You could try to build a histogram of memory usage over time using recording rules, looking like a "real" Prometheus histogram (consisting of _bucket, _count and _sum metrics) although doing it may be tedious. Something like:
- record: container_memory_usage_bytes_bucket
labels:
le: 100000.0
expr: |
container_memory_usage_bytes > bool 100000.0
+
(
container_memory_usage_bytes_bucket{le="100000.0"}
or ignoring(le)
container_memory_usage_bytes * 0
)
Repeat for all bucket sizes you're interested in, add _count and _sum metrics.
Histograms can be aggregated (over time or otherwise) without problems, so you can use a second set of recording rules that computes an increase of the histogram metrics, at much lower resolution (e.g. hourly or daily increase, at hourly or daily resolution). And finally, you can use histogram_quantile over your low resolution histogram (which has a lot fewer samples than the original time series) to compute your quantile.
It's a lot of work, though, and there will be a couple of downsides: you'll only get hourly/daily updates to your quantile and the accuracy may be lower, depending on how many histogram buckets you define.
Else (and this only came to me after writing all of the above) you could define a recording rule that runs at lower resolution (e.g. once an hour) and records the current value of container_memory_usage_bytes metrics. Then you could continue to use quantile_over_time over this lower resolution metric. You'll obviously lose precision (as you're throwing away a lot of samples) and your quantile will only update once an hour, but it's much simpler. And you only need to wait for 10 days to see if the result is close enough. (o:
The quantile_over_time(0.95, container_memory_usage_bytes[10d]) query can be slow because it needs to take into account all the raw samples for all the container_memory_usage_bytes time series on the last 10 days. The number of samples to process can be quite big. It can be estimated with the following query:
sum(count_over_time(container_memory_usage_bytes[10d]))
Note that if the quantile_over_time(...) query is used for building a graph in Grafana (aka range query instead of instant query), then the number of raw samples returned from the sum(count_over_time(...)) must be multiplied by the number of points on Grafana graph, since Prometheus executes the quantile_over_time(...) individually per each point on the displayed graph. Usually Grafana requests around 1000 points for building smooth graph. So the number returned from sum(count_over_time(...)) must be multiplied by 1000 in order to estimate the number of raw samples Prometheus needs to process for building the quantile_over_time(...) graph. See more details in this article.
There are the following solutions for reducing query duration:
To add more specific label filters in order to reduce the number of selected time series and, consequently, the number of raw samples to process.
To reduce the lookbehind window in square brackets. For example, changing [10d] to [1d] reduces the number of raw samples to process by 10x.
To use recording rules for calculating coarser-grained results.
To try using other Prometheus-compatible systems, which may process heavy queries at faster speed. Try, for example, VictoriaMetrics.

ELKI GUI no clustering results for Hierarchical clustering

I'm new to ELKI and I need to do some basic clustering of a dataset that I already tested and clustered in Weka. I'm using the "GUI version" and I read the tutorial Analyzing the "mouse" data set on ELKI site: http://elki.dbs.ifi.lmu.de/wiki/Tutorial#Analyzingthemousedataset
I clustered my dataset with EM and successfully visualized and output the results (from the tutorial I just changed the parameter resultHandler: ResultWriter). The results I got in the folder are are: cluster.txt, cluster-evaluation.txt and settings.txt.
I have problems with the output results for hierarchical algorithms (SLINK,CLINK, etc.). The output that I got is just the settings.txt, but I need the cluster.txt.
I need to change some other parameters, because on the log view there are no errors?
To get partitions from a hierarchical clustering result, you also need to specify a cluster extraction method:
-algorithm clustering.hierarchical.extraction.HDBSCANHierarchyExtraction
-algorithm CLINK
-hdbscan.minclsize 50
Note that we have two -algorithm parameters now, and order is important. The extraction algorithm has a "nested" algorithm call to do the actual hierarchical clustering.
On the long run, we want to move to an operator-based approach (in particular for GUIs). For the command line, the nested-invocation is more safe, as you cannot attempt to extract without running a hierarchical clustering.
As for CLINK, the cluster quality is usually not too good (it also is order dependent, so shuffling the data and running multiple times will give different results). I'd also give AGNES or Anderberg with complete linkage a try; AGNES is always O(n^3), Anderberg is usually in O(n^2) (only worst case is O(n^3)) and both produce much better results (they are expected to produce the same results except for tied distances, CLINK is different):

rapidminer: cluster performance operators..what does different value mean?

I have to check performance of various clustering algos using different performance operators in rapidminer. For that I want to know the following things:
what does cluster number index value shows which is output of cluster count performance operator?
what does small and large value of avg within cluster distance and avg. within centroid distance mean in terms of good and bad clustering?
I also want to check other indexes value like Dunn index,Jaccard index, Fowlkes–Mallows for various clustering algos. but rapidminer don't have any operator for this, what to do for that. I don't have experience with R.
I have copied part of the answer I gave on the Rapid-I forum
The cluster number index is the count of clusters - pointless you might say but when used with DBSCAN, it can be quite interesting http://rapidminernotes.blogspot.co.uk/2010/12/counting-clusters.html
The avg within cluster and centroid distances are hard to interpret - one thing to search for is "elbow criterion" in this context. As the number of clusters varies, note how the validity measure changes and look for an "elbow" that marks the point where the natural progression of the measure dominates the structure.
R has many validity measures and it's worth investing some time because you can always call the R process from RapidMiner which makes it easier to work out what is going on.

Mahout K-means has different behavior based on the number of mapping tasks

I experience a strange situation when running Mahout K-means:
Using the a pre-selected set of initial centroids, I run K-means on a SequenceFile generated by lucene.vector. The run is for testing purposes, so the file is small (around 10MB~10000 vectors).
When K-means is executed with a single mapper (the default considering the Hadoop split size which in my cluster is 128MB), it reaches a given clustering result in 2 iterations (Case A).
However, I wanted to test if there would be any improvement/deterioration in the algorithm's execution speed by firing more mapping tasks (the Hadoop cluster has in total 6 nodes).
I therefore set the -Dmapred.max.split.size parameter to 5242880 bytes, in order to make mahout fire 2 mapping tasks (Case B).
I indeed succeeded in starting two mappers, but the strange thing was that the job finished after 5 iterations instead of 2, and that even at the first assignment of points to clusters, the mappers made different choices compared to the single-map execution . What I mean is that after close inspection of the clusterDump for the first iteration for both two cases, I found that in case B some points were not assigned to their closest cluster.
Could this behavior be justified by the existing K-means Mahout implementation?
From a quick look at the sources, I see two problems with the Mahout k-means implementation.
First of all, the way the S0, S1, S2 statistics are kept is probably not numerically stable for large data sets. Oh, and since k-means actually does not even use S2, it is also unnecessary slow. I bet a good implementation can beat this version of k-means by a factor of 2-5 at least.
For small data sets split onto multiple machines, there seems to be an error in the way they compute their means. Ouch. This will amplify if the reducer is applied to more than one input, in particular when the partitions are small. To be more verbose, the cluster mean apparently is initialized with the previous mean instead of the 0 vector. Now if you if you reduce 't' copies of it, the resulting vector will be off by 't' times the previous mean.
Initialization of AbstractCluster:
setS1(center.like());
Update of the mean:
getS1().assign(x, Functions.PLUS);
Merge of multiple copies of a cluster:
setS1(getS1().plus(cl.getS1()));
Finalization to new center:
setCenter(getS1().divide(getS0()));
So with this approach, the center will be offset from the proper value by the previous center times t / n where t is the number of splits, and n the number of objects.
To fix the numerical instability (which arises whenever the data set is not centered on the 0 vector), I recommend replacing the S1 statistic by the true mean, not S0*mean. Both S1 and S2 can be incrementally updated at little cost using the incremental mean formula which AFAICT was used in the original "k-means" publication by MacQueen (which actually is an online kmeans, while this is Lloyd style batch iterations). Well, for an incremental k-means you obviously need the updatable mean vector anyway... I believe the formula was also discussed by Knuth in his essential books. I'm surprised that Mahout does not seem to use it. It's fairly cheap (just a few CPU instructions more, no additional data, so it all happens in the CPU cache line) and gives you extra precision when you are dealing with large data sets.