Does OptaPlanner have a "built-in" way to perform multi-unit score normalization? - drools

At the moment, my problem has four metrics. Each of these measures something entirely different (each has different units, a different range, etc.) and each is weighted externally. I am using Drools for scoring.
I only have only one score level (SimpleLongScore) and I have to find a way to appropriately combine the individual scores of these metrics onto one long value
The most significant problem at the moment is that the range of values for the metrics can be wildly different.
So if, for example, after a move the score of a metric with a small possible range improves by, say, 10%, that could be completely dwarfed by an alternate move which improves the metric with a larger range's score by only 1% because OptaPlanner only considers the actual score value rather than the possible range of values and how changes affect them proportionally (to my knowledge).
So, is there a way to handle this cleanly which is already part of OptaPlanner that I cannot find?
Is the only feasible solution to implement Pareto scoring? Because that seems like a hack-y nightmare.
So far I have code/math to compute the best-possible and worst-possible scores for a metric that I access from within the Drools and then I can compute where in that range a move puts us, but this also feel quite hack-y and will cause issues with incremental scoring if we want to scale non-linearly within that range.
I keep coming back to thinking I should just just bite the bullet and implement Pareto scoring.
Thanks!

Take a look at #ConstraintConfiguration and #ConstraintWeight in the docs.
Also take a look at the chapter "explaning the score", which can exactly tell you which constraint had which score impact on the best solution found.
If, however, you need pareto optimization, so you need multiple best solutions that don't dominate each other, then know that OptaPlanner doesn't support that yet, but I know of 2 cases that implemented it in OptaPlanner by hacking BestSolutionRecaller.
That being said, 99% of the cases that think of pareto optimization, are 100% happy with #ConstraintWeight instead, because users don't want multiple best solutions (except during simulations), they just want one in production.

Related

Can I plot the same subset of many days in grafana

I have an interesting event that happens once a day at (say) 10:00. I would like to plot some data for a small time around then for multiple days to explore the data. Is there an easy way to do this with Grafana?
Rough sketch of what I’d like (but I’m not very picky):
A few things I tried:
Searching on the internet. I think this failed because I don’t know the right words to describe what I want
Assuming this is impossible. The reason I would expect this to be possible is that it seems somewhat common to e.g. want to plot the value of some metric during working hours only. But it also seems like something that mightn’t be needed for a reasonably steady 24/7 operation.
Setting up a dashboard with many panels with appropriate timeshifts. This is a bit fiddly to set up and I think it is hard to change the metric being looked at. It also leaves a lot of empty space and I’m not sure if it is possible to lock the y axis scale to be the same for all the panels
I assume that this isn’t currently supported and therefore I suppose the question is what the best workarounds would be. In particular I’m hoping to optimize for something that is convenient to explore the data rather than somethings that is powerful.
Some alternatives that would likely also be good for me:
Some scheme for weighting the x axis (so I could give time in the gap a weight of 0)
Some way to automatically add discontinuities to the x axis when there is no data
A few more specifics:
Grafana v8.1.8 (52edcff798)
The data is stored in VictoriaMetrics, and queried via a PromQL interface (PromQL is a subset of VictoriaMetrics’ MetricsQL. Currently I can’t make MetricsQL queries but can maybe change to do that if it would help here).

How to ensure the output of _best_programs of SymbolicTransformer of gplearn is different?

I am using the SymbolicTransformer of gplearn to generate some automated features. The issue is, when I inspect the expression of the features via looking at _best_programs after fitting, I find that most of the features have the same expression. I am wondering whether there is a way to ensure that we output different features using SymbolicTransformer after fitting?
I don't know if there is a way to explicitly enforce this but you can probably try to enforce more diverse populations each generation in the hopes that this leads to a a collection of more diverse _best_programs. In my opinion a few parameters you could look into are:
p_crossover
p_subtree_mutation
p_hoise_mutation
p_point_mutation
p_point_replace
If you increase the chance of crossover or mutation you will increase your expected diversity but you must not overdue it. There is a balance between a diverse population and an accurate one. The higher the crossover or mutation the more likely that you will take a strong individual candidate and change it into something meaningless.

Clustering Category Purchases in Customer Data

I am attempting to cluster a group of customers based on spend, order frequency, order breadth and what % of purchases they make in each category (there are around 20).
It will probably be a simple answer but I cannot figure out whether I should standardize (subtract mean and divide by sd) the % category buy columns or not. When I dont standardize I can get around 90% of the variance explained in 4-5 principal components (using SVD), but when I standardize each column I only get around 40% for the same number of principal components. My worry is that because each column is related, I am removing the relationship by standardizing. At the same time I am worried that not standardizing will cause issues with the other variables in the data that I have standardized.
I would assume if others tried clustering in this way they would face a similar issue but I cant seem to find one so it might be that I just dont understand the situation. Thanks for any clarification in advance!
Chris,
Percentage scale has a well defined range and nice properties.
By heuristically scaling these features you usually make things worse.

Doubts about clustering methods for tweets

I'm fairly new to clustering and related topics so please forgive my questions.
I'm trying to get introduced into this area by doing some tests, and as a first experiment I'd like to create clusters on tweets based on content similarity. The basic idea for the experiment would be storing tweets on a database and periodically calculate the clustering (ie. using a cron job). Please note that the database would obtain new tweets from time to time.
Being ignorant in this field, my idea (probably naive) would be to do something like this:
1. For each new tweet in the db, extract N-grams (N=3 for example) into a set
2. Perform Jaccard similarity and compare with each of the existing clusters. If result > threshold then it would be assigned to that cluster
3. Once finished I'd get M clusters containing similar tweets
Now I see some problems with this basic approach. Let's put aside computational cost, how would the comparison between a tweet and a cluster be done? Assuming I have a tweet Tn and a cluster C1 containing T1, T4, T10 which one should I compare it to? Given that we're talking about similarity, it could well happen that sim(Tn,T1) > threshold but sim(Tn,T4) < threshold. My gut feeling tells me that something like an average should be used for the cluster, in order to avoid this problem.
Also, it could happen that sim(Tn, C1) and sim(Tn, C2) are both > threshold but similarity with C1 would be higher. In that case Tn should go to C1. This could be done brute force as well to assign the tweet to the cluster with maximum similarity.
And last of all, it's the computational issue. I've been reading a bit about minhash and it seems to be the answer to this problem, although I need to do some more research on it.
Anyway, my main question would be: could someone with experience in the area recommend me which approach should I aim to? I read some mentions about LSA and other methods, but trying to cope with everything is getting a bit overwhelming, so I'd appreciate some guiding.
From what I'm reading a tool for this would be hierarchical clustering, as it would allow regrouping of clusters whenever new data enters. Is this correct?
Please note that I'm not looking for any complicated case. My use case idea would be being able to cluster similar tweets into groups without any previous information. For example, tweets from Foursquare ("I'm checking in ..." which are similar to each other would be one case, or "My klout score is ..."). Also note that I'd like this to be language independent, so I'm not interested in having to deal with specific language issues.
It looks like to me that you are trying to address two different problems in one, i.e. "syntactic" and "semantic" clustering. They are quite different problems, expecially if you are in the realm of short-text analysis (and Twitter is the king of short-text analysis, of course).
"Syntactic" clustering means aggregating tweets that come, most likely, from the same source. Your example of Foursquare fits perfectly, but it is also common for retweets, people sharing online newspaper articles or blog posts, and many other cases. For this type of problem, using a N-gram model is almost mandatory, as you said (my experience suggests that N=2 is good for tweets, since you can find significant tweets that have as low as 3-4 features). Normalization is also an important factor here, removing RT tag, mentions, hashtags might help.
"Semantic" clustering means aggregating tweets that share the same topic. This is a much more difficult problem, and it won't likely work if you try to aggregate random sample of tweets, due to the fact that they, usually, carry too little information. These techniques might work, though, if you restrict your domain to a specific subset of tweets (i.e. the one matching a keyword, or an hashtag). LSA could be useful here, while it is useless for syntactic clusters.
Based on your observation, I think what you want is syntactic clustering. Your biggest issue, though, is the fact that you need online clustering, and not static clustering. The classical clustering algorithms that would work well in the static case (like hierarchical clustering, or union find) aren't really suited for online clustering , unless you redo the clustering from scratch every time a new tweet gets added to your database. "Averaging" the clusters to add new elements isn't a great solution according to my experience, because you need to retain all the information of every cluster member to update the "average" every time new data gets in. Also, algorithms like hierarchical clustering and union find work well because they can join pre-existant clusters if a link of similarity is found between them, and they don't simply assign a new element to the "closest" cluster, which is what you suggested to do in your post.
Algorithms like MinHash (or SimHash) are indeed more suited to online clustering, because they support the idea of "querying" for similar documents. MinHash is essentially a way to obtain pairs of documents that exceed a certain threshold of similarity (in particular, MinHash can be considered an estimator of Jaccard similarity) without having to rely on a quadratic algorithm like pairwise comparison (it is, in fact, O(nlog(n)) in time). It is, though, quadratic in space, therefore a memory-only implementation of MinHash is useful for small collections only (say 10000 tweets). In your case, though, it can be useful to save "sketches" (i.e., the set of hashes you obtain by min-hashing a tweet) of your tweets in a database to form an "index", and query the new ones against that index. You can then form a similarity graph, by adding edges between vertices (tweets) that matched the similarity query. The connected components of your graph will be your clusters.
This sounds a lot like canopy pre-clustering to me.
Essentially, each cluster is represented by the first object that started the cluster.
Objects within the outer radius join the cluster. Objects that are not within the inner radius of at least one cluster start a new cluster. This way, you get an overlapping (non-disjoint!) quantization of your dataset. Since this can drastically reduce the data size, it can be used to speed up various algorithms.
However don't expect useful results from clustering tweets. Tweet data is just to much noise. Most tweets have just a few words, too little to define a good similarity. On the other hand, you have the various retweets that are near duplicates - but trivial to detect.
So what would be a good cluster of tweets? Can this n-gram similarity actually capture this?

Whate are the basic concepts for implementing anti-shock and anti-shake algorithms?

I have some animations happening upon fine acceleration detections. But when the user sits in a car or is walking it may get annoying.
Basically, all that stuff has to be disabled automatically as soon as there is too much vibration or shaking. Conceptually, I think that it's very hard to filter those vibrations out , since the "vibration phase" changes permanently. I woul define "unwanted vibration or shocks" as acceleration values that change very fast by an large interval of values, or, an permanently changing accumulated value that does not exceed an specified treshold range in an specified minimum period of time.
I am looking for "proven" concepts, before I start reinventing the wheel for a couple of days.
I don't have any concrete answers for you, but you might want to Google band-pass filters or anti-aliasing filters for some ideas on how to approach this. Basically, if you can identify the frequency range of accelerations that you want to consider real, you can filter out frequencies that fall outside this range.
Before you start doing too much pre-optimization, I think you should implement a low pass filter and see if that does the job. Most iPhone apps effectively use a variation of an LPF to get rid of unwanted accelerometer noise.
You could also go the other way and use a high pass filter. Once you get a certain power level passing through the HPF, stop processing data.