I am using the DBSCAN implementation in Weka and it seems to be giving me different results based on whether I select "Use training set" or "Classes to clusters evaluation" as the 'Cluster mode'. As per the documentation here, selecting "Classes to clusters evaluation" should only change the metrics reported.
With DBSCAN however I actually see a different number of clusters. Here's a way to reproduce the problem:
Load the IRIS dataset: Select the "Preprocess" tab, click "Open file", go to the "data" folder inside your Weka installation and load the "iris" dataset.
Go over to the "Cluster" tab and choose DBSCAN. Set epsilon=0.5 and minpts=5.
In cluster mode, select the radio button "Use training set" and Start the clustering. Look for the string "Number of generated clusters" - this number is 3 for me.
Now select the radio mode to "Classes to clusters evaluation" and re-run the clustering. I get 1 cluster now.
Is this expected behavior? Am I missing something?
What I seemed to be missing was with the "Use training set" setting all attributes including the class-label, are used. If I explicitly remove the class, the results match.
Related
Well, let's say that I have the query from my previous question: How to do multi graph time series on Grafana with Kusto
Then I'd like to consume the tiemposCicloBruto variable from one panel to another in order to avoid repeating queries.
I saw: https://grafana.com/blog/2020/10/14/learn-grafana-share-query-results-between-panels-to-reduce-load-time/
But there isn't any way to share variables at all...
I also tried it as a dashboard variable, but it doesn't seem to support tabular expressions at all...
You can share only input variables across dashboard panels. Variables work as primitive text substitution in one direction (from dashboard to query), and do not take into account any context in your query language.
Your link tells about sharing results of the query between different panels. If exact same result set returned to a panel fits your needs, you can reuse it "for free", without putting load on the database. You don't need to save it into any variable, you just set it as a pseudo-datasource and you get the result immediately.
You can factor this feature into design of you panels. Examples could be:
time series plus histogram visualizations of the same data;
time-series chart plus a panel with latest readings (or use other Grafana reduce expressions).
I try to sumerize my powerusage metrics from my smart plugs. They are present at influxdb in different series. I've tried to use wildcard in the "from"-Part by i cannot sumerize the results?! Is it even possible to add the different series and get a single result? I do not want to make 4 (or more) manual series, each for one plug to be flexible if new plugs comes to the system. Fyi, i've already tried a "sum" at the select statement
I found a solution. Don't know why i did not tried this before. I add the transformation called "Add field from calculation" with mode set to "Reduce now" and calculation "Total".
According to IBM's online help:
Optionally, for CHAID, QUEST, and C&R Tree models, an additional field can be added that indicates the ID for the node to which each record is assigned.
I cannot find that option. I am using an (exhaustive) CHAID which adds the $R- (prediction field) variable but there is no $RI- (node identifier field) variable. Just in case IBM was being literal I checked running a regular CHAID (not exhaustive) but still without getting the $RI-variable I need.
I know that in SPSS v. 25 this is easily configured so is IBM just confused in their online help for modeler, or am I missing something obvious? Thanks in advance for any help.
The get the rule identifier added to the data set, you need to first train the model to generate the model nugget.
You can then edit (or open) the model nugget and select the "Settings" tab. Here you will find the option "Rule identifier" which must be checked to include the ID of node the each record is assigned.
It is important to realize that this is a setting in the generated mudel nugget and not in the modeling node. This also means that this setting must be checked (and rechecked) each time the model is retrained and the nugget is regenerated.
I built a custom CHAID tree in SPSS modeler. I would like to assign the particular terminal nodes to all of the records in the dataset. How would I go about doing this from within the software?
Assuming that you used the regular node called CHAID, if you select inside the diamond icon (created chaid model) in the tab configurations the rule identifyer, the output will add another variable called $RI-XXX that will classify all the records within the terminal nodes. Just check that option and then put a table node after that and all the records will be classified.
You just need to apply the algorithm to whatever data set you need, and you only need to inputs to be the same (type and eventually storage).
The diamond contains the algo and you can disconnect it and connects to whatever you want.
http://beyondthearc.com/blog/wp-content/uploads/2015/02/spss.png
Context
I want to use Weka clustering algorithm XMeans. However I cannot figure out how to obtain cluster assignments from GUI of Weka.
At the moment I can only see a list of cluster IDs along with percentage of entries assigned to each cluster.
Question
There any way to save cluster assignments for each entry in, e.g. CSV format?
Do everything in the "Preprocess Panel".
This is one way to do this:
Load Data File.
Remove any Classification Attribute or Identifiers
Choose Preprocess / Filter / Unsupervised attribute Filter /
AddCLuster
Click on the Word "AddCluster", choose the XMeans Clusterer, click
Apply.
This sghould add a new column "cluster" in the Attribute Panel
Click on "Save..." Button to export.