I am using WEKA for classification. I am using two function, "setClassIndex" and "setAttributeIndices". My dataset have two attributes, class and one more attribute. Following are some instances in my database:
#relation sms_test
#attribute spamclass {spam,ham}
#attribute mes String
#data
ham,'Go until jurong point'
ham,'Ok lar...'
spam,'Free entry in 2 a wkly'
Following is part of my code.
trainData.setClassIndex(0);
filter = new StringToWordVector();
filter.setAttributeIndices("2");
This code is running fine. But when I set, train.setClassIndex ("1") or filter.setAttributeIndices("1") , my code stops running. Do setClassIndex function take argument starting from 0 and setAttributeIndices takes argument starting from 1? How do we identify which WEKA function starts counting from 0 or 1?
Do setClassIndex function take argument starting from 0
Yes, index starts with 0.
and setAttributeIndices takes argument starting from 1?
Yes, indices start from 1
Source: http://weka.sourceforge.net/doc.stable
Related
My incoming field -->double datattype (veeva)--> for eg.s, 134.0 , 45.4,61.234
My output needs to be always 2 places of decimal --->file ---> field value: 134.00, 45.40, 61.23
I have tried creating an expression on informatica cloud:
TO_DECIMAL( IN_TAX, 2 )
but its not giving the expected output.
Keep the data type to double with decimal point =2 and use round(). I am assuming your target is a flat file.
create an out port in exp transformation like below -
out_tax = ROUND(in_tax,2) -- data type is DOUBLE(xx,2), where xx is your precision.
Link this port to target.
I have an rrd file in which traffic_in and out stats of interfaces are stored.
What i want is that i want Max and min values in certain time period.
I'm trying this command but it is giving me error ERROR: invalid rpn expression in: v,MAX
rrdtool graph -s 1537466100 -e 1537552237 DEF:v=lhr-spndc-7609_traffic_in_612.rrd:traffic_in:MAX CDEF:vm=v,MAX PRINT:vm:%lf
Can you please help to enter correct command & achieving desired Functionality?
You should be using VDEF for the definition of vm, not CDEF.
A CDEF is for transforming one or more data series created by either a DEF or CDEF into another series, ready for graphing or summarising.
A VDEF is for transforming a single data series into a single value via a consolodation function, such as to get the maximum value of a series over the entire graph. This is different from the function specified in a DEF, which only specifies how to consolodate a higher-granularity series into a lower-granularity series.
I load a file which has some columns with data.The first line contains ,CITY,YEAR2000 .
The first column has name of cities.The other columns contain number data.
I am trying to search for a specific city using:
data(data.CITY=='Athens',3:end)
where
data = dataset('File','cities.txt','Delimiter',',')
but I receive an error
Undefined function 'eq' for input arguments of type 'cell'.
--------UPDATE-----------------------------
Ok, use :
data(find(strncmp(data.CITY,'Athens',length('Athens'))),3:end)
Have you tried with using strncmp tangled with find?
I would use it this way
find(strncmp(data.CITY,'ATHENS',length('ATHENS')))
EDIT
Other opportunities to exploit would encompass strfind
strfind(data.CITY,'ATHENS')
EDIT 2
You could also try with
data(ismember(data.CITY,'ATHENS'),3:end)
This should lead you to the results you expect (at least I guess so).
EDIT 3
Given your last request I would go for this solution:
inp = input('Name of the CITY: ','s')
Name of the City: ATHENS
data(find(strncmp(data.CITY,inp,length(inp))),3:end)
I am using Weka in Scala (although the syntax is virtually identical to Java). I am trying to evaluate my data with a SimpleKMeans clusterer, but the clusterer won't accept string data. I don't want to cluster on the string data; I just want to use it to label the points.
Here is the data I am using:
#relation Locations
#attribute ID string
#attribute Latitude numeric
#attribute Longitude numeric
#data
'Carnegie Mellon University', 40.443064, -79.944163
'Stanford University', 37.427539, -122.170169
'Massachusetts Institute of Technology', 42.358866, -71.093823
'University of California Berkeley', 37.872166, -122.259444
'University of Washington', 47.65601, -122.30934
'University of Illinois Urbana Champaign', 40.091022, -88.229992
'University of Southern California', 34.019372, -118.28611
'University of California San Diego', 32.881494, -117.243079
As you can see, it's essentially a collection of points on an x and y coordinate plane. The value of any patterns is negligible; this is simply an exercise in working with Weka.
Here is the code that is giving me trouble:
val instance = new Instances(new StringReader(wekaHeader + wekaData))
val simpleKMeans = new SimpleKMeans()
simpleKMeans.buildClusterer(instance)
val eval = new ClusterEvaluation()
eval.setClusterer(simpleKMeans)
eval.evaluateClusterer(new Instances(instance))
Logger.info(eval.clusterResultsToString)
I get the following error on simpleKMeans.buildClusterer(instance):
[UnsupportedAttributeTypeException: weka.clusterers.SimpleKMeans: Cannot handle string attributes!]
How do I get Weka to retain IDs while doing clustering?
Here are a couple of other steps I have taken to troubleshoot this:
I used the Weka Explorer and loaded this data as a CSV:
ID, Latitude, Longitude
'Carnegie Mellon University', 40.443064, -79.944163
'Stanford University', 37.427539, -122.170169
'Massachusetts Institute of Technology', 42.358866, -71.093823
'University of California Berkeley', 37.872166, -122.259444
'University of Washington', 47.65601, -122.30934
'University of Illinois Urbana Champaign', 40.091022, -88.229992
'University of Southern California', 34.019372, -118.28611
'University of California San Diego', 32.881494, -117.243079
This does what I want it to do in the Weka Explorer. Weka clusters the points and retains the ID column to identify each point. I would do this in my code, but I'm trying to do this without generating additional files. As you can see from the Weka Java API, Instances interprets a java.io.Reader only as an ARFF.
I have also tried the following code:
val instance = new Instances(new StringReader(wekaHeader + wekaData))
instance.deleteAttributeAt(0)
val simpleKMeans = new SimpleKMeans()
simpleKMeans.buildClusterer(instance)
val eval = new ClusterEvaluation()
eval.setClusterer(simpleKMeans)
eval.evaluateClusterer(new Instances(instance))
Logger.info(eval.clusterResultsToString)
This works in my code, and displays results. That proves that Weka is working in general, but since I am deleting the ID attribute, I can't really map the clustered points back on the original values.
I am answering my own question, and in doing so, there are two issues that I would like to address:
Why CSV works with string values
How to get cluster information from the cluster evaluation
As Sentry points out in the comments, the ID does in fact get converted to a nominal attribute when loaded from a CSV.
If the data must be in an ARFF format (like in my example where the Instances object is created from a StringReader), then the StringToNominal filter can be applied:
val instances = new Instances(new StringReader(wekaHeader + wekaData))
val filter = new StringToNominal()
filter.setAttributeRange("first")
filter.setInputFormat(instances)
val filteredInstance = Filter.useFilter(instances, filter)
val simpleKMeans = new SimpleKMeans()
simpleKMeans.buildClusterer(instance)
...
This allows for "string" values to be used in clustering, although it's really just treated as a nominal value. It doesn't impact the clustering (if the ID is unique), but it doesn't contribute to the evaluation as I had hoped, which brings me to the next issue.
I was hoping to be able to get a nice map of cluster and data, like cluster: Int -> Array[(ID, latitude, longitude)] or ID -> cluster: Int. However, the cluster results are not that convenient. In my experience these past few days, there are two approaches that can be used to find the cluster of each point of data.
To get the cluster assignments, simpleKMeans.getAssignments returns an array of integers that is the cluster assignments for each data element. The array of integers is in the same order as the original data items and has to be manually related back to the original data items. This can be easily accomplished in Scala by using the zip method on the original list of data items and then using other methods like groupBy or map to get the collection in your favorite format. Keep in mind that this method alone does not use the ID attribute at all, and the ID attribute could be omitted from the data points entirely.
However, you can also get the cluster centers with simpleKMeans.getClusterCentroids or eval.clusterResultsToString(). I have not used this very much, but it does seem to me that the ID attribute can be recovered from the cluster centers here. As far as I can tell, this is the only situation in which the ID data can be utilized or recovered from the cluster evaluation.
I got the same error while having String value in one of the line in a CSV file with couple of million rows. Here is how I figured out which line has string value.
Exception "Cannot handle string attributes!" doesn't give any clue about the line number. Hence:
I imported CSV file into Weka Explorer GUI and created a *.arff file.
Then manually changed type from string to numeric in the *.arrf file at the beginning as show below.
After that I tried to build the cluster using the *.arff file.
I got the exact line number as part of exception
I removed the line from *.arff file and loaded again. It worked without any issue.
Converted string --> numeric in *.arff file
#attribute total numeric
#attribute avgDailyMB numeric
#attribute mccMncCount numeric
#attribute operatorCount numeric
#attribute authSuccessRate numeric
#attribute totalMonthlyRequets numeric
#attribute tokenCount numeric
#attribute osVersionCount numeric
#attribute totalAuthUserIds numeric
#attribute makeCount numeric
#attribute modelCount numeric
#attribute maxDailyRequests numeric
#attribute avgDailyRequests numeric
Error reported the exact line number
java.io.IOException: number expected, read Token[value.total], line 1750464
at weka.core.converters.ArffLoader$ArffReader.errorMessage(ArffLoader.java:354)
at weka.core.converters.ArffLoader$ArffReader.getInstanceFull(ArffLoader.java:728)
at weka.core.converters.ArffLoader$ArffReader.getInstance(ArffLoader.java:545)
at weka.core.converters.ArffLoader$ArffReader.readInstance(ArffLoader.java:514)
at weka.core.converters.ArffLoader$ArffReader.readInstance(ArffLoader.java:500)
at weka.core.Instances.<init>(Instances.java:138)
at com.lokendra.dissertation.ModelingUtils.kMeans(ModelingUtils.java:50)
at com.lokendra.dissertation.ModelingUtils.main(ModelingUtils.java:28)
I am writing a very small URL shortener with Dancer. It uses the REST plugin to store a posted URL in a database with a six character string which is used by the user to access the shorted URL.
Now I am a bit unsure about my random string generation method.
sub generate_random_string{
my $length_of_randomstring = shift; # the length of
# the random string to generate
my #chars=('a'..'z','A'..'Z','0'..'9','_');
my $random_string;
for(1..$length_of_randomstring){
# rand #chars will generate a random
# number between 0 and scalar #chars
$random_string.=$chars[rand #chars];
}
# Start over if the string is already in the Database
generate_random_string(6) if database->quick_select('urls', { shortcut => $random_string });
return $random_string;
}
This generates a six char string and calls the function recursively if the generated string is already in the DB. I know there are 63^6 possible strings but this will take some time if the database gathers more entries. And maybe it will become a nearly infinite recursion, which I want to prevent.
Are there ways to generate unique random strings, which prevent recursion?
Thanks in advance
We don't really need to be hand-wavy about how many iterations (or recursions) of your function there will be. I believe at every invocation, the expected number of iterations is geomtrically distributed (i.e. number of trials before first success is governed by the geomtric distribution), which has mean 1/p, where p is the probability of successfully finding an unused string. I believe that p is just 1 - n/63^6, where n is the number of currently stored strings. Therefore, I think that you will need to have stored 30 billion strings (~63^6/2) in your database before your function recurses on average more than 2 times per call (p = .5).
Furthermore, the variance of the geomtric distribution is 1-p/p^2, so even at 30 billion entries, one standard deviation is just sqrt(2). Therefore I expect ~99% of the time that the loop will take fewerer than 2 + 2*sqrt(2) interations or ~ 5 iterations. In other words, I would just not worry too much about it.
From an academic stance this seems like an interesting program to work on. But if you're on the clock and just need random and distinct strings I'd go with the Data::GUID module.
use strict;
use warnings;
use Data::GUID qw( guid_string );
my $guid = guid_string();
Getting rid of recursion is easy; turn your recursive call into a do-while loop. For instance, split your function into two; the "main" one and a helper. The "main" one simply calls the helper and queries the database to ensure it's unique. Assuming generate_random_string2 is the helper, here's a skeleton:
do {
$string = generate_random_string2(6);
} while (database->quick_select(...));
As for limiting the number of iterations before getting a valid string, what about just saving the last generated string and always building your new string as a function of that?
For example, when you start off, you have no strings, so let's just say your string is 'a'. Then the next time you build a string, you get the last built string ('a') and apply a transformation on it, for instance incrementing the last character. This gives you 'b'. and so on. Eventually you get to the highest character you care for (say 'z') at which point you append an 'a' to get 'za', and repeat.
Now there is no database, just one persistent value that you use to generate the next value. Of course if you want truly random strings, you will have to make the algorithm more sophisticated, but the basic principle is the same:
Your current value is a function of the last stored value.
When you generate a new value, you store it.
Ensure your generation will produce a unique value (one that did not occur before).
I've got one more idea based on using MySQL.
create table string (
string_id int(10) not null auto_increment,
string varchar(6) not null default '',
primary key(string_id)
);
insert into string set string='';
update string
set string = lpad( hex( last_insert_id() ), 6, uuid() )
where string_id = last_insert_id();
select string from string
where string_id = last_insert_id();
This gives you an incremental hex value which is left padded with non-zero junk.