Weka, SimpleKMeans cannot handle string attributes - scala

I am using Weka in Scala (although the syntax is virtually identical to Java). I am trying to evaluate my data with a SimpleKMeans clusterer, but the clusterer won't accept string data. I don't want to cluster on the string data; I just want to use it to label the points.
Here is the data I am using:
#relation Locations
#attribute ID string
#attribute Latitude numeric
#attribute Longitude numeric
#data
'Carnegie Mellon University', 40.443064, -79.944163
'Stanford University', 37.427539, -122.170169
'Massachusetts Institute of Technology', 42.358866, -71.093823
'University of California Berkeley', 37.872166, -122.259444
'University of Washington', 47.65601, -122.30934
'University of Illinois Urbana Champaign', 40.091022, -88.229992
'University of Southern California', 34.019372, -118.28611
'University of California San Diego', 32.881494, -117.243079
As you can see, it's essentially a collection of points on an x and y coordinate plane. The value of any patterns is negligible; this is simply an exercise in working with Weka.
Here is the code that is giving me trouble:
val instance = new Instances(new StringReader(wekaHeader + wekaData))
val simpleKMeans = new SimpleKMeans()
simpleKMeans.buildClusterer(instance)
val eval = new ClusterEvaluation()
eval.setClusterer(simpleKMeans)
eval.evaluateClusterer(new Instances(instance))
Logger.info(eval.clusterResultsToString)
I get the following error on simpleKMeans.buildClusterer(instance):
[UnsupportedAttributeTypeException: weka.clusterers.SimpleKMeans: Cannot handle string attributes!]
How do I get Weka to retain IDs while doing clustering?
Here are a couple of other steps I have taken to troubleshoot this:
I used the Weka Explorer and loaded this data as a CSV:
ID, Latitude, Longitude
'Carnegie Mellon University', 40.443064, -79.944163
'Stanford University', 37.427539, -122.170169
'Massachusetts Institute of Technology', 42.358866, -71.093823
'University of California Berkeley', 37.872166, -122.259444
'University of Washington', 47.65601, -122.30934
'University of Illinois Urbana Champaign', 40.091022, -88.229992
'University of Southern California', 34.019372, -118.28611
'University of California San Diego', 32.881494, -117.243079
This does what I want it to do in the Weka Explorer. Weka clusters the points and retains the ID column to identify each point. I would do this in my code, but I'm trying to do this without generating additional files. As you can see from the Weka Java API, Instances interprets a java.io.Reader only as an ARFF.
I have also tried the following code:
val instance = new Instances(new StringReader(wekaHeader + wekaData))
instance.deleteAttributeAt(0)
val simpleKMeans = new SimpleKMeans()
simpleKMeans.buildClusterer(instance)
val eval = new ClusterEvaluation()
eval.setClusterer(simpleKMeans)
eval.evaluateClusterer(new Instances(instance))
Logger.info(eval.clusterResultsToString)
This works in my code, and displays results. That proves that Weka is working in general, but since I am deleting the ID attribute, I can't really map the clustered points back on the original values.

I am answering my own question, and in doing so, there are two issues that I would like to address:
Why CSV works with string values
How to get cluster information from the cluster evaluation
As Sentry points out in the comments, the ID does in fact get converted to a nominal attribute when loaded from a CSV.
If the data must be in an ARFF format (like in my example where the Instances object is created from a StringReader), then the StringToNominal filter can be applied:
val instances = new Instances(new StringReader(wekaHeader + wekaData))
val filter = new StringToNominal()
filter.setAttributeRange("first")
filter.setInputFormat(instances)
val filteredInstance = Filter.useFilter(instances, filter)
val simpleKMeans = new SimpleKMeans()
simpleKMeans.buildClusterer(instance)
...
This allows for "string" values to be used in clustering, although it's really just treated as a nominal value. It doesn't impact the clustering (if the ID is unique), but it doesn't contribute to the evaluation as I had hoped, which brings me to the next issue.
I was hoping to be able to get a nice map of cluster and data, like cluster: Int -> Array[(ID, latitude, longitude)] or ID -> cluster: Int. However, the cluster results are not that convenient. In my experience these past few days, there are two approaches that can be used to find the cluster of each point of data.
To get the cluster assignments, simpleKMeans.getAssignments returns an array of integers that is the cluster assignments for each data element. The array of integers is in the same order as the original data items and has to be manually related back to the original data items. This can be easily accomplished in Scala by using the zip method on the original list of data items and then using other methods like groupBy or map to get the collection in your favorite format. Keep in mind that this method alone does not use the ID attribute at all, and the ID attribute could be omitted from the data points entirely.
However, you can also get the cluster centers with simpleKMeans.getClusterCentroids or eval.clusterResultsToString(). I have not used this very much, but it does seem to me that the ID attribute can be recovered from the cluster centers here. As far as I can tell, this is the only situation in which the ID data can be utilized or recovered from the cluster evaluation.

I got the same error while having String value in one of the line in a CSV file with couple of million rows. Here is how I figured out which line has string value.
Exception "Cannot handle string attributes!" doesn't give any clue about the line number. Hence:
I imported CSV file into Weka Explorer GUI and created a *.arff file.
Then manually changed type from string to numeric in the *.arrf file at the beginning as show below.
After that I tried to build the cluster using the *.arff file.
I got the exact line number as part of exception
I removed the line from *.arff file and loaded again. It worked without any issue.
Converted string --> numeric in *.arff file
#attribute total numeric
#attribute avgDailyMB numeric
#attribute mccMncCount numeric
#attribute operatorCount numeric
#attribute authSuccessRate numeric
#attribute totalMonthlyRequets numeric
#attribute tokenCount numeric
#attribute osVersionCount numeric
#attribute totalAuthUserIds numeric
#attribute makeCount numeric
#attribute modelCount numeric
#attribute maxDailyRequests numeric
#attribute avgDailyRequests numeric
Error reported the exact line number
java.io.IOException: number expected, read Token[value.total], line 1750464
at weka.core.converters.ArffLoader$ArffReader.errorMessage(ArffLoader.java:354)
at weka.core.converters.ArffLoader$ArffReader.getInstanceFull(ArffLoader.java:728)
at weka.core.converters.ArffLoader$ArffReader.getInstance(ArffLoader.java:545)
at weka.core.converters.ArffLoader$ArffReader.readInstance(ArffLoader.java:514)
at weka.core.converters.ArffLoader$ArffReader.readInstance(ArffLoader.java:500)
at weka.core.Instances.<init>(Instances.java:138)
at com.lokendra.dissertation.ModelingUtils.kMeans(ModelingUtils.java:50)
at com.lokendra.dissertation.ModelingUtils.main(ModelingUtils.java:28)

Related

Azure Data Factory - Data Wrangling with Data Flow - Array bug

Azure Data Factory - Data Wrangling with Data Flow - Array bug.
I have a tricky firewall log file to wrangle using Azure data factory. The file consists of 4 tab-separated columns. Date and Time, Source, IP and Data.
The Data column consists of key-value pairs separated with equal signs and text delimited by double-quotes. The challenge is that the data column is inconsistent and contains any number of key-value pair combinations.
Three lines from the source file.
2022-02-13 00:59:59 Local7.Notice 192.168.40.1 date=2022-02-13 time=00:59:59 devname="NoHouse" devid="FG100ETK18006624" eventtime=1644706798637882880 tz="+0200" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="root" srcip=192.168.41.200 srcport=58492 srcintf="port1" srcintfrole="undefined" dstip=216.239.36.55 dstport=443 dstintf="wan1" dstintfrole="undefined" srccountry="Reserved" dstcountry="United States" sessionid=137088638 proto=6 action="client-rst" policyid=5 policytype="policy" poluuid="c2a960c4-ac1b-51e6-8011-6f00cb1fddf2" policyname="All LAN over WAN1" service="HTTPS" trandisp="snat" transip=196.213.203.122 transport=58492 appcat="unknown" applist="block-p2p" duration=6 sentbyte=3222 rcvdbyte=1635 sentpkt=14 rcvdpkt=8 srchwvendor="Microsoft" devtype="Computer" osname="Debian" mastersrcmac="00:15:5d:29:b4:06" srcmac="00:15:5d:29:b4:06" srcserver=0
2022-02-13 00:59:59 Local7.Notice 192.168.40.1 date=2022-02-13 time=00:59:59 devname="NoHouse" devid="FG100ETK18006624" eventtime=1644706798657887422 tz="+0200" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="root" srcip=192.168.41.200 srcport=58496 srcintf="port1" srcintfrole="undefined" dstip=216.239.36.55 dstport=443 dstintf="wan1" dstintfrole="undefined" srccountry="Reserved" dstcountry="United States" sessionid=137088640 proto=6 action="client-rst" policyid=5 policytype="policy" poluuid="c2a960c4-ac1b-51e6-8011-6f00cb1fddf2" policyname="All LAN over WAN1" service="HTTPS" trandisp="snat" transip=196.213.203.122 transport=58496 appcat="unknown" applist="block-p2p" duration=6 sentbyte=3410 rcvdbyte=1791 sentpkt=19 rcvdpkt=11 srchwvendor="Microsoft" devtype="Computer" osname="Debian" mastersrcmac="00:15:5d:29:b4:06" srcmac="00:15:5d:29:b4:06" srcserver=0
2022-02-13 00:59:59 Local7.Notice 192.168.40.1 date=2022-02-13 time=00:59:59 devname="NoHouse" devid="FG100ETK18006624" eventtime=1644706798670487613 tz="+0200" logid="0001000014" type="traffic" subtype="local" level="notice" vd="root" srcip=192.168.41.180 srcname="GKHYPERV01" srcport=138 srcintf="port1" srcintfrole="undefined" dstip=192.168.41.255 dstport=138 dstintf="root" dstintfrole="undefined" srccountry="Reserved" dstcountry="Reserved" sessionid=137088708 proto=17 action="deny" policyid=0 policytype="local-in-policy" service="udp/138" trandisp="noop" app="netbios forward" duration=0 sentbyte=0 rcvdbyte=0 sentpkt=0 rcvdpkt=0 appcat="unscanned" srchwvendor="Intel" osname="Windows" srcswversion="10 / 2016" mastersrcmac="a0:36:9f:9b:de:b6" srcmac="a0:36:9f:9b:de:b6" srcserver=0
My strategy for wrangling this data set is as follows.
Source the data file from azure data Lake Using a tab-delimited CSV
data set. This successfully delivers the source data in four columns
to my data flow.
Add a surrogate key transformation to add an incrementing key value to each row of data.
Add a derived column,
with the following function.
regexSplit(Column_4,'\s(?=(?:[^"](["])[^"]\1)[^"]$)')
This splits the data by spaces ignoring the spaces between semicolons.
Then the unfold creates a new record for each item in the array while preserving the
other column values.
unfold(SplitBySpace)
Then split the key-value pairs into their represented value and key by the Delimiter equals.
The final step would then be to unpivot the data back into columns with the respected values grouped by the surrogate key added in step 2.
This all sounds good but unfortunately step 5 fails with the following error. “Indexing is only allowed on the array and map types”.
The output after step 4.
The unfold function returns an array according to the inspect tab, see below. I would expect a string here!!
Now in step 5, I split by “=” with the expression split(unfoldSplitBySpace, '=') but this errors in the expression builder with the message “Split expect string type of argument”
Changing the expression to split(unfoldSplitBySpace1, '=') remove the error from the expression Builder.
BUT THEN the spark execution engines errors with “Indexing is only allowed on the array and map types”
The problem.
According to the Azure Data Factory UI, the output of the Unfold() function is an array type but when accessing the array elements or any other function the spark engine does not recognise the object as in array type.
Is this a bug in the execution or do I have a problem in my understanding of how the data factory and a spark engine understand arrays?
Split() function splits the string to multiple values based on delimiter and returns array type.
If you are splitting value of array of particular index, mention the index with in braces [].
Example:
Here I have an array value ["employee=Robert", "D"], and using split(), I am splitting the value of index 1 based on =.
split(value[1], '=')
Microsoft-help provided an answer.
First, it looks like a bug.
Second, there is a workaround. Cast the array output to string with toString(unfold(SplitBySpace))
https://learn.microsoft.com/en-us/answers/questions/860243/azure-data-factory-data-wrangling-with-data-flow-a.html#answer-865321

Need to convert the double data type precision of scale 1 to 2 in informatica cloud expression transformation

My incoming field -->double datattype (veeva)--> for eg.s, 134.0 , 45.4,61.234
My output needs to be always 2 places of decimal --->file ---> field value: 134.00, 45.40, 61.23
I have tried creating an expression on informatica cloud:
TO_DECIMAL( IN_TAX, 2 )
but its not giving the expected output.
Keep the data type to double with decimal point =2 and use round(). I am assuming your target is a flat file.
create an out port in exp transformation like below -
out_tax = ROUND(in_tax,2) -- data type is DOUBLE(xx,2), where xx is your precision.
Link this port to target.

Write a struct into a DICOM header

I created a private DICOM tag and I would like to know if it is possible to use this tag to store a struct in a DICOM file using dicomwrite (or alike), instead of creating a field inside the DICOM header for each struct field.
(Something like saving a Patient's name, but instead of using a char data, I would use double)
Here is an example:
headerdicom = dicominfo('Test.dcm');
a.a = 1; a.b = 2; a.c = 3;
headerdicom.Private_0011_10xx_Creator = a;
img = dicomread('Test.dcm');
dicomwrite(img, 'test_modif.dcm', 'ObjectType', 'MR Image Storage', 'WritePrivate', true, headerdicom)
Undefined function 'fieldnames' for input arguments of type 'double'.
Thank you all in advance,
Depending on what "struct" means, here are your options. As you want to use a private tag which means no application but yours will be able to interpret it, you can choose the solution which is technically most appropriate. Basically your question is "which Value Representation should I assign to my private attribute using the DICOM toolkit of my choice?":
Sequence:
There is a DICOM Value Representation "Sequence" (VR=SQ) which allows you to store a list of attributes of different types. This VR is closest to a struct. A sequence can contain an arbitrary number of items each of which has the same attributes in the same order. Each attribute can have its own VR, so if your struct contains different data types (like string, integer, float), this would be my recommendation
Multi-value attribute:
DICOM supports the concept of "Value Multiplicity". This means that a single attribute can contain multiple values which are separated by backslashes. As the VR is a property of the attribute, all values must have the same type. If I understand you correctly, you have a list of floating point numbers which could be encoded as an array of doubles in one field with VR=FD (=Floating Point Double): 0.001\0.003\1.234...
Most toolkits support an indexed access to the attributes.
"Blob":
You can use an attribute with VR=OB (Other Byte) which is also used for encoding pixel data. It can contain up to 4 GB of binary data. The length of the attribute tells you of how many bytes the attribute's value consists. If you just want to copy the memory from / to the struct, this would be the way to go, but obviously it is the weakest approach in terms of type-safety and correctness of encoding. You are going to lose built in methods of your DICOM toolkit that ensure these properties.
To add a private attribute, you have to
reserve a range for the attribute specifying an odd group number and a prefix (2 hex digits) for the element numbers. (e.g. group = 0x0011, Element = 0x10xx) reserves a range from (0x0011, 0x10xx) - (0x0011, 0x10ff). This is done by specifying a Private Creator DICOM tag which holds a manufacturer name. So I suspect that instead of
headerdicom.Private_0011_10xx_Creator = a;
it should read e.g.
headerdicom.Private_0011_10xx_Creator = "Gabs";
register your private tags in the private dictionary, most of the time by specifying the Private Creator, group, element and VR (one of the options above)
Not sure how this can be done in matlab.

Weka LibSVM one class classifier always predicts one class

I'm trying to use LibSVM classifier in Weka to build a one class SVM classifier.
My training file has list of noun words.
My test file has many words. My aim is to use the classifier to predict the words which are nouns in test file.
My input arff file (ip.arff)(training file) looks like this:
#relation test1
#attribute name string
#attribute class {yes}
#data
'building',yes
'car',yes
..... and so on
My test file(test.arff) (test file) looks like this:
#relation test2
#attribute name string
#attribute class {yes}
#data
'car',?
'window',?
'running',?
..... and so on
Here's what I've done:
Since the datatype is string, I used batch Filtering on both input files to generate
ipstd.arff and teststd.arff as mentioned in
http://weka.wikispaces.com/Batch+filtering
Next i load and run the classifier with ipstd.arff. (Note: All the words are classified as yes)
Next I load the test set teststd.arff and re-evaluate the model.
But all the words are classified as nouns('yes')
=== Predictions on user test set ===
inst# actual predicted error prediction
1 1:? 1:yes 1
2 1:? 1:yes 1
3 1:? 1:yes 1
and so on
My problem is that all words in test file(teststd.arff) are classified as nouns
Can someone tell where I'm going wrong..
What should I do classify noun words in test set with 'yes' and others as outliers.
Thanks...

Create ordinal array with multiple groups

I need to categorize a dataset according to different age groups. The categorization depends on whether the Sex is Male or Female. I first subset the data by gender and then use the ordinal function (dataset is from a Matlab example). The following code crashes on the last line when I try to vertically concatenate the subsets:
load hospital;
subset_m=hospital(hospital.Sex=='Male',:);
subset_f=hospital(hospital.Sex=='Female',:);
edges_f=[0 20 max(subset_f.Age)];
edges_m=[0 30 max(subset_m.Age)];
labels_m = {'0-19','20+'};
labels_f = {'0-29','30+'};
subset_m.AgeGroup= ordinal(subset_m.Age,labels_m,[],edges_m);
subset_f.AgeGroup = ordinal(subset_f.Age,labels_f,[],edges_f);
vertcat(subset_m,subset_f);
Error using dataset/vertcat (line 76)
Could not concatenate the dataset variable 'AgeGroup' using VERTCAT.
Caused by:
Error using ordinal/vertcat (line 36)
Ordinal levels and their ordering must be identical.
Edit
It seems that a vital part was missing in the question, here is the answer to the corrected question. You need to use join rather than vertcat, for example:
joinFull = join(subset_f,subset_m,'LeftKeys','LastName','RightKeys','LastName','type','rightouter','mergekeys',true)
Solution of original problem
It seems like you are actually trying to work with the wrong variable. If I change all instances of hospitalCopy into hospital then everything works fine for me.
Perhaps you copied hospital and edited it, thus losing the validity of the input.
If you really need to have hospitalCopy make sure to assign to it directly after load hospital.
If this does not help, try using clear all before running the code and make sure there is no file called 'hospital' in your current directory.