How to show ignorated attributes in Weka - cluster-analysis

I have .arff file:
#RELATION Employee
#ATTRIBUTE EmployeeID string
#ATTRIBUTE sex {male,female}
#ATTRIBUTE age {young,middle-age,old-age}
#DATA
’5s6s6ss’,male,old-age
’5s6s6tt’,female,old-age
’5s6s6ii’,male,young
i want to make cluster in WEKA, but i have string attribute "EmployeeID". I have to ignore string attribute, but how to show which EmployeeID is in cluster 0 and cluster 1?

In the "Preprocess" panel use the unsupervised attribute Filter AddCluster to add the Cluster Assigmment to the Result. Do not forget to set the ignoreAttributeIndices value in the configuration dialog-box of the filter. Here you should enter "1" in order to exclude the EmployeeID from the clustering process (because it has too much predictive/discriminative power). The attribute value will still be displayed in the table.

Related

SQL query to extract default (initial) value of a value property

I am trying to create a custom template fragment that builds a table of value properties. I started by creating a SQL query fragment that pulls all properties classified by a Value Type. Now I would like to pull in the default (initial) value assigned. I figured out that it's in the Description table of t_xref, with the property guid in the client field, but I don't know how to write a query that will reliably parse the default value out since the string length may be different depending on other values set. I tried using the template content selector first but I couldn't figure out how to filter to only value properties. I'm still using the default .qeax file but will be migrating to a windows based DBMS soon. Appreciate any help!
Tried using the content selector. Successfully built a query to get value properties but got stuck trying to join and query t_xref for default value.
Edited to add current query and image
Value Properties are block properties that are typed to Value Types. I'm using SysML.
This is my current query, I am no SQL expert! I don't pull anything from t_xref yet but am pulling out only the value properties with this query:
SELECT property.ea_guid AS CLASSGUID, property.Object_Type AS CLASSTYPE, property.Name, property.Note as [Notes], classifier.Name AS TYPE
FROM t_object property
LEFT JOIN t_object classifier ON property.PDATA1 = classifier.ea_guid
LEFT JOIN t_object block on property.ParentID = block.Object_ID
WHERE block.Object_ID = #OBJECTID# AND property.Object_Type = 'Part' AND classifier.Object_Type = 'DataType'
ORDER BY property.Name
I guess that Geert will come up with a more elaborate answer, but (assuming you are after the Run State) here are some details. The value for these Run States is stored in t_object.runstate as one of the crude Sparxian formats. You find something like
#VAR;Variable=v1;Value=4711;Op==;#ENDVAR;
where v1 is the name and 4711 the default in this example. How you can marry that with your template? Not the faintest idea :-/
I can't give a full answer to the original question as I can't reproduce your data, but I can provide an answer for the generic problem of "how to extract data through SQL from the name-value pair in t_xref".
Note, this is heavily dependent on the database used. The example below extracts fully qualified stereotype names from t_xref in SQL Server for custom profiles.
select
substring(
t_xref.Description, charindex('FQName=',t_xref.Description)+7,
charindex(';ENDSTEREO',t_xref.Description,charindex('FQName=',t_xref.Description))
-charindex('FQName=',t_xref.Description)-7
),
Description from t_xref where t_xref.Description like '%FQName%'
This works using:
substring(string, start, length)
The string is the xref description column, and the start and length are set using:
charindex(substring, string, [start position])
This finds the start and end tags within the xref description field, for the data you're trying to parse.
For your data, I imagine something like the below is the equivalent (I haven't tested this). It's then a case of combining it with the query you've already got.
select
substring(
t_xref.Description, #the string to search in
charindex('#VALU=',t_xref.Description,charindex('#NAME=default',t_xref.Description)+6, #the start position, find the position of the first #VALU= tag after name=default
charindex('#ENDVALU;',t_xref.Description,charindex('#VALU=',t_xref.Description))
-charindex('#VALU=',t_xref.Description,charindex('#NAME=default',t_xref.Description))-6 #the length, find the position of the first #ENDVALU tag after the start, and subtract it from the start position
),
Description from t_xref where t_xref.Description like '%#NAME=default%' #filter anything which doesn't contain this tag to avoid "out of range" index errors

Mapping Data Flows Dynamic Column Updates

I have a text input source. This has over 100 columns so I won't show all of them here - a cut-down view of the data would be:
CustomerNo
DOB
DOD
Status
01418495
01/02/1940
NULL
1
01418496
01/01/1930
NULL
1
The users want to be able to update/override any of these columns during processing by providing another input text file containing the PK (CustomerNo) and the key/value pairs of the columns to be updated e.g.
CustomerNo
Variable
New Value
01418495
DOB
01/12/1941
01418496
DOD
01/01/2021
01418496
Status
0
Can this data be used to create dynamic columns somehow that update the customer records regardless of the columns they want to update - in the example above this would result in:
CustomerNo
DOB
DOD
Status
01418495
01/02/1941
NULL
1
01418496
01/01/1930
01/01/2021
0
I have looked at the documentation but don't see any examples of how something like this could be achieved? Thanks in advance for any advice.
You would use a technique similar to what I describe in this video: https://www.youtube.com/watch?v=q7W6J-DUuJY. What I've done is created a file with rules that have expressions and then apply those rules dynamically inside of my data flow.
The key to make this work is using the expr() function to dynamically evaluate the expression from the external file.

FileMaker database design with calculated fields and filtering

I am trying out Filemaker Pro 12 right now with no previous FM experience, although other basic DB experience. The issue I have is trying to do filtered queries for a report that span one-to-many relationships. Here is an example;
The 2 tables:
Sample_Replicate
PK
Sample FK
other fields
Weights
Sample_Replicate_FK (linked to PK of Sample_Replicate)
Weight
Measurement type (tare, gross, dry, ash)
Wash type (null or from list of lab assays)
I want to create a report that displays: (gross-tare), (dry-tare)/(gross-tare), (ash-tare)/(gross-tare), and (dry-tare)/(gross-tare) for all dry weights with non null wash types.
It seems that FM wants me to create columns for each of these values (which is doable as the list of lab assays changes minimally and updating the database would be acceptable, though not preferred). I have tried to add a gross wt, tare wt, etc to the Sample_Replicate table, but it only is returning the first record (tare wt) when I use calculated field and method:
tare wt field = Case ( Weights::Measurement type = "Tare"; Weights::Weights )
gross wt field = Case ( Weights::Measurement type = "Gross"; Weights::Weights )
etc...
It also seems to be failing when I add the criteria:
and Is Empty(Weights::Wash type )
Could someone point me in the right direction on this issue. Thanks
EDIT:
I came across this: http://www.filemakertoday.com/com/showthread.php/14084-Calculation-based-on-1-to-many-relationship
It seems that I can create ~15 calculated fields for each combination of measurement and wash type on the weights table, then do a sum of these columns in the sample_replicate after adding these 15 columns to the table. This seems absolutely asinine. Isn't there a better way to filter results of a one-to-many relationship in FM?
What about the following structure:
Replicate
ID
Wash Weight
Replicate ID
Type (null or from list of lab assays)
Tare
Gross
Dry
Ash
+ calculated fields
I assume you only calculate weight ratios of the same wash type. The weight types (tare, gross, etc.) are not just labels here; since you use them in formulas in specific places, they are more like roles, so I think they deserve their own fields.
add tare wt field, etc. in the Weights table but then add a calc field in your Sample_Replicate table to get the sum of all related values
ex: add field "total tare wt" to be "sum ( Weights::tare wt)"

simplest example of a query by date range in cassandra 1.x

I want to store an ID and a date and I want to retrieve all entries from dateA up to dateB, what exactly do I need to be able to perform select from my_column_family where date >= dateA and date < dateB; ?
the guys at #cassandra (IRC) helped me find a way, there's many subtle details so I'd like to document that here.
first you need to declare a column family similar to this (examples from cassandra-cli):
create column family users with comparator=UTF8Type and key_validation_class=UTF8Type and column_metadata=[
{column_name: id, validation_class: LongType}
{column_name: name, validation_class: UTF8Type, index_type: KEYS}
{column_name: age, validation_class: LongType}
];
few important things about this declaration:
the comparator and key_validation_class are there to be able to use strings as key names
the first declared column is special, it's the "row key" which is used to address each row and therefore cannot contain duplicate values (the INSERT is really an UPSERT so when there's duplicates the new values overwrite the old ones)
the second column declares a "secondary index" on its values (more on that below)
the dates are stored as Long datatypes, interpretation is up to the client
now let's add some values:
set users[1][name] = john;
set users[1][age] = 19;
set users[2][name] = jane;
set users[2][age] = 21;
set users[3][name] = john;
set users[3][age] = 32;
according to this: http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/ Cassandra does not support the < operators, what it does is to manually exclude the rows that don't match but it does that AFTER there's a resultset and it also refuses to do so unless and actual filtering has taken place.
what that means is that a query like get users where age > 20; will return null but if we add a predicate that includes = it'll magically work.
here's where the secondary index is important, without it you can't use = so on this example I can do get users where name = jane; but I cannot ask for get users where age = 21;
the funny thing is that, after using = the < works so having a secondary index allows you to ask for get users where name = john and age > 20; and it'll filter correctly.
There are a few ways to solve this. The simplest is probably the secondary index solution with the equality limitation mentioned in your own answer. I've used this method, adding an additional column called 'valid', setting the value to 1. Then the queries can become where valid=1 and date>nnnn
The other solutions require additional column families and additional queries.
When loading the data, create and add to a column family which contains the timestamps as keys, and each entry would list all the user ids as column names.
If the partitioning strategy is ordered, then a single RangeSliceQuery can specify the date range as a key range and get all the columns for each key. Then iterate through the result keys, using the column values for each user id and if needed, query the original column family for the data associated with each id. Cassandra always stores the column names sorted, and can be reversed when reading.
But, as documented, the ordered partitioner is not ideal, leading to hot spots and difficulty in load balancing the nodes.
Without the ordered partitioner, still keeping the timestamp column family, you would have to create another column family while loading data where you can store all the timestamps as the columns under one or more known keys (e.g. 'created' or 'updated'). The first query would be a SliceQuery for a known key, and then the column names (as timestamps) would provide the keys for the MultigetSliceQuery to the timestamp column family.
I've used variations on this, usually adding Composite keys or columns for additional flexibility.

Create Unique Field Value by Concatenation

How can you generate a unique value for a field that matches a concatenation of certain fields and a random number
i.e.
First Name: Jim
Last Name: Jones
Field Value: jimjones0345
obviously there's a need to ensure that this value was not populated before. How would one go about this?
Assuming your using SQL Server 2005 or later...
You might try something like
update myTable
set myNewColumn = FirstName+LastName+convert(varchar,(ABS(CAST(CAST(NEWID() AS VARBINARY(5)) AS Bigint))));