How do I use previous token's label as feature in my CRF? - mallet

I am looking for a way to use features conditioned with attributes and label bigrams in mallet. I am still trying to understand how would one be able to use the label of a token just generated as a feature for determining the label of the next token? Are the feature vectors for tokens generated as the labels for previous tokens obtained?
Have I misunderstood that CRF allows the usage of predicted previous label as a feature for the next token?
Thanks in advance!

Are the feature vectors for tokens generated as the labels for previous tokens obtained?
No, CRF optimizes loss jointly, there is no left-to-right processing like in MEMM where you predict a label and then use it. CRF takes in account all possible previous labels and finds a most likely sequence.
Have I misunderstood that CRF allows the usage of predicted previous label as a feature for the next token?
CRF allows to use previous labels as features; most likely it already happens automatically in your case. I don't have experience with Mallet, but in most out-of-box linear-chain CRF packages there are 2 kinds of features:
"state features". These are per-token features user defines; they can use any information from the input sequence (e.g. current and previous token, last 3 letters of the current token, etc.) Each state feature is usually conditioned on the current output label.
"transition features". In the most common 1-st order linear-chain CRF it is the current label conditioned on a previous label. Usually these features are generated automatically, for all possible label pairs.
Sometimes you can also condition (2) transition features on user-defined features which depend on a current token. It seems this is what you're looking for, but I'm not sure. Some packages implement this (e.g. wapiti), some don't (e.g. crfsuite). Some packages allow to define arbitrary CRFs, and use arbitrary features (e.g. pystruct, factorie, GRMM(?)). Sorry, I don't have experience with Mallet, so it is not really an answer :)

Related

Modifying AFL to include a new variable for the Fuzzer to consider in seed selection

I am looking on understanding how AFL implements its seed selection. To my understanding,afl-fuzz.c has a function called has_new_bits which returns values in identifying if the result of input creates a new path, new edge or if it is not an interesting branch we are considering. So my question is this, given that I am able to insert some lines of codes that allowed me to insert variables such as a counter, which I can insert other line of codes that will increment it in a given branch, how do I modify the AFL such that it is able to detect this?
In AFL++, you can affect the coverage bitmap directly using __afl_coverage_interesting. You can for instance compute the val parameter using the value of your counter (but remind that val is u8).
Another way, is to use FuzzFactory, a modified version of AFL that allow the user to define custom coverage metrics. In their paper the authors discuss one of the possible coverage metrics that FuzzFactory can use, that is validity. With validity, the fuzzer select with more probability valid inputs. You can hack around it and make a FuzzFactory version that focus on inputs triggering unsafe code instead of valid inputs.

Generating sub-systems based on user input (MATLAB/SimMechanics)

The user in this webinar;
http://www.mathworks.com.au/videos/parameterizing-bodies-68850.html?form_seq=conf1134
can create new levels of links for the scissor lift by copy pasting the sub systems.
I was wondering if there was any way the number of subsystems and the joints could be automated via user input.
i.e a gui which allows the user to input the number of levels in the scissor lift and that number of levels (subsystems) is generated in SimMechanics.
If someone could provide a solution I could adapt it to the problem I'm trying to solve.
Thanks in advance!
Yes, you can automate it, as long as you know what susbsytems and what joints you want to add. The functions of interest are:
add_block(path_to_your_subsystem,path_to_destination_subsystem) (I assume your susbsystem is stored in a library). You probably want to specify the 'Position` parameter so that all blocks don't end up on top of each other. It will take some experimenting to find coordinates that work for your model and that are parameterised based on the number of susbystems to add.
add_line(path_to_subsystem_of_interest,path_to_output_port,path_to_input_port). You'll need to know which port you want to connect to which and figure out how many times you need to do this based on the number of subsystems to add. Simscape and SimMechanics are a special type of ports, and you need to refer to them correctly otherwise it won't work, see Programatically connect two subsystems for more details (note: this is undocumented as far as I know and is therefore likely to change in a future release).
So in short, yes it's possible (I've done it in the past), but it's by no means easy. See this blog for a very simple introduction.

Are Operational Transformation Frameworks only meant for text?

Looking at all the examples of Operational Transformation Frameworks out there, they all seem to resolve around the transformation of changes to plain text documents. How would an OT framework be used for more complex objects?
I'm wanting to dev a real-time sticky notes style app, where people can co-create sticky notes, change their positon and text value. Would I be right in assuming that the position values wouldn't be transformed? (I mean, how would they, you can't merge them right?). However, I would want to use an OT framework to resolve conflicts with the posit-its value, correct?
I do not see any problem to use Operational Transformation to work with Complex Objects, what you need is to define what operations your OT system support and how concurrency is solved for them
For instance, if you receive two Sticky notes "coordinates move operation" from two different users from same 'client state', you need to make both states to converge, probably cancelling out second operation.
This is exactly the same behaviour with text when two users generate two updates to delete a text range that overlaps completely, (or maybe partially), the second update processed must be transformed against the previous and the resultant operation will only effectively delete a portion of the original one, (or completely cancelled with a 'no-op')
You can take a look on this nice explanation about how Google Wave Operational Transformation works and guess from this point how it should work your own implementation
See the following paper for an approach to using OT with trees if you want to go down that route:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.74
However, in your particular case, I would use a separate plain text OT document for each stickynote and use an existing library, eg: etherPad, to do the heavy lifting. The positions of the notes could then be broadcast on a last-committer-wins basis.
Operation Transformation is a general technique, it works for any data type. The point is you need to define your transformation functions. Also, there are some atomic attributes that you cannot merge automatically like (position and background color) those will be mostly "last-update wins" or the user solves them manually when there is a conflict.
there are some nice libs and frameworks that provide OT for complex data already out there:
ShareJS : library for Node which provides all operations on JSON objects
DerbyJS: framework for NodeJS, it uses ShareJS for OT stuff.
Open Coweb framework : Dojo foundation project for cooperative web applications using OT

How to start working with a large decision table

Today I've been presented with a fun challenge and I want your input on how you would deal with this situation.
So the problem is the following (I've converted it to demo data as the real problem wouldn't make much sense without knowing the company dictionary by heart).
We have a decision table that has a minimum of 16 conditions. Because it is an impossible feat to manage all of them (2^16 possibilities) we've decided to only list the exceptions. Like this:
As an example I've only added 10 conditions but in reality there are (for now) 16. The basic idea is that we have one baseline (the default) which is valid for everyone and all the exceptions to this default.
Example:
You have a foreigner who is also a pirate.
If you go through all the exceptions one by one, and condition by condition you remove the exceptions that have at least one condition that fails. In the end you'll end up with the following two exceptions that are valid for our case. The match is on the IsPirate and the IsForeigner condition. But as you can see there are 2 results here, well 3 actually if you count the default.
Our solution
Now what we came up with on how to solve this is that in the GUI where you are adding these exceptions, there should run an algorithm which checks for such cases and force you to define the exception more specifically. This is only still a theory and hasn't been tested out but we think it could work this way.
My Question
I'm looking for alternative solutions that make the rules manageable and prevent the problem I've shown in the example.
Your problem seem to be resolution of conflicting rules. When multiple rules match your input, (your foreigner and pirate) and they end up recommending different things (your cangetjob and cangetevicted), you need a strategy for resolution of this conflict.
What you mentioned is one way of resolution -- which is to remove the conflict in the first place. However, this may not always be possible, and not always desirable because when a user adds a new rule that conflicts with a set of old rules (which he/she did not write), the user may not know how to revise it to remove the conflict.
Another possible resolution method is prioritization. Mark a priority on each rule (based on things like the user's own authority etc.), sort the matching rules according to priority, and apply in ascending sequence of priority. This usually works and is much simpler to manage (e.g. everybody knows that the top boss's rules are final!)
Prioritization may also be used to mark a certain rule as "global override". In your example, you may want to make "IsPirate" as an override rule -- which means that it overrides settings for normal people. In other words, once you're a pirate, you're treated differently. This make it very easy to design a system in which you have a bunch of normal business rules governing 90% of the cases, then a set of "exceptions" that are treated differently, automatically overriding certain things. In this case, you should also consider making "?" available in the output columns as well.
One other possible resolution method is to include attributes in each of your conditions. For example, certain conditions must have no "zeros" in order to pass (? doesn't matter). Some conditions must have at least one "one" in order to pass. In other words, mark each condition as either "AND", "OR", or "XOR". Some popular file-system security uses this model. For example, CanGetJob may be AND (you want to be stringent on rights-to-work). CanBeEvicted may be OR -- you may want to evict even a foreigner if he is also a pirate.
An enhancement on the AND/OR method is to provide a threshold that the total result must exceed before passing that condition. For example, putting CanGetJob at a threshold of 2 then it must get at least two 1's in order to return 1. This is sometimes useful on conditions that are not clearly black-and-white.
You can mix resolution methods: e.g. first prioritize, then use AND/OR to resolve rules with similar priorities.
The possibilities are limitless and really depends on what your actual needs are.
To me this problem reminds business rules engine where there is no known algorithm to define outputs from inputs (e.g. using boolean logic) but the user (typically some sort of administrator) has to define all or some the logic itself.
This might sound a bit of an overkill but OTOH this provides virtually limit-less extension capabilities: you don't have to code any new business logic, just define a new rule set.
As I understand your problem, you are looking for a nice way to visualise the editing for these rules. But this all depends on your programming language and the tool you select for this. Java, for example, has JBoss Drools. Quoting their page:
Drools Guvnor provides a (logically
centralized) repository to store you
business knowledge, and a web-based
environment that allows business users
to view and (within certain
constraints) possibly update the
business logic directly.
You could possibly use this generic tool or write your own.
Everything depends on what your actual rules will look like. Rules like 'IF has an even number of these properties THEN' would be painful to represent in this format, whereas rules like 'IF pirate and not geek THEN' are easy.
You can 'avoid the ambiguity' by stating that you'll always be taking the first actual match, in other words your rules have a priority. You'd then want to flag rules which have no effect because they are 'shadowed' by rules higher up. They're not hard to find, so it's something your program should do.
Your interface could also indicate groups of rules where rules within the group can be in any order without changing the outcomes. This will add clarity to what the rules are really saying.
If some of your outputs are relatively independent of the others, you will also get a more compact and much clearer table by allowing question marks in the output. In that design the scan for first matching rule is done once for each output. Consider for example if 'HasChildren' is the only factor relevant to 'Can Be Evicted'. With question marks in the outputs (= no effect) you could be halving the number of exception rules.
My background for this is circuit logic design, not business logic. What you're designing is similar to, but not the same as, a PLA. As long as your actual rules are close to sum of products then it can work well. If your rules aren't, for example the 'even number of these properties' rule, then the grid like presentation will break down in a combinatorial explosion of cases. Your best hope if your rules are arbitrary is to get a clearer more compact presentation with either equations or with diagrams like a circuit diagram. To be avoided, if you can.
If you are looking for a Decision Engine with a GUI, than you can try this one: http://gandalf.nebo15.com/
We just released it, it's open source and production ready.
You probably need some kind of inference engine. Think about doing it in prolog.

a simple/practical example of fuzzy c-means algorithm

I am writing my master thesis on the subject of dynamic keystroke authentication. To support ongoing research, I am writing code to test out different methods of feature extraction and feature matching.
My current simple approach just checks if the reference password keycodes matches the currently typed in keycodes and also checks if the keypress times (dwell) and the key-to-key times (flight) are the same as reference times +/- 100ms (tolerance). This is of course very limited and I want to extend it with some sort of fuzzy c-means pattern matching.
For each key the features look like: keycode, dwelltime, flighttime (first flighttime is always 0).
Obviously the keycodes can be taken out of the fuzzy algorithm because they have to be exactly the same.
In this context, how would a practical implementation of fuzzy c-means look like?
Generally, you would do the following:
Determine how many clusters you want (2? "Authentic" and "Fake"?)
Determine what elements you want to cluster (individual keystrokes? login attempts?)
Determine what your feature vectors will look like (dwell time, flight time?)
Determine what distance metric you will be using (how will you measure the distance of each sample from each cluster?)
Create exemplar training data for each cluster type (what does an authentic login look like?)
Run the FCM algorithm on the training data to generate the clusters
To create the membership vector for any given login attempt sample, run it through the FCM algorithm using the clusters you found in step 6
Use the resulting membership vector to determine (based on some threshold criteria) whether the login attempt is authentic
I'm not an expert, but this seems like an odd approach to determining whether a login attempt is authentic or not. I've seen FCM used for pattern recognition (eg. which facial expression am I making?), which makes sense because you're dealing with several categories (eg. happy, sad, angry, etc...) with defining characteristics. In your case, you really only have one category (authentic) with defining characteristics. Non-authentic keystrokes are simply "not like" authentic keystrokes, so they won't cluster.
Perhaps I am missing something?
I don't think you really want to do clustering here. You might want to do some proper fuzzy matching though instead of just allowing some delta on each value.
For clustering, you need to have many data points. Additionally, you'd need to know the proper number of means you need.
But what are these multiple objects meant to be? You have one data point for every keycode. You don't want to have the user type the password 100 times to see if he can do it consistently. And even then, what do you expect the clusters to be? You already know which keycode comes at which position, you don't want to find out what keycodes the user use for his password...
Sorry, I really don't see any clustering here. The term "fuzzy" seems to have mislead you to this clustering algorithm. Try "fuzzy logic" instead.