Same feature value for all corresponding feature - uima

I have tried to set the sentence as a feature for each Me_UnitSpacing. But I'm getting the same sentence value for all the occurence of the Me_UnitSpacing
Sample Code:
DECLARE LOWERCAMELCASE,UPPERCAMELCASE;
DECLARE ME_UNITSPACING(STRING sentence, STRING replace,STRING description);
Document{-> RETAINTYPE(SPACE)};
SW CW{->MARK(LOWERCAMELCASE,1,2)};
CW CW{->MARK(UPPERCAMELCASE,1,2)};
Document{-> RETAINTYPE};
LOWERCAMELCASE{REGEXP("mmHg")->MARK(ME_UNITSPACING)};
UPPERCAMELCASE{REGEXP("MmHg")->MARK(ME_UNITSPACING)};
W{REGEXP("Mmhg",true)->MARK(ME_UNITSPACING)};
DECLARE UnitspacingSENTENCE;
SENTENCE{CONTAINS(ME_UNITSPACING)->UnitspacingSENTENCE};
STRING unitspacingsent;
UnitspacingSENTENCE{->MATCHEDTEXT(unitspacingsent)};
ME_UNITSPACING{->ME_UNITSPACING.sentence=unitspacingsent};
Sample Input:
A number of psychological and mmHg psychiatric correlates have been found
implicated in the onset and/or repetition of NSSI behavior. Nock et al. 14
reported 9 k that more than half of the clinical adolescents they studied
met the DSM-IV criteria for an internalizing disorder, an externalizing
disorder, or a substance-related disorder, with a prevalence mmHg rate of
psychiatric pathologies estimated to be as high as 87%. In a large
community-based sample of 12,068 adolescents from 11 countries, Brunner et
al. (2014) found significant associations mmHg with symptoms of depression
and anxiety in adolescents who engaged in self-harming behavior 6, and they
emphasized that self-injury is strongly indicative of psychological
problems that require professional attention. Their results are consistent
with previous reports of a significantly higher rate of depressive and
anxious symptoms in self-injurers.5,15,16,17,18,19 The onset of NSSI
behavior in teenagers with depression is mainly attributable to the
function of NSSI as a way to seek relief from the depressive symptoms. 20
The literature generally stresses the broad variety of psychiatric problems
seen in mmHg teenagers with history of NSSI. Cluster B personality
disorders are often identified, especially in self-cutting adolescent
females, and so are eating disorders; approximately one in three
adolescents with eating disorders are also self-injurers, the NSSI
frequently coinciding with or following the eating disorder 21, 22.

DECLARE LOWERCAMELCASE,UPPERCAMELCASE;
DECLARE ME_UNITSPACING(STRING sentence, STRING replace,STRING description);
Document{-> RETAINTYPE(SPACE)};
SW CW{->MARK(LOWERCAMELCASE,1,2)};
CW CW{->MARK(UPPERCAMELCASE,1,2)};
Document{-> RETAINTYPE};
LOWERCAMELCASE{REGEXP("mmHg")->MARK(ME_UNITSPACING)};
UPPERCAMELCASE{REGEXP("MmHg")->MARK(ME_UNITSPACING)};
W{REGEXP("Mmhg",true)->MARK(ME_UNITSPACING)};
DECLARE UnitspacingSENTENCE;
SENTENCE{CONTAINS(ME_UNITSPACING)->UnitspacingSENTENCE};
BLOCK(foreach)UnitspacingSENTENCE{}
{
STRING unitspacingsent;
UnitspacingSENTENCE{->MATCHEDTEXT(unitspacingsent)};
ME_UNITSPACING{->ME_UNITSPACING.sentence=unitspacingsent};
}

Related

Average result of 50 Netlogo Simulation_Agent Based Simulation

I run an infectious disease spread model similar to "VIRUS" model in the model library changing the "infectiousness".
I did 20 runs each for infectiousness values 98% , 95% , 93% and the Maximum infected count was 74.05 , 73 ,78.9 respectively. (peak was at tick 38 for all 3 infectiousness values)
[I took the average of the infected count for each tick and took the maximum of these averages as the "maximum infected".]
I was expecting the maximum infected count to decrease when the infectiousness is reduced, but it didn't. As per what I understood this happens, because I considered the average values of each simulation run. (It is like I am considering a new simulation run with average infected count for each tick ).
I want to say that, I am considering all 20 simulation runs. Is there a way to do that other than the way I used the average?
In the Models Library Virus model with default parameter settings at other values, and those high infectiousness values, what I see when I run the model is a periodic variation in the numbers three classes of person. Look at the plot in the lower left corner, and you'll see this. What is happening, I believe, is this:
When there are many healthy, non-immune people, that means that there are many people who can get infected, so the number of infected people goes up, and the number of healthy people goes down.
Soon after that, the number of sick, infectious people goes down, because they either die or become immune.
Since there are now more immune people, and fewer infectious people, the number of non-immune healthy grows; they are reproducing. (See "How it works" in the Info tab.) But now we have returned to the situation in step 1, ... so the cycle continues.
If your model is sufficiently similar to the Models Library Virus model, I'd bet that this is part of what's happening. If you don't have a plot window like the Virus model, I recommend adding it.
Also, you didn't say how many ticks you are running the model for. If you run it for a short number of ticks, you won't notice the periodic behavior, but that doesn't mean it hasn't begun.
What this all means that increasing infectiousness wouldn't necessarily increase the maximum number infected: a faster rate of infection means that the number of individuals who can infected drops faster. I'm not sure that the maximum number infected over the whole run is an interesting number, with this model and a high infectiousness value. It depends what you are trying to understand.
One of the great things about NetLogo and some other ABM systems is that you can watch the system evolve over time, using various tools such as plots, monitors, etc. as well as just looking at the agents move around or change states over time. This can help you understand what is going on in a way that a single number like an average won't. Then you can use this insight to figure out a more informative way of measuring what is happening.
Another model where you can see a similar kind of periodic pattern is Wolf-Sheep Predation. I recommend looking at that. It may be easier to understand the pattern. (If you are interested in mathematical models of this kind of phenomenon, look up Lotka-Volterra models.)
(Real virus transmission can be more complicated, because a person (or other animal) is a kind of big "island" where viruses can reproduce quickly. If they reproduce too quickly, this can kill the host, and prevent further transmission of the virus. Sometimes a virus that reproduces more slowly can harm more people, because there is time for them to infect others. This blog post by Elliott Sober gives a relatively simple mathematical introduction to some of the issues involved, but his simple mathematical models don't take into account all of the complications involved in real virus transmission.)
EDIT: You added a comment Lawan, saying that you are interested in modeling COVID-19 transmission. This paper, Variation and multilevel selection of SARS‐CoV‐2 by Blackstone, Blackstone, and Berg, suggests that some of the dynamics that I mentioned in the preceding remarks might be characteristic of COVID-19 transmission. That paper is about six months old now, and it offered some speculations based on limited information. There's probably more known now, but this might suggest avenues for further investigation.
If you're interested, you might also consider asking general questions about virus transmission on the Biology Stackexchange site.

Compute capability of a small (1mm^2) ASIC

I was watching a recent ACM Turing Lecture by Hennessy and Patterson and was intrigued by a stat they cited on the cost of small chip tape-outs. They claimed that you can tape-out 100 1 mm x 1mm chips at 28 nm process node for $14,000, presumably on a test shuttle.
My question is, if I wanted to fill this chip area with MAC units (say 16 or 32 bit), how many simultaneous MACs could I do per cycle?
Just as a back of the envelope calculation, this paper describes a 32x32->64 multiplier as being 435um*482um in Synopsys' 90nm educational technology. If you just trivially scale to 28nm, you get 0.02mm^2 per instance. That's probably within an order of magnitude, which is good enough because "multipliers per mm" isn't really a meaningful metric: the interesting part is how to get data into and and out of such a multiplier array, which will dominate the area of the actual multipliers.
For another reference, the FU540-C000 is 30mm^2 in TSMC's 28nm HPC process. Yunsup's HotChips presentation from last year shows a fairly detailed die plot on page 17, from which you can calculate what 1mm^2 gets you on a modern technology -- it's quite a bit of SRAM/logic, but not many pads.

How to tell tell that my self-play Neural Network is overfitting

I have a Neural Network designed to play Connect 4, it gauges the value of a game state toward Player 1 or Player 2.
In order to train it, I am having it play against itself for n number of games.
What I've found is that 1000 games results in a better game-play than 100,000 even though the Mean Square Average over every 100 games constantly improves in the 100,000 epochs.
(I determine this by challenging the top-ranked player at http://riddles.io)
I've therefore reached the conclusion that over-fitting has occurred.
With self-play in mind, how do you successfully measure/determine/estimate that over-fitting has occurred? I.e., how to I determine when to stop the self-play?
I'm not super familiar with reinforcement learning, being much more a supervised learning person.
With that said, I feel like your options are never-the-less going to be the same as for supervised learning.
You need to find the point in which performance on Inputs (and I use that term losely) outside of the training space (again lossly),
starts to decrease.
When that happens you terminate training.
You need Early Stopping.
For supervised learning, this would be done by having a held-out dev-set.
As an in imitation of having a test-set.
In your case, it seems clear that this would be making your bot play a bunch of real people -- which is a perfect imitation of the test set.
Which is exactly what you have done.
The downside is sufficient play against real people is slow.
What you can do to partially off-set this is rather than pausing training to do this test,
take a snapshot of your network, say every 500 iterations,
and start that up in a separate process as a bot, and test it, and record the score, while the network is still training.
However, this won't really help in this case, as I imagine that the time take for even 1 trial game is much longer than the time taken to run 500 iterations of training.
Still this is applicable if you were not converging so fast.
I assume, since this problem is so simple, this is for learning purposes.
On that basis, you could fake real people.
Connect4 is a game with a small enough play space, that classic gameplaying AI should be able to do nearly perfectly.
So you can set up a bot for it to play (as its Dev-set equiv), that uses Alpha-beta pruning minimax.
Run a game against that ever 100 iterators or so, and if your relative score starts decreasing you know you've over-fitted.
The other thing you could do, is make it less likely to overfit in the first place. Which wouldn't help you detect it, but if you make it hard enough for it to overfit, you can to an extent assume that it isn't.
So L1/L2 weight penalties. Dropout. Smaller hidden-layer sizes.
You could also increase the training set equivalent.
Rather than pure self play, you could use play against other bots,
potentially even other versions of itself setup with different hyper-parameters.
Rather than measuring/detecting when overfitting starts to occur, it's easier to take steps to prevent it from happening. Two ideas for doing this:
Instead of always having the agent play against itself, have it play against an agent randomly selected from a larger set of older versions of itself. This idea is somewhat similar in spirit to Lyndon's idea of testing against humans and/or alpha-beta search engines (and very similar to his idea in the last paragraph of his answer). However, the goal here is not to test and figure out when performance starts dropping against a test set of opponents; the goal is simply to create a diverse set of training opponents, so that your agent cannot afford to only overfit against one of them. I believe this approach was also used in [1, 2].
Incorporate search algorithms (like MCTS) directly in the agent's action selection during training. The combination of NN + search (typically informed/biased by the NN) is usually a bit stronger than just the NN on its own. So you can always keep updating the NN to make its behaviour more like the behaviour of NN + search, and it'll generally be an improvement. The search part in this is unlikely to ever overfit against a specific opponent (because it's not learned from experience, the search always behaves in the same way). If the NN on its own starts overfitting against a particular opponent and starts suggesting moves that would be bad in general, but good against a particular opponent, a search algorithm should be able to "exploit/punish" this "mistake" by the overfitting NN, and therefore provide feedback to the NN to move away from that overfitting again. Examples of this approach can be found in [3, 4, 5].
The second idea probably requires much more engineering effort than the first, and it also only works if you actually can implement a search algorithm like MCTS (which you can, since you know the game's rules), but it probably works a bit better. I don't know for sure if it'll work better though, I only suspect it will because it was used in later publications with better results than papers using the first idea.
References
[1] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016). Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, Vol 529, No. 7587, pp. 484-489.
[2] Bansal, T., Pachocki, J., Sidor, S., Sutskever, I., and Mordatch, I. (2017). Emergent Complexity via Multi-Agent Competition. arXiv:1710.03748v2.
[3] Anthony, T. W., Tian, Z., and Barber, D. (2017). Thinking Fast and Slow with Deep Learning and Tree Search. arXiv:1705.08439v4.
[4] Silver, D., Schrittwieser, J., Simonyan, K, Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis, D. (2017). Mastering the Game of Go without Human Knowledge. Nature, Vol. 550, No. 7676, pp. 354-359.
[5] Silver, D., Hubert, T., Schrittweiser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. (2017c). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv:1712.01815v1.

error while importing txt file into mallet

I have been having trouble converting some txt files to mallet. I keep getting:
Exception in thread "main" java.lang.IllegalStateException: Line #39843 does not match regex:
and the Line#39843 reads:
24393584 |Title Validation of a Danish version of the Toronto Extremity Salvage Score questionnaire for 
patients with sarcoma in the extremities.The Toronto Extremity Salvage Score (TESS) questionnaire is a selfadministered questionnaire designed to assess physical disability in patients having undergone surgery of the extremities. The aim of this study was to validate a Danish translation of the TESS. The TESS was translated according to international guidelines. A total of 22 consecutive patients attending the regular outpatient control programme were recruited for the study. To test their understanding of the questionnaires, they were asked to describe the meaning of five randomly selected questions from the TESS. The psychometric properties of the Danish version of TESS were tested for validity and reliability. To assess the testretest reliability, the patients filled in an extra TESS questionnaire one week after they had completed the first one. Patients showed good understanding of the questionnaire. There was a good internal consistency for both the upper and lower questionnaire measured by Cronbach's alpha. A BlandAltman plot showed acceptable limits of agreement for both questionnaires in the testretest. There was also good intraclass correlation coefficients for both questionnaires. The validity expressed as Spearman's rank correlation coefficient comparing the TESS with the QLQC30 was 0.89 and 0.90 for the questionnaire on upper and lower extremities, respectively. The psychometric properties of the Danish TESS showed good validity and reliability. not relevant.not relevant.
This happens for a quite a few of the lines and when I remove the line, the rest of the file
is imported into mallet. What regex expression in this line could be the problem?
thanks,
Priya
Mallet has problems handling certain machine symbols, because of bad programming. Try running
tr -dc [:alnum:][\ ,.]\\n < ./inputfile.txt > ./inputfilefixed.txt
before running mallet. This will remove all non-alphanumerical symbols, which usually solves the problem for me.

Making predictions from a CV

I have a database with many CVs, including structured data of the gender, age, address, number of years of education, and many other parameters of each person.
For about 10% of the sample, I also have additional data about a certain action they've made at some point in time. For instance, that Jane took a home loan in July 1998 or that John started pilot training in Jan. 2007 and got his license in Dec. 2007.
I need an algorithm that will give, for each of the actions, the probability that it will happen for each person in future time increments. For instance, that the chance of Bill taking a home loan is 2% in 2011, 3.5% in 2012, etc.
How should I approach this? Regression analysis? SVM? Neural net? Something else?
Is there perhaps even some standard tool/library that I can use with just the obvious customizations?
The probability that X happens given that Y happened is right out of Bayesian inference, I think.
Lou is right, this is the case for 'Bayesian Inference'.
The best tool/library to solve this is the R statistic programming language (r-project.org).
Take a look at the Bayesian Inference Libraries in R:
http://cran.r-project.org/web/views/Bayesian.html
How many people are in the "10% of the sample"? If it's below 100 people or so, I would fear that the results of the analysis could not be significant. If it's 1000 or more people, the results will be quite good (rule of thumb).
I would fist export the data to R (r-project) and do some data cleaning necessary. Then find a person familiar with R and advanced statistics, he will be able to solve this very quickly. Or try yourself, but R takes some time in the beginning.
Concerning the tool/library choice, I suggest you give Weka a try. It's an open source tool for experimenting with data mining and machine learning. Weka has several tools for reading, processing and filtering your data, as well as prediction and classification tools.
However, you must have a strong foundation in the above mentioned fields in order to strive for a useful result.