I have training data (.arff) and i want to convert to test data.
this is my training data:
#relation fix_labeled_tweet
#attribute Text string
#attribute class-att {relevant,not_relevant,additional}
#data
'pvj dengan ciwalk masih tetap jadi tempat fav untuk belanja;',additional
'deta di bandung trade centre btc fashion mall;',additional
'promo hotel bandung ibis trans studio enjoy our special price akan your wonderful weekend periode s di 27 desember;',not_relevant
'indri theressa di cihampelas walk ciwalk;',additional
'beiga we di jatinangor town square jatos;',additional
'nonton di paris van java my husband;',relevant
'mainya seringnya ke paris van java mall miko mall mana;',not_relevant
'double date yeahhhh di braga city walk;',relevant
'sinta di jatinangor town square jatos;',additional
'terimakasih tas dompet teguh di cihampelas walk ciwalk;',additional
'malam minggu miko the movie di cinema 21 mall panakukang;',additional
'karaokean sekalian dugem patriot handrian di inul vista paskal hypersquare;',relevant
'makan di mujigae korean resto ciwalk;',relevant
'just posted a photo bandung trade center;',additional
What i've tried is removing the label (addition,relevant,not_relevant) from the data, then i save to different name, but it's not working. Weka said that the train and test set are not compatible.
They are incompatible because the structure of the training set and testing set is different.
If you did a copy of the document (say as Testing.arff), then supplied it as the test set, then the classifier would accept the file fine. If, however, you remove the used attributes from the testing file, then the document cannot be used either because some of the inputs (for classification) or outputs (for evaluation) are missing.
I have been able to replicate your issue when removing the class output, but when copying the document, the test set works correctly as expected.
Hope this helps!
Related
I have a data set and I want to tag it for Named Entity Recognition. My dataset is in Persian.
I want to know how should I tag expressions like :
*** آقای مهدی کاظمی = Mr Mehdi Kazemi / Mr will Smith. >>> (names with titles) should I tag all as a person or just the first name and last name should be tagged? (I mean should i also tag "Mr")
Mr >> b_per || Mr >> O
Mehdi >> i_per || Mehdi >> b_per
Kazemi >> i_per || Kazemi >> i_per
*** بیمارستان نور = Noor hospital >>> Should I tag the name only or the name and hospital both as Named Entity?
*** Eiffel tower / The Ministry of Defense (I mean the us DOD for example) >>> in Persian it is called :
وزارت دفاع (vezarate defa)
should I only tag Defense ? or all together?
There are many more examples for schools, movies, cities, countries and.... since we use the entity class before the named entity.
I would appreciate if you can help me with tagging this dataset.
I'll give you some examples from the CoNLL 2003 training data:
"Mr." is not tagged as part of the person, so titles are ignored.
"Columbia Presbyterian Hospital" is tagged as (LOC, LOC, LOC)
"a New York hospital" (O, LOC, LOC, O)
"Ministry of Commerce" is (ORG, ORG, ORG)
I think "Eiffel Tower" should be (LOC, LOC)
In general, you tag as the way you want the output to look. It's up to you if you want titles included, for example. However, Core NLP won't tag overlapping entities, so you have to make a decision in for cases like the hospital named after someone.
I believe you are heading to Stanford NLP and BIO format. But in case you'd also consider other options, you may have a look a structured entities such as: http://www.afcp-parole.org/etape/docs/etape-06022012-quaero-en.pdf.
Those allow to describe entities as trees, providing a finer analysis for information extraction. More tedious to annotate but probably relevant if you intend to use annotation for semantic purposes, not only indexing.
I am using Weka in Scala (although the syntax is virtually identical to Java). I am trying to evaluate my data with a SimpleKMeans clusterer, but the clusterer won't accept string data. I don't want to cluster on the string data; I just want to use it to label the points.
Here is the data I am using:
#relation Locations
#attribute ID string
#attribute Latitude numeric
#attribute Longitude numeric
#data
'Carnegie Mellon University', 40.443064, -79.944163
'Stanford University', 37.427539, -122.170169
'Massachusetts Institute of Technology', 42.358866, -71.093823
'University of California Berkeley', 37.872166, -122.259444
'University of Washington', 47.65601, -122.30934
'University of Illinois Urbana Champaign', 40.091022, -88.229992
'University of Southern California', 34.019372, -118.28611
'University of California San Diego', 32.881494, -117.243079
As you can see, it's essentially a collection of points on an x and y coordinate plane. The value of any patterns is negligible; this is simply an exercise in working with Weka.
Here is the code that is giving me trouble:
val instance = new Instances(new StringReader(wekaHeader + wekaData))
val simpleKMeans = new SimpleKMeans()
simpleKMeans.buildClusterer(instance)
val eval = new ClusterEvaluation()
eval.setClusterer(simpleKMeans)
eval.evaluateClusterer(new Instances(instance))
Logger.info(eval.clusterResultsToString)
I get the following error on simpleKMeans.buildClusterer(instance):
[UnsupportedAttributeTypeException: weka.clusterers.SimpleKMeans: Cannot handle string attributes!]
How do I get Weka to retain IDs while doing clustering?
Here are a couple of other steps I have taken to troubleshoot this:
I used the Weka Explorer and loaded this data as a CSV:
ID, Latitude, Longitude
'Carnegie Mellon University', 40.443064, -79.944163
'Stanford University', 37.427539, -122.170169
'Massachusetts Institute of Technology', 42.358866, -71.093823
'University of California Berkeley', 37.872166, -122.259444
'University of Washington', 47.65601, -122.30934
'University of Illinois Urbana Champaign', 40.091022, -88.229992
'University of Southern California', 34.019372, -118.28611
'University of California San Diego', 32.881494, -117.243079
This does what I want it to do in the Weka Explorer. Weka clusters the points and retains the ID column to identify each point. I would do this in my code, but I'm trying to do this without generating additional files. As you can see from the Weka Java API, Instances interprets a java.io.Reader only as an ARFF.
I have also tried the following code:
val instance = new Instances(new StringReader(wekaHeader + wekaData))
instance.deleteAttributeAt(0)
val simpleKMeans = new SimpleKMeans()
simpleKMeans.buildClusterer(instance)
val eval = new ClusterEvaluation()
eval.setClusterer(simpleKMeans)
eval.evaluateClusterer(new Instances(instance))
Logger.info(eval.clusterResultsToString)
This works in my code, and displays results. That proves that Weka is working in general, but since I am deleting the ID attribute, I can't really map the clustered points back on the original values.
I am answering my own question, and in doing so, there are two issues that I would like to address:
Why CSV works with string values
How to get cluster information from the cluster evaluation
As Sentry points out in the comments, the ID does in fact get converted to a nominal attribute when loaded from a CSV.
If the data must be in an ARFF format (like in my example where the Instances object is created from a StringReader), then the StringToNominal filter can be applied:
val instances = new Instances(new StringReader(wekaHeader + wekaData))
val filter = new StringToNominal()
filter.setAttributeRange("first")
filter.setInputFormat(instances)
val filteredInstance = Filter.useFilter(instances, filter)
val simpleKMeans = new SimpleKMeans()
simpleKMeans.buildClusterer(instance)
...
This allows for "string" values to be used in clustering, although it's really just treated as a nominal value. It doesn't impact the clustering (if the ID is unique), but it doesn't contribute to the evaluation as I had hoped, which brings me to the next issue.
I was hoping to be able to get a nice map of cluster and data, like cluster: Int -> Array[(ID, latitude, longitude)] or ID -> cluster: Int. However, the cluster results are not that convenient. In my experience these past few days, there are two approaches that can be used to find the cluster of each point of data.
To get the cluster assignments, simpleKMeans.getAssignments returns an array of integers that is the cluster assignments for each data element. The array of integers is in the same order as the original data items and has to be manually related back to the original data items. This can be easily accomplished in Scala by using the zip method on the original list of data items and then using other methods like groupBy or map to get the collection in your favorite format. Keep in mind that this method alone does not use the ID attribute at all, and the ID attribute could be omitted from the data points entirely.
However, you can also get the cluster centers with simpleKMeans.getClusterCentroids or eval.clusterResultsToString(). I have not used this very much, but it does seem to me that the ID attribute can be recovered from the cluster centers here. As far as I can tell, this is the only situation in which the ID data can be utilized or recovered from the cluster evaluation.
I got the same error while having String value in one of the line in a CSV file with couple of million rows. Here is how I figured out which line has string value.
Exception "Cannot handle string attributes!" doesn't give any clue about the line number. Hence:
I imported CSV file into Weka Explorer GUI and created a *.arff file.
Then manually changed type from string to numeric in the *.arrf file at the beginning as show below.
After that I tried to build the cluster using the *.arff file.
I got the exact line number as part of exception
I removed the line from *.arff file and loaded again. It worked without any issue.
Converted string --> numeric in *.arff file
#attribute total numeric
#attribute avgDailyMB numeric
#attribute mccMncCount numeric
#attribute operatorCount numeric
#attribute authSuccessRate numeric
#attribute totalMonthlyRequets numeric
#attribute tokenCount numeric
#attribute osVersionCount numeric
#attribute totalAuthUserIds numeric
#attribute makeCount numeric
#attribute modelCount numeric
#attribute maxDailyRequests numeric
#attribute avgDailyRequests numeric
Error reported the exact line number
java.io.IOException: number expected, read Token[value.total], line 1750464
at weka.core.converters.ArffLoader$ArffReader.errorMessage(ArffLoader.java:354)
at weka.core.converters.ArffLoader$ArffReader.getInstanceFull(ArffLoader.java:728)
at weka.core.converters.ArffLoader$ArffReader.getInstance(ArffLoader.java:545)
at weka.core.converters.ArffLoader$ArffReader.readInstance(ArffLoader.java:514)
at weka.core.converters.ArffLoader$ArffReader.readInstance(ArffLoader.java:500)
at weka.core.Instances.<init>(Instances.java:138)
at com.lokendra.dissertation.ModelingUtils.kMeans(ModelingUtils.java:50)
at com.lokendra.dissertation.ModelingUtils.main(ModelingUtils.java:28)
I need to extract names (including uncommon names) from blocks of text using Perl. I've looked into this module for extracting names, but it only has the top 1000 popular names and surnames in the US dating back to 1990; I need something a bit more comprehensive.
I've considered using the Social Security Index to make a database for comparison, but this seems very tedious and processing intensive. Is there a way to pull names from Perl using another method?
Example of text to parse:
LADNIER Louis Anthony Ladnier, [Louie] age 48, of Mobile, Alabama died at home Friday, November 16, 2012. Louie was born January 9, 1964 in Mobile, Alabama. He was the son of John E. Ladnier, Sr. and Gloria Bosarge Ladnier. He was a graduate of McGill-Toolen High School and attended University of South Alabama. He was employed up until his medical retirement as Communi-cations Supervisor with the Bayou La Batre Police Department. He is preceded in death by his father, John. Survived by his mother, Gloria, nephews, Dominic Ladnier and Christian Rubio, whom he loved and help raise as his own sons, sisters, Marj Ladnier and Morgan Gordy [Julian], and brother Eddie Ladnier [Cindy], and nephews, Jamie, Joey, Eddie, Will, Ben and nieces, Anna and Elisabeth. Memorial service will be held at St. Dominic's Catholic Church in Mobile on Wednesday at 1pm. Serenity Funeral Home is in charge of arrangements. In lieu of flowers, memorials may be sent to St. Dominic School, 4160 Burma Road Mobile, AL 36693, education fund for Christian Rubio and McGill-Toolen High School, 1501 Old Shell Road Mobile, AL 36604, education Fund for Dominic Ladnier. The family is grateful for all the prayers and support during this time. Louie was a rock and a joy to us all.
Use Stanford's NER (GPL). Demo:
http://nlp.stanford.edu:8080/ner/process
There is no sure fire way to do this due to the nature of the English language. You either need lists to (fuzzy)compare with, or will have to settle for significant accuracy penalties.
The Apache Foundation has a few projects that cover the topic of entity extraction with specific pre-trained models for English names (nameFinder). I would recommend openLNP or Stanbol. In the meantime if you have just a few queries I have an NLP I've implemented in C# in my apps section at http://www.augmentedintel.com/apps/csharpnlp/extract-names-from-text.aspx.
Best,
Don
You're trying to implement a named-entity recognition. The bad news is that it's really hard.
You could try Lingua::EN::NamedEntity, however:
$ perl -MLingua::EN::NamedEntity -nE 'say $_ for map { $_->{class} eq "person" ? $_->{entity} : () } extract_entities($_)' names.txt
Louie
Louis Anthony Ladnier
Louie
John E
Bayou La Batre Police Department
Gloria
Julian
Cindy
Eddie Ladnier
Eddie
John
Catholic Church
Christian Rubio
Dominic Ladnier
Burma Road Mobile
Louie
You can also use Calais, a Reuters webservice for natural language processing, which offers a lot better results:
I think you want to Google something like:
perl part of speech tagging
I want to store long description in sqlite database manager in iphone like this data.
"The Golden Temple: The Golden Temple, popular as Sri Harmandir Sahib or Sri Darbar Sahib, is the sacred seat of Sikhism. Bathed in a quintessential golden hue that dazzles in the serene waters of the Amrit Sarovar that lace around it, the swarn mandir (Golden temple) is one that internalizes in the mindscape of its visitors, no matter what religion or creed, as one of the most magnificent House of Worship. On a jewel-studded platform is the Adi Grantha or the sacred scripture of Sikhs wherein are enshrined holy inscriptions by the ten Sikh gurus and various Hindu and Moslem saints. While visiting the Golden Temple you need to cover your head. Street sellers sell bandanas outside the temple at cheap prices."
I am trying to take as description (VARCHAR(5000)) but when i execute query it is showing half text with dotted (....) like that http://i.stack.imgur.com/gyMqi.png
Thanks
The ... surely indicate that the full text is present in the database. It also indicates that "Sqlite database browser" truncates past a certain length:
m_textWidthMarkSize = s.value("prefs/sqleditor/textWidthMarkSpinBox", 60).toInt();
Is there a way to change the settings?
Edit
You can verify that the text is fully saved with the following query (replace theTable with the correct table name):
select length(description) from theTable;
I am trying to figure out how to parse an address using T-SQL and I suck at T-SQL. My challenge is this,
I have a table called Locations defined as follows:
- City [varchar(100)]
- State [char(2)]
- PostalCode [char(5)]
My UI has a text box in which a user can enter an address in. This address could be in the form of essentially anything (yuck, I know). Unfortunately, I cannot change this UI either. Anyways, the value of the text box is passed into the stored procedure that is responsible for parsing the address. I need to take what the person enters and get the PostalCode from the Locations table associated with their input. For the life of me, I cannot figure out how to do this. There are so many cases. For instance, the user could enter one of the following:
Chicago, IL
Chicago, IL 60601
Chicago, IL, 60601
Chicago, IL 60601 USA
Chicago, IL, 60601 USA
Chicago IL 60601 USA
New York NY 10001 USA
New York, NY 10001, USA
You get the idea. There are a lot of cases. I can't find any parsers online either. I must not be looking correctly. Can someone please point me to a parser online or explain how to do this? I'm willing to pay for a solution to this problem, but I can't find anything, I'm shocked.
Perhaps a CLR function might be a better choice than tsql. Check out http://msdn.microsoft.com/en-us/magazine/cc163473.aspx for an example of using regular expressions to parse some pretty complex string inputs into table value results. Now you get to be as creative as you please with your regex matching but the following regex should get you started:
(.*?)([A-Z]{2}),? (\d+)( USA)?$
If you're reluctant to use CLR functions, perhaps you have regex functionality in the calling system, like ASP.Net or PHP.