Tagging and Training NER dataset - tags

I have a data set and I want to tag it for Named Entity Recognition. My dataset is in Persian.
I want to know how should I tag expressions like :
*** آقای مهدی کاظمی = Mr Mehdi Kazemi / Mr will Smith. >>> (names with titles) should I tag all as a person or just the first name and last name should be tagged? (I mean should i also tag "Mr")
Mr >> b_per || Mr >> O
Mehdi >> i_per || Mehdi >> b_per
Kazemi >> i_per || Kazemi >> i_per
*** بیمارستان نور = Noor hospital >>> Should I tag the name only or the name and hospital both as Named Entity?
*** Eiffel tower / The Ministry of Defense (I mean the us DOD for example) >>> in Persian it is called :
وزارت دفاع (vezarate defa)
should I only tag Defense ? or all together?
There are many more examples for schools, movies, cities, countries and.... since we use the entity class before the named entity.
I would appreciate if you can help me with tagging this dataset.

I'll give you some examples from the CoNLL 2003 training data:
"Mr." is not tagged as part of the person, so titles are ignored.
"Columbia Presbyterian Hospital" is tagged as (LOC, LOC, LOC)
"a New York hospital" (O, LOC, LOC, O)
"Ministry of Commerce" is (ORG, ORG, ORG)
I think "Eiffel Tower" should be (LOC, LOC)

In general, you tag as the way you want the output to look. It's up to you if you want titles included, for example. However, Core NLP won't tag overlapping entities, so you have to make a decision in for cases like the hospital named after someone.

I believe you are heading to Stanford NLP and BIO format. But in case you'd also consider other options, you may have a look a structured entities such as: http://www.afcp-parole.org/etape/docs/etape-06022012-quaero-en.pdf.
Those allow to describe entities as trees, providing a finer analysis for information extraction. More tedious to annotate but probably relevant if you intend to use annotation for semantic purposes, not only indexing.

Related

Using context variables to write the user's car registration

In watson-conversation, I have reached a point where I ask the user their car number (registration), which follows this format: 0000BBB (4 numbers, 3 letters).
I want to type that to the user like:
User: "My car id is 0123asd"
Watson: "Okay so your car id is this one: 0123asd"!
I have tried defining an entity #carId with some examples, but every time I input something with that format (0000BBB), it shows "irrelevant".
If the chat detects #carId, respond with Okay so this is your carId! #exampleCarId (I have some examples like 5487qwe, or 8521rty, I thought the machine learning below that would learn the "pattern").
And my #carId has these examples:
I know I am missing something! Do I need to code anything? I think it's not necessary. I tried to save it on a sys-number but it does not work as it is not a "number".
Most developers would consider he car ID an entity (denoted by #). IBM Watson Assistant allows to define so-called dictionary-based entities. One form of such an entity is pattern-based. Thus, you would define a pattern of 4 digits and 3 letters.
You could have an intent (denoted by #) that identifies that a user inputs the car ID. In the dialog node you could match against the intent and then assign the matched entity for the actual ID to a variable.
The linked documentation has examples.

Layout for importing .csv intent file for IBM watson chatbot

I am in the process of creating a chatbot and need to import a .csv file to act as the intents. My question is regarding the layout; I currently have a list of cities however when importing should my intent name be in row A next to the city name or in row B seperated with a "," before the city ?
Thanks.
Intents should be questions where there is a clear intent of what is being asked.
Saying "Dubai" doesn't mean anything except as a word meaning. Using city names for training in intents will not give you great results.
File format for Intents is listed here. It is question,intent format.
Example:
Tell me the current weather conditions.,weather_conditions
Is it raining?,weather_conditions
What's the temperature?,weather_conditions
Where is your nearest location?,find_location
Do you have a store in Raleigh?,find_location
Entites are non standard format of entitiy,value, synomns,...
Example:
weekday,Monday,Mon
weekday,Tuesday,Tue,Tues
weekday,Wednesday,Wed
weekday,Thursday,Thur,Thu,Thurs
weekday,Friday,Fri
weekday,Saturday,Sat
weekday,Sunday,Sun
More details for entities here.

Gremlin - how do you merge vertices to combine their properties without listing the properties explicitly?

Background: I'm trying to implement a time-series versioned DB using this approach, using gremlin (tinkerpop v3).
I want to get the latest state node (in red) for a given identity node (in blue) (linked by a 'state' edge which contains a timestamp range), but I want to return a single aggregated object which contains the id (cid) from the identity node and all the properties from the state node, but I don't want to have to list them explicitly.
(8640000000000000 is my way of indicating no 'to' date - i.e. the edge is current - slightly different from the image shown).
I've got this far:
:> g.V().hasLabel('product').
as('cid').
outE('state').
has('to', 8640000000000000).
inV().
as('name').
as('price').
select('cid', 'name','price').
by('cid').
by('name').
by('price')
=>{cid=1, name="Cheese", price=2.50}
=>{cid=2, name="Ham", price=5.00}
but as you can see I have to list out the properties of the 'state' node - in the example above the name and price properties of a product. But this will apply to any domain object so I don't want to have to list the properties all the time. I could run a query before this to get the properties but I don't think I should need to run 2 queries, and have the overhead of 2 round trips. I've looked at 'aggregate', 'union', 'fold' etc but nothing seems to do this.
Any ideas?
===================
Edit:
Based on Daniel's answer (which doesn't quite do what I want ATM) I'm going to use his example graph. In the 'modernGraph' people-create->software. If I run:
> g.V().hasLabel('person').valueMap()
==>[name:[marko], age:[29]]
==>[name:[vadas], age:[27]]
==>[name:[josh], age:[32]]
==>[name:[peter], age:[35]]
then the results are a list of entities's with the properties. What I want is, on the assumption that a person can only create one piece of software ever (although hopefully we will see how this could be opened up later for lists of software created), to include the created software 'language' property into the returned entity to get:
> <run some query here>
==>[name:[marko], age:[29], lang:[java]]
==>[name:[vadas], age:[27], lang:[java]]
==>[name:[josh], age:[32], lang:[java]]
==>[name:[peter], age:[35], lang:[java]]
At the moment the best suggestion so far comes up with the following:
> g.V().hasLabel('person').union(identity(), out("created")).valueMap().unfold().group().by {it.getKey()}.by {it.getValue()}
==>[name:[marko, lop, lop, lop, vadas, josh, ripple, peter], lang:[java, java, java, java], age:[29, 27, 32, 35]]
I hope that's clearer. If not please let me know.
Since you didn't provide I sample graph, I'll use TinkerPop's toy graph to show how it's done.
Assume you want to merge marko and lop:
gremlin> g = TinkerFactory.createModern().traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.V(1).valueMap()
==>[name:[marko],age:[29]]
gremlin> g.V(1).out("created").valueMap()
==>[name:[lop],lang:[java]]
Note, that there are two name properties and in theory you won't be able to predict which name makes it into your merged result; however that doesn't seem to be an issue in your graph.
Get the properties for both vertices:
gremlin> g.V(1).union(identity(), out("created")).valueMap()
==>[name:[marko],age:[29]]
==>[name:[lop],lang:[java]]
Merge them:
gremlin> g.V(1).union(identity(), out("created")).valueMap().
unfold().group().by(select(keys)).by(select(values))
==>[name:[lop],lang:[java],age:[29]]
UPDATE
Thank you for the added sample output. That makes it a lot easier to come up with a solution (although I think your output contains errors; vadas didn't create anything).
gremlin> g.V().hasLabel("person").
filter(outE("created")).map(
union(valueMap(),
outE("created").limit(1).inV().valueMap("lang")).
unfold().group().by {it.getKey()}.by {it.getValue()})
==>[name:[marko], lang:[java], age:[29]]
==>[name:[josh], lang:[java], age:[32]]
==>[name:[peter], lang:[java], age:[35]]
Merging edge and vertex properties using gremlin java DSL:
g.V().has('User', 'id', userDbId).outE(Edges.TWEETS)
.union(__.identity().valueMap(), __.inV().valueMap())
.unfold().group().by(__.select(Column.keys)).by(__.select(Column.values))
.map(v -> converter.toTweet((Map) v.get())).toList();
Thanks for the answer by Daniel Kuppitz and youhans it has given me a basic idea on the solution of the issue. But later I found out that the solution is not working for multiple rows. It is required to have local step for handling multiple rows. The modified gremlin query will look like:
g.V()
.local(
__.union(__.valueMap(), __.outE().inV().valueMap())
.unfold().group().by(__.select(Column.keys)).by(__.select(Column.values))
)
This will limit the scope of union and group by to a single row.
If you can work with custom DSL ,create custom DSL with java like this one.
public default GraphTraversal<S, LinkedHashMap> unpackMaps(){
GraphTraversal<S, LinkedHashMap> it = map(x -> {
LinkedHashMap mapSource = (LinkedHashMap) x.get();
LinkedHashMap mapDest = new LinkedHashMap();
mapSource.keySet().stream().forEach(key->{
Object obj = mapSource.get(key);
if (obj instanceof LinkedHashMap) {
LinkedHashMap childMap = (LinkedHashMap) obj;
childMap.keySet().iterator().forEachRemaining( key_child ->
mapDest.put(key_child,childMap.get(key_child)
));
} else
mapDest.put(key,obj);
});
return mapDest;
});
return it;
}
and use it freely like
g.V().as("s")
.valueMap().as("value_map_0")
.select("s").outE("INFO1").inV().valueMap().as("value_map_1")
.select("s").outE("INFO2").inV().valueMap().as("value_map_2")
.select("s").outE("INFO3").inV().valueMap().as("value_map_3")
.select("s").local(__.outE("INFO1").count()).as("value_1")
.select("s").outE("INFO1").inV().value("name").as("value_2")
.project("val_map1","val_map2","val_map3","val1","val2")
.by(__.select("value_map_1"))
.by(__.select("value_map_2"))
.by(__.select("value_1"))
.by(__.select("value_2"))
.unpackMaps()
results to rows with
map1_val1, map1_val2,.... ,map2_va1, map2_val2....,value1, value2
This can handle mix of values and valueMaps in a natural gremlin way.

Text classification using Weka

I'm a beginner to Weka and I'm trying to use it for text classification. I have seen how to StringToWordVector filter for classification. My question is, is there any way to add more features to the text I'm classifying? For example, if I wanted to add POS tags and named entity tags to the text, how would I use these features in a classifier?
It depends of the format of your dataset and the preprocessing steps you perform. For instance, let us suppose that you have pre-POS-tagged your texts, looking like:
The_det dog_n barks_v ._p
So you can build an specific tokenizer (see weka.core.tokenizers) to generate two tokens per word, one would be "The" and the other one would be "The_det" so you keep the tag information.
If you want only tagged words, then you can just ensure that "_" is not a delimiter in the weka.core.tokenizers.WordTokenizer.
My advice is to have both the words and tagged words, so a simpler way would be to write an script that joins the texts and the tagged texts. From a file containing "The dog barks" and another one cointaining "The_det dog_n barks_v ._p", it would generate a file with "The The_det dog dog_n barks barks_v . ._p". You may even forget about the order unless you are going to make use of n-grams.

RESTful URL design for search

I'm looking for a reasonable way to represent searches as a RESTful URLs.
The setup: I have two models, Cars and Garages, where Cars can be in Garages. So my urls look like:
/car/xxxx
xxx == car id
returns car with given id
/garage/yyy
yyy = garage id
returns garage with given id
A Car can exist on its own (hence the /car), or it can exist in a garage. What's the right way to represent, say, all the cars in a given garage? Something like:
/garage/yyy/cars ?
How about the union of cars in garage yyy and zzz?
What's the right way to represent a search for cars with certain attributes? Say: show me all blue sedans with 4 doors :
/car/search?color=blue&type=sedan&doors=4
or should it be /cars instead?
The use of "search" seems inappropriate there - what's a better way / term? Should it just be:
/cars/?color=blue&type=sedan&doors=4
Should the search parameters be part of the PATHINFO or QUERYSTRING?
In short, I'm looking for guidance for cross-model REST url design, and for search.
[Update] I like Justin's answer, but he doesn't cover the multi-field search case:
/cars/color:blue/type:sedan/doors:4
or something like that. How do we go from
/cars/color/blue
to the multiple field case?
For the searching, use querystrings. This is perfectly RESTful:
/cars?color=blue&type=sedan&doors=4
An advantage to regular querystrings is that they are standard and widely understood and that they can be generated from form-get.
The RESTful pretty URL design is about displaying a resource based on a structure (directory-like structure, date: articles/2005/5/13, object and it's attributes,..), the slash / indicates hierarchical structure, use the -id instead.
Hierarchical structure
I would personaly prefer:
/garage-id/cars/car-id
/cars/car-id #for cars not in garages
If a user removes the /car-id part, it brings the cars preview - intuitive. User exactly knows where in the tree he is, what is he looking at. He knows from the first look, that garages and cars are in relation. /car-id also denotes that it belongs together unlike /car/id.
Searching
The searchquery is OK as it is, there is only your preference, what should be taken into account. The funny part comes when joining searches (see below).
/cars?color=blue;type=sedan #most prefered by me
/cars;color-blue+doors-4+type-sedan #looks good when using car-id
/cars?color=blue&doors=4&type=sedan #also possible, but & blends in with text
Or basically anything what isn't a slash as explained above.
The formula: /cars[?;]color[=-:]blue[,;+&], though I wouldn't use the & sign as it is unrecognizable from the text at first glance if that's your thing.
** Did you know that passing JSON object in URI is RESTful? **
Lists of options
/cars?color=black,blue,red;doors=3,5;type=sedan #most prefered by me
/cars?color:black:blue:red;doors:3:5;type:sedan
/cars?color(black,blue,red);doors(3,5);type(sedan) #does not look bad at all
/cars?color:(black,blue,red);doors:(3,5);type:sedan #little difference
possible features?
Negate search strings (!)
To search any cars, but not black and red:
?color=!black,!red
color:(!black,!red)
Joined searches
Search red or blue or black cars with 3 doors in garages id 1..20 or 101..103 or 999 but not 5
/garage[id=1-20,101-103,999,!5]/cars[color=red,blue,black;doors=3]
You can then construct more complex search queries. (Look at CSS3 attribute matching for the idea of matching substrings. E.g. searching users containing "bar" user*=bar.)
Conclusion
Anyway, this might be the most important part for you, because you can do it however you like after all, just keep in mind that RESTful URI represents a structure which is easily understood e.g. directory-like /directory/file, /collection/node/item, dates /articles/{year}/{month}/{day}.. And when you omit any of last segments, you immediately know what you get.
So.., all these characters are allowed unencoded:
unreserved: a-zA-Z0-9_.-~
Typically allowed both encoded and not, both uses are then equivalent.
special characters: $-_.+!*'(),
reserved: ;/?:#=&
May be used unencoded for the purpose they represent, otherwise they must be encoded.
unsafe: <>"#%{}|^~[]`
Why unsafe and why should rather be encoded: RFC 1738 see 2.2
Also see RFC 1738#page-20 for more character classes.
RFC 3986 see 2.2
Despite of what I previously said, here is a common distinction of delimeters, meaning that some "are" more important than others.
generic delimeters: :/?#[]#
sub-delimeters: !$&'()*+,;=
More reading:
Hierarchy: see 2.3, see 1.2.3
url path parameter syntax
CSS3 attribute matching
IBM: RESTful Web services - The basics
Note: RFC 1738 was updated by RFC 3986
Although having the parameters in the path has some advantages, there are, IMO, some outweighing factors.
Not all characters needed for a search query are permitted in a URL. Most punctuation and Unicode characters would need to be URL encoded as a query string parameter. I'm wrestling with the same problem. I would like to use XPath in the URL, but not all XPath syntax is compatible with a URI path. So for simple paths, /cars/doors/driver/lock/combination would be appropriate to locate the 'combination' element in the driver's door XML document. But /car/doors[id='driver' and lock/combination='1234'] is not so friendly.
There is a difference between filtering a resource based on one of its attributes and specifying a resource.
For example, since
/cars/colors returns a list of all colors for all cars (the resource returned is a collection of color objects)
/cars/colors/red,blue,green would return a list of color objects that are red, blue or green, not a collection of cars.
To return cars, the path would be
/cars?color=red,blue,green or /cars/search?color=red,blue,green
Parameters in the path are more difficult to read because name/value pairs are not isolated from the rest of the path, which is not name/value pairs.
One last comment. I prefer /garages/yyy/cars (always plural) to /garage/yyy/cars (perhaps it was a typo in the original answer) because it avoids changing the path between singular and plural. For words with an added 's', the change is not so bad, but changing /person/yyy/friends to /people/yyy seems cumbersome.
To expand on Peter's answer - you could make Search a first-class resource:
POST /searches # create a new search
GET /searches # list all searches (admin)
GET /searches/{id} # show the results of a previously-run search
DELETE /searches/{id} # delete a search (admin)
The Search resource would have fields for color, make model, garaged status, etc and could be specified in XML, JSON, or any other format. Like the Car and Garage resource, you could restrict access to Searches based on authentication. Users who frequently run the same Searches can store them in their profiles so that they don't need to be re-created. The URLs will be short enough that in many cases they can be easily traded via email. These stored Searches can be the basis of custom RSS feeds, and so on.
There are many possibilities for using Searches when you think of them as resources.
The idea is explained in more detail in this Railscast.
Justin's answer is probably the way to go, although in some applications it might make sense to consider a particular search as a resource in its own right, such as if you want to support named saved searches:
/search/{searchQuery}
or
/search/{savedSearchName}
I use two approaches to implement searches.
1) Simplest case, to query associated elements, and for navigation.
/cars?q.garage.id.eq=1
This means, query cars that have garage ID equal to 1.
It is also possible to create more complex searches:
/cars?q.garage.street.eq=FirstStreet&q.color.ne=red&offset=300&max=100
Cars in all garages in FirstStreet that are not red (3rd page, 100 elements per page).
2) Complex queries are considered as regular resources that are created and can be recovered.
POST /searches => Create
GET /searches/1 => Recover search
GET /searches/1?offset=300&max=100 => pagination in search
The POST body for search creation is as follows:
{
"$class":"test.Car",
"$q":{
"$eq" : { "color" : "red" },
"garage" : {
"$ne" : { "street" : "FirstStreet" }
}
}
}
It is based in Grails (criteria DSL): http://grails.org/doc/2.4.3/ref/Domain%20Classes/createCriteria.html
This is not REST. You cannot define URIs for resources inside your API. Resource navigation must be hypertext-driven. It's fine if you want pretty URIs and heavy amounts of coupling, but just do not call it REST, because it directly violates the constraints of RESTful architecture.
See this article by the inventor of REST.
In addition i would also suggest:
/cars/search/all{?color,model,year}
/cars/search/by-parameters{?color,model,year}
/cars/search/by-vendor{?vendor}
Here, Search is considered as a child resource of Cars resource.
There are a lot of good options for your case here. Still you should considering using the POST body.
The query string is perfect for your example, but if you have something more complicated, e.g. an arbitrary long list of items or boolean conditionals, you might want to define the post as a document, that the client sends over POST.
This allows a more flexible description of the search, as well as avoids the Server URL length limit.
RESTful does not recommend using verbs in URL's /cars/search is not restful. The right way to filter/search/paginate your API's is through Query Parameters. However there might be cases when you have to break the norm. For example, if you are searching across multiple resources, then you have to use something like /search?q=query
You can go through http://saipraveenblog.wordpress.com/2014/09/29/rest-api-best-practices/ to understand the best practices for designing RESTful API's
Though I like Justin's response, I feel it more accurately represents a filter rather than a search. What if I want to know about cars with names that start with cam?
The way I see it, you could build it into the way you handle specific resources:
/cars/cam*
Or, you could simply add it into the filter:
/cars/doors/4/name/cam*/colors/red,blue,green
Personally, I prefer the latter, however I am by no means an expert on REST (having first heard of it only 2 or so weeks ago...)
My advice would be this:
/garages
Returns list of garages (think JSON array here)
/garages/yyy
Returns specific garage
/garage/yyy/cars
Returns list of cars in garage
/garages/cars
Returns list of all cars in all garages (may not be practical of course)
/cars
Returns list of all cars
/cars/xxx
Returns specific car
/cars/colors
Returns lists of all posible colors for cars
/cars/colors/red,blue,green
Returns list of cars of the specific colors (yes commas are allowed :) )
Edit:
/cars/colors/red,blue,green/doors/2
Returns list of all red,blue, and green cars with 2 doors.
/cars/type/hatchback,coupe/colors/red,blue,green/
Same idea as the above but a lil more intuitive.
/cars/colors/red,blue,green/doors/two-door,four-door
All cars that are red, blue, green and have either two or four doors.
Hopefully that gives you the idea. Essentially your Rest API should be easily discoverable and should enable you to browse through your data. Another advantage with using URLs and not query strings is that you are able to take advantage of the native caching mechanisms that exist on the web server for HTTP traffic.
Here's a link to a page describing the evils of query strings in REST: http://web.archive.org/web/20070815111413/http://rest.blueoxen.net/cgi-bin/wiki.pl?QueryStringsConsideredHarmful
I used Google's cache because the normal page wasn't working for me here's that link as well:
http://rest.blueoxen.net/cgi-bin/wiki.pl?QueryStringsConsideredHarmful