Google Cloud NL API term/classification quality and batch processing on Traditional Chinese (zh-Hant) data - chinese-locale

I've tested Google Cloud NL API v1.0 these days. I mainly use Traditional Chinese(a.k.a. zh-Hant) data. After the testing, I find the quality is not satisfactory, classification is not right, too many one-character terms (many of them should be stop words), the worst quality is for unknown word recognition.
Also, some analysis methods (e.g. entity-sentimental) don't support zh-Hant (that I can only use 'en' to run zh-Hant data, pitty).
Does anyone know if NL API provides any way, e.g. set configuration, set parameters, or run some process, so as to improve training result?
Does anyone actually have experience on using NL API generated result to add some value-added feature on a business product or service?
Also, if I want to feed high-volume data, is there a library or SDK, that I can use to write code to carry out batch-in-batch-out processing?

Related

Building openears compatible language model

I am doing some development on speech to text and text to speech and I found the OpenEars API very useful.
The principle of this cmu-slm based API is it uses a language model to map the speech listened by the iPhone device. So I decided to find a big English language model to feed the API speech recognizer engine. But I failed to understand the format of the voxfourge english data model to use with OpenEars.
Do anyone have any idea that how can I get the .languagemodel and .dic file for English language to work with OpenEars?
Regarding LM Formats:
AFAIK most Language Models use the ARPA standard for Language Models. Sphinx / CMU language models are compiled into binary format. You'd need the source format to convert a Sphinx LM into another format. Most other Language Models are in text format.
I'd recommend using the HTK Speech Recognition Toolkit ; Detailed Documentation here: http://htk.eng.cam.ac.uk/ftp/software/htkbook_html.tar.gz
Here's also a description of CMU's SLM Toolkit: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html
Here's an example of a language model in ARPA format I found on the net: http://www.arborius.net/~jphekman/sphinx/full/index.html
You probably want to create an ARPA LM first, then convert it into any binary format if needed.
In General:
To build a language model, you need lots and lots of training data - to determine what the probability of any other word in your vocabulary is, after observing the current input to this point in time.
You can't just "make" a language model by just adding the words you want to recognize - you also need a lot of training data (= typical input you observe when running your speech recognition application).
A Language Model is not just a word list -- it estimates the probability of the next token (word) in the input.
To estimate those probabilities, you need to run a training process, which goes over training data (e.g. historic data), and observes word frequencies there to estimate above mentioned probabilities.
For your problem, maybe as a quick solution, just assume all words have the same frequency / probability.
create a dictionary with the words you want to recognize (N words in dictionary)
create a language model which has 1/N as the probability for each word (uni-gram language model)
you can then interpolate that uni-gram language model (LM) with another LM for a bigger corpus using HTK Toolkit
Old question, but maybe the answer is still interesting. OpenEars now has built-in language model generation, so one option is for you to create models dynamically in your app as you need them using the LanguageModelGenerator class, which uses the MITLM library and NSScanner to accomplish the same task as the CMU toolkit mentioned above. Processing a corpus with >5000 words on the iPhone is going to take a very long time, but you could always use the Simulator to run it once and get the output out of the documents folder and keep it.
Another option for large vocabulary recognition is explained here:
Creating ARPA language model file with 50,000 words
Having said that, I need to point out as the OpenEars developer that the CMU tool's limit of 5000 words corresponds pretty closely to the maximum vocabulary size that is likely to have decent accuracy and processing speed on the iPhone when using Pocketsphinx. So, the last suggestion would be to either reconceptualize your task so that it doesn't absolutely require large vocabulary recognition (for instance, since OpenEars allows you switch models on the fly, you may find that you don't need one enormous model but can get by with multiple smaller ones that you can switch in in different contexts), or to use a network-based API that can do large vocabulary recognition on a server (or make your own API that uses Sphinx4 on your own server). Good luck!

machine learning and code generator from strings

The problem: Given a set of hand categorized strings (or a set of ordered vectors of strings) generate a categorize function to categorize more input. In my case, that data (or most of it) is not natural language.
The question: are there any tools out there that will do that? I'm thinking of some kind of reasonably polished, download, install and go kind of things, as opposed to to some library or a brittle academic program.
(Please don't get stuck on details as the real details would restrict answers to less generally useful responses AND are under NDA.)
As an example of what I'm looking at; the input I'm wanting to filter is computer generated status strings pulled from logs. Error messages (as an example) being filtered based on who needs to be informed or what action needs to be taken.
Doing Things Manually
If the error messages are being generated automatically and the list of exceptions behind the messages is not terribly large, you might just want to have a table that directly maps each error message type to the people who need to be notified.
This should make it easy to keep track of exactly who/which-groups will be getting what types of messages and to update the routing of messages should you decide that some of the messages are being misdirected.
Typically, a small fraction of the types of errors make up a large fraction of error reports. For example, Microsoft noticed that 80% of crashes were caused by 20% of the bugs in their software. So, to get something useful, you wouldn't even need to start with a complete table covering every type of error message. Instead, you could start with just a list that maps the most common errors to the right person and routes everything else to a person for manual routing. Each time an error is routed manually, you could then add an entry to the routing table so that errors of that type are handled automatically in the future.
Document Classification
Unless the error messages are being editorialized by people who submit them and you want to use this information when routing them, I wouldn't recommend treating this as a document classification task. However, if this is what you want to do, here's a list of reasonably good packages for document document classification organized by programming language:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J, Weka, Lucene Mahout, and as adi92 mentioned Mallet.
Learning Rules with Weka - If rules are what you want, Weka might be of particular interest, since it includes a rule set based learner. You'll find a tutorial on using Weka for text categorization here.
Mallet has a bunch of classifiers which you can train and deploy entirely from the commandline
Weka is nice too because it has a huge number of classifiers and preprocessors for you to play with
Have you tried spam or email filters? By using text files that have been marked with appropriate categories, you should be able to categorize further text input. That's what those programs do, anyway, but instead of labeling your outputs a 'spam' and 'not spam', you could do other categories.
You could also try something involving AdaBoost for a more hands-on approach to rolling your own. This library from Google looks promising, but probably doesn't meet your ready-to-deploy requirements.

How do I adapt my recommendation engine to cold starts?

I am curious what are the methods / approaches to overcome the "cold start" problem where when a new user or an item enters the system, due to lack of info about this new entity, making recommendation is a problem.
I can think of doing some prediction based recommendation (like gender, nationality and so on).
You can cold start a recommendation system.
There are two type of recommendation systems; collaborative filtering and content-based. Content based systems use meta data about the things you are recommending. The question is then what meta data is important? The second approach is collaborative filtering which doesn't care about the meta data, it just uses what people did or said about an item to make a recommendation. With collaborative filtering you don't have to worry about what terms in the meta data are important. In fact you don't need any meta data to make the recommendation. The problem with collaborative filtering is that you need data. Before you have enough data you can use content-based recommendations. You can provide recommendations that are based on both methods, and at the beginning have 100% content-based, then as you get more data start to mix in collaborative filtering based.
That is the method I have used in the past.
Another common technique is to treat the content-based portion as a simple search problem. You just put in meta data as the text or body of your document then index your documents. You can do this with Lucene & Solr without writing any code.
If you want to know how basic collaborative filtering works, check out Chapter 2 of "Programming Collective Intelligence" by Toby Segaran
Maybe there are times you just shouldn't make a recommendation? "Insufficient data" should qualify as one of those times.
I just don't see how prediction recommendations based on "gender, nationality and so on" will amount to more than stereotyping.
IIRC, places such as Amazon built up their databases for a while before rolling out recommendations. It's not the kind of thing you want to get wrong; there are lots of stories out there about inappropriate recommendations based on insufficient data.
Working on this problem myself, but this paper from microsoft on Boltzmann machines looks worthwhile: http://research.microsoft.com/pubs/81783/gunawardana09__unified_approac_build_hybrid_recom_system.pdf
This has been asked several times before (naturally, I cannot find those questions now :/, but the general conclusion was it's better to avoid such recommendations. In various parts of the worls same names belong to different sexes, and so on ...
Recommendations based on "similar users liked..." clearly must wait. You can give out coupons or other incentives to survey respondents if you are absolutely committed to doing predictions based on user similarity.
There are two other ways to cold-start a recommendation engine.
Build a model yourself.
Get your suppliers to fill in key information to a skeleton model. (Also may require $ incentives.)
Lots of potential pitfalls in all of these, which are too common sense to mention.
As you might expect, there is no free lunch here. But think about it this way: recommendation engines are not a business plan. They merely enhance the business plan.
There are three things needed to address the Cold-Start Problem:
The data must have been profiled such that you have many different features (with product data the term used for 'feature' is often 'classification facets'). If you don't properly profile data as it comes in the door, your recommendation engine will stay 'cold' as it has nothing with which to classify recommendations.
MOST IMPORTANT: You need a user-feedback loop with which users can review the recommendations the personalization engine's suggestions. For example, Yes/No button for 'Was This Suggestion Helpful?' should queue a review of participants in one training dataset (i.e. the 'Recommend' training dataset) to another training dataset (i.e. DO NOT Recommend training dataset).
The model used for (Recommend/DO NOT Recommend) suggestions should never be considered to be a one-size-fits-all recommendation. In addition to classifying the product or service to suggest to a customer, how the firm classifies each specific customer matters too. If functioning properly, one should expect that customers with different features will get different suggestions for (Recommend/DO NOT Recommend) in a given situation. That would the 'personalization' part of personalization engines.

What is the Best Test Automation Approach for WatiN

I Studied both data-driven and keyword driven approaches. After reading, It seems data driven is better than keyword. For documentation purpose keyword sounds great. But it has many levels. I need guidance from people who actually have implemented Automation frameworks. Personally, I want to store all data in database or excel and break up the system into modular parts (functions that are common to major company products).
Currently using, WatiN, Nunit, CC.net
Any advise pls
I would hightly recommend that you look into the stack that Michael Hunter aka the braidy tester built for testing expression at Microsoft he has a lot of articles about it http://www.thebraidytester.com/stack.html
Esentially he splits out into a logical model, a physical model and a data model and all three are loosley copupled. All my stacks are written this way now. So the test cases end up looking like this:
Logical.Google.Search.Websearch("watin");
Verification.VerifySearchResult("watin");
All the test data is then stored in a sql express database that indexed by the text string, in this case watin.
You will need to build a full domain model and data access layer, I personally auto generate that using SubSonic.

What makes a good spec? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
One of the items in the Joel Test is that a project/company should have a specification.
I'm wondering what makes a spec good. Some companies will write volumes of useless specification that no one ever reads, others will not write anything down because "no one will read any of it anyway". So, what do you put into your spec? What is the good balance between the two extremes? Is there something particularly important that really, really (!) should always be recorded in a specification?
The best spec is one that:
Exists
Describes WHAT, not HOW (no solutions)
Can be interpreted in as few ways as possible
Is widely-distributed
Is agreed-upon as being THE spec by all parties involved
Is concise
Is consistent
Is updated regularly as requirements change
Describes as much of the problem as is possible and practical
Is testable
What to put in a spec
You need to look at the audience of the spec and work out what they need to know. Is it just a document between you and a business sponsor? In this case it can probably be fairly lightweight. If it's a functional spec for a 100+ man-year J2EE project it will probably need a bit more detail.
The audience
The key question is: who is going to read the spec - A spec will have several potential sets of stakeholders:
The business owner who is signing
off the system.
The developer who is building the
system (which may or may not be you)
QA people who have to write test plans for it.
Maintenance staff wanting to
understand the system
Developers or analysts on other projects who
may want to integrate other systems into it.
Requirements of typical key stakeholders:
The business owner needs to have a clear idea of what the system workflows and business rules are so they can have a fighting chance of understanding what they have agreed to. If they are the only major audience of the spec, concentrate on the user interface, screen-screen workflow and business and data validation rules.
Developers need a data model, data validation rules, some or all of the user interface design and enough description of the expected system behaviour so they know what to build. If you are writing for developers concentrate on the user interface, mapping to data model and rules in the user interface. This should be more detailed than if you are doing the development yourself because you are acting as an intermediary in a communication between two third parties.
If you are specifying an interface between two systems, this has to be very precise.
QA staff need enough information to work out how to test and validate the logic, validation and expected user interface behaviour of the application. A spec intended for developers and QA staff needs to be fairly unambiguous.
Maintenance staff need much the same information as developers plus a system roadmap document describing the architecture.
Integrators need a data model and clear definitions of any interfaces.
Key components of a spec:
I'm assuming that one is writing specs for business apps, so the content below is geared to this. Specs for other types of systems will have different emphasis. In my experience the key elements of a functional spec are:
User Interface: screen mockups and a description of the interaction behaviour of the system and workflow between screens.
Data Model: Definition of the data items and mapping to the user interface. User interface mappings are normally done in the bits of the spec describing the user interface.
Data Validation and Business Rules: What checks for correctness need to be be made on the data and what computations are being made, along with definitions. Examples can be quite useful here.
Definitions of interfaces: If you have interfaces exposed that other systems can use, you need to specify those pretty tightly. The simpler internet RFC's give quite good examples of protocol designs and are quire a good start for examples of interface documents. Clearly defining interfaces isn't easy but almost certainly save you grief down the track.
Glue: this is where use cases, workflow diagrams and other requirements related artifacts help. Generally an exhaustive listing of these is pointless, but there will be key areas within the system where this type of documentation helps to put items in context. My experience is that selective inclusion of use cases and other requirements level descriptions does a lot to add clarity and meaning to a spec but writing up a user story for every single interaction with the system is a waste of time.
Joel (of 'on software' fame) wrote a good series of articles on this called Painless Functional Specification which I've referred people to on quite a few occasions. It's quite a good set of articles and well worth a read. In my opinion, your objective is to clearly explain what the system is supposed to do in a way that minimises ambiguity. It's quite useful to think of the spec as a reference document - what might the various stakeholders want to be able to easily look up.
Having written a glib set of bullet points about specs, the clear communication part is harder than it looks. Specs are actually non-trivial technical documents and are quite a test of one's technical writing and editorial skills. You are actually in the business of writing document that describes what someone is supposed to build. Doing good specs is a bit of an art.
The pay-off for doing specs is that no-one else wants to do them. As you've written what is probably the only document of any importance for the system, you get to call the shots. Anyone else with an agenda has to either lobby you to change the spec or somehow impose a competing spec on the project. This is a good example of the pen being mightier than the sword.
EDIT: It has been my experience that debate about the distinction between 'how' and 'what' tends to be pretty self-serving. On any non-trivial project the data model and user interface will have multiple stakeholders, not all of whom are the system's developers. Working in data warehousing will give one a taste for the chaos that ensues when an application data model is allowed to become a free-for all, and PFS should give one a feel for the potential set of stakeholders a spec has to cater to.
The fact that someone owns a data model or user interface design doesn't mean that these are just decided by fiat - there can be a discourse and negotiation process. However, as a project gets larger the value of ownership and consistency in these gets greater. It's been my observation in the past that the best way to appreciate the value of a good analyst is to see the damage done by a bad one.
In my experience a spec will have more chance of being read if it has the following:
Use diagrams where possible - pictures are worth 1000 words
Have a title page that clearly indicates what the spec is describing
Have a style that is used throughout the document. Make all headers the same font, size and style. Make the font the same all the way through, use the same bullet styles etc
DONT WAFFLE - Be clear concise and to the point, and don't add extra cruft to pad out your document. If a point can't be explained in a few lines of text, then maybe you need to break it down further
I have seen in companies where the person writing the spec doesn't understand the system. It's almost a way of learning the system by writing the spec. This usually ends in tears...
As someone who develops bespoke software for clients, the best spec is the one which the customer has signed.
It doesn't matter how refined your spec is - if the customer hasn't explicitly agreed to it in writing, they'll change it and expect you to roll with their changes seamlessly, wrecking your beautiful architecture...
Good specs should contain requirements that are measurable and verifiable. When looking at each requirement, you should be able to easily answer the question, "How can I prove I have fulfilled this requirement?".
Read Joel's series of "Painless Functional Specifications" followups to the Joel Test article. They also appear in the "Joel on Software" book.
Depends on how big the project is and (like all architecture decisions) what the constraints are. A good start is
a short description, a "one pager"
a context diagram -- where are the
boundaries, what interacts with the
system?
use cases/user stories
a GUI prototype or paper prototype,
if applicable
a description of the needed
nonfunctional requirements
(performance etc.)
Best of all is to have an acceptance test, ie, a testable statement of things that can be checked, along with an agreement that when those things are done, the project is complete.
It also helps if you start by stating the goal the user has or what the global idea of a certain function is; rather than filling in the exact implementation. This always feels to me like narrowing down the open mindedness or using less creative (more usable) solutions. So you should keep "all options open".
Example
Your writing a software to measure "X".
Instead of stating:
There has to be a start button and a save button.
Use:
The user has to be able to start a measurement and save it.
Why?
Because in the first situation you already determined what the solution has to be, while the second situation gives you flexibility on how to implement something. Now this may seem trivial, but I have the feeling "programmers" tend to think more in solutions rather than in problems (or situations). When you add more functionality this becomes more obvious, because then it might have been better to use a wizard or automate the process, but you already narrowed the idea's down to using buttons.
For functional requirements—or, more specifically, behavioral requirements—I like to use Cucumber and Gherkin.
Here’s an example of a simple and short specification for a new feature in a simple mapping application. The feature allows small businesses to sign up to the mapping platform and add their places of business on a Google Maps-like service.
Feature: Allow new businesses to appear on the map
Scenario Outline: Businesses should provide required data
Given a <business> at <location>
When <business> signs up to the map platform
Then it <should?> be added to the platform
And its name <should?> appear on the map at <location>
Examples: Business name and location should be required
| business | location | should? |
| UNNAMED BUSINESS | NOWHERE | shouldn't |
Examples: Allow only businesses with correct names
| business | location | should? |
| Back to Black | 8114 2nd Street, Stockton | should |
| UNNAMED BUSINESS | 8114 2nd Street, Stockton | shouldn't |
Examples: Allow businesses with two or more establishments
| business | location | should? |
| Deep Lemon | 6750 Street South, Reno | should |
| Deep Lemon | 289 Laurel Drive, Reno | should |
Examples: Allow only suitable locations
| business | location | should? |
| Anchor | 77 Chapel Road, Chicago | should |
| Anchor | Chicago River, Chicago | shouldn't |
| Anchor | NOWHERE | shouldn't |
This specification looks deceivably simple, but is in fact quite powerful.
Good specifications are clear, unambiguous and concrete. They don't need to be deciphered in order to write working code. That’s exactly what Gherkin specs are. They’re best served short and simple. Instead of writing a long ass specification document, you let the specification suite evolve along with your product by writing new specs in every iteration.
Gherkin is a business-readable language for writing specification documents based on the Given-When-Then template. The template can be automated into acceptance tests. Automating the specification ensures it stays up to date because the captured conversation is directly tied to testing code. This way, tests can be used as documentation, because Gherkin features have to change every time the code changes.
When each business rule is given an automated test, Gherkin specifications become so-called executable specifications—specifications that can be run as computer programs. The program tests whether the acceptance criteria were implemented correctly. So at the end of the day, we get a yes-or-no answer to the question of whether our product is actually doing what we expect it to do—which in itself is very valuable, as it contributes to making software of better quality.
The direct connection between Gherkin specifications and testing code often reduces the damage of waste by creating and cultivating a system of living documentation. Thanks to frequent validation of tests, as in continuous integration systems, you can know that Given-When-Thens are still up to date—and when you trust your tests, you can use the corresponding Gherkin specifications as documentation for the entire system.
In fact, there’s an entire methodology called Specification by Example that uses tools like Gherkin. Specification by Example's practices reduce possibility for misunderstandings and rework by giving you a framework for talking with business stakeholders by forcing you to use concrete, discrete, unambiguous examples in your specification documents.
If you want to read more about Cucumber, Gherkin, BDD, and Specification by Example, I wrote a book on the subject. “Writing Great Specifications” explores the art of writing great scenarios and will help you make executable specifications a core part of your development process.
If you are interested in buying “Writing Great Specifications,” you can save 39% with the promo code 39nicieja2 :)
I think writing "Use cases" should save you bunch of pages
+1 #KiwiBastard and I would add write bullet-like and make each bullet testable.
A blueprint that describes all of the critical information necessary for the implementation, but doesn't waste any effort on describing all of the trivial or obvious information that is also necessary.
It should just be enough information to insure that the implementation is "as expected", without providing too much additional noise that isn't necessary.
In practice, most people get this wrong, as they focus on the easy stuff (which is the least necessary) and shy away from the hard stuff (which is what you really really want to lock down). I've seen way too many 2 inch documents that completely and utterly miss the point, and very few 3 page ones that hit it dead on.
Specs don't have to be long, they just have to contain the right stuff!
(hint: if the programmer didn't look at that page while coding, it probably wasn't required)
Paul.