How to find relation between two columns of csv (containing labels and related data) file using doc2vec? - tags

I am working on a problem related to doc2vec where i need to find labels that are related to a particular word. For ex (csv file):
Data Label / Tags
In a future world devastated by disease, a convict is sent back sci-fi
in time to gather information about the man-made virus that
wiped out most of the human population on the planet.
You have slipped under my skin, invaded my blood and seized my action
heart. That sounds more like a poison than a person,” was all I
could say. His confession had both shocked and thrilled me.
Plenty of data like this is available on which the model can be trained. Now, I want the results like, when I enter a particular word like virus, it gives me corresponding labels (sci-fi) where ever the word is used and also give those labels (action), where the word virus itself is not present but it's semantically related words (like poison, poisonous) are present. The semantically related words can be easily fetched from the model. I just want to list the labels.
I want to know if something could be applied rather than using keyword search. Any particular method which could help me solve this problem.
Thanks

Related

Classification of gender for given names

after some research I could not find yet a suitable open source library or software I can use to classify by most likely gender a long table of first names I have.
For my application I have a set of first names from many different countries, and many of them are also pretty exotic.
For example, when I tried to use Genderize I could get only 1/8 of the names classified, while the remaining are labeled as Unknown (I made sure that the format is correct, no lower/upper case ambiguity, etc..).
Any advise would be appreciated. Thank you in advance !
For the record, the best I could find was really just do it manually looking up names from google or dedicated websites such as https://namepedia.org. I am afraid there is no automated solution for my use case. This mostly for the following reasons:
Many names are somewhat archaic (I could not even recognise several names of my own nationality)
Many names were truncated to form nicknames or had two nearby letters swapped: here a LUT approach would fail and rather one would need a score from a model
There were several names not based on Roman alphabet but where the mapping into roman characters produced some ambiguities I guess
For those curious of the original dataset, this is part of a Kaggle challenge (Spaceship Titanic, https://www.kaggle.com/competitions/spaceship-titanic).

How can I perform automated tests against MS Word documents using PowerShell?

We regularly need to perform a handful of relatively simple tests against a bunch of MS Word documents. As these checks are currently done manually, I am striving for a way to automate this. For example:
Check if every page actually has a page number and verify that it is correct.
Verify that a version identifier in the page header is identical across all pages.
Check if the document has a table of contents.
Check if the document has a table of figures.
Check if every figure has a caption.
et cetera. Is this reasonably feasible using PowerShell in conjunction with a Word API?
Powershell can access Word via its object model/Interop (on Windows, at any rate) and AIUI can also work with the Office Open XML OOXML) API, so really you should be able to write any checks you want on the document content. What is slightly less obvious is how you verify that the document content will result in a particular "printed appearance". I'm going to start with some comments on the details first.
Just bear in mind that in the following notes I'm just pointing out a few things that you might have to deal with. If you're examining documents produced by an organisation where people are already broadly speaking following the same standards, it may be easier.
Of the 5 examples you give, without checking the details I couldn't say exactly how you would do them, and there could be difficulties with all of them, but for example
Check if every page actually has a page number and verify that it is correct.
Difficult using either OOXML or the object model, because what you would really be checking is that the header for a particular section had a visible { PAGE } field code. Because that field code might be nested inside other fields that say "if don't display this field code", it's not so easy to be sure that there would be a page number.
Which is what I mean by checking the document's "printed appearance" - if, for example, you can use the object model to print to PDF and have some mechanism that lets PS inspect the PDF's content, that might be a better approach.
Verify that a version identifier in the page header is identical across all pages.
Similar problem to the above, IMO. It depends partly on how the version identifier might be inserted. Is it just a piece of text? Could it be constructed from a number of fields? Might it reference Document Properties or Variables, or Custom XML content?
Check if the document has a table of contents.
Perhaps enough to look for a TOC field that does not have certain options, such as a \c option that a Table of Figures would contain.
Check if the document has a table of figures.
Perhaps enough to check for a TOC field that does have a \c option, perhaps with a specific parameter such as "Figure"
Check if every figure has a caption.
Not sure that you can tell whether a particular image is "a Figure". But if you mean "verify that every graphic object has a caption", you could probably iterate through the inline and floating graphics in the document and verify that there was something that looked like a Word standard caption paragraph within a certain distance of that object. Word has two standard field code patterns for captions AFAIK (one where the chapter number is included and one where it isn't), so you could look for those. You could measure a distance between the image and the caption by ensuring that they were no more than a predefined number of paragraphs apart, or in the case of a floating image, perhaps that the paragraph anchoring the image was no more than so many paragraphs away from the caption.
A couple of more general problems that you might have to deal with:
- just because a document contains a certain feature, such as a ToC field, does not mean that it is visible. A TOC field might have been formatted as not visible. Even harder to detect, it could have been formatted as colored white.
- change tracking. You might have to use the Word object model to "accept changes" before checking whether any given feature is actually there or not. Unless you can find existing code that would help you do that using the OOXML representation of the document, that's probably a strong case for doing checks via the object model.
Some final observations
for future checks, perhaps worth noting that in principle you could create a "DocumentInspector" that users could call from Word BackStage to perform checks on a document. Not sure you can force users to run it, or that you could create it in PS, but perhaps a useful tool.
longer term, if you are doing a very large number of checks, perhaps worth considering whether you could train a ML model to try to detect problems.

Determining canonical classes with text data

I have a unique problem and I'm not aware of any algorithm that can help me. Maybe someone on here does.
I have a dataset compiled from many different sources (teams). One field in particular is called "type". Here are some example values for type:
aple, apples, appls, ornge, fruits, orange, orange z, pear,
cauliflower, colifower, brocli, brocoli, leeks, veg, vegetables.
What I would like to be able to do is to group them together into e.g. fruits, vegetables, etc.
Put another way I have multiple spellings of various permutations of a parent level variable (fruits or vegetables in this example) and I need to be able to group them as best I can.
The only other potentially relevant feature of the data is the team that entered it, assuming some consistency in the way each team enters their data.
So, I have several million records of multiple spellings and short spellings (e.g. apple, appls) and I want to group them together in some way. In this example by fruits and vegetables.
Clustering would be challenging since each entry is most often 1 or two words, making it tricky to calculate a distance between terms.
Short of creating a massive lookup table created by a human (not likely with millions of rows), is there any approach I can take with this problem?
You will need to first solve the spelling problem, unless you have Google scale data that could allow you to learn fixing spelling with Google scale statistics.
Then you will still have the problem that "Apple" could be a fruit or a computer. Apple and "Granny Smith" will be completely different. You best guess at this second stage is something like word2vec trained on massive data. Then you get high dimensional word vectors, and can finally try to solve the clustering challenge, if you ever get that far with decent results. Good luck.

How to break up large document into smaller answer units on Retrieve and Rank?

I am still very new to Retrieve and Rank, and Document Conversion services, so I have been playing around with that lately.
I encountered a problem where when I upload a large document (100+ pages) - Retrieve and Rank would help me automatically break it up into answer units, which is great and helpful.
However, some questions only require ONE small line in the big chunks of answer units, is there a way that I can manually break further down the answer units that Retrieve and Rank service has provided me?
I heard that you can do it through JavaScript, but is there a way to do it through the UI?
I am contemplating to manually break up the huge doc into multiple smaller documents, but that could potentially lead to 100s of them - which is probably the last option that I'd resort to.
Any help or suggestions is greatly appreciated!
Thank you all!
First off, one clarification:
Retrieve and Rank does not break up your documents into answer units. That is something that the Document Conversion Service does when your conversion target is ANSWER_UNITS.
Regarding your question:
I don't fully understand exactly what you're trying to do, but if the answer units that are produced by default don't meet your requirements, you can customize different steps of the conversion process to adjust the produced answer units. Take a look at the documentation here.
Specifically, you want to make sure that the heading levels (for Word, PDF or HTML, depending on your document type) are defined in a way that
they detect the start of each answer unit. Then, make sure that the heading levels that you defined (h1, h2, h3, etc.) are included in the selector_tags list within the answer_units section.
Once your custom Document Conversion Service configuration produces the answer units you are looking for, you will be ready to send them to Retrieve and Rank to be indexed.

Efficiently extract WikiData entities from text

I have a lot of texts (millions), ranging from 100 to 4000 words. The texts are formatted as written work, with punctuation and grammar. Everything is in English.
The problem is simple: How to extract every WikiData entity from a given text?
An entity is defined as every noun, proper or regular. I.e., names of people, organizations, locations and things like chair, potatoes etc.
So far I've tried the following:
Tokenize the text with OpenNLP, and use the pre-trained models to extract people, location, organization and regular nouns.
Apply Porter Stemming where applicable.
Match all extracted nouns with the wmflabs-API to retrieve a potential WikiData ID.
This works, but I feel like I can do better. One obvious improvement would be to cache the relevant pieces of WikiData locally, which I plan on doing. However, before I do that, I want to check if there are other solutions.
Suggestions?
I tagged the question Scala because I'm using Spark for the task.
Some suggestions:
consider Stanford NER in comparison to OpenNLP to see how it compares on your corpus
I wonder at the value of stemming for most entity names
I suspect you might be losing information by dividing the task into discrete stages
although Wikidata is new, the task isn't, so you might look at papers for Freebase|DBpedia|Wikipedia entity recognition|disambiguation
In particular, DBpedia Spotlight is one system designed for exactly this task.
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf
http://ceur-ws.org/Vol-1057/Nebhi_LD4IE2013.pdf