stemCompletion is not working - tm

I am using tm package for text analysis of repair data, Reading data into data frame, converting to Corpus object, applied various methods to clean data using lower, stipWhitespace, removestopwords and so on.
Taken back of Corpus object for stemCompletion.
Performed stemDocument using tm_map function, my object words got stemmed
got results at expected.
When I am running stemCompletion operation using tm_map function, it is not working
and got below error
Error in UseMethod("words") : no applicable method for 'words'
applied to an object of class "character"
Executed trackback() to show and got steps as below
> traceback()
9: FUN(X[[1L]], ...)
8: lapply(dictionary, words)
7: unlist(lapply(dictionary, words))
6: unique(unlist(lapply(dictionary, words)))
5: FUN(X[[1L]], ...)
4: lapply(X, FUN, ...)
3: mclapply(content(x), FUN, ...)
2: tm_map.VCorpus(c, stemCompletion, dictionary = c_orig)
1: tm_map(c, stemCompletion, dictionary = c_orig)
How can I resolve this error?

I received the same error when using tm v0.6. I suspect this occurs because stemCompletion is not in the default transformations for this version of the tm package:
> getTransformations
function ()
c("removeNumbers", "removePunctuation", "removeWords", "stemDocument",
"stripWhitespace")
<environment: namespace:tm>
Now, the tolower function has the same problem, but can be made operational by using the content_transformer function. I tried a similar approach for stemCompletion but was not successful.
Note, even though stemCompletion isn't a default transformation, it still works when manually fed stemmed words:
> stemCompletion("compani",dictCorpus)
compani
"companies"
So that I could continue with my work, I manually delimited each document in a corpus by single spaces, feed them through stemCompletion, and concatenated them back together with the following (clunky and not graceful!) function:
stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}
where dictCorpus is just a copy of the cleaned corpus, but before it's stemmed. The extra stripWhitespace is specific for my corpus, but is likely benign for a general corpus. You may want to change the type option from "shortest" as needed.
For a full example, let's setup a dummy corpus using the crude data in the tm package:
> data("crude")
> docs = Corpus(VectorSource(crude))
> docs <- tm_map(docs, content_transformer(tolower))
> docs <- tm_map(docs, removeNumbers)
> docs <- tm_map(docs, removeWords, stopwords("english"))
> docs <- tm_map(docs, removePunctuation)
> docs <- tm_map(docs, stripWhitespace)
> docs <- tm_map(docs, PlainTextDocument)
> dictCorpus <- docs
> docs <- tm_map(docs, stemDocument)
> # Define modified stemCompletion function
> stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}
> # Original doc in crude data
> crude[[1]]
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
"The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
Reuter
> # Stemmed example in crude data
> docs[[1]]
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel
reduct bring post price west texa intermedi dlrs barrel copani said price reduct today
made light fall oil product price weak crude oil market compani spokeswoman said diamond
latest line us oil compani cut contract post price last two day cite weak oil market reuter
> # Stem comlpeted example in crude data
> stemCompletion_mod(docs[[1]],dictCorpus)
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel
reduction brings posted price west texas intermediate dlrs barrel NA said price reduction today
made light fall oil product price weak crude oil market companies spokeswoman said diamond
latest line us oil companies cut contract posted price last two day cited weak oil market reuter
Note: This example is odd, since the misspelled word "copany" is mapped: -> "copani" -> "NA", in this process. Not sure how to correct this...
To run the stemCompletion_mod through the entire corpus, I just use sapply (or parSapply with snow package).
Perhaps someone with more experience than me could suggest a simpler modification to get stemCompletion to work in v0.6 of the tm package.

I had success with the following workflow:
use content_transformer to apply an anonymous function on each document of the corpus,
split the document to words by spaces,
call stemCompletion on the words with the help of the dictionary,
and concatenate the separate words into a document again with paste.
POC demo code:
tm_map(c, content_transformer(function(x, d)
paste(stemCompletion(strsplit(stemDocument(x), ' ')[[1]], d), collapse = ' ')), d)
PS: using c as a variable name to store the corpus is not a good idea due to base::c

Thanks, cdxsza. Your method worked for me.
A note to all who are going to use stemCompletion:
The function completes an empty string with a word in dictionary, which is unexpected. See an example below, where the first "monday" was produced for the blank at the beginning of the string.
stemCompletion(unlist(strsplit(" mond tues ", " ")), dict=c("monday", "tuesday"))
[1] "monday" "monday" "tuesday"
It can be easily fixed by removing empty string "" before stemCompletion as below.
stemCompletion2 <- function(x, dictionary) {
x <- unlist(strsplit(as.character(x), " "))
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste(x, sep="", collapse=" ")
PlainTextDocument(stripWhitespace(x))
}
myCorpus <- lapply(myCorpus, stemCompletion2, dictionary=myCorpusCopy)
myCorpus <- Corpus(VectorSource(myCorpus))
See a detailed example in page 12 of slides at
http://www.rdatamining.com/docs/RDataMining-slides-text-mining.pdf
Regards
Yanchang Zhao
RdataMining.com

The problem is that using tolower (e.g. myCorpus <- tm_map(myCorpus, tolower)) converts the text to simple character values, which tm version 0.6 does not accept for use with tm_map.
If you instead do your original tolower like this
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
then the data will be in the correct format for when you need stemCompletion.
Other functions like removePunctuation and removeNumbers are used with tm_map as usual, i.e. without content_transformer.
Reference: https://stackoverflow.com/a/24771621

Related

Function Corpus in Quanteda doesn't work because of a kwic objects

First of all, I'm working on a big data project which consists in analyze some press URLs to detect the most popular topics. My topic is about football (Mbappe contract) and I collected 180 URLs from Marca, a Spanish media mass, in a .txt file.
When I want to create a matrix-document with Corpus function from Quanteda package, I obtain this: Error: corpus() only works on character, corpus, Corpus, data.frame, kwic objects.
In some URLs there is a kwic object (maybe a video, adverts...) that doesn't allow me to work just with text, and I think it's because when inspecting HTML div class = body, automatically picks these kwic objects.
I leave here my code to read it:
url_marca <- read.table("mbappe.txt",stringsAsFactors = F)$V1
get_marca_text <- function(url){url %>%
read_html() %>%
html_nodes("div.ue-c-article__body") %>%
html_text() %>%
str_replace_all("[\r\n]" , "")}
text_marca_mbappe <- sapply(url_marca,get_marca_text)
Does anyone know if is it because of a mistake in html_notes when inspecting the URL or is it something different?

issue with biofam3c data in SeqHMM

I have the data which is biofam3c in package SeqHMM in list format. I wanted to see it in matrix form. I used this command
matrix_biofam3c= matrix(unlist(biofam3c), ncol=16, byrow= TRUE)
But its results was confusing . Then I also tried to change matrix_biofam3c into list using a command
U= as.list(as.data.frame(matrix_biofam3c))
it gives not similar results like given form in package (List of length 4). I request please, suggest the right way to make biofam3c in matrix data (like biofam in package TraMineR). And also how to convert it again so it becomes like list as it is given there.
The data biofam3c is a list of three matrices that contain sequence data of three channels, namely children, married, and left (for left home). Tables children and left have binary data (children vs childless, left home vs with parents), while there are three states for married: single, married, and divorced.
The help page of biofam3c gives the code used to generate these three channels from the biofam data that ships with TraMineR. Note that this code assumes that we do not have divorced people living with their parents. The biofam data does not distinguish between divorced with and without children, nor between divorced who have left home and those who did not. See also the example in the help page of the seqdistmc function of TraMineR.
You can reconstruct biofam from the three channels using interaction
library(seqHMM)
data(biofam3c)
## the 3 channels
children <- biofam3c[[1]]
married <- biofam3c[[2]]
left <- biofam3c[[3]]
## Reconstructing biofam
bf.rec <- matrix(NA,nrow(children),ncol(children))
for (i in 1:ncol(children)) {
bf.rec[,i] <- as.matrix(interaction(children[,i],married[,i],left[,i]))
}
alph <- seqstatl(bf.rec)
## reordering states as in biofam
alphabet <- alph[c(6,5,4,3,10,9,8,7,2,1)]
## creating a single state divorced as in biofam
bf.rec[bf.rec %in% alphabet[8:10]] <- "divorced"
alphabet[8] <- "divorced"
alphabet <- alphabet[1:8]
seq.bf.rec <- seqdef(bf.rec, alphabet=alphabet)
seqdplot(seq.bf.rec)
You get the same figure using the original biofam
## in biofam sequence data are in columns 10 to 25
library(TraMineR)
bf <- biofam[,10:25]
seq.bf <- seqdef(bf)
seqdplot(seq.bf)

regexp_extract function - Spark scala getting error

Here are the sample records
SYSTEM, paid18.26 toward test
sys, paid $861.82 toward your
L, paid $1119.00toward your
I need to extract the data between paid and toward. I have written the statement like below and I am not getting the output
withColumn("message_comment_txt_amount",regexp_extract(col("message_comment_txt"),"(?i)paid\\s+(.*?)\\s+(?i)toward",1))
I am not getting the desired
Expected Output
18.26
861.82
1119.00
Please let me know where the exact error.
Assuming the amount always come in between strings "paid" and "toward"
val amount = df.withColumn(
"amount",
regexp_extract(col("message_comment_txt"), "^paid(.*)toward.*", 1)
)
The above snippet adds a new column amount to the dataset/df. It doesn't check / replace the $ symbol though. That can be replaced in the next step if this works fine as expected in all your cases.

Sentence similarity - How to calculate the depth of subsumer using WordNet?

I try to build a tool to calculate the similarity between 2 words and I found that there is a formula come from Manchester Metropolitan University as following:
Until now, I am still confused how to get the h which is the depth of subsumer in the hierarchical semantic nets.
As my understanding, h is the path length from the top word to the a certain word, as reference from the author, the top word is 'entity' for NOUN.
But how about another kind of word such as ADJ, ADV, VERB...?
And if we already have the top word, how can we list out the path from it to the word we need to calculate
The paper is at the following link: https://www.researchgate.net/profile/Keeley_Crockett/publication/232645326_Sentence_Similarity_Based_on_Semantic_Nets_and_Corpus_Statistics/links/0deec51b8db68f19fa000000.pdf
Really appreciate for any answer.
Thanks
Every time I've tried to understand the Wordnet hierarchy I found something that invalidates everything I previously assumed :)
Regarding similarities, if you are using Python and NLTK, I'd recommend you use the provided similarity metrics, if not, those may be a good start to understand how things work.
In this link, scroll down to Similarity:
http://www.nltk.org/howto/wordnet.html
I would like to add more detail which I have just found.
These details are enough for my searching but may not exactly with the question above, but I think I need to share to somebody need it in future.
'Entity' is not only root of Noun, but also the root of any word even it is VERB, ADJ, ADV....
Ex full path for the word 'kiss': ROOT#n#1 < entity#n#1 < abstraction#n#6 < psychological_feature#n#1 < event#n#1 < act#n#2 < touch#n#5 < kiss#n#1
EX full path for the word 'kick': ROOT#n#1 < entity#n#1 < abstraction#n#6 < psychological_feature#n#1 < event#n#1 < act#n#2 < speech_act#n#1 < objection#n#2 < kick#n#4
To calculate the depth of any word, we need to calculate from the beginning word ('entity') and base on the Word Net hierarchical database.
Come back to above example, the h (length of subsummer of 'kiss' and 'kick') is 6, which is count from the top tree node root to the word 'act'

How to express queries in Tuple Relational Calculus?

Problem:
Consider a relation of scheme Building(Street, Number, No.Apartments, Color, Age).
TRC: find the oldest building in Downing Street.
The associated SQL statement would be:
SELECT MAX(Age) AS ‘Oldest building’, Street FROM Building WHERE Street = ‘Downing Street’;
My answer using TRC: (B stands for Building relation)
{V.*|V(B) | V.BAge >=Age ^ V.Bstreet = ‘Downing Street’}
V.* (it returns evry single tuple of Building)
V(B) (it maps variables V to Building’s tuples)
V.BAge >=Age ^ V.Bstreet = ‘Downing Street’(here I set the condition…maybe..)
If this is still relevant: the hint would be to realize that the oldest building is the one such that no other building is older than it.