First of all, I'm working on a big data project which consists in analyze some press URLs to detect the most popular topics. My topic is about football (Mbappe contract) and I collected 180 URLs from Marca, a Spanish media mass, in a .txt file.
When I want to create a matrix-document with Corpus function from Quanteda package, I obtain this: Error: corpus() only works on character, corpus, Corpus, data.frame, kwic objects.
In some URLs there is a kwic object (maybe a video, adverts...) that doesn't allow me to work just with text, and I think it's because when inspecting HTML div class = body, automatically picks these kwic objects.
I leave here my code to read it:
url_marca <- read.table("mbappe.txt",stringsAsFactors = F)$V1
get_marca_text <- function(url){url %>%
read_html() %>%
html_nodes("div.ue-c-article__body") %>%
html_text() %>%
str_replace_all("[\r\n]" , "")}
text_marca_mbappe <- sapply(url_marca,get_marca_text)
Does anyone know if is it because of a mistake in html_notes when inspecting the URL or is it something different?
I've had great success using the gtsummary::tbl_regression function to display regression model results. I can't see how to use tbl_regression with pooled regression models from imputed data sets, however, and I'd really like to.
I don't have a reproducible example handy, I just wanted to see if anyone else has found a way to work with, say, mids objects created by the mice package in tbl_regression.
In the current development version of gtsummary, it's possible to summarize models estimated on imputed data from the mice package. Here's an example
# install dev version of gtsummary
remotes::install_github("ddsjoberg/gtsummary")
library(gtsummary)
packageVersion("gtsummary")
#> [1] ‘1.3.5.9012’
# impute the data
df_imputed <- mice::mice(trial, m = 2)
# build the model
imputed_model <- with(df_imputed, lm(age ~ marker + grade))
# present beautiful table with gtsummary
tbl_regression(imputed_model)
#> pool_and_tidy_mice: Tidying mice model with
#> `mice::pool(x) %>% mice::tidy(exponentiate = FALSE, conf.int = TRUE, conf.level = 0.95)`
Created on 2020-12-16 by the reprex package (v0.3.0)
It's important to note that you pass the mice model object to tbl_regression() BEFORE you pool the results. The tbl_regression() function needs access to the individual models in order to correctly identify the reference row and variable labels (among other things). Internally, the tidying function used on the mice model will first pool the results, then tidy the results. The code used for this process is printed to the console for transparency (as seen in the example above).
I work on a project with MaxMSP where I have multiple colls. I want to combine all the lists in there in one single coll. Is there a way to do that directly without unpacking and repacking everything?
In order to be more clear, let’s say I have two colls, with the first one being:
0, 2
1, 4
2, 4
….
99, 9
while the second one is:
100, 8
101, 4
…
199, 7
I would like the final coll to be one list from 0-199.
Please keep in mind I don’t want to unpack everything ( with uzi for instance) cause my lists are very long and I find that it is problematic for the cpu to use colls with such long lists.That’s why I broke my huge list into sublists/subcolls in the first place
Hope that’s clear enough.
If the two colls do not have overlapping indices, then you can just dump one into the other, like this:
----------begin_max5_patcher----------
524.3ocyU0tSiCCD72IOEQV7ybnZmFJ28pfPUNI6AlKwIxeTZEh28ydsCDNB
hzdGbTolTOd20yXOd6CoIjp98flj8irqxRRdHMIAg7.IwwIjN995VtFCizAZ
M+FfjGly.6MHdisaXDTZ6DxVvfYvhfCbS8sB4MaUPsIrhWxNeUdFsf5esFex
bPYW+bc5slwBQinhFbA6qt6aaFWwPXlCCPnxDxSEQaNzhnDhG3wzT+i7+R4p
AS1YziUvTV44W3+r1ozxUnrKNdYW9gKaIbuagdkpGTv.HalU1z26bl8cTpkk
GufK9eI35911LMT2ephtnbs+0l2ybu90hl81hNex241.hHd1usga3QgGUteB
qDoYQdDYLpqv3dJR2L+BNLQodjc7VajJzrqivgs5YSkMaprkjZwroVLI03Oc
0HtKv2AMac6etChsbiQIprlPKto6.PWEfa0zX5+i8L+TnzlS7dBEaLPC8GNN
OC8qkm4MLMKx0Pm21PWjugNuwg9A6bv8URqP9m+mJdX6weocR2aU0imPwyO+
cpHiZ.sQH4FQubRLtt+YOaItUzz.3zqFyRn4UsANtZVa8RYyKWo4YSwmFane
oXSwBXC6SiMaV.anmHaBlZ9vvNPoikDIhqa3c8J+vM43PgLLDqHQA6Diwisp
Hbkqimwc8xpBMc1e4EjPp8MfRZEw6UtU9wzeCz5RFED
-----------end_max5_patcher-----------
mzed's answer works, as stated if the lists have no overlapping indices which they shouldn't based on the design you specify.
If you are treating your 'huge list' as multiple lists, or vice versa, that might help come up with an answer. One question some may ask is "why are you merging it again?"
you consider your program to have one large list
that large list is really an interface that handles how you interact with several sub-lists for efficiency sake
the interface to your data persistence (the lists) for storing and retrieval then acts like one large list but works with several under-the-hood
an insertion and retrieval mechanism for handling the multiple lists as one list should exist for your interface then
save and reload the sublists individually as well
If you wrap this into a poly~, the voice acts as the sublist, so when I say voice I basically mean sublist:
You could use a universal send/receive in and out of a poly~ abstraction that contains your sublist's unique coll, the voice# from poly~ can append uniquely to your sublist filename that is reading/saving to for that voice's [coll].
With that set up, you could specify the number of sublists (voices) and master list length you want in the poly~ arguments like:
[poly~ sublist_manager.maxpat 10 1000] // 10 sublists emulating a 1000-length list
The math for index lookup is:
//main variables for master list creation/usage
master_list_length = 1000
sublist_count = 10
sublist_length = master_list_length/sublist_count;
//variables created when inserting/looking up an index
sublist_number = (desired_index/sublist_count); //integer divide to get the base sublist you'll be performing the lookup in
sublist_index = (desired_index%sublist_length); //actual index within your sublist to access
If the above ^ is closer to what you're looking for I can work on a patch for that. cheers
I am using tm package for text analysis of repair data, Reading data into data frame, converting to Corpus object, applied various methods to clean data using lower, stipWhitespace, removestopwords and so on.
Taken back of Corpus object for stemCompletion.
Performed stemDocument using tm_map function, my object words got stemmed
got results at expected.
When I am running stemCompletion operation using tm_map function, it is not working
and got below error
Error in UseMethod("words") : no applicable method for 'words'
applied to an object of class "character"
Executed trackback() to show and got steps as below
> traceback()
9: FUN(X[[1L]], ...)
8: lapply(dictionary, words)
7: unlist(lapply(dictionary, words))
6: unique(unlist(lapply(dictionary, words)))
5: FUN(X[[1L]], ...)
4: lapply(X, FUN, ...)
3: mclapply(content(x), FUN, ...)
2: tm_map.VCorpus(c, stemCompletion, dictionary = c_orig)
1: tm_map(c, stemCompletion, dictionary = c_orig)
How can I resolve this error?
I received the same error when using tm v0.6. I suspect this occurs because stemCompletion is not in the default transformations for this version of the tm package:
> getTransformations
function ()
c("removeNumbers", "removePunctuation", "removeWords", "stemDocument",
"stripWhitespace")
<environment: namespace:tm>
Now, the tolower function has the same problem, but can be made operational by using the content_transformer function. I tried a similar approach for stemCompletion but was not successful.
Note, even though stemCompletion isn't a default transformation, it still works when manually fed stemmed words:
> stemCompletion("compani",dictCorpus)
compani
"companies"
So that I could continue with my work, I manually delimited each document in a corpus by single spaces, feed them through stemCompletion, and concatenated them back together with the following (clunky and not graceful!) function:
stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}
where dictCorpus is just a copy of the cleaned corpus, but before it's stemmed. The extra stripWhitespace is specific for my corpus, but is likely benign for a general corpus. You may want to change the type option from "shortest" as needed.
For a full example, let's setup a dummy corpus using the crude data in the tm package:
> data("crude")
> docs = Corpus(VectorSource(crude))
> docs <- tm_map(docs, content_transformer(tolower))
> docs <- tm_map(docs, removeNumbers)
> docs <- tm_map(docs, removeWords, stopwords("english"))
> docs <- tm_map(docs, removePunctuation)
> docs <- tm_map(docs, stripWhitespace)
> docs <- tm_map(docs, PlainTextDocument)
> dictCorpus <- docs
> docs <- tm_map(docs, stemDocument)
> # Define modified stemCompletion function
> stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}
> # Original doc in crude data
> crude[[1]]
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
"The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
Reuter
> # Stemmed example in crude data
> docs[[1]]
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel
reduct bring post price west texa intermedi dlrs barrel copani said price reduct today
made light fall oil product price weak crude oil market compani spokeswoman said diamond
latest line us oil compani cut contract post price last two day cite weak oil market reuter
> # Stem comlpeted example in crude data
> stemCompletion_mod(docs[[1]],dictCorpus)
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel
reduction brings posted price west texas intermediate dlrs barrel NA said price reduction today
made light fall oil product price weak crude oil market companies spokeswoman said diamond
latest line us oil companies cut contract posted price last two day cited weak oil market reuter
Note: This example is odd, since the misspelled word "copany" is mapped: -> "copani" -> "NA", in this process. Not sure how to correct this...
To run the stemCompletion_mod through the entire corpus, I just use sapply (or parSapply with snow package).
Perhaps someone with more experience than me could suggest a simpler modification to get stemCompletion to work in v0.6 of the tm package.
I had success with the following workflow:
use content_transformer to apply an anonymous function on each document of the corpus,
split the document to words by spaces,
call stemCompletion on the words with the help of the dictionary,
and concatenate the separate words into a document again with paste.
POC demo code:
tm_map(c, content_transformer(function(x, d)
paste(stemCompletion(strsplit(stemDocument(x), ' ')[[1]], d), collapse = ' ')), d)
PS: using c as a variable name to store the corpus is not a good idea due to base::c
Thanks, cdxsza. Your method worked for me.
A note to all who are going to use stemCompletion:
The function completes an empty string with a word in dictionary, which is unexpected. See an example below, where the first "monday" was produced for the blank at the beginning of the string.
stemCompletion(unlist(strsplit(" mond tues ", " ")), dict=c("monday", "tuesday"))
[1] "monday" "monday" "tuesday"
It can be easily fixed by removing empty string "" before stemCompletion as below.
stemCompletion2 <- function(x, dictionary) {
x <- unlist(strsplit(as.character(x), " "))
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste(x, sep="", collapse=" ")
PlainTextDocument(stripWhitespace(x))
}
myCorpus <- lapply(myCorpus, stemCompletion2, dictionary=myCorpusCopy)
myCorpus <- Corpus(VectorSource(myCorpus))
See a detailed example in page 12 of slides at
http://www.rdatamining.com/docs/RDataMining-slides-text-mining.pdf
Regards
Yanchang Zhao
RdataMining.com
The problem is that using tolower (e.g. myCorpus <- tm_map(myCorpus, tolower)) converts the text to simple character values, which tm version 0.6 does not accept for use with tm_map.
If you instead do your original tolower like this
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
then the data will be in the correct format for when you need stemCompletion.
Other functions like removePunctuation and removeNumbers are used with tm_map as usual, i.e. without content_transformer.
Reference: https://stackoverflow.com/a/24771621
I have a page where the markup includes nested definition lists of both random depth and random numbers of DDs associated with any DT. Thus:
DL
- DT
- DD
- DT
- DD
- DD
--DL
--DT
--DD
--DT
--DD
--DD
-DT
-DD
-DD
I need:
zebra stripe the groups of DT/DDs with one another and
to start the even/odd sequence over for each nested list that is encountered.
Using :even and :odd won't work because of the extra DDs.
I've tried using an each loop, shown here: http://jsfiddle.net/XJ9j4/, which fixes A but ignores B. i.e. compare the background color of the 1st child dt/dd combination to the 1st parent, and consider the return to the parent list which should be blue not green.
Thoughts?
With my new understanding of what you want, I think this will do. Let me know if I am still misunderstanding.
$("dl").each(function(){
$this = $(this);
$this.children("dt:even").addClass("even").nextUntil("dt").addClass("even");
$this.children("dt:odd").addClass("odd").nextUntil("dt").addClass("odd");
});
http://jsfiddle.net/XJ9j4/8/