How to print textual representation of single documents stored in a tm corpus in R? - tm

I was using {tm} package and then generated a corpus using
corpus = Corpus(VectorSource(sample.words))
then I want to check the content in corpus ,but it print this instead of its texts:
> corpus
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 3933
Now I have found some methods to look into corpus,then I started wondering what exactly R print when an object was typed in ?
> class(corpus)
[1] "VCorpus" "Corpus"
> typeof(corpus)
[1] "list"
Why it didn`t like other ordinary lists ,printing its columns and rows?Does this has something to do whit the class attribute?
I`m new in R and not familiar with some basic concepts, thanks for your patience!

The introduction document to the tm package says that you can use , say, writeLines(as.character(mycorpus[[4]])) to get a textual representation of document 4.
You can also use content(myCorpus[[23]]).
To read the intro document, enter browseVignettes() on your R prompt and the search for it on the browser window that will have opened.


Reading documents with r-tm to use with r-mallet

I have this code to fit a topic model with the R wrapper for MALLET:
docs <- mallet.import(DF$document, DF$text, stop_words)
mallet_model <- MalletLDA(num.topics = 4)
I have used the tm package to read my documents, which are txt files in a directory:
myCorpus <- Corpus(DirSource("data")) # a directory of txt files
The corpus can't be used as input for mallet.import, so how do I get from the tm corpus myCorpus above to the DF to call upon?
RMallet is intended to be a standalone package, so integration with tm isn't great. The requirement for RMallet input is a data frame with one row per document, and a character field containing the text, which it expects to not be tokenized already.
You can use tidy data principles to process your text and get it ready for input into mallet, with one row per document, as described here.
Also, there are tidiers for the mallet package in tidytext, and you can use them to analyze the output of mallet topic modeling:
# word-topic pairs
# document-topic pairs
tidy(mallet_model, matrix = "gamma")
# column needs to be named "term" for "augment"
term_counts <- rename(word_counts, term = word)
augment(mallet_model, term_counts)

how to find partial search in Mongodb?

How to find partial search?
Now Im trying to find
db.content.find({$text: {$search: "Customer london"}})
It finds all records matching customer, and all records matching london.
If I am searching for a part of a word for example lond or custom
db.content.find({$text: {$search: "lond"}})
It returns an empty result. How can I modify the query to get the same result like when I am searching for london?
You can use regex to get around with it ( However, it will work for following :
if you have word Cooking, following queries may give you result
cooking(exact matching)
coo(part of the word)
cooked(The word containing the english root of the document word, where cook is the root word from which cooking or cooked are derived)
If you would like to go one step further and get a result document containing cooking when you type vooking (missplled V instead of C), go for elasticsearch.
Elasticsearch is easy to setup, has extremely powerful edge-ngram analyzer which converts each words into smaller weightage words. Hence when you misspell, you will still get a document based on score elasticsearch gives to that document.
You can read about it here :
it will always return the empty array for partial words like when you are searching for lond to get this type of text london..
Because it take full words and search same as that they are ..
Not achive same results like :-
Here you may get help from ELASTIC-SEARCH . It is quite good for full text search when implement with mongoDB.
Refrence : ElasticSearch
The find all is to an Array
clientDB.collection('details').find({}).toArray().then((docs) =>
I now used the str.StartWith in a for loop to pick out my record.
if (docs[i].name.startsWith('U', 0)) {
return console.log(docs[i].name);
} else {
console.log('Record not found!!!')
This may not be efficient, but it works for now

importing website to google sheets

I have tried searching everywhere online for a good answer but cannot seem to find anything that matches specifically what i am looking for.
When i use the IMPORTHTML function in google sheets, i end up with data that looks like:
${} (${player.position}, ${team.abbrev}) ${opponent.abbrev} #${opponent_rank} ${minutes} ${pts} ${fgm}-${fga} ${ftm}-${fta} ${p3m}-${p3a} ${treb} ${ast} ${stl} ${blk} ${tov} ${pf} ${fp} $${salary} ${ratio}
the code that i am using looks like this:
=IMPORTHTML("", "table",2)
When I use the same as above (=IMPORTHTML("", "table",2)) only with "0" as my index, it pulls this:
Opp Stats
Player Team Rank Min Pts FGM/A FTM/A 3PM/A Reb Ast Stl Blk Tov Foul FP Cost Value
Basically, I am attempting to pull the table data from this website:
(because of my rep i cannot post more than two links, however my IMPORTHTML function has the above link input in both functions)
into a google sheet. Please help. any feedback is much appreciated... thanks!
Best advice is to find another Web table you can import. If you do "view source" on the page, you will find that the table content is dynamically populated from a variable named NF_DATA.
You need to create a document script to extract the data you want:
function this_is_test() {
var response = UrlFetchApp.fetch("");
raw_content = response.getContentText();
re = new RegExp('"daily_projections":\\[[^\\]]+','i');
proj = raw_content.match(re);
It will extract all text in-between "daily_projections":[ and ], which is (as of today):
Note that even this is not complete. You need to somehow map nba_player_id to the appropriate name. Anyway, a lot coding will be involved...

Solr search error when dealing with Arabic string

I'm struggling with Solr search Arabic for several days and made some experiment. Here is the simple reflection of the problem.
After I store some Arabic sentence (now only 1 word السوري ) into database and have Solr index it, then query it by q=*:*&wt=python,(if no wt part, it was garbled chars) the response is:
The actual word I store there for index is coding in another way:
As you can tell, there is a one-to-to corresponding from \xd8↔\u00d8. But I don't know what is the name of this coding, thus I cannot convert it. And when I do the search as: <>/select/?q=السوري&wt=python,the response is:
No docs found and it seems using a third version for coding u'\u0627\u0644\u0633\u0648\u0631\u064a'. if I take it and encode('utf8') then it convert back to '\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'.
In summary, when it (السوري) is in my code (python) or in data base (mysql),
it presents as 'form1':
When it is indexed by Solr, it converts to form2:
And when I use <>/select/?q=السوري&wt=python, to query from browser (Google chrome), it becomes form3:
(which could convert back to form1 by encode('utf8') But since they are different, the search matches nothing.
Therefore, those three different encode strategy may be the core problem. Could anyone help me figure it out and solve the search problem?
Thanks in advance.

Is there any Open topic Classifier API for demo purposes

Is there an Open Classifier/Categorizer API which returns categories/topics for a given entity string. For example for an input string like : "Barack Obama" it should return topics like : "Politics" similarly for "Albert Einsten" it should emit "Physicist, Nobel Laureate" etc.
try texlexan. it gave me some really good results but with slightly longer input texts.