Not able to see the Text after doing some pre-procssing in "tm" package - tm

corpus = tm_map(corpus,tolower)
corpus = tm_map(corpus,PlainTextDocument) corpus[[1]]
I was expecting the tweet text in lowercase,instead its showing the below:
<<PlainTextDocument>>
Metadata: 7
Content: chars: 101

Related

How to export data from txt (lines with info sep by ",") to csv, so csv file will have 6 columns even if there are only 4 or 5 values in some lines?

I am a total beginner, I cannot find an answer, so that's my last resort.
Please help.
I have a txt file with lines of data separated by commas.
Some lines have 6 values, but some of them have 5 or 4.
I am trying to export this data to csv so it has 6 columns and if there are only 4 values in line in txt file, then absent data will be presented in columns as zeros or NaNs.
Txt:
ZEQ-851,Toyota Corolla,63,Air Conditioning,Hybrid,Automatic Transmission
BMC-69,Nissan Micra,42,Manual Transmission
And i need to form it in a way:
Plate, Model, Number, Propertie1, Propertie2, Propertie3.
ZEQ-851,Toyota Corolla,63,Air Conditioning,Hybrid,Automatic Transmission
BMC-69, Nissan Micra, 42, 0, 0,Manual Transmission.
Thank you!
That's what i was trying to do:
import csv
with open('Vehicles.txt', 'r') as in_file:
stripped = (line.strip() for line in in_file)
lines = (line.split(",") for line in stripped if line)
with open('Vehicles.csv', 'w', newline='') as out_file:
writer = csv.writer(out_file)
writer.writerow(('Reg. nr','Model','PRD','Prop1','Prop2','Prop3'))
writer.writerows(lines)
And then:
all_cars=[]
print("The following cars are available:\n")
with open('Vehicles.csv', 'r') as garage:
reader = csv.reader(garage)
for column in reader:
regnum = column[0]
model = column[1]
ppd = column[2]
prop1 = column[3]
prop2 = column[4]
prop3 = column[5]
veh = Vehicle(regnum=regnum,model=model,ppd=ppd,prop1=prop1,prop2=prop2,prop3=prop3)
all_cars.append(veh)
for each in all_cars:
next(reader)
print("* Reg. nr: ",each.regnum,", Model: ",each.model,", Price per day: ",each.ppd,"\nProperties:",each.prop1,each.prop2,each.prop3,".",sep="")

ruamel.yaml edit document before dump

I have a YAML file which has documents of this sort
%YAML 1.2
---
!some_tag
name: xyz
constants:
state: abc
After reading in the documents, before dumping using ruamel.yaml.YAML().dump, I want to remove this part of the document
%YAML 1.2
---
My output file should have just these sections of the document
!some_tag
name: xyz
constants:
state: abc
How can this be done?
If you have an explicit tag with version the .version attribute on the
YAML() instance gets set. You can either instantiate a new instance for dumping
or just set that attribute to None:
import sys
import pathlib
import ruamel.yaml
inp = pathlib.Path('input.yaml')
yaml = ruamel.yaml.YAML()
data = yaml.load(inp)
print(f'>>> version info {yaml.version}\n')
yaml.version = None
yaml.dump(data, sys.stdout)
which gives:
>>> version info (1, 2)
!some_tag
name: xyz
constants:
state: abc

Error when importing tm Vcorpus into Quanteda corpus

This code snippet worked just fine until I decided to update R(3.6.3) and RStudio(1.2.5042) yesterday, though it is not obvious to me that is the source of the problem.
In a nutshell, I convert 91 pdf files into a volatile corpus named Vcorp and confirm that I created a volatile corpus as follows:
> Vcorp <- VCorpus(VectorSource(citiesText))
> class(Vcorp)
[1] "VCorpus" "Corpus"
Then I attempt to import this tm Vcorpus into quanteda, but keep getting an error message, which I did not get before (eg the day before the update).
> data(Vcorp, package = "tm")
> citiesCorpus <- corpus(Vcorp)
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 8714, 91
Any suggestions? Thank you.
Impossible to know the exact problem without a) version information on your packages and b) a reproducible example.
Why use tm at all? You could have created a quanteda corpus directly as:
corpus(citiesText)
Converting a VCorpus works fine for me.
library("quanteda")
## Package version: 2.0.1
library("tm")
packageVersion("tm")
## [1] ‘0.7.7’
reut21578 <- system.file("texts", "crude", package = "tm")
VCorp <- VCorpus(
DirSource(reut21578, mode = "binary"),
list(reader = readReut21578XMLasPlain)
)
corpus(VCorp)
## Corpus consisting of 20 documents and 16 docvars.
## text1 :
## "Diamond Shamrock Corp said that effective today it had cut i..."
##
## text2 :
## "OPEC may be forced to meet before a scheduled June session t..."
##
## text3 :
## "Texaco Canada said it lowered the contract price it will pay..."
##
## text4 :
## "Marathon Petroleum Co said it reduced the contract price it ..."
##
## text5 :
## "Houston Oil Trust said that independent petroleum engineers ..."
##
## text6 :
## "Kuwait"s Oil Minister, in remarks published today, said ther..."
##
## [ reached max_ndoc ... 14 more documents ]

Trying to install a corpus for countVectorizer in sklearn package

I am trying to load a corpus from my local drive into python at one time with a for loop and then read each text file and save it for analysis with countVectorizer. But, I am only getting the last file. How do I get the results from all of the files to be stored for analysis with countVectorizer?
This code brings out the text from last file in folder.
folder_path = "folder"
#import and read all files in animal_corpus
for filename in glob.glob(os.path.join(folder_path, '*.txt')):
with open(filename, 'r') as f:
txt = f.read()
print(txt)
MyList= [txt]
## Create a CountVectorizer object that you can use
MyCV1 = CountVectorizer()
## Call your MyCV1 on the data
DTM1 = MyCV1.fit_transform(MyList)
## get col names
ColNames=MyCV1.get_feature_names()
print(ColNames)
## convert DTM to DF
MyDF1 = pd.DataFrame(DTM1.toarray(), columns=ColNames)
print(MyDF1)
This code works, but would not work for a huge corpus that I am preparing it for.
#import and read text files
f1 = open("folder/animal_1.txt",'r')
f1r = f1.read()
f2 = open("/folder/animal_2.txt",'r')
f2r = f2.read()
f3 = open("/folder/animal_3.txt",'r')
f3r = f3.read()
#reassemble corpus in python
MyCorpus=[f1r, f2r, f3r]
## Create a CountVectorizer object that you can use
MyCV1 = CountVectorizer()
## Call your MyCV1 on the data
DTM1 = MyCV1.fit_transform(MyCorpus)
## get col names
ColNames=MyCV1.get_feature_names()
print(ColNames)
## convert DTM to DF
MyDF2 = pd.DataFrame(DTM1.toarray(), columns=ColNames)
print(MyDF2)
I figured it out. Just gotta keep grinding.
MyCorpus=[]
#import and read all files in animal_corpus
for filename in glob.glob(os.path.join(folder_path, '*.txt')):
with open(filename, 'r') as f:
txt = f.read()
MyCorpus.append(txt)

Aligning and italicising table column headings using Rmarkdown and pander

I am writing a rmarkdown document knitting to pdf with tables taken from portions of lists from the ezANOVA package. The tables are made using the pander package. Toy Rmarkdown file with toy dataset below.
---
title: "Table Doc"
output: pdf_document
---
```{r global_options, include=FALSE}
#set global knit options parameters.
knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path='Figs/',
echo=FALSE, warning=FALSE, message=FALSE, dev = 'pdf')
```
```{r, echo=FALSE}
# toy data
id <- rep(c(1,2,3,4), 5)
group1 <- factor(rep(c("A", "B"), 10))
group2 <- factor(rep(c("A", "B"), each = 10))
dv <- runif(20, min = 0, max = 10)
df <- data.frame(id, group1, group2, dv)
```
``` {r anova, echo = FALSE}
library(ez)
library(plyr)
library(pander)
# create anova object
anOb <- ezANOVA(df,
dv = dv,
wid = id,
between = c(group1, group2),
type = 3,
detailed = TRUE)
# extract the output table from the anova object, reduce it down to only desired columns
anOb <- data.frame(anOb[[1]][, c("Effect", "F", "p", "p<.05")])
# format entries in columns
anOb[,2] <- format( round (anOb[,2], digits = 1), nsmall = 1)
anOb[,3] <- format( round (anOb[,3], digits = 4), nsmall = 1)
pander(anOb, justify = c("left", "center", "center", "right"))
```
Now I have a few problems
a) For the last three columns I would like to have the column heading in the table aligned in the center, but the actual column entries underneath those headings aligned to the right.
b) I would like to have the column headings 'F' and 'p' in italics and the 'p' in the 'p<.05' column in italics also but the rest in normal font. So they read F, p and p<.05
I tried renaming the column headings using plyr::rename like so
anOb <- rename(anOb, c("F" = "italic(F)", "p" = "italic(p)", "p<.05" = ""))
But it didn't work
In markdown, you have to use the markdown syntax for italics, which is wrapping text between a star or underscore:
> names(anOb) <- c('Effect', '*F*', '*p*', '*p<.05*')
> pander(anOb)
-----------------------------------------
Effect *F* *p* *p<.05*
--------------- ------ -------- ---------
(Intercept) 52.3 0.0019 *
group1 1.3 0.3180
group2 2.0 0.2261
group1:group2 3.7 0.1273
-----------------------------------------
If you want to do that in a programmatic way, you can also use the pandoc.emphasis helper function to add the starts to a string.
But your other problem is due to a bug in the package, for which I've just proposed a fix on GH. Please feel free to give a try to that branch and report back on GH -- I will try to get some time later this week to clean up the related unit tests and merge the branch if everything seem to be OK.