Reading documents with r-tm to use with r-mallet - tm

I have this code to fit a topic model with the R wrapper for MALLET:
docs <- mallet.import(DF$document, DF$text, stop_words)
mallet_model <- MalletLDA(num.topics = 4)
mallet_model$loadDocuments(docs)
mallet_model$train(100)
I have used the tm package to read my documents, which are txt files in a directory:
myCorpus <- Corpus(DirSource("data")) # a directory of txt files
The corpus can't be used as input for mallet.import, so how do I get from the tm corpus myCorpus above to the DF to call upon?

RMallet is intended to be a standalone package, so integration with tm isn't great. The requirement for RMallet input is a data frame with one row per document, and a character field containing the text, which it expects to not be tokenized already.

You can use tidy data principles to process your text and get it ready for input into mallet, with one row per document, as described here.
Also, there are tidiers for the mallet package in tidytext, and you can use them to analyze the output of mallet topic modeling:
# word-topic pairs
tidy(mallet_model)
# document-topic pairs
tidy(mallet_model, matrix = "gamma")
# column needs to be named "term" for "augment"
term_counts <- rename(word_counts, term = word)
augment(mallet_model, term_counts)

Related

how to read CSV file in scala

I have a CSV file and I want to read that file and store it in case class. As I know A CSV is a comma separated values file. But in case of my csv file there are some data which have already comma itself. and it creates new column for every comma. So the problem how to split data from that.
1st data
04/20/2021 16:20(1st column) Here a bunch of basic techniques that suit most businesses, and easy-to-follow steps that can help you create a strategy for your social media marketing goals.(2nd column)
2nd data
11-07-2021 12:15(1st column) Focus on attracting real followers who are genuinely interested in your content, and make the most of your social media marketing efforts.(2nd column)
var i=0
var length=0
val data=Source.fromFile(file)
for (line <- data.getLines) {
val cols = line.split(",").map(_.trim)
length = cols.length
while(i<length){
//println(cols(i))
i=i+1
}
i=0
}
If you are reading a complex CSV file then the ideal solution is to use an existing library. Here is a link to the ScalaDex search results for CSV.
ScalaDex CSV Search
However, based on the comments, it appears that you might actually be wanting to read data stored in a Google Sheet. If that is the case, you can utilize the fact that you have some flexibility to save the data in a text file yourself. When I want to read data from a Google Sheet in Scala, the approach I use first is to save the file in a format that isn't hard to read. If the fields have embedded commas but no tabs, which is common, then I will save the file as a TSV and parse that with split("\t").
A simple bit of code that only uses the standard library might look like the following:
val source = scala.io.Source.fromFile("data.tsv")
val data = source.getLines.map(_.split("\t")).toArray
source.close
After this, data will be an Array[Array[String]] with your data in it that you can process as you desire.
Of course, if your data includes both tabs and commas then you'll really want to use one of those more robust external libraries.
You could use univocity CSV parser for faster stuffs.
You can also use it for creation as well.
Univocity parsers

Need to create dictionary of idf values, associating words with their idf values

I understand how to get the idf values and vocabulary using the vectorizer. With vocabulary the frequency of the word is the value and the word is the key of a dictionary, however, what I want the value to be is the idf value.
I haven't been able to try anything because I don't know how to work with sklearn.
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())
The code provided above is what I was originally trying to work with.
I have since come up with a new solution that does not use scikit:
for string in text_array:
for word in string:
if word not in total_dict.keys(): # build up a word frequency in the dictionary
total_dict[word] = 1
else:
total_dict[word] += 1
for word in total_dict.keys(): # calculate the tf-idf of each word in the dictionary using this url: https://nlpforhackers.io/tf-idf/
total_dict[word] = math.log(len(text_array) / float(1 + total_dict[word]))
print("word", word, ":" , total_dict[word])
Let me know if the code snippet above is enough to allow a reasonable estimation of what is going on. I included a link to what I was using for guidance.
You can directly use vectorizer.fit_transform(text) for the first time.
What it does is build a vocabulary set according to all the word/tokens in the text.
And then you can use vectorizer.transform(anothertext) to vectorize another text with the same mapping as the previous text.
More explanation:
fit() is to learn vocabulary and idf from training set. transform() is to transform the documents based on the learned vocabulary from the previous fit().
So you should only do fit() once, and can transform many times.

How can I merge multiple tfrecords file into one file?

My question is, if I want to create one tfrecords file for my data , it will take approximately 15 days to finish it, it has 500000 pairs of template , and each template is 32 frames( images). In order to save the time, I have 3 GPUs, so I thought I can create three tfrocords file each one file on one GPUs and then I can finish creating the tfrecords in 5 days. But then I searched about a way to merge these three files in one file and couldn't find proper solution.
So Is there any way to merge these three files in one file, OR is there any way that I can train my network by feeding batch of example extracted form the three tfrecords files, knowing I am using Dataset API.
As the question is asked two months ago, I thought you already find the solution. For the follows, the answer is NO, you do not need to create a single HUGE tfrecord file. Just use the new DataSet API:
dataset = tf.data.TFRecordDataset(filenames_to_read,
compression_type=None, # or 'GZIP', 'ZLIB' if compress you data.
buffer_size=10240, # any buffer size you want or 0 means no buffering
num_parallel_reads=os.cpu_count() # or 0 means sequentially reading
)
# Maybe you want to prefetch some data first.
dataset = dataset.prefetch(buffer_size=batch_size)
# Decode the example
dataset = dataset.map(single_example_parser, num_parallel_calls=os.cpu_count())
dataset = dataset.shuffle(buffer_size=number_larger_than_batch_size)
dataset = dataset.batch(batch_size).repeat(num_epochs)
...
For details, check the document.
Addressing the question title directly for anyone looking to merge multiple .tfrecord files:
The most convenient approach would be to use the tf.Data API:
(adapting an example from the docs)
# Create dataset from multiple .tfrecord files
list_of_tfrecord_files = [dir1, dir2, dir3, dir4]
dataset = tf.data.TFRecordDataset(list_of_tfrecord_files)
# Save dataset to .tfrecord file
filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(dataset)
However, as pointed out by holmescn, you'd likely be better off leaving the .tfrecord files as separate files and reading them together as a single tensorflow dataset.
You may also refer to a longer discussion regarding multiple .tfrecord files on Data Science Stackexchange
The answer by MoltenMuffins works for higher versions of tensorflow. However, if you are using lower versions, you have to iterate through the three tfrecords and save them them into a new record file as follows. This works for tf versions 1.0 and above.
def comb_tfrecord(tfrecords_path, save_path, batch_size=128):
with tf.Graph().as_default(), tf.Session() as sess:
ds = tf.data.TFRecordDataset(tfrecords_path).batch(batch_size)
batch = ds.make_one_shot_iterator().get_next()
writer = tf.python_io.TFRecordWriter(save_path)
while True:
try:
records = sess.run(batch)
for record in records:
writer.write(record)
except tf.errors.OutOfRangeError:
break
Customizing the above the script for better tfrecords listing
import os
import glob
import tensorflow as tf
save_path = 'data/tf_serving_warmup_requests'
tfrecords_path = glob.glob('data/*.tfrecords')
dataset = tf.data.TFRecordDataset(tfrecords_path)
writer = tf.data.experimental.TFRecordWriter(save_path)
writer.write(dataset)

How to print textual representation of single documents stored in a tm corpus in R?

I was using {tm} package and then generated a corpus using
corpus = Corpus(VectorSource(sample.words))
then I want to check the content in corpus ,but it print this instead of its texts:
> corpus
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 3933
Now I have found some methods to look into corpus,then I started wondering what exactly R print when an object was typed in ?
> class(corpus)
[1] "VCorpus" "Corpus"
> typeof(corpus)
[1] "list"
Why it didn`t like other ordinary lists ,printing its columns and rows?Does this has something to do whit the class attribute?
I`m new in R and not familiar with some basic concepts, thanks for your patience!
The introduction document to the tm package says that you can use , say, writeLines(as.character(mycorpus[[4]])) to get a textual representation of document 4.
You can also use content(myCorpus[[23]]).
To read the intro document, enter browseVignettes() on your R prompt and the search for it on the browser window that will have opened.

Spark/Scala read hadoop file

In a pig script I saved a table using PigStorage('|').
I have in the corresponding hadoop folder files like
part-r-00000
etc.
What is the best way to load it in Spark/Scala ? In this table I have 3 fields: Int, String, Float
I tried:
text = sc.hadoopFile("file", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)
But then I would need somehow to split each line. Is there a better way to do it?
If I were coding in python I would create a Dataframe indexed by the first field and whose columns are the values found in the string field and coefficients the float values. But I need to use scala to use the pca module. And the dataframes don't seem that close to python's ones
Thanks for the insight
PigStorage creates a text file without schema information so you need to do that work yourself something like
sc.textFile("file") // or directory where the part files are
val data = csv.map(line => {
vals=line.split("|")
(vals(0).toInt,vals(1),vals(2).toDouble)}
)