When using Mallet, how do I get a list of topics associated with each document - mallet

When using Mallet, how do I get a list of topics associated with each document? I think I need to use train-topics and --output-topic-docs, but when I do, I get an error.
I'm using Mallet (2.0.8), and I use the following bash script to do my modeling:
MALLET=/Users/emorgan/desktop/mallet/bin/mallet
INPUT=/Users/emorgan/desktop/sermons
OBJECT=./object.mallet
$MALLET import-dir --input $INPUT --output $OBJECT --keep-sequence --remove-stopwords
$MALLET train-topics --input $OBJECT --num-topics 10 --num-top-words 1 \
--num-iterations 50 \
--output-doc-topics ./topics.txt \
--output-topic-keys ./keys.txt \
--xml-topic-report ./topic.xml \
--output-topic-docs ./docs.txt
Unfortunately, ./docs.txt does not get created. Instead I get the following error:
Exception in thread "main" java.lang.ClassCastException: java.net.URI cannot be cast to java.lang.String
at cc.mallet.topics.ParallelTopicModel.printTopicDocuments(ParallelTopicModel.java:1773)
at cc.mallet.topics.tui.TopicTrainer.main(TopicTrainer.java:281)
More specifically, I want Mallet to generate a list of documents and the associated topics assigned to them, or I want a list of topics and then the list of associated documents. How do I create such lists?

At least in mallet 2.0.7, it is --output-doc-topics ./topics.txt that gives the desired table (a topic composition of each document). While the output format has changed from 2.0.7 to 2.0.8, the main content of the file stayed the same.

Related

How to read new files uploaded in last one hour using pyspark filestream?

I am trying to read latest files(say new files in last one hour) available in a directory and load that data . I am trying with pyspark structured streaming. i have tried maxFileAge option of spark streaming, but still it is loading all the files in the diretory, regardless of time specified in the option.
spark.readStream\
.option("maxFileAge", "1h")\
.schema(cust_schema)\
.csv(upload_path) \
.withColumn("closing_date", get_date_udf_func(input_file_name()))\
.writeStream.format('parquet') \
.trigger(once=True) \
.option('checkpointLocation', checkpoint_path) \
.option('path', write_path) \
.start()
Above is the code that i tried, but it will load all available files regardless of time . Please point out what i am doing wrong here ..

How to change the metadata of all specific file of exist objects in Google Cloud Storage?

I have uploaded thousands of files to google storage, and i found out all the files miss content-type,so that my website cannot get it right.
i wonder if i can set some kind of policy like changing all the files content-type at the same time, for example, i have bunch of .html files inside the bucket
a/b/index.html
a/c/a.html
a/c/a/b.html
a/a.html
.
.
.
is that possible to set the content-type of all the .html files with one command in the different place?
You could do:
gsutil -m setmeta -h Content-Type:text/html gs://your-bucket/**.html
There's no a unique command to achieve the behavior you are looking for (one command to edit all the object's metadata) however, there's a command from gcloud to edit the metadata which you could use on a bash script to make a loop through all the objects inside the bucket.
1.- Option (1) is to use a the gcloud command "setmeta" on a bash script:
# kinda pseudo code here.
# get the list with all your object's names and iterate over the metadata edition command.
for OUTPUT in $(get_list_of_objects_names)
do
gsutil setmeta -h "[METADATA_KEY]:[METADATA_VALUE]" gs://[BUCKET_NAME]/[OBJECT_NAME]
# the "gs://[BUCKET_NAME]/[OBJECT_NAME]" would be your object name.
done
2.- You could also create a C++ script to achieve the same thing:
namespace gcs = google::cloud::storage;
using ::google::cloud::StatusOr;
[](gcs::Client client, std::string bucket_name, std::string object_name,
std::string key, std::string value) {
# you would need to find list all the objects, while on the loop, you can edit the metadata of the object.
for (auto&& object_metadata : client.ListObjects(bucket_name)) {
string bucket_name=object_metadata->bucket(), object_name=object_metadata->name();
StatusOr<gcs::ObjectMetadata> object_metadata =
client.GetObjectMetadata(bucket_name, object_name);
gcs::ObjectMetadata desired = *object_metadata;
desired.mutable_metadata().emplace(key, value);
StatusOr<gcs::ObjectMetadata> updated =
client.UpdateObject(bucket_name, object_name, desired,
gcs::Generation(object_metadata->generation()))
}
}

spark structured streaming file source read from a certain partition onwards

I have a folder on HDFS like below containing ORC files:
/path/to/my_folder
It contains partitions:
/path/to/my_folder/dt=20190101
/path/to/my_folder/dt=20190103
/path/to/my_folder/dt=20190103
...
Now I need to process the data here using streaming.
A spark.readStream.format("orc").load("/path/to/my_folder") nicely works.
However, I do not want to process the whole table, but rather only start from a certain partition onwards similar to a certain kafka offset.
How can this be implemented? I.e. how can I specify the initial state where to read from.
Spark Structured Streaming File Source Starting Offset claims that there is no such feature.
Their suggestion to use: latestFirst is not desirable for my use-case, as I do not aim to build an always-on streaming application, but rather use Trigger.Once like a batch job with the nice streaming semantics of duplicate reduction and handling of late-arriving data
If this is not available, what would be a suitable workaround?
edit
Run warn-up stream with option("latestFirst", true) and
option("maxFilesPerTrigger", "1") with checkpoint, dummy sink and huge
processing time. This way, warm-up stream will save latest file
timestamp to checkpoint.
Run real stream with option("maxFileAge", "0"), real sink using the
same checkpoint location. In this case stream will process only newly
available files.
https://stackoverflow.com/a/51399134/2587904
building from this idea, let's look at an example:
# in bash
rm -rf data
mkdir -p data/dt=20190101
echo "1,1,1" >> data/dt=20190101/1.csv
echo "1,1,2" >> data/dt=20190101/2.csv
mkdir data/dt=20190102
echo "1,2,1" >> data/dt=20190102/1.csv
echo "1,2,2" >> data/dt=20190102/2.csv
mkdir data/dt=20190103
echo "1,3,1" >> data/dt=20190103/1.csv
echo "1,3,2" >> data/dt=20190103/2.csv
mkdir data/dt=20190104
echo "1,4,1" >> data/dt=20190104/1.csv
echo "1,4,2" >> data/dt=20190104/2.csv
spark-shell --conf spark.sql.streaming.schemaInference=true
// from now on in scala
val df = spark.readStream.csv("data")
df.printSchema
val query = df.writeStream.format("console").start
query.stop
// cleanup the data and start from scratch.
// this time instead of outputting to the console, write to file
val query = df.writeStream.format("csv")
.option("path", "output")
.option("checkpointLocation", "checkpoint")
val started = query.start
// in bash
# generate new data
mkdir data/dt=20190105
echo "1,5,1" >> data/dt=20190105/1.csv
echo "1,5,2" >> data/dt=20190105/2.csv
echo "1,4,3" >> data/dt=20190104/3.csv
// in scala
started.stop
// cleanup the output, start later on with custom checkpoint
//bash: rm -rf output/*
val started = query.start
// bash
echo "1,4,3" >> data/dt=20190104/4.csv
started.stop
// *****************
//bash: rm -rf output/*
Everything works as intended. The operation picks up where the checkpoint marks the last processed file.
How can a checkpoint definition be generated by hands such as all files in dt=20190101 and dt=20190102 have been processed and no late-arriving data is tolerated there anymore and the processing shall continue with all the files from dt=20190103 onwards?
I see that spark generates:
commits
metadata
offsets
sources
_spark-metadata
files and folders.
So far I only know that _spark-metadata can be ignored to set an initial state / checkpoint.
But have not yet figured out (from the other files) which minimal values need to be present so processing picks up from dt=20190103 onwards.
edit 2
By now I know that:
commits/0 needs to be present
metadata needs to be present
offsets needs to be present
but can be very generic:
v1
{"batchWatermarkMs":0,"batchTimestampMs":0,"conf":{"spark.sql.shuffle.partitions":"200"}}
{"logOffset":0}
When I tried to remove one of the already processed and committed files from sources/0/0, the query still runs but: not only the new data is processed larger than the existing committed data, but any data, in particular, the files I just removed from the log.
How can I change this behavior to only process data more current than the initial state?
edit 3
The docs (https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html) but also javadocs ;) list getOffset
The maximum offset (getOffset) is calculated by fetching all the files
in path excluding files that start with _ (underscore).
That sounds interesting, but so far I have not figured out how to use it to solve my problem.
Is there a simpler way to achieve the desired functionality besides creating a custom (copy) of the FileSource?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L237
maxFileAge also sounds interesting.
I have started to work on a custom file stream source. However, fail to properly instanciate it. https://gist.github.com/geoHeil/6c0c51e43469ace71550b426cfcce1c1
When calling:
val df = spark.readStream.format("org.apache.spark.sql.execution.streaming.StatefulFileStreamSource")
.option("partitionState", "/path/to/data/dt=20190101")
.load("data")
The operation fails with:
InstantiationException: org.apache.spark.sql.execution.streaming.StatefulFileStreamSource
at java.lang.Class.newInstance(Class.java:427)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:196)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:88)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:88)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:159)
... 53 elided
Caused by: java.lang.NoSuchMethodException: org.apache.spark.sql.execution.streaming.StatefulFileStreamSource.<init>()
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.newInstance(Class.java:412)
... 59 more
Even though it is basically a copy of the original source. What is different? Why is the constructor not found from https://github.com/apache/spark/blob/v2.2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L196? But it works just fine for https://github.com/apache/spark/blob/v2.2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L42
Even:
touch -t 201801181205.09 data/dt=20190101/1.csv
touch -t 201801181205.09 data/dt=20190101/2.csv
val df = spark.readStream
.option("maxFileAge", "2d")
.csv("data")
returns the whole dataset and fails to filter to the most k current days.

Batch Filtering with Multi-Filter throws a 'Class attribute not set' exception

We have a data set of 15k classified tweets with which we need to perform sentiment analysis. I would like to test against a test set of 5k classified tweets. Due to Weka needing the same attributes within the header of the test set as exist in the header of training set, I will have to use batch filtering if I want to be able to run my classifier against this 5k test set.
However, there are several filters that I need to run my training set through, so I figured the running a multifilter against the training set would be a good idea. The multifilter works fine when not running the batch argument, but when I try to batch filter I get an error from the CLI as it tried to execute the first filter within the multi-filter:
CLI multiFilter command w/batch argument:
java weka.filters.MultiFilter -F "weka.filters.supervised.instance.Resample -B 1.0 -S 1 -Z 15.0 -no-replacement" \
-F "weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 100000 -prune-rate -1.0 -N 0 -S -stemmer weka.core.stemmers.NullStemmer -M 2 -tokenizer weka.core.tokenizers.AlphabeticTokenizer" \
-F "weka.filters.unsupervised.attribute.Reorder -R 2-last,first"\
-F "weka.filters.supervised.attribute.AttributeSelection -E \"weka.attributeSelection.InfoGainAttributeEval \" -S \"weka.attributeSelection.Ranker -T 0.0 -N -1\"" \
-F weka.filters.AllFilter \
-b -i input\Train.arff -o output\Train_b_out.arff -r input\Test.arff -s output\Test_b_out.arff
Here is the resultant error from the CLI:
weka.core.UnassignedClassException: weka.filters.supervised.instance.Resample: Class attribute not set!
at weka.core.Capabilities.test(Capabilities.java:1091)
at weka.core.Capabilities.test(Capabilities.java:1023)
at weka.core.Capabilities.testWithFail(Capabilities.java:1302)
at weka.filters.Filter.testInputFormat(Filter.java:434)
at weka.filters.Filter.setInputFormat(Filter.java:452)
at weka.filters.SimpleFilter.setInputFormat(SimpleFilter.java:195)
at weka.filters.Filter.batchFilterFile(Filter.java:1243)
at weka.filters.Filter.runFilter(Filter.java:1319)
at weka.filters.MultiFilter.main(MultiFilter.java:425)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at weka.gui.SimpleCLIPanel$ClassRunner.run(SimpleCLIPanel.java:265)
And here are the headers with a portion of data for both the training and test input arffs:
Training:
#RELATION classifiedTweets
#ATTRIBUTE ##sentence## string
#ATTRIBUTE ##class## {1,-1,0}
#DATA
"Conditioning be very important for curly dry hair",0
"Combine with Sunday paper coupon and",0
"Price may vary by store",0
"Oil be not really moisturizers",-1
Testing:
#RELATION classifiedTweets
#ATTRIBUTE ##sentence## string
#ATTRIBUTE ##class## {1,-1,0}
#DATA
"5",0
"I give the curl a good form and discipline",1
"I have be cowashing every day",0
"LOL",0
"TITLETITLE Walgreens Weekly and Midweek Deal",0
"And then they walk away",0
Am I doing something wrong here? I know that supervised resampling requires the class attribute to be on the bottom of the attribute list within the header, and it is... within both the test and training input files.
EDIT:
Further testing reveals that this error does not occur with relationship to the batch filtering, it occurs whenever I run the supervised resample filter from the CLI... The data that I use works on every other filter I've tried within the CLI, so I don't understand why this filter is any different... resampling the data in the GUI works fine as well...
Update:
This also happens with the SMOTE filter instead of the resample filter
Could not get the batch filter to work with any resampling filter. However, our workaround was to simply resample (and then randomize) the training data as step 1. From this reduced set we ran batch filters for everything else we wanted on the test set. This seemed to work fine.
You could have used the multifilter along with the ClassAssigner method to make it work:
java -classpath $jcp weka.filters.MultiFilter
-F "weka.filters.unsupervised.attribute.ClassAssigner -C last"
-F "weka.filters.supervised.instance.Resample -B 1.0 -S 1 -Z 66.0"

hadoop-streaming example failed to run - Type mismatch in key from map

I was running $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer
What hould be the input file when IdentityMapper is the mapper?
I was hoping to see it can sort on certain selected keys and not the entire keys. My input file is simple
"aa bb".
"cc dd"
Not sure what did I miss? I always get this error
java.lang.Exception: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:371)
Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
This is a known bug and here is the JIRA. The bug has been identified in Hadoop 0.21.0, but I don't think it's in any of the Hadoop release version. If you are really interested to fix this, you can
download the source code for Hadoop (for the release you are working)
download the patch from JIRA and apply it
build and test Hadoop
Here are the instructions on how to apply a patch.
Or instead of using an IdentityMapper and the IdentityReducder, use a python/perl scripts which will read the k/v pairs from STDIN and then write the same k/v pairs to the STDOUT without any processing. It's like creating your own IdentityMapper and the IdentityReducder not using Java.
I was trying my hands on Hadoop with my own example, but got the same error. I used KeyValueTextInputFormat to resolve the issue. You can have a look at following blog for the same.
http://sanketraut.blogspot.in/2012/06/hadoop-example-setting-up-hadoop-on.html
Hope it helps you.
Peace.
Sanket Raut