how to get probability of each topic in mallet - mallet

I am doing topic modelling with mallet.I have imported my file(each document in a line)and I trained mallet with 200 topics.Now I have 200 topics with words related to them for each topic.Now I need to know each topic`s probability.How can I know?
Thank you

The command bin/mallet train-topics has an option --output-doc-topics topic-composition.txt. This outputs a big table in TAB-separated text format containing the topic composition of each text.

Related

How do you send the record matching certain condition/s to specific/multiple output topics in Spring apache-kafka?

I have referred this. but, this is an old post so i'm looking for a better solution if any.
I have an input topic that contains 'userActivity' data. Now I wish to gather different analytics based on userInterest, userSubscribedGroup, userChoice, etc... produced to distinct output topics from the same Kafka-streams-application.
Could you help me achieve this... ps: This my first time using Kafka-streams so I'm unaware of any other alternatives.
edit:
It's possible that One record matches multiple criteria, in which case the same record should go into those output topics as well.
if(record1 matches criteria1) then... output to topic1;
if(record1 matches criteria2) then ... output to topic2;
and so on.
note: i'm not looking elseIf kind of solution.
For dynamically choosing which topic to send to at runtime based on each record's key-value pairs. Apache Kafka version 2.0 or later introduced a feature called: Dynamic routing
And this is an example of it: https://kafka-tutorials.confluent.io/dynamic-output-topic/confluent.html

Get topics from new document based on trained LDA model

I've seen similar questions, but those are only works with PySpark.
I went through this LDA example on Spark docs, but I did not find any way, how to use that model for detect topics (from the founded topics) in a brand new text or document.
E.g: If I use the subset of the reuters datase which means I have the following topics:
comp.graphics rec.motorcycles sci.crypt sci.space talk.religion.misc
comp.sys.ibm.pc.hardware rec.sport.baseball sci.electronics talk.politics.guns
rec.autos rec.sport.hockey sci.med talk.politics.misc
Then I have a model which is knows 13 topics and If I pass a brand new doc to the model about diabetes then I should get back the most suitable topics e.g: sci.med
Is it possible to achieve that? If yes, how should I do that?

Reusing PCollection from output of a Transform in another Transform which is later stage of pipeline

In java or any other Programming we can save state of a variable and refer the variable value later as needed. This seems to be not possible with Apache beam, can someone confirm? If it is possible please point me to some samples or documentation.
I am trying to solve below which needs context of my previous transform output.
I am new to Apache Beam so finding it hard to understand how to solve the above.
Approach#1:
PCollection config = p.apply(ReadDBConfigFn(options.getPath()));
PCollection<Record> records = config.apply(FetchRecordsFn());
PCollection<Users> users = config.apply(FetchUsersFn());
// Now Process using both 'records' and 'users', How can this be done with beam?
Approach#2:
PCollection config = p.apply(ReadDBConfigFn(options.getPath()));
PCollection<Record> records = config.apply(FetchRecordsFn()).apply(FetchUsersAndProcessRecordsFn());
// Above line 'FetchUsersAndProcessRecordsFn' needs 'config' so it can fetch Users but there is seems to be no possible way?
If I understand correctly, you want to use elements from the two collections records and users in a processing step? There are two commonly used patterns in Beam to accomplish this:
If you are looking to join the two collections, you probably want to use a CoGroupByKey to group related records and users together for processing.
If one of the collections (records or users) is "small" and the entire set needs to be available during processing, you might want to send it as a side input to your processing step.
It isn't clear what might be in the PCollection config in your example, so I may have misinterpreted... Does this meet your use case?

Is there any test link for Mpeg DASH in both: type = "dynamic" and with multiple representations (bitrates)?

As title:
I have tried to find some but I found for most of cases
if the test url is of type = "dynamic" there is ONLY ONE representation (a unique bitrate; CANNOT apply bitrate switch).
Does anyone know if there is a test link?
Thanks
There are several DASH data sets and test vectors out there, lots of them are listed in this blog post. Many don't have live streams, but some have (at least simulated live streams).
The DASH IF Test Vectors might be a good starting point, there are several live streams (look at the column mpd_type and search for the value dynamic), at least some should have multiple representations.

DDS Keyed Topics

I am currently using RTI DDS on a system where we will have one main topic for multiple items, such as a car topic with multiple vin numbers. Since this is the design I am trying to then make a "keyed" topic which is basically a topic that has a member acting as a key (kind of like the primary key in the database) which in this example would be the vin of each car. To implement the keyed topics I am using an IDL file which is as follows,
const string CAR_TOPIC = "CAR";
enum ALARMSTATUS {
ON,
OFF
};
struct keys {
long vin; //#key
string make;
ALARMSTATUS alarm;
};
When I run the IDL file through the rtigen tool for making C,Java, etc kind of files from the IDL, the only thing I can do is run the program and see
Writing keys, count 0
Writing keys, count 1 ...
and
keys subscriber sleeping for 4 sec...
Received:
vin: 38
make:
alarm : ON
keys subscriber sleeping for 4 sec...
Received:
vin: 38
make:
alarm : ON ...
Thus making it hard to see how the keyed topics work and if they are really working at all. Does anyone have any input what to do with the files generated from the IDL files to make the program more functional? Also I never see the topic CAR so I am not sure I am using the right syntax to set the topic for the DDS.
When you say "the only thing I can do is run the program", it is not clear what "the" program is. I do not recognize the exact output that you give, so did you adjust the code of the generated example?
Anyway, responding to some of your remarks:
Thus making it hard to see how the keyed topics work and if they are really working at all.
The concept of keys is most clearly visible when you have values for multiple instances (that is, different key-values) present simultaneously in your DataReader. This is comparable to having a database table containing multiple rows at the same time. So in order to demonstrate the key concept, you will have to assign different values to the key-fields on the DataWriter side and write() the resulting samples. This does not happen by default in the generated examples, so you have to do adjust the code to achieve that.
On the DataReader side, you will have to make sure that multiple values remain stored to demonstrate the effect. This means that you should not do a take() (which is similar to a "destructive read"), but a read(). This way, the number of values in your DataReader will grow in line with the number of distinct key values that you wrote.
Note that in real life, you should not have a growing number of key-values for ever, just like you do not want a database table to contain an ever growing number of rows.
Also I never see the topic CAR so I am not sure I am using the right syntax to set the topic for the DDS.
Check out the piece of code that creates the Topic. The method name depends on the language you use, but should have something like create_topic() in it. The second parameter to that call is the name of the Topic. In general, the IDL constant CAR_TOPIC that you defined will not be automatically used as the name of the Topic, you have to indicate that in the code.
Depending on the example you are running, you could try -h to get some extra flags to use. You might be able to increase verbosity to see the name of the Topic being created, or set the topic name off the command line.
If you want to verify the name of the Topic in your system, you could use rtiddsspy to watch the data flowing. Its output includes the names of the Topics it discovers.