how to import file to mallet for topic modelling - mallet

I wanna use mallet for topic modelling and I have a question.My data is in a file one instance per line.But I didnt consider any label or instance name.So each line starts with the text.Is it required to have those labels or instance names?

I am not sure about what exactly do you want.
For me, in Windows, I put all my data in a folder like "D:\Data\test1", in "test1" folder, there are a number of .txt files, each of them is one instance.
Then I use bin\mallet import-dir --input D:\Data\test1 --output test1.mallet --keep-sequence --remove-stopwords --extra-stopwords extra.txt to generate the model.
I wish this could help. BTW, you can generate separate .txt files using Word or Excel Macro.

Related

How to generate a 10000 lines test file from original file with 10 lines?

I want to test an application with a file containing 10000 lines of records (plus header and footer lines). I have a test file with 10 lines now, so I want to duplicate these line 1000 times. I don't want to create a C# code in my app to generate that file (is only for test), so I am looking for a different and simple way to do that.
What kind of tool can I use to do that? CMD? Visual Studio/VS Code extension? Any thought?
If your data is textual, load the 10 records from your test file into an editor. Select all, copy, insert at the end of file. Repeat until the file is of length 10000+
This procedure requires ceil(log_2(1000)) cycles, 10 in your case, in general ceil(log_2(<target_number_of_lines>/<base_number_of_lines>)).
Alternative (large files)
Modern editors should not have performance problems here. However, the principle can be applied using a cat cli command. Assuming that you copy the original file into a file named dup0.txt proceed as follows:
cat dup0.txt dup0.txt >dup1.txt
cat dup1.txt dup1.txt >dup0.txt
leaving you with the quadrupled number of lines in dup0.txt.

copy pairs of files with same filename and different file extension

So I'm trying to copy files for statements so there is a pdf and an xml file with the same name and different file extensions. I need to copy lets say 1000 files at a time so that would be 500 pairs of pdf and xml files. I'm trying setup a batch file to do this, but I'm not sure how to get it to check the filenames to get the pair to copy at the same time. The reason I need the pair is because if just one or the other is there it will bomb out and not process. Any help is greatly appreciated.
Thanks,
Corey

Talend tWaitForFile insufficiency

We have a producer process that write files into a specific folder, which run continuously, we have to read files one by one using talend, there is 2 issues:
The 1st: tWaitForFile read only files which exist before its starting, so files which have created after the component starting are not visible for it.
The 2nd: There is no way to know if the file is released by the producer process, it may be read while it is not completely written, the parameter _wait_release_ of tWaitForFile does not work on Linux system !
So how can make Talend read complete written files from a directory that have an increasing files number ?
I'm not sure what you mean by your first issue. tWaitForFile has options to trigger when files are created, modified or deleted in a folder.
As for the second issue, your best bet here is for the file producer to be creating an OK or control file which is a 0 byte touch when it has finished writing the file you want.
In this case you simply look for the appearance of the OK file and then pick up the relevant completed file. If you name the 2 files the same but with a different file extension (the OK file is typically called ".OK" then this should be easy enough to look for. So you would set your tWaitForFile to look for "*.OK" files and then connect this to an iterate to a tFileInputDelimited (in the case you want to pick up a delimited text file) and then declare the file name as ((String)globalMap.get("tWaitForFile_1_CREATED_FILE")).substring(0,((String)globalMap.get("tWaitForFile_1_CREATED_FILE")).length()-3) + ".txt"
I've included some screenshots to help you below:

Extracting file names from an online data server in Matlab

I am trying to write a script that will allow me to download numerous (1000s) of data files from a data server (e.g, http://hydro1.sci.gsfc.nasa.gov/thredds/catalog/GLDAS_NOAH10SUBP_3H/2011/345/). Unfortunately, the names of the files in each directory are not formatted in a similar way (the time that they were created were appended to the end of the file name). I need to be able to specify the file name to subset the data (I have a special tool for these data types) and download it. I cannot find a function in matlab that will extract the file names.
I have looked at URLREAD, but it downloads everything including html code.
Thanks for your help!
You can easily parse the link.
x=urlread(url)
links=regexp(x,'<a href=''([^>]+)''>','tokens')
Reads every link, you have to filter all unwanted links.
For example this gets all grb files:
a=regexp(x,'<a href=''([^>]+.grb)''>','tokens')

Grouping two files into one custom file-type

I am currently working on a simple tower defense game for iOS (using objective-c), which contains several maps/levels. However, as it is now, each map consists of an image file and a .plist file with information. My question is: is there any way I could create a custom file type (for example, *.map) that contains both the image and the information from the plist?
If this is possible, how do I implement this?
Thanks in advance!
You have several good choices for that:
The simplest solution would be grouping the related files in subfolders: rather than having xyz.map file, you could have an xyz sub-folder, and reference the files out of it. You would not need to use any additional libraries for this, and you would be able to use the same name for all your image files and all your level files, because they would be in separate folders.
You can make a zip archive with the files that you would like to combine, and unzip it before use. Here is a link to an answer referencing a library to do it.
You can use a tar format - here is a list to an answer referencing a library that supports it. You would be able to use tar utility on OS-X to group images with plists on your workstation.
Finally, you can define a format of your own: store the length of the first file in the first four bytes, then store the content of the first file, and then the second. You would need to write a utility for combining the two files into one. This sounds like the hardest choice to implement.