How to get mallet to load all tokens from a line without a label? - mallet

I'm trying to perform topic modeling on a dataset that's in a whitespace delimited file, with no label. I can't get mallet to load all the tokens. I'm using version 2.0.8 on linux and mac.
As a test for the issue, I created a file with the one line:
1 2 3 4 5
Then ran
mallet import-file --token-regex [0-9]+ --keep-sequence true --label 0 --input testData --output testLoaded
mallet train-topics --input testLoaded
I should get 4 tokens, but I only get 3:
Data loaded.
max tokens: 3
total tokens: 3
It gets even worse if I try to use the --data flag (same result whether I use it and --label 0 or --data 2 on its own):
mallet import-file --token-regex [0-9]+ --keep-sequence true --label 0 --data 2 --input testData --output testLoaded2
mallet train-topics --input testLoaded2
Data loaded.
max tokens: 1
total tokens: 1
So either I lose the first token, or I only get the first token (2 is appearing in the output later on, so I know it's not loading the rest of the line as a single token in the latter case).

Mallet parses lines in two phases: first, it segments the line into fields, using the --line-regex option. Then it maps those segments to one of the three instance fields (name, label, data).
The command isn't working because it is only changing the second part, the mapping from regex groups to instance fields. It's telling Mallet to separate off the first two fields, but then ignore them. Here's an example of the default behavior:
$ bin/mallet import-file --input token_test.txt --keep-sequence \
--token-regex [0-9]+ --print-output
name: 1
target: 2
input: 0: 3 (0)
1: 4 (1)
2: 5 (2)
If we add the --label 0 it just ignores the second field, but still captures it:
$ bin/mallet import-file --input token_test.txt --keep-sequence \
--token-regex [0-9]+ --label 0 --print-output
name: 1
target: <null>
input: 0: 3 (0)
1: 4 (1)
2: 5 (2)
Now if we redefine the line regex, we can grab the whole line as a single field as use it all as data:
$ bin/mallet import-file --input token_test.txt --keep-sequence \
--token-regex [0-9]+ --line-regex '(.*)' --data 1 --name 0 --label 0 --print-output
name: csvline:1
target: <null>
input: 0: 1 (0)
1: 2 (1)
2: 3 (2)
3: 4 (3)
4: 5 (4)

Related

extract mappep reads from sam to classify them

I want to extract mapped reads from a SAM file (from a resistome analysis using AMR++) to taxonomically classify them.
I was searching from samtools manual and stackoverflow mainly and found these steps
samtools view -# 20 -S -b SRR4454621_1.alignment.sam > SRR4454621_1.bam ## to convert SAM to BAM
samtools view -# 20 -c SRR4454621_1.bam ### to count reads: 10000126
samtools view -# 20 -c -F 260 SRR4454621_1.bam ### to count mapped reads: 53189
samtools view -# 20 -b -F 4 SRR4454621_1.bam > SRR4454621_1_mapped.bam ### to get mapped reads
samtools view -# 20 -c SRR4454621_1_mapped.bam ### new check to count mapped reads: 53189
samtools bam2fq SRR4454621_1_mapped.bam | seqtk seq -A > SRR4454621_1_mapped.fa ## to extract sequences
grep ">" SRR4454621_1_mapped.fa | wc -l ### to check whether everything is going rigth: 53063 (lost 126 sequences)
Then I run centrifuge to classify them.
centrifuge -f -x centridb/hpvc testing/SRR4454621_1_mapped.fa -S testing/SRR4454621_1_mapped.tsv -p 24 --report-file testing/SRR4454621_1_mapped_report.tsv
And my problems is that the sum of column "numReads" from SRR4454621_1_mapped_report.tsv is 107682, and I would expect to be the same that the sum of equivalent column from resistome analysis which is 51961.
Where can it be the problem? Are the main steps I described above well done?
Thank you very much for your help,
Manuel

push multiple Line Text as one message in a kafka topic

I want to push a text consisting of multiple lines as one message into a kafka topic.
After I enter:
kafka-console-producer --broker-list localhost:9092 --topic myTopic
and copy my text:
My Text consists of:
two lines instead of one
I get two messages in the kafka topic, but I want to have just one. Any ideas how to achieve that? Thanks
You can use kafkacat for this, with its -D operator to specify a custom message delimiter (in this example /):
kafkacat -b kafka:29092 \
-t test_topic_01 \
-D/ \
-P <<EOF
this is a string message
with a line break/this is
another message with two
line breaks!
EOF
Note that the delimiter must be a single byte - multi-byte chars will end up getting included in the resulting message See issue #140
Resulting messages, inspected also using kafkacat:
$ kafkacat -b kafka:29092 -C \
-f '\nKey (%K bytes): %k\t\nValue (%S bytes): %s\n\Partition: %p\tOffset: %o\n--\n' \
-t test_topic_01
Key (-1 bytes):
Value (43 bytes): this is a string message
with a line break
Partition: 0 Offset: 0
--
Key (-1 bytes):
Value (48 bytes): this is
another message with two
line breaks!
Partition: 0 Offset: 1
--
% Reached end of topic test_topic_01 [0] at offset 2
Inspecting using kafka-console-consumer:
$ kafka-console-consumer \
--bootstrap-server kafka:29092 \
--topic test_topic_01 \
--from-beginning
this is a string message
with a line break
this is
another message with two
line breaks!
(thus illustrating why kafkacat is nicer to work with than kafka-console-consumer because of its optional verbosity :) )
It's not possible with kafka-console-producer as it uses a Java Scanner object that's newline delimited.
You would need to do it via your own producer code
With Console-consumer you are obviously running tests for your expected data coming from client. If it is a single message, better keep it as a single string by adding a unique delimiter as identifier. e.g.
{this is line one ^^ this is line two}
Then handle the message accordingly in your consumer job. Even if client is planning to send multiple sentences in message, better make it in a single string, it will improve serialization of your message and will be more efficient after serialization.

Make single instance of `sed` with multiple filenames skip to next file

The next command in sed skips to the next line, but with multiple files there doesn't seem to be any command to skip to the next file.
Is there any workaround using only a single invocation of sed?
Demonstration of problem...
Make two simple 3-number data files:
seq 3 > three ; seq 10 1 13 > thirteen
Show that sed handles multiple files, (by finding all lines ending with 3 and printing the filenames), and is somewhat aware of them as distinct objects:
sed -n '/3$/{p;F}' three thirteen
Output:
3
three
13
thirteen
This next attempt to print both last lines doesn't work however, or rather it works as though both files were a single stream:
sed -n '$p' three thirteen
Output:
13
See if your version supports the -s option:
$ seq 3 > three ; seq 10 1 13 > thirteen
$ sed -n '$p' three thirteen
13
$ sed -n '2p' three thirteen
2
$ sed -sn '$p' three thirteen
3
13
$ sed -sn '2p' three thirteen
2
11
From man sed:
-s, --separate
consider files as separate rather than as a single continuous long stream.
When using the -i option, GNU sed uses -s by default.
In case the -s option is not available, here's an alternative with perl:
$ perl -ne 'print if eof' three thirteen
3
13

How to use gnuplot xdata time with feedgnuplot?

I'm using feedgnuplot and gnuplot to show real-time data on my linux desktop.
My data is produced by command which outputs 12 space separated integer values on each line and up to four lines per second.
I like to add the time, so I put time in front of the line before feeding data to feedgnuplot.
If I specify the time as seconds, the data is properly plotted, but the x-axis is quite unreadable:
data-producer |
while read -a line
do
echo -n $(date +"%s")
for n in "${line[#]}"
do
echo -en "\t$n"
done
echo
done |
feedgnuplot --terminal 'qt' \
--domain \
--stream 1 \
--xlen 5 \
--with lines \
--set "xdata time" \
--set "timefmt '%s'"
So I tried to get human readable time on the horizontal scale:
data-producer |
while read -a line
do
echo -n $(date +"%H:%M:%S")
for n in "${line[#]}"
do
echo -en "\t$n"
done
echo
done |
feedgnuplot --terminal 'qt' \
--domain \
--stream 1 \
--xlen 5 \
--with lines \
--set "xdata time" \
--set "timefmt '%H:%M:%S'"
This line does not work because feedgnuplot complains about comparison operators not applied to numeric data:
Argument "09:45:58" isn't numeric in numeric lt (<) at /usr/bin/feedgnuplot line 694.
Argument "09:45:57" isn't numeric in numeric ge (>=) at /usr/bin/feedgnuplot line 797.
Looking into the feedgnuplot code (it is a perl script) I see that comparison is performed on x values to sort them and to assess whether the graph has to be plotted again or not.
Is it possible to have feedgnuplot handle times by using some command line switches? If not, is there any other option before resorting to patching the feedgnuplot source code? Thank you.
Gnuplot requires some special settings for datetime data (e.g. a using statement must be specified). Therefore, feedgnuplot provides an own option for time data, --timefmt <format>:
for i in `seq 0 100`; do echo $i; sleep 1; done |
while read -a line
do
echo -n $(date +"%Y-%m-%d %H:%M:%S")
for n in "${line[#]}"
do
echo -en "\t$n"
done
echo
done |
feedgnuplot --terminal 'qt' \
--domain \
--stream 1 \
--lines \
--timefmt "%Y-%m-%d %H:%M:%S" \
--set 'format x "%H:%M:%S"'
Note, that different versions of gnuplot use different reference points for time in seconds, so that versions 4.6 (reference 1st January 2000) and earlier give wrong results when using %s. So it is better to use a time format of the kind %H:%M:%S. In the code above I used a fully defined datetime to avoid possible problems with day-spanning plots.

split a large text (xyz) database into x equal parts

I want to split a large text database (~10 million lines). I can use a command like
$ sed -i -e '4 s/(dB)//' -e '4 s/Best\ unit/Best_Unit/' -e '1,3 d' '/cygdrive/c/ Radio Mobile/Output/TRC_TestProcess/trc_longlands.txt'
$ split -l 1000000 /cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands/trc_longlands.txt 1
The first line is to clean the databse and the next is to split it -
but then the output files do not have the field names. How can I incorporate the field names into each dataset and pipe a list which has the original file, new file name and line numbers (from original file) in it. This is so that it can be used in the arcgis model to re-join the final simplified polygon datasets.
ALTERNATIVELY AND MORE USEFULLY -as this needs to go into a arcgis model, a python based solution is best. More details are in https://gis.stackexchange.com/questions/21420/large-point-to-polygon-by-buffer-join-buffer-dissolve-issues#comment29062_21420 and Remove specific lines from a large text file in python
SO GOING WITH A CYGWIN based Python solution as per answer by icyrock.com
we have process_text.sh
cd /cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands
mkdir processing
cp trc_longlands.txt processing/trc_longlands.txt
cd txt_processing
sed -i -e '4 s/(dB)//' -e '4 s/Best\ unit/Best_Unit/' -e '1,3 d' 'trc_longlands.txt'
split -l 1000000 trc_longlands.txt trc_longlands_
cat > a
h
1
2
3
4
5
6
7
8
9
^D
split -l 3
split -l 3 a 1
mv 1aa 21aa
for i in 1*; do head -n1 21aa|cat - $i > 2$i; done
for i in 21*; do echo ---- $i; cat $i; done
how can "TRC_Longlands" and the path be replaced with the input filename -in python we have %path%/%name for this.
in the last line is "do echo" necessary?
and this is called by python using
import os
os.system("process_text.bat")
where process_text.bat is basically
bash process_text.sh
I get the following error when run from dos...
Microsoft Windows [Version 6.1.7601] Copyright (c) 2009 Microsoft
Corporation. All rights reserved.
C:\Users\georgec>bash
P:\2012\Job_044_DM_Radio_Propogation\Working\FinalPropogat
ion\TRC_Longlands\process_text.sh 'bash' is not recognized as an
internal or external command, operable program or batch file.
also when I run the bash command from cygwin -I get
georgec#ATGIS25
/cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands
$ bash process_text.sh : No such file or directory:
/cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands
cp: cannot create regular file `processing/trc_longlands.txt\r': No
such file or directory : No such file or directory: txt_processing :
No such file or directoryds.txt
but the files are created in the root directory.
why is there a "." after the directory name? how can they be given a .txt extension?
If you want to just prepend the first line of the original file to all but the first of the splits, you can do something like:
$ cat > a
h
1
2
3
4
5
6
7
^D
$ split -l 3
$ split -l 3 a 1
$ ls
1aa 1ab 1ac a
$ mv 1aa 21aa
$ for i in 1*; do head -n1 21aa|cat - $i > 2$i; done
$ for i in 21*; do echo ---- $i; cat $i; done
---- 21aa
h
1
2
---- 21ab
h
3
4
5
---- 21ac
h
6
7
Obviously, the first file will have one line less then the middle parts and the last part might be shorter, too, but if that's not a problem, this should work just fine. Of course, if your header has more lines, just change head -n1 to head -nX, X being the number of header lines.
Hope this helps.