extract mappep reads from sam to classify them - classification

I want to extract mapped reads from a SAM file (from a resistome analysis using AMR++) to taxonomically classify them.
I was searching from samtools manual and stackoverflow mainly and found these steps
samtools view -# 20 -S -b SRR4454621_1.alignment.sam > SRR4454621_1.bam ## to convert SAM to BAM
samtools view -# 20 -c SRR4454621_1.bam ### to count reads: 10000126
samtools view -# 20 -c -F 260 SRR4454621_1.bam ### to count mapped reads: 53189
samtools view -# 20 -b -F 4 SRR4454621_1.bam > SRR4454621_1_mapped.bam ### to get mapped reads
samtools view -# 20 -c SRR4454621_1_mapped.bam ### new check to count mapped reads: 53189
samtools bam2fq SRR4454621_1_mapped.bam | seqtk seq -A > SRR4454621_1_mapped.fa ## to extract sequences
grep ">" SRR4454621_1_mapped.fa | wc -l ### to check whether everything is going rigth: 53063 (lost 126 sequences)
Then I run centrifuge to classify them.
centrifuge -f -x centridb/hpvc testing/SRR4454621_1_mapped.fa -S testing/SRR4454621_1_mapped.tsv -p 24 --report-file testing/SRR4454621_1_mapped_report.tsv
And my problems is that the sum of column "numReads" from SRR4454621_1_mapped_report.tsv is 107682, and I would expect to be the same that the sum of equivalent column from resistome analysis which is 51961.
Where can it be the problem? Are the main steps I described above well done?
Thank you very much for your help,
Manuel

Related

I want to update a file with rollno. and name from a csv file where name and roll no are separated by comma

My input file is a csv file containing details as:
2233,anish sharma
2234,azad khan
2235,birbal singh
2236,chaitanya kumar
my expected output is display of the two details in two separate columns.
I executed following code. Full name is not getting displayed. The part after space doesn't appear. What changes should be done?
echo "Roll no updation"
tput cup 10 10
echo "Key in file name (rollno,name separated by comma)"
tput cup 12 10
read infile
for i in `cat $infile`
do
rollno=`echo $i|cut -d , -f1`
name=`echo $i|cut -d , -f2`
psql -U postgres -A -t -F, -c "update student set name = '$name' where rollno = '$rollno' current record" >bq
done
Your loop should be written in this fashion
# comma separates records
IFS=,
cat "$infile" | while read rollno name; do
psql -U postgres -A -t -F, -c \
"update student set name = '$name'
where rollno = '$rollno'" >bq
done
But you should be aware that this code is susceptible to SQL injection. Only use it if you can trust the source of the data!
Any ' in the file will cause errors and worse.

comparing two directories with separate diff output per file

I'd need to see what has been changed between two directories which contain different version of a software sourcecode. While I have found a way to get a unique .diff file, how can I obtain a different file for each changed file in the two directories? I'd need this, as the "main" is about 6 MB and wanted some more handy thing.
I came around this problem too, so I ended up with some lines of a shell script. It takes three arguments: Source and destination directory (as used for diff) and a target folder (should exist) for the output.
It's a bit hacky, but maybe it would be useful for someone. So use with care, especially if your paths have special characters.
#!/bin/sh
DIFFARGS="-wb"
LANG=C
TARGET=$3
SRC=`echo $1 | sed -e 's/\//\\\\\\//g'`
DST=`echo $2 | sed -e 's/\//\\\\\\//g'`
if [ ! -d "$TARGET" ]; then
echo "'$TARGET' is not a directory." >&2
exit 1
fi
diff -rqN $DIFFARGS "$1" "$2" | sed "s/Files $SRC\/\(.*\?\) and $DST\/\(.*\?\) differ/\1/" | \
while read file
do
if [ ! -d "$TARGET/`dirname \"$file\"`" ]; then
mkdir -p "$TARGET/`dirname \"$file\"`"
fi
diff $DIFFARGS -N "$1/$file" "$2/$file" > "$TARGET"/"$file.diff"
done
if you want to compare source code it is better to commit it to a source vesioning program as "svn".
after you have done so. do a diff of your uploaded code and pipe it to file.diff
svn diff --old svn:url1 --new svn:url2 > file.diff
A bash for loop will work for you. The following will diff two directories with C source code and produce a separate diff for each file.
for FILE in $(find <FIRST_DIR> -name '*.[ch]'); do DIFF=<DIFF_DIR>/$(echo $FILE | grep -o '[-_a-zA-Z0-9.]*$').diff; diff -u $FILE <SECOND_DIR>/$FILE > $DIFF; done
Use the correct patch level for the lines starting with +++

Why after delete some lines by sed, Postfix can't write maillog [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I want to use cron job, that once per three day will clean and sort maillog.
My job looks like
/bin/sed -i /status=/!d /var/log/maillog |
(/bin/grep "status=bounced" /var/log/maillog | /bin/grep -E -o --color "\b[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" | /bin/sort -u >> /root/unsent.log) |
(/bin/grep "status=deferred" /var/log/maillog | /bin/grep -E -o --color "\b[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" | /bin/sort -u >> /root/deferred.log) |
(/bin/grep "status=sent" /var/log/maillog | /bin/grep -E -o --color "\b[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" | /bin/sort -u >> /root/sent.log) |
/bin/sed -i "/status=/d" /var/log/maillog
Job works fine and do 3 step:
Delete from maillog all lines that don't contain "status="
Sort sent, bounced, deffered in different logs.
Delete from maillog all lines that contain "status"
After this job my maillog is fully clean and sorted to 3 logs.
But Postfix doesn't want to write next records to maillog.
I delete sed command, and Postfix writes next records fine.
Why sed command blocks maillog after execution cron job?
sed -i will unlink the file it modifies, so syslog/postfix will continue writing to a nonexistent file.
From http://en.wikipedia.org/wiki/Sed:
Note: "sed -i" overwrites the original file with a new one, breaking any links the original may have had
It is more common to process log files after rotating them out of place with a tool like logrotate or savelog, so that syslog can continue writing uninterrupted.
If you must edit /var/log/maillog in place, you can add a line to the end of your cron job to reload syslog when you are done. Note that you can lose log lines written to the file while your script is running if you do this. The command will depend on what distribution / operating system you are running. On ubuntu, which uses rsyslog, it would be reload rsyslog >/dev/null 2>&1.
I've reformatted your original code to highlight the pipe-lines you added
/bin/sed -i /status=/!d /var/log/maillog \
| (/bin/grep "status=bounced" /var/log/maillog \
| /bin/grep -E -o --color "\b[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" \
| /bin/sort -u >> /root/unsent.log\
) \
| (/bin/grep "status=deferred" /var/log/maillog \
| /bin/grep -E -o --color "\b[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" \
| /bin/sort -u >> /root/deferred.log\
) \
| (/bin/grep "status=sent" /var/log/maillog \
| /bin/grep -E -o --color "\b[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" \
| /bin/sort -u >> /root/sent.log \
) \
| /bin/sed -i "/status=/d" /var/log/maillog
As #alberge noted, you could very likely lose log messages with all of this sed -i processing on the same file.
I propose a different approach:
I would move the maillog to a dated filename, (the assumption here is that Postfix, will create a new file with the standard name that it 'likes' to use (/var/log/maillog).
Then your real goal seems to be to extract various categories of messages to separately named files, i.e. unsent.log, deferred.log, sent.log AND then you're discarding any lines that don't contain the string status= (although you do that first).
Here's my alternate (please read the whole message, don't copy/paste/excute right away!).
logDate=$(/bin/date +%Y%m%d.%H%M%S)
/bin/mv /var/log/maillog /var/log/maillog.${logDate}
/bin/grep "status=bounced" /var/log/maillog.${logDate} \
| /bin/grep -E -o --color "\b[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" \
| /bin/sort -u \
>> /root/unsent.log.${logDate}
/bin/grep "status=deferred" /var/log/maillog.${logDate} \
| /bin/grep -E -o --color "\b[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" \
| /bin/sort -u \
>> /root/deferred.log.${logDate}
/bin/grep "status=sent" \
| /bin/grep -E -o --color "\b[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" \
| /bin/sort -u \
>> /root/sent.log.${logDate}
To test that this code is working, replace the 2nd line ( /bin/mv .... ) with
/bin/cp /var/log/maillog /var/log/maillog.${logDate}
Copy/paste that into a terminal window, confirm that the /var/log/maillog.${logDate} was copied correctly, then copy/paste each section, 1 at a time and check that the expected output is created in each of the /root logfiles.
(If you get error messages for any of these blocks, make sure there are NO space/tab chars after the last '\' char on each of the continued lines. OR you can fold each of those 3 pipelines back into one line, removing the '\' chars as you go.
(Note that to create each of the /root logfiles, I don't use any connecting sections via pipes surrounded by sub-processes. But, in other situations, I do use this sort of technique for advanced problems, so don't throw the technique away, just use it when it is really required ;-)!
After you confirm that all of this is working as you needed, then you extend the script to do a final cleaning up :
/bin/rm /var/log/maillog.${logDate}
I've added ${logDate} to each of your output files, but as I see you're using sort -u >> you may want to remove that 'extension' to your sub-logfile names (unsent.log, deferred.log, sent.log) And just let those files get grow naturally. In either case, you'll have to comeback at some point and determine how far back you want to keep this data, and develop a plan and method for how you'll clean up these logfiles when they're not useful. I think someone mentioned logrotate package. You might want to look into that as your long-term solution.
This solution avoids a lot of extra processes being created, and it eliminates (mostly) the possibility of lost log records. I'm think you might lose all or part of a record if Postfix is writing to the logfile in the same split-second as you are moving the file. But your solution would have similar problems AND more opportunities for that to happen.
If I have misunderstood the intention of your design, using the nested ( .... ) | ( .... ) sub-processes, sorry! Consider updating your post to include why you are using that techinque.
I hope this helps.

split a large text (xyz) database into x equal parts

I want to split a large text database (~10 million lines). I can use a command like
$ sed -i -e '4 s/(dB)//' -e '4 s/Best\ unit/Best_Unit/' -e '1,3 d' '/cygdrive/c/ Radio Mobile/Output/TRC_TestProcess/trc_longlands.txt'
$ split -l 1000000 /cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands/trc_longlands.txt 1
The first line is to clean the databse and the next is to split it -
but then the output files do not have the field names. How can I incorporate the field names into each dataset and pipe a list which has the original file, new file name and line numbers (from original file) in it. This is so that it can be used in the arcgis model to re-join the final simplified polygon datasets.
ALTERNATIVELY AND MORE USEFULLY -as this needs to go into a arcgis model, a python based solution is best. More details are in https://gis.stackexchange.com/questions/21420/large-point-to-polygon-by-buffer-join-buffer-dissolve-issues#comment29062_21420 and Remove specific lines from a large text file in python
SO GOING WITH A CYGWIN based Python solution as per answer by icyrock.com
we have process_text.sh
cd /cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands
mkdir processing
cp trc_longlands.txt processing/trc_longlands.txt
cd txt_processing
sed -i -e '4 s/(dB)//' -e '4 s/Best\ unit/Best_Unit/' -e '1,3 d' 'trc_longlands.txt'
split -l 1000000 trc_longlands.txt trc_longlands_
cat > a
h
1
2
3
4
5
6
7
8
9
^D
split -l 3
split -l 3 a 1
mv 1aa 21aa
for i in 1*; do head -n1 21aa|cat - $i > 2$i; done
for i in 21*; do echo ---- $i; cat $i; done
how can "TRC_Longlands" and the path be replaced with the input filename -in python we have %path%/%name for this.
in the last line is "do echo" necessary?
and this is called by python using
import os
os.system("process_text.bat")
where process_text.bat is basically
bash process_text.sh
I get the following error when run from dos...
Microsoft Windows [Version 6.1.7601] Copyright (c) 2009 Microsoft
Corporation. All rights reserved.
C:\Users\georgec>bash
P:\2012\Job_044_DM_Radio_Propogation\Working\FinalPropogat
ion\TRC_Longlands\process_text.sh 'bash' is not recognized as an
internal or external command, operable program or batch file.
also when I run the bash command from cygwin -I get
georgec#ATGIS25
/cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands
$ bash process_text.sh : No such file or directory:
/cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands
cp: cannot create regular file `processing/trc_longlands.txt\r': No
such file or directory : No such file or directory: txt_processing :
No such file or directoryds.txt
but the files are created in the root directory.
why is there a "." after the directory name? how can they be given a .txt extension?
If you want to just prepend the first line of the original file to all but the first of the splits, you can do something like:
$ cat > a
h
1
2
3
4
5
6
7
^D
$ split -l 3
$ split -l 3 a 1
$ ls
1aa 1ab 1ac a
$ mv 1aa 21aa
$ for i in 1*; do head -n1 21aa|cat - $i > 2$i; done
$ for i in 21*; do echo ---- $i; cat $i; done
---- 21aa
h
1
2
---- 21ab
h
3
4
5
---- 21ac
h
6
7
Obviously, the first file will have one line less then the middle parts and the last part might be shorter, too, but if that's not a problem, this should work just fine. Of course, if your header has more lines, just change head -n1 to head -nX, X being the number of header lines.
Hope this helps.

Do not show directories in rsync output

Does anybody know if there is an rsync option, so that directories that are being traversed do not show on stdout.
I'm syncing music libraries, and the massive amount of directories make it very hard to see which file changes are actually happening.
I'v already tried -v and -i, but both also show directories.
If you're using --delete in your rsync command, the problem with calling grep -E -v '/$' is that it will omit the information lines like:
deleting folder1/
deleting folder2/
deleting folder3/folder4/
If you're making a backup and the remote folder has been completely wiped out for X reason, it will also wipe out your local folder because you don't see the deleting lines.
To omit the already existing folder but keep the deleting lines at the same time, you can use this expression :
rsync -av --delete remote_folder local_folder | grep -E '^deleting|[^/]$'
I'd be tempted to filter using by piping through grep -E -v '/$' which uses an end of line anchor to remove lines which finish with a slash (a directory).
Here's the demo terminal session where I checked it...
cefn#cefn-natty-dell:~$ mkdir rsynctest
cefn#cefn-natty-dell:~$ cd rsynctest/
cefn#cefn-natty-dell:~/rsynctest$ mkdir 1
cefn#cefn-natty-dell:~/rsynctest$ mkdir 2
cefn#cefn-natty-dell:~/rsynctest$ mkdir -p 1/first 1/second
cefn#cefn-natty-dell:~/rsynctest$ touch 1/first/file1
cefn#cefn-natty-dell:~/rsynctest$ touch 1/first/file2
cefn#cefn-natty-dell:~/rsynctest$ touch 1/second/file3
cefn#cefn-natty-dell:~/rsynctest$ touch 1/second/file4
cefn#cefn-natty-dell:~/rsynctest$ rsync -r -v 1/ 2
sending incremental file list
first/
first/file1
first/file2
second/
second/file3
second/file4
sent 294 bytes received 96 bytes 780.00 bytes/sec
total size is 0 speedup is 0.00
cefn#cefn-natty-dell:~/rsynctest$ rsync -r -v 1/ 2 | grep -E -v '/$'
sending incremental file list
first/file1
first/file2
second/file3
second/file4
sent 294 bytes received 96 bytes 780.00 bytes/sec
total size is 0 speedup is 0.00