why my sed script to split FASTA file is slow?

why my sed script to split FASTA file is slow? - sed

I have a 600 Mb FASTA file containing many alignments blocks from 12 species and I want to split them into smaller FASTA files containing one block each with its corresponding alignments
I have a sed script that looks like this:
#!/bin/bash
echo
for i in {0..Nblocks}; do
sed -n "/block_index=$i|/,/^$/p" genome12species.fasta > bloque$i.fasta
done
This works at a small scale but for a big file as 600Mb it takes too long, around 2 days. I don't think this is a matter of the computer I am running.
Does anyone knows how to make this faster?
The input Fasta file looks like this:
dm3.chr3R(-):17092630-17092781|sequence_index=0|block_index=4|species=dm3|dm3_4_0
GGCGGAGATCAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACTGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droGri2.scaffold_15074(-):2610183-2610334|sequence_index=0|block_index=4|species=droGri2|droGri2_4_0
GGCGGAGATCAAGAATCGTGTTGGGCCGCCGTCGAGCGCCACCGATAACGCTAGCAAAGTGAAAATCGATCAGGGACGCCCAGTGGAAAACAATAGATCTGGTTGCTGCTAAATAA-CTCTGATTGTGAATCATTATTTTATTATACAATTa
droMoj3.scaffold_6540(+):33866311-33866462|sequence_index=0|block_index=4|species=droMoj3|droMoj3_4_0
TGCCGAGATTAAGAATCGTGTCGGTCCGCCGTCCAGCGCAACCGACAATGCAAGCAAAGTGAAAATCGATCAGGGACGTCCAGTGGAGAACACCAGATCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCATTATTTTATTatacaatta
droVir3.scaffold_12822(+):1248119-1248270|sequence_index=0|block_index=4|species=droVir3|droVir3_4_0
GGCCGAGATTAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACCGATAATGCTAGCAAAGTGAAAATCGATCAGGGTCGTCCAGTGGAGAACACCAAATCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droWil1.scaffold_181130(-):16071336-16071488|sequence_index=0|block_index=4|species=droWil1|droWil1_4_0
GGCCGAGATTAAGAATCGTGTTGGGCCGCCGTCCAGCGCCACTGATAATGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAATACCAAATCCGGTTGCTGCTGAATAAACTCTGATTGTGAATCATTATTTTATTATACAATTA
droPer1.super_19(-):1310088-1310239|sequence_index=0|block_index=4|species=droPer1|droPer1_4_0
GGCTGAGATCAAGAATCGCGTCGGACCGCCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAAACCCAATTCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
dp4.chr2(-):5593491-5593642|sequence_index=0|block_index=4|species=dp4|dp4_4_0
GGCTGAGATCAAGAATCGCGTCGGACCGCCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAAGCCCAATTCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droAna3.scaffold_13340(-):3754154-3754305|sequence_index=0|block_index=4|species=droAna3|droAna3_4_0
GGCCGAGATCAAGAATCGCGTCGGGCCACCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAGATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattataaaatta
droEre2.scaffold_4770(+):4567591-4567742|sequence_index=0|block_index=4|species=droEre2|droEre2_4_0
GGCCGAGATCAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droYak2.chr3R(-):5883047-5883198|sequence_index=0|block_index=4|species=droYak2|droYak2_4_0
GGCCGAGATCAAGAATCGCGTCGGGCCGCCATCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droSec1.super_38(+):36432-36583|sequence_index=0|block_index=4|species=droSec1|droSec1_4_0
GGCGGAGATCAAGAATCGCGTCGGTCCGCCGTCCAGCGCCACTGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droSim1.chr3R(+):4366350-4366501|sequence_index=0|block_index=4|species=droSim1|droSim1_4_0
GGCGGAGATCAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACTGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
dm3.chr3R(-):17092781-17092867|sequence_index=0|block_index=5|species=dm3|dm3_5_0
GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTTGAGCAGGCCTTCATGACGATGGC
droSim1.chr3R(+):4366264-4366350|sequence_index=0|block_index=5|species=droSim1|droSim1_5_0
GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTTGAGCAGGCCTTTATGACGATGGC
droSec1.super_38(+):36346-36432|sequence_index=0|block_index=5|species=droSec1|droSec1_5_0
GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTTGAGCAGGCCTTCATGACGATGGC
droYak2.chr3R(-):5883198-5883284|sequence_index=0|block_index=5|species=droYak2|droYak2_5_0
GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACATCGGCCAAGAGCGCCACCAACGTGGAGCAGGCCTTCATGACGATGGC
droEre2.scaffold_4770(+):4567505-4567591|sequence_index=0|block_index=5|species=droEre2|droEre2_5_0
GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTGGAGCAGGCCTTCATGACGATGGC
droAna3.scaffold_13340(+):20375068-20375148|sequence_index=0|block_index=5|species=droAna3|droAna3_5_0
------GCCGAAAACTTCGACATGCCCTTCTTCGAGGTCTCTTGCAAGTCAAACATCAATATTGAAGATGCGTTTCTTTCCCTGGC
dp4.chr2(-):5593642-5593728|sequence_index=0|block_index=5|species=dp4|dp4_5_0
GAGTATGCAGCTCAGTTAGGCATTCCATTTCTTGAAACTTCGGCCAAGAGCGCCACGAACGTGGAGCAGGCCTTCATGACGATGGC
droPer1.super_19(-):1310239-1310325|sequence_index=0|block_index=5|species=droPer1|droPer1_5_0
GAGTATGCAGCTCAGTTAGGCATTCCATTTCTTGAAACTTCGGCCAAGAGCGCCACGAACGTGGAGCAGGCCTTCATGACGATGGC
droWil1.scaffold_181130(-):16071488-16071574|sequence_index=0|block_index=5|species=droWil1|droWil1_5_0
GAATATGCGGCTCAGTTAGGCATTCCATTCCTTGAAACTTCGGCAAAGAGTGCCACCAATGTGGAGCAGGCCTTTATGACGATGGC
droVir3.scaffold_12822(+):1248033-1248119|sequence_index=0|block_index=5|species=droVir3|droVir3_5_0
GAGTACGCACATCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTGGAGCAGGCATTTATGACGATGGC
droMoj3.scaffold_6540(+):33866225-33866311|sequence_index=0|block_index=5|species=droMoj3|droMoj3_5_0
GAGTATGCACATCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAATGTAGAGCAGGCATTCATGACGATGGC
droGri2.scaffold_15074(-):2610334-2610420|sequence_index=0|block_index=5|species=droGri2|droGri2_5_0
GAGTACGCAAATCAGTTAGGCATTCCATTCCTTGAAACTTCGGCGAAGAGTGCCACCAATGTGGAACAGGCATTCATGACGATGGC

Here's an awk oneliner to get you started - it uses the same regex range as your sed - the matched block_index is in m[1] - 600MB should take just a few minutes
awk 'match($0, /block_index=([0-9]+)\|/, m),/^$/ {print >"bloque"m[1]".fasta"}'

Related

sed search and replace with 2 conditions/patterns

I have a file:
http://www.gnu.org/software/coreutils/
glob
lxc-ls
I need to only replace below:
lxc-ls
with
lxc-ls
lxc-ls can be any word as I have multiple such links in several files which I need to replace.
I do not want to make any changes to the other 2 links. i.e.
http://www.gnu.org/software/coreutils/
glob
What I have tried until is:
$ sed '/html/ i\
..' file
But this appends to the start of the line, also the other condition of excluding 2 URLs is also not full filled.
Here is a more realistic example from one of the file.
<b>echoping</b>(1),
<b>getaddrinfo</b>(3),
<b>getaddrinfo_a</b>(3),
<b>getpeername</b>(2),
<b>getsockname</b>(2),
<b>ping_setopt</b>(3),
<b>proc</b>(5),
<b>rds</b>(7),
<b>recv</b>(2),
<b>rtnetlink</b>(7),
<b>sctp</b>(7),
<b>sctp_connectx</b>(3),
<b>send</b>(2),
<b>udplite</b>(7)
http://gnu.org/licenses/gpl.html
http://translationproject.org/team/
Here I only need to replace:
<b>rds</b>(7),
<b>rtnetlink</b>(7),
<b>sctp</b>(7),
<b>udplite</b>(7)
with:
<b>rds</b>(7),
<b>rtnetlink</b>(7),
<b>sctp</b>(7),
<b>udplite</b>(7)

Using sed
$ sed s'|"\([[:alpha:]].*\)|"../\1|' file
<b>echoping</b>(1),
<b>getaddrinfo</b>(3),
<b>getaddrinfo_a</b>(3),
<b>getpeername</b>(2),
<b>getsockname</b>(2),
<b>ping_setopt</b>(3),
<b>proc</b>(5),
<b>rds</b>(7),
<b>recv</b>(2),
<b>rtnetlink</b>(7),
<b>sctp</b>(7),
<b>sctp_connectx</b>(3),
<b>send</b>(2),
<b>udplite</b>(7)

Perl script to copy logs with timestamp for every hour and paste into different file

First of all, I'm very new to programming and so would need your help in writing a perl script to do the following on windows.
I have a big log file with timestamp (1gb) and its difficult to read the logs as it takes a lot of time to open. so my requirement is to copy the logs from the bigger log file for the last one hour and paste it to another file and then copy the next 1 hr of data to different file(so we will have 24 files for a day). The next day the data in these files needs to be over written or delete & create a new file.
Sample log :
09092016-00:02:00,..................
09092016-00:02:08,..................
09092016-00:02:15,..................
09092016-00:02:18,..................
Please help me with this and thanks for your help in advance.
Thanks,

A simpler solution would be to use the split command to split the files into manageable sizes.
split -l 1000 logfile
Will split your logfile into smaller files of 1000 lines each.
You can then just use grep to find the files that contain the day you need.
grep 09092016 logfile*

for example:
logfile="./log"
while read -r d m y h; do
grep "^$d$m$y-$h" "$logfile" > "partial-${y}${m}{$d}-${h}.log"
done < <(sed -n 's/\(..\)\(..\)\(....\)-\(..\)\(.*\)/\1 \2 \3 \4/p' "$logfile" | sort -u)
easy, but not efficient. It reads the whole big logfile 25x for the split. (1x for gathering the existing ddmmyyyy-hh lines in the log, and again for every different found date-hour.)

How to use fishshell to add numbers to files

I have a very simple mp3 player, and the order it plays audio files are based on the file names, and the rule is there must be a 3-size number in the beginning of file name, such as:
001file.mp3
002file.mp3
003file.mp3
I want to write a fish shell sortmp3 to add numbers to the files of a directory. Say directory myfiles contains files:
aaa.mp3
bbb.mp3
ccc.mp3`
When I run sortmp3 myfiles, the file names will be changed to:
001aaa.mp3
002bbb.mp3
003ccc.mp3
But my question is:
how to generate some sequential numbers?
how to make sure the size of each number is exactly 3?

I would write this, which makes no assumptions about how many files there are in a directory:
function sortmp3
set -l files *
set -l i
for i in (seq (count $files))
echo mv $files[$i] (printf "%03d%s" $i $files[$i])
end
end
Remove the "echo" if you like how it works.

You can generate sequential numbers with the seq tool - an external program.
This will only take care of the first part, it won't pad to three characters.
To do that, there's a variety of choices:
printf '%s\n' 00(seq 0 99) | rev | cut -c 1-3 | rev
printf '%s\n' 00(seq 0 99) | sed 's/^.*\(...\)$/\1/'
The 00(seq 0 99) part will generate numbers from "1" to "99" with two zeroes prepended - ie. from "001" to "0099". The later parts of the pipeline remove the superfluous zeroes again.
Or with the next fish version, you can use the new string tool:
string sub -s -3 -- 00(seq 0 99)

Depending on your specific situation you should use the "seq" command to generate sequential numbers or the "math" command to increment a counter. To format the number with a predictable number of leading zeros use the "printf" command:
set idx 12
printf '%03d' $idx

Reading huge .csv files with matlab - file is not well orgenized

I have several .csv files that I read with matlab using textscan, beause csvread and xlsread do not support this size of a file 200Mb-600Mb.
I use this line to read it:
C = textscan(fileID,'%s%d%s%f%f%d%d%d%d%d%d%d','delimiter',',');
the problem that I have found that sometimes the data is not in this format and then the textscan stop to read in that line without any error.
So what I have done is to read it in this way
C = textscan(fileID,'%s%d%s%f%f%s%s%s%s%s%s%s%s%s%s%s','delimiter',',');
In this way I see the in 2 rows out of 3 milion there is a change in the format.
I want to read all the lines except the bad/different lines.
In addition if its possible to read only the lines that the first string is 'PAA'. is it possible ?
I have tried to load it directly to matlab but its super slow and sometime it get stuck. Or for the realy big one it will announce memory problem.
Any recomendations?

For large files which are still small enough to fit your memory, parsing all lines at once is typically the best choice.
f = fopen('data.txt');
g = textscan(f,'%s','delimiter','\n');
fclose(f);
In a next step you have to identify the lines starting with PAA use strncmp.
Now having your data filtered, apply your textscan expression above to each line. If it fails, try the other.

Matlab is slow with this kind of thing because it needs to load everything into memory. I would suggest using grep/bash/cmd lines to reduce your file to readable lines before processing them in Matlab, in Linux you can:
awk '{if (p ~ /^PAA/ && $1 ~ /^PAA/) print; p=$1}' yourfile.csv > yourNewFile.csv %// This will give you a new file with all the lines that starts with PAA (NOTE: Case sensitive)
To Find lines that does not have the same format, you can use:
awk -F ',' 'NF = 12 {print NR, $0} ' yourfile.csv > yourNewFile.csv
This line looks at 12 delimiters for each line, and discard any line that has more than 12 ",".

Replace matches of one regex expression with matches from another, across two files

I am currently helping a friend reorganise several hundred images on a database driven website. I have generated a list of the new, reorganised image paths offline and would like to replace each matching image reference in the sql export of the database with the new paths.
EDIT: Here is an example of what I am trying to achieve
The new_paths_list.txt is a file that I generated using a batch script after I had organised all of the existing images into folders. Prior to this all of the images were in just a few folders. A sample of this generated list might be:
image/data/product_photos/telephones/snom/snom_xyz.jpg
image/data/product_photos/telephones/gigaset/giga_xyz.jpg
A sample of my_exported_db.sql (the database exported from the website) might be:
...
,(110,32,'data/phones/snom_xyz.jpg',3),(213,50,'data/telephones/giga_xyz.jpg',0),
...
The result I want is my_exported_db.sql to be:
...
,(110,32,'data/product_photos/telephones/snom/snom_xyz.jpg',3),(213,50,'data/product_photos/telephones/gigaset/giga_xyz.jpg',0),
...
Some pseudo code to illustrate:
1/ Find the first image name in my_exported_db.sql, such as 'snom_xyz.jpg'.
2/ Find the same image name in new_paths_list.txt
3/ If it is present, copy the whole line (the path and filename)
4/ Replace the whole path in in my_exported_db.sql of this image with the copied line
5/ Repeat for all other image names in my_exported_db.sql
A regex expression that appears to match image names is:
([^)''"/])+\.(?:jpg|jpeg|gif|png)
and one to match image names, complete with path (for relative or absolute) is:
\bdata[^)''"\s]+\.(?:jpg|jpeg|gif|png)
I have looked around and have seen that Sed or Awk may be capable of doing this, but some pointers would be greatly appreciated. I understand that this will only work accurately if there are no duplicated filenames.

You can use sed to convert new_paths_list.txt into a set of sed replacement commands:
sed 's|\(.*\(/[^/]*$\)\)|s#data\2#\1#|' new_paths_list.txt > rules.sed
The file rules.sed will look like this:
s#data/snom_xyz.jpg#image/data/product_photos/telephones/snom/snom_xyz.jpg#
s#data/giga_xyz.jpg#image/data/product_photos/telephones/gigaset/giga_xyz.jpg#
Then use sed again to translate my_exported_db.sql:
sed -i -f rules.sed my_exported_db.sql
I think in some shells it's possible to combine these steps and do without rules.sed:
sed 's|\(.*\(/[^/]*$\)\)|s#data\2#\1#|' new_paths_list.txt | sed -i -f - my_exported_db.sql
but I'm not certain about that.
EDIT<:
If the images are in several directories under data/, make this change:
sed "s|image/\(.*\(/[^/]*$\)\)|s#[^']*\2#\1#|" new_paths_list.txt > rules.sed

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

why my sed script to split FASTA file is slow? - sed

Here's an awk oneliner to get you started - it uses the same regex range as your sed - the matched block_index is in m[1] - 600MB should take just a few minutes awk 'match($0, /block_index=([0-9]+)\|/, m),/^$/ {print >"bloque"m[1]".fasta"}'

Related

sed search and replace with 2 conditions/patterns

Perl script to copy logs with timestamp for every hour and paste into different file

How to use fishshell to add numbers to files

Reading huge .csv files with matlab - file is not well orgenized

Replace matches of one regex expression with matches from another, across two files

Categories

Resources