sed search and replace with 2 conditions/patterns - sed

I have a file:
http://www.gnu.org/software/coreutils/
glob
lxc-ls
I need to only replace below:
lxc-ls
with
lxc-ls
lxc-ls can be any word as I have multiple such links in several files which I need to replace.
I do not want to make any changes to the other 2 links. i.e.
http://www.gnu.org/software/coreutils/
glob
What I have tried until is:
$ sed '/html/ i\
..' file
But this appends to the start of the line, also the other condition of excluding 2 URLs is also not full filled.
Here is a more realistic example from one of the file.
<b>echoping</b>(1),
<b>getaddrinfo</b>(3),
<b>getaddrinfo_a</b>(3),
<b>getpeername</b>(2),
<b>getsockname</b>(2),
<b>ping_setopt</b>(3),
<b>proc</b>(5),
<b>rds</b>(7),
<b>recv</b>(2),
<b>rtnetlink</b>(7),
<b>sctp</b>(7),
<b>sctp_connectx</b>(3),
<b>send</b>(2),
<b>udplite</b>(7)
http://gnu.org/licenses/gpl.html
http://translationproject.org/team/
Here I only need to replace:
<b>rds</b>(7),
<b>rtnetlink</b>(7),
<b>sctp</b>(7),
<b>udplite</b>(7)
with:
<b>rds</b>(7),
<b>rtnetlink</b>(7),
<b>sctp</b>(7),
<b>udplite</b>(7)

Using sed
$ sed s'|"\([[:alpha:]].*\)|"../\1|' file
<b>echoping</b>(1),
<b>getaddrinfo</b>(3),
<b>getaddrinfo_a</b>(3),
<b>getpeername</b>(2),
<b>getsockname</b>(2),
<b>ping_setopt</b>(3),
<b>proc</b>(5),
<b>rds</b>(7),
<b>recv</b>(2),
<b>rtnetlink</b>(7),
<b>sctp</b>(7),
<b>sctp_connectx</b>(3),
<b>send</b>(2),
<b>udplite</b>(7)

Related

oneliner -- multiple file substitution transformation produces out-of-sync results

Context
perl 5.22
multi-file transformation with perl oneliner
Overview
TrevorWattanStewie has a directory full of config files, and he wants to transform them.
The transformation operation is best understood by comparing "BEFORE" to "AFTER".
Files BEFORE
## ./configfile001.config
TrevorWattanStewie#oldmail.com;--blank--
## ./configfile002.config
TrevorWattanStewie#oldmail.com;--blank--
## ./configfile003.config
TrevorWattanStewie#oldmail.com;--blank--
## ./configfile004.config
TrevorWattanStewie#oldmail.com;--blank--
SallyWattanStewie#oldmail.com;--blank--
RickyWattanStewie#oldmail.com;--blank--
Files AFTER (Desired result)
## ./configfile001.config
TrevorWattanStewie#newmail.com;configfile001.config
## ./configfile002.config
TrevorWattanStewie#newmail.com;configfile002.config
## ./configfile003.config
TrevorWattanStewie#newmail.com;configfile003.config
## ./configfile004.config
TrevorWattanStewie#newmail.com;configfile004.config
SallyWattanStewie#newmail.com;configfile004.config
RickyWattanStewie#newmail.com;configfile004.config
Step by Step Explanation
Trevor wants to:
replace all --blank-- tokens with the name of the file currently being processed.
change all substrings from #oldmail into #newmail
Trevor's attempt
Trevor decides the quickest way to get the job done is with a perl oneliner script.
The oneliner Trevor uses is as follows:
perl -pi -e '$curf=$ARGV[0];s/--blank--/$curf/; s/#oldmail.com/#newmail.com/;' *.asc
Problem
When Trevor runs the script, the output does not meet his expectations.
The actual result is as follows:
Files AFTER (Actual result)
## ./configfile001.config
TrevorWattanStewie#oldmail.com;configfile002.config
## ./configfile002.config
TrevorWattanStewie#oldmail.com;configfile003.config
## ./configfile003.config
TrevorWattanStewie#oldmail.com;configfile004.config
## ./configfile004.config
TrevorWattanStewie#oldmail.com;
SallyWattanStewie#oldmail.com;
RickyWattanStewie#oldmail.com;
Questions
Why did Trevor's script fail to transform #oldmail to #newmail?
Why is the file numbering mismatched? The sequence numbering is off by one.
You want to use the variable $ARGV, which is the name of the currently processed file.
So s/--blank--/$ARGV/;
Also, #oldmail (etc) will be interpolated inside the regex, as Wumpus Q. Wumbley notes.
I always run my one-liners with -wE.
Trevor didn't enable warnings, thus missing out on the explanation:
$ perl -wpi -e '$curf=$ARGV[0];s/--blank--/$curf/; s/#oldmail.com/#newmail.com/;' *.asc
Possible unintended interpolation of #oldmail in string at -e line 1.
Possible unintended interpolation of #newmail in string at -e line 1.
#oldmail and #newmail are arrays. the s/// operator interpolates variables, including arrays. You need to use \#

why my sed script to split FASTA file is slow?

I have a 600 Mb FASTA file containing many alignments blocks from 12 species and I want to split them into smaller FASTA files containing one block each with its corresponding alignments
I have a sed script that looks like this:
#!/bin/bash
echo
for i in {0..Nblocks}; do
sed -n "/block_index=$i|/,/^$/p" genome12species.fasta > bloque$i.fasta
done
This works at a small scale but for a big file as 600Mb it takes too long, around 2 days. I don't think this is a matter of the computer I am running.
Does anyone knows how to make this faster?
The input Fasta file looks like this:
dm3.chr3R(-):17092630-17092781|sequence_index=0|block_index=4|species=dm3|dm3_4_0
GGCGGAGATCAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACTGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droGri2.scaffold_15074(-):2610183-2610334|sequence_index=0|block_index=4|species=droGri2|droGri2_4_0
GGCGGAGATCAAGAATCGTGTTGGGCCGCCGTCGAGCGCCACCGATAACGCTAGCAAAGTGAAAATCGATCAGGGACGCCCAGTGGAAAACAATAGATCTGGTTGCTGCTAAATAA-CTCTGATTGTGAATCATTATTTTATTATACAATTa
droMoj3.scaffold_6540(+):33866311-33866462|sequence_index=0|block_index=4|species=droMoj3|droMoj3_4_0
TGCCGAGATTAAGAATCGTGTCGGTCCGCCGTCCAGCGCAACCGACAATGCAAGCAAAGTGAAAATCGATCAGGGACGTCCAGTGGAGAACACCAGATCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCATTATTTTATTatacaatta
droVir3.scaffold_12822(+):1248119-1248270|sequence_index=0|block_index=4|species=droVir3|droVir3_4_0
GGCCGAGATTAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACCGATAATGCTAGCAAAGTGAAAATCGATCAGGGTCGTCCAGTGGAGAACACCAAATCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droWil1.scaffold_181130(-):16071336-16071488|sequence_index=0|block_index=4|species=droWil1|droWil1_4_0
GGCCGAGATTAAGAATCGTGTTGGGCCGCCGTCCAGCGCCACTGATAATGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAATACCAAATCCGGTTGCTGCTGAATAAACTCTGATTGTGAATCATTATTTTATTATACAATTA
droPer1.super_19(-):1310088-1310239|sequence_index=0|block_index=4|species=droPer1|droPer1_4_0
GGCTGAGATCAAGAATCGCGTCGGACCGCCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAAACCCAATTCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
dp4.chr2(-):5593491-5593642|sequence_index=0|block_index=4|species=dp4|dp4_4_0
GGCTGAGATCAAGAATCGCGTCGGACCGCCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAAGCCCAATTCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droAna3.scaffold_13340(-):3754154-3754305|sequence_index=0|block_index=4|species=droAna3|droAna3_4_0
GGCCGAGATCAAGAATCGCGTCGGGCCACCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAGATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattataaaatta
droEre2.scaffold_4770(+):4567591-4567742|sequence_index=0|block_index=4|species=droEre2|droEre2_4_0
GGCCGAGATCAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droYak2.chr3R(-):5883047-5883198|sequence_index=0|block_index=4|species=droYak2|droYak2_4_0
GGCCGAGATCAAGAATCGCGTCGGGCCGCCATCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droSec1.super_38(+):36432-36583|sequence_index=0|block_index=4|species=droSec1|droSec1_4_0
GGCGGAGATCAAGAATCGCGTCGGTCCGCCGTCCAGCGCCACTGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droSim1.chr3R(+):4366350-4366501|sequence_index=0|block_index=4|species=droSim1|droSim1_4_0
GGCGGAGATCAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACTGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
dm3.chr3R(-):17092781-17092867|sequence_index=0|block_index=5|species=dm3|dm3_5_0
GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTTGAGCAGGCCTTCATGACGATGGC
droSim1.chr3R(+):4366264-4366350|sequence_index=0|block_index=5|species=droSim1|droSim1_5_0
GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTTGAGCAGGCCTTTATGACGATGGC
droSec1.super_38(+):36346-36432|sequence_index=0|block_index=5|species=droSec1|droSec1_5_0
GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTTGAGCAGGCCTTCATGACGATGGC
droYak2.chr3R(-):5883198-5883284|sequence_index=0|block_index=5|species=droYak2|droYak2_5_0
GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACATCGGCCAAGAGCGCCACCAACGTGGAGCAGGCCTTCATGACGATGGC
droEre2.scaffold_4770(+):4567505-4567591|sequence_index=0|block_index=5|species=droEre2|droEre2_5_0
GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTGGAGCAGGCCTTCATGACGATGGC
droAna3.scaffold_13340(+):20375068-20375148|sequence_index=0|block_index=5|species=droAna3|droAna3_5_0
------GCCGAAAACTTCGACATGCCCTTCTTCGAGGTCTCTTGCAAGTCAAACATCAATATTGAAGATGCGTTTCTTTCCCTGGC
dp4.chr2(-):5593642-5593728|sequence_index=0|block_index=5|species=dp4|dp4_5_0
GAGTATGCAGCTCAGTTAGGCATTCCATTTCTTGAAACTTCGGCCAAGAGCGCCACGAACGTGGAGCAGGCCTTCATGACGATGGC
droPer1.super_19(-):1310239-1310325|sequence_index=0|block_index=5|species=droPer1|droPer1_5_0
GAGTATGCAGCTCAGTTAGGCATTCCATTTCTTGAAACTTCGGCCAAGAGCGCCACGAACGTGGAGCAGGCCTTCATGACGATGGC
droWil1.scaffold_181130(-):16071488-16071574|sequence_index=0|block_index=5|species=droWil1|droWil1_5_0
GAATATGCGGCTCAGTTAGGCATTCCATTCCTTGAAACTTCGGCAAAGAGTGCCACCAATGTGGAGCAGGCCTTTATGACGATGGC
droVir3.scaffold_12822(+):1248033-1248119|sequence_index=0|block_index=5|species=droVir3|droVir3_5_0
GAGTACGCACATCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTGGAGCAGGCATTTATGACGATGGC
droMoj3.scaffold_6540(+):33866225-33866311|sequence_index=0|block_index=5|species=droMoj3|droMoj3_5_0
GAGTATGCACATCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAATGTAGAGCAGGCATTCATGACGATGGC
droGri2.scaffold_15074(-):2610334-2610420|sequence_index=0|block_index=5|species=droGri2|droGri2_5_0
GAGTACGCAAATCAGTTAGGCATTCCATTCCTTGAAACTTCGGCGAAGAGTGCCACCAATGTGGAACAGGCATTCATGACGATGGC
Here's an awk oneliner to get you started - it uses the same regex range as your sed - the matched block_index is in m[1] - 600MB should take just a few minutes
awk 'match($0, /block_index=([0-9]+)\|/, m),/^$/ {print >"bloque"m[1]".fasta"}'

Find and Replace one list of "words" with another list of "words" pairwise in csh

I am trying to modify some length code. I want to replace words in all words in list 1 with words in list 2 (pairwise).
List 1:
Vsap1*(GF/(Kagf+GF))
kdap1*AP1
vsprb
kpc1*pRB*E2F
.
.
List 2:
v1
v2
v3
v4
.
.
In other words, I'd like it to replace all instances of "Vsap1*(GF/(Kagf+GF))" with "v1" (and so on) in the file "code.txt". I have List 1 in a text file ("search_for.txt").
So far, I've been doing something like this:
set search_for=`cat search_for.txt`
set vv=1
foreach reaction $search_for
sed -i s/$reaction/$vv/g code.txt
set vv=$vv+1
end
There are many problems with this code. First, it seems the code can't handle expression with parentheses (something about "regular expressions"?). Second, I'm not sure my counter is working properly. Third, I haven't even integrated the replace list -- I thought it would be easier to just replace with 1,2,3… instead. Ideally, I would like to replace with v1,v3,v3…
Any help would be greatly appreciated!! I work mainly in Matlab (in which it is hard to deal with strings and such) so I'm not that great at csh.
Best,
Mehdi
awk should be better i think
set search_for=`cat search_for.txt`
set vindex=1
foreach reaction ${search_for}
ReactionEscaped="`printf \"%s\" \"${reaction}\" | sed 's²[\+*./[]²\\\\&²g'`"
sed -i "s/${ReactionEscaped}/v${vindex}/g code.txt
let vindex+=1
end
I haven't test (no system available here) so
ReactionEscaped="printf \"%s\" \"${reaction}\" | sed
's²[\+*./[]²\\\\&²g'\"
have to be fine tuned certainly (due to double \ between "", and special meaning of car in first sed pattern) [there is lot of post about escaping special char sed pattern on the site)

Replace matches of one regex expression with matches from another, across two files

I am currently helping a friend reorganise several hundred images on a database driven website. I have generated a list of the new, reorganised image paths offline and would like to replace each matching image reference in the sql export of the database with the new paths.
EDIT: Here is an example of what I am trying to achieve
The new_paths_list.txt is a file that I generated using a batch script after I had organised all of the existing images into folders. Prior to this all of the images were in just a few folders. A sample of this generated list might be:
image/data/product_photos/telephones/snom/snom_xyz.jpg
image/data/product_photos/telephones/gigaset/giga_xyz.jpg
A sample of my_exported_db.sql (the database exported from the website) might be:
...
,(110,32,'data/phones/snom_xyz.jpg',3),(213,50,'data/telephones/giga_xyz.jpg',0),
...
The result I want is my_exported_db.sql to be:
...
,(110,32,'data/product_photos/telephones/snom/snom_xyz.jpg',3),(213,50,'data/product_photos/telephones/gigaset/giga_xyz.jpg',0),
...
Some pseudo code to illustrate:
1/ Find the first image name in my_exported_db.sql, such as 'snom_xyz.jpg'.
2/ Find the same image name in new_paths_list.txt
3/ If it is present, copy the whole line (the path and filename)
4/ Replace the whole path in in my_exported_db.sql of this image with the copied line
5/ Repeat for all other image names in my_exported_db.sql
A regex expression that appears to match image names is:
([^)''"/])+\.(?:jpg|jpeg|gif|png)
and one to match image names, complete with path (for relative or absolute) is:
\bdata[^)''"\s]+\.(?:jpg|jpeg|gif|png)
I have looked around and have seen that Sed or Awk may be capable of doing this, but some pointers would be greatly appreciated. I understand that this will only work accurately if there are no duplicated filenames.
You can use sed to convert new_paths_list.txt into a set of sed replacement commands:
sed 's|\(.*\(/[^/]*$\)\)|s#data\2#\1#|' new_paths_list.txt > rules.sed
The file rules.sed will look like this:
s#data/snom_xyz.jpg#image/data/product_photos/telephones/snom/snom_xyz.jpg#
s#data/giga_xyz.jpg#image/data/product_photos/telephones/gigaset/giga_xyz.jpg#
Then use sed again to translate my_exported_db.sql:
sed -i -f rules.sed my_exported_db.sql
I think in some shells it's possible to combine these steps and do without rules.sed:
sed 's|\(.*\(/[^/]*$\)\)|s#data\2#\1#|' new_paths_list.txt | sed -i -f - my_exported_db.sql
but I'm not certain about that.
EDIT<:
If the images are in several directories under data/, make this change:
sed "s|image/\(.*\(/[^/]*$\)\)|s#[^']*\2#\1#|" new_paths_list.txt > rules.sed

How to extract strings from plist files for translation (localization)?

I need to prepare list of strings for translation of my iPhone application.
I have extracted strings from *.m files using genstring and from the XIB files using ibtool command.
But I have also lots of texts to translate in plist files (String field types enclosed in string tag).
Is there a nice bash script / command to extract those strings into a flat txt file?
I could review and filter it so my translators can work with nice list but not with alien looking XML file.
I made a custom shell script which tries to figure out the values needed. You can then use the localize.py script in a modified way (see below) to automatically create the translation files. (The line break where somehow very important) If there more entities to be translated, the shell script can be modified accordingly
#!/bin/bash
rm -f $2
sed -n 'N;/<key>Title<\/key>/{N;/<string>.*<\/string>/{s/.*<string>\(.*\)<\/string>.*/\/* \1 *\/\
"\1" = "\1";\
/p;};}' $1 >> $2
sed -n 'N;/<key>FooterText<\/key>/{N;/<string>.*<\/string>/{s/.*<string>\(.*\)<\/string>.*/\/* \1 *\/\
\"\1" = "\1";\
/p;}
;}' $1 >> $2
sed -n 'N;/<key>Titles<\/key>/{N;/<array>/{:a
N;/<\/array>/!{
/<string>.*<\/string>/{s/.*<string>\(.*\)<\/string>.*/\/* \1 *\/\
\"\1" = "\1";\
/p;}
ba
;};};}' $1 >> $2
the localize.py script needed some modification. Therefore I created a small package containing the localizer for the source code and for the plist Files. The new script even supports Duplikates (meaning it will kick them)
We recently made a small online application to do that, please take a look on: http://www.icapps.be/plist-translator/
I can't think of any command off the top of my head. However, plists are glorified xml files and there are various parsers available for them.
It shouldn't be too difficult to create a simple python script to get all the strings from the file.
Does this help?
http://www.icanlocalize.com/site/tutorials/how-to-translate-plist-files/
We much prefer paying clients who use our translation system with our translators, but you can translate yourself in our GUI at no charge.