How do I repeat a character n times in a string? - perl

I am learning Perl, so please bear with me for this noob question.
How do I repeat a character n times in a string?
I want to do something like below:
$numOfChar = 10;
s/^\s*(.*)/' ' x $numOfChar$1/;

By default, substitutions take a string as the part to substitute. To execute code in the substitution process you have to use the e flag.
$numOfChar = 10;
s/^(.*)/' ' x $numOfChar . $1/e;
This will add $numOfChar space to the start of your text. To do it for every line in the text either use the -p flag (for quick, one-line processing):
cat foo.txt | perl -p -e "$n = 10; s/^(.*)/' ' x $n . $1/e/" > bar.txt
or if it's a part of a larger script use the -g and -m flags (-g for global, i.e. repeated substitution and -m to make ^ match at the start of each line):
$n = 10;
$text =~ s/^(.*)/' ' x $n . $1/mge;

Your regular expression can be written as:
$numOfChar = 10;
s/^(.*)/(' ' x $numOfChar).$1/e;
but - you can do it with:
s/^/' ' x $numOfChar/e;
Or without using regexps at all:
$_ = ( ' ' x $numOfChar ) . $_;

You're right. Perl's x operator repeats a string a number of times.
print "test\n" x 10; # prints 10 lines of "test"
EDIT: To do this inside a regular expression, it would probably be best (a.k.a. most maintainer friendly) to just assign the value to another variable.
my $spaces = " " x 10;
s/^\s*(.*)/$spaces$1/;
There are ways to do it without an extra variable, but it's just my $0.02 that it'll be easier to maintain if you do it this way.
EDIT: I fixed my regex. Sorry I didn't read it right the first time.

Related

Perl CLI code cannot do a string line appended

I'm trying to use a perl -npe one-liner to surround each line with =.
$ for i in {1..4}; { echo $i ;} |perl -npe '...'
=1=
=2=
=3=
=4=
The following is my first attempt. Note that the line feeds are in the incorrect position.
$ for i in {1..4}; { echo $i ;} |perl -npe '$_= "=".$_."=" '
=1
==2
==3
==4
=
I tried using chop to remove them line feeds and then re-add them in the correct position, but it didn't work.
$ for i in {1..4} ;{ echo $i ;} |perl -npe '$_= "=".chop($_)."=\n" '
=
=
=
=
=
=
=
=
Please solve it out, thanks much.
chop returned the removed character, not the remaining string. It modifies the variable in-place. So the following is the correct usage:
perl -npe'chop( $_ ); $_ = "=$_=\n"'
But we can improve this.
It's safer to use chomp instead of chop to remove trailing line feeds.
-n is implied by -p, and it's customary to leave it out when -p is used.
chomp and chop modify $_ by default, so we don't need to explicitly pass $_.
perl -pe'chomp; $_ = "=$_=\n"'
Finally, we can get the same exact behaviour out of -l.
perl -ple'$_ = "=$_="'

Conditional editing of line by one-liner

My question is more of an optimization one, rather then a "howto".
I have a lef file, with thousands of lines in the form of:
RECT 429.336 273.821 426.246 274.721 ;
I wanted to move left by 4 um all rects above a certain point using a one-liner:
perl -lane '$F[2] > 1200 ? print $F[0]," ", ($F[1] - 4)," ", $F[2]," ", ($F[3] -4)," ", $F[4], " ;" : print $_' trial.lef
Thing is, this is UGLY.
Is there a nicer way of editing the file?
I'm not picky and will be happy to have answers with other languages (awk, sed, etc.) as long as they are nicer than what I wrote.
Additional input:
LAYER M12 ;
RECT 0 411.214 1 412.214 ; <-- shouldn't change, because 411.214 < 1200
END
END kuku_pin
PIN gaga_pin
DIRECTION OUTPUT ;
USE SIGNAL ;
PORT
LAYER M11 ;
RECT 43.1045 1203.138 43.1805 1207.29 ; <-- should change to "RECT 39.1045 1203.138 39.1805 1207.29"
END
There really is not much room for improvement, but you can replace -n with -p to skip the extra print. Further, you can edit the array elements and use join for a bit prettier code:
perl -lape'if ($F[2] > 1200) { $F[1] -= 4; $F[3] -= 4; $_ = join " ", #F }'
-a autosplit mode, splits the line $_ on space and puts the values in the predefined #F array. This switch is used with -n or -p.
-p loops around the <> operator input, file or standard input
-= decreases the LHS by amount in RHS
join joins the line back together after math has been done
-l can be skipped in this case, since we never touch the line endings, but keeping it makes the code more flexible if we decide to edit the last field.
When the condition is not met, original line is printed unchanged. Otherwise, it is replaced with the joined values in #F.
If you decide to keep the leading whitespace before RECT you can surround your if-statement with
if (($pre) = /^(\s*RECT)/)
To store the beginning of the line, making the one-liner:
perl -lape'if (($pre) = /^(\s*RECT)/) { if ($F[2] > 1200) { $F[1] -= 4; $F[3] -= 4; $F[0] = $pre; $_ = join " ", #F }}'

How can I write this sed/bash command in awk or perl (or python, or ...)?

I need to replace instances of Progress (n,m) and Progress label="some text title" (n,m) in a scripting language with new values (N,M) where
N= integer ((n/m) * normal)
M= integer ( normal )
The progress statement can be anywhere on the script line (and worse, though not with current scripts, split across lines).
The value normal is a specified number between 1 and 255, and n and m are floating point numbers
So far, my sed implementation is below. It only works on Progress (n,m) formats and not Progress label="Title" (n,m) formats, but its just plain nuts:
#!/bin/bash
normal=$1;
file=$2
for n in $(sed -rn '/Progress/s/Progress[ \t]+\(([0-9\. \t]+),([0-9\. \t]+)\).+/\1/p' "$file" )
do
m=$(sed -rn "/Progress/s/Progress[ \t]+\(${n},([0-9\. \t]+).+/\1/p" "$file")
N=$(echo "($normal * $n)/$m" | bc)
M=$normal
sed -ri "/Progress/s/Progress[ \t]+\($n,$m\)/Progress ($N,$M)/" "$file"
done
Simply put: This works, but, is there a better way?
My toolbox has sed and bash scripting in it, and not so much perl, awk and the like which I think this problem is more suited to.
Edit Sample input.
Progress label="qt-xx-95" (0, 50) thermal label "qt-xx-95" ramp(slew=.75,sp=95,closed) Progress (20, 50) Pause 5 Progress (25, 50) Pause 5 Progress (30, 50) Pause 5 Progress (35, 50) Pause 5 Progress (40, 50) Pause 5 Progress (45, 50) Pause 5 Progress (50, 50)
Progress label="qt-95-70" (0, 40) thermal label "qt-95-70" hold(sp=70) Progress (10, 40) Pause 5 Progress (15, 40) Pause 5 Progress (20, 40) Pause 5 Progress (25, 40) Pause 5
awk has good splitting capabilities, so it might be a good choice for this problem.
Here is a solution that works for the supplied input, let's call it update_m_n_n.awk. Run it like this in bash: awk -f update_m_n_n.awk -v normal=$NORMAL input_file.
#!/usr/bin/awk
BEGIN {
ORS = RS = "Progress"
FS = "[)(]"
if(normal == 0) normal = 10
}
NR == 1 { print }
length > 1 {
split($2, A, /, */)
N = int( normal * A[1] / A[2] )
M = int( normal )
sub($2, N ", " M)
print $0
}
Explanation
ORS = RS = "Progress": Split sections at Progress and include Progress in the output.
FS = "[)(]": Separate fields at parenthesis.
NR == 1 { print }: Insert ORS before the first section.
split($2, A, /, */): Assuming there is only on parenthesized item between occurrences of Progress, this splits m and n into the A array.
sub($2, N ", " M): Substitute the new values the into current record.
This is somewhat brittle but it seems to do the trick? It could be changed to a one-line with perl -pe but I think this is clearer:
use 5.16.0;
my $normal = $ARGV[0];
while(<STDIN>){
s/Progress +(label=\".+?\")? *( *([0-9. ]+) *, *([0-9. ]+) *)/sprintf("Progress $1 (%d,%d)", int(($2/$3)*$normal),int($normal))/eg;
print $_;
}
The basic idea is to optionally capture the label clause in $1, and to capture n and m into $2 and $3. We use perl's ability to replace the matched string with an evaluated piece of code by providing the "e" modifier. It's going to fail dramatically if the label clause has any escaped quotes or contains the string that matches something that looks like a Progress toekn, so its not ideal. I agree that you need an honest to goodness parser here, though you could modify this regex to correct some of the obvious deficiencies like the weak number matching for n and m.
My initial thought was to try sed with recursive substitutions (t command), however I suspected that would get stuck.
This perl code might work for statements that are not split across lines. For splits across lines, perhaps it makes sense to write a separate pre-processor to join disparate lines.
The code splits "Progress" statements into separate line-segments, applies any replacement rules then rejoins the segments into one line and prints. Non-matching lines are simply printed. The matching code uses back-references and becomes somewhat unreadable. I have assumed your "normal" parameter can take floating values as the spec didn't seem clear.
#!/usr/bin/perl -w
use strict;
die("Wrong arguments") if (#ARGV != 2);
my ($normal, $file) = #ARGV;
open(FILE, '<', $file) or die("Cannot open $file");
while (<FILE>) {
chomp();
my $line = $_;
# Match on lines containing "Progress"
if (/Progress/) {
$line =~ s/(Progress)/\n$1/go; # Insert newlines on which to split
my #segs = split(/\n/, $line); # Split line into segments containing possibly one "Progress" clause
# Apply text-modification rules
#segs = map {
if (/(Progress[\s\(]+)([0-9\.]+)([\s,]+)([0-9\.]+)(.*)/) {
my $newN = int($2/$4 * $normal);
my $newM = int($normal);
$1 . $newN . $3 . $newM . $5;
} elsif (/(Progress\s+label="[^"]+"[\s\(]+)([0-9\.]+)([\s,]+)([0-9\.]+)(.*)/) {
my $newN = int($2/$4 * $normal);
my $newM = int($normal);
$1 . $newN . $3 . $newM . $5;
} else {
$_; # Segment doesn't contain "Progress"
}
} #segs;
$line = join("", #segs); # Reconstruct the single line
}
print($line,"\n"); # Print all lines
}

How to quickly find and replace many items on a list without replacing previously replaced items in BASH?

I want to perform about many find and replace operations on some text. I have a UTF-8 CSV file containing what to find (in the first column) and what to replace it with (in the second column), arranged from longest to shortest.
E.g.:
orange,fruit2
carrot,vegetable1
apple,fruit3
pear,fruit4
ink,item1
table,item2
Original file:
"I like to eat apples and carrots"
Resulting output file:
"I like to eat fruit3s and vegetable1s."
However, I want to ensure that if one part of text has already been replaced, that it doesn't mess with text that was already replaced. In other words, I don't want it to appear like this (it matched "table" from within vegetable1):
"I like to eat fruit3s and vegeitem21s."
Currently, I am using this method which is quite slow, because I have to do the whole find and replace twice:
(1) Convert the CSV to three files, e.g.:
a.csv b.csv c.csv
orange 0001 fruit2
carrot 0002 vegetable1
apple 0003 fruit3
pear 0004 fruit4
ink 0005 item1
table 0006 item 2
(2) Then, replace all items from a.csv in file.txt with the matching column in b.csv, using ZZZ around the words to make sure there is no mistake later in matching the numbers:
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
for i in `sed -n "$a"p ./b.csv`; do
for j in `sed -n "$a"p ./a.csv`; do
sed -i "s/$i/ZZZ$j\ZZZ/g" ./file.txt
echo "Instances of '"$i"' replaced with '"ZZZ$j\ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
done
(3) Then running this same script again, but to replace ZZZ0001ZZZ with fruit2 from c.csv.
Running the first replacement takes about 2 hours, but as I must run this code twice to avoid editing the already replaced items, it takes twice as long. Is there a more efficient way to run a find and replace that does not perform replacements on text already replaced?
Here's a perl solution which is doing the replacement in "one phase".
#!/usr/bin/perl
use strict;
my %map = (
orange => "fruit2",
carrot => "vegetable1",
apple => "fruit3",
pear => "fruit4",
ink => "item1",
table => "item2",
);
my $repl_rx = '(' . join("|", map { quotemeta } keys %map) . ')';
my $str = "I like to eat apples and carrots";
$str =~ s{$repl_rx}{$map{$1}}g;
print $str, "\n";
Tcl has a command to do exactly this: string map
tclsh <<'END'
set map {
"orange" "fruit2"
"carrot" "vegetable1"
"apple" "fruit3"
"pear" "fruit4"
"ink" "item1"
"table" "item2"
}
set str "I like to eat apples and carrots"
puts [string map $map $str]
END
I like to eat fruit3s and vegetable1s
This is how to implement it in bash (requires bash v4 for the associative array)
declare -A map=(
[orange]=fruit2
[carrot]=vegetable1
[apple]=fruit3
[pear]=fruit4
[ink]=item1
[table]=item2
)
str="I like to eat apples and carrots"
echo "$str"
i=0
while (( i < ${#str} )); do
matched=false
for key in "${!map[#]}"; do
if [[ ${str:$i:${#key}} = $key ]]; then
str=${str:0:$i}${map[$key]}${str:$((i+${#key}))}
((i+=${#map[$key]}))
matched=true
break
fi
done
$matched || ((i++))
done
echo "$str"
I like to eat apples and carrots
I like to eat fruit3s and vegetable1s
This will not be speedy.
Clearly, you may get different results if you order the map differently. In fact, I believe the order of "${!map[#]}" is unspecified, so you might want to specify the order of the keys explicitly:
keys=(orange carrot apple pear ink table)
# ...
for key in "${keys[#]}"; do
One way to do it would be to do a two-phase replace:
phase 1:
s/orange/##1##/
s/carrot/##2##/
...
phase 2:
s/##1##/fruit2/
s/##2##/vegetable1/
...
The ##1## markers should be chosen so that they don't appear in the original text or the replacements of course.
Here's a proof-of-concept implementation in perl:
#!/usr/bin/perl -w
#
my $repls = $ARGV[0];
die ("first parameter must be the replacement list file") unless defined ($repls);
my $tmpFmt = "###%d###";
open(my $replsFile, "<", $repls) || die("$!: $repls");
shift;
my #replsList;
my $i = 0;
while (<$replsFile>) {
chomp;
my ($from, $to) = /\"([^\"]*)\",\"([^\"]*)\"/;
if (defined($from) && defined($to)) {
push(#replsList, [$from, sprintf($tmpFmt, ++$i), $to]);
}
}
while (<>) {
foreach my $r (#replsList) {
s/$r->[0]/$r->[1]/g;
}
foreach my $r (#replsList) {
s/$r->[1]/$r->[2]/g;
}
print;
}
I would guess that most of your slowness is coming from creating so many sed commands, which each need to individually process the entire file. Some minor adjustments to your current process would speed this up a lot by running 1 sed per file per step.
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
cmd=""
for i in `sed -n "$a"p ./a.csv`; do
for j in `sed -n "$a"p ./b.csv`; do
cmd="$cmd ; s/$i/ZZZ${j}ZZZ/g"
echo "Instances of '"$i"' replaced with '"ZZZ${j}ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
sed -i "$cmd" ./file.txt
done
Doing it twice is probably not your problem. If you managed to just do it once using your basic strategy, it would still take you an hour, right? You probably need to use a different technology or tool. Switching to Perl, as above, might make your code a lot faster (give it a try)
But continuing down the path of other posters, the next step might be pipelining. Write a little program that replaces two columns, then run that program twice, simultaneously. The first run swaps out strings in column1 with strings in column2, the next swaps out strings in column2 with strings in column3.
Your command line would be like this
cat input_file.txt | perl replace.pl replace_file.txt 1 2 | perl replace.pl replace_file.txt 2 3 > completely_replaced.txt
And replace.pl would be like this (similar to other solutions)
#!/usr/bin/perl -w
my $replace_file = $ARGV[0];
my $before_replace_colnum = $ARGV[1] - 1;
my $after_replace_colnum = $ARGV[2] - 1;
open(REPLACEFILE, $replace_file) || die("couldn't open $replace_file: $!");
my #replace_pairs;
# read in the list of things to replace
while(<REPLACEFILE>) {
chomp();
my #cols = split /\t/, $_;
my $to_replace = $cols[$before_replace_colnum];
my $replace_with = $cols[$after_replace_colnum];
push #replace_pairs, [$to_replace, $replace_with];
}
# read input from stdin, do swapping
while(<STDIN>) {
# loop over all replacement strings
foreach my $replace_pair (#replace_pairs) {
my($to_replace,$replace_with) = #{$replace_pair};
$_ =~ s/${to_replace}/${replace_with}/g;
}
print STDOUT $_;
}
A bash+sed approach:
count=0
bigfrom=""
bigto=""
while IFS=, read from to; do
read countmd5sum x < <(md5sum <<< $count)
count=$(( $count + 1 ))
bigfrom="$bigfrom;s/$from/$countmd5sum/g"
bigto="$bigto;s/$countmd5sum/$to/g"
done < replace-list.csv
sed "${bigfrom:1}$bigto" input_file.txt
I have chosen md5sum, to get some unique token. But some other mechanism can also be used to generate such token; like reading from /dev/urandom or shuf -n1 -i 10000000-20000000
A awk+sed approach:
awk -F, '{a[NR-1]="s/####"NR"####/"$2"/";print "s/"$1"/####"NR"####/"}; END{for (i=0;i<NR;i++)print a[i];}' replace-list.csv > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
A cat+sed+sed approach:
cat -n replace-list.csv | sed -rn 'H;g;s|(.*)\n *([0-9]+) *[^,]*,(.*)|\1\ns/####\2####/\3/|;x;s|.*\n *([0-9]+)[ \t]*([^,]+).*|s/\2/####\1####/|p;${g;s/^\n//;p}' > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
Mechanism:
Here, it first generates the sed script, using the csv as input file.
Then uses another sed instance to operate on input.txt
Notes:
The intermediate file generated - sed_script.sed can be re-used again, unless the input csv file changes.
####<number>#### is chosen as some pattern, which is not present in the input file. Change this pattern if required.
cat -n | is not UUOC :)
This might work for you (GNU sed):
sed -r 'h;s/./&\\n/g;H;x;s/([^,]*),.*,(.*)/s|\1|\2|g/;$s/$/;s|\\n||g/' csv_file | sed -rf - original_file
Convert the csv file into a sed script. The trick here is to replace the substitution string with one which will not be re-substituted. In this case each character in the substitution string is replaced by itself and a \n. Finally once all substitutions have taken place the \n's are removed leaving the finished string.
There are a lot of cool answers here already. I'm posting this because I'm taking a slightly different approach by making some large assumptions about the data to replace ( based on the sample data ):
Words to replace don't contain spaces
Words are replaced based on the longest, exactly matching prefix
Each word to replace is exactly represented in the csv
This a single pass, awk only answer with very little regex.
It reads the "repl.csv" file into an associative array ( see BEGIN{} ), then attempts to match on prefixes of each word when the length of the word is bound by key length limits, trying to avoid looking in the associative array whenever possible:
#!/bin/awk -f
BEGIN {
while( getline repline < "repl.csv" ) {
split( repline, replarr, "," )
replassocarr[ replarr[1] ] = replarr[2]
# set some bounds on the replace word sizes
if( minKeyLen == 0 || length( replarr[1] ) < minKeyLen )
minKeyLen = length( replarr[1] )
if( maxKeyLen == 0 || length( replarr[1] ) > maxKeyLen )
maxKeyLen = length( replarr[1] )
}
close( "repl.csv" )
}
{
i = 1
while( i <= NF ) { print_word( $i, i == NF ); i++ }
}
function print_word( w, end ) {
wl = length( w )
for( j = wl; j >= 0 && prefix_len_bound( wl, j ); j-- ) {
key = substr( w, 1, j )
wl = length( key )
if( wl >= minKeyLen && key in replassocarr ) {
printf( "%s%s%s", replassocarr[ key ],
substr( w, j+1 ), !end ? " " : "\n" )
return
}
}
printf( "%s%s", w, !end ? " " : "\n" )
}
function prefix_len_bound( len, jlen ) {
return len >= minKeyLen && (len <= maxKeyLen || jlen > maxKeylen)
}
Based on input like:
I like to eat apples and carrots
orange you glad to see me
Some people eat pears while others drink ink
It yields output like:
I like to eat fruit3s and vegetable1s
fruit2 you glad to see me
Some people eat fruit4s while others drink item1
Of course any "savings" of not looking the replassocarr go away when the words to be replaced goes to length=1 or if the average word length is much greater than the words to replace.

sed/awk + count char until relevant char

Dear friends
I have the following:
PARAM=1,2,3=,4,5,6,=,7#,8,9
How to count by sed/awk the even "=" character between PARAM until "#" character
For example
PARAM=1,2,3=,4,5,6,=,7#,8,9
Then sed/awk should return 3
OR
PARAM=1,2,3=,4=,5=,6,=,7#,=8,9
Then sed/awk should return 5
THX
yael
you can use this one liner. No need to use split() as in the answer. Just use gsub(). It will return the count of the thing that is replaced. Also, set the field delimiter to "#", so you only need to deal with the first field.
$ echo "PARAM=1,2,3=,4,5,6,=,7#,8,9" | awk -F"#" '{print gsub("=","",$1)}'
3
$ echo "PARAM=1,2,3=,4=,5=,6,=,7#,=8,9" | awk -F"#" '{print gsub("=","",$1)}'
5
Here is an awk script that finds the count using field separators/split. IT sets the field separator to the # symbol and then splits the first word (the stuff to the left of the first # on the = character. An odd approach possibly, but it is one method. Note that it assumes there are no = characters to the left of param. If that is a bad assumption, this will not work.
BEGIN{ FS="#" }
/PARAM.*#/{
n = split( $1, a, "=" );
printf( "Count = %d\n", n-1 );
}
It can be done with one line as well:
[]$ export LINE=PARAM=1,2=3,4=5#=6
[]$ echo $LINE | awk 'BEGIN{FS="#"}/PARAM.*#/{n=split($1,a,"="); print n-1;}'
3