Converting long single-line comments into multiple short lines - sed

I have some lines with very long single-line comments:
# this is a comment describing the function, let's pretend it's long.
function whatever()
{
# this is an explanation of something that happens in here.
do_something();
}
For this example (adapting it to other numbers should be trivial) I want
each line to contain at most 33 characters (each indentation level is 4 spaces) and
to be broken at the last possible space
each additional line do be indented exactly like the original line.
So it would end up looking like this:
# this is a comment describing
# the function, let's pretend
# it's long.
function whatever()
{
# this is an explanation of
# something that happens in
# here.
do_something();
}
I'm trying to write a sed script for that, my attempt looking like this (leaving out the attempts to make it break at a particular character count for clarity and because it didn't work):
s/\(^[^#]*# \)\(.*\) \(.*\)/\1\2\n\1\3/g;
This breaks the line only once and not repeatedly like I falsely assumed g to do (and which it actually would do if it were only s/ /\n/g or something).

Perl to the rescue!
Its Text::Wrap module does what you need:
perl -MText::Wrap='wrap,$columns' -pe '
s/^(\s*#)(.*)/$columns = 33 - length $1; wrap("$1", "$1 ", "$2")/e
' < input > output
-M uses the given module with the given parameters. Here, we'll use the wrap function and the $columns variable.
-p reads the input line by line and prints the possibly modified line (like sed)
s///e is a substitution that uses code in the replacement part, the matching part is replaced by the value returned from the code
to calculate the width, we subtract the initial whitespace from 33. If you use tabs in your sources, you'll have to handle them specially.
wrap takes three parameters: prefix for the first line, prefix for the rest of the lines (in this case, they're almost the same: the comment prefix, we just need to add the space to the second one), and the text to wrap.
Comparing the output to yours, it seems you want 33 characters regardless of the leading whitespace. If that's true, just remove the - length $1 part.

Related

How to find and replace with sed, except when between curly braces?

I have a command like this, it is marking words to appear in an index in the document:
sed -i "s/\b$line\b/\\\keywordis\{$line\}\{$wordis\}\{$definitionis\}/g" file.txt
The problem is, it is finding matches within existing matches, which means its e.g. "hello" is replaced with \keywordis{hello}{a common greeting}, but then "greeting" might be searched too, and \keywordis{hello}{a common \keywordis{greeting}{a phrase used when meeting someone}}...
How can I tell sed to perform the replacement, but ignore text that is already inside curly brackets?
Curley brackets in this case will always appear on the same line.
How can I tell sed to perform the replacement, but ignore text that is already inside curly brackets?
First tokenize input. Place something unique, like | or byte \x01 between every \keywordis{hello}{a common greeting} and store that in hold space. Something along s/\\the regex to match{hello}{a common greeting}/\x01&\x01/g'.
Ten iterate over elements in hold space. Use \n to separate elements already parsed from not parsed - input from output. If the element matches the format \keywordis{hello}{a common greeting}, just move it to the front before the newline in hold space, if it does not, perform the replacement. Here's an example: Identify and replace selective space inside given text file , it uses double newline \n\n as input/output separator.
Because, as you noted, replacements can have overlapping words with the patterns you are searching for, I believe the simplest would be after each replacement shuffling the pattern space like for ready output and starting the process all over for the current line.
Then on the end, shuffle the hold space to remove \x01 and newline and any leftovers and output.
Overall, it's Latex. I believe it would be simpler to do it manually.
By "eating" the string from the back and placing it in front of input/output separator inside pattern space, I simplified the process. The following program:
sed '
# add our input/output separator - just a newline
s/^/\n/
: loop
# l1000
# Ignore any "\keywords" and "{stuff}"
/^\([^\n]*\)\n\(.*\)\(\\[^{}]*\|{[^{}]*}\)$/{
s//\3\1\n\2/
b loop
}
# Replace hello followed by anthing not {}
# We match till the end because regex is greedy
# so that .* will eat everything.
/^\([^\n]*\)\n\(.*\)hello\([{}]*\)$/{
s//\\keywordis{hello}{a common greeting}\3\1\n\2/
b loop
}
# Hello was not matched - ignore anything irrelevant
# note - it has to match at least one character after newline
/^\([^\n]*\)\n\(.*\)\([^{}]\+\)$/{
s//\3\1\n\2/
b loop
}
s/\n//
' <<<'
\keywordis{hello}{hello} hello {some other hello} another hello yet
'
outputs:
\keywordis{hello}{hello} \keywordis{hello}{a common greeting} {some other hello} another \keywordis{hello}{a common greeting} yet

substituting spaces for underscores using lookaheads in perl

I have files with many lines of the following form:
word -0.15636028 -0.2953045 0.29853472 ....
(one word preceding several hundreds floats delimited by blanks)
Due to some errors out of my control, the word sometimes has spaces in it.
a bbb c -0.15636028 -0.2953045 0.29853472 .... (several hundreds floats)
which I wish to substitute by underscores so to get:
a_bbb_c -0.15636028 -0.2953045 0.29853472 .... (several hundreds floats)
have tried for each line the following substitution code:
s/\s(?=(\s-?\d\.\d+)+)/_/g;
So lookarounds is apparently not the solution.
I'd be grateful for any clues.
Your idea for the lookahead is fine, but the question is how to replace only spaces in the part matched before the lookahead, when they are mixed with other things (the words, that is).
One way is to capture what precedes the first float (given by lookahead), and in the replacement part run another regex on what's been captured, to replace spaces
s{ (.*?) (?=\s+-?[0-9]+\.[0-9]) }{ $1 =~ s/\s+/_/gr }ex
Notes
Modifier /e makes the replacement part be evaluated as code; any valid Perl code goes
With s{}{} delimiters we can use s/// ones in the replacement part's regex
Regex in the replacement part, that changes spaces to _ in the captured text, has /r modifier so to return the modified string and leave the original unchanged. Thus we aren't attempting to change $1 (it's read only), and the modified string (being returned) is available as the replacement
Modifier /x allows use of spaces in patterns, for readability
Some assumptions must be made here. Most critical one is that the text to process is followed by a number in the given format, -?[0-9]+\.[0-9]+, and that there isn't such a number in the text itself. This follows the OP's sample and, more decidedly, the attempted solution
A couple of details with assumptions. (1) Leading digits are expected with [0-9]+\. -- if you can have numbers like .123 then use [0-9]*\. (2) The \s+ in the inner regex collapses multiple consecutive spaces into one _, so a b c becomes a_b_c (and not a__b_c)
In the lookahead I scoop up all spaces preceding the first float with \s+ -- and so they'll stay in front of the first float. This is as wanted with one space but with multiple ones it may be awkward
If they were included in the .*? capture (if the lookahead only has one space, \s) then we'd get an _ trailing the word(s). I thought that'd be more awkward. The ideal solution is to run another regex and clean that up, if such a case is possible and if it's a bother
An example
echo "a bbb c -0.15636028 -0.2953045" |
perl -wpe's{(.*?)(?=\s+-?[0-9]+\.[0-9])}{ $1 =~ s/\s+/_/gr }e'
prints
a_bbb_c -0.15636028 -0.2953045
Then to process all lines in a file you can do either
perl -wpe'...' file > new_file
and get a new_file with changes, or
perl -i.bak -wpe'...' file
to change the file in-place (that's -i), where .bak makes it save a backup.
Would something like this work for you:
s/\s+/_/g;
s/_(-?\d+\.)/ $1/g;
Use a negative lookahead to replace any spaces not followed by a float:
echo "a bbb cc -0.123232 -0.3232" | perl -wpe 's/ +(?! *-?\d+\.)/_/g'
Assuming from your comments your files look like that:
name float1 float2 float3
a bbb c -0.15636028 -0.2953045 0.29853472
abbb c -0.15636028 -0.2953045 0.29853472
a bbbc -0.15636028 -0.2953045 0.29853472
ab bbc -0.15636028 -0.2953045 0.29853472
abbbc -0.15636028 -0.2953045 0.29853472
Since you said in comments that the first field may contain digits, you can't use a lookahead that searches the first float to solve the problem. (you can nevertheless use a lookahead that counts the number of floats until the end of the line but it isn't very handy).
What I suggest is a solution based on fields number defined by the header first line.
You can use the header line to know the number of fields and replace spaces at the begining of other lines until the number of fields is the same.
You can use perl command line as awk like that:
perl -MEnglish -pae'$c=scalar #F if ($NR==1);for($i=0;$i<scalar(#F)-$c;$i++){s/\s+/_/}' file
The for loop counts the difference between the number of fields in the first row (stored in $c) and in the current line (given by scalar(#F) where #F is the fields array), and repeats the substitution.
The a switches the perl command line in autosplit mode and the -MEnglish makes available the number row variable as $NR (like the NR variable in awk).
It's possible to shorten it like that:
perl -pae'$c=#F if $.<2;$i=#F-$c;s/\s+/_/ while $i--' file

Append to non-empty line that doesn't start with whitespace AND is followed, two lines down, by a non-empty line that doesn't start with whitespace

I am converting several unruly, early 90's DOS-generated text files to something more usable. I need to append a set of characters to all of the non-empty lines in said text files that don't start with whitespace AND that are followed, two lines down, by another non-empty line that doesn't start with whitespace (I will refer to all single lines of text that meet these characteristics as "target" lines). BTW, irrelevant to the problem are the characteristics of the line directly below each of the target lines.
Of interest is the fact that all of the target lines in the above-mentioned text files end with the same character. Also, the command I'm looking for needs to slot into a rather long pipeline.
Suppose I have the following file:
foo
third line foo
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foo
eleventh line foo
this line starts with a space foo
last line foo
I want the output to look like this:
foobar
third line foobar
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foobar
eleventh line foo
this line starts with a space foo
last line foo
Although I'm looking for a sed solution, awk and perl are welcome as well. All solutions must be able to be used in a pipeline. Also welcomed are solutions which handle a more general case (e.g. able to append the desired text to target lines that end in various ways, including whitespace).
Now, for the backstory:
I recently asked a question similar to the subject question a few days ago (see here). As you can see, I got some great answers. It turned out, however, that I did not fully understand my problem, so I did not ask the correct question that would actually solve said problem.
Now, I'm asking the right question!
Based on what I learned by scrutinizing the answers to the question I linked to above, I've cobbled together the following sed command
sed '1N;N;/^[^[:space:]]/s/^\([^[:space:]].*\o\)\(\n\n[^[:space:]].*\)$/\1bar\2/;P;D' infile
Ugly, yes, but it works for my humble purposes. Indeed, as my original intent with this question was to post a question, then self-answer same, you can see this sed construct posted below as one of the answers (posted by me).
I'm sure there are better ways to solve this particular problem, however...any ideas, anyone?
From your posted expected output it looks like you meant to say "is followed, two lines down, by a line that DOES NOT start with whitespace" instead of "is followed, two lines down, by a line that DOES start with whitespace".
This produces the output you show:
$ cat tst.awk
NR>2 { print p2 ((p2 ~ /^[^[:blank:]]/) && /^[^[:blank:]]/ ? "bar" : "") }
{ p2=p1; p1=$0 }
END { print p2 ORS p1 }
$ awk -f tst.awk file
foobar
third line foobar
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foobar
eleventh line foo
this line starts with a space foo
last line foo
It simply keeps a 2 line buffer and adds "bar" to the end of the line being printed given whatever condition you need. It will work on all POSIX awks and any others that support POSIX character classes (for the rest, change [[:blank:]] to [ \t]).
You have over-analysed the problem so that your question now reads as a computer program, and you have got that program wrong. Requirements are best explained using examples and real data, so that we have some hope of rationalising the problem in our heads
This Perl program alters your algorithm so the output matches your required output
use strict;
use warnings 'all';
chomp(my #data = <>);
my $i = 0;
for ( #data ) {
$_ .= 'bar' if /^\S/ and $data[$i+2] =~ /^\S/;
++$i;
last if $i+2 > $#data;
}
print "$_\n" for #data;
output
foobar
third line foobar
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foobar
eleventh line foo
this line starts with a space foo
last line foo
This sed one-liner seems to do the trick for the specific case outlined in the OP:
sed '1N;N;/^[^[:space:]]/s/^\([^[:space:]].*\o\)\(\n\n[^[:space:]].*\)$/\1bar\2/;P;D' infile
Thanks to the excellent clarifying information given by Benjamin W. in his answer to one of my recent questions, I was able to cobble together this one-liner that solved my specific problem. Please refer to same if you wish to gain insight into said command.

Decipher this sed one-liner

I want to remove duplicate lines from a file, without sorting the file.
Example of why this is useful to me: removing duplicates from Bash's $HISTFILE without changing the chronological order.
This page has a one-liner to do that:
http://sed.sourceforge.net/sed1line.txt
Here's the one-liner:
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
I asked a sysadmin and he told me "you just copy the script and it works, don't go philosophising about this", which is fine, so I am asking here as it's a developer forum and I trust people might be like me, suspicious about using things they don't understand:
Could you kindly provide a pseudo-code explanation of what that "black magic" script is doing, please? I tried parsing the incantation in my head but especially the central part is quite hard.
I'll note that this script does not appear to work with my copy of sed (GNU sed 4.1.5) in my current locale. If I run it with LC_ALL=C it works fine.
Here's an annotated version of the script. sed basically has two registers, one is called "pattern space" and is used for (basically) the current input line, and the other, the "hold space", can be used by scripts for temporary storage etc.
sed -n ' # -n: by default, do not print
G # Append hold space to current input line
s/\n/&&/ # Add empty line after current input line
/^\([ -~]*\n\).*\n\1/d # If the current input line is repeated in the hold space, skip this line
# Otherwise, clean up for storing all input in hold space:
s/\n// # Remove empty line after current input line
h # Copy entire pattern space back to hold space
P # Print current input line'
I guess the adding and removal of an empty line is there so that the central pattern can be kept relatively simple (you can count on there being a newline after the current line and before the beginning of the matching line).
So basically, the entire input file (sans duplicates) is kept (in reverse order) in the hold space, and if the first line of the pattern space (the current input line) is found anywhere in the rest of the pattern space (which was copied from the hold space when the script started processing this line), we skip it and start over.
The regex in the conditional can be further decomposed;
^ # Look at beginning of line (i.e. beginning of pattern space)
\( # This starts group \1
[ -~] # Any printable character (in the C locale)
* # Any number of times
\n # Followed by a newline
\) # End of group \1 -- it contains the current input line
.*\n # Skip any amount of lines as necessary
\1 # Another occurrence of the current input line, with newline and all
If this pattern matches, the script discards the pattern space and starts over with the next input line (d).
You can get it to work independently of locale by changing [ -~] to [[:print:]]
The code doesn't work for me, perhaps due to some locale setting, but this does:
vvv
sed -n 'G; s/\n/&&/; /^\([^\n]*\n\).*\n\1/d; s/\n//; h; P'
^^^
Let's first translate this by the book (i.e. sed info page), into something perlish.
# The standard sed loop
my $hold = "";
while ($my pattern = <>) {
chomp $pattern;
$pattern = "$pattern\n$hold"; # G
$pattern =~ s/(\n)/$1$1/; # s/\n/&&/
if ($pattern =~ /^([^\n]*\n).*\n\1/) { # /…/
next; # d
}
$pattern =~ s/\n//; # s/\n//
$hold = $pattern; # h
$pattern =~ /^([^\n]*\n?)/; print $1; # P
}
OK, the basic idea is that the hold space contains all the lines seen so far.
G: At the beginning of each cycle, append that hold space to the current line. Now we have a single string consisting of the current line and all unique lines which preceeded it.
s/\n/&&/: Turn the newline which separates them into a double newline, so that we can match subsequent and non-subsequent duplicates the same, see the next step.
^\([^\n]*\n\).*\n\1/: Look through the current text for the following: at the beginning of all the lines (^) look for a first line including trailing newline (\([^\n]*\n\)), then anything (.*), then a newline (\n), and then that same first line including newline repeated again (\1). If two subsequent lines are the same, then the .* in the regular expression will match the empty string, but the two \n will still match due to the newline duplication in the preceding step. So basically this asks whether the first line appears again among the other lines.
d: If there is a match, this is a duplicate line. We discard this input, keep the hold space as it is as a buffer of all unique lines seen so far, and continue with the next line of input.
s/\n//: Otherwise, we continue and next turn the double newline back into a single newline.
h: We include the current line in our list of all unique lines.
P: And finally print this new unique line, up to the newline character.
For the actual problem to resolve, here is a simpler solution (at least it looks so) with awk:
awk '!_[$0]++' FILE
In short _[$0] is a counter (of appearance) for each unique line, for any line ($0) appearing for the second time _[$0] >= 1, thus !_[$0] evaluates to false, causing it not to be printed except its first time appearance.
See https://gist.github.com/ryenus/5866268 (credit goes to a recent forum I visited.)

How to restrict a find and replace to only one column within a CSV?

I have a 4-column CSV file, e.g.:
0001 # fish # animal # eats worms
I use sed to do a find and replace on the file, but I need to limit this find and replace to only the text found inside column 3.
How can I have a find and replace only occur on this one column?
Are you sure you want to be using sed? What about csvfix? Is your CSV nice and simple with no quotes or embedded commas or other nasties that make regexes...a less than satisfactory way of dealing with a general CSV file? I'm assuming that the # is the 'comma' in your format.
Consider using awk instead of sed:
awk -F# '$3 ~ /pattern/ { OFS= "#"; $3 = "replace"; }'
Arguably, you should have a BEGIN block that sets OFS once. For one line of input, it didn't make any odds (and you'd probably be hard-pressed to measure a difference on a million lines of input, too):
$ echo "pattern # pattern # pattern # pattern" |
> awk -F# '$3 ~ /pattern/ { OFS= "#"; $3 = "replace"; }'
pattern # pattern #replace# pattern
$
If sed still seems appealing, then:
sed '/^\([^#]*#[^#]*\)#pattern#\(.*\)/ s//\1#replace#\2/'
For example (and note the slightly different input and output – you can fix it to handle the same as the awk quite easily if need be):
$ echo "pattern#pattern#pattern#pattern" |
> sed '/^\([^#]*#[^#]*\)#pattern#\(.*\)/ s//\1#replace#\2/'
pattern#pattern#replace#pattern
$
The first regex looks for the start of a line, a field of non-at-signs, an at-sign, another field of non-at-signs and remembers the lot; it looks for an at-sign, the pattern (which must be in the third field since the first two fields have been matched already), another at-sign, and then the residue of the line. When the line matches, then it replaces the line with the first two fields (unchanged, as required), then adds the replacement third field, and the residue of the line (unchanged, as required).
If you need to edit rather than simply replace the third field, then you think about using awk or Perl or Python. If you are still constrained to sed, then you explore using the hold space to hold part of the line while you manipulate the other part in the pattern space, and end up re-integrating your desired output line from the hold space and pattern space before printing the line. That's nearly as messy as it sounds; actually, possibly even messier than it sounds. I'd go with Perl (because I learned it long ago and it does this sort of thing quite easily), but you can use whichever non-sed tool you like.
Perl editing the third field. Note that the default output is $_ which had to be reassembled from the auto-split fields in the array #F.
$ echo "pattern#pattern#pattern#pattern" | sh -x xxx.pl
> perl -pa -F# -e '$F[2] =~ s/\s*pat(\w\w)rn\s*/ prefix-$1-suffix /; $_ = join "#", #F; ' "$#"
pattern#pattern# prefix-te-suffix #pattern
$
An explanation. The -p means 'loop, reading lines into $_ and printing $_ at the end of each iteration'. The -a means 'auto-split $_ into the array #F'. The -F# means the field separator is #. The -e is followed by the Perl program. Arrays are indexed from 0 in Perl, so the third field is split into $F[2] (the sigil — the # or $ — changes depending on whether you're working with a value from the array or the array as a whole. The =~ is a match operator; it applies the regex on the RHS to the value on the LHS. The substitute pattern recognizes zero or more spaces \s* followed by pat then two 'word' characters which are remembered into $1, then rn and zero or more spaces again; maybe there should be a ^ and $ in there to bind to the start and end of the field. The replacement is a space, 'prefix-', the remembered pair of letters, and '-suffix' and a space. The $_ = join "#", #F; reassembles the input line $_ from the possibly modified separate fields, and then the -p prints that out. Not quite as tidy as I'd like (so there's probably a better way to do it), but it works. And you can do arbitrary transforms on arbitrary fields in Perl without much difficulty. Perl also has a module Text::CSV (and a high-speed C version, Text::CSV_XS) which can handle really complex CSV files.
Essentially break the line into three pieces, with the pattern you're looking for in the middle. Then keep the outer pieces and replace the middle.
/\([^#]*#[^#]*#\[^#]*\)pattern\([^#]*#.*\)/s//\1replacement\2/
\([^#]*#[^#]*#\[^#]*\) - gather everything before the pattern, including the 3rd # and any text before the math - this becomes \1
pattern - the thing you're looking for
\([^#]*#.*\) - gather everything after the pattern - this becomes \2
Then change that line into \1 then the replacement, then everything after pattern, which is \2
This might work for you:
echo 0001 # fish # animal # eats worms|
sed 's/#/&\n/2;s/#/\n&/3;h;s/\n#.*//;s/.*\n//;y/a/b/;G;s/\([^\n]*\)\n\([^\n]*\).*\n/\2\1/'
0001 # fish # bnimbl # eats worms
Explanation:
Define the field to be worked on (in this case the 3rd) and insert a newline (\n) before it and directly after it. s/#/&\n/2;s/#/\n&/3
Save the line in the hold space. h
Delete the fields either side s/\n#.*//;s/.*\n//
Now process the field i.e. change all a's to b's. y/a/b/
Now append the original line. G
Substitute the new field for the old field (also removing any newlines). s/\([^\n]*\)\n\([^\n]*\).*\n/\2\1/
N.B. That in step 4 the pattern space only contains the defined field, so any number of commands may be carried out here and the result will not affect the rest of the line.