command line method to remove single line paragraphs from text file

command line method to remove single line paragraphs from text file - sed

I have a .txt file with two types of paragraphs:
Some statements and numbers (02) and such followed by a return
With some more stuff followed by two returns
Then a single line paragraph that is followed by two returns
Along with some more double line text return
some more text.
I want to remove all single line paragraphs from the text file. So that the result is:
Some statements and numbers (02) and such followed by a return
With some more stuff followed by two returns
Along with some more double line text return
some more text
I have been attempting to do this with sed and awk, but I keep running into problems coming up with a regex that will look for a newline followed by some characters and ending in two consecutive newlines \n\n.
Is there anyway way to do this with a one liner or am I going to have to write a script to read in line by line and determine the length of the paragraph and strip it out that way?
Thanks.

awk -F '\n' -v RS='' -v ORS='\n\n' 'NF>1' input.txt
When RS is set to the empty string, each record always ends at the first blank line encountered.
When RS is set to the empty string, and FS is set to a single character, the newline character always acts as a field separator.
[read more]

I tend to reach for Perl for paragraph-oriented parsing:
perl -00 -lne 'print if tr/\n/\n/ > 0'

Related

Replace block of text inside a file using the contents of another file using sed

I am looking to replace a block of text that is between markers with the contents of another file.
I came across this solution but it only works with one line
$ sed -n '/foo/{p;:a;N;/bar/!ba;s/.*\n/REPLACEMENT\n/};p' file
line 1
line 2
foo
REPLACEMENT
bar
line 6
line 7
I am trying to get the following working but it's not.
content=`cat file_content`
sed -n '/foo/{p;:a;N;/bar/!ba;s/.*\n/${content}\n/};p' file
output
line 1
line 2
foo
${content}
bar
line 6
line 7
How can I get ${content} to list the output of the file?

So I guess this should be a reasonably short way of doing it to replace text between foo and bar lines with content of file file_content:
sed -e '/^foo$/,/^bar$/{/^bar$/{x;r content_file
D};d}' file
For range of lines matching ^foo$ and ^bar$. If line matches ^bar$ swap (empty) hold space into pattern space, read and append content of content_file, then delete pattern space up to first newline and start next cycle with the reminder of the pattern space. For all other lines in that range... just drop the line (delete patter space and move to the next line of input).
Otherwise to the result of your question... any string enclosed in single quotes is taken literally by shell and without any expansion (also of variables) taking place. '${content}' means literally ${content} and that is also part of the argument passed to sed, whereas double quote text ("${content}") would still see shell expand variable to what its value before becoming part of the sed arguments. Since that could still see content tripping up sed, I would opt for the r method for being more generic / robust.
EDIT: Edit keeping the start and end lines in (since I've misread the question):
sed -e '/^foo$/,/^bar$/{/^foo$/{r content_file
p};/^bar$/!d}' file
This time for range between matched of ^foo$ and ^bar$... for opening line matching ^foo$ we it reads content from content_file appending it to pattern space and then prints it (because of delete that follow). Then for all line in the range not matching the closing line pattern ^bar$ it just drops it and moves on.

This might work for you (GNU sed):
sed '/foo/!b;:a;$b;N;/bar/!ba;P;s/.*\n//;e cat contentFile' file
Print all lines until one containing foo.
If this is the last line, then there will never be a line containing bar so break out and do not insert the contentFile.
Otherwise, append the next line and check for it containing bar, if not repeat.
The pattern space should now contain both foo and bar so, print the first line (containing foo), remove all other lines other than the one containing bar, print the file contentFile and then print the last line of the collection containing bar.
N.B. This does not insert the contentFile unless both foo and bar exist in file. Also the e command will evaluate the cat contentFile immediately and insert the result into the output stream before printing the line containing bar, whereas the r command always prints to the output stream after the implicit print of the sed cycle.
An alternative:
sed -ne '/foo/{p;:a;n;/bar/!ba;e cat contentFile' -e '};p' file
However this solution will only print lines before foo if file does not have a line containing bar.

sed '/foo/,/bar/{//!d;/foo/s//&\n'${content}'/}' file
From foo to bar, delete lines not matching previous match //!d.
On foo line, replace match & with match followed by \n${content}

How to find patterns across multiple lines using perl

I want to grep some string spread along multiple lines withing some begin and end pattern
Example:
MediaHelper->fetchStrings( names => [ //Here new line may or many not be
**'ubp-firstrun_heading',
'firstrun_text',
'_firstrun-or-start_search',
'installed'** //may end here also );
]);
using perl or grap how I can get list 4 strings here begin pattern is MediaHelper->fetchStrings(names => [ and end pattern is );
Or any other suggesting using other commands like grep or sed or awk ?

Try this:
sed -n '/MediaHelper->fetchStrings( names =>/,/);/ p' <yourfile>
Or, if you want to skip the delimiting lines, this:
sed -n '/MediaHelper->fetchStrings( names =>/,/);/ {/MediaHelper->fetchStrings( names =>/b; /^);/b; p}' <yourfile>

If I understand your question, you need to match all strings in all lines (and not just the MediaHelper thing).
If this is the case, then sed is the right tool, because it is by default line-oriented.
In our case, if you want to match the string in every line:
sed "s/.*\('.*'\).*/\1/" <your_file>
Hope it helps
Edit: To be more descriptive, first we need to match the whole line (that's the first and the last .*) and then we enclose in parenthesis the part of the line we want to print, which in our case is everything inside single quotes. The number 1 before the last delimiter denotes that we want to print the first (in our case it is the last also) parenthesis.

Just process the file in slurp mode instead of line by line:
perl -0777 -ne 'print $1 while m{MediaHelper->fetchStrings(names\s*=>\s*\[(.*?)\]}g' file
Explanation:
Switches:
-0777: Slurp mode instead of line by line
-n: Creates a while(<>){..} loop for each line in your input file.
-e: Tells perl to execute the code on command line.

Alternatives to grep/sed that treat new lines as just another character

Both grep and sed handle input line-by-line and, as far as I know, getting either of them to handle multiple lines isn't very straightforward. What I'm looking for is an alternative or alternatives to these two programs that treat newlines as just another character. Is there any tool that fits such a criteria

The tool you want is awk. It is record-oriented, not line-oriented, and you can specify your record-separator by setting the builtin variable RS. In particular, GNU awk lets you set RS to any regular expression, not just a single character.

Here is an example where awk uses one blank line to separate every record. If you show us what data you have, we can help you with it.
cat file
first line
second line
third line
fourth line
fifth line
sixth line
seventh line
eight line
more data
Running awk on this and reconstruct data using blank line as new record.
awk -v RS= '{$1=$1}1' file
first line second line third line
fourth line fifth line sixth line
seventh line eight line
more data
PS RS is not equal to file, is set to RS= blank, equal to RS=""

1) Sed can handle a block lines together, not always line by line.
In sed, normally I use :loop; $!{N; b loop}; to get all the lines available in pattern space delimited by newline.
Sample:
Productivity
Google Search\
Tips
"Web Based Time Tracking,
Web Based Todo list and
Reduce Key Stores etc"
result (remove the content between ")
sed -e ':loop; $!{N; b loop}; s/\"[^\"]*\"//g' thegeekstuff.txt
Productivity
Google Search\
Tips
You should read this URL (Unix Sed Tutorial: 6 Examples for Sed Branching Operation), it will give you detail how it works.
http://www.thegeekstuff.com/2009/12/unix-sed-tutorial-6-examples-for-sed-branching-operation/
2) For grep, check if your grep support -z option, which needn't handle input line by line.
-z, --null-data
Treat the input as a set of lines, each terminated by a zero
byte (the ASCII NUL character) instead of a newline. Like the
-Z or --null option, this option can be used with commands like
sort -z to process arbitrary file names.

put all separate paragraphs of a file into a separate line

I have a file that contains sequence data, where each new paragraph (separated by two blank lines) contain a new sequence:
#example
ASDHJDJJDMFFMF
AKAKJSJSJSL---
SMSM-....SKSKK
....SK
SKJHDDSNLDJSCC
AK..SJSJSL--HG
AHSM---..SKSKK
-.-GHH
and I want to end up with a file looking like:
ASDHJDJJDMFFMFAKAKJSJSJSL---SMSM-....SKSKK....SK
SKJHDDSNLDJSCCAK..SJSJSL--HGAHSM---..SKSKK-.-GHH
each sequence is the same length (if that helps).
I would also be looking to do this over multiple files stored in different directiories.
I have just tried
sed -e '/./{H;$!d;}' -e 'x;/regex/!d' ./text.txt
however this just deleted the entire file :S
any help would bre appreciated - doesn't have to be in sed, if you know how to do it in perl or something else then that's also great.
Thanks.

All you're asking to do is convert a file of blank-lines-separated records (RS) where each field is separated by newlines into a file of newline-separated records where each field is separated by nothing (OFS). Just set the appropriate awk variables and recompile the record:
$ awk '{$1=$1}1' RS= OFS= file
ASDHJDJJDMFFMFAKAKJSJSJSL---SMSM-....SKSKK....SK
SKJHDDSNLDJSCCAK..SJSJSL--HGAHSM---..SKSKK-.-GHH

awk '
/^[[:space:]]*$/ {if (line) print line; line=""; next}
{line=line $0}
END {if (line) print line}
'
perl -00 -pe 's/\n//g; $_.="\n"'
For multiple files:
# adjust your glob pattern to suit,
# don't be shy to ask for assistance
for file in */*.txt; do
newfile="/some/directory/$(basename "$file")"
perl -00 -pe 's/\n//g; $_.="\n"' "$file" > "$newfile"
done

A Perl one-liner, if you prefer:
perl -nle 'BEGIN{$/=""};s/\n//g;print $_' file
The $/ variable is the equivalent of awk's RS variable. When set to the empty sting ("") it causes two or more empty lines to be treated as one empty line. This is the so-called "paragraph-mode" of reading. For each record read, all newline characters are removed. The -l switch adds a newline to the end of each output string, thus giving the desired result.

just try to find those double linebreaks: \n or \r and replace first those with an special sign like :$:
after that you replace every linebreak with an empty string to get the whole file in one line.
next, replace your special sign with a simple line break :)

Can I use the sed command to replace multiple empty line with one empty line?

I know there is a similar question in SO How can I replace mutliple empty lines with a single empty line in bash?. But my question is can this be implemented by just using the sed command?
Thanks

Give this a try:
sed '/^$/N;/^\n$/D' inputfile
Explanation:
/^$/N - match an empty line and append it to pattern space.
; - command delimiter, allows multiple commands on one line, can be used instead of separating commands into multiple -e clauses for versions of sed that support it.
/^\n$/D - if the pattern space contains only a newline in addition to the one at the end of the pattern space, in other words a sequence of more than one newline, then delete the first newline (more generally, the beginning of pattern space up to and including the first included newline)

You can do this by removing empty lines first and appending line space with G command:
sed '/^$/d;G' text.txt
Edit2: the above command will add empty lines between each paragraph, if this is not desired, you could do:
sed -n '1{/^$/p};{/./,/^$/p}'
Or, if you don't mind that all leading empty lines will be stripped, it may be written as:
sed -n '/./,/^$/p'
since the first expression just evaluates the first line, and prints it if it is blank.
Here: -n option suppresses pattern space auto-printing, /./,/^$/ defines the range between at least one character and none character (i.e. empty space between newlines) and p tells to print this range.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

command line method to remove single line paragraphs from text file - sed

awk -F '\n' -v RS='' -v ORS='\n\n' 'NF>1' input.txt When RS is set to the empty string, each record always ends at the first blank line encountered. When RS is set to the empty string, and FS is set to a single character, the newline character always acts as a field separator. [read more]

I tend to reach for Perl for paragraph-oriented parsing: perl -00 -lne 'print if tr/\n/\n/ > 0'

Related

Replace block of text inside a file using the contents of another file using sed

How to find patterns across multiple lines using perl

Alternatives to grep/sed that treat new lines as just another character

put all separate paragraphs of a file into a separate line

Can I use the sed command to replace multiple empty line with one empty line?

Categories

Resources