How to exclude end of lines of textfiles via terminal?

How to exclude end of lines of textfiles via terminal? - sed

Given a file ./wordslist.txt with <word> <number_of_apparitions> such as :
aš toto 39626
ir 35938
tai 33361
tu 28520
kad 26213
...
How to exclude the end-of-lines digits in order to collect in output.txt data such :
aš toto
ir
tai
tu
kad
...
Note :
Sed, find, cut or grep prefered. I cannot use something which keeps [a-z] things since my data can contain ascii letters, non-ascii letters, chinese characters, digits, etc.

I suggest:
cut -d " " -f 1 wordslist.txt > output.txt
Or :
sed -E 's/ [0-9]+$//' wordslist.txt > output.txt.

Use awk for print first word in this case.
awk '{print $1}' your_file > your_new_file

awk solution to simply print input line excluding last column
$ awk '{NF--; print}' wordslist.txt
aš toto
ir
tai
tu
kad
Note:
This will only work in some awks. Per POSIX incrementing NF adds a null field but decrementing NF is undefined behavior (thanks #EdMorton for the info)
This doesn't check if last column is numeric and field separation in output will be single space only
If there can be empty lines in input file, use awk 'NF{NF--}1'

The following works :
sed -r 's/ [0-9]+$//g' wordslist.txt

Related

sed - Replace comma after first regex match

i m trying to perform the following substitution on lines of the general format:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......
as you see the problem is that its a comma separated file, with a specific field containing a comma decimal. I would like to replace that with a dot .
I ve tried this, to replace the first occurence of a pattern after match, but to no avail, could someone help me?
sed -e '/,"/!b' -e "s/,/./"
sed -e '/"/!b' -e ':a' -e "s/,/\./"
Thanks in advance. An awk or perl solution would help me as well. Here's an awk effort:
gawk -F "," 'substr($10, 0, 3)==3 && length($10)==12 { gsub(/,/,".", $10); print}'
That yielded the same file unchanged.

CSV files should be parsed in awk with a proper FPAT variable that defines what constitutes a valid field in such a file. Once you do that, you can just iterate over the fields to do the substitution you need
gawk 'BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")"; OFS="," }
{ for(i=1; i<=NF;i++) if ($i ~ /[,]/) gsub(/[,]/,".",$i);}1' file
See this answer of mine to understand how to define and parse CSV file content with FPAT variable. Also see Save modifications in place with awk to do in-place file modifications like sed -i''.

The following sed will convert all decimal separators in quoted numeric fields:
sed 's/"\([-+]\?[0-9]*\)[,]\?\([0-9]\+\([eE][-+]\?[0-9]+\)\?\)"/"\1.\2"/g'
See: https://www.regular-expressions.info/floatingpoint.html

This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^",]*"[^"]*)*"[^",]*),/\1./;ta' file
This regexp matches a , within a pair of "'s and replaces it by a .. The regexp is anchored to the start of the line and thus needs to be repeated until no further matches can be matched, hence the :a and the ta commands which causes the substitution to be iterated over whilst any substitution is successful.
N.B. The solution expects that all double quotes are matched and that no double quotes are quoted i.e. \" does not appear in a line.

If your input always follows that format of only one quoted field containing 1 comma then all you need is:
$ sed 's/\([^"]*"[^"]*\),/\1./' file
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109.07",DF,CCCCCCCCCCC, .......
If it's more complicated than that then see What's the most robust way to efficiently parse CSV using awk?.

Assuming you have this:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC
Try this:
awk -F',' '{print $1,$2,$3,$4"."$5,$6,$7}' filename | awk '$1=$1' FS=" " OFS=","
Output will be:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109.07",DF,CCCCCCCCCCC
You simply need to know the field numbers for replacing the field separator between them.

In order to use regexp as in perl you have to activate extended regular expression with -r.
So if you want to replace all numbers and omit the " sign, then you can use this:
echo 'BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......'|sed -r 's/\"([0-9]+)\,([0-9]+)\"/\1\.\2/g'
If you want to replace first occurrence only you can use that:
echo 'BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......'|sed -r 's/\"([0-9]+)\,([0-9]+)\"/\1\.\2/1'
https://www.gnu.org/software/sed/manual/sed.txt

Substitution of characters limited to part of each input line

Have a file eg. Inventory.conf with lines like:
Int/domain—home.dir=/etc/int
I need to replace / and — before the = but not after.
Result should be:
Int_domain_home_dir=/etc/int
I have tried several sed commands but none seem to fit my need.

Sed with a t loop (BRE):
$ sed ':a;s/[-/—.]\(.*=\)/_\1/;ta;' <<< "Int/domain—home.dir=/etc/int"
Int_domain_home_dir=/etc/int
When one of the -/—. character is found, it's replaced with a _. Following text up to = is captured and output using backreference. If the previous substitution succeeds, the t command loops to label :a to check for further replacements.
Edit:
If you're under BSD/Mac OSX (thanks #mklement0):
sed -e ':a' -e 's/[-/—.]\(.*=\)/_\1/;ta'

You're asking for a sed solution, but an awk solution is simpler and performs better in this case, because you can easily split the line into 2 fields by = and then selectively apply gsub() to only the 1st field in order to replace the characters of interest:
$ awk -F= '{ gsub("[./-]", "_", $1); print $1 FS $2 }' <<< 'Int/domain-home.dir=/etc/int'
Int_domain_home_dir=/etc/int
-F= tells awk to split the input into fields by =, which with the input at hand results in $1 (1st field) containing the first half of the line, before the =, and $2 (2nd field) the 2nd half, after the =; using the -F option sets variable FS, the input field separator.
gsub("[./-]", "_", $1) globally replaces all characters in set [./-] with _ in $1 - i.e., all occurrences of either ., / or - in the 1st field are replaced with a _ each.
print $1 FS $2 prints the result: the modified 1st field ($1), followed by FS (which is =), followed by the (unmodified) 2nd field ($2).
Note that I've used ASCII char. - (HYPHEN-MINUS, codepoint 0x2d) in the awk script, even though your sample input contains the Unicode char. — (EM DASH, U+2014, UTF-8 encoding 0xe2 0x80 0x94).
If you really want to match that, simply substitute it in the command above, but note that the awk version on macOS won't handle that properly.
Another option is to use iconv with ASCII transliteration, which tranlates the em dash into a regular ASCII -:
iconv -f utf-8 -t ascii//translit <<< 'Int/domain—home.dir=/etc/int' |
awk -F= '{ gsub("[./-]", "_", $1); print $1 FS $2 }'
perl allows for an elegant solution too:
$ perl -F= -ane '$F[0] =~ tr|-/.|_|; print join("=", #F)' <<<'Int/domain-home.dir=/etc/int'
Int_domain_home_dir=/etc/int
-F=, just like with Awk, tells Perl to use = as the separator when splitting lines into fields
-ane activates field splitting (a), turns off implicit output (n), and e tells Perl that the next argument is an expression (command string) to execute.
The fields that each line is split into is stored in array #F, where $F[0] refers to the 1st field.
$F[0] =~ tr|-/.|-| translates (replaces) all occurrences of -, /, and . to _.
print join("=", #F) rebuilds the input line from the fields - with the 1st field now modified - and prints the result.
Depending on the Awk implementation used, this may actually be faster (see below).
That sed isn't the best tool for this job is also reflected in the relative performance of the solutions:
Sample timings from my macOS 10.12 machine (GNU sed 4.2.2, Mawk awk 1.3.4, perl v5.18.2, using input file file, which contains 1 million copies of the sample input line) - take them with a grain of salt, but the ratios of the numbers are of interest; fastest solutions first:
# This answer's awk answer.
# Note: Mawk is much faster here than GNU Awk and BSD Awk.
$ time awk -F= '{ gsub("[./-]", "_", $1); print $1 FS $2 }' file >/dev/null
real 0m0.657s
# This answer's perl solution:
# Note: On macOS, this outperforms the Awk solution when using either
# GNU Awk or BSD Awk.
$ time perl -F= -ane '$F[0] =~ tr|-/.|_|; print join("=", #F)' file >/dev/null
real 0m1.656s
# Sundeep's perl solution with tr///
$ time perl -pe 's#^[^=]+#$&=~tr|/.-|_|r#e' file >/dev/null
real 0m2.370s
# Sundeep's perl solution with s///
$ time perl -pe 's#^[^=]+#$&=~s|[/.-]|_|gr#e' file >/dev/null
real 0m3.540s
# Cyrus' solution.
$ time sed 'h;s/[^=]*//;x;s/=.*//;s/[/.-]/_/g;G;s/\n//' file >/dev/null
real 0m4.090s
# Kenavoz' solution.
# Note: The 3-byte UTF-8 em dash is NOT included in the char. set,
# for consistency of comparison with the other solutions.
# Interestingly, adding the em dash adds another 2 seconds or so.
$ time sed ':a;s/[-/.]\(.*=\)/_\1/;ta' file >/dev/null
real 0m9.036s
As you can see, the awk solution is fastest by far, with the line-internal-loop sed solution predictably performing worst, by a factor of about 12.

With GNU sed:
echo 'Int/domain—home.dir=/etc/int' | sed 'h;s/[^=]*//;x;s/=.*//;s/[/—.]/_/g;G;s/\n//'
Output:
Int_domain_home_dir=/etc/int
See: man sed. I assume you want to replace dots too.

If perl solution is okay:
$ echo 'Int/domain-home.dir=/etc/int' | perl -pe 's#^[^=]+#$&=~s|[/.-]|_|gr#e'
Int_domain_home_dir=/etc/int
^[^=]+ string matching from start of line up to but not including the first occurrence of =
$&=~s|[/.-]|_|gr perform another substitution on matched string
replace all / or . or - characters with _
the r modifier would return the modified string
the e modifier allows to use expression instead of string in replacement section
# is used as delimiter to avoid having to escape / inside the character class [/.-]
Also, as suggested by #mklement0, we can use translate instead of inner substitute
$ echo 'Int/domain-home.dir=/etc/int' | perl -pe 's#^[^=]+#$&=~tr|/.-|_|r#e'
Int_domain_home_dir=/etc/int
Note that I've changed sample input, - is used instead of — which is what OP seems to want based on comments

Using sed or awk, how can I alter the first field in a delimited line?

I have a delimited file whose first few fields look like this:
2774013300|184500|2012-01-04 23:00:00|
and I want to alter certain rows whose first field equals or exceeds 8 characters.
I want to truncate the value in the first column.
In the case of 2774013300 I want its value to become become 27740133.
I would like to do this in sed, preferably, or awk.
Using sed, I can find any number that exceeds 8 digits at the beginning of the line, but am not quite sure how to truncate it, using, I would assume, substitute.
sed -n -e /'^[0-9]\{10,\}/p' infile
I am thinking I could use grouping for the first 8 characters and return those in a substitute command, but I'm not quite sure how to do that.
In awk, I can detect the first field, but am not quite sure how to use substr to alter the first field and then return the remaining fields, so a full line is preserved.
awk -F'|' '{ if (length($1) > 9) { print $1; print length($1);} }' infile

Depending on the subtleties of your situation, you can use
sed 's/^\([0-9]\{8\}\)[0-9]*/\1/' infile
or
sed 's/^\([0-9]\{8\}\)[0-9]\{1,\}/\1/' infile
which with GNU sed can be simplified to
sed -r 's/^([0-9]{8})[0-9]+/\1/' infile
or, if you need to, add -n and p.
Example:
$ sed 's/^\([0-9]\{8\}\)[0-9]*/\1/' <<<'2774013300|184500|2012-01-04 23:00:00|'
27740133|184500|2012-01-04 23:00:00|

Using awk:
awk -F'|' 'BEGIN{OFS=FS}length($1)>9{$1=substr($1, 0,9)}{print}'
example:
$ echo "2774013300|184500|2012-01-04 23:00:00|" | awk -F'|' 'BEGIN{OFS=FS}length($1)>9{$1=substr($1, 0,9)}{print}'
27740133|184500|2012-01-04 23:00:00|

using sed to convert numbers 0-9 to hex values from a large list

I have a 10,000,000 digit string of numbers. Numbers are not separated by anything, they are all crammed together like this (its a long string of the first 10,000,000 digits in pi):
1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679
I'm trying to use sed to replace each number with a hex color value. Here's my code:
sed -e 's/0/#F5F5F5/' -e 's/1/#FFE4B5/' -e 's/2/#98FB98/' -e 's/3/#ADFF2F/' -e 's/4/#FF69B4/' -e 's/5/#BA55D3/' -e 's/6/#FF6347/' -e 's/7/#2E8B57/' -e 's/8/#8B4513/' -e 's/9/#000000/' < pi > pi2
trouble is, sed starts converting numbers from my hexcode substitutions. I want those conversions to stay untouched. How do I prevent that? Hope this is clear enough.
ending up with results like this:
#FFE#FF#FF634#2E#8B4513B57#000000B4B#BA55D34159#98FB986535897932384626433832795#F5F5F528841971693993751

You can stick with this approach if you add one sed command to the beginning and make a small change to each of your existing commands:
sed -e 's/\(.\)/\1,/g' -e 's/0,/#F5F5F5/g' -e 's/1,/#FFE4B5/g' ... < pi > pi2
1) Mark each integer in the input by following it with a comma.
2) When you do the substitution, only replace integers that are followed by a comma, and remove the comma when you're done with each one.
Also, as a bug fix, you should also add the g option on each substitution.

try:
LC_LANG=C sed -e 's/[0-9]/,_\1_,/g' <pi >pi2| sed -e 's/,_0_,/#F5F5F5/g;s/,_1_,/#FFE4B5/g .....'
and change the "delimiters" ,_ and _, accordingly depending on the content of the file. (so that it never interfere with the content).
And maybe you need to add a space or ; after each color codes?... and maybe also a newline once in a while ^^

Probably easier to use perl:
perl -pe 'BEGIN{ $/=\1 }; s/0/#F5F5F5/ or s/1/#FFE4B5/ or s/2/#98FB98/ ...'

$ cat tst.awk
BEGIN{
codes="F5F5F5 FFE4B5 98FB98 ADFF2F FF69B4 BA55D3 FF6347 2E8B57 8B4513 000000"
split(codes,hex)
FS=""
}
{ for (i=1;i<=NF;i++)
printf "#%s",hex[($i)+1]
print ""
}
$ echo 1415 | awk -f tst.awk
#FFE4B5#FF69B4#FFE4B5#BA55D3
$ echo 1042 | awk -f tst.awk
#FFE4B5#F5F5F5#FF69B4#98FB98
The above just creates an array mapping numbers to the codes in the string and then for each digit on the input prints the string from the array. Awk arrays start at 1, not zero, hence adding 1 to $i.

How can I replace each newline (\n) with a space using sed?

How can I replace a newline ("\n") with a space ("") using the sed command?
I unsuccessfully tried:
sed 's#\n# #g' file
sed 's#^$# #g' file
How do I fix it?

sed is intended to be used on line-based input. Although it can do what you need.
A better option here is to use the tr command as follows:
tr '\n' ' ' < input_filename
or remove the newline characters entirely:
tr -d '\n' < input.txt > output.txt
or if you have the GNU version (with its long options)
tr --delete '\n' < input.txt > output.txt

Use this solution with GNU sed:
sed ':a;N;$!ba;s/\n/ /g' file
This will read the whole file in a loop (':a;N;$!ba), then replaces the newline(s) with a space (s/\n/ /g). Additional substitutions can be simply appended if needed.
Explanation:
sed starts by reading the first line excluding the newline into the pattern space.
Create a label via :a.
Append a newline and next line to the pattern space via N.
If we are before the last line, branch to the created label $!ba ($! means not to do it on the last line. This is necessary to avoid executing N again, which would terminate the script if there is no more input!).
Finally the substitution replaces every newline with a space on the pattern space (which is the whole file).
Here is cross-platform compatible syntax which works with BSD and OS X's sed (as per #Benjie comment):
sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ /g' file
As you can see, using sed for this otherwise simple problem is problematic. For a simpler and adequate solution see this answer.

Fast answer
sed ':a;N;$!ba;s/\n/ /g' file
:a create a label 'a'
N append the next line to the pattern space
$! if not the last line, ba branch (go to) label 'a'
s substitute, /\n/ regex for new line, / / by a space, /g global match (as many times as it can)
sed will loop through step 1 to 3 until it reach the last line, getting all lines fit in the pattern space where sed will substitute all \n characters
Alternatives
All alternatives, unlike sed will not need to reach the last line to begin the process
with bash, slow
while read line; do printf "%s" "$line "; done < file
with perl, sed-like speed
perl -p -e 's/\n/ /' file
with tr, faster than sed, can replace by one character only
tr '\n' ' ' < file
with paste, tr-like speed, can replace by one character only
paste -s -d ' ' file
with awk, tr-like speed
awk 1 ORS=' ' file
Other alternative like "echo $(< file)" is slow, works only on small files and needs to process the whole file to begin the process.
Long answer from the sed FAQ 5.10
5.10. Why can't I match or delete a newline using the \n escape
sequence? Why can't I match 2 or more lines using \n?
The \n will never match the newline at the end-of-line because the
newline is always stripped off before the line is placed into the
pattern space. To get 2 or more lines into the pattern space, use
the 'N' command or something similar (such as 'H;...;g;').
Sed works like this: sed reads one line at a time, chops off the
terminating newline, puts what is left into the pattern space where
the sed script can address or change it, and when the pattern space
is printed, appends a newline to stdout (or to a file). If the
pattern space is entirely or partially deleted with 'd' or 'D', the
newline is not added in such cases. Thus, scripts like
sed 's/\n//' file # to delete newlines from each line
sed 's/\n/foo\n/' file # to add a word to the end of each line
will NEVER work, because the trailing newline is removed before
the line is put into the pattern space. To perform the above tasks,
use one of these scripts instead:
tr -d '\n' < file # use tr to delete newlines
sed ':a;N;$!ba;s/\n//g' file # GNU sed to delete newlines
sed 's/$/ foo/' file # add "foo" to end of each line
Since versions of sed other than GNU sed have limits to the size of
the pattern buffer, the Unix 'tr' utility is to be preferred here.
If the last line of the file contains a newline, GNU sed will add
that newline to the output but delete all others, whereas tr will
delete all newlines.
To match a block of two or more lines, there are 3 basic choices:
(1) use the 'N' command to add the Next line to the pattern space;
(2) use the 'H' command at least twice to append the current line
to the Hold space, and then retrieve the lines from the hold space
with x, g, or G; or (3) use address ranges (see section 3.3, above)
to match lines between two specified addresses.
Choices (1) and (2) will put an \n into the pattern space, where it
can be addressed as desired ('s/ABC\nXYZ/alphabet/g'). One example
of using 'N' to delete a block of lines appears in section 4.13
("How do I delete a block of specific consecutive lines?"). This
example can be modified by changing the delete command to something
else, like 'p' (print), 'i' (insert), 'c' (change), 'a' (append),
or 's' (substitute).
Choice (3) will not put an \n into the pattern space, but it does
match a block of consecutive lines, so it may be that you don't
even need the \n to find what you're looking for. Since GNU sed
version 3.02.80 now supports this syntax:
sed '/start/,+4d' # to delete "start" plus the next 4 lines,
in addition to the traditional '/from here/,/to there/{...}' range
addresses, it may be possible to avoid the use of \n entirely.

A shorter awk alternative:
awk 1 ORS=' '
Explanation
An awk program is built up of rules which consist of conditional code-blocks, i.e.:
condition { code-block }
If the code-block is omitted, the default is used: { print $0 }. Thus, the 1 is interpreted as a true condition and print $0 is executed for each line.
When awk reads the input it splits it into records based on the value of RS (Record Separator), which by default is a newline, thus awk will by default parse the input line-wise. The splitting also involves stripping off RS from the input record.
Now, when printing a record, ORS (Output Record Separator) is appended to it, default is again a newline. So by changing ORS to a space all newlines are changed to spaces.

GNU sed has an option, -z, for null-separated records (lines). You can just call:
sed -z 's/\n/ /g'

The Perl version works the way you expected.
perl -i -p -e 's/\n//' file
As pointed out in the comments, it's worth noting that this edits in place. -i.bak will give you a backup of the original file before the replacement in case your regular expression isn't as smart as you thought.

Who needs sed? Here is the bash way:
cat test.txt | while read line; do echo -n "$line "; done

In order to replace all newlines with spaces using awk, without reading the whole file into memory:
awk '{printf "%s ", $0}' inputfile
If you want a final newline:
awk '{printf "%s ", $0} END {printf "\n"}' inputfile
You can use a character other than space:
awk '{printf "%s|", $0} END {printf "\n"}' inputfile

tr '\n' ' '
is the command.
Simple and easy to use.

Three things.
tr (or cat, etc.) is absolutely not needed. (GNU) sed and (GNU) awk, when combined, can do 99.9% of any text processing you need.
stream != line based. ed is a line-based editor. sed is not. See sed lecture for more information on the difference. Most people confuse sed to be line-based because it is, by default, not very greedy in its pattern matching for SIMPLE matches - for instance, when doing pattern searching and replacing by one or two characters, it by default only replaces on the first match it finds (unless specified otherwise by the global command). There would not even be a global command if it were line-based rather than STREAM-based, because it would evaluate only lines at a time. Try running ed; you'll notice the difference. ed is pretty useful if you want to iterate over specific lines (such as in a for-loop), but most of the times you'll just want sed.
That being said,
sed -e '{:q;N;s/\n/ /g;t q}' file
works just fine in GNU sed version 4.2.1. The above command will replace all newlines with spaces. It's ugly and a bit cumbersome to type in, but it works just fine. The {}'s can be left out, as they're only included for sanity reasons.

Why didn't I find a simple solution with awk?
awk '{printf $0}' file
printf will print the every line without newlines, if you want to separate the original lines with a space or other:
awk '{printf $0 " "}' file

The answer with the :a label ...
How can I replace a newline (\n) using sed?
... does not work in freebsd 7.2 on the command line:
( echo foo ; echo bar ) | sed ':a;N;$!ba;s/\n/ /g'
sed: 1: ":a;N;$!ba;s/\n/ /g": unused label 'a;N;$!ba;s/\n/ /g'
foo
bar
But does if you put the sed script in a file or use -e to "build" the sed script...
> (echo foo; echo bar) | sed -e :a -e N -e '$!ba' -e 's/\n/ /g'
foo bar
or ...
> cat > x.sed << eof
:a
N
$!ba
s/\n/ /g
eof
> (echo foo; echo bar) | sed -f x.sed
foo bar
Maybe the sed in OS X is similar.

Easy-to-understand Solution
I had this problem. The kicker was that I needed the solution to work on BSD's (Mac OS X) and GNU's (Linux and Cygwin) sed and tr:
$ echo 'foo
bar
baz
foo2
bar2
baz2' \
| tr '\n' '\000' \
| sed 's:\x00\x00.*:\n:g' \
| tr '\000' '\n'
Output:
foo
bar
baz
(has trailing newline)
It works on Linux, OS X, and BSD - even without UTF-8 support or with a crappy terminal.
Use tr to swap the newline with another character.
NULL (\000 or \x00) is nice because it doesn't need UTF-8 support and it's not likely to be used.
Use sed to match the NULL
Use tr to swap back extra newlines if you need them

You can use xargs:
seq 10 | xargs
or
seq 10 | xargs echo -n

cat file | xargs
for the sake of completeness

If you are unfortunate enough to have to deal with Windows line endings, you need to remove the \r and the \n:
tr '\r\n' ' ' < $input > $output

I'm not an expert, but I guess in sed you'd first need to append the next line into the pattern space, bij using "N". From the section "Multiline Pattern Space" in "Advanced sed Commands" of the book sed & awk (Dale Dougherty and Arnold Robbins; O'Reilly 1997; page 107 in the preview):
The multiline Next (N) command creates a multiline pattern space by reading a new line of input and appending it to the contents of the pattern space. The original contents of pattern space and the new input line are separated by a newline. The embedded newline character can be matched in patterns by the escape sequence "\n". In a multiline pattern space, the metacharacter "^" matches the very first character of the pattern space, and not the character(s) following any embedded newline(s). Similarly, "$" matches only the final newline in the pattern space, and not any embedded newline(s). After the Next command is executed, control is then passed to subsequent commands in the script.
From man sed:
[2addr]N
Append the next line of input to the pattern space, using an embedded newline character to separate the appended material from the original contents. Note that the current line number changes.
I've used this to search (multiple) badly formatted log files, in which the search string may be found on an "orphaned" next line.

In response to the "tr" solution above, on Windows (probably using the Gnuwin32 version of tr), the proposed solution:
tr '\n' ' ' < input
was not working for me, it'd either error or actually replace the \n w/ '' for some reason.
Using another feature of tr, the "delete" option -d did work though:
tr -d '\n' < input
or '\r\n' instead of '\n'

I used a hybrid approach to get around the newline thing by using tr to replace newlines with tabs, then replacing tabs with whatever I want. In this case, " " since I'm trying to generate HTML breaks.
echo -e "a\nb\nc\n" |tr '\n' '\t' | sed 's/\t/ <br> /g'`

You can also use this method:
sed 'x;G;1!h;s/\n/ /g;$!d'
Explanation
x - which is used to exchange the data from both space (pattern and hold).
G - which is used to append the data from hold space to pattern space.
h - which is used to copy the pattern space to hold space.
1!h - During first line won't copy pattern space to hold space due to \n is
available in pattern space.
$!d - Clear the pattern space every time before getting the next line until the
the last line.
Flow
When the first line get from the input, an exchange is made, so 1 goes to hold space and \n comes to pattern space, appending the hold space to pattern space, and a substitution is performed and deletes the pattern space.
During the second line, an exchange is made, 2 goes to hold space and 1 comes to the pattern space, G append the hold space into the pattern space, h copy the pattern to it, the substitution is made and deleted. This operation is continued until EOF is reached and prints the exact result.

Bullet-proof solution. Binary-data-safe and POSIX-compliant, but slow.
POSIX sed
requires input according to the
POSIX text file
and
POSIX line
definitions, so NULL-bytes and too long lines are not allowed and each line must end with a newline (including the last line). This makes it hard to use sed for processing arbitrary input data.
The following solution avoids sed and instead converts the input bytes to octal codes and then to bytes again, but intercepts octal code 012 (newline) and outputs the replacement string in place of it. As far as I can tell the solution is POSIX-compliant, so it should work on a wide variety of platforms.
od -A n -t o1 -v | tr ' \t' '\n\n' | grep . |
while read x; do [ "0$x" -eq 012 ] && printf '<br>\n' || printf "\\$x"; done
POSIX reference documentation:
sh,
shell command language,
od,
tr,
grep,
read,
[,
printf.
Both read, [, and printf are built-ins in at least bash, but that is probably not guaranteed by POSIX, so on some platforms it could be that each input byte will start one or more new processes, which will slow things down. Even in bash this solution only reaches about 50 kB/s, so it's not suited for large files.
Tested on Ubuntu (bash, dash, and busybox), FreeBSD, and OpenBSD.

In some situations maybe you can change RS to some other string or character. This way, \n is available for sub/gsub:
$ gawk 'BEGIN {RS="dn" } {gsub("\n"," ") ;print $0 }' file
The power of shell scripting is that if you do not know how to do it in one way you can do it in another way. And many times you have more things to take into account than make a complex solution on a simple problem.
Regarding the thing that gawk is slow... and reads the file into memory, I do not know this, but to me gawk seems to work with one line at the time and is very very fast (not that fast as some of the others, but the time to write and test also counts).
I process MB and even GB of data, and the only limit I found is line size.

Finds and replaces using allowing \n
sed -ie -z 's/Marker\n/# Marker Comment\nMarker\n/g' myfile.txt
Marker
Becomes
# Marker Comment
Marker

You could use xargs — it will replace \n with a space by default.
However, it would have problems if your input has any case of an unterminated quote, e.g. if the quote signs on a given line don't match.

On Mac OS X (using FreeBSD sed):
# replace each newline with a space
printf "a\nb\nc\nd\ne\nf" | sed -E -e :a -e '$!N; s/\n/ /g; ta'
printf "a\nb\nc\nd\ne\nf" | sed -E -e :a -e '$!N; s/\n/ /g' -e ta

To remove empty lines:
sed -n "s/^$//;t;p;"

Using Awk:
awk "BEGIN { o=\"\" } { o=o \" \" \$0 } END { print o; }"

A solution I particularly like is to append all the file in the hold space and replace all newlines at the end of file:
$ (echo foo; echo bar) | sed -n 'H;${x;s/\n//g;p;}'
foobar
However, someone said me the hold space can be finite in some sed implementations.

Replace newlines with any string, and replace the last newline too
The pure tr solutions can only replace with a single character, and the pure sed solutions don't replace the last newline of the input. The following solution fixes these problems, and seems to be safe for binary data (even with a UTF-8 locale):
printf '1\n2\n3\n' |
sed 's/%/%p/g;s/#/%a/g' | tr '\n' # | sed 's/#/<br>/g;s/%a/#/g;s/%p/%/g'
Result:
1<br>2<br>3<br>

It is sed that introduces the new-lines after "normal" substitution. First, it trims the new-line char, then it processes according to your instructions, then it introduces a new-line.
Using sed you can replace "the end" of a line (not the new-line char) after being trimmed, with a string of your choice, for each input line; but, sed will output different lines. For example, suppose you wanted to replace the "end of line" with "===" (more general than a replacing with a single space):
PROMPT~$ cat <<EOF |sed 's/$/===/g'
first line
second line
3rd line
EOF
first line===
second line===
3rd line===
PROMPT~$
To replace the new-line char with the string, you can, inefficiently though, use tr , as pointed before, to replace the newline-chars with a "special char" and then use sed to replace that special char with the string you want.
For example:
PROMPT~$ cat <<EOF | tr '\n' $'\x01'|sed -e 's/\x01/===/g'
first line
second line
3rd line
EOF
first line===second line===3rd line===PROMPT~$

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to exclude end of lines of textfiles via terminal? - sed

I suggest: cut -d " " -f 1 wordslist.txt > output.txt Or : sed -E 's/ [0-9]+$//' wordslist.txt > output.txt.

Use awk for print first word in this case. awk '{print $1}' your_file > your_new_file

The following works : sed -r 's/ [0-9]+$//g' wordslist.txt

Related

sed - Replace comma after first regex match

Substitution of characters limited to part of each input line

Using sed or awk, how can I alter the first field in a delimited line?

using sed to convert numbers 0-9 to hex values from a large list

How can I replace each newline (\n) with a space using sed?

Categories

Resources