Delete everything between the first tab and last semicolon

Delete everything between the first tab and last semicolon - perl

I have a file whose lines are like that:
EF457507|S000834932 Root;Bacteria;"Acidobacteria";Acidobacteria_Gp4;Gp4
EF457374|S000834799 Root;Bacteria;"Acidobacteria";Acidobacteria_Gp14;Gp14
AJ133184|S000323093 Root;Bacteria;Cyanobacteria/Chloroplast;Cyanobacteria;Family I;GpI
DQ490004|S000686022 Root;Bacteria;"Armatimonadetes";Armatimonadetes_gp7
AF268998|S000340459 Root;Bacteria;TM7;TM7_genera_incertae_sedis
I would like to print any thing between the first tab and last semicolon, like that
EF457507|S000834932 Gp4
EF457374|S000834799 Gp14
AJ133184|S000323093 GpI
DQ490004|S000686022 Armatimonadetes_gp7
AF268998|S000340459 TM7_genera_incertae_sedis
I tried to use regex but it doesn't work, is there any way to do it using Linux, awk or Perl?

You could use sed:
sed 's/\t.*;/\t/' file
## This matches a tab character '\t'; followed by any character '.' any number of
## times '*'; followed by a semicolon; and; replaces all of this with a tab
## character '\t'.
sed 's/[^\t]*;//' file
## Things inside square brackets become a character class. For example, '[0-9]'
## is a character class. Obviously, this would match any digit between zero and
## nine. However, when the first character in the character class is a '^', the
## character class becomes negated. So '[^\t]*;' means match anything not a tab
## character any number of times followed by a semicolon.
Or awk:
awk 'BEGIN { FS=OFS="\t" } { sub(/.*;/,"",$2) }1' file
awk '{ sub(/[^\t]*;/,"") }1' file
Results:
EF457507|S000834932 Gp4
EF457374|S000834799 Gp14
AJ133184|S000323093 GpI
DQ490004|S000686022 Armatimonadetes_gp7
AF268998|S000340459 TM7_genera_incertae_sedis
As per comments below, to 'remove everything after the last semicolon', with sed:
sed 's/[^;]*$//' file
## '[^;]*$' will match anything not a semicolon any number of times anchored to
## the end of the line.
Or awk:
awk 'BEGIN { FS=OFS="\t" } { sub(/[^;]*$/,"",$2) }1' file
awk '{ sub(/[^;]*$/,"") }1' file

Related

Replacing all occurrence after nth occurrence in a line in perl

I need to replace all occurrences of a string after nth occurrence in every line of a Unix file.
My file data:
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
My output data:
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
tried using sed: sed 's/://3g' test.txt
Unfortunately, the g option with the occurrence is not working as expected. instead, it is replacing all the occurrences.

Another approach using awk
awk -v c=':' -v n=2 'BEGIN{
FS=OFS=""
}
{
j=0;
for(i=0; ++i<=NF;)
if($i==c && j++>=n)$i=""
}1' file
$ cat file
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
$ awk -v c=':' -v n=2 'BEGIN{FS=OFS=""}{j=0;for(i=0; ++i<=NF;)if($i==c && j++>=n)$i=""}1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus

With GNU awk, using gensub please try following. This is completely based on your shown samples, where OP wants to remove : from 3rd occurrence onwards. Using gensub to segregate parts of matched values and removing all colons from 2nd part(from 3rd colon onwards) in it as per OP's requirement.
awk -v regex="^([^:]*:)([^:]*:)(.*)" '
{
firstPart=restPart=""
firstPart=gensub(regex, "\\1 \\2", "1", $0)
restPart=gensub(regex,"\\3","1",$0)
gsub(/:/,"",restPart)
print firstPart restPart
}
' Input_file

I have inferred based on the limited data you've given us, so it's possible this won't work. But I wouldn't use regex for this job. What you have there is colon delimited fields.
So I'd approach it using split to extract the data, and then some form of string formatting to reassemble exactly what you like:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my ( undef, $first, #rest ) = split /:/;
print ":$first:", join ( "", #rest ),"\n";
}
__DATA__
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
This gives you the desired result, whilst IMO being considerably clearer for the next reader than a complicated regex.

You can use the perl solution like
perl -pe 's~^(?:[^:]*:){2}(*SKIP)(?!)|:~~g if /^:account_id:/' test.txt
See the online demo and the regex demo.
The ^(?:[^:]*:){2}(*SKIP)(?!)|: regex means:
^(?:[^:]*:){2}(*SKIP)(?!) - match
^ - start of string (here, a line)
(?:[^:]*:){2} - two occurrences of any zero or more chars other than a : and then a : char
(*SKIP)(?!) - skip the match and go on to search for the next match from the failure position
| - or
: - match a : char.
And only run the replacement if the current line starts with :account_id: (see if /^:account_id:/').
Or an awk solution like
awk 'BEGIN{OFS=FS=":"} /^:account_id:/ {result="";for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result}' test.txt
See this online demo. Details:
BEGIN{OFS=FS=":"} - sets the input/output field separator to :
/^:account_id:/ - line must start with :account_id:
result="" - sets result variable to an empty string
for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result} - iterates over the fields and if the field number is greater than 2, just append the current field value to result, else, append the value + output field separator; then print the result.

I would use GNU AWK following way if n fixed and equal 2 following way, let file.txt content be
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
then
awk 'BEGIN{FS=":";OFS=""}{$2=FS $2 FS;print}' file.txt
output
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
Explanation: use : as field separator and nothing as output field separator, this itself does remove all : so I add : which have to be preserved: 1st (before second column) and 2nd (after second column). Beware that I tested it solely for this data, so if you would want to use it you should firstly test it with more possible inputs.
(tested in gawk 4.2.1)

This might work for you (GNU sed):
sed 's/:/\n/3;h;s/://g;H;g;s/\n.*\n//' file
Replace the third occurrence of : by a newline.
Make a copy of the line.
Delete all occurrences of :'s.
Append the amended line to the copy.
Join the two lines by removing everything from third occurrence of the copy to the third occurrence of the amended line.
N.B. The use of the newline is the best delimiter to use in the case of sed, as the line presented to seds commands are initially devoid of newlines. However the important property of the delimiter is that it is unique and therefore can be any such character as long as it is not found anywhere in the data set.
An alternative solution uses a loop to remove all :'s after the first two:
sed -E ':a;s/^(([^:]*:){2}[^:]*):/\1/;ta' file

With GNU awk for the 3rd arg to match() and gensub():
$ awk 'match($0,/(:[^:]+:)(.*)/,a){ $0=a[1] gensub(/:/,"","g",a[2]) } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
and with any awk in any shell on every Unix box:
$ awk 'match($0,/:[^:]+:/){ tgt=substr($0,1+RLENGTH); gsub(/:/,"",tgt); $0=substr($0,1,RLENGTH) tgt } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus

Eliminate duplicate words across lines

I'd like a sed script that eliminates repeated words in a text file on one or more lines. For example:
this is is is a text file file it is littered with duplicate words
words words on one or more lines lines
lines
lines
should transform to:
this is a text file it is littered with duplicate words
on one or more lines
This awk script produces the correct output:
{
for (i = 1; i <= NF; i++) {
word = $i
if (word != last) {
if (i < NF) {
next_word = $(i+1)
if (word != next_word) {
printf("%s ", word)
}
} else {
printf("%s\n", word)
}
}
}
last = word
}
but I'd really like a sed "one-liner".

This works with GNU sed, at least for the example input:
$ sed -Ez ':a;s/(\<\S+)(\s+)\1\s+/\1\2/g;ta' infile
This is a text file and is littered with duplicate words
on one or more lines
The -E option is just there to avoid having to escape the capture group parentheses and + quantifiers.
-z treats the input as null byte separated, i.e., as a single line.
The commmand is then structured as
:a # label
s///g # substitution
ta # jump to label if substitution did something
And the substitution is this:
s/(\<\S+)(\s+)\1\s+/\1\2/g
First capture group: (\<\S+) – a complete word (start of word boundary, one or more non-space characters
Second capture group: (\s+) – any number of blanks after that first word
\1\s+ – the first word again plus whatever blanks follow it
This preserves the whitespace after the first word and discards the whitespace after the duplicate.
Note that -E, -z, \<, \S and \s are all GNU extensions to POSIX sed.

With sed, you can use
sed -E 's/([a-z]+) +\1/\1/g'
Note that it works for duplicates. Not for triplicates or line breaks.
This can be fixed, by joining all the lines and looping.
sed -E ':a;N;s/(\b[a-z]+\b)([ \n])[ \n]*\b\1\b */\1\2/g;ba'

sed -En '
H
${
g
s/^\n//
s/(\<([[:alnum:]]+)[[:space:]]+)(\2([[:space:]]+|$))+/\1/g
p
}
' file
This is a text file with duplicate words
on one or more lines
where
H -- append each line to the hold space
${...} -- on the last line, perform the enclosed commands
g -- replace pattern space with the contents of the hold space
s/^\n// -- remove leading newline (side-effect of H on first line)
s/(\<([[:alnum:]]+)[[:space:]]+)(\2([[:space:]]+|$))+/\1/g
..1..2............2............1..........................
the key here is to capture the text and the spaces separately so that the back reference can match with differing whitespace.
captured expression #1 is the first word and it's whitespace (which can contain newlines), and the capture #2 is just the word.

Using sed to remove embedded newlines

What is a sed script that will remove the "\n" character but only if it is inside "" characters (delimited string), not the \n that is actually at the end of the (virtual) line?
For example, I want to turn this file
"lalala","lalalslalsa"
"lalalala","lkjasjdf
asdfasfd"
"lalala","dasdf"
(line 2 has an embedded \n ) into this one
"lalala","lalalslalsa"
"lalalala","lkjasjdf \\n asdfasfd"
"lalala","dasdf"
(Line 2 and 3 are now joined, and the real line feed was replaced with the character string \\n (or any other easy to spot character string, I'm not picky))
I don't just want to remove every other newline as a previous question asked, nor do I want to remove ALL newlines, just those that are inside quotes. I'm not wedded to sed, if awk would work, that's fine too.
The file being operated on is too large to fit in memory all at once.

sed is an excellent tool for simple substitutions on a single line but for anything else you should use awk., e.g:
$ cat tst.awk
{
if (/"$/) {
print prev $0
prev = ""
}
else {
prev = prev $0 " \\\\n "
}
}
$ awk -f tst.awk file
"lalala","lalalslalsa"
"lalalala","lkjasjdf \\n asdfasfd"
"lalala","dasdf"
Below was my original answer but after seeing #NeronLeVelu's approach of just testing for a quote at the end of the line I realized I was doing this in a much too complicated way. You could just replace gsub(/"/,"&") % 2 below with /"$/ and it'd work the same but the above code is a simpler implementation of the same functionality and will now handle embedded escaped double quotes as long as they aren't at the end of a line.
$ cat tst.awk
{ $0 = saved $0; saved="" }
gsub(/"/,"&") % 2 { saved = $0 " \\\\n "; next }
{ print }
$ awk -f tst.awk file
"lalala","lalalslalsa"
"lalalala","lkjasjdf \\n asdfasfd"
"lalala","dasdf"
The above only stores 1 output line in memory at a time. It just keeps building up an output line from input lines while the number of double quotes in that output line is an odd number, then prints the output line when it eventually contains an even number of double quotes.
It will fail if you can have double quotes inside your quoted strings escaped as \", not "", but you don't show that in your posted sample input so hopefully you don't have that situation. If you have that situation you need to write/use a real CSV parser.

sed -n ':load
/"$/ !{N
b load
}
:cycle
s/^\(\([^"]*"[^"]*"\)*\)\([^"]*"[^"]*\)\n/\1\3 \\\\n /
t cycle
p' YourFile
load the lines in working buffer until a close line (ending with ") is found or end reach
replace any \n that is after any couple of open/close " followed by a single " with any other caracter that " between from the start of file by the escapped version of new line (in fact replace starting string + \n by starting string and escaped new line)
if any substitution occur, retry another one (:cycle and t cycle)
print the result
continue until end of file
thanks to #Ed Morton for remark about escaped new line

Remove newline depending on the format of the next line

I have a special file with this kind of format :
title1
_1 texthere
title2
_2 texthere
I would like all newlines starting with "_" to be placed as a second column to the line before
I tried to do that using sed with this command :
sed 's/_\n/ /g' filename
but it is not giving me what I want to do (doing nothing basically)
Can anyone point me to the right way of doing it ?
Thanks

Try following solution:
In sed the loop is done creating a label (:a), and while not match last line ($!) append next one (N) and return to label a:
:a
$! {
N
b a
}
After this we have the whole file into memory, so do a global substitution for each _ preceded by a newline:
s/\n_/ _/g
p
All together is:
sed -ne ':a ; $! { N ; ba }; s/\n_/ _/g ; p' infile
That yields:
title1 _1 texthere
title2 _2 texthere

If your whole file is like your sample (pairs of lines), then the simplest answer is
paste - - < file
Otherwise
awk '
NR > 1 && /^_/ {printf "%s", OFS}
NR > 1 && !/^_/ {print ""}
{printf "%s", $0}
END {print ""}
' file

This might work for you (GNU sed):
sed ':a;N;s/\n_/ /;ta;P;D' file
This avoids slurping the file into memory.
or:
sed -e ':a' -e 'N' -e 's/\n_/ /' -e 'ta' -e 'P' -e 'D' file

A Perl approach:
perl -00pe 's/\n_/ /g' file
Here, the -00 causes perl to read the file in paragraph mode where a "line" is defined by two consecutive newlines. In your example, it will read the entire file into memory and therefore, a simple global substitution of \n_ with a space will work.
That is not very efficient for very large files though. If your data is too large to fit in memory, use this:
perl -ne 'chomp;
s/^_// ? print "$l " : print "$l\n" if $. > 1;
$l=$_;
END{print "$l\n"}' file
Here, the file is read line by line (-n) and the trailing newline removed from all lines (chomp). At the end of each iteration, the current line is saved as $l ($l=$_). At each line, if the substitution is successful and a _ was removed from the beginning of the line (s/^_//), then the previous line is printed with a space in place of a newline print "$l ". If the substitution failed, the previous line is printed with a newline. The END{} block just prints the final line of the file.

Search for a particular multiline pattern using awk and sed

I want to read from the file /etc/lvm/lvm.conf and check for the below pattern that could span across multiple lines.
tags {
hosttags = 1
}
There could be as many white spaces between tags and {, { and hosttags and so forth. Also { could follow tags on the next line instead of being on the same line with it.
I'm planning to use awk and sed to do this.
While reading the file lvm.conf, it should skip empty lines and comments.
That I'm doing using.
data=$(awk < cat `cat /etc/lvm/lvm.conf`
/^#/ { next }
/^[[:space:]]*#/ { next }
/^[[:space:]]*$/ { next }
.
.
How can I use sed to find the pattern I described above?

Are you looking for something like this
sed -n '/{/,/}/p' input
i.e. print lines between tokens (inclusive)?
To delete lines containing # and empty lines or lines containing only whitespace, use
sed -n '/{/,/}/p' input | sed '/#/d' | sed '/^[ ]*$/d'
space and a tab--^
update
If empty lines are just empty lines (no ws), the above can be shortened to
sed -e '/#/d' -e '/^$/d' input
update2
To check if the pattern tags {... is present in file, use
$ tr -d '\n' < input | grep -o 'tags\s*{[^}]*}'
tags { hosttags = 1# this is a comment}
The tr part above removes all newlines, i.e. makes everything into one single line (will work great if the file isn't to large) and then search for the tags pattern and outputs all matches.
The return code from grep will be 0 is pattern was found, 1 if not.
Return code is stored in variable $?. Or pipe the above to wc -l to get the number of matches found.
update3
regex for searcing for tags { hosttags=1 } with any number of ws anywhere
'tags\s*{\s*hosttags\s*=\s*1*[^}]*}'

try this line:
awk '/^\s*#|^\s*$/{next}1' /etc/lvm/lvm.conf

One could try preprocessing the file first, removing commments and empty lines and introducing empty lines behind the closing curly brace for easy processing with the second awk.
awk 'NF && $1!~/^#/{print; if(/}/) print x}' file | awk '/pattern/' RS=