Reading line by line in perl - perl

Suppose I have a text file with the following format
#ATDGGSGDTSG
NTCCCCC
+
#nddhdhnadn
#ATDGGSGDTSG
NTCCCCC
+
nddhdhnadn
Now its a repeating pattern of "4" lines and I every time want to print only the 2nd line i.e. the line after the line starting with "#" i.e 2nd line..6th line..etc.
How can I do it?

perl -ne 'print if $b and !/^#/; $b=/^#/' file

With awk:
$ awk 'NR%4==2' a
NTCCCCC
NTCCCCC
NR stands for number or record, in this case being number of line. Then, if we divide it by 4, we get all lines whose modulus is 2.
Update on your comment
And wat if I want the output to be > "nextline" NTCCCCC > "nextline"
NTCCCCC i.e. I want to add ">" before that line while redirecting the
output.
This way, for example:
$ awk 'NR%4==2 {print ">"; print $0}' a
>
NTCCCCC
>
NTCCCCC
Another example:
$ seq 30 | awk 'NR%4==2'
2
6
10
14
18
22
26
30

awk '/^\#/{getline;print}' your_file

You could have a variable like $printNextLine and loop over all your Input, Setting it to 1 whenever you see a line with # and printing the current line while Setting the variable back to 0 if it is 1.
Not as effective and short as the other answers but maybe more intuitive for someone new to perl.

awk '/^\#/{getline;print}' file

Related

How can i use sed to exclude patterns when joining lines together?

I am trying to use sed to look for lines that start with '1' and join them with the following line, while ignoring lines that start with '1.'
my source file looks like this:
name cat
1
7.75
2
1.27
X
5.10
The desired output is:
name cat
1 7.75
2
1.27
X
5.10
I have a command that looks for lines that start with 1 and joins the following line, however because I also have lines with 1.* which i want to ignore. I have tried the following sed command and used to try and ignore decimals, however it does not work.
The command i am using is:
sed '/^\<1\>/N;s/\n/ /'
but it gives this output:
name cat
1 7.75
2
1.27 X
5.10
How can I join lines starting with '1' with the following line, while ignoring lines that start with 1.* ?
Edit:
I only want to join lines that contain '1' (nothing else on the line) with the following line
Some lines start with a float , eg 1.2 , i want to ignore these so the next line is not appended to this.
sed '/^1/{/^1\./!N;s/\n/ /}'
If a line does start with a 1, then if it does not start with 1. then append next line. Then replace the newline for a space.
Or just:
sed '/^1\([^\.]\|$\)/N;s/\n/ /'
# same without `\(\|\)`
sed '/^1[^\.]/N;/^1$/N;s/\n/ /'
If a line start with a 1 and then has anything else then a comma or it's the end of line, then append next line. Replace the newline for a space
I only want to join lines that contain '1' (nothing else on the line) with the following line
So just match the 1.
sed '/^1$/N;s/\n/ /'
Maybe you want to just match 1 followed by any whitespace?
sed '/^1[[:space:]]*$/N;s/\n/ /'
Or by spaces only?
sed '/^1 *$/N;s/\n/ /'
The Sed - An Introduction and Tutorial by Bruce Barnett is a great place to learn how to use sed. To learn regexes, I recommend playing with regex crosswords, they let you learn regexes fast and with fun.
You can begin your sed script by starting a new cycle when a line begins with 1.:
#!/bin/sed -f
/^1\./n # don't change 1.x
/^1\b/N # \b is GNU sed word-boundary
s/\n/ /
Thus, only lines not beginning 1. get the following line appended.
Example output:
name cat
1 7.75
2
1.27
X
5.10
According to later comments on the question, it seems you only want to join lines containing 1 and optional trailing spaces, which makes the script much simpler:
#!/bin/sed -f
/^1[[:space:]]*$/N # match the whole line
y/\n/ /
Could you please try following.
awk '
$0==1{
prev=$0
next
}
prev{
$0=prev OFS $0
prev=""
}
1
END{
if(prev){
print prev
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
$0==1{ ##Checking condition if line is having 1 value then do following.
prev=$0 ##Creating variable prev and set its value to current line value.
next ##next will skip all further statements from here.
}
prev{ ##Checking condition if prev variable is NOT NULL then do following.
$0=prev OFS $0 ##Setting current line value to prev OFS and current line.
prev="" ##Nullifing variable prev here.
}
1 ##1 will print the edited/non-edited line here.
END{ ##Starting END block for this awk program here.
if(prev){ ##Checking condition if variable prev is NOT NULL then do following.
print prev ##Printing variable prev here.
} ##Closing BLOCK for if condition here.
} ##Closing END block of this awk program here.
' Input_file ##Mentioning Input_file here.
Output will be as follows.
1 7.75
2
1.27
X
5.10
Using any awk in any shell on every UNIX box:
$ awk '{printf "%s%s", $0, ($1==1"" ? OFS : ORS)}' file
name cat
1 7.75
2
1.27
X
5.10
FYI some (all?) of the sed solutions posted so far are relying on non-POSIX functionality and so YMMV depending on what they do depending on which sed you use.

how to delete lines connected with "+" signs with sed

In this example, "+" sign means it connects the previous line and the current line. So I like to delete a specific group of lines that are connected by "+".
For example, I'd like to remove from 1st line to 4th line(.groupA ~ + G H I). Please help me on how to do it with sed.
To delete lines starting with .groupA and all consecutive +-prefixed lines, one easy to understand approach is:
sed '/\.groupA/,/^[^+]/ { /\.groupA/d; /\.groupA/!{/^\+/d} }' file
We first select everything between .groupA and the first non +-prefixed line (inclusive), then for that selection of lines, we delete the first line (containing .groupA), and of the remaining lines, we delete all with + prefix.
Note you need to escape regex metacharacters (like . and +) if you want to match them literally.
A little bit more advanced, but more elegant (only one use of starting block pattern) approach uses a loop to skip the first line of the matched block, and all the following lines that start with +:
sed -n '/\.groupA/ { :a; n; s/^\+//; ta }; p' file
IMHO this is more readily done with awk, but kindly just ignore if that is not an option for you.
So, every time I see a line starting with .groupA, I set a flag d to say I am deleting, and then skip to the next line. If I see a line starting with a + and I am currently deleting, I skip to the next line. If I see anything else, I change the flag to say I am no longer deleting and print the line:
awk '/^\.groupA/ {d=1; next}
/^+/ && d==1 {next}
{d=0; print}' file
Sample Output
** Example **
abcdef ghijkl
.groupB abc def
+ JKL
+ MNO
+ GHI
opqrst vwxyz
You can cast it as a one-liner like this:
awk '/^\.groupA/{d=1; next} d==1 && /^+/ {next} {d=0;print}' file

Printing lines if a specific column contains values >0

I have a tab-delimited .txt file in this format, containing numerous symbols, numerals and letters:
MUT 124 GET 288478 0 * = 288478 0
MUT 15 GET 514675 0 75MH = 514637 -113
MUT 124 GET 514637 0 75MH = 514675 113
I want to identify all lines that contain a >0 value in the 9th column (i.e. only the 3rd row above would be extracted) and then print column 4 + 9 from any matched lines.
Desired output (two column tab delimited .txt file):
514637 113
Is there a quick way to do this in terminal/on the command-line. If so, how?
I've only just begun to learn awk and perl so all my attempts so far have been nowhere near close. Not sure where to begin!
Easy in Perl
perl -lane 'print "$F[3]\t$F[8]" if $F[8] > 0' < input-file
-l appends a newline to everything you print
-a splits the input into the #F array
-n processes the input line by line
Can be done with the Perl one-liner:
$ perl -anE 'say join "\t", #F[3,8] if $F[8] > 0' data.txt
-n (non-autoprinting) - loop through lines, reading but not printing them
-a (auto-split) - split the input line stored in $_ into #F array (space is the default separator, change it with -F, ex. -F:)
-E 'CODE' (execute) - execute 'CODE' enabling feature bundle (like use 5.010) for your version of Perl
See perlrun for more.
awk handles it almost automatically!
awk '$9>0 {print $4,$9}' file
If you need to specify the input and output separator, say:
awk 'BEGIN{FS=OFS="\t"} $9>0 {print $4,$9}' file

How to remove empty lines to one empty line between sentences in text files?

I have a text file with many empty lines between sentences. I used sed, gawk, grep but they dont work. :(. How can I do now? Thanks.
Myfile: Desired file:
a a
b b
c c
. .
d d
e e
f f
g g
. .
h
i
h j
i k
j .
k
.
You can use awk for this:
awk 'BEGIN{prev="x"}
/^$/ {if (prev==""){next}}
{prev=$0;print}' inputFile
or the compressed one liner:
awk 'BEGIN{p="x"}/^$/{if(p==""){next}}{p=$0;print}' inFl
This is a simple state machine that collapses multi-blank-lines into a single one.
The basic idea is this. First, set the previous line to be non-empty.
Then, for every line in the file, if it and the previous one are blank, just throw it away.
Otherwise, set the previous line to that value, print the line, and carry on.
Sample transcript, the following command:
$ echo '1
2
3
4
5
6
7
8
9
10' | awk 'BEGIN{p="x"}/^$/{if(p==""){next}}{p=$0;print}'
outputs:
1
2
3
4
5
6
7
8
9
10
Keep in mind that this is for truly blank lines (no content). If you're trying to collapse lines that have an arbitrary number of spaces or tabs, that will be a little trickier.
In that case, you could pipe the file through something like:
sed 's/^\s*$//'
to ensure lines with just whitespace become truly empty.
In other words, something like:
sed 's/^\s*$//' infile | awk 'my previous awk command'
To suppress repeated empty output lines with GNU cat:
cat -s file1 > file2
Here's one way using sed:
sed ':a; N; $!ba; s/\n\n\+/\n\n/g' file
Otherwise, if you don't mind a trailing blank line, all you need is:
awk '1' RS= ORS="\n\n" file
The Perl solution is even shorter:
perl -00 -pe '' file
You could do like this also,
awk -v RS="\0" '{gsub(/\n\n+/,"\n\n");}1' file
Explanation:
RS="\0" Once we set the null character as Record Seperator value, awk will read the whole file as single record.
gsub(/\n\n+/,"\n\n"); this replaces one or more blank lines with a single blank line. Note that \n\n regex matches a blank line along with the previous line's new line character.
Here is an other awk
awk -v p=1 'p=="" {p=1;next} 1; {p=$0}' file

How to use 'sed or gawk' to delete a text block until the third line previous the last one

Good day,
I was wondering how to delete a text block like this:
1
2
3
4
5
6
7
8
and delete from the second line until the third line previous the last one, to obtain:
1
2
6
7
8
Thanks in advance!!!
BTW This text block is just an example, the real text blocks I working on are huge and each one differs among them in the line numbers.
Getting the number of lines with wc and using awk to print the requested range:
$ awk 'NR<M || NR>N-M' M=3 N="$(wc -l file)" file
1
2
6
7
8
This allows you to easily change the range by just changing the value of M.
This might work for you (GNU sed):
sed '3,${:a;$!{N;s/\n/&/3;Ta;D}}' file
or i f you prefer:
sed '1,2b;:a;$!{N;s/\n/&/3;Ta;D}' file
These always print the first two lines, then build a running window of three lines.
Unless the end of file is reached the first line is popped off the window and deleted. At the end of file the remaining 3 lines are printed.
since you mentioned huge and also line numbers could be differ. I would suggest this awk one-liner:
awk 'NR<3{print;next}{delete a[NR-3];a[NR]=$0}END{for(x=NR-2;x<=NR;x++)print a[x]}' file
it processes the input file only once, without (pre) calculating total line numbers
it stores minimal data in memory, in all processing time, only 3 lines data were stored.
If you want to change the filtering criteria, for example, removing from line x to $-y, you just simply change the offset in the oneliner.
add a test:
kent$ seq 8|awk 'NR<3{print;next}{delete a[NR-3];a[NR]=$0}END{for(x=NR-2;x<=NR;x++)print a[x]}'
1
2
6
7
8
Using sed:
sed -n '
## Append second line, print first two lines and delete them.
N;
p;
s/^.*$//;
## Read next three lines removing leading newline character inserted
## by the "N" command.
N;
s/^\n//;
N;
:a;
N;
## I will keep three lines in buffer until last line when I will print
## them and exit.
$ { p; q };
## Not last line yet, so remove one line of buffer based in FIFO algorithm.
s/^[^\n]*\n//;
## Goto label "a".
ba
' infile
It yields:
1
2
6
7
8