remove a newline befor a specific character in a txt file perl - perl

i have a problem i have a txt file that has several lines with a three line pattern that for some reason is unpastable so i have to describe it. first line starts looks like this ">#1M1U7:00204:00340" can have any number after the : but have a fixed number of characters. The next line look like this "_F_48_32.0416666667" and can have any number after the last underscore and can be of different legths. The last lien in the pattern is a DNA sequence. what i want is to join the two first lines together.
I want a script in perl that can fix this for me

Just chomp every first line of the three-line group:
perl -pe 'chomp if 1 == $. % 3' < input > output

Related

sed remove lines with any alphabetical order

im trying to remove all lines that have any 3 characters in alphabetical order with sed is there an easy way to do this instead of a bunch of pattern lines
sed -i '/abc/d
/bcd/d
....
/xyz/d' file.txt
With your attempted code, please try following awk code, where we are not writing all combinations of continuous alphabets. IMHO awk will be much efficient then sed here.
awk '
BEGIN{
FS=""
num=split("a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z",arr1,",")
for(i=1;i<=num;i++){ letters[arr1[i]]=i }
}
{
for(i=1;i<=NF;i++){
if(($i in letters) && ($(i+1) in letters) && ($(i+2) in letters)\
&& (letters[$i]+1==letters[$(i+1)]) && (letters[$i]+2==letters[$(i+2)])\
&& (letters[$(i+1)]+1==letters[$(i+2)])){
print $i $(i+1) $(i+2)
}
}
}
' Input_file
Explanation: Simple and detailed explanation for whole awk program would be:
Explanation of BEGIN block of awk program:
Creating field separator(FS) as NULL for all lines in awk so that each character could be compared to find out 3 consecutive occurrences of letters.
Then using split function of awk creating an array named arr1 where splitting all alphabets(small letters) into it with delimiter of , here.
Then starting a for loop till value of num(could be written as 26 also since number of alphabets are always fixed), where creating an array named letters which has index as alphabets and its value will be their place value(their number on which they occur, eg: for a it will be 1).
Explanation of main block of awk program:
Running a for loop from 1st field to till NF all fields of current line basically.
Then checking conditions there(basically checking if current field and next 2 fields are coming in letters array or not AND checking if their sequence is continuous or not).
If all conditions mentioned are met then printing current and next 2 fields(which will basically print 3 letters).
This might work for you (GNU sed):
sed -En '1{x;s/^/abcdefghijklmnopqrstuvwxyz/;x};G;/(...).*\n.*\1/!P' file
On the first line, introduce a literal alphabet in the hold space.
On each line, append the alphabet and using a three character back reference, compare it the the alphabet.
If there is a match, delete the line, otherwise, print the first line only.
N.B. The use of the -n turns off implicit printing and thus only when a match fails is the line printed.

How to find patterns across multiple lines using perl

I want to grep some string spread along multiple lines withing some begin and end pattern
Example:
MediaHelper->fetchStrings( names => [ //Here new line may or many not be
**'ubp-firstrun_heading',
'firstrun_text',
'_firstrun-or-start_search',
'installed'** //may end here also );
]);
using perl or grap how I can get list 4 strings here begin pattern is MediaHelper->fetchStrings(names => [ and end pattern is );
Or any other suggesting using other commands like grep or sed or awk ?
Try this:
sed -n '/MediaHelper->fetchStrings( names =>/,/);/ p' <yourfile>
Or, if you want to skip the delimiting lines, this:
sed -n '/MediaHelper->fetchStrings( names =>/,/);/ {/MediaHelper->fetchStrings( names =>/b; /^);/b; p}' <yourfile>
If I understand your question, you need to match all strings in all lines (and not just the MediaHelper thing).
If this is the case, then sed is the right tool, because it is by default line-oriented.
In our case, if you want to match the string in every line:
sed "s/.*\('.*'\).*/\1/" <your_file>
Hope it helps
Edit: To be more descriptive, first we need to match the whole line (that's the first and the last .*) and then we enclose in parenthesis the part of the line we want to print, which in our case is everything inside single quotes. The number 1 before the last delimiter denotes that we want to print the first (in our case it is the last also) parenthesis.
Just process the file in slurp mode instead of line by line:
perl -0777 -ne 'print $1 while m{MediaHelper->fetchStrings(names\s*=>\s*\[(.*?)\]}g' file
Explanation:
Switches:
-0777: Slurp mode instead of line by line
-n: Creates a while(<>){..} loop for each line in your input file.
-e: Tells perl to execute the code on command line.

Decipher this sed one-liner

I want to remove duplicate lines from a file, without sorting the file.
Example of why this is useful to me: removing duplicates from Bash's $HISTFILE without changing the chronological order.
This page has a one-liner to do that:
http://sed.sourceforge.net/sed1line.txt
Here's the one-liner:
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
I asked a sysadmin and he told me "you just copy the script and it works, don't go philosophising about this", which is fine, so I am asking here as it's a developer forum and I trust people might be like me, suspicious about using things they don't understand:
Could you kindly provide a pseudo-code explanation of what that "black magic" script is doing, please? I tried parsing the incantation in my head but especially the central part is quite hard.
I'll note that this script does not appear to work with my copy of sed (GNU sed 4.1.5) in my current locale. If I run it with LC_ALL=C it works fine.
Here's an annotated version of the script. sed basically has two registers, one is called "pattern space" and is used for (basically) the current input line, and the other, the "hold space", can be used by scripts for temporary storage etc.
sed -n ' # -n: by default, do not print
G # Append hold space to current input line
s/\n/&&/ # Add empty line after current input line
/^\([ -~]*\n\).*\n\1/d # If the current input line is repeated in the hold space, skip this line
# Otherwise, clean up for storing all input in hold space:
s/\n// # Remove empty line after current input line
h # Copy entire pattern space back to hold space
P # Print current input line'
I guess the adding and removal of an empty line is there so that the central pattern can be kept relatively simple (you can count on there being a newline after the current line and before the beginning of the matching line).
So basically, the entire input file (sans duplicates) is kept (in reverse order) in the hold space, and if the first line of the pattern space (the current input line) is found anywhere in the rest of the pattern space (which was copied from the hold space when the script started processing this line), we skip it and start over.
The regex in the conditional can be further decomposed;
^ # Look at beginning of line (i.e. beginning of pattern space)
\( # This starts group \1
[ -~] # Any printable character (in the C locale)
* # Any number of times
\n # Followed by a newline
\) # End of group \1 -- it contains the current input line
.*\n # Skip any amount of lines as necessary
\1 # Another occurrence of the current input line, with newline and all
If this pattern matches, the script discards the pattern space and starts over with the next input line (d).
You can get it to work independently of locale by changing [ -~] to [[:print:]]
The code doesn't work for me, perhaps due to some locale setting, but this does:
vvv
sed -n 'G; s/\n/&&/; /^\([^\n]*\n\).*\n\1/d; s/\n//; h; P'
^^^
Let's first translate this by the book (i.e. sed info page), into something perlish.
# The standard sed loop
my $hold = "";
while ($my pattern = <>) {
chomp $pattern;
$pattern = "$pattern\n$hold"; # G
$pattern =~ s/(\n)/$1$1/; # s/\n/&&/
if ($pattern =~ /^([^\n]*\n).*\n\1/) { # /…/
next; # d
}
$pattern =~ s/\n//; # s/\n//
$hold = $pattern; # h
$pattern =~ /^([^\n]*\n?)/; print $1; # P
}
OK, the basic idea is that the hold space contains all the lines seen so far.
G: At the beginning of each cycle, append that hold space to the current line. Now we have a single string consisting of the current line and all unique lines which preceeded it.
s/\n/&&/: Turn the newline which separates them into a double newline, so that we can match subsequent and non-subsequent duplicates the same, see the next step.
^\([^\n]*\n\).*\n\1/: Look through the current text for the following: at the beginning of all the lines (^) look for a first line including trailing newline (\([^\n]*\n\)), then anything (.*), then a newline (\n), and then that same first line including newline repeated again (\1). If two subsequent lines are the same, then the .* in the regular expression will match the empty string, but the two \n will still match due to the newline duplication in the preceding step. So basically this asks whether the first line appears again among the other lines.
d: If there is a match, this is a duplicate line. We discard this input, keep the hold space as it is as a buffer of all unique lines seen so far, and continue with the next line of input.
s/\n//: Otherwise, we continue and next turn the double newline back into a single newline.
h: We include the current line in our list of all unique lines.
P: And finally print this new unique line, up to the newline character.
For the actual problem to resolve, here is a simpler solution (at least it looks so) with awk:
awk '!_[$0]++' FILE
In short _[$0] is a counter (of appearance) for each unique line, for any line ($0) appearing for the second time _[$0] >= 1, thus !_[$0] evaluates to false, causing it not to be printed except its first time appearance.
See https://gist.github.com/ryenus/5866268 (credit goes to a recent forum I visited.)

Chopping lines in file for equal length but with new identifiers

I have a file, whose every 2nd line is of unequal length. I want to make these lines equal(every 2nd line of output should be equal to 10 characters) but with new identifier (every odd line).
FILE ->
>ZQMK36301EDYQE
ZHZHHEXZZHHZZHHZZXHHHEHHHZZZHHHZHXZHZ
>ZQMK36301EEMJ9
ZZZXHZHHXHHHEZZEEZZHZZZZXEZ
>ZQMK36301EOEM5
ZXHXHZZHEHHHXZEZHXXXHXHHHHXEHHHZHHHH
desired output ->
>ZQMK36301EDYQE
ZHZHHEXZZH
>ZQMK36301EDYQE#2
HZZHHZZXHH
>ZQMK36301EDYQE#3
HEHHHZZZHH
>ZQMK36301EEMJ9
ZZZXHZHHXH
>ZQMK36301EEMJ9#2
HHEZZEEZZH
>ZQMK36301EOEM5
ZXHXHZZHEH
>ZQMK36301EOEM5#2
HHXZEZHXXX
>ZQMK36301EOEM5#3
HXHHHHXEHH
Here if we take the first line which is identifier (>ZQMK36301EDYQE) and in its 2nd line it contains 37 characters. Now it will make 3 sequences of equal length (i:e 10) and if remaining characters are less than 10, we will throw that part. Now each new line of equal length has an identifier which is same as from which the part of sequence it came but followed by "#" and the number. I want to do this for whole file. Please help.
Thanks and Best regards,
Vikas
As a one-liner:
perl -nwle '
$i=0;
for my $add (<>=~/.{10}/g) {
printf "%s%s\n%s\n", $_, $i++ ? "#$i":"", $add;
}' inputfile
-n read file line-by-line and store line in $_. -l autochomps the input. We assume first line is header, and second is data. $i is the counter, so it is reset for each new line pair. The for loop list is made on the fly by reading one line <>, then extracting 10-character long strings from it with a regex. Then we just print the stuff, and make sure not to show the zero counter.

SED: Operate on Last seven lines regardless of file length

I would like to operate on the last 7 lines of a file with sed regardless of the filelength.
According to a related question this type of range won't work: $-6,$ {..commands..}
What is the equivalent that will?
Pipe the output of tail -7 into sed.
tail -7 test.txt | sed -e "s/e/WWW/"
More info on Pipes here.
You could just switch from sed(1) to ed(1), the commands are about the same. In this case, the command is the same, except with no limitations on address range.
$ cat > fl7.ed
ed - $1 << \eof
1,7s/$/ (one of the first seven lines)/
$-6,$s/$/ (one of the last seven lines)/
w
q
eof
$ sh fl7.ed yourfile
perl -lne 'END{print join$\,#a,"-",#b}push#a,$_ if#a<6;push#b,$_;shift#b if#b>7'
In the END{} block you can do whatever is required; #a contains the first 6, #b the last 7 lines as requested.
This should work for you:
sed '1{N;N;N;N;N};N;$s/foo/bar/g;P;D' inputfile
Explanation:
1{N;N;N;N;N} - when the first line is read, load the pattern space with five more lines (total: 6 at this point)
N - append another line
$s/foo/bar/g - when the last line is read, perform some operation on the entire contents of pattern space (the last seven lines of the file). Operations can be more complex than shown here
P - print the test before the first newline in pattern space
D - delete the text just printed and loop to the beginning of the script (the "append another line" step - the first instruction is skipped since it only applies to the first line in the file)
This might work for you:
sed ':a;1,6{$!N;ba};${s/foo/bar/g;q};N;D' file
Explanation:
Create a loop label. :a
Gather lines 1 to 6 in the pattern space (PS). 1,6{$!N;ba}
If it's the last line, process the PS and quit, therefore printing out the last seven lines. ${s/foo/bar/g;q}
If it's not the last line, append the next line to the PS. N
Delete upto the first newline and begin a new cycle without reading a new line. D