FInd similar patterns in a row using Awk or SED - perl

Need some help here
I have file which has multiple rows, Something like below
'**FSR**+S:KSSSSS+20+14++20120401'FST+S:KSSSSS+2'**FSR**+S:KSSSSS+44+1++20160218'
'**FSR**+S:KSSSSS+20+14++20120401'FST+S:KSSSSS+2'**FSR**+S:KSSSSS+44+1++20160218'
'**FSR**+S:KSSSSS+20+14++20120401'FST+S:KSSSSS+2'**FSR**+S:KSSSSS+44+1++20160218'
'**FSR**+S:KSSSSS+20+14++20120401'
I am trying to get all the segments within a row which begins with FSR
so results should be something like this, add pipe every time its finds a FSR, since they are constant on where they would occur in row. so i am not able to use cut here, in short FSR may come in beginning , middle or in the end of the row
'**FSR**+S:KSSSSS+20+14++20120401' | **FSR**+S:KSSSSS+44+1++20160218'
'**FSR**+S:KSSSSS+20+14++20120401' | **FSR**+S:KSSSSS+44+1++20160218'
'**FSR**+S:KSSSSS+20+14++20120401' | **FSR**+S:KSSSSS+44+1++20160218'
'**FSR**+S:KSSSSS+20+14++20120401'
Additionally, this is the code, i had written in perl, thinking if this could be could be done in simple way
#!/usr/bin/perl
use strict;
use Data::Dumper;
my $filename = $ARGV[0];
chomp($filename);
open(FILE,$filename);
my ($FSR);
while(my $data = <FILE>) {
#print $data;
($FSR) = "";
if($data =~ /('FSR.*?)(.*?)(\')/) {
$FSR = "$1,$2";
}
print "$FSR\n";
}
close(FILE);
Please help
Thanks
Sandy

EDITED:
The following one-liner would seem to address your problem as described:
$ sed -i -e 's/FSR/ | FSR/g' YOURFILE
Is there something more to this or is that what you were looking for?

Related

Extract preceding and trailing characters to a matched string from file in awk

I have a large string file seq.txt of letters, unwrapped, with over 200,000 characters. No spaces, numbers etc, just a-z.
I have a second file search.txt which has lines of 50 unique letters which will match once in seq.txt. There are 4000 patterns to match.
I want to be able to find each of the patterns (lines in file search.txt), and then get the 100 characters before and 100 characters after the pattern match.
I have a script which uses grep and works, but this runs very slowly, only does the first 100 characters, and is written out with echo. I am not knowledgeable enough in awk or perl to interpret scripts online that may be applicable, so I am hoping someone here is!
cat search.txt | while read p; do echo "grep -zoP '.{0,100}$p' seq.txt | sed G"; done > do.grep
Easier example with desired output:
>head seq.txt
abcdefghijklmnopqrstuvwxyz
>head search.txt
fgh
pqr
uvw
>head desiredoutput.txt
cdefghijk
mnopqrstu
rstuvwxyz
Best outcome would be a tab separated file of the 100 characters before \t matched pattern \t 100 characters after. Thank you in advance!
One way
use warnings;
use strict;
use feature 'say';
my $string;
# Read submitted files line by line (or STDIN if #ARGV is empty)
while (<>) {
chomp;
$string = $_;
last; # just in case, as we need ONE line
}
# $string = q(abcdefghijklmnopqrstuvwxyz); # test
my $padding = 3; # for the given test sample
my #patterns = do {
my $search_file = 'search.txt';
open my $fh, '<', $search_file or die "Can't open $search_file: $!";
<$fh>;
};
chomp #patterns;
# my #patterns = qw(bcd fgh pqr uvw); # test
foreach my $patt (#patterns) {
if ( $string =~ m/(.{0,$padding}) ($patt) (.{0,$padding})/x ) {
say "$1\t$2\t$3";
# or
# printf "%-3s\t%3s%3s\n", $1, $2, $3;
}
}
Run as program.pl seq.txt, or pipe the content of seq.txt to it.†
The pattern .{0,$padding} matches any character (.), up to $padding times (3 above), what I used in case the pattern $patt is found at a position closer to the beginning of the string than $padding (like the first one, bcd, that I added to the example provided in the question). The same goes for the padding after the $patt.
In your problem then replace $padding to 100. With the 100 wide "padding" before and after each pattern, when a pattern is found at a position closer to the beginning than the 100 then the desired \t alignment could break, if the position is lesser than 100 by more than the tab value (typically 8).
That's what the line with the formatted print (printf) is for, to ensure the width of each field regardless of the length of the string being printed. (It is commented out since we are told that no pattern ever gets into the first or last 100 chars.)
If there is indeed never a chance that a matched pattern breaches the first or the last 100 positions then the regex can be simplified to
/(.{$padding}) ($patt) (.{$padding})/x
Note that if a $patt is within the first/last $padding chars then this just won't match.
The program starts the regex engine for each of #patterns, what in principle may raise performance issues (not for one run with the tiny number of 4000 patterns, but such requirements tend to change and generally grow). But this is by far the simplest way to go since
we have no clue how the patterns may be distributed in the string, and
one match may be inside the 100-char buffer of another (we aren't told otherwise)
If there is a performance problem with this approach please update.
† The input (and output) of the program can be organized in a better way using named command-line arguments via Getopt::Long, for an invocation like
program.pl --sequence seq.txt --search search.txt --padding 100
where each argument may be optional here, with defaults set in the file, and argument names may be shortened and/or given additional names, etc. Let me know if that is of interest
One in awk. -v b=3 is the before context length -v a=3 is the after context length and -v n=3 is the match length which is always constant. It hashes all the substrings of seq.txt to memory so it uses it depending on the size of the seq.txt and you might want to follow the consumption with top, like: abcdefghij -> s["def"]="abcdefghi" , s["efg"]="bcdefghij" etc.
$ awk -v b=3 -v a=3 -v n=3 '
NR==FNR {
e=length()-(n+a-1)
for(i=1;i<=e;i++) {
k=substr($0,(i+b),n)
s[k]=s[k] (s[k]==""?"":ORS) substr($0,i,(b+n+a))
}
next
}
($0 in s) {
print s[$0]
}' seq.txt search.txt
Output:
cdefghijk
mnopqrstu
rstuvwxyz
You can tell grep to search for all the patterns in one go.
sed 's/.*/.{0,100}&.{0,100}/' search.txt |
grep -zoEf - seq.txt |
sed G >do.grep
4000 patterns should be easy peasy, though if you get to hundreds of thousands, maybe you will want to optimize.
There is no Perl regex here, so I switched from the nonstandard grep -P to the POSIX-compatible and probably more efficient grep -E.
The surrounding context will consume any text it prints, so any match within 100 characters from the previous one will not be printed.
You can try following approach to your problem:
load string input data
load into an array patterns
loop through each pattern and look for it in the string
form an array from found matches
loop through matches array and print result
NOTE: the code is not tested due absence of input data
use strict;
use warnings;
use feature 'say';
my $fname_s = 'seq.txt';
my $fname_p = 'search.txt';
open my $fh, '<', $fname_s
or die "Couldn't open $fname_s";
my $data = do { local $/; <$fh> };
close $fh;
open my $fh, '<', $fname_p
or die "Couln't open $fname_p";
my #patterns = <$fh>;
close $fh;
chomp #patterns;
for ( #patterns ) {
my #found = $data =~ s/(.{100}$_.{100})/g;
s/(.{100})(.{50})(.{100})/$1 $2 $3/ && say for #found;
}
Test code for provided test data (added latter)
use strict;
use warnings;
use feature 'say';
my #pat = qw/fgh pqr uvw/;
my $data = do { local $/; <DATA> };
for( #pat ) {
say $1 if $data =~ /(.{3}$_.{3})/;
}
__DATA__
abcdefghijklmnopqrstuvwxyz
Output
cdefghijk
mnopqrstu
rstuvwxyz

How do I extract lines between two strings

I am an absolute beginner in perl and I am trying to extract lines of text between 2 strings on different lines but without success. It looks like I`m missing something in my code. The code should print out the file name and the found strings. Do you have any idea where could be the problem ? Many thanks indeed for your help or advice. Here is the example:
*****************
example:
START
new line 1
new line 2
new line 3
END
*****************
and my script:
use strict;
use warnings;
my $command0 = "";
opendir (DIR, "C:/Users/input/") or die "$!";
my #files = readdir DIR;
close DIR;
splice (#files,0,2);
open(MYOUTFILE, ">>output/output.txt");
foreach my $file (#files) {
open (CHECKBOOK, "input/$file")|| die "$!";
while ($record = <CHECKBOOK>) {
if (/\bstart\..\/bend\b/) {
print MYOUTFILE "$file;$_\n";
}
}
close(CHECKBOOK);
$command0 = "";
}
close(MYOUTFILE);
I suppose that you are trying to use a flip-flop here, which might work well for your input, but you've written it wrong:
if (/\bstart\..\/bend\b/) {
A flip-flop (the range operator) uses two statements, separated by either .. or .... What you want is two regexes joined with ..:
if (/\bSTART\b/ .. /\bEND\b/)
Of course, you also want to match the case (upper), or use the /i modifier to ignore case. You might even want to use beginning of line anchor ^ to only match at the beginning of a line, e.g.:
if (/^START\b/ .. /^END\b/)
You should also know that your entire program can be replaced with a one-liner, such as
perl -ne 'print if /^START\b/ .. /^END\b/' input/*
Alas, this only works for linux. The cmd shell in Windows does not glob, so you must do that manually:
perl -ne "BEGIN { #ARGV = map glob, #ARGV }; print if /^START\b/ .. /^END\b/" input/*
If you are having troubles with the whole file printing no matter what you do, I think the problem lies with your input file. So take a moment to study it and make sure it is what you think it is, for example:
perl -MData::Dumper -e"$Data::Dumper::Useqq = 1; print Dumper $_;" file.txt
If you're matching a multi-line string, you might need to tell the regexp about it:
if (/\bstart\..\/bend\b/s) {
note the s after the regex.
Perldoc says:
s
Treat string as single line. That is, change "." to match any
character whatsoever, even a newline, which normally it would not
match.

Perl: Delete multiple lines from text file having a specific string

I have a text file having data in below mentioned format..
#rectype='ABC' #recname='123' #rec_id='1K2j' etc...
#rectype='DEF' #recname='matin' #rec_id='458i' etc...
#rectype='ABC' #recname='John' #rec_id='lom0' etc...
#rectype='GHI' #recname='Kalme, #rec_id='pl90' etc...
#rectype='KLM' #recname='Kitty' #rec_id='987k' etc...
#rectype='ABC' #recname='OMR' #rec_id='lo09' etc...
Now, I have to delete all the lines having #rectype='ABC'..there are multiple lines of this kind in the input file.It's a kind of urgent and as I am not a perl coder , I am finding it difficult to figure out the way.
Please suggest!!!
NOTE: I need to make changes in input file only. I don't need to create a seperate output file.
You don't need to do it in Perl. You can use the grep tool.
grep -v "#rectype='ABC'" input_file > output_file
grep -v means "Print every line that does not match this expression."
perl -i -ne 'print if !/\#rectype = \047ABC\047/x' text_file
#!/usr/bin/perl
use warnings;
use strict;
use File::Slurp;
my $output = 'output.txt';
open my $outfile, '>', $output or die "Can't write to $output: $!";
my #array = read_file('input.txt');
for (#array){
next if ($_ =~ /^\#rectype='ABC'/);
print $outfile $_ ;
}
Output (saved to 'output.txt'):
#rectype='DEF' #recname='matin' #rec_id='458i' etc...
#rectype='GHI' #recname='Kalme, #rec_id='pl90' etc...
#rectype='KLM' #recname='Kitty' #rec_id='987k' etc...

How to grep a variable which stores a full text file and print matching lines

Hi I have been trying to execute a code where i used a variable $logs to save all my linux logs.
Now i want to grep the variable for a pattern and print the whole line for the lines that have the pattern in them.
I want to print whole line where i do grep /pattern/ and the lines that have pattern in them have to be printed.
Anyways here is my code.
my $logs = $ssh->exec("cat system_logs");
my $search = "pattern";
if(grep($search,$logs))
{
# this is where i want to print the lines matched.
# I want to print the whole line what command to use?
}
any help is greatly appreciated.
Try this:
foreach (grep(/$search/, split(/\n/, $logs))) {
print $_."\n";
}

Reading between the lines (perl or awk)

Hey all, sorry for posting this here I could not find an answer anywhere and my solution is not working. I have a log file being written in the following fashion (don't ask):
=============================
04-12-2011 11:37:10 SOMETHING_GOES_HERE
Variable1
Something Goes Here
=============================
04-12-2011 11:37:20 SOMETHING_GOES_HERE
Variable2
Anything different may be here
=============================
04-12-2011 11:37:30 SOMETHING_GOES_HERE
Variable3
is altogether different here
=============================
What I'd like to do (either in perl or awk as this is an RTOS) is:
Take a look at the file, if Variable1 exists, then start at Variable2 and print everything between the equal signs:
E.g.
=============================
04-12-2011 11:37:10 SOMETHING_GOES_HERE
Variable1
Mary had a little lamb
=============================
04-12-2011 11:37:20 SOMETHING_GOES_HERE
Variable2
The cow jumped over the moon
=============================
awk '/Mary had/{getline;getline;getline;print}'
will only print
04-12-2011 11:37:20 SOMETHING_GOES_HERE
but what I need is everything between the equal signs. I tried butchering up a perl script which isn't working either. Any thoughts?
Alright, this worked (sort of)
#!perl -w
use strict;
my $LOGFILE = "/home/mydir/MyTestFile";
open my $fh, "<$LOGFILE" or die("could not open log file: $!");
my $in = 0;
while(<$fh>)
{
$in = 1 if /Variable1/i;
print if($in);
$in = 0 if /Variable2/i;
}
In the sense that a lot was stripped out. Now another q I have is selective printing a-la awk. Typically, I can get the line before something using:
echo "
test
hello
foo
bar" | awk '/foo/{print x};{x=$0}'
Will print test, however haven't found a way to get the word test (will always be a different word, but the word foo will always remain). Any takers (by the way many thanks in advance)
local $/ = "\n=============================\n";
while (<>) {
chomp;
...
}
Alternative:
my $rec = '';
while (<>) {
if (!/^=============================$/) {
$rec .= $_;
next;
}
...
$rec = '';
}
$/, chomp
I'm not totally clear on if "Variable1/2" etc are constant strings or actually varying quanitites, but does this work:
if [ $(grep Variable1 $file | wc -l) -gt 0 ]; then
sed -n '/Variable2/,/=======/' $file
fi