Join broken lines with perl/awk

Join broken lines with perl/awk - perl

I have a huge file with sql broken statements like:
PP3697HB ####0
<<<<<<Record has been deleted as per PP3697HB>>>>>>
FROM sys.xtab_ref rc,sys.xtab_sys f,sys.domp ur WHE
RE rc.milf = ur.milf AND rc.molf = f.molf AND ur.dept = 'SWIT'AND ur
.department = 'IND' AND share = '2' AND ur.status = 'DONE' AND f.s
tatus = 'TRUE' AND rc.OPERATOR = '=' AND rc.VALUE = '261366'AND rc.r
unet IN (SELECT milf FROM sys.domp WHERE change = 'OVO' A
ND IND = 75);
I need all these broken lines to be recombined to a single line.
The line should look like:
PP3697HB ####0<<<<<<Record has been deleted as per PP3697HB>>>>>>FROM sys.xtab_ref rc,sys.xtab_sys f,sys.domp ur WHERE rc.milf = ur.milf AND rc.molf = f.molf AND ur.dept = 'SWIT'AND ur.department = 'IND' AND share = '2' AND ur.status = 'DONE' AND f.status = 'TRUE' AND rc.OPERATOR = '=' AND rc.VALUE = '261366'AND rc.runet IN (SELECT milf FROM sys.domp WHERE change = 'OVO' AND IND = 75);
How can I achieve this in perl/awk.
We can say that the start of the line must be ^PP(.*) and the end of sql statement must be (.*);$
Let me know if you have difficulty understand the problem and I will try to explain again.

try this one-liner:
awk '!/;$/{printf "%s",$0}/;$/{print}' file

Using tr to remove the newlines and sed to split each SQL statement:
tr '\n' ' ' < file | sed 's/;/;\n/g'

Try this solution in Perl:
#!/usr/bin/perl -w
use strict;
use warnings;
use Data::Dumper;
## The raw string
my $str = "
PP3697HB ####0
<<<<<<Record has been deleted as per PP3697HB>>>>>>
FROM sys.xtab_ref rc,sys.xtab_sys f,sys.domp ur WHE
RE rc.milf = ur.milf AND rc.molf = f.molf AND ur.dept = 'SWIT'AND ur
.department = 'IND' AND share = '2' AND ur.status = 'DONE' AND f.s
tatus = 'TRUE' AND rc.OPERATOR = '=' AND rc.VALUE = '261366'AND rc.r
unet IN (SELECT milf FROM sys.domp WHERE change = 'OVO' A
ND IND = 75);
";
## Split the given string as per new line.
my #lines = split(/\n/, $str);
## Join every element of the formed array using blank.
$str = join("", #lines);
print $str;

Perl solution:
perl -ne 'chomp $last unless /^PP/; print $last; $last = $_ }{ print $last' FILE.SQL

Assuming there's other lines that are not split up like this, and that only the specified lines require re-joining:
awk '
/^PP/ {insql=1}
/;$/ {insql=0}
insql {printf "%s", $0; next}
{print}
' file

Related

adding new line to an output file

I'm writing a script for comparing 2 variable in 2 line then output the line with equal value to new file. However, the new file only contain last line only, the earlier line was delete. I do google my problem but still not find the way out. Sorry for my English.
Thank you very much in advance.
Here is my script somehow look like:
for (tmp1 = 1 ; tmp1 <= cnt1 ; tmp1++) {
$line1 = `head -tmp1 file1| tail -1`;
#str1 = split(/\s/, $line1);
for (tmp2 = 1 ; tmp2 <= cnt2 ; tmp2++) {
$line2 = `head -tmp2 file2| tail -1`;
#str2 = split(/\s/, $line2);
open(OUT, ">out");
if ($str1[3] eq $str2[3]) {
print OUT "$line1";
}
}
}

You should open the before the loop starts, or use open (OUT, ">>out"); to append mode.

Want to extract the first letter of each word

I basically have a variable COUNTRY along with variables SUBJID and TREAT and I want to concatenate it like this ABC002-123 /NZ/ABC.
Suppose if the COUNTRY variable had the value 'New Zealand'. I want to extract the first letter of each word, But I want extract only the first two letters of the value when there is only one word in the COUNTRY variable. I wanted a to know how to simply the below code. If possible in perl programming.
If COUNTW(COUNTRY) GT 1 THEN
CAT_VAR=
UPCASE(SUBJID||"/"||CAT(SUBSTR(SCAN(COUNTRY,1,' '),1,1),
SUBSTR(SCAN(COUNTRY,2,' '),1,1))||"/"||TREAT);

my #COUNTRY = ("New Zealand", "Germany");
# 'NZ', 'GE'
my #two_letters = map {
my #r = /\s/ ? /\b(\w)/g : /(..)/;
uc(join "", #r);
} #COUNTRY;

The SAS Perl Regular Expression solution is to use CALL PRXNEXT along with PRXPOXN or CALL PRXPOSN (or a similar function, if you prefer):
data have;
infile datalines truncover;
input #1 country $20.;
datalines;
New Zealand
Australia
Papua New Guinea
;;;;
run;
data want;
set have;
length country_letter $5.;
prx_1 = prxparse('~(?:\b([a-z])[a-z]*\b)+~io');
length=0;
start=1;
stop = length(country);
position=0;
call prxnext(prx_1,start,stop,country,position,length);
do while (position gt 0);
matchletter = prxposn(prx_1,1,country);
country_letter = cats(country_letter,matchletter);
call prxnext(prx_1,start,stop,country,position,length);
put i= position= start= stop=;
end;
run;

I realize the OP might not be interested in another answer, but for other users browsing this thread and not wanting to use Perl expressions I suggest the following simple solution (for the original COUNTRY variable):
FIRST_LETTERS = compress(propcase(COUNTRY),'','l');
The propcase functions capitalizes the first letters of each word and puts the other ones in lower case. The compress function with 'l' modifier deletes all lower case letters.
COUNTRY may have any number of words.

How about this:
#!/usr/bin/perl
use warnings;
use strict;
my #country = ('New Zealand', 'Germany', 'Tanzania', 'Mozambique', 'Irish Repuublic');
my ($one_word_letters, $two_word_letters, #initials);
foreach (#country){
if ($_ =~ /\s+/){ # Captures CAPs if 'country' contains a space
my ($first_letter, $second_letter) = ($_ =~ /([A-Z])/g);
my ($two_word_letters) = ($first_letter.$second_letter);
push #initials, $two_word_letters; # Add to array for later
}
else { ($one_word_letters) = ($_ =~ /([A-Z][a-z])/); # If 'country' is only one word long, then capture first two letters (CAP+noncap)
push #initials, $one_word_letters; # Add this to the same array
}
}
foreach (#initials){ # Print contents of the capture array:
print "$_\n";
}
Outputs:
NZ
Ge
Ta
Mo
IR
This should do the job provided there really are no 3 word countries. Easily fixed if there are though...

This should do.
#!/usr/bin/perl
$init = &getInitials($ARGV[0]);
if($init)
{
print $init . "\n";
exit 0;
}
else
{
print "invalid name\n";
exit 1;
}
1;
sub getInitials {
$name = shift;
$name =~ m/(^(\S)\S*?\s+(\S)\S*?$)|(^(\S\S)\S*?$)/ig;
if( defined($1) and $1 ne '' ) {
return uc($2.$3);
} elsif( defined($4) and $4 ne '' ) {
return uc($5);
} else {
return 0;
}
}

Split a perl string with a substring and a space

local_addr = sjcapp [value2]
How do you split this string so that I get 2 values in my array i.e.
array[0] = sjcapp and array[1] = value2.
If I do this
#array = split('local_addr =', $input)
then my array[0] has sjcapp [value2]. I want to be able to separate it into two in my split function itself.
I was trying something like this but it didn't work:
split(/local_addr= \s/, $input)

Untested, but maybe something like this?
#array = ($input =~ /local_addr = (\S+)\s\[(\S+)\]/);
Rather than split, this uses a regex match in list context, which gives you an array of the parts captured in parentheses.

~/ cat data.txt
local_addr = sjcapp [value2]
other_addr = superman [value1492]
euro_addr = overseas [value0]
If the data really is as regularly structured as that , then you can just split on the whitespace. On the command line (see the perlrun(1) manual page) this is easiest with "autosplit" (-a) which magically creates an array of fields called #F from the input:
perl -lane 'print "$F[2] $F[3]" ' data.txt
sjcapp [value2]
superman [value1492]
overseas [value0]
In your script you can change the name of array, and the position of the elements within,it by shift-ing or splice-ing - possibly in a more elegant way than this - but it works:
perl -lane 'my #array = ($F[2],$F[3]) ; print "$array[0], $array[1]" ' data.txt
Or, without using autosplit, as follows :
perl -lne 'my #arr=split(" ");splice(#arr,0,2); print "$arr[0] $arr[1]"' data.txt

try :
if ( $input =~ /(=)(.+)(\[)(.+)(\])/ ) {
#array=($2,$4);
}

I would use a regexp rather than a split, since this is clearly a standard format config file line. How you construct your regexp will likely depend on the full line syntax and how flexible you want to be.
if( $input =~ /(\S+)\s*=\s*(\S+)\s*\[\s*(\S+)\s*\]/ ) {
#array = ($2,$3);
}

Renaming names in a file using another file without using loops

I have two files:
(one.txt) looks Like this:
>ENST001
(((....)))
(((...)))
>ENST002
(((((((.......))))))
((((...)))
I have like 10000 more ENST
(two.txt) looks like this:
>ENST001 110
>ENST002 59
and so on for the rest of all ENSTs
I basically would like to replace the ENSTs in the (one.txt) by the combination of the two fields in the (two.txt) so the results will look like this:
>ENST001_110
(((....)))
(((...)))
>ENST002_59
(((((((.......))))))
((((...)))
I wrote a matlab script to do so but since it loops for all lines in (two.txt) it take like 6 hours to finish, so I think using awk, sed, grep, or even perl we can get the result in few minutes. This is what I did in matlab:
frf = fopen('one.txt', 'r');
frp = fopen('two.txt', 'r');
fw = fopen('result.txt', 'w');
while feof(frf) == 0
line = fgetl(frf);
first_char = line(1);
if strcmp(first_char, '>') == 1 % if the line in one.txt start by > it is the ID
id_fold = strrep(line, '>', ''); % Reomve the > symbol
frewind(frp) % Rewind two.txt file after each loop
while feof(frp) == 0
raw = fgetl(frp);
scan = textscan(raw, '%s%s');
id_pos = scan{1}{1};
pos = scan{2}{1};
if strcmp(id_fold, id_pos) == 1 % if both ids are the same
id_new = ['>', id_fold, '_', pos];
fprintf(fw, '%s\n', id_new);
end
end
else
fprintf(fw, '%s\n', line); % if the line doesn't start by > print it to results
end
end

One way using awk. FNR == NR process first file in arguments and saves each number. Second condition process second file, and when first field matches with a key in the array modifies that line appending the number.
awk '
FNR == NR {
data[ $1 ] = $2;
next
}
FNR < NR && data[ $1 ] {
$0 = $1 "_" data[ $1 ]
}
{ print }
' two.txt one.txt
Output:
>ENST001_110
(((....)))
(((...)))
>ENST002_59
(((((((.......))))))
((((...)))

With sed you can at first run only on two.txt you can make a sed commands to replace as you want and run it at one.txt:
First way
sed "$(sed -n '/>ENST/{s=.*\(ENST[0-9]\+\)\s\+\([0-9]\+\).*=s/\1/\1_\2/;=;p}' two.txt)" one.txt
Second way
If files are huge you'll get too many arguments error with previous way. Therefore there is another way to fix this error. You need execute all three commands one by one:
sed -n '1i#!/bin/sed -f
/>ENST/{s=.*\(ENST[0-9]\+\)\s\+\([0-9]\+\).*=s/\1/\1_\2/;=;p}' two.txt > script.sed
chmod +x script.sed
./script.sed one.txt
The first command will form the sed script that will be able to modify one.txt as you want. chmod will make this new script executable. And the last command will execute command. So each file is read only once. There is no any loops.
Note that first command consist from two lines, but still is one command. If you'll delete newline character it will break the script. It is because of i command in sed. You can look for details in ``sed man page.

This Perl solution sends the modified one.txt file to STDOUT.
use strict;
use warnings;
open my $f2, '<', 'two.txt' or die $!;
my %ids;
while (<$f2>) {
$ids{$1} = "$1_$2" if /^>(\S+)\s+(\d+)/;
}
open my $f1, '<', 'one.txt' or die $!;
while (<$f1>) {
s/^>(\S+)\s*$/>$ids{$1}/;
print;
}

Turn the problem on its head. In perl I would do something like this:
#!/usr/bin/perl
open(FH1, "one.txt");
open(FH2, "two.txt");
open(RESULT, ">result.txt");
my %data;
while (my $line = <FH2>)
{
chomp(line);
# Delete leading angle bracket
$line =~ s/>//d;
# split enst and pos
my ($enst, $post) = split(/\s+/, line);
# Store POS with ENST as key
$data{$enst} = $pos;
}
close(FH2);
while (my $line = <FH1>)
{
# Check line for ENST
if ($line =~ m/^>(ENST\d+)/)
{
my $enst = $1;
# Get pos for ENST
my $pos = $data{$enst};
# make new line
$line = '>' . $enst . '_' . $pos . '\n';
}
print RESULT $line;
}
close(FH1);
close(RESULT);

This might work for you (GNU sed):
sed -n '/^$/!s|^\(\S*\)\s*\(\S*\).*|s/^\1.*/\1_\2/|p' two.txt | sed -f - one.txt

Try this MATLAB solution (no loops):
%# read files as cell array of lines
fid = fopen('one.txt','rt');
C = textscan(fid, '%s', 'Delimiter','\n');
C1 = C{1};
fclose(fid);
fid = fopen('two.txt','rt');
C = textscan(fid, '%s', 'Delimiter','\n');
C2 = C{1};
fclose(fid);
%# use regexp to extract ENST numbers from both files
num = regexp(C1, '>ENST(\d+)', 'tokens', 'once');
idx1 = find(~cellfun(#isempty, num)); %# location of >ENST line
val1 = str2double([num{:}]); %# ENST numbers
num = regexp(C2, '>ENST(\d+)', 'tokens', 'once');
idx2 = find(~cellfun(#isempty, num));
val2 = str2double([num{:}]);
%# construct new header lines from file2
C2(idx2) = regexprep(C2(idx2), ' +','_');
%# replace headers lines in file1 with the new headers
[tf,loc] = ismember(val2,val1);
C1( idx1(loc(tf)) ) = C2( idx2(tf) );
%# write result
fid = fopen('three.txt','wt');
fprintf(fid, '%s\n',C1{:});
fclose(fid);

reformat text in perl

I have a file of 1000 lines, each line in the format
filename dd/mm/yyyy hh:mm:ss
I want to convert it to read
filename mmddhhmm.ss
been attempting to do this in perl and awk - no success - would appreciate any help
thanks

You can do a simple regular expression replacement if the format is really fixed:
s|(..)/(..)/.... (..):(..):(..)$|$2$1$3$4.$5|
I used | as a separator so that I do not need to escape the slashes.
You can use this with Perl on the shell in place:
perl -pi -e 's|(..)/(..)/.... (..):(..):(..)$|$2$1$3$4.$5|' file
(Look up the option descriptions with man perlrun).

Another somehow ugly approach: foreach line of code ($str here) you get from the file do something like this:
my $str = 'filename 26/12/2010 21:09:12';
my #arr1 = split(' ',$str);
my #arr2 = split('/',$arr1[1]);
my #arr3 = split(':',$arr1[2]);
my $day = $arr2[0];
my $month = $arr2[1];
my $year = $arr2[2];
my $hours = $arr3[0];
my $minutes = $arr3[1];
my $seconds = $arr3[2];
print $arr1[0].' '.$month.$day.$year.$hours.$minutes.'.'.$seconds;

Pipe your file to a perl script with:
while( my line = <> ){
if ( $line =~ /(\S+)\s+\(d{2})\/(\d{2})/\d{4}\s+(\d{2}):(\d{2}):(\d{2})/ ) {
print $1 . " " . $3 . $2 . $4 . $5 . '.' . $6;
}
}
Redirect the output however you want.
This says match line to:
(non-whitespace>=1)whitespace>=1(2digits)/(2digits)/4digits
whitepsace>=1(2digits):(2digits):(2digits)
Capture groups are in () numbered 1 to 6 left to right.

Using sed:
sed -r 's|/[0-9]{4} ||; s|/||; s/://; s/:/./' file.txt
delete the year /yyyy
delete the remaining slash
delete the first colon
change the remaining colon to a dot
Using awk:
awk '{split($2,d,"/"); split($3,t,":"); print $1, d[1] d[2] t[1] t[2] "." t[3]}'

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Join broken lines with perl/awk - perl

try this one-liner: awk '!/;$/{printf "%s",$0}/;$/{print}' file

Using tr to remove the newlines and sed to split each SQL statement: tr '\n' ' ' < file | sed 's/;/;\n/g'

Perl solution: perl -ne 'chomp $last unless /^PP/; print $last; $last = $_ }{ print $last' FILE.SQL

Assuming there's other lines that are not split up like this, and that only the specified lines require re-joining: awk ' /^PP/ {insql=1} /;$/ {insql=0} insql {printf "%s", $0; next} {print} ' file

Related

adding new line to an output file

Want to extract the first letter of each word

Split a perl string with a substring and a space

Renaming names in a file using another file without using loops

reformat text in perl

Categories

Resources