How to merge files with line-skipping - perl

Have two files:
file f1 has the next structure (after the # are comments which are not in the file)
SomeText1 #Section name - one word [a-zA-Z]
acd:some text #code:text - the code contains only [a-z]
opo:some another text #variable number of code:text pairs
wed:text too #in the SomeText1 section are 3 pairs
SomeText2
xxx:textttt #here only 1 code:text pair
SomeText3
zzz:texxxxxxx #here only 1 code:text pair too
and file f2 what contains in the same order as the above file the next lines:
1000:acd:opo:wed:123.44:4545.23:1233.23 #3 codes - like in the above segment 1
304:xxx:10:11:12.12 #1 code - these lines contains only
4654:zzz:0 #codes and numbers
the desired output is
SomeText1:1000:acd:opo:wed:123.44:4545.23:1233.23
acd:some text:
opo:some another text:
wed:text too:
SomeText2:304:xxx:10:11:12
xxx:textttt:
SomeText3:4654:zzz:0
zzz:texxxxxxx:
So need to add the lines from the f2 to "section name" line. The codes in every line in the f2 file are the same as the codes in the code:text pairs in the f1
Haven't no idea how to start, because
can't use the paste command because i don't have the same line-count in the both files, and
can't use join, because here aren't common keys in both files.
So, would be really happy, when someone tell me SOME ALGORITHM, how to start - and I will program it myself.

I'm offering you different approach - I provide a code, and you should figure out how it works ;) :)
paste -d':' f1 <(perl -pe '$\="\n"x($c=()=/[a-z]+/g)' <f2)
produces exactly what you want from your inputs.
EDIT - Explanation:
The soultion comes from your comment the lines contains only codes and numbers. Therefore it is possible easily get the codes from the line.
therefore enough enter as many empty lines after each line - how many codes you have
the /[a-z]+/g matched every code and return them
the $c =()= is the "Rolex operator" - what allows count the list of matches
the count of matched codes gives the number how much empty lines are needed
the $\ = "\n" x NUMBER - mean repeat NUMBER times the string before `x, e.g. when have 3 codes, will repeat 3 times the "\n" (newline) character.
the newlines are added to the variabe $\ - output record sep.
and because the -p switch process the file by lines and print every line in the form "print $_$\;" - so after every line will print the output record separator - what contains a number of newlines.
therefore we get empty lines
I hope than my english was enough ok for the explanation.

Or wholly in Perl:
my $skip;
while (<$f1>) {
chomp;
my $suffix;
if ($skip--) {
$suffix = "\n";
} else {
$suffix = <$f2>;
$skip = () = $suffix =~ /[a-z]+/g;
}
print "$_:$suffix";
}

Related

Perl: how to format a string containing a tilde character "~"

I have run into an issue where a perl script we use to parse a text file is omitting lines containing the tilde (~) character, and I can't figure out why.
The sample below illustrates what I mean:
#!/usr/bin/perl
use warnings;
formline " testing1\n";
formline " ~testing2\n";
formline " testing3\n";
my $body_text = $^A;
$^A = "";
print $body_text
The output of this example is:
testing1
testing3
The line containing the tilde is dropped entirely from the accumulator. This happens whether there is any text preceding the character or not.
Is there any way to print the line with the tilde treated as a literal part of the string?
~ is special in forms (see perlform) and there's no way to escape it. But you can create a field for it and populate it with a tilde:
formline " \#testing2\n", '~';
The first argument to formline is the "picture" (template). That picture uses various characters to mean particular things. The ~ means to suppress output if the fields are blank. Since you supply no fields in your call to formline, your fields are blank and output is suppressed.
my #lines = ( '', 'x y z', 'x~y~z' );
foreach $line ( #lines ) { # forms don't use lexicals, so no my on control
write;
}
format STDOUT =
~ ID: #*
$line
.
The output doesn't have a line for the blank field because the ~ in the picture told it to suppress output when $line doesn't have anything:
ID: x y z
ID: x~y~z
Note that tildes coming from the data are just fine; they are like any other character.
Here's probably something closer to what you meant. Create a picture, #* (variable-width multiline text), and supply it with values to fill it:
while( <DATA> ) {
local $^A;
formline '#*', $_;
print $^A, "\n";
}
__DATA__
testing1
~testing2
testing3
The output shows the field with the ~:
testing1
~testing2
testing3
However, the question is very odd because the way you appear to be doing things seems like you aren't really doing what formats want to do. Perhaps you have some tricky thing where you're trying to take the picture from input data. But if you aren't going to give it any values, what are you really formatting? Consider that you may not actually want formats.

Perl hash while loop cannot find the key value

I am confused by one perl question, anyone has some idea?
I use one hash structure to store the keys and values like:
$hash{1} - > a;
$hash{2} - > b;
$hash{3} - > c;
$hash{4} - > d;
....
more than 1000 lines. I give a name like %hash
and then, I plan to have one loop statement to search for all keys to see whether it will match with the value from the file.
for example, below is the file content:
first line 1
second line 2
nothing
another line 3
my logic is:
while(read line){
while (($key, $value) = each (%hash))
{
if ($line =~/$key/i){
print "found";
}
}
so my expectation is :
first line 1 - > return found
second line 2 - > return found
nothing
another line 3 - > return found
....
However, during my testing, only first line and second line return found, for 'another line3', the
program does not return 'found'
Note: the hash has more than 1000 records.
So I try to debug it and add some count inside and find out for those found case, the loop has run like 600 or 700 times, but for the 'another line3' case, it only runs around 300 times and just exit the loop and did not return found.
any idea why it happens like that?
and I have done one more testing is if my hash structure is small, like only 10 keys, the logic works.
and I try to use foreach, and It looks like foreach does not have this kind of issue.
The pseudo code you give should work fine, but there might be a subtle problem.
If after you found your key and print it out you end the while loop, the next time each is called, it will continue where you left. Put it in other words "each" is an iterator that stores its state in the hash it iterates over.
In http://blogs.perl.org/users/rurban/2014/04/do-not-use-each.html the author explains this in more detail. His conclusion:
So each should be treated as in php: Avoid it like a plague. Only use it in optimized cases where you know what you are doing.
The problem is not very well articulated by OP, provided sample data are poor for demonstration purpose.
Following sample code is an attempt based on provided problem description by OP.
Recreate filter hash from DATA block, compose $re_filter consisting of filter hash keys, walk through a file given as an argument on command line to filter out lines matching $re_filter.
use strict;
use warnings;
my $data = do { local $/; <DATA> };
my %hash = split ' ', $data;
my $re_filter = join('|',keys %hash);
/$re_filter/ && print for <>;
__DATA__
1 a
2 b
3 c
4 d
Input data file content
first line 1
second line 2
nothing
another line 3
Output
first line 1
second line 2
another line 3

get column list using sed/awk/perl

I have different files like below format
Scenario 1 :
File1
no,name
1,aaa
20,bbb
File2
no,name,address
5,aaa,ghi
7,ccc,mn
I would like to get column list which is having more number of columns and if it is in the same order
**Expected output for scenario 1 :**
no,name,address
Scenario 2 :
File1
no,name
1,aaa
20,bbb
File2
no,age,name,address
5,2,aaa,ghi
7,3,ccc,mn
Expected Results :
Both file headers and positions are different as a message
I am interested in any short solution using bash / perl / sed / awk.
Perl solution:
perl -lne 'push #lines, $_;
close ARGV;
next if #lines < 2;
#lines = sort { length $a <=> length $b } #lines;
if (0 == index "$lines[1],", $lines[0]) {
print $lines[1];
} else {
print "Both file headers and positions are different";
}' -- File1 File2
-n reads the input line by line and runs the code for each line
-l removes newlines from input and adds them to printed lines
closing the special file handle ARGV makes Perl open the next file and read from it instead of processing the rest of the currently opened file.
next makes Perl go back to the beginning of the code, it can continue once more than one input line has been read.
sort sorts the lines by length so that we know the longer one is in the second element of the array.
index is used to check whether the shorter header is a prefix of the longer one (including the comma after the first header, so e.g. no,names is correctly rejected)

delete previous and next lines in perl

I have the following file:
#TWEETY:150:000000000-ACFKE:1:2104:27858:17965
AAATTAGCAAAAAACAATAACAAAACTGGGAAAATGCAATTTAACAACGAAAATTTTCCGAGAACTTGAAAGCGTACGAAAACGATACGCTCC
+
D1FFFB11FDG00EE0FFFA1110FAA1F/ABA0FGHEGDFEEFGDBGGGGFEHBFDDG/FE/EGH1#GF#F0AEEEEFHGGFEFFCEC/>EE
#TWEETY:150:000000000-ACFKE:1:1105:22044:20029
AAAAAATATTAAAACTACGAATGCATAAATTATTTCGTTCGAAATAAACTCACACTCGTAACATTGAACTACGCGCTCC
+
CCFDDDFGGGGGGGGGGHGGHHHHGHHHHHHHHHHHHHHHGHHGHHHHHHHHHHHHHGHGHGGHHHHHHGHHEGGGGGG
#TWEETY:150:000000000-ACFKE:1:2113:14793:7182
TATATAAAGCGAGAGTAGAAACTTTTTAATTGACGCGGCGAGAAAGTATATAGCAACAAGCGAGCACCCGCTCC
+
BBFFFFFGGGGFFGGFGHHHHHHHHHHHHHHHHHGGAEEEAFGGGHHFEGHHGHHHHHGHHGGGGFHHGG?EEG
#TWEETY:150:000000000-ACFKE:1:2109:5013:22093
AAAAAAATAATTCATATCGCCATATCGACTGACAGATAATCTATCTATAATCATAACTTTTCCCTCGCTCC
+
DAFAADDGF1EAGG3EG3A00ECGDFFAEGFCHHCAGHBGEAGBFDEDGGHBGHGFGHHFHHHBDG?/FA/
#TWEETY:150:000000000-ACFKE:1:2106:25318:19875
+
CCCCCCCCCCCCGGGGGGGGGGGGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
The lines are in groups of four (each time there is a name, starting with #TWEETY, a string of letters, a + character, and another string of letters).
The second and fourth lines should have the same number of characters.
But there are cases where the second line is empty, as in the last four lines.
In these cases, I would like to get rid of the whole block (the previous line before the empty line and the next two lines).
I have just started perl and have been trying to write a script for my problem, but am having a hard time. Does anyone have some feedback?
Thanks!
Keep an array buffer of the last four lines. When it's full, check the second line, print the lines or not, empty the buffer, repeat.
#!/usr/bin/perl
use warnings;
use strict;
my #buffer;
sub output {
print #buffer unless 1 == length $buffer[1];
#buffer = ();
}
while (<>) {
if (4 == #buffer) {
output();
}
push #buffer, $_;
}
output(); # Don't forget to process the last four lines.
Yes. Start with looking at $/ and set it so you can work on a chunk at a time. I would suggest you can treat # as a record separator in your example.
Then iterate your records using a while loop. E.g. while ( <> ) {
Use split on \n to turn the current chunk into an array of lines.
Perform your test on the appropriate lines, and either print - or not - depending on whether it passed.
If you get stuck with that, then I'm sure a specific question including your code and where you're having problems will be well received here.
If you chunk the data correctly, this becomes almost trivial.
#!/usr/bin/perl
use strict;
use warnings;
# Use '#TWEETY' as the record separator to make it
# easy to chunk the data.
local $/ = '#TWEETY';
while (<DATA>) {
# The first entry will be empty (as the separator
# is the first thing in the file). Skip that record.
next unless /\S/;
# Skip any records with two consecutive newlines
# (as they will be the ones with the empty line 2)
next if /\n\n/;
# Print the remaining records
# (with $/ stuck back on the front)
print "$/$_";
}
__DATA__
#TWEETY:150:000000000-ACFKE:1:2104:27858:17965
AAATTAGCAAAAAACAATAACAAAACTGGGAAAATGCAATTTAACAACGAAAATTTTCCGAGAACTTGAAAGCGTACGAAAACGATACGCTCC
+
D1FFFB11FDG00EE0FFFA1110FAA1F/ABA0FGHEGDFEEFGDBGGGGFEHBFDDG/FE/EGH1#GF#F0AEEEEFHGGFEFFCEC/>EE
#TWEETY:150:000000000-ACFKE:1:1105:22044:20029
AAAAAATATTAAAACTACGAATGCATAAATTATTTCGTTCGAAATAAACTCACACTCGTAACATTGAACTACGCGCTCC
+
CCFDDDFGGGGGGGGGGHGGHHHHGHHHHHHHHHHHHHHHGHHGHHHHHHHHHHHHHGHGHGGHHHHHHGHHEGGGGGG
#TWEETY:150:000000000-ACFKE:1:2113:14793:7182
TATATAAAGCGAGAGTAGAAACTTTTTAATTGACGCGGCGAGAAAGTATATAGCAACAAGCGAGCACCCGCTCC
+
BBFFFFFGGGGFFGGFGHHHHHHHHHHHHHHHHHGGAEEEAFGGGHHFEGHHGHHHHHGHHGGGGFHHGG?EEG
#TWEETY:150:000000000-ACFKE:1:2109:5013:22093
AAAAAAATAATTCATATCGCCATATCGACTGACAGATAATCTATCTATAATCATAACTTTTCCCTCGCTCC
+
DAFAADDGF1EAGG3EG3A00ECGDFFAEGFCHHCAGHBGEAGBFDEDGGHBGHGFGHHFHHHBDG?/FA/
#TWEETY:150:000000000-ACFKE:1:2106:25318:19875
+
CCCCCCCCCCCCGGGGGGGGGGGGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
Thanks everyone for the feedback!
It was all really useful. Thanks to your suggestions, I explored all the options and learned the unless statement.
The easiest solution given my existing code, was just to add an unless statement at the end.
### Write to output, but remove non-desired Gs
open OUT, ">$outfile";
my #accorder = #{$store0{"accorder"}};
foreach my $acc (#accorder){
# retrieve seq(2nd line) and qual(4th line)
my $seq = $store0{$acc}{"seq"};
my $qual = $store0{$acc}{"qual"};
# clean out polyG at end
$seq =~ s/G{3,}.{0,1}$//;
my $lenseq = length($seq);
my $lenqual = length($qual);
my $startqual = $lenqual - $lenseq;
$qual = substr($qual, 0, $lenseq);
#the above was in order to remove multiple G characters at the end of the
#second line, which is what led to empty lines (lines that were made up of
#only Gs got cut out)
# print to output, unless sequence has become empty
unless($lenseq == 0){ #this is the unless statement I added
print OUT "\#$acc\n$seq\n+\n$qual\n";
}
}
close(OUT);

sed, awk or perl: Pattern range match, print 45 lines then add record delimiter

I have a file containing records delimited by the pattern /#matchee/. These records are of varying lengths ...say 45 - 75 lines. They need to ALL be 45 lines and still maintain the record delimiter. Records can be from different departments, department name is on line 2 following a blank line. So record delimiter could be thought of as simply /^#matchee/ or /^matchee/ followed by \n. There is a Deluxe edition of this problem and a Walmart edition ...
DELUXE EDITION
Pull each record by pattern range so I can sort records by department. Eg., with sed
sed -n '/^DEPARTMENT NAME/,/^#matchee/{p;}' mess-o-records.txt
Then, Print only the first 45 lines of each record in the file to conform to
the 45 line constraint.
Finally, make sure the result still has the record delimiter on line 45.
WALMART EDITION
Same as above, but instead of using a range, just use the record delimiter.
STATUS
My attempt at this might clarify what I'm trying to do.
sed -n -e '/^DEPARTMENT-A/,/^#matchee/{p;}' -e '45q' -e '$s/.*/#matchee/' mess-o-records.txt
This doesn't work, of course, because sed is operating on the entire file at each command.
I need it to operate on each range match not the whole file.
SAMPLE INPUT - 80 Lines ( truncated for space )
<blank line>
DEPARTMENT-A
Office space 206
Anonymous, MI 99999
Harold O Nonymous
Buckminster Abbey
Anonymous, MI 99999
item A Socket B 45454545
item B Gizmo Z 76767676
<too many lines here>
<way too many lines here>
#matchee
SAMPLE OUTPUT - now only 45 lines
<blank line>
DEPARTMENT-A
Office space 206
Anonymous, MI 99999
Harold O Nonymous
Buckminster Abbey
Anonymous, MI 99999
item A Socket B 45454545
item B Gizmo Z 76767676
<Record now equals exactly 45 lines>
<yet record delimiter is maintained>
#matchee
CLARIFICATION UPDATE
I will never need more than the first 40 lines if this makes things easier. Maybe the process would be:
Match pattern(s)
Print first 40 lines.
Pad to appropriate length. Eg., 45 lines.
Tack delimiter back on. Eg., #matchee
I think this would be more flexible -- Ie., can handle record shorter than 45 lines.
Here's a riff based on #Borodin's Perl example below:
my $count = 0;
$/ = "#matchee";
while (<>) {
if (/^REDUNDANCY.*DEPT/) {
print;
$count = 0;
}
else {
print if $count++ < 40;
print "\r\n" x 5;
print "#matchee\r\n";
}
}
This add 5 newlines to each record + the delimiting pattern /#matchee/. So it's wrong -- but it illustrates what I want.
Print 40 lines based on department -- pad -- tack delimiter back on.
I think I understand what you want. Not sure about the bit about pull each record by pattern range. Is #matchee always followed by a blank line and then the department line? So in fact record number 2?
This Perl fragment does what I understand you need.
If you prefer you can put the input file on the command line and drop the open call. Then the loop would have to be while (<>) { ... }.
Let us know if this is right so far, and what more you need from it.
use strict;
use warnings;
open my $fh, '<', 'mess-o-records.txt' or die $!;
my $count = 0;
while (<$fh>) {
if (/^#matchee/) {
print;
$count = 0;
}
else {
print if $count++ < 45;
}
}
I know this has already had an accepted answer, but I figured I'd post an awk example for anyone interested. It's not 100%, but it gets the job done.
Note This numbers the lines so you can verify the script is working as expected. Remove the i, from print i, current[i] to remove the line numbers.
dep.awk
BEGIN { RS = "#matchee\n\n" }
$0 ~ /[a-zA-Z0-9]+/ {
split($0, current, "\n")
for (i = 1; i <= 45; i++) {
print i, current[i];
}
print "#matchee\n"
}
In this example, you begin the script by setting the record separator (RS) to "#matchee\n\n". There are two newlines because the first ends the line on which #matchee occurs and the second is the blank line on its own.
The match validates that a record contains letters or numbers to be valid. You could also check that the match starts with 'DEPARTMENT-', but this would fail if there is a stray newline. Checking the content is the safest route. Because this uses a block record (i.e., DEPARTMENT-A through #matchee), you could either pass $0 through awk or sed again, or use the awk split function and loop through 45 lines. In awk, the arrays aren't zero-indexed.
The print function includes a newline, so the block ends with print "#matchee\n" only instead of the double \n in the record separator variable.
You could also drop the same awk script into a bash script and change the number of lines and field separator. Of course, you should add validations and whatnot, but here's the start:
dep.sh
#!/bin/bash
# prints the first n lines within every block of text delimited by splitter
splitter=$1
numlines=$2
awk 'BEGIN { RS="'$1'\n\n" }
$0 ~ /[a-zA-Z0-9]+/ {
split($0, current, "\n")
for(i=1;i<='$numlines';i++) {
print i, current[i]
}
print "'$splitter'", "\n"
}' $3
Make the script executable and run it.
./dep.sh '#matchee' 45 input.txt > output.txt
I added these files to a gist so you could also verify the output
This might work for you:
D="DEPARTMENT-A" M="#matchee"
sed '/'"$D/,/$M"'/{/'"$D"'/{h;d};H;/'"$M"'/{x;:a;s/\n/&'"$M"'/45;tb;s/'"$M"'/\n&/;ta;:b;s/\('"$M"'\).*/\1/;p};d}' file
Explanation:
Focus on range of lines /DEPARTMENT/,/#matchee/
At start of range move pattern space (PS) to hold space (HS) and delete PS /DEPARTMENT/{h;d}
All subsequent lines in the range append to HS and delete H....;d
At end of range:/#matchee/
Swap to HS x
Test for 45 lines in range and if successful append #matchee at the 45th line s/\n/&#matchee/45
If previous substitution was successful branch to label b. tb
If previous substitution was unsuccessful insert a linefeed before #matchee s/'"$M"'/\n&/ thus lengthening a short record to 45 lines.
Branch to label a and test for 45 lines etc . ta
Replace the first occurrence of #matchee to the end of the line by it's self. s/\('"$M"'\).*/\1/ thus shortening a long record to 45 lines.
Print the range of records. p
All non-range records pass through untouched.
TXR Solution ( http://www.nongnu.org/txr )
For illustration purposes using the fake data, I shorten the requirement from 40 lines to 12 lines. We find records beginning with a department name, delimited by #matchee. We dump them, chopped to no more than 12 lines, with #matchee added again.
#(collect)
# (all)
#dept
# (and)
# (collect)
#line
# (until)
#matchee
# (end)
# (end)
#(end)
#(output)
# (repeat)
#{line[0..12] "\n"}
#matchee
# (end)
#(end)
Here, the dept variable is expected to come from a -D command line option, but of course the code can be changed to accept it as an argument and put out a usage if it is missing.
Run on the sample data:
$ txr -Ddept=DEPARTMENT-A trim-extract.txr mess-o-records.txt
DEPARTMENT-A
Office space 206
Anonymous, MI 99999
Harold O Nonymous
Buckminster Abbey
Anonymous, MI 99999
item A Socket B 45454545
item B Gizmo Z 76767676
<too many lines here>
#matchee
The blank lines before DEPARTMENT-A are gone, and there are exactly 12 lines, which happen to include one line of the <too many ...> junk.
Note that the semantics of #(until) is such that the #matchee is excluded from the collected material. So it is correct to unconditionally add it in the #(output) clause. This program will work even if a record happens to be shorter than 12 lines before #matchee is found.
It will not match a record if #matchee is not found.