Split my output into multiple files - perl

I have the following list in a CSV file, and my goal is to split this list into directories named YYYY-Month based on the date in each row.
NAME99;2018/06/13;12:27:30
NAME01;2018/06/13;13:03:59
NAME00;2018/06/15;11:33:01
NAME98;2018/06/15;12:22:00
NAME34;2018/06/15;16:58:45
NAME17;2018/06/18;15:51:10
NAME72;2018/06/19;10:06:37
NAME70;2018/06/19;12:44:03
NAME77;2018/06/19;16:36:55
NAME25;2018/06/11;16:32:57
NAME24;2018/06/11;16:32:57
NAME23;2018/06/11;16:37:15
NAME01;2018/06/11;16:37:15
NAME02;2018/06/11;16:37:15
NAME01;2018/06/11;16:37:18
NAME02;2018/06/05;09:51:17
NAME00;2018/06/13;15:04:29
NAME07;2018/06/19;10:02:26
NAME08;2018/06/26;16:03:57
NAME09;2018/06/26;16:03:57
NAME02;2018/06/27;16:58:12
NAME03;2018/07/03;07:47:21
NAME21;2018/07/03;10:53:00
NAMEXX;2018/07/05;03:13:01
NAME21;2018/07/05;15:39:00
NAME01;2018/07/05;16:00:14
NAME00;2018/07/08;11:50:10
NAME07;2018/07/09;14:46:00
What is the smartest method to achieve this result without having to create a list of static routes, in which to carry out the append?
Currently my program writes this list to a directory called YYYY-Month only on the basis of localtime but does not do anything on each line.
Perl
#!/usr/bin/perl
use strict;
use warnings 'all';
use feature qw(say);
use File::Path qw<mkpath>;
use File::Spec;
use File::Copy;
use POSIX qw<strftime>;
my $OUTPUT_FILE = 'output.csv';
my $OUTFILE = 'splitted_output.csv';
# Output to file
open( GL_INPUT, $OUTPUT_FILE ) or die $!;
$/ = "\n\n"; # input record separator
while ( <GL_INPUT> ) {
chomp;
my #lines = split /\n/;
my $i = 0;
foreach my $lines ( #lines ) {
# Encapsulate Date/Time
my ( $name, $y, $m, $d, $time ) =
$lines[$i] =~ /\A(\w+);(\d+)\/(\d+)\/(\d+);(\d+:\d+:\d+)/;
# Generate Directory YYYY-Month - #2009-January
my $dir = File::Spec->catfile( $BASE_LOG_DIRECTORY, "$y-$m" ) ;
unless ( -e $dir ) {
mkpath $dir;
}
my $log_file_path = File::Spec->catfile( $dir, $OUTFILE );
open( OUTPUT, '>>', $log_file_path ) or die $!;
# Here I append value into files
print OUTPUT join ';', "$y/$m/$d", $time, "$name\n";
$i++;
}
}
close( GL_INPUT );
close( OUTPUT );

There is no reason to care about the actual date, or to use date functions at all here. You want to split up your data based on a partial value of one of the columns in the data. That just happens to be the date.
NAME08;2018/06/26;16:03:57 # This goes to 2018-06/
NAME09;2018/06/26;16:03:57 #
NAME02;2018/06/27;16:58:12 #
NAME03;2018/07/03;07:47:21 # This goes to 2018-07/
NAME21;2018/07/03;10:53:00 #
NAMEXX;2018/07/05;03:13:01 #
NAME21;2018/07/05;15:39:00 #
The easiest way to do this is to iterate your input data, then stick it into a hash with keys for each year-month combination. But you're talking about log files, and they might be large, so that's inefficient.
We should work with different file handles instead.
use strict;
use warnings;
my %months = ( 6 => 'June', 7 => 'July' );
my %handles;
while (my $row = <DATA>) {
# no chomp, we don't actually care about reading the whole row
my (undef, $dir) = split /;/, $row; # discard name and everything after date
# create the YYYY-MM key
$dir =~ s[^(....)/(..)][$1-$months{$2}];
# open a new handle for this year/month if we don't have it yet
unless (exists $handles{$dir}) {
# create the directory (skipped here) ...
open my $fh, '>', "$dir/filename.csv" or die $!;
$handles{$dir} = $fh;
}
# write out the line to the correct directory
print { $handles{$dir} } $row;
}
__DATA__
NAME08;2018/06/26;16:03:57
NAME09;2018/06/26;16:03:57
NAME02;2018/06/27;16:58:12
NAME03;2018/07/03;07:47:21
NAME21;2018/07/03;10:53:00
NAMEXX;2018/07/05;03:13:01
NAME21;2018/07/05;15:39:00
I've skipped the part about creating the directory as you already know how to do this.
This code will also work if your rows of data are not sequential. It's not the most efficient as the number of handles will grow the more data you have, but as long you don't have 100s of them at the same time that does not really matter.
Things of note:
You don't need chomp because you don't care about working with the last field.
You don't need to assign all of the values after split because you don't care about them.
You can discard values by assigning them to undef.
Always use three-argument open and lexical file handles.
the {} in print { ... } $roware needed to tell Perl that this is the handle we are printing too. See http://perldoc.perl.org/functions/print.html.

Related

Parsing data from delimited blocks

I have a log file content many blocks /begin CHECK ... /end CHECK like below:
/begin CHECK
Var_AAA
"Description AAA"
DATATYPE UBYTE
Max_Value 255.
ADDRESS 0xFF0011
/end CHECK
/begin CHECK
Var_BBB
"Description BBB"
DATATYPE UBYTE
Max_Value 255.
ADDRESS 0xFF0022
/end CHECK
...
I want to extract the variable name and its address, then write to a new file like this
Name Address
Var_AAA => 0xFF0011
Var_BBB => 0xFF0022
I am just thinking about the ($start, $keyword, $end) to check for each block and extract data after keyword only
#!/usr/bin/perl
use strict;
use warnings;
my $input = 'input.log';
my $output = 'output.out';
my ( $start, $keyword, $end ) = ( '^\/begin CHECK\n\n', 'ADDRESS ', '\/end CHECK' );
my #block;
# open input file for reading
open( my $in, '<', $input ) or die "Cannot open file '$input' for reading: $!";
# open destination file for writing
open( my $out, '>', $output ) or die "Cannot open file '$output' for writing: $!";
print( "copying variable name and it's address from $input to $output \n" );
while ( $in ) { #For each line of input
if ( /$start/i .. /$end/i ) { #Block matching
push #block, $_;
}
if ( /$end/i ) {
for ( #block ) {
if ( /\s+ $keyword/ ) {
print $out join( '', #block );
last;
}
}
#block = ();
}
close $in or die "Cannot close file '$input': $!";
}
close $out or die "Cannot close file '$output': $!";
But I got nothing after execution. Can anyone suggest me with sample idea?
Most everything looks good but it's your start regex that's causing the first problem:
'^\/begin CHECK\n\n'
You are reading lines from the file but then looking for two newlines in a row. That's not going to ever match because a line ends with exactly one newline (unless you change $/, but that's a different topic). If you want to match the send of a line, you can use the $ (or \z) anchor:
'^\/begin CHECK$'
Here's the program I pared down. You can adjust it to do all the rest of the stuff that you need to do:
use v5.10;
use strict;
use warnings;
use Data::Dumper;
my ($start, $keyword, $end) = (qr{^/begin CHECK$}, qr(^ADDRESS ), qr(^/end CHECK));
while (<DATA>) #For each line of input
{
state #block;
chomp;
if (/$start/i .. /$end/i) #Block matching
{
push #block, $_ unless /^\s*$/;
}
if( /$end/i )
{
print Dumper( \#block );
#block = ();
}
}
After that, you're not reading the data. You need to put the filehandle inside <> (the line input operator):
while ( <$in> )
The file handles will close themselves at the end of the program automatically. If you want to close them yourself that's fine but don't do that until you are done. Don't close $in until the while is finished.
using the command prompt in windows. In MacOS or Unix will follow the same logic you can do:
perl -wpe "$/='/end CHECK';s/^.*?(Var_\S+).*?(ADDRESS \S+).*$/$1 => $2\n/s" "your_file.txt">"new.txt
first we set the endLine character to $/ = "/end CHECK".
we then pick only the first Var_ and the first ADDRESS. while deleting everything else in single line mode ie Dot Matches line breaks \n. s/^.*?(Var_\S+).*?(ADDRESS \S+).*$/$1 => $2\n/s.
We then write the results into a new file. ie >newfile.
Ensure to use -w -p -e where -e is for executing the code, -p is for printing and -w is for warnings:
In this code, I did not write the values to a new file ie, did not include the >newfile.txt prt so that you may be able to see the result. If you do include the part, just open the newfile.txt and everything will be printed there
Here are some of the issues with your code
You have while ($in) instead of while ( <$in> ), so your program never reads from the input file
You close your input file handle inside the while read loop, so you can only ever read one record
Your $start regex pattern is '^\/begin CHECK\n\n'. The single quotes make your program search for backslash n backslash n instead of newline newline
Your test if (/\s+ $keyword/) looks for multiple space characters of any sort, followed by a space, followed by ADDRESS—the contents of $keyword. There are no occurrences of ADDRESS preceded by whitespace anywhere in your data
You have also written far too much without testing anything. You should start by writing your read loop on its own and make sure that the data is coming in correctly before proceeding by adding two or three lines of code at a time between tests. Writing 90% of the functionality before testing is a very bad approach.
In future, to help you address problems like this, I would point you to the excellent resources linked on the Stack Overflow Perl tag information page
The only slightly obscure thing here is that the range operator /$start/i .. /$end/i returns a useful value; I have copied it into $status. The first time the operator matches, the result will be 1; the second time 2 etc. The last time is different because it is a string that uses engineering notation like 9E0, so it still evaluates to the correct count but you can check for the last match using /E/. I've used == 1 and /E/ to avoid pushing the begin and end lines onto #block
I don't think there's anything else overly complex here that you can't find described in the Perl language reference
use strict;
use warnings;
use autodie; # Handle bad IO status automatically
use List::Util 'max';
my ($input, $output) = qw/ input.log output.txt /;
open my $in_fh, '<', $input;
my ( #block, #vars );
while ( <$in_fh> ) {
my $status = m{^/begin CHECK}i .. m{^/end CHECK}i;
if ( $status =~ /E/ ) { # End line
#block = grep /\S/, #block;
chomp #block;
my $var = $block[0];
my $addr;
for ( #block ) {
if ( /^ADDRESS\s+(0x\w+)/ ) {
$addr = $1;
last;
}
}
push #vars, [ $var, $addr ];
#block = ();
}
elsif ( $status ) {
push #block, $_ unless $status == 1;
}
}
# Format and generate the output
open my $out_fh, '>', $output;
my $w = max map { length $_->[0] } #vars;
printf $out_fh "%-*s => %s\n", $w, #$_ for [qw/ Name Address / ], #vars;
close $out_fh;
output
Name => Address
Var_AAA => 0xFF0011
Var_BBB => 0xFF0022
Update
For what it's worth, I would have written something like this. It produces the same output as above
use strict;
use warnings;
use autodie; # Handle bad IO status automatically
use List::Util 'max';
my ($input, $output) = qw/ input.log output.txt /;
my $data = do {
open my $in_fh, '<', $input;
local $/;
<$in_fh>;
};
my #vars;
while ( $data =~ m{^/begin CHECK$(.+?)^/end CHECK$}gms ) {
my $block = $1;
next unless $block =~ m{(\w+).+?ADDRESS\s+(0x\w+)}ms;
push #vars, [ $1, $2 ];
}
open my $out_fh, '>', $output;
my $w = max map { length $_->[0] } #vars;
printf $out_fh "%-*s => %s\n", $w, #$_ for [qw/ Name Address / ], #vars;
close $out_fh;

Perl print to seperate files

I have a text file which lists a service, device and a filter, here I list 3 examples only:
service1 device04 filter9
service2 device01 filter2
service2 device10 filter11
I have written a perl script that iterates through the file and should then print device=device filter=filter to a file named according to the service it belongs to, but if a string contains a duplicate filter, it should add the devices to the same file, seperated by semicolons. Looking at the above example, I then need a result of:
service1.txt
device=device04 filter=filter9
service2.txt
device=device01 filter=filter2 ; device=device10 filter=filter11
Here is my code:
use strict;
use warnings qw(all);
open INPUT, "<", "file.txt" or die $!;
my #Input = <INPUT>;
foreach my $item(#Input) {
my ($serv, $device, $filter) = split(/ /, $item);
chomp ($serv, $device, $filter);
push my #arr, "device==$device & filter==$filter";
open OUTPUT, ">>", "$serv.txt" or die $!;
print OUTPUT join(" ; ", #arr);
close OUTPUT;
}
The problem I am having is that both service1.txt and service2.txt are created, but my results are all wrong, see my current result:
service1.txt
device==device04 filter==filter9
service2.txt
device==device04 filter==filter9 ; device==device01 filter==filter2device==device04 filter==filter9 ; device==device01 filter==filter2 ; device==device10 filter==filter11
I apologise, I know this is something stupid, but it has been a really long night and my brain cannot function properly I believe.
For each service to have its own file where data for it accumulates you need to distinguish for each line what file to print it to.
Then open a new service-file when a service without one is encountered, feasible since there aren't so many as clarified in a comment. This can be organized by a hash service => filehandle.
use warnings;
use strict;
use feature 'say';
my $file = shift #ARGV || 'data.txt';
my %handle;
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
my ($serv, $device, $filter) = split;
if (exists $handle{$serv}) {
print { $handle{$serv} } " ; device==$device & filter==$filter";
}
else {
open my $fh_out, '>', "$serv.txt" or do {
warn "Can't open $serv.txt: $!";
next;
};
print $fh_out "device==$device & filter==$filter";
$handle{$serv} = $fh_out;
}
}
say $_ '' for values %handle; # terminate the line in each file
close $_ for values %handle;
For clarity the code prints almost the same in both cases, what surely can be made cleaner. This was tested only with the provided sample data and produces the desired output.
Note that when a filehandle need be evaluated we need { }. See this post, for example.
Comments on the original code (addressed in the code above)
Use lexical filehandles (my $fh) instead of typeglobs (FH)
Don't read the whole file at once unless there is a specific reason for that
split has nice defaults, split ' ', $_, where ' ' splits on whitespace and discards leading and trailing space as well. (And then there is no need to chomp in this case.)
Another option is to first collect data for each service, just as OP attempts, but again use a hash (service => arrayref/string with data) and print at the end. But I don't see a reason to not print as you go, since you'd need the same logic to decide when ; need be added.
Your code looks pretty perl4-ish, but that's not a problem. As MrTux has pointed out, you are confusing collection and fanning out of your data. I have refactored this to use a hash as intermediate container with the service name as keys. Please note that this will not accumulate results across mutliple calls (as it uses ">" and not ">>").
use strict;
use warnings qw(all);
use File::Slurp qw/read_file/;
my #Input = read_file('file.txt', chomp => 1);
my %store = (); # Global container
# Capture
foreach my $item(#Input) {
my ($serv, $device, $filter) = split(/ /, $item);
push #{$store{$serv}}, "device==$device & filter==$filter";
}
# Write out for each service file
foreach my $k(keys %store) {
open(my $OUTPUT, ">", "$k.txt") or die $!;
print $OUTPUT join(" ; ", #{$store{$k}});
close( $OUTPUT );
}

Multi-column minimal representation discovery in perl

I have a csv of data with about 20 columns and each column will have more than one distinct value. Each row after the top one which is the header, is an individual data sample. I want to narrow the list down programatically so that I have the smallest number of data samples but each permutation of the column data is still represented.
Example data
SERIAL,ACTIVE,COLOR,CLASS,SEASON,SEATS
.0xb468d47cc9749fb862990426ff79aafb,T,GREEN,BETA,SUMMER,3
.0x847129b35bad62f5837eec30dc07a8a4,T,VIOLET,DELTA,SUMMER,1
.0x14b8df88fd6d6547e387f4caa99e52fd,F,ORANGE,ALPHA,SUMMER,4
.0x0a07fb97224caf79ea73d3fdd5495b8f,T,YELLOW,DELTA,WINTER,1
.0x7d747e689bb27b60198283d7b86db409,F,READ,DELTA,SPRING,2
.0x8247524df49bd19c4c316ee070a2dd4a,T,BLUE,GAMA,WINTER,2
.0x4103ed42af6e8e463708a6c629907fb5,T,YELLOW,ALPHA,SPRING,5
.0xc38deea7f02fbfbcdde1d3718d6decb4,T,YELLOW,DELTA,FALL,5
.0xa3d562edcf64e151d7de08ff8f8e0a94,F,VIOLET,DELTA,SUMMER,3
.0x9da58b3b05603325c24629f700c25c97,T,YELLOW,OMEGA,SPRING,4
.0xef0c0e75083229d654c9b111e3af8726,T,BLUE,GAMA,FALL,1
.0xa9022c8713f0aba2a8e1d20475a3104a,T,YELLOW,BETA,SUMMER,2
.0x5bb5f73e6030730610866cee80cfc2fb,F,ORANGE,BETA,FALL,5
.0xc202e5b43dd65525754fdc52b89e7375,T,BLUE,OMEGA,SUMMER,3
.0xfac9145af33a74aedae7cc0442426432,F,READ,BETA,SPRING,1
.0x457949648053f710b4f2d55cb237a91d,T,GREEN,BETA,SPRING,3
.0xed94d4df300f10f5c4dc5d3ac76cf9e5,F,VIOLET,ALPHA,WINTER,15
.0x870130135beed4cbbe06478e368b40b3,F,YELLOW,ALPHA,SPRING,3
.0x3b6f17841edb9651e732e3ffbacbe14a,T,GREEN,OMEGA,SUMMER,3
.0xfb30e054466b9e4cf944c8e48ff74c93,F,VIOLET,DELTA,SUMMER,8
.0xf741ddc71b4a667585acaa35b67dc6c9,F,BLUE,BETA,FALL,4
.0x60257ad6c299e466086cc6e5bb0a9a33,F,VIOLET,OMEGA,SPRING,1
.0xa5d208bfee5a27a7619ba07dcbdaeea0,T,GREEN,OMEGA,FALL,1
.0x53bc78fa8863e53e8c9fb11c5f6d2320,F,GREEN,GAMA,SPRING,2
.0x5a01253ce5cb0a6aa5213f34f0b35416,T,READ,BETA,WINTER,3
.0xaed9a979ba9f6fbf39895b610dde80f4,T,ORANGE,DELTA,WINTER,1
.0xe7769918e36671af77b5d3d59ea15cfe,T,ORANGE,OMEGA,FALL,4
.0x9e5327a1583332e4c56d29c356dbc5d2,T,INDEGO,ALPHA,WINTER,5
.0x79c5c70732ff04b4d00e81ac3a07c3b7,T,READ,OMEGA,FALL,5
.0x55f54d3c9cd2552e286364894aeef62a,F,READ,GAMA,SPRING,15
Use a hash to determine whether you have seen a particular column combination before, and then use that to determine whether to print a particular line.
Here is a rather basic example to demonstrate the idea:
filter.pl
#!/usr/bin/env perl
use warnings;
use strict;
die "usage: $0 file col1,col2,col3, ... coln\n" unless #ARGV;
my ($file, $columns) = #ARGV;
-f $file or die "$file does not exist!";
defined $columns or die "need to pass in columns!";
my #columns;
for my $col ( split /,/, $columns ) {
die "Invalid column id $col" unless $col >= 1; # 1-based
push #columns, $col - 1; # 0-based
}
scalar #columns or die "No columns!";
open my $fh, "<", $file or die "Unable to open $file : $!";
my %uniq;
while (<$fh>) {
chomp();
next if $. == 1; # Skip Header
my (#data) = split /,/, $_; # Use Text::CSV for any non-trivial csv file
my $key = join '|', #data[ #columns ]; # key will look like 'foo|bar|baz'
if (not defined $uniq{ $key } ) {
print $_ . "\n"; # Print the whole line with the first unique set of columns
$uniq{ $key } = 1; # Now we have seen this combo
}
}
data.csv
SERIAL,TRUTH,IN,PARALLEL
123,TRUE,YES,5
124,TRUE,YES,5
125,TRUE,YES,3
126,TRUE,NO,5
127,FALSE,YES,1
128,FALSE,YES,3
129,FALSE,NO,7
Output
perl filter.pl data.csv 2,3
123,TRUE,YES,5
126,TRUE,NO,5
127,FALSE,YES,1
129,FALSE,NO,7

"Out of memory" error from Perl when processing files

I'm using Perl with Mojo::DOM to process a large batch of text files. I need to count the occurrences of all the words that end with certain suffixes.
Running this code keeps returning out of memory error messages for batches of over, say, 40 files.
Is there any way to accomplish this task more efficiently (less memory usage) than what I'm doing below?
#!/software/perl512/bin/perl
use strict;
use warnings;
use autodie;
use Mojo::DOM;
my $path = "/data/10K/2012";
chdir($path) or die "Cant chdir to $path $!";
# This program counts the total number of suffixes of a form in a given document.
my #sequence;
my %sequences;
my $file;
my $fh;
my #output;
# Reading in the data.
for my $file (<*.txt>) {
my %affixes;
my %word_count;
my $data = do {
open my $fh, '<', $file;
local $/; # Slurp mode
<$fh>;
};
my $dom = Mojo::DOM->new($data);
my $text = $dom->all_text();
for (split /\s+/, $text) {
if ($_ =~ /[a-zA-Z]+(ness|ship|dom|ance|ence|age|cy|tion|hood|ism|ment|ure|tude|ery|ity|ial)\b/ ) {
++$affixes{"affix_count"};
}
++$word_count{"word_count"};
}
my $output = join ",", $file, $affixes{"affix_count"}, $word_count{"word_count"};
push #output, ($output);
}
#output = sort #output;
open(my $fh3, '>', '/home/usr16/rcazier/PerlCode/affix_count.txt');
foreach (#output) {
print $fh3 "$_\n ";
}
close $fh3;
This is as near as I can get to a solution. It incorporates all the points that have been made in the comments, and solves the "Out of memory" error by leaving any HTML tags intact. It also leaves the result unsorted as the original code doesn't really do any useful sorting.
Because of the way you are looking for suffixed words, I think it's very unlikely that leaving HTML tags in your text files will pervert your results significantly.
#!/software/perl512/bin/perl
use strict;
use warnings;
use 5.010;
use autodie;
# Build and compile a regex that will match any of the suffixes that interest
# us, for later use in testing each "word" in the input file
#
my $suffix_re = do {
my #suffixes = qw/ ness ship dom ance ence age cy tion hood ism ment ure tude ery ity ial /;
my $alternation = join '|', #suffixes;
qr/ (?: $alternation ) /xi;
};
# Set the directory that we want to examine. `autodie` will check the success
# of `chdir` for us
#
my $path = '/data/10K/2012';
chdir $path;
# Process every file with a `txt` file type
#
for my $filename ( grep -f, glob('*.txt') ) {
warn qq{Processing "$filename"\n};
open my ($fh), '<', $filename;
my ($suffixes, $word_count) = (0, 0);
while (<$fh>) {
for (split) {
++$word_count;
++$suffixes if /\A[a-z]+$suffix_re\z/i;
}
}
say join ',', $filename, $suffixes, $word_count if $suffixes;
}

How do I push more than one matched groups as same element of array in Perl?

I am need to push all the matched groups into an array.
#!/usr/bin/perl
use strict;
open (FILE, "/home/user/name") || die $!;
my #lines = <FILE>;
close (FILE);
open (FH, ">>/home/user/new") || die $!;
foreach $_(#lines){
if ($_ =~ /AB_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_W.+txt/){
print FH "$1 $2 $3 $4 $5 $6 $7\n"; #needs to be first element of array
}
elsif ($_ =~ /CD_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_W.+txt/){
print FH "$1 $2 $3 $4 $5 $6\n"; #needs to be second element of array
}
}close (FH);
_ INPUT _
AB_ first--2-45_ Name_ is34_ correct_ OR_ not_W3478.txt
CD_ second_ input_ 89-is_ diffErnt_ 76-from_Wfirst6.txt
Instead of writing matched groups to FILE, I want to push them into array. I can't think of any other command other than push but this function does not accept more than one argument. What is the best way to do the same? The output should look like following after pushing matched groups into array.
_ OUTPUT _
$array[0] = first--2-45 Name is34 correct OR not
$array[1] = second input 89-is diffErnt 76-from
Use the same argument for push that you use for print: A string in double quotes.
push #array, "$1 $2 $3 $4 $5 $6 $7";
Take a look at perldoc -f grep, which returns a list of all elements of a list that match some criterion.
And incidentally, push does take more than one argument: see perldoc -f push.
push #matches, grep { /your regex here/ } #lines;
You didn't include the code leading up to this though.. some of it is a little odd, such as the use of $_ as a function call. Are you sure you want to do that?
If you are using Perl 5.10.1 or later, this is how I would write it.
#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1; # or use 5.010;
use autodie;
my #lines = do{
# don't need to check for errors, because of autodie
open( my $file, '<', '/home/user/name' );
grep {chomp} <$file>;
# $file is automatically closed
};
# use 3 arg form of open
open( my $file, '>>', '/home/user/new' );
my #matches;
for( #lines ){
if( /(?:AB|CD)( (?:_[^_]+)+ )_W .+ txt/x ){
my #match = "$1" =~ /_([^_]+)/g;
say {$file} "#match";
push #matches, \#match;
# or
# push #matches, [ "$1" =~ /_([^_]+)/g ];
# if you don't need to print it in this loop.
}
}
close $file;
This is a little bit more permissive of inputs, but the regex should be a little bit more "correct", than the original.
Remember that a capturing match in list context returns the captured fields, if any:
#!/usr/bin/perl
use strict; use warnings;
my $file = '/home/user/name';
open my $in, '<', $file
or die "Cannot open '$file': $!";
my #matched;
while ( <$in> ) {
my #fields;
if (#fields = /AB_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_W.+txt/
or #fields = /CD_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_W.+txt/)
{
push #matched, "#fields";
}
}
use Data::Dumper;
print Dumper \#matched;
Of course, you could also do
push #matched, \#fields;
depending on what you intend to do with the matches.
I wonder if using push and giant regexes is really the right way to go.
The OP says he wants lines starting with AB at index 0, and those with CD at index 1.
Also, those regexes look like an inside-out split to me.
In the code below I have added some didactic comments that point out why I am doing things differently than the OP and the other solutions offered here.
#!/usr/bin/perl
use strict;
use warnings; # best use warnings too. strict doesn't catch everything
my $filename = "/home/user/name";
# Using 3 argument open addresses some security issues with 2 arg open.
# Lexical filehandles are better than global filehandles, they prevent
# most accidental filehandle name colisions, among other advantages.
# Low precedence or operator helps prevent incorrect binding of die
# with open's args
# Expanded error message is more helpful
open( my $inh, '<', $filename )
or die "Error opening input file '$filename': $!";
my #file_data;
# Process file with a while loop.
# This is VERY important when dealing with large files.
# for will read the whole file into RAM.
# for/foreach is fine for small files.
while( my $line = <$inh> ) {
chmop $line;
# Simple regex captures the data type indicator and the data.
if( $line =~ /(AB|CD)_(.*)_W.+txt/ ) {
# Based on the type indicator we set variables
# used for validation and data access.
my( $index, $required_fields ) = $1 eq 'AB' ? ( 0, 7 )
: $1 eq 'CD' ? ( 1, 6 )
: ();
next unless defined $index;
# Why use a complex regex when a simple split will do the same job?
my #matches = split /_/, $2;
# Here we validate the field count, since split won't check that for us.
unless( #matches == $required_fields ) {
warn "Incorrect field count found in line '$line'\n";
next;
}
# Warn if we have already seen a line with the same data type.
if( defined $file_data[$index] ) {
warn "Overwriting data at index $index: '#{$file[$index]}'\n";
}
# Store the data at the appropriate index.
$file_data[$index] = \#matches;
}
else {
warn "Found non-conformant line: $line\n";
}
}
Be forewarned, I just typed this into the browser window. So, while the code should be correct, there may be typos or missed semicolons lurking--it's untested, use it at your own peril.