How do I parse this file and store it in a table? - perl

I have to parse a file and store it in a table. I was asked to use a hash to implement this. Give me simple means to do that, only in Perl.
-----------------------------------------------------------------------
L1234| Archana20 | 2010-02-12 17:41:01 -0700 (Mon, 19 Apr 2010) | 1 line
PD:21534 / lserve<->Progress good
------------------------------------------------------------------------
L1235 | Archana20 | 2010-04-12 12:54:41 -0700 (Fri, 16 Apr 2010) | 1 line
PD:21534 / Module<->Dir,requires completion
------------------------------------------------------------------------
L1236 | Archana20 | 2010-02-12 17:39:43 -0700 (Wed, 14 Apr 2010) | 1 line
PD:21534 / General Page problem fixed
------------------------------------------------------------------------
L1237 | Archana20 | 2010-03-13 07:29:53 -0700 (Tue, 13 Apr 2010) | 1 line
gTr:SLC-163 / immediate fix required
------------------------------------------------------------------------
L1238 | Archana20 | 2010-02-12 13:00:44 -0700 (Mon, 12 Apr 2010) | 1 line
PD:21534 / Loc Information Page
------------------------------------------------------------------------
I want to read this file and I want to perform a split or whatever to extract the following fields in a table:
the id that starts with L should be the first field in a table
Archana20 must be in the second field
timestamp must be in the third field
PD must be in the fourth field
Type (content preceding / must be in the last field)
My questions are:
How to ignore the --------… (separator line) in this file?
How to extract the above?
How to split since the file has two delimiters (|, /)?
How to implement it using a hash and what is the need for this?
Please provide some simple means so that I can understand since I am a beginner to Perl.

My questions are:
How to ignore the --------… (separator line) in this file?
How to extract the above?
How to split since the file has two delimiters (|, /)?
How to implement it using a hash and what is the need for this?
You will probably be working through the file line by line in a loop. Take a look at perldoc -f next. You can use regular expressions or a simpler match in this case, to make sure that you only skip appropriate lines.
You need to split first and then handle each field as needed after, I would guess.
Split on the primary delimiter (which appears to be ' | ' - more on that in a minute), then split the final field on its secondary delimiter afterwards.
I'm not sure if you are asking whether you need a hash or not. If so, you need to pick which item will provide the best set of (unique) keys. We can't do that for you since we don't know your data, but the first field (at a glance) looks about right. As for how to get something like this into a more complex data structure, you will want to look at perldoc perldsc eventually, though it might only confuse you right now.
One other thing, your data above looks like it has a semi-important typo in the first line. In that line only, there is no space between the first field and its delimiter. Everywhere else it's ' | '. I mention this only because it can matter for split. I nearly edited this, but maybe the data itself is irregular, though I doubt it.
I don't know how much of a beginner you are to Perl, but if you are completely new to it, you should think about a book (online tutorials vary widely and many are terribly out of date). A reasonably good introductory book is freely available online: Beginning Perl. Another good option is Learning Perl and Intermediate Perl (they really go together).

When you say This is not a homework...to mean this will be a start to assess me in perl I assume you mean that this is perhaps the first assignment you have at a new job or something, in which case It seems that if we just give you the answer it will actually harm you later since they will assume you know more about Perl than you do.
However, I will point you in the right direction.
A. Don't use split, use regular expressions. You can learn about them by googling "perl regex"
B. Google "perl hash" to learn about perl hashes. The first result is very good.
Now to your questions:
regular expressions will help you ignore lines you don't want
regular expressions with extract items. Look up "capture variables"
Don't split, use regex
See point B above.

If this file is line based then you can do a line by line based read in a while loop. Then skip those lines that aren't formatted how you wish.
After that, you can either use regex as indicated in the other answer. I'd use that to split it up and get an array and build a hash of lists for the record. Either after that (or before) clean up each record by trimming whitespace etc. If you use regex, then use the capture expressions to add to your list in that fashion. Its up to you.
The hash key is the first column, the list contains everything else. If you are just doing a direct insert, you can get away with a list of lists and just put everything in that instead.
The key for the hash would allow you to look at particular records for fast lookup. But if you don't need that, then an array would be fine.

You can try this one,
Points need to know:
read the file line by line
By using regular expression, removing '----' lines.
after that use split function to populate Hashes of array .
#!/usr/bin/perl
use strict;
use warning;
my $test_file = 'test.txt';
open(IN, '<' ,"$test_file") or die $!;
my (%seen, $id, $name, $timestamp, $PD, $type);
while(<IN>){
chomp;
my $line = $_;
if($line =~ m/^-/){ #removing '---' lines
# print "$line:hello\n";
}else{
if ($line =~ /\|/){
($id , $name, $timestamp) = split /\|/, $line, 4;
} else{
($PD, $type) = split /\//, $line , 3;
}
$seen{$id}= [$name, $timestamp, $PD, $type]; //use Hashes of array
}
}
for my $test(sort keys %seen){
my $test1 = $seen{$test};
print "$test:#{$test1}\n";
}
close(IN);

Related

Using Powershell to remove illegal CRLF from csv row

Gentle Reader,
I have a year's worth of vendor csv files sitting in a directory. My task is to load them into a SQL Server DB as a "Historical Load". The files are mal-formed and while we are working with the vendor to re-send 365 new, properly structured files, I have been tasked with trying to work with what we have.
I'm restricted to using either C# (as a script task in SSIS) or Powershell.
Each file has no header but the schema is known and built into the SSIS package connection.
Each file has approx 35k rows and roughly a few dozen mal-formed rows per file.
Each properly formed row consists of 122 columns, 121 comma's.
Rows are NOT text qualified.
Example: (data cleaned of PII)
555222,555222333444,1,HN71232,1/19/2018 8:58:07 AM,3437,27.50,HECTOR EVERYMAN,25-Foot Garden Hose - ,1/03/2018 10:17:24 AM,,1835,,,,,online,,MERCH,1,MERCH,MI,,,,,,,,,,,,,,,,,,,,6611060033556677,2526677,,,,,,,,,,,,,,EVERYMAN,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,,,,,555666NA118855,2/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,2121,,,1/29/2018 9:50:56 AM,0,,,[CRLF]
555222,555222444888,1,CASUAL50,1/09/2018 12:00:00 PM,7000,50.00,JANE SMITH,$50 Casual Gift Card,1/19/2018 8:09:15 AM,1/29/2018 8:19:25 AM,1856,,,,,online,,FREE,1,CERT,GC,,,,,,,6611060033553311[CRLF]
,6611060033553311[CRLF]
,,,,,,,,,25,,,6611060033556677,2556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,CASUAL25,VWDEB,,,,,,,555222NA118065,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/19/2018 12:00:15 PM,0,,,[CRLF]
555222,555222777666,1,CASHCS,1/12/2018 10:31:43 AM,2500,25.00,BOB SMITH,BIG BANK Rewards Cash Back Credit [...6S66],,,1821,,,,,online,,CHECK,1,CHECK,CK,,,,,,,,,,,,,,,,,,,,555222166446,5556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,1/23/2018 10:30:21 AM,,,,555666NA118844,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/22/2018 10:31:26 AM,0,,,[CRLF]
Powershell Get-Content (I think...) reads until file into an array where each row is identified by the CRLF as the terminator. This means (again, I think) that mal-formed rows will be treated as an element of the array without respect to how many "columns" it holds.
C# Streamreader also uses CRLF as a marker but a streamreader object also has a few methods available like Peek and Read that may be useful.
Please, Oh Wise Ones, point me in the direction of least resistance. Using Powershell, as a script to process mal-formed csv files such that CRLFs that are not EOL are removed.
Thank you.
Based on #vonPryz design but in (Native¹) PowerShell:
$Delimiters = 121
Get-Content .\OldFile.csv |ForEach-Object { $Line = '' } {
if ($Line) { $Line += ',' + $_ } else { $Line = $_ }
$TotalMatches = ($Line |Select-String ',' -AllMatches).Matches.Count
if ($TotalMatches -ge $Delimiters ) {
$Line
$Line = ''
}
} |Set-Content .\NewFile.Csv
1) I guess performance might be improved by avoiding += and using dot .net methods along with text streamers
Honestly, your best bet is to get good data from the supplier. Trying to work around a mess will just cause problems later on. Garbage in, garbage out. Since it's you who wrote the garbage data in the database, congratulations, it's now your fault that the DB data is of poor quality. Please talk with your manager and the stakeholders first, so that you have in writing an agreement that you didn't break the data and it was broken to start with. I've seen such problems on ETL processing all too often.
A quick and dirty pseudocode without error handling, edge case processing, substring index assumptions, performance guarantees and whatnot goes like so,
while(dataInFile)
line = readline()
:parseLine
commasInLine = countCommas(line)
if commasInLine == rightAmount
addLineInOKBuffer(line)
else
commasNeeded = rightAmount - commasInLine
if commasNeeded < 0
# too many commas, two lines are combined
lastCommaLocation = getLastCommaIndex(line, commasNeeded)
addLineInOKBuffer(line.substring(0, lastCommaLocation)
line = line.substring(lastCommaLocation, line.end)
goto :parseline
else
# too few lines, need to read next line too
line = line.removeCrLf() + readline()
goto :parseline
The idea is that first you look for a line and count how many commas there are. If the count matches what's expected, the row is not broken. Store it in a buffer containing good data.
If you have too many commas, then the row contains at least two different elements. Then find the index of where the first element ends, extract it and store it in the good data buffer. Then remove already processed part of the line and start again.
If you have too few commas, then the row is splitted by a newline. Read the next line from the file, join it with the current line and start the parsing again from counting the lines.

Getting Error of Modification of a read-only value attempted

I am trying to select the below value from database:
Reporting that one of #its many problems had been the recent# extended
sales slump in women's apparel, the seven-store retailer said it would
start a three-month liquidation sale in all of its stores.~(A) its
many problems had been the recent~(B) its many problems has been the
recently~(C) its many problems is the recently~(D) their many problems
is the recent~(E) their many problems had been the recent~
i am selecting this value in variable $ques and then selecting a text as below:
$ques=~s/^(.*?)\#(.*?)\#(.*?)$/$2/;
Now, while replacing the ~ character in the string by
$3=~s/~/\n/g; ---->line 171
and running the script, I am getting one error as:
Modification of a read-only value attempted at main.pl line 171
I want to replace all the ~ character with '\n' and print the final value. Please suggest how to do it.
*I have researched this on net, but got confused that how to handle these read only variables.
You've already got a good explanation of the problem from José Castro. But there's another solution if you're using a recent-ish version of Perl (Update: having checked more carefully, I find that means 5.14+). The /r argument to the substitution operator will copy your string, make the substitution on the copy and then return that altered value.
So you could write:
my $new_value = $3 =~ s/~/\n/rg;
It sounds like what you really want in this case is split rather than regular expression capture groups:
my #parts = split(/#/, $ques);
$parts[2] =~ s/~/\n/g;
It makes the intent of your code clearer since you are, in fact, splitting on # symbols.
Just like you say, the special variables $1, $2, etc., are read-only, and that means that you can't perform that substitution on them.
Performing the substitution on $ques will do what you need:
$ques =~ s/~/\n/g;
print $ques;
Do note that in the earlier substitution that you're performing on $ques you're getting rid of all the ~ characters.

Convert date in YYYY-MM-DD in Perl?

I have search the internet for some time now but I can't seem to find the right answer (maybe there is but I don't understand it).
I have this code to read a file and get the date time (from Task Scheduler query).
File.txt holds the Task Scheduler query.
"TaskName","Next Run Time","Status"
"CheckFile","20:33:00, 17/1/2013",""
=======================================
Script to read a get the next run time value.
open (FILE, "<", $file) || print "WARN: Cannot open $file: $!";
#logLines = <FILE>;
if (#list = grep(/\b$keyword\b/, #logLines)) {
foreach(#list){$result = $_;}
my #sresult = split(/(?<="),(?=")/, $result);
$name = $sresult[0];
$name =~ tr/"//d;
$next_run = $sresult[1];
$next_run =~ tr/"//d;
print $next_run;
}
#list=();
#dFormat=();
#logLines=();
close FILE;
Output will be:
20:33:00, 17/1/2013
I want to modify the output into:
20:33:00, 2013-1-17 #note that I can do this just by splitting up and rearranging the numbers.
But the problem is, 17/1/2013 in Task Scheduler query is locale dependent. It could be in the following:
1/17/2013
17/1/2013
2013/1/17
1/17/13
17/1/13
13/1/17
1-17-2013
17-1-2013
2013-1-17
1-17-13
17-1-13
13-1-17
1.17.2013
17.1.2013
2013.1.17
1.17.13
17.1.13
13.1.17
Is there any cpan module that could do what I want? Could you give a script on how to achieve this?
Please no harsh comment. Thanks.
The following should get you the format:
use Win32::OLE::NLS qw( GetLocaleInfo GetSystemDefaultLCID LOCALE_SSHORTDATE );
say GetLocaleInfo(GetSystemDefaultLCID(), LOCALE_SSHORTDATE); # yyyy-MM-dd
You could try to find a date parser that understands that format, or you could use something like the following to create a format many parsers to understand.
my %subs = (
'yyyy' => '%Y',
...
);
my $pat = join '|', map quotemeta, sort { length($b) <=> length($a) } keys %subs;
$format =~ s{($pat)}{ $subs{$1} // $1 }eg;
I can tell you answer(algorithm) about your problem:
Once upon a time
Let's think about your problem:
You want to get date from every format.
But this is not possible, because it can be 12-11-10
Okay, now you must determine (somehow) which format is used.
If you can do that, you can rule the world.
Let's think about your problem more deeper:
First you must choose delimiter. (or not, if you don't want it)
it is possible to use something like this:
(\d{1,4})(\D)(\d{1,4})(\D)(\d{1,4})
After that, you have:
$1,$3,$5 # parts of data
$2,$4 # delimiters
I took 4 because i don't know where year can be placed.
After that, you must understand, which format you have:
dd-mm-yy
mm-dd-yy
yy-mm-dd
yy-dd-mm
You can add checks for that like days or months, but, as I said before:
12-11-10 = -9 <- not possible to determine, right?
So, you must have some external info about date format, for example:
which country
which branch of science
which family
etc
it belongs to.
If you can do that, you can (probably can) determine format and 12-11-10 = 12 nov 2010
The End
Update: see this discussion. It appears that the date format may be a regional setting based on the user who runs the task. If so, all you have to do is figure out how to get that regional setting...
As others have pointed out, there is no way of resolving the ambiguities in the potential date formats.
What I would explore is some way of querying the system with a known date and see what format it returns. For example, perhaps you could schedule a fake task at a (non-ambiguous) date far in the future, see what format that task is returned in, then later delete it.
That would be a bit of a messy solution, but perhaps there is a less kludgy way of doing something similar. Is task scheduler's date format the same as the system date format? If so, you could query a known date from the system.

perl sequence extraction loop

I have an existing perl one-liner (from the Edwards lab) that works wonderfully to read a text file (named ids.file) that contains one column of IDs and searches a second, specially formatted text file (named fasta.file in this example - in "fasta" format for those who know bioinformatics) and returns sequences that match the ID from the first file. I was hoping to expand this script to do two additional things:
The current perl one-liner only seems to work if the ids.file contains one column of data. I would like it to work on a file that contains two columns (separated by spaces), and act on the second column of data (well, really any column of data, but I assume that it will be obvious enough to adapt it if someone can give an example using a second column)
I would like to append the any results returned from the output of the search to a third column, instead of just to a new file.
If someone is kind enough to offer an example but only has time or inclination to work on one of these, I would prefer that you try to solve #2 - I have come close to solving #1 with a for loop that uses awk to only use the Perl code on the second column - I haven't gotten it yet, but am close, so #2 seems like the harder one to me.
The perl one liner is as follows:
perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if #ARGV' ids.file fasta.file
I appreciate any help you can give!
Not quite sure but will this do?
perl -ne 'chomp; s/^>(\S+).*/$c=$i{$1}/e; print if $c;
$i{(/^\S*\s(\S*)$/)[0]}="$_ " if #ARGV'
ids.file fasta.file

How can I make $1 return alternatives without a substitution regex?

The project I recently joined abstracts logic into code and database elements. Business logic like xPaths, regular expressions and function names are entered in the database, while general code like reading files, creating xml from xpaths, etc are in the code base.
Most (if not all) of the methods that use regular expressions are structured thus:
if ( $entry =~ /$regex/ ) { $req_value = $1; }
This means that only $1 is available and you always have to write your regex to give you your desired result in $1.
The issue:
The result for the following strings should be either
'2.6.9-78.1.6.ELsmp (SMP)' or '2.6.9-78.1.6.ELsmp'
depending on the existence of SMP. $1 does not suffice for $entry[0].
$entry[0] = qq|Linux version 2.6.9-78.1.6.ELsmp (brewbuilder#hs20-bc2-2.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-10)) #1 SMP Wed Sep 24 05:41:12 EDT 2008|;
$entry[1] = qq|Linux version 2.6.9-78.0.5.ELsmp (brewbuilder#hs20-bc2-2.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-10)) #1 Wed Sep 24 05:41:12 EDT 2008|;
Hence my solution:
my $mutable = '';
my $regex = qr/((\d.*?)\s+(?:.*)?(SMP)((?{$mutable="$2 ($3)"}))|(\d.*?))\s+/;
if ($entry[$i] =~ /$regex/) {
$req_value = $1;
$req_value = $mutable if ($mutable ne '');
$mutable = '';
}
Unfortunately, the existence of a 'variable' in the database makes this solution unacceptable.
My questions are:
How can I clean up the above solution to make it acceptable with the structure available?
or
How can I use a substitution regex with the structure 'if ($entry =~ /$regex/)'?
Thanks.
You're stuck unless you can talk the folks who control the code you're using into generalizing it somehow. The good news is you need only a bit more, perhaps
if (my #fields = $_ =~ /$pat/) {
$req_value = join " " => grep defined($_), #fields;
}
This works because a successful regular-expression match in list context returns all captured substrings, i.e., $1, $2, $3, and so on as appropriate.
With a single pattern,
qr/(\d+(?:[-.]\w+)*)(?:.*(SMP))?/
the code above yields 2.6.9-78.1.6.ELsmp SMP and 2.6.9-78.0.5.ELsmp in $req_value. The grep defined($_) filters out captures for subpatterns not taken. Without it, you get undefined value warnings for the non-SMP case.
The downside is every regular expression would need to be reviewed to be sure that all capturing groups really ought to go in $req_value. For example, say someone is using the pattern
qr/(XYZ) OS (version \d+|v-\d+)/
As it is now, only XYZ would go into $req_value, but using the above generalization would also include the version number. If that's undesired, the regular expression should be
qr/(XYZ) OS (?:version \d+|v-\d+)/
because (?:...) does not capture (that is, it does not produce a $2 for the pattern above): it's for grouping only.
I don't fully understand your constraints. Are you limited to supplying a single regex that will always by processed using the code in your first excerpt? If so, you cannot do what you are trying to do. You are trying to extract two separate parts of the entry string, you simply can't return 2 values in a single scalar return value unless you can add the code to concatenate them.
Can you add perl code at all? For example, can you define the logic to be:
if ( $entry =~ /$regex/ ) { $req_value = '$1 $2'; }
where your $regex = qr/((\d.*?)\s+(?:.*)?(SMP)/; ?
Baring the ability to define some new perl code, you can't accomplish this.
Regarding part two, substiutions. I interpret your question to ask if you can compile both the PATTERN and REPLACEMENT parts of s/PATTERN/REPLACEMENT/ into a single qr//. If so, you cannot. qr// only compiles a matching pattern, and a qr variable can only be used in the PATTERN portion of a REPLACEMENT. In other words, to use s///, you'll need to write perl code that runs s///. I'm guessing that if you could write new perl code, you'd use the above solution.
One more thought: In your current architecture, can you define fields in terms of of other fields? In other words, could you extract the version string with one regex, the SMP string with another regex, and define a third field that combines the two?
As of 5.10.0, (?|pattern) is available to allow alternatives to use the same capture numbering. As you pointed out that you're still using 5.8, this may not be useful directly but perhaps as further incentive to your project to start moving to a modern Perl.