Output particular sections of txt file to csv? - perl
Say I have a txt file like below (it's obviously not 'text text text', I'm just showing that it's blocks of irrelevant text)
text text text
text text text
text text text
important section age=30
name=mike
text text text
text text text
text text text
I want to parse it and output only the 'important section' to csv so that my csv would look like below, i.e. age in one column and name in another
age name
30 mike
How should I go about this? Perl? Sed? I'm not that familiar with either but hoping there is a straightforward enough solution.
Choroba actually answered the above perfectly for me but I fear I oversimplified my actual text file too much, it is more like below
Something:
this
Something else:
that
Something else:
etc.
Sales
2011 Sales:
€3,000
()
2010 Sales:
€2,000
()
2011 Growth Rate:
50.00%
Contact Details
And the output I would ideally like is
2011 Sales 2010 Sales 2011 Growth Rate
3,000 2,000 50.00%
This, unfortunately, greatly complicates things. The output doesn't have to be exactly like above but as close as possible
Perl solution. It keeps a flag telling whether we are in the important section. Everything important is remembered in an array and printed at the end:
perl -nE '$i = 1 if s/important section //;
push #t, [$1, $2] if $i and /(.*)=(.*)/;
}{
for my $i (0, 1) {
say join "\t", map $_->[$i], #t
}' file.txt
Related
Using Powershell to remove illegal CRLF from csv row
Gentle Reader, I have a year's worth of vendor csv files sitting in a directory. My task is to load them into a SQL Server DB as a "Historical Load". The files are mal-formed and while we are working with the vendor to re-send 365 new, properly structured files, I have been tasked with trying to work with what we have. I'm restricted to using either C# (as a script task in SSIS) or Powershell. Each file has no header but the schema is known and built into the SSIS package connection. Each file has approx 35k rows and roughly a few dozen mal-formed rows per file. Each properly formed row consists of 122 columns, 121 comma's. Rows are NOT text qualified. Example: (data cleaned of PII) 555222,555222333444,1,HN71232,1/19/2018 8:58:07 AM,3437,27.50,HECTOR EVERYMAN,25-Foot Garden Hose - ,1/03/2018 10:17:24 AM,,1835,,,,,online,,MERCH,1,MERCH,MI,,,,,,,,,,,,,,,,,,,,6611060033556677,2526677,,,,,,,,,,,,,,EVERYMAN,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,,,,,555666NA118855,2/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,2121,,,1/29/2018 9:50:56 AM,0,,,[CRLF] 555222,555222444888,1,CASUAL50,1/09/2018 12:00:00 PM,7000,50.00,JANE SMITH,$50 Casual Gift Card,1/19/2018 8:09:15 AM,1/29/2018 8:19:25 AM,1856,,,,,online,,FREE,1,CERT,GC,,,,,,,6611060033553311[CRLF] ,6611060033553311[CRLF] ,,,,,,,,,25,,,6611060033556677,2556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,CASUAL25,VWDEB,,,,,,,555222NA118065,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/19/2018 12:00:15 PM,0,,,[CRLF] 555222,555222777666,1,CASHCS,1/12/2018 10:31:43 AM,2500,25.00,BOB SMITH,BIG BANK Rewards Cash Back Credit [...6S66],,,1821,,,,,online,,CHECK,1,CHECK,CK,,,,,,,,,,,,,,,,,,,,555222166446,5556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,1/23/2018 10:30:21 AM,,,,555666NA118844,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/22/2018 10:31:26 AM,0,,,[CRLF] Powershell Get-Content (I think...) reads until file into an array where each row is identified by the CRLF as the terminator. This means (again, I think) that mal-formed rows will be treated as an element of the array without respect to how many "columns" it holds. C# Streamreader also uses CRLF as a marker but a streamreader object also has a few methods available like Peek and Read that may be useful. Please, Oh Wise Ones, point me in the direction of least resistance. Using Powershell, as a script to process mal-formed csv files such that CRLFs that are not EOL are removed. Thank you.
Based on #vonPryz design but in (Native¹) PowerShell: $Delimiters = 121 Get-Content .\OldFile.csv |ForEach-Object { $Line = '' } { if ($Line) { $Line += ',' + $_ } else { $Line = $_ } $TotalMatches = ($Line |Select-String ',' -AllMatches).Matches.Count if ($TotalMatches -ge $Delimiters ) { $Line $Line = '' } } |Set-Content .\NewFile.Csv 1) I guess performance might be improved by avoiding += and using dot .net methods along with text streamers
Honestly, your best bet is to get good data from the supplier. Trying to work around a mess will just cause problems later on. Garbage in, garbage out. Since it's you who wrote the garbage data in the database, congratulations, it's now your fault that the DB data is of poor quality. Please talk with your manager and the stakeholders first, so that you have in writing an agreement that you didn't break the data and it was broken to start with. I've seen such problems on ETL processing all too often. A quick and dirty pseudocode without error handling, edge case processing, substring index assumptions, performance guarantees and whatnot goes like so, while(dataInFile) line = readline() :parseLine commasInLine = countCommas(line) if commasInLine == rightAmount addLineInOKBuffer(line) else commasNeeded = rightAmount - commasInLine if commasNeeded < 0 # too many commas, two lines are combined lastCommaLocation = getLastCommaIndex(line, commasNeeded) addLineInOKBuffer(line.substring(0, lastCommaLocation) line = line.substring(lastCommaLocation, line.end) goto :parseline else # too few lines, need to read next line too line = line.removeCrLf() + readline() goto :parseline The idea is that first you look for a line and count how many commas there are. If the count matches what's expected, the row is not broken. Store it in a buffer containing good data. If you have too many commas, then the row contains at least two different elements. Then find the index of where the first element ends, extract it and store it in the good data buffer. Then remove already processed part of the line and start again. If you have too few commas, then the row is splitted by a newline. Read the next line from the file, join it with the current line and start the parsing again from counting the lines.
Parsing a text file or an html file to create a table
I have a simple issue with a .msg file from outlook, but I discovered that with a code someone helped me with, it was not working since the htmlbody from the .msg file would vary between different emails even though they are from the same source, so my next option was to save the email as a .txt and .html file, since I have no knowledge of html I have no idea how to grab the table which is structured in the html with a . but on the text I found something easy, for example this is data from one table: Summary Date Good mail Rule matches Spam Malware 2019-10-22 4927 4519 2078 0 2019-10-23 4783 4113 1934 0 this is on the text file, Summary is the keyword, and after that key word, the next 5 lines are the columns of the table, after that ,each 5 lines following are the rows, this goes up to 7 rows in total, so headers and then 7 rows. Now what I want to do is create a table from this text using the 5 first lines after summary as my columns. Since each .msg is different, this 5 columns will change order on each file randomly so I want to avoid this, my best attempt was to use convertfrom-string to create a table , but I have little idea on how to format the table with the conditions set above. The problem I have is this simple, I have a table on the txt file shown as above, with 5 columns, each column besides the headers contains 7 rows, therei s also the condition that the email since it has more data, I need to stop there nad just grab that part which should be easy. How can I use convertfrom-string to create the table using those 5 columns , how can I set the delimiter as a new line and how can I set the first 5 lines as the column headers?
I think trying to make this work with ConvertFrom-StringData is adding more work than necessary. But here is an alternative that works with your sample set. $text = Get-Content -Path File.txt $formattedText = if ($text[0] -match '^Summary') { for ($i = 1; $i -lt $text.count; $i+=5 ) { $text[$i..($i+4)] -join ',' } } $fomattedText | ConvertFrom-Csv | ConvertTo-Html Explanation: If we assume your text data is in File.txt, Get-Content is used to read the data as an array ($text). If the first line begins with Summary, the file will be parsed. The for loop is used to skip 5 lines during each iteration until the end of the file. The for loop begins with $text values (indexes 1, 2, 3, 4, and 5) joined together by a ,. Then the index increment ($i) is increased by 5 and the next five index values are joined together. Each increment will create a new line of comma separated values. The reason for the , join is just to use the simple ConvertFrom-Csv later. ConvertFrom-Csv converts the CSV data into an array of objects ($formattedText) with the first row becoming those objects' properties. Finally, the array is piped to ConvertTo-Html, which will output all of the objects in a table. Note: If you want to resize or add extra format to the table, you may need to do that after the code is generated. If your data has commas, you will need a different delimiter when joining the strings. You will then need to add the -Delimiter parameter to the ConvertFrom-Csv with the delimiter you choose. Adaptation: The code is fairly flexible. If you need to work with more than five properties, the $i+=5 will need to reflect the number of properties you need to cycle through. The same change needs to apply to $text[$i..($i+4)]. You want the .. to separate two values that differ by your property number.
Using Perl to parse text from blocks
I have a file with multiple blocks of test. FOR EACH block of test, I want to be able to extract what is in the square bracket, the line containing the FIRST instance of the word "area", and what is on the right of the square bracket. Everything will be a string. Essentially what I want to do is store each string into a variable in a hash so i can print it into a 3 column csv file. Here's a sample of what the file looks like: Student-[K-6] Exceptional in Math /home/area/kinder/mathadvance.txt, 12 Students in grade K-12 shown to be exceptional in math. Placed into special after school program. See /home/area/overall/performance.txt, 200 Student-[Junior] Weak Performance Students with overall weak performance. Summer program services offered as shown in "/home/area/services/summer.txt", 212 Student-[K-6] Physical Excerise Time Slots /home/area/pe/schedule.txt, 303 Assigned time slots for PE based on student's grade level. Make reference to /home/area/overall/classtimes.txt, 90 I want to to have a final csv file that looks like: Grade,Topic,Path K-6, Exceptional in Math, /home/area/kinder/mathadvance.txt, 12 K-6, Physical Exercise Time Slots, /home/area/pe/schedule.txt, 303 Junior, Weak Performance, "/home/area/services/summer.txt", 212 Since it's a csv file, I know it will also separate at the line number when exporting into excel but I'm fine with that. I started off by putting the grade type into an array because I want to be able to add more strings to it for different grade levels. My program looks like this so far: #!/usr/bin/perl use strict; use warnings; my #grades = ("K-6", "Junior", "Community-College", "PreK"); I was thinking that I will need to do some sort of system sed command to grab what is in the brackets and store it into a variable. Then I will grab everything to the right of the bracket on the line and store it into a variable. And then I will grep for a line containing "area" to get the path and I will store it as a string into a variable, put these in a hash, and then print into csv. I'm not sure if I'm thinking about this the right way. Also, I have NO IDEA how to do this for each BLOCK of text in the file. I need it by block because each block has its own corresponding grades, topics, and paths.
perl -000 -ne '($grade, $topic) = /\[(.*)\] (.*)/; ($path) = m{(.*/area/.*)}; print "$grade, $topic, $path\n"' -- file.txt -000 turns on paragraph mode, -n won't read line by line, but paragraph by paragraph /\[(.*)\] (.*)/ matches the square brackets and whatever follows them up to a newline. The inside of the square brackets and the following text are captured using the parentheses. m{(.*/area/.*)} captures the line containing "area". It uses the m{} syntax instead of // so we don't have to backslash the slashes (avoiding so called "leaning toothpick syndrome")
I have a Tab separated file in Unix which has data issue
I have to make sure each line has 4 columns, but the input data is quite a mess: The first line is header. The second line is valid as it has 4 columns. The third is also valid (it's ok if the description field is null) ID field and "god bless me" Last column PNumber is are not null fields. As one can see 4th line is messed up because of newline character in "Description column" it spanned across multiple lines. ID Name Description Phnumber 1051 John 5674 I am doing good, is this task we need to fix 908342 1065 Rohit 9876246 10402 rob I am doing good, is this task we need to fix 908341 105552 "Julin rob hain" i know what to do just let me do it " " " " " " 908452 1051 Dave I am doing reporting this week 88889999 Maybe a screenshot will make it easier to see the problem Each line will start with a number and ends with a number. Each line should have 4 columns. desired output ID Name Description Phnumber 1051 John 5674 I am doing good, is this task we need to fix 908342 1065 Rohit 9876246 10402 rob I am doing good, 563 is this task we need to fix 908341 105552 "Julin rob hain" i know what to do just let me do it 908452 1051 Dave I am doing reporting this week 88889999 The data is sample data the actual file has 12 columns. yes in between columns can have numbers and few are date fields (like 2017-03-02)
This did the trick cat file_name | perl -0pe 's/\n(?!([0-9]{6}|$)\t)//g' | perl -0pe 's/\r(?!([0-9]{6}|$)\t)//g' | sed '/^$/d'
awk to the rescue! assumes the all digit fields don't appear except first and last fields awk 'NR==1; NR>1 {for(i=1;i<=NF;i++) {if($i~/[0-9]+/) s=!s; printf "%s", $i (s?OFS:RS)}}' file ID Name Description Phnumber 1051 John I am doing good, is this task we need to fix 908342 10423 rob I am doing good, is this task we need to fix 908341 1052 Julin rob hain i know what to do just let me do it " " " " " " 908452 1051 Dave I am doing reporting this week 88889999 perhaps set the OFS to \t to have more structure
How do I parse this file and store it in a table?
I have to parse a file and store it in a table. I was asked to use a hash to implement this. Give me simple means to do that, only in Perl. ----------------------------------------------------------------------- L1234| Archana20 | 2010-02-12 17:41:01 -0700 (Mon, 19 Apr 2010) | 1 line PD:21534 / lserve<->Progress good ------------------------------------------------------------------------ L1235 | Archana20 | 2010-04-12 12:54:41 -0700 (Fri, 16 Apr 2010) | 1 line PD:21534 / Module<->Dir,requires completion ------------------------------------------------------------------------ L1236 | Archana20 | 2010-02-12 17:39:43 -0700 (Wed, 14 Apr 2010) | 1 line PD:21534 / General Page problem fixed ------------------------------------------------------------------------ L1237 | Archana20 | 2010-03-13 07:29:53 -0700 (Tue, 13 Apr 2010) | 1 line gTr:SLC-163 / immediate fix required ------------------------------------------------------------------------ L1238 | Archana20 | 2010-02-12 13:00:44 -0700 (Mon, 12 Apr 2010) | 1 line PD:21534 / Loc Information Page ------------------------------------------------------------------------ I want to read this file and I want to perform a split or whatever to extract the following fields in a table: the id that starts with L should be the first field in a table Archana20 must be in the second field timestamp must be in the third field PD must be in the fourth field Type (content preceding / must be in the last field) My questions are: How to ignore the --------… (separator line) in this file? How to extract the above? How to split since the file has two delimiters (|, /)? How to implement it using a hash and what is the need for this? Please provide some simple means so that I can understand since I am a beginner to Perl.
My questions are: How to ignore the --------… (separator line) in this file? How to extract the above? How to split since the file has two delimiters (|, /)? How to implement it using a hash and what is the need for this? You will probably be working through the file line by line in a loop. Take a look at perldoc -f next. You can use regular expressions or a simpler match in this case, to make sure that you only skip appropriate lines. You need to split first and then handle each field as needed after, I would guess. Split on the primary delimiter (which appears to be ' | ' - more on that in a minute), then split the final field on its secondary delimiter afterwards. I'm not sure if you are asking whether you need a hash or not. If so, you need to pick which item will provide the best set of (unique) keys. We can't do that for you since we don't know your data, but the first field (at a glance) looks about right. As for how to get something like this into a more complex data structure, you will want to look at perldoc perldsc eventually, though it might only confuse you right now. One other thing, your data above looks like it has a semi-important typo in the first line. In that line only, there is no space between the first field and its delimiter. Everywhere else it's ' | '. I mention this only because it can matter for split. I nearly edited this, but maybe the data itself is irregular, though I doubt it. I don't know how much of a beginner you are to Perl, but if you are completely new to it, you should think about a book (online tutorials vary widely and many are terribly out of date). A reasonably good introductory book is freely available online: Beginning Perl. Another good option is Learning Perl and Intermediate Perl (they really go together).
When you say This is not a homework...to mean this will be a start to assess me in perl I assume you mean that this is perhaps the first assignment you have at a new job or something, in which case It seems that if we just give you the answer it will actually harm you later since they will assume you know more about Perl than you do. However, I will point you in the right direction. A. Don't use split, use regular expressions. You can learn about them by googling "perl regex" B. Google "perl hash" to learn about perl hashes. The first result is very good. Now to your questions: regular expressions will help you ignore lines you don't want regular expressions with extract items. Look up "capture variables" Don't split, use regex See point B above.
If this file is line based then you can do a line by line based read in a while loop. Then skip those lines that aren't formatted how you wish. After that, you can either use regex as indicated in the other answer. I'd use that to split it up and get an array and build a hash of lists for the record. Either after that (or before) clean up each record by trimming whitespace etc. If you use regex, then use the capture expressions to add to your list in that fashion. Its up to you. The hash key is the first column, the list contains everything else. If you are just doing a direct insert, you can get away with a list of lists and just put everything in that instead. The key for the hash would allow you to look at particular records for fast lookup. But if you don't need that, then an array would be fine.
You can try this one, Points need to know: read the file line by line By using regular expression, removing '----' lines. after that use split function to populate Hashes of array . #!/usr/bin/perl use strict; use warning; my $test_file = 'test.txt'; open(IN, '<' ,"$test_file") or die $!; my (%seen, $id, $name, $timestamp, $PD, $type); while(<IN>){ chomp; my $line = $_; if($line =~ m/^-/){ #removing '---' lines # print "$line:hello\n"; }else{ if ($line =~ /\|/){ ($id , $name, $timestamp) = split /\|/, $line, 4; } else{ ($PD, $type) = split /\//, $line , 3; } $seen{$id}= [$name, $timestamp, $PD, $type]; //use Hashes of array } } for my $test(sort keys %seen){ my $test1 = $seen{$test}; print "$test:#{$test1}\n"; } close(IN);