How to extract column wise data from pdf in powershell?

How to extract column wise data from pdf in powershell? - powershell

I want to read a .pdf file which has some data. How can I extract complete data from one specific column only using powershell? I am using iText 5 .NET (aka iTextSharp) for pdf data extraction.
This is my current code, which extracts an entire line:
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList testPOC.pdf
$page = 2
$text = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page).Split([char]0x000A)
Write-Host $text[5]
Output is shown as:
ID Working Agent Assistant Name Plan Gender Year Amount Comm.% Split% Commission
4169985061 Paul E. Ted Alskd, Ols fhghslhshsl+(0sdhsk) M 12 $1,234.00 0.45% 100.00% $32.78
How can I get data only from one single column (eg. only from salary column)?

This is just a blind stab at the answer, because we don't know what type of data $text is (unless we are iTextSharp experts). You could find that out for us by entering:
$text.gettype()
From the way it shows up on output, it almost appears that it's a PSCustomObject. If so, an approach like this might work:
$text | select-object ID, Commission
I used Commission because I couldn't see Salary in your output. I added ID for the sake of context.
Note: a real answer is going to have to wait for somebody who uses iTextSharp and might know the datatype of $text without being told. That could be a long wait.

Related

Powershell: Compare filenames to a list of formats

I am not looking for a working solution but rather an idea to push me in the right direction as I was thrown a curveball I am not sure how to tackle.
Recently I wrote a script that used split command to check the first part of the file against the folder name. This was successful so now there is a new ask: check all the files against the naming matrix, problem is there are like 50+ file formats on the list.
So for example format of a document would be ID-someID-otherID-date.xls
So for example 12345678-xxxx-1234abc.xls as a format for the amount of characters to see if files have the correct amount of characters in each section to spot typos etc.
Is there any other reasonable way of tackling that besides Regex? I was thinking of using multiple splits using the hyphens but don't really have anything to reference that against other than the amount of characters required in each part.
As always, any (even vague) pointers are most welcome ;-)

Although I would use a regex (as also commented by zett42), there is indeed an other way which is using the ConvertFrom-String cmdlet with a template:
$template = #'
{[Int]Some*:12345678}-{[String]Other:xxxx}-{[DateTime]Date:2022-11-18}.xls
{[Int]Some*:87654321}-{[String]Other:yyyy}-{[DateTime]Date:18Nov2022}.xls
'#
'23565679-iRon-7Oct1963.xls' | ConvertFrom-String -TemplateContent $template
Some : 23565679
Other : iRon
Date : 10/7/1963 12:00:00 AM
RunspaceId : 3bf191e9-8077-4577-8372-e77da6d5f38d

Using Powershell to remove illegal CRLF from csv row

Gentle Reader,
I have a year's worth of vendor csv files sitting in a directory. My task is to load them into a SQL Server DB as a "Historical Load". The files are mal-formed and while we are working with the vendor to re-send 365 new, properly structured files, I have been tasked with trying to work with what we have.
I'm restricted to using either C# (as a script task in SSIS) or Powershell.
Each file has no header but the schema is known and built into the SSIS package connection.
Each file has approx 35k rows and roughly a few dozen mal-formed rows per file.
Each properly formed row consists of 122 columns, 121 comma's.
Rows are NOT text qualified.
Example: (data cleaned of PII)
555222,555222333444,1,HN71232,1/19/2018 8:58:07 AM,3437,27.50,HECTOR EVERYMAN,25-Foot Garden Hose - ,1/03/2018 10:17:24 AM,,1835,,,,,online,,MERCH,1,MERCH,MI,,,,,,,,,,,,,,,,,,,,6611060033556677,2526677,,,,,,,,,,,,,,EVERYMAN,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,,,,,555666NA118855,2/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,2121,,,1/29/2018 9:50:56 AM,0,,,[CRLF]
555222,555222444888,1,CASUAL50,1/09/2018 12:00:00 PM,7000,50.00,JANE SMITH,$50 Casual Gift Card,1/19/2018 8:09:15 AM,1/29/2018 8:19:25 AM,1856,,,,,online,,FREE,1,CERT,GC,,,,,,,6611060033553311[CRLF]
,6611060033553311[CRLF]
,,,,,,,,,25,,,6611060033556677,2556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,CASUAL25,VWDEB,,,,,,,555222NA118065,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/19/2018 12:00:15 PM,0,,,[CRLF]
555222,555222777666,1,CASHCS,1/12/2018 10:31:43 AM,2500,25.00,BOB SMITH,BIG BANK Rewards Cash Back Credit [...6S66],,,1821,,,,,online,,CHECK,1,CHECK,CK,,,,,,,,,,,,,,,,,,,,555222166446,5556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,1/23/2018 10:30:21 AM,,,,555666NA118844,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/22/2018 10:31:26 AM,0,,,[CRLF]
Powershell Get-Content (I think...) reads until file into an array where each row is identified by the CRLF as the terminator. This means (again, I think) that mal-formed rows will be treated as an element of the array without respect to how many "columns" it holds.
C# Streamreader also uses CRLF as a marker but a streamreader object also has a few methods available like Peek and Read that may be useful.
Please, Oh Wise Ones, point me in the direction of least resistance. Using Powershell, as a script to process mal-formed csv files such that CRLFs that are not EOL are removed.
Thank you.

Based on #vonPryz design but in (Native¹) PowerShell:
$Delimiters = 121
Get-Content .\OldFile.csv |ForEach-Object { $Line = '' } {
if ($Line) { $Line += ',' + $_ } else { $Line = $_ }
$TotalMatches = ($Line |Select-String ',' -AllMatches).Matches.Count
if ($TotalMatches -ge $Delimiters ) {
$Line
$Line = ''
}
} |Set-Content .\NewFile.Csv
1) I guess performance might be improved by avoiding += and using dot .net methods along with text streamers

Honestly, your best bet is to get good data from the supplier. Trying to work around a mess will just cause problems later on. Garbage in, garbage out. Since it's you who wrote the garbage data in the database, congratulations, it's now your fault that the DB data is of poor quality. Please talk with your manager and the stakeholders first, so that you have in writing an agreement that you didn't break the data and it was broken to start with. I've seen such problems on ETL processing all too often.
A quick and dirty pseudocode without error handling, edge case processing, substring index assumptions, performance guarantees and whatnot goes like so,
while(dataInFile)
line = readline()
:parseLine
commasInLine = countCommas(line)
if commasInLine == rightAmount
addLineInOKBuffer(line)
else
commasNeeded = rightAmount - commasInLine
if commasNeeded < 0
# too many commas, two lines are combined
lastCommaLocation = getLastCommaIndex(line, commasNeeded)
addLineInOKBuffer(line.substring(0, lastCommaLocation)
line = line.substring(lastCommaLocation, line.end)
goto :parseline
else
# too few lines, need to read next line too
line = line.removeCrLf() + readline()
goto :parseline
The idea is that first you look for a line and count how many commas there are. If the count matches what's expected, the row is not broken. Store it in a buffer containing good data.
If you have too many commas, then the row contains at least two different elements. Then find the index of where the first element ends, extract it and store it in the good data buffer. Then remove already processed part of the line and start again.
If you have too few commas, then the row is splitted by a newline. Read the next line from the file, join it with the current line and start the parsing again from counting the lines.

Removing CR LF improper split line on .txt pipe-delimited flat with Powershell script

Hope all is well! I came across a bit of a tricky issue with a flat file that is exported from Oracle PBCS with some carriage return issues. End users - when inputting data into PBCS - will often press in a specific data field input screen. When the data gets exported representing a specific record with all the data elements representing that data point (intersection) - think like a SQL record - the record element where the user pressed enter causes that record to break at that point - shifting the rest of the data elements in that record to the next line. This is very bad as each record must have the same amount of elements - causing downstream issues in a mapping. In effect one unique record becomes two broken records.
I need a Powershell script that looks at the improper CR LF (Windows system) and reforms each unique record. However, the majority of the records in the flat file are fine so the code will have to be able to discern the "mostly good" from the "very bad" cases.
My flat file is pipe delimited and has a header element. The header element may not need to be considered as I am simply trying to address the fix - a solution could potentially look at the amount of property values for the header record to determine how to format broken records based off a property count using the pipe delimiter - but not sure that is necessary.
I will be honest - there are Jython scripts I tried to no avail - so I felt given that I have employed a couple Powershell scripts for other reasons in the past that I would use this again. I have a basis of a script for a csv file - but this isn't quite working.
$file = Get-Content 'E:\EPM_Cloud\Exports\BUD_PLN\Data\EXPORT_DATA_BUD_PLN.txt'
$file| Foreach-Object {
foreach ($property in $_.PSObject.Properties) {
$property.Value = ($property.Value).replace("`r","").replace("`n","")
}
}
$file|out-file -append 'E:\EPM_Cloud\Exports\BUD_PLN\Data\EXPORT_DATA_BUD_PLN_FINAL.txt'
Here are a few examples of what the before and after case would be if I could get this code to work.
This is supposed to be one record - as you see beginning with "$43K from... the user pressed enter several times. As you see it is pipe delimited - I use the numeric line numbers to show you what I mean since this isn't notepad++. The idea is this should all just be on 1.
Contract TBD|#missing|#missing|#missing|#missing|ORNL to Perform Radio-Chemical (RCA) Measurements|#missing|#missing|#missing|#missing|"$43K from above
$92,903
$14,907
The current $150K to be reprogrammed to XXX, plus another $150K from Fuel Fac for this item to be reprogrammed to RES."|#missing|#missing|#missing|"Summary|X0200_FEEBASED|No_BOC|O-xxxx-B999|xx_xxx_xx_xxx|Plan|Active|FY19|BegBalance"|COMMIT
This is what the output should look like (I have attached screenshots instead). All in 1.
Contract TBD|#missing|#missing|#missing|#missing|ORNL to Perform Radio-Chemical (RCA) Measurements|#missing|#missing|#missing|#missing|"$43K from above $92,903 $14,907 The current $150K to be reprogrammed to XXX, plus another $150K from Fuel Fac for this item to be reprogrammed to RES."|#missing|#missing|#missing|"Summary|X0200_FEEBASED|No_BOC|O-xxxx-B999|xx_xxx_xx_xxx|Plan|Active|FY19|BegBalance"|COMMIT
In other cases the line breaks just once - all defined just by how many times the user presses enter.enter image description here
As you see in the data image - you see how the line splits - this is the point of the powershell. As you see next to that screenshot image - other lines are just fine.

So after checking locally you should be able to just import the file as a csv, then loop through everything and remove CRLF from each property on each record, and output to a new file (or the same, but its safer to output to a new file).
$Records = Import-Csv C:\Path\To\File.csv -Delimiter '|'
$Properties = $Records[0].psobject.properties.name
ForEach($Record in $Records){
ForEach($Property in $Properties){
$Record.$Property = $Record.$Property -replace "[\r\n]"
}
}
$Records | Export-Csv C:\Path\To\NewFile.csv -Delimiter '|' -NoTypeInfo

Using PowerShell to extract data from a large tab-separated text file, masked it and then merge the masked data back to the original file

I am newbie to Windows PowerShell and wanted to know if is it possible to use PowerShell to extract specific data from tab-delimited(.dat) file and merge it back together to the original file.
The reason behind the extraction of data is that they are sensitive data and requires masking.
Upon extraction, I would require to mask the data and after masking, would require to merge this masked data again back to its original file on their specific places.
Please provide some pointers, any kind of help would be appreciated.
Thank you in advance.

Solution
Here's a solution based on my limited understanding of your question (if you add more details I may be able to be more specific)
Code
Seems all you need to do is read all the data, modify and write it to the file, so here it is!
$Columns = 2,4 # Columns to mask out (Indexes start from 0)
cat ./lol.dat | % {
$arr = $_.split("`t")
$Columns | % {$arr[$_] = '*'*$arr[$_].length}
$arr.join("`t")
} | Out-File ./lol.dat

How to parse logs and mask specific characters using Powershell

I have a problem that I really hope to get some help with.
It's rather complex but I will try and keep my explanation as simple and objective as possible. In a nutshell, I have log files that contain thousands of lines. Each line consists of information like date/time, source, type and message.
In this case the message contains a variable size ...999 password that I need to mask. Basically the message looks something like this (its an ISO message):
year-day-month 00:00:00,computername,source, info,rx 0210 22222222222222333333333333333333444444444444444444444444455555008PASSWORD6666666666666666677777777777777777777777ccccdddddddddddffffffffffffff
For each line I need to zero in on password length identifier (008) do a count on it and then proceed to mask the number of following characters, which would be PASSWORD in this case. I would change it to something like XXXXXXXX instead so once done the line would look like this:
year-day-month 00:00:00,computername,source, info,rx 0210 22222222222222333333333333333333444444444444444444444444455555008XXXXXXXX6666666666666666677777777777777777777777ccccdddddddddddffffffffffffff
I honestly have no idea how to start doing this with PowerShell. I need to loop though each line in the log file, and identify the number of characters to mask.
I've kept this high level as a starting point, there are some other complexities that I hope to figure out at a later time, like the fact that there are different types of messages and depending on the type the password length starts at another character position. I might be able to build on my aforementioned question first but if anyone understands what I mean then I would appreciate some help or tips about that too.
Any help is appreciated.
Thanks!
Additional information to original post:
Firstly, thank you to everyone for your answers thus far, its been greatly appreciated. Now that I have a baseline for how your answers are being formulated based on my information I feel I need to provide some more details.
1) There was a question about whether or not the password starting position is fixed and the logic behind it.
The password position is not fixed. In an ISO message (which these are) the password, and all information in the message, is dependent on the data elements present in the message which are in turn are indicated by the bitmap. The bitmap is also part of the message. So in my case, I need to script additional logic above and beyond the answers provided to come full circle.
2) This is what I know and these are the steps I hope to accomplish with the script.
What I know:
- There are 3 different msg types that contain passwords. I've figured out where the starting position of the password is for each msg type based on the bitmap and the data elements present.
For example 0210 contains one in this case:
year-day-month 00:00:00,computername,source, info,rx 0210 22222222222222333333333333333333444444444444444444444444455555008PASSWORD6666666666666666677777777777777777777777ccccdddddddddddffffffffffffff
What I need to do:
Pass the log file to the script
For each line in the log identify if the line has a msg type that contains a password
If the message type contains a password then determine length of password by reading the preceding 3 digits to the password ("ans ...999" which means alphanumeric - special with length max of 999 and 3 digit length info). Lets say the character position of the password would be 107 in this case for arguments sake, so we know to read the 3 numbers before it.
Starting at the character position of the password, mask the number of characters required with XXX. Loop through log until complete.

It does seem as though you're indicating the position of the password and the length of the password will vary. As long as you have the '008' and something like '666' to indicate a starting and stopping point something like this should work.
$filePath = '.\YourFile.log'
(Get-Content $filePath) | ForEach-Object {
$startIndex = $_.IndexOf('008') + 3
$endIndex = $_.IndexOf('666', $startIndex)
$passwordLength = $endIndex - $startIndex
$passwordToReplace = $_.Substring($startIndex,$passwordLength)
$obfuscation = New-Object 'string' -ArgumentList 'X', $passwordLength
$_.Replace($passwordToReplace, $obfuscation)
} | Set-Content $filePath
If the file is too large to load into memory then you will have to StreamReader and StreamWriter to write the content to a new file and delete the old.

Assuming a fixed position where the password-length field starts, based on your sample line (if that position is variable, as you've hinted at, you need to tell us more):
$line = '22222222222222333333333333333333444444444444444444444444455555008PASSWORD6666666666666666677777777777777777777777ccccdddddddddddffffffffffffff'
$posStart = 62 # fixed 0-based pos. where length-of-password field stats
$pwLenFieldLen = 3 # length of length-of-password field
$pwLen = [int] $line.SubString($posStart, $pwLenFieldLen) # extract password length
$pwSubstitute = 'X' * $pwLen # determine the password replacement string
# replace the password with all Xs
$line -replace "(?<=^.{$($posStart + $pwLenFieldLen)}).{$pwLen}(?=.*)", $pwSubstitute
Note: This is not the most efficient way to do it, but it is concise.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse