Remove Duplicate Group of Data in Text file - powershell

I have a text file formatted similar to the following:
Description1: Data-123<br>
Description2: Data-ABC<br>
Description3: Data-789<br>
Description4: Data-EFG<br>
Description5: Data-XYZ<br>
Description1: Data-123<br>
Description2: Data-ABC<br>
Description3: Data-789<br>
Description4: Data-EFG<br>
Description5: Data-XYZ<br>
Description1: Data-123<br>
Description2: Data-ABC<br>
Description3: Data-789<br>
Description4: Data-EFG<br>
Description5: Data-584<br>
I need PowerShell to compare each group (5 lines of data) as a whole and remove any duplicate groups, leaving only the unique groups of data. I can get it to remove single duplicate lines with the code below, but no luck comparing each group.
get-content TextFile.txt | sort-object | get-unique > NewTextFile.txt

Maybe this can work, you need to create the output file based on the result of last line of code, anyway I give no explanation because you don't show us any code you have so far.
$a = gc mylist.txt
$b = [string]::Empty
$c = #()
$a | % {if ( $_ -ne [string]::Empty )
{ $b += "$_`n" }
else
{ $c += $b
$b = [string]::Empty
}
}
$c += $b
$c | select -Unique | out-file .\mynew.txt

Split the file content on double new line characters (that should match the end of the line right before the empty line + the empty line right after it), split each object returned (remove the empty line) and then join it back, add new line and write the results to a new file.
(Get-Content TextFile.txt | Out-String) -split "`r`n`r`n" | ForEach-Object{
($_.Split("`r`n",[System.StringSplitOptions]::RemoveEmptyEntries) -join "`r`n") + "`n"
} | Select-Object -Unique | Out-File NewTextFile.txt

Related

How can I transpose and parse a large vertical text file into a CSV file with headers?

I have a large text file (*.txt) in the following format:
; KEY 123456
; Any Company LLC
; 123 Main St, Anytown, USA
SEC1 = xxxxxxxxxxxxxxxxxxxxx
SEC2 = xxxxxxxxxxxxxxxxxxxxx
SEC3 = xxxxxxxxxxxxxxxxxxxxx
SEC4 = xxxxxxxxxxxxxxxxxxxxx
SEC5 = xxxxxxxxxxxxxxxxxxxxx
SEC6 = xxxxxxxxxxxxxxxxxxxxx
This is repeated for about 350 - 400 keys. These are HASP keys and the SEC codes associated with them. I am trying to parse this file into a CSV file with KEY and SEC1 - SEC6 as the headers, with the rows being filled in. This is the format I am trying to get to:
KEY,SEC1,SEC2,SEC3,SEC4,SEC5,SEC6
123456,xxxxxxxxxx,xxxxxxxxxxx,xxxxxxxxxx,xxxxxxxxxx,xxxxxxxxxx,xxxxxxxxxx
456789,xxxxxxxxxx,xxxxxxxxxx,xxxxxxxxxx,xxxxxxxxxx,xxxxxxxxxx,xxxxxxxxxx
I have been able to get a script to export to a CSV with only one key in the text file (my test file), but when I try to run it on the full list, it only exports the last key and sec codes.
$keysheet = '.\AllKeys.txt'
$holdarr = #{}
Get-Content $keysheet | ForEach-Object {
if ($_ -match "KEY") {
$key, $value = $_.TrimStart("; ") -split " "
$holdarr[$key] = $value }
elseif ($_ -match "SEC") {
$key, $value = $_ -split " = "
$holdarr[$key] = $value }
}
$hash = New-Object PSObject -Property $holdarr
$hash | Export-Csv -Path '.\allsec.csv' -NoTypeInformation
When I run it on the full list, it also adds a couple of extra columns with what looks like properties instead of values.
Any help to get this to work would be appreciated.
Thanks.
Here's the approach I suggest:
$output = switch -Regex -File './AllKeys.txt' {
'^; KEY (?<key>\d+)' {
if ($o) {
[pscustomobject]$o
}
$o = #{
KEY = $Matches['key']
}
}
'^(?<sec>SEC.*?)\s' {
$o[$Matches['sec']] = ($_ | ConvertFrom-StringData)[$Matches['sec']]
}
default {
Write-Warning -Message "No match found: $_"
}
}
# catch the last object
$output += [pscustomobject]$o
$output | Export-Csv -Path './some.csv' -NoTypeInformation
This would be one approach.
& {
$entry = $null
switch -Regex -File '.\AllKeys.txt' {
"KEY" {
if ($entry ) {
[PSCustomObject]$entry
}
$entry = #{}
$key, $value = $_.TrimStart("; ") -split " "
$entry[$key] = [int]$value
}
"SEC" {
$key, $value = $_ -split " = "
$entry[$key] = $value
}
}
[PSCustomObject]$entry
} | sort KEY | select KEY,SEC1,SEC2,SEC3,SEC4,SEC5,SEC6 |
Export-Csv -Path '.\allsec.csv' -NoTypeInformation
Lets leverage the strength of ConvertFrom-StringData which
Converts a string containing one or more key and value pairs to a hash table.
So what we will do is
Split into blocks of text
edit the "; Key" line
Remove an blank lines or semicolon lines.
Pass to ConvertFrom-StringData to create a hashtable
Convert that to a PowerShell object
$path = "c:\temp\keys.txt"
# Split the file into its key/sec collections. Drop any black entries created in the split
(Get-Content -Raw $path) -split ";\s+KEY\s+" | Where-Object{-not [string]::IsNullOrWhiteSpace($_)} | ForEach-Object{
# Split the block into lines again
$lines = $_ -split "`r`n" | Where-Object{$_ -notmatch "^;" -and -not [string]::IsNullOrWhiteSpace($_)}
# Edit the first line so we have a full block of key=value pairs.
$lines[0] = "key=$($lines[0])"
# Use ConvertFrom-StringData to do the leg work after we join the lines back as a single string.
[pscustomobject](($lines -join "`r`n") | ConvertFrom-StringData)
} |
# Cannot guarentee column order so we force it with this select statement.
Select-Object KEY,SEC1,SEC2,SEC3,SEC4,SEC5,SEC6
Use Export-CSV to your hearts content now.

Parse a list line by line, create a new list in Powershell

I need to read in a file that contains lines of source/destination IPs and ports as well as a tag. I'm using Get-Content:
Get-Content $logFile -ReadCount 1 | % {
} | sort | get-unique | Out-File "C:\Log\logout.txt"
This is an example of the input file:
|10.0.0.99|345|195.168.4.82|58164|spam|
|10.0.0.99|345|195.168.4.82|58164|robot|
|10.0.0.99|231|195.168.4.82|58162|spam|
|195.168.4.82|58162|10.0.0.99|231|robot|
|10.0.0.99|345|195.168.4.82|58168|spam|
|10.0.0.99|345|195.168.4.82|58169|spam|
What I need to do is output a new list, but if the same source/destination IPs/ports are both 'spam' and 'robot' I just need to output that line as 'robot' (lines 1 and 2 above).
I need to do the same if the reverse direction of an existing connection is either 'spam' or 'robot', I just need one or the other and it would be 'robot' (lines 3 and 4 above). There will be plenty of 'spam' lines without a duplicate or reverse connection (the last couple lines above), they need to just stay the same.
This is what i've been using to create the reverse direction of the connection, but I haven't been able to figure out how to properly create the new list:
$reverse = '|' + ($_.Split("|")[3,4,1,2,5] -join '|') + '|'
Output of the above would be:
|10.0.0.99|345|195.168.4.82|58164|robot|
|195.168.4.82|58162|10.0.0.99|231|robot|
|10.0.0.99|345|195.168.4.82|58168|spam|
|10.0.0.99|345|195.168.4.82|58169|spam|
(except that second line didn't have to be the reversed direction)
Thanks for any help!
Since both direct and reverse connections are checked and their line order may not be sequential, I would use a hashtable to store the type of both directions and do everything algorithmically:
$checkPoints = #{}
$output = [ordered]#{}
$reader = [IO.StreamReader]'R:\1.txt'
while (!$reader.EndOfStream) {
$line = $reader.ReadLine()
$s = $line.split('|')
$direct = [string]::Join('|', $s[1..4])
$reverse = [string]::Join('|', ($s[3,4,1,2]))
$type = $s[5]
$known = $checkPoints[$direct]
if (!$known -or ($type -eq 'robot' -and $known -eq 'spam')) {
$checkPoints[$direct] = $checkPoints[$reverse] = $type
$output[$direct] = $line
$output.Remove($reverse)
} elseif ($type -eq 'spam' -and $known -eq 'robot') {
$output.Remove($reverse)
}
}
$reader.Close()
Set-Content r:\2.txt -Encoding utf8 -value #($output.Values)

Retrieving second part of a line when first part matches exactly

I used the below steps to retrieve a string from file
$variable = 'abc#yahoo.com'
$test = $variable.split('#')[0];
$file = Get-Content C:\Temp\file1.txt | Where-Object { $_.Contains($test) }
$postPipePortion = $file | Foreach-Object {$_.Substring($_.IndexOf("|") + 1)}
This results in all lines that contain $test as a substring. I just want the result to contain only the lines that exactly matches $test.
For example, If a file contains
abc_def|hf#23$
abc|ohgvtre
I just want the text ohgvtre
If I understand the question correctly you probably want to use Import-Csv instead of Get-Content:
Import-Csv 'C:\Temp\file1.txt' -Delimiter '|' -Header 'foo', 'bar' |
Where-Object { $_.foo -eq $test } |
Select-Object -Expand bar
To address the exact matching, you should be testing for equality (-eq) rather than substring (.Contains()). Also, there is no need to parse the data multiple times. Here is your code, rewritten to to operate in one pass over the data using the -split operator.
$variable = 'abc#yahoo.com'
$test = $variable.split('#')[0];
$postPipePortion = (
# Iterate once over the lines in file1.txt
Get-Content C:\Temp\file1.txt | foreach {
# Split the string, keeping both parts in separate variables.
# Note the backslash - the argument to the -split operator is a regex
$first, $second = ($_ -split '\|')
# When the first half matches, output the second half.
if ($first -eq $test) {
$second
}
}
)

Powershell to count columns in a file

I need to test the integrity of file before importing to SQL.
Each row of the file should have the exact same amount of columns.
These are "|" delimited files.
I also need to ignore the first line as it is garbage.
If every row does not have the same number of columns, then I need to write an error message.
I have tried using something like the following with no luck:
$colCnt = "c:\datafeeds\filetoimport.txt"
$file = (Get-Content $colCnt -Delimiter "|")
$file = $file[1..($file.count - 1)]
Foreach($row in $file){
$row.Count
}
Counting rows is easy. Columns is not.
Any suggestions?
Yep, read the file skipping the first line. For each line split it on the pipe, and count the results. If it isn't the same as the previous throw an error and stops.
$colCnt = "c:\datafeeds\filetoimport.txt"
[int]$LastSplitCount = $Null
Get-Content $colCnt | ?{$_} | Select -Skip 1 | %{if($LastSplitCount -and !($_.split("|").Count -eq $LastSplitCount)){"Process stopped at line number $($_.psobject.Properties.value[5]) for column count mis-match.";break}elseif(!$LastSplitCount){$LastSplitCount = $_.split("|").Count}}
That should do it, and if it finds a bad column count it will stop and output something like:
Process stopped at line number 5 for column count mis-match.
Edit: Added a Where catch to skip blank lines ( ?{$_} )
Edit2: Ok, if you know what the column count should be then this is even easier.
Get-Content $colCnt | ?{$_} | Select -Skip 1 | %{if(!($_.split("|").Count -eq 210)){"Process stopped at line number $($_.psobject.Properties.value[5]), incorrect column count of: $($_.split("|").Count).";break}}
If you want it to return all lines that don't have 210 columns just remove the ;break and let it run.
A more generic approach, including a RegEx filter:
$path = "path\to\folder"
$regex = "regex"
$expValue = 450
$files= Get-ChildItem $path | Where-Object {$_.Name -match $regex}
Foreach( $f in $files) {
$filename = $f.Name
echo $filename
$a = Get-Content $f.FullName;
$i = 1;
$e = 0;
echo "Starting...";
foreach($line in $a)
{
if ($line.length -ne $expValue){
echo $filename
$a | Measure-Object -Line
echo "Long:"
echo $line.Length;
echo "Line Nº: "
echo $i;
$e = $e + 1;
}
$i = $i+1;
}
echo "Finished";
if ($e -ne 0){
echo $e "errors found";
}else{
echo "No errors"
echo ""
}
}
echo "All files examined"
Another possibility:
$colCnt = "c:\datafeeds\filetoimport.txt"
$DataLine = (Get-Content $colCnt -TotalCount 2)[1]
$DelimCount = ([char[]]$DataLine -eq '|').count
$MatchString = '.*' + ('|.*' * $DelimCount )
$test = Select-String -Path $colCnt -Pattern $MatchString -NotMatch |
where { $_.linenumber -ne 1 }
That will find the number of delimiter characters in the second line, and build a regex pattern that can be used with Select-String.
The -NotMatch switch will make it return any lines that don't match that pattern as MatchInfo objects that will have the filename, line number and content of the problem lines.
Edit: Since the first line is "garbage" you probably don't care if it didn't match so I added a filter to the result to drop that out.

Extracting columns from text file using PowerShell

I have to extract columns from a text file explained in this post:
Extracting columns from text file using Perl one-liner: similar to Unix cut
but I have to do this also in a Windows Server 2008 which does not have Perl installed. How could I do this using PowerShell? Any ideas or resources? I'm PowerShell noob...
Try this:
Get-Content test.txt | Foreach {($_ -split '\s+',4)[0..2]}
And if you want the data in those columns printed on the same line:
Get-Content test.txt | Foreach {"$(($_ -split '\s+',4)[0..2])"}
Note that this requires PowerShell 2.0 for the -split operator. Also, the ,4 tells the the split operator the maximum number of split strings you want but keep in mind the last string will always contain all extras concat'd.
For fixed width columns, here's one approach for column width equal to 7 ($w=7):
$res = Get-Content test.txt | Foreach {
$i=0;$w=7;$c=0; `
while($i+$w -lt $_.length -and $c++ -lt 2) {
$_.Substring($i,$w);$i=$i+$w-1}}
$res will contain each column for all rows. To set the max columns change $c++ -lt 2 from 2 to something else. There is probably a more elegant solution but don't have time right now to ponder it. :-)
Assuming it's white space delimited this code should do.
$fileName = "someFilePath.txt"
$columnToGet = 2
$columns = gc $fileName |
%{ $_.Split(" ",[StringSplitOptions]"RemoveEmptyEntries")[$columnToGet] }
To ordinary、
type foo.bar | % { $_.Split(" ") | select -first 3 }
Try this. This will help to skip initial rows if you want, extract/iterate through columns, edit the column data and rebuild the record:
$header3 = #("Field_1","Field_2","Field_3","Field_4","Field_5")
Import-Csv $fileName -Header $header3 -Delimiter "`t" | select -skip 3 | Foreach-Object {
$record = $indexName
foreach ($property in $_.PSObject.Properties){
#doSomething $property.Name, $property.Value
if($property.Name -like '*CUSIP*'){
$record = $record + "," + '"' + $property.Value + '"'
}
else{
$record = $record + "," + $property.Value
}
}
$array.add($record) | out-null
#write-host $record
}