Powershell - Remove Duplicate lines in TXT based on ID - powershell

I have a TXT-File with thousands of lines. The number after the first Slash is the image ID.
I want to delete all lines so that only one line remains for every ID. Which of the lines is getting killed doesn't matter.
I tried to pipe the TXT to a CSV with Powershell and work with the unique parameter. But it didnt work. Any ideas how I can iterate through the TXT and kill all lines, so that always only one line per unique ID remains? :/
Status Today
thumbnails/4000896042746/2021-08-17_4000896042746_small.jpg
thumbnails/4000896042746/2021-08-17_4000896042746_smallX.jpg
thumbnails/4000896042333/2021-08-17_4000896042746_medium.jpg
thumbnails/4000896042444/2021-08-17_4000896042746_hugex.jpg
thumbnails/4000896042333/2021-08-17_4000896042746_tiny.jpg
After the script
thumbnails/4000896042746/2021-08-17_4000896042746_small.jpg
thumbnails/4000896042333/2021-08-17_4000896042746_medium.jpg
thumbnails/4000896042444/2021-08-17_4000896042746_hugex.jpg

If it concerns "TXT-File with thousands of lines", I would use the PowerShell pipeline for this because (if correctly setup) it will perform the same but uses far less memory.
Performance improvements might actually be leveraged from using a HashTable (or a HashSet) which is based on a binary search (and therefore much faster then e.g. grouping).
(I am pleading to get an accelerated HashSet #16003 into PowerShell)
$Unique = [System.Collections.Generic.HashSet[string]]::new()
Get-Content .\InFile.txt |ForEach-Object {
if ($Unique.Add(($_.Split('/'))[-2])) { $_ }
} | Set-Content .\OutFile.txt

To add to iRons great answer, I've done a speed comparison on 5 different ways to do it using 250k lines of the OPs' example.
Using the Get-Content -raw read and write using Set-Content method is the fastest way to do it. At least in these examples, as it is nearly 3x faster than using Get-Content and Set-Content.
I was curious to see how the HashSet method stacked up against the System.Collections.ArrayList one. And as you can see from the result below for that it's not too dissimilar.
Edit note Got the -raw switch to work as it needed splitting by a new line.
$fileIn = "C:\Users\user\Desktop\infile.txt"
$fileOut = "C:\Users\user\Desktop\outfile.txt"
# All examples below tested with 250,000 lines
# In order from fastest to slowest
#
# EXAMPLE 1 (Fastest)
#
# [Finished in 2.4s]
# Using the -raw switch only with Get-Content
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = (Get-Content -raw $fileIn).Split([Environment]::NewLine,[StringSplitOptions]::None)
$fileInSplit |ForEach-Object {
if ($Unique.Add(($_.Split('/'))[-2])) { $_ }
} | Set-Content $fileOut
#
# EXAMPLE 2 (2nd fastest)
#
# [Finished in 2.5s]
# Using the -raw switch with Get-Content
# Using [IO.File] for write only
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = (Get-Content -raw $fileIn).Split([Environment]::NewLine,[StringSplitOptions]::None)
$contentToWriteArr = New-Object System.Collections.ArrayList
$fileInSplit |ForEach-Object {
if ($Unique.Add(($_.Split('/'))[-2])) { [void]$contentToWriteArr.Add($_) }
}
[IO.File]::WriteAllLines($fileOut, $contentToWriteArr)
# #
# EXAMPLE 3 (3rd fastest example)
#
# [Finished in 2.7s]
# Using [IO.File] for the read and write
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = ([IO.File]::ReadAllLines($fileIn)).Split([Environment]::NewLine,[StringSplitOptions]::None)
$contentToWriteArr = [Collections.Generic.HashSet[string]]::new()
$fileInSplit |ForEach-Object {
if ($Unique.Add(($_.Split('/'))[-2])) { $contentToWriteArr.Add($_) | out-null }
}
[IO.File]::WriteAllLines($fileOut, $contentToWriteArr)
#
# EXAMPLE 4 (4th fastest example)
#
# [Finished in 2.8s]
# Using [IO.File] for the read only
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = ([IO.File]::ReadAllLines($fileIn)).Split([Environment]::NewLine,[StringSplitOptions]::None)
$fileInSplit |ForEach-Object {
if ($Unique.Add(($_.Split('/'))[-2])) { $_ }
} | Set-Content $fileOut
#
# EXAMPLE 5 (5th fastest example)
#
# [Finished in 2.9s]
# Using [IO.File] for the read and write
# This is using a System.Collections.ArrayList instead of a HashSet
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = ([IO.File]::ReadAllLines($fileIn)).Split([Environment]::NewLine,[StringSplitOptions]::None)
$contentToWriteArr = New-Object System.Collections.ArrayList
$fileInSplit |ForEach-Object {
if ($Unique.Add(($_.Split('/'))[-2])) { $contentToWriteArr.Add($_) | out-null }
}
[IO.File]::WriteAllLines($fileOut, $contentToWriteArr)
#
# EXAMPLE 6 (Slowest example) - As per iRons answer
#
# [Finished in 7.2s]
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = Get-Content $fileIn
$fileInSplit |ForEach-Object {
if ($Unique.Add(($_.Split('/'))[-2])) { $_ }
} | Set-Content $fileOut

Here a solution that uses a calculated property to create an object that contains the ID and the FileName. Then I group the result based on the ID, iterate over each group and select the first FileName:
$yourFileList = #(
'thumbnails/4000896042746/2021-08-17_4000896042746_small.jpg',
'thumbnails/4000896042746/2021-08-17_4000896042746_smallX.jpg',
'thumbnails/4000896042333/2021-08-17_4000896042746_medium.jpg',
'thumbnails/4000896042444/2021-08-17_4000896042746_hugex.jpg',
'thumbnails/4000896042333/2021-08-17_4000896042746_tiny.jpg'
)
$yourFileList |
Select-Object #{Name = "Id"; Expression = { ($_ -split '/')[1] } }, #{ Name = 'FileName'; Expression = { $_ } } |
Group Id |
ForEach-Object { $_.Group[0].FileName }

You can group by custom property. So if you know what's your ID then you just have to group by that and then take the first element from the group:
$content = Get-Content "path_to_your_file";
$content = ($content | group { ($_ -split "/")[1] } | % { $_.Group[0] });
$content | Out-File "path_to_your_result_file"

Related

Comparing two text files and output the differences in Powershell

So I'm new to the Powershell scripting world and I'm trying to compare a list of IPs in text file against a database of IP list. If an IP from (file) does not exist in the (database) file put it in a new file, let's call it compared.txt. When I tried to run the script, I didn't get any result. What am I missing here?
$file = Get-Content "C:\Users\zack\Desktop\file.txt"
$database = Get-Content "C:\Users\zack\Desktop\database.txt"
foreach($line1 in $file){
$check = 0
foreach($line2 in $database)
{
if($line1 != $line2)
{
$check = 1
}
else
{
$check = 0
break
}
}
if ($check == 1 )
{
$line2 | Out-File "C:\Users\zack\Desktop\compared.txt"
}
}
There is a problem with your use of PowerShell comparison operators unlike in C#, equality and inequality are -eq and -ne, and since PowerShell is a case insensitive language, there is also -ceq and -cne.
There is also a problem with your code's logic, a simple working version of it would be:
$database = Get-Content "C:\Users\zack\Desktop\database.txt"
# iterate each line in `file.txt`
$result = foreach($line1 in Get-Content "C:\Users\zack\Desktop\file.txt") {
# iterate each line in `database.txt`
# this happens on each iteration of the outer loop
$check = foreach($line2 in $database) {
# if this line of `file.txt` is the same as this line of `database.txt`
if($line1 -eq $line2) {
# we don't need to keep checking, output this boolean
$true
# and break the inner loop
break
}
}
# if above condition was NOT true
if(-not $check) {
# output this line, can be `$line1` or `$line2` (same thing here)
$line1
}
}
$result | Set-Content path\to\comparisonresult.txt
However, there are even more simplified ways you could achieve the same results:
Using containment operators:
$database = Get-Content "C:\Users\zack\Desktop\database.txt"
$result = foreach($line1 in Get-Content "C:\Users\zack\Desktop\file.txt") {
if($line1 -notin $database) {
$line1
}
}
$result | Set-Content path\to\comparisonresult.txt
Using Where-Object:
$database = Get-Content "C:\Users\zack\Desktop\database.txt"
Get-Content "C:\Users\zack\Desktop\file.txt" | Where-Object { $_ -notin $database } |
Set-Content path\to\comparisonresult.txt
Using a HashSet<T> and it's ExceptWith method (Note, this will also get rid of duplicates in your file.txt):
$file = [System.Collections.Generic.HashSet[string]]#(
Get-Content "C:\Users\zack\Desktop\file.txt"
)
$database = [string[]]#(Get-Content "C:\Users\zack\Desktop\database.txt")
$file.ExceptWith($database)
$file | Set-Content path\to\comparisonresult.txt

Stuck with this PS Script

I have a text file that contains millions of records
I want to find out from each line that does not start with string + that line number (String starts with double quote 01/01/2019)
Can you help me modify this code?
Get-Content "(path).txt" | Foreach { if ($_.Split(',')[-1] -inotmatch "^01/01/2019") { $_; } }
Thanks
Based on your comments the content will look something like the array.
So you want to read the content, filter it, and get the resulting line from that content:
# Get the content
# $content = Get-Content -Path 'pathtofile.txt'
$content = #('field1,field2,field3', '01/01/2019,b,c')
# Convert from csv
$csvContent = $content | ConvertFrom-Csv
# Add your filter based on the field
$results = $csvContent | Where-Object { $_.field1 -notmatch '01/01/2019'} | % { $_ }
# Convert your results back to csv if needed
$results | ConvertTo-Csv
If performance is an issue then .net would handle millions of records with CsvHelper just like PowerBi.
# install CsvHelper
nuget install CsvHelper
# import csvhelper
import-module CsvHelper.2.16.3.0\lib\net45\CsvHelper.dll
# write the content to the file just for this example
#('field1,field2,field3', '01/01/2019,b,c') | sc -path "c:\temp\text.csv"
$results = #()
# open the file for reading
try {
$stream = [System.IO.File]::OpenRead("c:\temp\text.csv")
$sr = [System.IO.StreamReader]::new($stream)
$csv = [CsvHelper.CsvReader]::new($sr)
# read in the records
while($csv.Read()){
# add in the result
$result= #{}
[string] $value = "";
for($i = 0; $csv.TryGetField($i, [ref] $value ); $i++) {
$result.Add($i, $value);
}
# add your filter here for the results
$results.Add($result)
}
# dispose of everything once we are done
}finally {
$stream.Dispose();
$sr.Dispose();
$csv.Dispose();
}
My .txt file looks like this...
date,col2,col3
"01/01/2019 22:42:00", "column2", "column3"
"01/02/2019 22:42:00", "column2", "column3"
"01/01/2019 22:42:00", "column2", "column3"
"02/01/2019 22:42:00", "column2", "column3"
This command does exactly what you are asking...
Get-Content -Path C:\myFile.txt | ? {$_ -notmatch "01/01/2019"} | Select -Skip 1
The output is:
"01/02/2019 22:42:00", "column2", "column3"
"02/01/2019 22:42:00", "column2", "column3"
I skipped the top row. If you want to deal with particular columns, change myFile.txt to a .csv and import it.
Looking at the question and comments, you are dealing with a headerless CSV file it seems. Because the file contains millions of records, I think using Get-Content or Import-Csv could slow down too much. Using [System.IO.File]::ReadLines() would then be faster.
If indeed each line starts with a quoted date, you could use various methods of figuring out if the line start with "01/01/2019 or not. Here, I use the -notlike operator:
$fileIn = "D:\your_text_file_which_is_in_fact_a_CSV_file.txt"
$fileOut = "D:\your_text_file_which_is_in_fact_a_CSV_file_FILTERED.txt"
foreach ($line in [System.IO.File]::ReadLines($fileIn)) {
if ($line -notlike '"01/01/2019*') {
# write to a NEW file
Add-Content -Path $fileOut -Value $line
}
}
Update
Judging from your comment, you are apparently using an older .NET framework, as the [System.IO.File]::ReadLines() became available as of version 4.0.
In that case, the below code should work for you:
$fileIn = "D:\your_text_file_which_is_in_fact_a_CSV_file.txt"
$fileOut = "D:\your_text_file_which_is_in_fact_a_CSV_file_FILTERED.txt"
$reader = New-Object System.IO.StreamReader($fileIn)
$writer = New-Object System.IO.StreamWriter($fileOut)
while (($line = $reader.ReadLine()) -ne $null) {
if ($line -notlike '"01/01/2019*') {
# write to a NEW file
$writer.WriteLine($line)
}
}
$reader.Dispose()
$writer.Dispose()

Get a part of a text into a variable

I have a .txt file with a content that is delimited by a blank line.
Eg.
Question1
What is your favourite colour?
Question2
What is your hobby?
Question3
What kind of music do you like?
...and so on.
I would like to put each of the text questions into an array.
I tried this
$path=".\Documents\Questions.txt"
$shareArray= gc $path
But it gives me every line into a variable.
Can someone give me a tip?
Thanks
This is a different approach which can handle multiline questions and doesn't need a seperating blank line. The split is ^Question but this text is not excluded. Output is to Out-Gridview.
## LotPings 2016-11-26
$InFile = ".\Questions.txt"
## prepare Table
$Table = New-Object system.Data.DataTable
$col = New-Object system.Data.DataColumn "QuestionNo",([string])
$Table.columns.add($col)
$col = New-Object system.Data.DataColumn "QuestionBody",([string])
$Table.columns.add($col)
## prepare RegEx for the split
$Delimiter = [regex]'Question'
$Split = "(?!^)(?=$Delimiter)"
(Get-Content $InFile -Raw) -split $Split |
ForEach-Object {
If ($_ -match '(?smi)^(?<QuestionNo>Question\d+)( *)(?<QuestionBody>.*)$') {
$Row = $Table.Newrow()
$Row.QuestionNo = $matches.QuestionNo.Trim()
$Row.QuestionBody = $matches.QuestionBody.Trim()
$Table.Rows.Add($Row)
} else {Write-Host "no Match"}
}
$Table | Out-Gridview
If you just want an array, you could filter on whether the lines have ? at the end or not:
$Questions = Get-Content $path |Where-Object {$_.Trim() -match '\?$'}
If you want to store the questions by the preceding name, you could use a hashtable.
Start by reading the file as a single string, then split by two consecutive line breaks:
$Blocks = (Get-Content $path -Raw) -split '\r?\n\r?\n' |ForEach-Object { $_.Trim() }
If you want both lines in each array item, you can stop here.
Otherwise split each block into a "question name" and "question" part, use those to populate your hashtable:
$Questions = #{}
$Blocks |ForEach-Object {
$Name,$Question = $_ -split '\r?\n'
$Questions[$Name.Trim()] = $Question
}
Now you can access the questions like:
$Questions['Question1']
other solution 1
select-string -Path "C:\temp\test.txt" -Pattern "Question[0-1]*" -Context 0,1 |
% {$_.Context.PostContext} |
out-file "c:\temp\result.txt"
other solution 2
$template=#"
{Question*:Question1}
{Text:an example of question?}
{Question*:Question2}
{Text:other example of question with Upper and digits 12}
"#
gc "C:\temp\test.txt" | ConvertFrom-String -TemplateContent $template | select Text
other solution 3
gc "C:\temp\test.txt" | ?{$_ -notmatch 'Question\d+' -and $_ -ne "" }

How to modify contents of a pipe-delimited text file with PowerShell

I have a pipe-delimited text file. The file contains "records" of various types. I want to modify certain columns for each record type. For simplicity, let's say there are 3 record types: A, B, and C. A has 3 columns, B has 4 columns, and C has 5 columns. For example, we have:
A|stuff|more_stuff
B|123|other|x
C|something|456|stuff|more_stuff
B|78903|stuff|x
A|1|more_stuff
I want to append the prefix "P" to all desired columns. For A, the desired column is 2. For B, the desired column is 3. For C, the desired column is 4.
So, I want the output to look like:
A|Pstuff|more_stuff
B|123|Pother|x
C|something|456|Pstuff|more_stuff
B|78903|Pstuff|x
A|P1|more_stuff
I need to do this in PowerShell. The file could be very large. So, I'm thinking about going with the File-class of .NET. If it were a simple string replacement, I would do something like:
$content = [System.IO.File]::ReadAllText("H:\test_modify_contents.txt").Replace("replace_text","something_else")
[System.IO.File]::WriteAllText("H:\output_file.txt", $content)
But, it's not so simple in my particular situation. So, I'm not even sure if ReadAllText and WriteAllText is the best solution. Any ideas on how to do this?
I would ConvertFrom-Csv so you can check each line as an object. On this code, I did add a header, but mainly for code readability. The header is cut out of the output on the last line anyway:
$input = "H:\test_modify_contents.txt"
$output = "H:\output_file.txt"
$data = Get-Content -Path $input | ConvertFrom-Csv -Delimiter '|' -Header 'Column1','Column2','Column3','Column4','Column5'
$data | % {
If ($_.Column5) {
#type C:
$_.Column4 = "P$($_.Column4)"
} ElseIf ($_.Column4) {
#type B:
$_.Column3 = "P$($_.Column3)"
} Else {
#type A:
$_.Column2 = "P$($_.Column2)"
}
}
$data | Select Column1,Column2,Column3,Column4,Column5 | ConvertTo-Csv -Delimiter '|' -NoTypeInformation | Select-Object -Skip 1 | Set-Content -Path $output
It does add extra | for the type A and B lines. Output:
"A"|"Pstuff"|"more_stuff"||
"B"|"123"|"Pother"|"x"|
"C"|"something"|"456"|"Pstuff"|"more_stuff"
"B"|"78903"|"Pstuff"|"x"|
"A"|"P1"|"more_stuff"||
If your file sizes are large then reading the complete file contents at once using Import-Csv or ReadAll is probably not a good idea. I would use Get-Content cmdlet using the ReadCount property which will stream the file one row at time and then use a regex for the processing. Something like this:
Get-Content your_in_file.txt -ReadCount 1 | % {
$_ -replace '^(A\||B\|[^\|]+\||C\|[^\|]+\|[^\|]+\|)(.*)$', '$1P$2'
} | Set-Content your_out_file.txt
EDIT:
This version should output faster:
$d = Get-Date
Get-Content input.txt -ReadCount 1000 | % {
$_ | % {
$_ -replace '^(A\||B\|[^\|]+\||C\|[^\|]+\|[^\|]+\|)(.*)$', '$1P$2'
} | Add-Content output.txt
}
(New-TimeSpan $d (Get-Date)).Milliseconds
For me this processed 50k rows in 350 milliseconds. You probably get more speed by tweaking the -ReadCount value to find the ideal amount.
Given the large input file, i would not use either ReadAllText or Get-Content.
They actually read the entire file into memory.
Consider using something along the lines of
$filename = ".\input2.csv"
$outfilename = ".\output2.csv"
function ProcessFile($inputfilename, $outputfilename)
{
$reader = [System.IO.File]::OpenText($inputfilename)
$writer = New-Object System.IO.StreamWriter $outputfilename
$record = $reader.ReadLine()
while ($record -ne $null)
{
$writer.WriteLine(($record -replace '^(A\||B\|[^\|]+\||C\|[^\|]+\|[^\|]+\|)(.*)$', '$1P$2'))
$record = $reader.ReadLine()
}
$reader.Close()
$reader.Dispose()
$writer.Close()
$writer.Dispose()
}
ProcessFile $filename $outfilename
EDIT: After testing all the suggestions on this page, i have borrowed the regex from Dave Sexton and this is the fastest implementation. Processes a 1gb+ file in 175 seconds. All other implementations are significantly slower on large input files.

Powershell - reading ahead and While

I have a text file in the following format:
.....
ENTRY,PartNumber1,,,
FIELD,IntCode,123456
...
FIELD,MFRPartNumber,ABC123,,,
...
FIELD,XPARTNUMBER,ABC123
...
FIELD,InternalPartNumber,3214567
...
ENTRY,PartNumber2,,,
...
...
the ... indicates there is other data between these fields. The ONLY thing I can be certain of is that the field starting with ENTRY is a new set of records. The rows starting with FIELD can be in any order, and not all of them may be present in each group of data.
I need to read in a chunk of data
Search for any field matching the
string ABC123
If ABC123 found, search for the existence of the
InternalPartNumber field & return that row of data.
I have not seen a way to use Get-Content that can read in a variable number of rows as a set & be able to search it.
Here is the code I currently have, which will read a file, searching for a string & replacing it with another. I hope this can be modified to be used in this case.
$ftype = "*.txt"
$fnames = gci -Path $filefolder1 -Filter $ftype -Recurse|% {$_.FullName}
$mfgPartlist = Import-Csv -Path "C:\test\mfrPartList.csv"
foreach ($file in $fnames) {
$contents = Get-Content -Path $file
foreach ($partnbr in $mfgPartlist) {
$oldString = $mfgPartlist.OldValue
$newString = $mfgPartlist.NewValue
if (Select-String -Path $file -SimpleMatch $oldString -Debug -Quiet) {
$stringData = $contents -imatch $oldString
$stringData = $stringData -replace "[\n\r]","|"
foreach ($dataline in $stringData) {
$file +"|"+$stringData+"|"+$oldString+"|"+$newString|Out-File "C:\test\Datachanges.txt" -Width 2000 -Append
}
$contents = $contents -replace $oldString $newString
Set-Content -Path $file -Value $contents
}
}
}
Is there a way to read & search a text file in "chunks" using Powershell? Or to do a Read-ahead & determine what to search?
Assuming your fine isn't too big to read into memory all at once:
$Text = Get-Content testfile.txt -Raw
($Text -split '(?ms)^(?=ENTRY)') |
foreach {
if ($_ -match '(?ms)^FIELD\S+ABC123')
{$_ -replace '(?ms).+(^Field\S+InternalPartNumber.+?$).+','$1'}
}
FIELD,InternalPartNumber,3214567
That reads the entire file in as a single multiline string, and then splits it at the beginning of any line that starts with 'ENTRY'. Then it tests each segment for a FIELD line that contains 'ABC123', and if it does, removes everything except the FIELD line for the InternalPartNumber.
This is not my best work as I have just got back from vacation. You could use a while loop reading the text and set an entry flag to gobble up the text in chunks. However if your files are not too big then you could just read up the text file at once and use regex to split up the chunks and then process accordingly.
$pattern = "ABC123"
$matchedRowToReturn = "InternalPartNumber"
$fileData = Get-Content "d:\temp\test.txt" | Where-Object{$_ -match '^(entry|field)'} | Out-String
$parts = $fileData | Select-String '(?smi)(^Entry).*?(?=^Entry|\Z)' -AllMatches | Select-Object -ExpandProperty Matches | Select-Object -ExpandProperty Value
$parts | Where-Object{$_ -match $pattern} | Select-String "$matchedRowToReturn.*$" | Select-Object -ExpandProperty Matches | Select-Object -ExpandProperty Value
What this will do is read in the text file, drop any lines that are not entry or field related, as one long string and split it up into chunks that start with lines that begin with the work "Entry".
Then we drop those "parts" that do not contain the $pattern. Of the remaining that match extract the InternalPartNumber line and present.