Two files: keep lines with identical first n characters only

Two files: keep lines with identical first n characters only - powershell

There are 2 text files in the CWD, a.txt, b.txt. From a.txt, I would like to delete all lines whose first 5 characters are NOT present in b.txt as any lines' first 5 characters. (Or, stating otherwise, keep only those lines in a.txt, whose first 5 characters is present in b.txt as any lines' first 5 characters.) Content after the 5th character to the end of the line is irrelevant.
For example: a.txt
abcde000dsdsddsdsdsdsdsd
0123456xxx
kkk
xyzxyzxyzfeeeee
kkkkkkkkkkk
and b.txt:
012345aabbcc
kkkkkkkhhkkvv
nnnnnnn5777nnnn77567
Intended result (lines in a.txt whose 1-5 character is present in b.txt):
0123456xxx
kkkkkkkkkkk
When I am running the code, it gives me an empty results.txt, but no error messages. What I am missing?
$pattern = "^[5]"
$set1 = Get-Content -Path a.txt
$results = New-Object -TypeName System.Text.StringBuilder
Get-Content -Path b.txt | foreach {
if ($_ -match $pattern) {
[void]$results.AppendLine($_)
}
}
$results.ToString() | Out-File -FilePath .\results.txt -Encoding ascii

Your code doesn't work because your pattern doesn't match anything. The regular expression ^[5] means "the character '5' at the beginning of the string" (the square brackets define a character class), not "5 characters at the beginning of the string". The latter would be ^.{5}. Also, you never match the content of a.txt against the content of b.txt.
There are several ways to do what you want:
Extract the first 5 characters from each line of b.txt. to an array and compare the lines of a.txt against that array. Esperento57's answer sort of uses this approach, but in a way that requires PowerShell v3 or newer. A variant that'll work on all PowerShell versions could look like this:
$pattern = '^(.{5}).*'
$ref = (Get-Content 'b.txt') -match $pattern -replace $pattern, '$1' |
Get-Unique
Get-Content 'a.txt' | Where-Object {
$ref -contains ($_ -replace $pattern, '$1')
} | Set-Content 'results.txt'
Since lookups in arrays are comparatively slow and don't scale well (they get significantly slower with increasing number of elements in the array) you could also put the reference values in a hashtable so you can do index lookups (which are significantly faster):
$pattern = '^(.{5}).*'
$ref = #{}
(Get-Content 'b.txt') -match $pattern -replace $pattern, '$1' |
ForEach-Object { $ref[$_] = $true }
Get-Content 'a.txt' | Where-Object {
$ref.ContainsKey(($_ -replace $pattern, '$1'))
} | Set-Content 'results.txt'
Another alternative would be to build a second regular expression from the substrings extracted from b.txt and compare the content of a.txt against that expression:
$pattern = '^(.{5}).*'
$list = (Get-Content 'b.txt') -match $pattern -replace $pattern, '$1' |
Get-Unique |
ForEach-Object { [regex]::Escape($_) }
$ref = '^({0})' -f ($list -join '|')
(Get-Content 'a.txt') -match $ref | Set-Content 'results.txt'
Note that each of these approaches will ignore lines shorter than 5 characters.

try Something like this:
$listB=get-content "c:\temp\b.txt" | where {$_.Length -gt 4} | select #{N="First5";E={$_.Substring(0, 5)}}
get-content "c:\temp\a.txt" | where {$_.Length -gt 4 -and $_.Substring(0, 5) -in $listB.First5}

If performance is a concern, consider to use the hashtable(s) as index:
$Pattern = '^(.{5}).*'
$a = #{}; $b = #{}
Get-Content -Path a.txt | Where {$_ -Match $Pattern} | ForEach {$a[$Matches[1]] = #($a[$Matches[1]] + $_)}
Get-Content -Path b.txt | Where {$_ -Match $Pattern} | ForEach {$b[$Matches[1]] = #($b[$Matches[1]] + $_)}
$a.Keys | Where {$b.Keys -Contains $_} | ForEach {$a.$_} | Set-Content results.txt

Related

How to strip out leading time stamp?

I have some log files.
Some of the UPDATE SQL statements are getting errors, but not all.
I need to know all the statements that are getting errors so I can find the pattern of failure.
I can sort all the log files and get the unique lines, like this:
$In = "C:\temp\data"
$Out1 = "C:\temp\output1"
$Out2 = "C:\temp\output2"
Remove-Item $Out1\*.*
Remove-Item $Out2\*.*
# Get the log files from the last 90 days
Get-ChildItem $In -Filter *.log | Where-Object {$_.LastWriteTime -gt (Get-Date).AddDays(-90)} |
Foreach-Object {
$content = Get-Content $_.FullName
#filter and save content to a file
$content | Where-Object {$_ -match 'STATEMENT'} | Sort-Object -Unique | Set-Content $Out1\$_
}
# merge all the files, sort unique, write to output
Get-Content $Out2\* | Sort-Object -Unique | Set-Content $Out3\output.txt
Works great.
But some of the logs have a leading date-time stamp in the leading 24 char. I need to strip that out, or all those lines are unique.
If it helps, all the files either have the leading timestamp or they don't. The lines are not mixed within a single file.
Here is what I have so far:
# Get the log files from the last 90 days
Get-ChildItem $In -Filter *.log | Where-Object {$_.LastWriteTime -gt (Get-Date).AddDays(-90)} |
Foreach-Object {
$content = Get-Content $_.FullName
#filter and save content to a file
$s = $content | Where-Object {$_ -match 'STATEMENT'}
# strip datetime from front if exists
If (Where-Object {$s.Substring(0,1) -Match '/d'}) { $s = $s.Substring(24) }
$s | Sort-Object -Unique | Set-Content $Out1\$_
}
# merge all the files, sort unique, write to output
Get-Content $Out1\* | Sort-Object -Unique | Set-Content $Out2\output.txt
But it just write the lines out without stripping the leading chars.

Regex /d should be \d (\ is the escape character in general, and character-class shortcuts such as d for a digit[1] must be prefixed with it).
Use a single pipeline that passes the Where-Object output to a ForEach-Object call where you can perform the conditional removal of the numeric prefix.
$content |
Where-Object { $_ -match 'STATEMENT' } |
ForEach-Object { if ($_[0] -match '\d') { $_.Substring(24) } else { $_ } } |
Set-Content $Out1\$_
Note: Strictly speaking, \d matches everything that the Unicode standard considers a digit, not just the ASCII-range digits 0 to 9; to limit matching to the latter, use [0-9].

Need to output multiple rows to CSV file

I am using the following script that iterates through hundreds of text files looking for specific instances of the regex expression within. I need to add a second data point to the array, which tells me the object the pattern matched in.
In the below script the [Regex]::Matches($str, $Pattern) | % { $_.Value } piece returns multiple rows per file, which cannot be easily output to a file.
What I would like to know is, how would I output a 2 column CSV file, one column with the file name (which should be $_.FullName), and one column with the regex results? The code of where I am at now is below.
$FolderPath = "C:\Test"
$Pattern = "(?i)(?<=\b^test\b)\s+(\w+)\S+"
$Lines = #()
Get-ChildItem -Recurse $FolderPath -File | ForEach-Object {
$_.FullName
$str = Get-Content $_.FullName
$Lines += [Regex]::Matches($str, $Pattern) |
% { $_.Value } |
Sort-Object |
Get-Unique
}
$Lines = $Lines.Trim().ToUpper() -replace '[\r\n]+', ' ' -replace ";", '' |
Sort-Object |
Get-Unique # Cleaning up data in array

I can think of two ways but the simplest way is to use a hashtable (dict). Another way is create psobjects to fill your Lines variable. I am going to go with the simple way so you can only use one variable, the hashtable.
$FolderPath = "C:\Test"
$Pattern = "(?i)(?<=\b^test\b)\s+(\w+)\S+"
$Results =#{}
Get-ChildItem -Recurse $FolderPath -File |
ForEach-Object {
$str = Get-Content $_.FullName
$Line = [regex]::matches($str,$Pattern) | % { $_.Value } | Sort-Object | Get-Unique
$Line = $Line.Trim().ToUpper() -Replace '[\r\n]+', ' ' -Replace ";",'' | Sort-Object | Get-Unique # Cleaning up data in array
$Results[$_.FullName] = $Line
}
$Results.GetEnumerator() | Select #{L="Folder";E={$_.Key}}, #{L="Matches";E={$_.Value}} | Export-Csv -NoType -Path <Path to save CSV>
Your results will be in $Results. $Result.keys contain the folder names. $Results.Values has the results from expression. You can reference the results of a particular folder by its key $Results["Folder path"]. of course it will error if the key does not exist.

Powershell Remove all lines except those containing certain strings

I can do this one-by-one with bookmarking and other Notepad++ features, but I will be doing this frequently to edit documents. I have used the below powershell for removing all lines except those containing a certain string, but how would I do it for, say 50 strings.
$SourceFile = 'C:\PATH\TO\FILE.csv'
$Pattern = 'word||'
(Get-Content $SourceFile) | % {if ($_ -match $Pattern){$_}} | Set-Content $SourceFile

I guess $Match should be $Pattern in your example.
You can specify multiple keywords in your pattern, like this:
$SourceFile = 'C:\PATH\TO\FILE.csv'
$Pattern = 'word|excel|powerpoint'
(Get-Content $SourceFile) | Where-Object { $_ -match $Pattern } | Set-Content $SourceFile

Change and save .nc files

I have a massive amount of .nc files (text files) where I need to change different lines based on their linenumer and content.
Example:
So far I have:
Get-ChildItem I:\temp *.nc -recurse | ForEach-Object {
$c = ($_ | Get-Content)
$c = $c -replace "S355J2","S235JR2"
$c = $c.GetType() | Format-Table -AutoSize
$c = $c -replace $c[3],$c[4]
[IO.File]::WriteAllText($_.FullName, ($c -join "`r`n"))
}
This is not working, however, since it returns only a few PowerShell lines to each file, instead of the original (changed) content.

I don't know what you expect $c = $c.GetType() | Format-Table -AutoSize to do, but it most likely doesn't do whatever it is you're expecting.
If I understand your question correctly you essentially want to
remove the line pos,
replace the code S355J2 with S235JR2, and
remove a section SI if it exists.
The following code should work:
Get-ChildItem I:\temp *.nc -Recurse | ForEach-Object {
(Get-Content $_.FullName | Out-String) -replace 'pos\r\n\s+' -replace 'S355J2', 'S235JR2' -replace '(?m)^SI\r\n(\s+.*\n)+' |
Set-Content $_.FullName
}
Out-String mangles the content of the input file into a single string, and the daisy-chained replacement operations modify that string before it's written back to the file. The expression (?m)^SI\r\n(\s+.*\n)+ matches a line beginning with SI and followed by one or more indented lines. The (?m) modifier is to allow matching start-of-line in a multiline string, otherwise ^ would only match the beginning of the string.
Edit: If you need to replace variable text in the 3rd line with the text from the 4th line (thus duplicating the 4th line) you're indeed better off working with an array for that. Delay the mangling of the string array until after that replacement:
Get-ChildItem I:\temp *.nc -Recurse | ForEach-Object {
$txt = #(Get-Content $_.FullName)
$txt[3] = $txt[4]
($txt | Out-String) -replace 'S355J2', 'S235JR2' -replace '(?m)^SI\r\n(\s+.*\n)+' |
Set-Content $_.FullName
}

Retrieve text from a file in ps script

I have a text file containing some data as follows:
test|wdthe$muce
check|muce6um#%
How can I check for a particular string like test and retrieve the text after the | symbol to a variable in a PowerShell script?
And also,
If Suppose there is variable $from=test#abc.com and how to search the file by splitting the text before "#" ?

this may be one possible solution
$filecontents = #'
test|wdthe$muce
check|muce6um#%
'#.split("`n")
# instead of the above, you would use this with the path of the file
# $filecontents = get-content 'c:\temp\file.txt'
$hash = #{}
$filecontents | ? {$_ -notmatch '^(?:\s+)?$'} | % {
$split = $_.Split('|')
$hash.Add($split[0], $split[1])
}
$result = [pscustomobject]$hash
$result
# and to get just what is inside 'test'
$result.test
*note: this may only work if there is only one of each line in the file. if you get an error, try this other method
$search = 'test'
$filecontents | ? {$_ -match "^$search\|"} | % {
$_.split('|')[1]
}

First you need to read the text from the file.
$content = Get-Content "c:\temp\myfile.txt"
Then you want to grab the post-pipe portion of each matching line.
$postPipePortion = $content | Foreach-Object {$_.Substring($_.IndexOf("|") + 1)}
And because it's PowerShell you could also daisy-chain it together instead of using variables:
Get-Content "C:\temp\myfile.txt" | Foreach-Object {$_.Substring($_.IndexOf("|") + 1)}
The above assumes that you happen to know every line will include a | character. If this is not the case, you need to select out only the lines that do have the character, like this:
Get-Content "C:\temp\myfile.txt" | Select-String "|" | Foreach-Object {$_.Line.Substring($_.Line.IndexOf("|") + 1)}
(You need to use the $_.Line instead of just $_ now because Select-String returns MatchInfo objects rather than strings.)
Hope that helps. Good luck.

gc input.txt |? {$_ -match '^test'} |% { $_.split('|') | select -Index 1 }
or
sls '^test' -Path input.txt |% { $_.Line.Split('|') | select -Index 1 }
or
sls '^test' input.txt |% { $_ -split '\|' | select -Ind 1 }
or
(gc input.txt).Where{$_ -match '^test'} -replace '.*\|'
or
# Borrowing #Anthony Stringer's answer shape, but different
# code, and guessing names for what you're doing:
$users = #{}
Get-Content .\input.txt | ForEach {
if ($_ -match "(?<user>.*)\|(?<pwd>.*)") {
$users[$matches.user]=$matches.pwd
}
}
$users = [pscustomobject]$users

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Two files: keep lines with identical first n characters only - powershell

try Something like this: $listB=get-content "c:\temp\b.txt" | where {$_.Length -gt 4} | select #{N="First5";E={$_.Substring(0, 5)}} get-content "c:\temp\a.txt" | where {$_.Length -gt 4 -and $_.Substring(0, 5) -in $listB.First5}

Related

How to strip out leading time stamp?

Need to output multiple rows to CSV file

Powershell Remove all lines except those containing certain strings

Change and save .nc files

Retrieve text from a file in ps script

Categories

Resources