Find Carriage Returns in Files Using Powershell - powershell

I want to see all the instances of files containing the Windows-style crlf instead of Unix-style lf in a set of files. Here's what I have so far:
sls -Path src/*.cs -Pattern "`r`n" | group Path | select name
This works if I search for any normal text, but it's not finding the carriage returns, even though (according to everything I can find online) that's the proper Powershell escape sequence for carriage returns and newlines. For the record \r\n doesn't work either.

sls (an alias for Select-String) works line by line, so it's already processing (consuming) the line breaks during the file reading process before it gets to the regex matching.
Use something that reads the entire file, and then look for it:
Get-ChildItem -Path src/*.cs | ForEach-Object {
$contents = [System.IO.File]::ReadAllText($_.FullName)
if ($contents -cmatch '\r\n') {
$_
}
} | Group-Object Directory | Select-Object Name
\r\n is used here instead of the backticks because you're escaping them for the regex engine, not for powershell.

Related

Is there a Powershell command that will print a specified file's EOL characters?

I have four text files in the following directory that have varying EOL characters:
C:\Sandbox 1.txt, 2.txt, 3.txt, 4.txt
I would like to write a powershell script that will loop through all files in the directory and find the EOL characters that are being used for each file and print them into a new file named EOL.txt
Sample contents of EOL.txt:
1.txt UNIX(LF)
2.txt WINDOWS(CRLF)
3.txt WINDOWS(CRLF)
4.txt UNIX(LF)
I know to loop through files I will need something like the following, but I'm not sure how to read the file EOL:
Get-ChildItem "C:\Sandbox" -Filter *.txt |
Foreach-Object {
}
OR
Get-Content "C:\Sandbox\*" -EOL | Out-File -FilePath "C:\Sandbox\EOL.txt"
##note that EOL is not a valid Get-Content command
Try the following:
Get-ChildItem C:\Sandbox\*.txt -Exclude EOL.txt |
Get-Content -Raw |
ForEach-Object {
$newlines = [regex]::Matches($_, '\r?\n').Value | Select-Object -Unique
$newLineDescr =
switch ($newlines.Count) {
0 { 'N/A' }
2 { 'MIXED' }
default { ('UNIX(LF)', 'WINDOWS(CRLF)')[$newlines -eq "`r`n"] }
}
# Construct and output a custom object for the file at hand.
[pscustomobject] #{
Path = $_.PSChildName
NewlineFormat = $newLineDescr
}
} # | Out-File ... to save to a file - see comments below.
The above outputs something like:
FileName NewlineFormat
-------- -------------
1.txt UNIX(LF)
2.txt WINDOWS(CRLF)
3.txt N/A
4.txt MIXED
N/A means that no newlines are present, MIXED means that both CRLF and LF newlines are present.
You can save the output:
directly in the for-display format shown above by appending a > redirection or piping (|) to Out-File, as in your question.
alternatively, using a structured text format better suited to programmatic processing, such CSV; e.g.:
Export-Csv -NoTypeInformation -Encoding utf8 C:\Sandbox\EOL.txt
Note:
Short of reading the raw bytes of a text file one by one or in batches, the only way to analyze the newline format is to read the file in full and search for newline sequences. Get-Content -Raw reads a given file in full.
[regex]::Matches($_, '\r?\n').Value extracts all newline sequences - whether CRLF or LF - from the file's content, and Select-Object -Unique reduces them to the set of distinct sequences.
('UNIX(LF)', 'WINDOWS(CRLF)')[$newlines -eq "`r`n"] is a convenient, but somewhat obscure emulation of the following ternary conditional:
$newlines -eq "`r`n" ? 'WINDOWS(CRLF)' : 'UNIX(LF)', which could be used in PowerShell (Core) 7+ as-is, but, unfortunately isn't supported in Windows PowerShell.
The technique relies on a [bool] value getting coerced to an [int] value when used as an array index ($true -> 1, $false -> 0), thereby selecting the appropriate element from the input array.
If you don't mind the verbosity, you can use a regular if statement as an expression (i.e., you can assign its output directly to a variable: $foo = if ...), which works in both PowerShell editions:
if ($newlines -eq "`r`n") { 'WINDOWS(CRLF)' } else { 'UNIX(LF)' }
Simpler alternative via WSL, if installed:
WSL comes with the file utility, which analyzes the content of files and reports summary information, including newline formats.
While you get no control over the output format, which invariably includes additional information, such as the file's character encoding, the command is much simpler:
Set-Location C:\Sandbox
wsl file *.txt
Caveats:
This approach is fundamentally limited to files on local drives.
If changing to the target dir. is not an option, relative paths would need their \ instances translated to /, and full paths would need drive specs. such as C: translated to /mnt/c (lowercase!).
Interpreting the output:
If the term line terminators (referring to newlines) is not mentioned in the output (for text files), Unix (LF) newlines only are implied.
Windows (CRLF) newlines only are implied if you see with CRLF line terminators
In case of a mix of LF and CRLF, you'll see with CRLF, LF line terminators
In the absence of newlines you'll see with no line terminators

How to replace a newline at the start of a txt file with powershell

I have tried a lot, but I cannot seem to replace the new line at the start of a txt file.
So my txt file looks like this:
I just want to remove the first newline character, but everything I try does not work:
Replace ``n`r, replace \n\r or any combination of these.
Try
(Get-Content -Path 'YourFile.txt' -Raw).TrimStart() | Set-Content -Path 'YourFile.txt' -Force
Or
(Get-Content -Path 'YourFile.txt' -Raw) -replace '^\s+' | Set-Content -Path 'YourFile.txt' -Force
Explanation:
The above removes all whitespace (tabs, spaces, newlines) from the top of the text, as it is impossible to see from the image if other whitespace characters are in that line or not.
If you are sure there is just the one newline, in your case \r\n won't work, because the file uses Unix newlines (\n only).
Better is to replace using ^\r?\n. The ^ anchors at the beginning of the text. The ? reads zero or one on the CR character \r
This would replace all blank lines. The parentheses make sure the first command finishes first if you're writing to the same file.
(get-content file.txt) | where { $_ } | set-content file.txt
Or this way, the filename goes first.
set-content file.txt (get-content file.txt).where{$_}
A different approach
( Get-Content -Tail ( ( Get-Content testfile1 ).count-1 ) testfile1 ) | Set-Content testfile1
Count the number of lines in the file and then take one off the total. Use that to tail the file and write the output back to the file.
If you are certain that there will be an empty line (or you want to ignore it) you can use Skip.
Get-Content -Path 'YourFile.txt' | select -Skip 1 | Set-Content -Path 'YourFile.txt' -Force

Need to substitute \x0d\x0a to \x2c\x0d\x0a in a file using powershell

Need to replace \x0d\x0a with \x2c\x0d\x0a in a file.
I can do it relatively easy on Unix:
awk '(NR>1){gsub("\r$",",\r")}1' $file > "fixed_$file":
Need help with implementing this in PowerShell.
Thank you in advance.
Assuming that you're running this on Windows (where \r\n (CRLF) newlines are the default), the following command
is the equivalent of your awk command:
Get-Content $file | ForEach-Object {
if ($_.ReadCount -eq 1) { $_ } else { $_ -replace '$', ',' }
} | Set-Content "fixed_$file"
Caveat: The character encoding of the input file is not preserved, and
Set-Content uses a default, which you can override with -Encoding.
In Windows PowerShell, this default is the system's "ANSI" encoding, whereas in PowerShell Core it is BOM-less UTF-8.
Get-Content $file reads the input file line by line.
The ForEach-Object loop passes the 1st line ($_.ReadCount -eq 1) through as-is ($_), and appends , (which is what escape sequence \x2c in your awk command represents) to all others ($_ -replace '$', ',').
Note: $_ + ',' or "$_," are simpler alternatives for appending a comma; the regex-based -replace operator was used here to highlight the PowerShell feature that is similar to awk's gsub().
Set-Content then writes the resulting lines to the target file, terminating each with the platform-appropriate newline sequence, which on Windows is CRLF (\r\n).

Strip lines from text file based on content

I like to use one of the packaged HOSTS (MVPS,) files to protect myself from some of the nastier domains. Unfortunately, sometimes these files are a bit overzealous for me (blocking googleadsservices is a pain sometimes). I want an easy way to strip certain lines out of these files. In Linux I use:
cat hosts |grep -v <pattern> >hosts.new
And the file is rewritten minus the lines referencing the pattern I specified in the grep. So I just set it up to replace hosts with hosts.new on reboot and I'm done.
Is there an easy way to do this in PowerShell?
In PowerShell you'd do
(Get-Content hosts) -notmatch $pattern | Out-File hosts.new
or
(cat hosts) -notmatch $pattern > hosts.new
for short.
Of course, since Out-File (and with it the redirection operator) default to Unicode format, you may actually want to use Set-Content instead of Out-File:
(Get-Content hosts) -notmatch $pattern | Set-Content hosts.new
or
(gc hosts) -notmatch $pattern | sc hosts.new
And since the input file is read in a grouping expression (the parentheses around Get-Content hosts) you could actually write the output back to the source file:
(Get-Content hosts) -notmatch $pattern | Set-Content hosts
To complement Ansgar Wiechers' helpful answer (which offers pragmatic and concise solutions based on reading the entire input file into memory up-front):
PowerShell's grep equivalent is the Select-String cmdlet and, just like grep, it directly accepts a filename argument (PSv3+ syntax):
Select-String -NotMatch <pattern> hosts | ForEach-Object Line | Set-Content hosts.new
Select-String -NotMatch <pattern> hosts is short for
Select-String -NotMatch -Pattern <pattern> -LiteralPath hosts and is the virtual equivalent of
grep -v <pattern> hosts
However, Select-String doesn't output strings, it outputs [Microsoft.PowerShell.Commands.MatchInfo] instances that wrap matching lines (stored in property .Line) along with metadata about the match.
ForEach-Object Line extracts just the matching lines (the value of property .Line) from these objects.
Set-Content hosts.new writes the matching lines to file hosts.new, using "ANSI" encoding in Windows PowerShell - i.e., it uses the legacy code page implied by the active system locale, typically a supranational 8-bit superset of ASCII - and UTF-8 encoding (without BOM) in PowerShell Core.
Use the -Encoding parameter to specify a different encoding.
>, by contrast (an effective alias of the Out-File cmdlet), creates:
UTF16-LE ("Unicode") files by default in Windows PowerShell.
UTF-8 files (without BOM) in PowerShell Core - in other words: in PowerShell Core, using
> hosts.new in lieu of | Set-Content hosts.new will do.
Note: While both > / Out-File and Set-Content are suitable for sending string inputs to an output file, they are not generally suitable for sending other data types to a file for programmatic processing: > / Out-File output objects the way they would print to the console / terminal, which is pretty format for display, whereas Set-Content stringifies (simply put: calls .ToString() on) the input objects, which often results in loss of information.
For non-string data, consider a (more) structured data format such as XML (Export-CliXml), JSON (ConvertTo-Json) or CSV (Export-Csv).

Powershell magnling ascii text

I'm getting extra characters and lines when trying to modify hosts files. For example, this select string does not take anything out, but the two files are different:
get-content -Encoding ascii C:\Windows\system32\drivers\etc\hosts |
select-string -Encoding ascii -notmatch "thereisnolinelikethis" |
out-file -Encoding ascii c:\temp\testfile
PS C:\temp> (get-filehash C:\windows\system32\drivers\etc\hosts).hash
C54C246D2941F02083B85CE2774D271BD574F905BABE030CC1BB41A479A9420E
PS C:\temp> (Get-FileHash C:\temp\testfile).hash
AC6A1134C0892AD3C5530E58759A09C73D8E0E818EC867C9203B9B54E4B83566
I can confirm that your commands do inexplicably result in extra line breaks in the output file, in the start and in the end. Powershell also converts the tabs in the original file into four spaces instead.
While I cannot explain why, these commands do the same thing without these issues:
Try this code instead:
Get-Content -Path C:\Windows\System32\drivers\etc\hosts -Encoding Ascii |
Where-Object { -not $_.Contains("thereisnolinelikethis") } |
Out-File -FilePath "c:\temp\testfile" -Encoding Ascii
I think this is more of an issue with PowerShell's F&O (formatting & output) engine. Keep in mind that Select-String outputs a rich object called MatchInfo. When that object reaches the end of the output it needs to be rendered to a string. I think it is that rendering/formatting that injects the extra line. One of the properties on MatchInfo is the line that was matched (or notmatched). If you pass just the Line property down the pipe, it seems to work better (hashes match):
Get-Content C:\Windows\system32\drivers\etc\hosts |
Select-String -notmatch "thereisnolinelikethis" |
Foreach {$_.Line} |
Out-File -Encoding ascii c:\temp\testfile
BTW you only need to specify ASCII encoding when outputting back to the file. Everywhere else in PowerShell, just let the string flow as Unicode.
All that said, I would use Where-Object instead of Select-String for this scenario. Where-Object is a filtering command which is what you want. Select-String takes input of one form (string) and converts it to a different object (MatchInfo).
Out-File adds a trailing NewLine ("`r`n") to the testfile file.
C:\Windows\System32\drivers\etc\hosts does not contain a trailing newline out of the box, which is why you get a different FileHash
If you open the files with a StreamReader, you'll see that the underlying stream differs in length (due to the trailing newline in the new file):
PS C:\> $Hosts = [System.IO.StreamReader]"C:\Windows\System32\drivers\etc\hosts"
PS C:\> $Tests = [System.IO.StreamReader]"C:\temp\testfile"
PS C:\> $Hosts.BaseStream.Length
822
PS C:\> $Tests.BaseStream.Length
824
PS C:\> $Tests.BaseStream.Position = 822; $Tests.Read(); $Tests.Read()
13
10
ASCII characters 13 (0x0D) and 10 (0x0A) correspond to [System.Environment]::NewLine or CR+LF