I have a file on my PC called test.ps1
I have a file hosted on my github called test.ps1
both of them have the same contents a string inside them
I am using the following script to try and comapare them:
$fileA = Get-Content -Path "C:\Users\User\Desktop\test.ps1"
$fileB = (Invoke-webrequest -URI "https://raw.githubusercontent.com/repo/Scripts/test.ps1")
if(Compare-Object -ReferenceObject $fileA -DifferenceObject ($fileB -split '\r?\n'))
{"files are different"}
Else {"Files are the same"}
echo ""
Write-Host $fileA
echo ""
Write-Host $fileB
however my output is showing the exact same data for both but it says the files are different. The output:
files are different
a string
a string
is there some weird EOL thing going on or something?
tl;dr
# Remove a trailing newline from the downloaded file content
# before splitting into lines.
# Parameter names omitted for brevity.
Compare-Object $fileA ($fileB -replace '\r?\n\z' -split '\r?\n' )
If the files are truly identical (save for any character-encoding and newline-format differences, and whether or not the local file has a trailing newline), you'll see no output (because Compare-Object only reports differences by default).
If the lines look the same, it sounds like character encoding is not the problem, though it's worth pointing out that Get-Content in Windows PowerShell, in the absence of a BOM, assumes that a file is ANSI-encoded, so a UTF-8 file without BOM that contains characters outside the ASCII range will be misinterpreted - use -Encoding utf8 to fix that.
Assuming that the files are truly identical (including not having variations in whitespace, such as trailing spaces at the end of lines), the likeliest explanation is that the file being retrieved has a trailing newline, as is typical for text files.
Thus, if the downloaded file has a trailing newline, as is to be expected, if you apply -split '\r?\n' to the multi-line string representing the entire file content in order to split it into lines, you'll end up with an extra, empty array element at the end, which causes Compare-Object to report that element as a difference.
Compare-Object emitting an object is evaluated as $true in the implied Boolean context of your if statement's conditional, which is why files are different is output.
The above -replace operation, -replace '\r?\n\z' (\z matches the very end of a (multi-line) string), compensates for that, by removing the trailing newline before splitting into lines.
Related
folder name is: c:\home\alltext\
inside has: 2 text files with different names(each text contents extra whitespace that I want to trim)
text1.txt
text2.txt
I don't want to use notepad++ and do one by one text.txt if I have more than 2 command.
I tried PowerShell it returns both text1 and text2 together in same one text.txt.
How can I trim them in one command and return individual txt?
This is my command:
(get-content c:\home\alltext\*.txt).trim() -ne '' | Set-content c:\home\alltext\*.txt
You need to process the input files one by one:
Get-ChildItem c:\home\alltext*.txt | ForEach-Object {
Set-Content -LiteralPath $_.FullName -Value (($_ | Get-Content).Trim() -ne '')
}
Note that PowerShell never preserves the original character encoding when reading text files, so you may have to use the -Encoding parameter with Set-Content.
As for what you tried:
(get-content c:\home\alltext*.txt).trim() -ne '' streams the non-blank lines of all files matching wildcard expression c:\home\alltext*.txt, across file boundaries.
Perhaps surprisingly, not only does Set-Content's (positionally implied) -Path parameter accept wildcard expressions too, it writes the same content (the stringified versions of whatever input it receives) to whatever files happen to match that wildcard expression.
This problematic behavior is discussed in GitHub issue #6729; unfortunately, it was decided to retain the current behavior.
Environment: Windows 10 pro 20H2, PowerShell 5.1.19041.1237
In a .txt file, my following PowerShell code is not replacing the newline character(s) with " ". Question: What I may be missing here, and how can we make it work?
C:\MyFolder\Test.txt File:
This is first line.
This is second line.
This is third line.
This is fourth line.
Desired output [after replacing the newline characters with " " character]:
This is first line. This is second line. This is third line. This is fourth line.
PowerShell code:
PS C:\MyFolder\test.txt> $content = get-content "Test.txt"
PS C:\MyFolder\test.txt> $content = $content.replace("`r`n", " ")
PS C:\MyFolder\test.txt> $content | out-file "Test.txt"
Remarks
The above code works fine if I replace some other character(s) in file. For example, if I change the second line of the above code with $content = $content.replace("third", "3rd"), the code successfully replaces third with 3rd in the above file.
You need to pass -Raw parameter to Get-Content. By default, without the Raw parameter, content is returned as an array of newline-delimited strings.
Get-Content "Test.txt" -Raw
Quoting from the documentation,
-Raw
Ignores newline characters and returns the entire contents of a file in one string with the newlines preserved. By default, newline
characters in a file are used as delimiters to separate the input into
an array of strings. This parameter was introduced in PowerShell 3.0.
The simplest way of doing this is to not use the -Raw switch and then do a replacement on it, but make use of the fact that Get-Content splits the content on Newlines for you.
All it then takes is to join the array with a space character.
(Get-Content -Path "Test.txt") -join ' ' | Set-Content -Path "Test.txt"
As for what you have tried:
By using Get-Content without the -Raw switch, the cmdlet returns a string array of lines split on the Newlines.
That means there are no Newlines in the resulting strings anymore to replace and all that is needed is to 'stitch' the lines together with a space character.
If you do use the -Raw switch, the cmdlet returns a single, multiline string including the Newlines.
In your case, you then need to do the splitting or replacing yourself and for that, don't use the string method .Replace, but the regex operator -split or -replace with a search string '\r?\n'.
The question mark in there makes sure you split on newlines in Windows format (CRLF), but also works on *nix format (LF).
I found a nifty command here - http://www.stackoverflow.com/questions/27892957/merging-multiple-csv-files-into-one-using-powershell that I am using to merge CSV files -
Get-ChildItem -Filter *.csv | Select-Object -ExpandProperty FullName | Import-Csv | Export-Csv .\merged\merged.csv -NoTypeInformation -Append
Now this does what it says on the tin and works great for the most part. I have 2 issues with it however, and I am wondering if there is a way they can be overcome:
Firstly, the merged csv file has CRLF line endings, and I am wondering how I can make the line endings just LF, as the file is being generated?
Also, it looks like there are some shenanigans with quote marks being added/moved around. As an example:
Sample row from initial CSV:
"2021-10-05"|"00:00"|"1212"|"160477"|"1.00"|"3.49"LF
Same row in the merged CSV:
"2021-10-05|""00:00""|""1212""|""160477""|""1.00""|""3.49"""CRLF
So see that the first row has lost its trailing quotes, other fields have doubled quotes, and the end of the row has an additional quote. I'm not quite sure what is going on here, so any help would be much appreciated!
For dealing with the quotes, the cause of the “problem” is that your CSV does not use the default field delimiter that Import-CSV assumes - the C in CSV stands for comma, and you’re using the vertical bar. Add the parameter -Delimiter "|" to both the Import-CSV and Export-CSV cmdlets.
I don’t think you can do anything about the line-end characters (CRLF vs LF); that’s almost certainly operating-system dependent.
Jeff Zeitlin's helpful answer explains the quote-related part of your problem well.
As for your line-ending problem:
As of PowerShell 7.2, there are no PowerShell-native features that allow you to control the newline format of file-writing cmdlets such as Export-Csv.
However, if you use plain-text processing, you can use multi-line strings built with the newline format of interest and save / append them with Set-Content and its -NoNewLine switch, which writes the input strings as-is, without a (newline) separator.
In fact, to significantly speed up processing in your case, plain-text handling is preferable, since in essence your operation amounts to concatenating text files, the only twist being that the header lines of all but the first file should be skipped; using plain-text handling also bypasses your quote problem:
$tokenCount = 1
Get-ChildItem -Filter *.csv |
Get-Content -Raw |
ForEach-Object {
# Get the file content and replace CRLF with LF.
# Include the first line (the header) only for the first file.
$content = ($_ -split '\r?\n', $tokenCount)[-1].Replace("`r`n", "`n")
$tokenCount = 2 # Subsequent files should have their header ignored.
# Make sure that each file content ends in a LF
if (-not $content.EndsWith("`n")) { $content += "`n" }
# Output the modified content.
$content
} |
Set-Content -NoNewLine ./merged/merged.csv # add -Encoding as needed.
I'm breaking my head: D
I am trying to encode a text file that will be saved in the same way as Notepad saves
It looks exactly the same but it's not the same only if I go into the file via Notepad and save again it works for me what could be the problem with encoding? Or how can I solve it? Is there an option for a command that opens Notepad and saves again?
i use now
(Get-Content 000014.log) | Out-FileUtf8NoBom ddppyyyyy.txt
and after this
Get-ChildItem ddppyyyyy.txt | ForEach-Object {
# get the contents and replace line breaks by U+000A
$contents = [IO.File]::ReadAllText($_) -replace "`r`n?", "`n"
# create UTF-8 encoding without signature
$utf8 = New-Object System.Text.UTF8Encoding $false
# write the text back
[IO.File]::WriteAllText($_, $contents, $utf8)
}
When you open a file with notepad.exe it autodetects the encoding (or do you open the file explicitly File->Open.. as UTF-8?). If your file is actually not UFT-8 but something else notepad could be able to work around this and converts it to the required encoding when the file is resaved. So, when you do not specify the correct input encoding in your PoSh script things are will go wrong.
But that's not all; notepad also drops erroneous characters when the file is saved to create a regular text file. For instance, your text file might contain a NULL character that only gets removed when you use notepad. If this is the case it is highly unlikely that your input file is UTF-8 encoded (unless it is broken). So, it looks like your problem is your source file is UTF16 or similar; try to find the right input encoding and rewrite it, e.g. UTF-16 to UTF-8
Get-Content file.foo -Encoding Unicode | Set-Content -Encoding UTF8 newfile.foo
Try it like this:
Get-ChildItem ddppyyyyy.txt | ForEach-Object {
# get the contents and replace Windows line breaks by U+000A
$raw= (Get-Content -Raw $_ -Encoding UTF8) -replace "`r?`n", "`n" -replace "`0", ""
# create UTF-8 encoding without BOM signature
$utf8NoBom = New-Object System.Text.UTF8Encoding $false
# write the text back
[System.IO.File]::WriteAllLines($_, $raw, $utf8NoBom)
}
If you are struggling with the Byte-order-mark it is best to use a hex editor to check the file header manually; checking your file after I have saved it like shown above and then opening it with Notepad.exe and saving it under a new name shows no difference anymore:
The hex-dumped beginning of a file with BOM looks like this instead:
Also, as noted, while your regex pattern should work it want to convert Windows newlines to Unix style it is much more common and safer to make the CR optional: `r?`n
Als noted by mklement0 reading the file using the correct encoding is important; if your file is actually in Latin1 or something you will end up with a broken file if you carelessly convert it to UTF-8 in PoSH.
Thus, I have added the -Encoding UTF8 param to the Get-Content Cmdlet; adjust as needed.
Update: There is nothing wrong with the code in the question, the true problem was embedded NUL characters in the files, which caused problems in R, and which opening and resaving in Notepad implicitly removed, thereby resolving the problem (assuming that simply discarding these NULs works as intended) - see also: wp78de's answer.
Therefore, modifying the $contents = ... line as follows should fix your problem:
$contents = [IO.File]::ReadAllText($_) -replace "`r`n", "`n" -replace "`0"
Note: The code in the question uses the Out-FileUtf8NoBom function from this answer, which allows saving to BOM-less UTF-8 files in Windows PowerShell; it now supports a -UseLF switch, which would simplify the OP's command to (additional problems notwithstanding):
Get-Content 000014.log | Out-FileUtf8NoBom ddppyyyyy.txt -UseLF
There's a conceptual flaw in your regex, though it is benign in this case: instead of "`r`n?" you want "`r?`n" (or, expressed as a pure regex, '\r?\n') in order to match both CRLF ("`r`n") and LF-only ("`n") newlines.
Your regex would instead match CRLF and CR-only(!) newlines; however, as wp78de points out, if your input file contains only the usual CRLF newlines (and not also isolated CR characters), your replacement operation should still work.
In fact, you don't need a regex at all if all you need is to replace CRLF sequences with LF: -replace "`r`n", "`n"
Assuming that your original input files are ANSI-encoded, you can simplify your approach as follows, without the need to call Out-FileUtf8NoBom first (assumes Windows PowerShell):
# NO need for Out-FileUtf8NoBom - process the ANSI-encoded files directly.
Get-ChildItem *SomePattern*.txt | ForEach-Object {
# Get the contents and make sure newlines are LF-only
# [Text.Encoding]::Default is the encoding for the active ANSI code page
# in Windows PowerShell.
$contents = [IO.File]::ReadAllText(
$_.FullName,
[Text.Encoding]::Default
) -replace "`r`n", "`n"
# Write the text back with BOM-less UTF-8 (.NET's default)
[IO.File]::WriteAllText($_.FullName, $contents, $utf8)
}
Note that replacing the content of files in-place bears a risk of data loss, so it's best to create backup copies of the original files first.
Note: If you wanted to perform the same operation in PowerShell [Core] v6+, which is built on .NET Core, the code must be modified slightly, because [Text.Encoding]::Default no longer reflects the active ANSI code page and instead invariably returns a BOM-less UTF-8 encoding.
Therefore, the $contents = ... statement would have to change to (note that this would work in Windows PowerShell too):
$contents = [IO.File]::ReadAllText(
$_.FullName,
[Text.Encoding]::GetEncoding(
[cultureinfo]::CurrentCulture.TextInfo.AnsiCodePage
)
) -replace "`r`n", "`n"
I'm trying to delete blank line at the bottom from the each sqlcmd output files, provided other vendor.
$List=Get-ChildItem * -include *.csv
foreach($file in $List) {
$data = Get-Content $file
$name = $file.name
$length = $data.length -1
$data[$length] = $null
$data | Out-File $name -Encoding utf8
}
It takes bit long time to remove the blank line. Anyone knows a more efficient way?
Using Get-Content -Raw to load files as a whole, as a single string into memory and operating on that string will give you the greatest speed boost.
While that isn't always an option depending on file size, you mention sqlcmd files, which can be assumed to be small enough.
Note:
By blank line I mean a line that is either completely empty or contains whitespace (other than newlines) only.
The trimmed string will not have a final terminating newline following the last line, but if you pass it to Set-Content (or Out-File), one will be appended by default; use -NoNewline to suppress that, but not that especially on Unix-like platforms even the last line of text files is expected to have a trailing newline.
Trailing (or leading) whitespace on a non-blank line is by design not trimmed, except where noted.
The solutions use the -replace operator, which operates on regexes (regular expressions).
Remove all trailing blank lines:
Note: If you really want to remove only the last line if it happens to be blank, see the second-to-last solution below.
(Get-Content -Raw $file) -replace '\r?\n\s*$'
In the context of your command (slightly modified):
Get-ChildItem -Filter *.sqlcmd | ForEach-Object {
(Get-Content -Raw $_.FullName) -replace '\r?\n\s*$' |
Set-Content $_.FullName -Encoding utf8 -WhatIf # save back to same file
}
Note: The -WhatIf common parameter in the command above previews the operation. Remove -WhatIf once you're sure the operation will do what you want.
If it's acceptable / desirable to also trim trailing whitespace from the last non-blank line, you can more simply write:
(Get-Content -Raw $file).TrimEnd()
Remove all blank lines, wherever they occur in the file:
(Get-Content -Raw $file) -replace '(?m)\A\s*\r?\n|\r?\n\s*$'
Here's a conceptually much simpler version that operates on the array of lines output by Get-Content without -Raw (and also returns an array), but it performs much worse.
#(Get-Content $file) -notmatch '^\s*$'
Do not combine this with Set-Content / Out-Content -NoNewline, as that will concatenate the lines stored in the array elements directly, without line breaks between them. Without -NoNewline, you'll invariably get a terminating newline after the last line.
Remove only the last line if it is blank:
(Get-Content -Raw $file) -replace '\r?\n[ \t]*\Z'
Note:
[ \t] matches spaces and tabs, whereas \s more generally matches all forms of Unicode whitespace, including that outside the ASCII range.
An optional trailing newline at the very end of the file (to terminate the last line) is not considered a blank line in this case - whether such a newline is present or not does not make a difference.
Unconditionally remove the last line, whether it is blank or not:
(Get-Content -Raw $file) -replace '\r?\n[^\n]*\Z'
Note:
An optional trailing newline at the very end of the file (to terminate the last line) is not considered a blank line in this case - whether such a newline is present or not does not make a difference.
If you want to remove the last non-blank line, use
(Get-Content -Raw $file).TrimEnd() -replace '\r?\n[^\n]*\Z'
try replacing with this line. you will not have blank lines in your array value $data.
$data = get-content $file.FullPath | Where-Object {$_.trim() -ne "" }