I'm trying to figure out why there's a huge difference in the output sizes when encoding a file in base64 in Powershell vs GNU coreutils. Depending on options (UTF8 vs Unicode), the Powershell output ranges from about 240MB to 318MB. Using coreutils base64 (in Cygwin, in this case), the output is about 80MB. The original filesize is about 58MB. So, 2 questions:
Why is there such a drastic difference?
How can I get Powershell to give the smaller output that the GNU tool gives?
Here are the specific commands I used:
Powershell smaller output:
$input = "C:\Users\my.user\myfile.pdf"
$filecontent = get-content $input
$converted = [System.Text.Encoding]::UTF8.GetBytes($filecontent)
$encodedtext = [System.Convert]::ToBase64String($converted)
$encodedtext | Out-File "C:\Users\my.user\myfile.pdf.via_ps.base64"
The larger Powershell output came from simply replacing "UTF8" with "Unicode". It will be obvious that I'm pretty new to Powershell; I'm sure someone only slightly better with it could combine that into a couple of simple lines.
Coreutils (via Cygwin) base64:
base64.exe -w0 myfile.pdf > myfile.pdf.via_cygwin.base64
Why is there such a drastic difference?
Because you're doing something wildly different in PowerShell
How can I get Powershell to give the smaller output that the GNU tool gives?
By doing what base64 does :)
Let's have a look at what base64 ... > ... actually does:
base64:
Opens file handle to input file
Reads raw byte stream from disk
Converts every 3-byte pair to a 4-byte base64-encoded output string-fragment
>:
Writes raw byte stream to disk
Since the 4-byte output fragments only contain byte values that correspond to 64 printable ASCII characters, the command never actually does any "string manipulation" - the values on which it operates just happen to also be printable as ASCII strings and the resulting file is therefor indistinguishable from a "text file".
Your PowerShell script on the other hand does lots of string manipulation:
Get-Content $input:
Opens file handle to input file
Reads raw byte stream from disk
Decodes the byte stream according to some chosen encoding scheme (likely your OEM codepage)
[Encoding]::UTF8.GetBytes():
Re-encodes the resulting string using UTF8
[Convert]::ToBase64String()
Converts every 3-byte pair to a 4-byte base64-encoded output string-fragment
Out-File:
Encodes input string as little-endian UTF16
Writes to disk
The three additional string encoding steps highlighted above will result in a much-inflated byte stream, which is why you're seeing the output size double or triple.
How to base64-encode files then?
The trick here is to read the raw bytes from disk and pass those directly to [convert]::ToBase64String()
It is technically possibly to just read the entire file into an array at once:
$bytes = Get-Content path\to\file.ext -Encoding Byte # Windows PowerShell only
# or
$bytes = [System.IO.File]::ReadAllBytes($(Convert-Path path\to\file.ext))
$b64String = [convert]::ToBase64String($bytes)
Set-Content path\to\output.base64 -Value $b64String -Encoding Ascii
... I'd strongly recommend against doing so for files larger than a few kilobytes.
Instead, for file transformation in general you'll want to use streams. In this particular case, you'll want want to use a CryptoStream with a ToBase64Transform to re-encode a file stream as base64:
function New-Base64File {
[CmdletBinding(DefaultParameterSetName = 'ByPath')]
param(
[Parameter(Mandatory = $true, ParameterSetName = 'ByPath', Position = 0)]
[string]$Path,
[Parameter(Mandatory = $true, ParameterSetName = 'ByPSPath')]
[Alias('PSPath')]
[string]$LiteralPath,
[Parameter(Mandatory = $true, Position = 1)]
[string]$Destination
)
# Create destination file if it doesn't exist
if (-not(Test-Path -LiteralPath $Destination -PathType Leaf)) {
$outFile = New-Item -Path $Destination -ItemType File
}
else {
$outFile = Get-Item -LiteralPath $Destination
}
[void]$PSBoundParameters.Remove('Destination')
try {
# Open a writable file stream to the output file
$outStream = $outFile.OpenWrite()
# Wrap output file stream in a CryptoStream.
#
# Anything that we write to the crypto stream is automatically
# base64-encoded and then written through to the output file stream
$transform = [System.Security.Cryptography.ToBase64Transform]::new()
$cryptoStream = [System.Security.Cryptography.CryptoStream]::new($outStream, $transform, 'Write')
foreach ($file in Get-Item #PSBoundParameters) {
try {
# Open readable input file stream
$inStream = $file.OpenRead()
# Copy input bytes to crypto stream
# - which in turn base64-encodes and writes to output file
$inStream.CopyTo($cryptoStream)
}
finally {
# Clean up the input file stream
$inStream | ForEach-Object Dispose
}
}
}
finally {
# Clean up the output streams
$transform, $cryptoStream, $outStream | ForEach-Object Dispose
}
}
Now you can do:
$inputPath = "C:\Users\my.user\myfile.pdf"
New-Base64File $inputPath -Destination "C:\Users\my.user\myfile.pdf.via_ps.base64"
And expect an output the same size as with base64
Related
I have this function that I wrote that will Encode/Decode strings in Base64 format.
I understand the end goal of what I am asking for might not make sense but it is something I have to do.
I have a snp.txt file with the contents start notepad
I need to convert that string in the file to Base64 and it should look like this:
cwB0AGEAcgB0ACAAbgBvAHQAZQBwAGEAZAA=
Then immedietely turn around and decode it right back to what it was to look like:
start notepad
However when I do that using the example below, when it is decoded back it returns:
s t a r t n o t e p a d
I am not sure why the text has the spaces in it
function B64 {
[CmdletBinding(DefaultParameterSetName="encString")]
param(
[Parameter(Position=0, ParameterSetName="encString")]
[Alias("es")]
[string]$encString,
[Parameter(Position=0, ParameterSetName="decString")]
[Alias("ds")]
[string]$decString
)
if ($psCmdlet.ParameterSetName -eq "encString") {
$encoded = [Convert]::ToBase64String([System.Text.Encoding]::Unicode.GetBytes($encString))
return $encoded
}
elseif ($psCmdlet.ParameterSetName -eq "decString") {
$decoded = [System.Text.Encoding]::ASCII.GetString([System.Convert]::FromBase64String($decString))
return $decoded
}
}
This is where I call the functions to first encode the string and then decode it back again returning:
s t a r t n o t e p a d
$filePath = "C:\Users\User\Desktop\snp.txt"
$encData = Get-Content $filePath
$enc = B64 -encString $fp;$enc | Out-File -FilePath $fp
Sleep 1
$dec = B64 -ds $encData;$dec | Out-File -FilePath $fp
This happens because you're using different encodings to encode and decode your string. Either ascii for both or uncicode for both. Preferably you should use UTF8. The encoding of the file should also match with the encodings used on your function. –
Santiago Squarzon
I have a binary log file that has text headers describing the file. The headers are of the form:
FileName: <filename>
Datetime: <dateTime>
NumberOfLines: <nnn>
DataForm: l,l,d,d,d,d
DataBlock:
After that there goes a binary portion of the file:
ウゥョ・ ` 0 ウゥョ゚?~・?ヨソフ>・・?Glfウチメ>-~JUッ羲ソ濂・x・-$>;ノPセ.・4ツヌ岐・セ:)胥篩・tシj~惞ケ劔劔劒 ウゥッ ` 0 ウゥッ?Gd$・フ>・)
and so on...
The headers I can read and parse into variables using this:
$streamReader = [System.IO.StreamReader]::New($FullPath)
while (($currentLine = $streamReader.ReadLine()) -ne 'DataBlock: ') {
$variableName, $variableValue = $currentLine -split ':'
New-Variable -Name $variableName -Value $variableValue.Trim() -Force
}
Now to the binary block.
This binary portion is basically a CSV-like data structure. DataForm describes how long the fields are and what is data type for each field. NumberOfLines - how many lines there are.
So, I know how to read and parse the binary portion using:
[Byte[]] $FileByteArray = Get-Content $FullPath -Encoding Byte
and knowing the start position of the data block. For instance, the first field in the example above is 'l' which is 4 bytes for UInt32 data type. Assuming my data block starts at byte 800, I can read it like this:
$byteArray = $FileByteArray[800..803]
[BitConverter]::ToUInt32($byteArray, 0)
and so on.
Now, the question.
I'd like to use my existing StreamReader that I used to parse headers to keep advancing through the file (not by line now, but by chunks of bytes) and read it as bytes. How do I do that?
If I can only read characters using StreamReader - what other methods I can use?
I don't want to read headers using StreamReader, calculating along the line the starting position of my data block through length of each line plus two bytes for new line symbols, and then read the whole file again through [Byte[]] $FileByteArray = Get-Content $FullPath -Encoding Byte
How do I compare the output of Get-FileHash directly with the output of Properties.ContentMD5?
I'm putting together a PowerShell script that takes some local files from my system and copies them to an Azure Blob Storage Container.
The files change daily so I have added in a check to see if the file already exists in the container before uploading it.
I use Get-FileHash to read the local file:
$LocalFileHash = (Get-FileHash "D:\file.zip" -Algorithm MD5).Hash
Which results in $LocalFileHash holding this: 67BF2B6A3E6657054B4B86E137A12382
I use this code to get the checksum of the blob file already transferred to the container:
$BlobFile = "Path\To\file.zip"
$AZContext = New-AZStorageContext -StorageAccountName $StorageAccountName -SASToken "<token here>"
$RemoteBlobFile = Get-AzStorageBlob -Container $ContainerName -Context $AZContext -Blob $BlobFile -ErrorAction Ignore
if ($ExistingBlobFile) {
$cloudblob = [Microsoft.Azure.Storage.Blob.CloudBlockBlob]$RemoteBlobFile.ICloudBlob
$RemoteBlobHash = $cloudblob.Properties.ContentMD5
}
This value of $RemoteBlobHash is set to Z78raj5mVwVLS4bhN6Ejgg==
No problem, I thought, I'll just decrypt the Base64 string and compare:
$output = [System.Text.Encoding]::UTF8.GetString([System.Convert]::FromBase64String($RemoteBlobHash))
Which gives me g�+j>fWKK��7�#� so not directly comparable ☹
This question shows someone in a similar pickle but I don't think they were using Get-FileHash given the format of their local MD5 result.
Other things I've tried:
changing the System.Text.Encoding line above UTF8 to UTF16 & ASCII which changes the output but not to anything recognisable.
dabbling with GetBytes to see if that helped:
$output = [System.Text.Encoding]::UTF8.GetBytes([System.Text.Encoding]::UTF16.GetString([System.Convert]::FromBase64String($RemoteBlobHash)))
Note: Using md5sum to compare the local file and a downloaded copy of file.zip results in the same MD5 string as Get-FileHash: 67BF2B6A3E6657054B4B86E137A12382
Thank you in advance!
ContentMD5 is a base64 representation of the binary hash value, not the resulting hex string :)
$md5sum = [convert]::FromBase64String('Z78raj5mVwVLS4bhN6Ejgg==')
$hdhash = [BitConverter]::ToString($md5sum).Replace('-','')
Here we convert base64 -> binary -> hexadecimal
If you need to do it the other way around (ie. for obtaining a local file hash, then using that to search for blobs in Azure), you'll first need to split the hexadecimal string into byte-size chunks, then convert the resulting byte array to base64:
$hdhash = '67BF2B6A3E6657054B4B86E137A12382'
$bytes = [byte[]]::new($hdhash.Length / 2)
for($i = 0; $i -lt $bytes.Length; $i++){
$offset = $i * 2
$bytes[$i] = [convert]::ToByte($hdhash.Substring($offset,2), 16)
}
$md5sum = [convert]::ToBase64String($bytes)
How can data like strings byte arrays io streams be hash using common hashing algorithms like MD4 MD5 SHA1 etc...
I am writing a script that makes backup of drives and to prevent unnecessary copies and detecting if files become corrupted it need to hash files quickly with some hashing algorithm like MD4.
If anyone have idea how to hash files, io streams, byte arrays, strings... using any hashing algorithm please let me know. Also Get-FileHash cmdlet doesn't exist on all Windows installation I encountered.
Create an instance of [System.Security.Cryptography.MD5], then pass a file stream to its ComputeHash() method:
function Get-MD5Sum
{
param(
[Parameter(Mandatory, ValueFromPipelineByPropertyName)]
[Alias('PSPath')]
[string[]]$Path
)
begin {
$md5 = [System.Security.Cryptography.MD5]::Create()
}
process {
foreach($filePath in $Path){
# Resolve filesystem item
$file = Get-Item -LiteralPath $Path
# Skip if not a file
if($file -isnot [System.IO.FileInfo]){
continue
}
# Open a stream to read the file
$filestream = $file.OpenRead()
try {
# Calculate + format hash, then output
Write-Output $([pscustomobject]#{
File = $file.FullName
Hash = [BitConverter]::ToString($MD5.ComputeHash($filestream)) -replace '-'
})
}
finally {
# close file stream handle
$filestream.Dispose()
}
}
}
end {
# Dispose of the hash provider
$MD5.Dispose()
}
}
Now you can calculate MD5 file hashes without Get-FileHash:
PS C:\> $fileHashes = Get-ChildItem . |Get-MD5Sum
I'm trying to capture the output of ffmpeg in PowerShell(tm) to get some metadata on some ogg & mp3 files. But when I do:
ffmpeg -i file.ogg 2>&1 | sls GENRE
The output includes a bunch of lines without my matching string, "GENRE":
album_artist : Post Human Era
ARTIST : Post Human Era
COMMENT : Visit http://posthumanera.bandcamp.com
DATE : 2013
GENRE : Music
TITLE : Supplies
track : 1
At least one output file must be specified
I am guessing something is different in the encoding. ffmpeg's output is colored, so maybe there are color control characters in the output that are breaking things? Or, maybe ffmpeg's output isn't playing nicely with powershell's default UTF-16? I can't figure out if there is another way to redirect stderr and remove the color characters or change the encoding of stderr.
EDIT:
Strangely, I also get indeterminate output. Sometimes the output is as shown above. Sometimes with precisely the same command the output is:
GENRE :
Which makes slightly more sense, but is still missing the part of the line I care about ('Music').
Somewhere powershell is interpreting something as newlines that is not newlines.
I am still seeing this behavior when I use the old powershell, but I have since upgraded to PowerShell Core (7.0.2), and the problem seems to be solved. I read somewhere that with PowerShell Core they've changed the default encoding to UTF-8, so perhaps it is something related to that.
My theory is that in the old version, whatever code combines the outputstreams normally would make sure that individual lines were preserved and interleaved instead of of cut up. But I would guess that this code is looking for newlines in the default encoding, not UTF-8, so when it receives two UTF-8 streams it doesn't parse the line delimiters correctly and you get weird splits. It seems like there should be a way to change the encoding before it gets to mixing the output streams, but I'm not sure (and now it doesn't matter since it works). Why the output seems to change nondeterministically, I don't know, unless there is something nondeterministic about parsing UTF8 bytes as if they were UTF16 or whatever the default is.
I got something working for my script catching all the output with regex and pipe to a custom object
Function Rotate-Video {
param(
[STRING]$FFMPEGEXE = "P:\Video Editing\ffmpeg-4.3.1-2020-10-01-full_build\bin\ffmpeg.exe",
[parameter(ValueFromPipeline = $true)]
[STRING]$Source = "D:\Video\Source",
[STRING]$Destination = 'D:\Video\Destination',
[STRING]$DestinationExtention='mp4'
)
(Get-ChildItem $Source) | ForEach-Object {
$FileExist = $false
$Source = $_.fullname
$Name = $_.basename
$outputName = $name+'.'+$DestinationExtention
$Fullpath = Join-Path -Path $Destination -ChildPath $outputName
$Regex = "(\w+)=\s+(\d+)\s+(\w+)=(\d+.\d+)\s+(\w)=(\d+.\d+)\s+(\w+)=\s+(\d+)\w+\s+(\w+)=(\d+:\d+:\d+.\d+)\s+(\w+)=(\d+.\d+)\w+\/s\s+(\w+)=(\d+.\d+)"
&$FFMPEGEXE -i $Source -vf transpose=clock $Fullpath 2>&1 | Select-String -Pattern $Regex | ForEach-Object {
$output = ($_ | Select-String -Pattern $regex).Matches.Groups
[PSCUSTOMOBJECT]#{
Source = $source
Destination = $Fullpath
$output[1] = $output[2]
$output[3] = $output[4]
$output[5] = $output[6]
$output[7] = $output[8]
$output[9] = $output[10]
$output[11] = $output[12]
$output[13] = $output[14]
}
}
}
}