I have a scenario where I need to obtain an installer embedded within a JSON REST response that is base64-encoded. Since the size of the JSON string is rather large (180 MB), it causes problems when decoding the REST response using standard PowerShell tooling as it causes OutOfMemoryException to be thrown quite often in limited memory scenarios (such as hitting WinRM memory quotas).
It's not desirable to raise the memory quota in our environment over a single installation, and we don't have standard tooling to prepare a package whose payload does not exist at a simple HTTP endpoint (I don't have direct permissions to publish packages not performed through our build system). My solution in this case is to decode the base64 string in chunks. However, while I have this working, I am stuck on one last bit of optimization for this process.
Currently I am using a MemoryStream to read from the string, but I need to provide a byte[]:
# $Base64String is a [ref] type
$memStream = [IO.MemoryStream]::new([Text.Encoding]::UTF8.GetBytes($Base64String.Value))
This unsurprisingly results in copying the byte[] representation of the entire base64-encoded string, and is even less memory-efficient than built-in tooling in its current form. The code you don't see here reads from $memStream in chunks of 1024 bytes at a time, decoding the base64 string and writing the bytes to disk using BinaryWriter. This all works well, if slow since I'm forcing garbage collection fairly often. However, I want to extend this byte-counting to the initial MemoryStream and only read n bytes from the string at a time. My understanding is that base64 strings must be decoded in chunks of bytes divisible by 4.
The problem is that [string].Substring([int], [int]) works based on string length, not number of bytes per character. The JSON response can be assumed to be UTF-8 encoded, but even with this assumption UTF-8 characters vary between 1-4 bytes in length. How can I (directly or indirectly) substring a specific number of bytes in PowerShell so I can create the MemoryStream from this substring instead of the full $Base64String?
I will note that I have explored the use of the [Text.Encoding].GetBytes([string], [int], [int]) overload, however, I face the same issue in that the method expects a character count, not byte count, for the length of the string to get the byte[] for from the starting index.
To answer the base question "How can I substring a specific number of bytes from a string in PowerShell", I was able to write the following function:
function Get-SubstringByByteCount {
[CmdletBinding()]
Param(
[Parameter(Mandatory)]
[ValidateScript({ $null -ne $_ -and $_.Value -is [string] })]
[ref]$InputString,
[int]$FromIndex = 0,
[Parameter(Mandatory)]
[int]$ByteCount,
[ValidateScript({ [Text.Encoding]::$_ })]
[string]$Encoding = 'UTF8'
)
[long]$byteCounter = 0
[System.Text.StringBuilder]$sb = New-Object System.Text.StringBuilder $ByteCount
try {
while ( $byteCounter -lt $ByteCount -and $i -lt $InputString.Value.Length ) {
[char]$char = $InputString.Value[$i++]
[void]$sb.Append($char)
$byteCounter += [Text.Encoding]::$Encoding.GetByteCount($char)
}
$sb.ToString()
} finally {
if( $sb ) {
$sb = $null
[System.GC]::Collect()
}
}
}
Invocation works like so:
Get-SubstringByByteCount -InputString ( [ref]$someString ) -ByteCount 8
Some notes on this implementation:
Takes the string as a [ref] type since the original goal was to avoid copying the full string in a limited-memory scenario. This function could be re-implemented using the [string] type instead.
This function essentially adds each character to a StringBuilder until the specified number of bytes has been written.
The number of bytes of each character is determined by using one of the [Text.Encoding]::GetByteCount overloads. Encoding can be specified via a parameter, but the encoding value should match one of the static encoding properties available from [Text.Encoding]. Defaults to UTF8 as written.
$sb = $null and [System.GC]::Collect() are intended to forcibly clean up the StringBuilder in a memory-constrained environment, but could be omitted if this is not a concern.
-FromIndex takes the start position within -InputString to begin the substring operation from. Defaults to 0 to evaluate from the start of the -InputString.
I have a "structured" file (logical fixed-length records) from a legacy program on a legacy (non-MS) operating system. I know how the records were structured in the original program, but the original O/S handled structured data as a sequence of bytes for file I/O, so a hex dump won't show you anything more than what the record length is (there are marker bytes and other record overhead imposed by the access method API used to generate the file originally).
Once I have the sequence of bytes in a Powershell variable, with the overhead bytes "cut away", how can I convert this into a structured object? Some of the "fields" are 16-bit integers, some are strings of the form [s]data (where [s] is a byte giving the length of the "real" data in that field), some are BCD coded fixed-point numbers, some are IEEE floats.
(I haven't been specific about the structure, either on the Powershell side or on the legacy side, because I am seeking a more-or-less 'generic' solution/technique, as I actually have several different files with different record structures to process.)
Initially, I tried to do it by creating a type that could take the buffer and overwrite a struct so that all the fields were nicely filled in. However, certain issues arose (regarding struct layout, fixed buffers and mixing fixed and managed members) and I also realised that there was no guarantee that the data in the buffer would be properly (or even legally) aligned. Decided to try a more programmatic path.
"Manual" parsing is out, so how about automatic parsing? You're going to need to define the members of your PSobject at some point, why not do it in a way that can help programmatically parse the data. This method does not require the data in the buffer to be correctly aligned or even contiguous. You can also have fields overlap to separate raw unions into the individual members (though, typically, only one will contain a "correct" value).
First step, build a hash table to identify the members, the offset in the buffer, their data types and, if an array, the number of elements :
$struct = #{
field1 = 0,[int],0; # 0 means not an array
field2 = 4,[byte],16; # a C string maybe
field3 = 24,[char],32; # wchar_t[32] ? note: skipped over bytes 20-23
field4 = 56,[double],0
}
# the names field1/2/3/4 are arbitrary, any valid member name may be used (but not
# necessarily any valid hash key if you want a PSObject as the end result).
# also, the values could be hash tables instead of arrays. that would allow
# descriptive names for the values but doesn't affect the end result.
Next, use [BitConverter] to extract the required data. The problem here is that we need to call the correct method for all the varying types. Just use a (big) switch statement. The basic principle is the same for most values, get the type indicator and initial offset from the $struct definition then call the correct [BitConverter] method and supply the buffer and initial offset, update the offset to where the next element of an array would be and then repeat for as many array elements as are required. The only trap here is that the data in the buffer must have the same format as expected by [BitConverter], so for the [double] example, the bytes in the buffer must conform to IEEE-754 floating point format (assuming that [BitConverter]::ToDouble() is used). Thus, for example, raw data from a Paradox database will need some tweeking because it flips the high bit to simplify sorting.
$struct.keys | foreach {
# key order is undefined but that won't affect the final object's members
$hashobject = #{}
} {
$fieldoffs = $struct[$_][0]
$fieldtype = $struct[$_][1]
if (($arraysize = $struct[$_][2]) -ne 0) { # yes, I'm a C programmer from way back
$array = #()
} else {
$array = $null
}
:w while ($arraysize-- -ge 0) {
switch($fieldtype) {
([int]) {
$value = [bitconverter]::toint32($buffer, $fieldoffs)
$fieldoffs += 4
}
([byte]) {
$value = $buffer[$fieldoffs++]
}
([char]) {
$value = [bitconverter]::tochar($buffer, $fieldoffs)
$fieldoffs += 2
}
([string]) { # ANSI string, 1 byte per character
$array = new-object string (,[char[]]$buffer[$fieldoffs..($fieldoffs+$arraysize)])
# $arraysize has already been decremented so don't need to subtract 1
break w # "array size" was actually string length so don't loop
#
# description:
# first, get a slice of the buffer as a byte[] (assume single byte characters)
# next, convert each byte to a char in a char[]
# then, invoke the constructor String(Char[])
# finally, put the String into $array ready for insertion into $hashobject
#
# Note the convoluted syntax - New-Object expects the second argument to be
# an array of the constructor parameters but String(Char[]) requires only
# one argument that is itself an array. By itself,
# [char[]]$buffer[$fieldoffs..($fieldoffs+$arraysize)]
# is treated by PowerShell as an argument list of individual chars, corrupting the
# constructor call. The normal trick is to prepend a single comma to create an array
# of one element which is itself an array
# ,[char[]]$buffer[$fieldoffs..($fieldoffs+$arraysize)]
# but this won't work because of the way PowerShell parses the command line. The
# space before the comma is ignored so that instead of getting 2 arguments (a string
# "String" and the array of an array of char), there is only one argument, an array
# of 2 elements ("String" and array of array of char) thereby totally confusing
# New-Object. To make it work you need to ALSO isolate the single element array into
# its own expression. Hence the parentheses
# (,[char[]]$buffer[$fieldoffs..($fieldoffs+$arraysize)])
#
}
}
if ($null -ne $array) {
# must be in this order* to stop the -ne from enumerating $array to compare against
# $null. this would result in the condition being considered false if $array were
# empty ( (#() -ne $null) -> $null -> $false ) or contained only one element with
# the value 0 ( (#(0) -ne $null) -> (scalar) 0 -> $false ).
$array += $value
# $array is not $null so must be an array to which $value is appended
} else {
# $array is $null only if $arraysize -eq 0 before the loop (and is now -1)
$array = $value
# so the loop won't repeat thus leaving this one scalar in $array
}
}
$hashobject[$_] = $array
}
#*could have reversed it as
# if ($array -eq $null) { scalar } else { collect array }
# since the condition will only be true if $array is actually $null or contains at
# least 2 $null elements (but no valid conversion will produce $null)
At this point there is a hash table, $hashobject, with keys equal to the field names and values containing the bytes from the buffer arranged into single (or arrays of) numeric (inc. char/boolean) values or (ANSI) strings. To create a (proper) object, just invoke New-Object -TypeName PSObject -Property $hashobject or use [PSCustomObject]$hashobject.
Of course, if the buffer actually contained structured data then the process would be more complicated but the basic procedure would be the same. Note also that the "types" used in the $struct hash table have no direct effect on the resultant types of the object members, they are only convenient selectors for the switch statement. It would work just as well with strings or numbers. In fact, the parentheses around the case labels are because switch parses them the same as command arguments. Without the parentheses, the labels would be treated as literal strings. With them, the labels are evaluated as a type object. Both the label and the switch value are then converted to strings (that's what switch does for values other than script blocks or $null) but each type has a distinct string representation so the case labels will still match up correctly. (Not really on point but still interesting, I think.)
Several optimisations are possible but increase the complexity slightly. E.g.
([byte]) { # already have a byte[] so why collect bytes one at a time
if ($arraysize -ge 0) { # was originally -gt 0 so want a byte[]
$array = [byte[]]$buffer[$fieldoffs..($fieldoffs+$arraysize)]
# slicing the byte array produces an object array (of bytes) so cast it back
} else { # $arraysize was 0 so just a single byte
$array = $buffer[$fieldoffs]
}
break w # $array ready for insertion into $hashobject, don't need to loop
}
But what if my strings are actually Unicode?, you say. Easy, just use existing methods from the [Text.Encoding] class,
[string] { # Unicode string, 2 (LE) bytes per character
$array = [text.encoding]::unicode.getstring([byte[]]$buffer[$fieldoffs..($fieldoffs+$arraysize*2+1)])
# $arraysize should be the string length so, initially, $arraysize*2 is the byte
# count and $arraysize*2-1 is the end index (relative to $fieldoffs) but $arraysize
# was decremented so the end index is now $arraysize*2+1, i.e. length*2-1 = (length-1)*2+1
break w # got $array, no loop
}
You could also have both ANSI and Unicode by utilising a different type indicator for the ANSI string, maybe [char[]]. Remember, the type indicators do not affect the result, they just have to be distinct (and hopefully meaningful) identifiers.
I realise that this is not quite the "just dump the bytes into a union or variant record" solution mentioned in the OPs comment but PowerShell is based in .NET and uses managed objects where this sort of thing is largely prohibited (or difficult to get working, as I found). For example, assuming you could just dump raw chars (not bytes) into a String, how would the Length property get updated? This method also allows some useful preprocessing such as splitting up unions as noted above or converting raw byte or char arrays into the Strings they represent.
I am trying to write to a multi string, but using data gleaned from a REG file, so it's in Hex format. I have managed to convert the string to a byte array using the Convert-HexStringToByteArray here, but that doesn't produce the same result in the registry as loading the REG, so I am thinking that is not actually the right data type to be casting to.
The initial data looks like this
"NavigatorLayoutOrder"=hex(7):31,00,30,00,00,00,31,00,00,00,32,00,00,00,33,00,00,00,30,00,00,00,34,00,00,00,35,00,00,00,36,00,00,00,37,00,00,00,38,00,00,00,39,00,00,00,31,00,31,00,00,00,31,00,32,00,00,00,31,00,33,00,00,00,31,00,34,00,00,00,31,00,35,00,00,00,31,00,36,00,00,00,31,00,37,00,00,00,31,00,38,00,00,00,31,00,39,00,00,00,32,00,30,00,00,00,32,00,31,00,00,00,32,00,32,00,00,00,00,00
and I have removed the hex(7): off the front, then tried it as a pure string and casting to a byte array, and neither seems to work.
I have found reference to REG_MULTI_SZ being UTF-16le, but my understanding is that this is also the default for PowerShell, so I shouldn't need to be changing the encoding, but perhaps I am wrong there?
EDIT: I also tried this, with again a successful write, but the wrong result.
$enc = [system.Text.Encoding]::UTF8
[byte[]]$bytes = $enc.GetBytes($string)
Also tried
$array = $string.Split(',')
$byte = [byte[]]$array
This also puts data into the registry, but the result is not the same as importing the REG. And, everything I am finding keeps pointing at the idea that the REG file is UTF16, so I tried
$enc = [system.Text.Encoding]::Unicode
[byte[]]$bytes = $enc.GetBytes($string)
both with BigEndianUnicode & Unicode. Not only did it not work, the result is the same which I find odd. Seems like changing the endian-ness SHOULD change the result.
EDIT: To clarify, the input string as taken from the REG file is shown above. I simply removed the hex(7): from the front of the data.
The results are seen here, where the second value is what results from PowerShell, while the first is what the REG file produced.
The code used to produce this was
$string = "31,00,30,00,00,00,31,00,00,00,32,00,00,00,33,00,00,00,30,00,00,00,34,00,00,00,35,00,00,00,36,00,00,00,37,00,00,00,38,00,00,00,39,00,00,00,31,00,31,00,00,00,31,00,32,00,00,00,31,00,33,00,00,00,31,00,34,00,00,00,31,00,35,00,00,00,31,00,36,00,00,00,31,00,37,00,00,00,31,00,38,00,00,00,31,00,39,00,00,00,32,00,30,00,00,00,32,00,31,00,00,00,32,00,32,00,00,00,00,00"
$enc = [system.Text.Encoding]::BigEndianUnicode
[byte[]]$bytes = $enc.GetBytes($string)
New-ItemProperty "HKCU:\Software\Synchro\Synchro\ProjectConfig" -name:"NavigatorLayoutOrder2" -value:$bytes -propertyType:MultiString -force
Using Unicode encoding produces a very slightly different, but still wrong, result.
For one thing, the multistring is little endian encoded, so you need [Text.Encoding]::Unicode, not [Text.Encoding]::BigEndianUnicode. Plus, using [Text.Encoding]::Unicode.GetBytes() on the string from the .reg file ("31,00,30,00,...") would give you a byte array of the characters of that string:
'3' → 51, 0
'1' → 49, 0
',' → 44, 0
'0' → 48, 0
…
What you actually want is a byte array of the comma-separated hexadecimal values in that string:
31 → 49 (character '1')
00 → 0 (character NUL)
30 → 48 (character '0')
…
Split the string at commas, convert the hexadecimal number strings to integers, and cast the resulting list of integers to a byte array:
[byte[]]$bytes = $string -split ',' | ForEach-Object { [int]"0x$_" }
Then you can convert that (little endian encoded) byte array to a string:
$ms = [Text.Encoding]::Unicode.GetString($bytes)
and write that to the registry:
$key = 'HKCU:\Software\Synchro\Synchro\ProjectConfig'
$name = 'NavigatorLayoutOrder2'
New-ItemProperty $key -Name $name -Value $ms -PropertyType MultiString -Force