Sqlite .dump handles utf8 characters differently than when manually running the command - powershell

So I'm working with a sqlite3 database using the sqlite3 command line.
I'm trying to dump the entire contents into the database, so I can import them into another database. The problem I'm having stems from this apostrophe, ’, which is the unicode right single quotation mark.
I've set up a Powershell script to dump the table and then import it in one go.
If I create the dump file in the following way:
Get-Content sql_commands -Raw | sqlite3 database.db
Where the file sql_commands looks like:
.output output.sql
.dump
Everything works perfectly and the unicode character is kept.
On the other hand if I try the following:
sqlite3 database.db .dump | Set-Content output.sql
The unicode character is not kept and instead looks like "GÇÖ".
I'm confused as to why this is happening. I want to use the second command because that lets me easily set the filepath as a variable in the Powershell script.
I'd appreciate it if you could give some information as to what the difference between the two commands is and what is going on.

Re sqlite3 database.db .dump | Set-Content output.sql:
In order for PowerShell to interpret the textual output from an external program such as sqlite3 correctly, [Console]::OutputEncoding must match the actual character encoding used by that program.
E.g, if sqlite3 outputs UTF-8-encoded strings, first (temporarily) set [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
Whatever .NET string PowerShell decodes the external program's output into is then re-encoded on output with Set-Content, using its default encoding (which is unrelated to original encoding). Use the -Encoding parameter to control that encoding.
Re Get-Content sql_commands -Raw | sqlite3 database.db:
Conversely, when PowerShell pipes (what is invariably) text to an external program, that text is encoded using the character encoding stored in the $OutputEncoding preference variable , so you may have to set that variable to the encoding that the external program expects.

Related

Powershell string variable with UTF-8 encoding

I checked many related questions about this, but I couldn't find something that solves my problem. Basically, I want to store a UTF-8 encoded string in a variable and then use that string as a file name.
For example, I'm trying to download a YouTube video. If we print the video title, the non-English characters show up (ytd here is youtube-dl):
./ytd https://www.youtube.com/watch?v=GWYndKw_zbw -e
Output: [LEEPLAY] 시티팝 입문 City Pop MIX (Playlist)
But if I store this in a variable and print it, the Korean characters are ignored:
$vtitle= ./ytd https://www.youtube.com/watch?v=GWYndKw_zbw -e
$vtitle
Output:[LEEPLAY] City Pop MIX (Playlist)
For a comprehensive overview of how PowerShell interacts with external programs, which includes sending data to them, see this answer.
When PowerShell interprets output from external programs (such as ytd in your case), it assumes that the output uses the character encoding reflected in [Console]::OutputEncoding.
Note:
Interpreting refers to cases where PowerShell captures (e.g., $output = ...), relays (e.g., ... | Select-String ...), or redirects (e.g., ... > output.txt) the external program's output.
By contrast, printing directly to the display may not be affected, because PowerShell then isn't involved, and certain CLIs adjust their behavior when their stdout isn't redirected to print directly to the console with full Unicode support (which explains why the characters looked as expected in your console when ytd's output printed directly to it).
If the encoding reported by [Console]::OutputEncoding is not the same encoding used by the external program at hand, PowerShell misinterprets the output.
To fix that, you must (temporarily) set [Console]::OutputEncoding] to match the encoding used by the external program.
For instance, let's assume an executable foo.exe that outputs UTF-8-encoded text:
# Save the current encoding and switch to UTF-8.
$prev = [Console]::OutputEncoding
[Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
# PowerShell now interprets foo's output correctly as UTF-8-encoded.
# and $output will correctly contain CJK characters.
$output = foo https://example.org -e
# Restore the previous encoding.
[Console]::OutputEncoding = $prev
Important:
[Console]::OutputEncoding by default reflects the encoding associated with the legacy system locale's OEM code page, as reported by chcp (e.g. 437 on US-English systems).
Recent versions of Windows 10 now allow setting the system locale to code page 65001 (UTF-8) (the feature is still in beta as of Window 10 version 1909), which is great, considering that most modern command-line utilities "speak" UTF-8 - but note that making this system-wide change has far-reaching consequences - see this answer.
With the specific program at hand, youtube-dl, js2010 has discovered that capturing in a variable works without extra effort if you pass --encoding utf-16.
The reason this works is that the resulting UTF16-LE-encoded output is preceded by a BOM (Byte-Order Mark).
(Note that --encoding utf-8 does not work, because youtube-dl then does not emit a BOM.)
Windows PowerShell is capable of detecting and properly decoding UTF-16LE-encoded and UTF-8-encoded text irrespective of the effective [Console]::OutputEncoding] IF AND ONLY IF the output is preceded by a BOM.
Caveats:
This does not work in PowerShell Core (v6+, on any of the supported platforms).
Even in Windows PowerShell you'll rarely be able to take advantage of this obscure behavior, because using a BOM in stdout output is atypical (it is typically only used in files).
This works for me in the ISE. Youtube-dl is from ytdl-org.github.io. Actually the ise wouldn't be needed, but the filename will only show correctly in something like explorer.
# get title
# utf-16 has a bom, or use utf-8-sig, this program is python based
$a = .\youtube-dl -e https://www.youtube.com/watch?v=Qpy7N4oFQUQ --encoding utf-16
$a
Gacharic Spin - 赤裸ライアー教則映像(short ver.)TOMO-ZO編
You might have similar luck in vscode (or osx/linux).

Illegal Character '?' a when I create the JSON using ConvertTo-JSON

I am not a powershell guy please excuse if my question is confusing.
We are creating a JSON file using ConverTo-JSON and it successfully creates the JSON file. However when I cat the contents of JSON it has '??' at the beginning of the json file but the same is not seen when I download the file/ view the file in file system.
Below is the powershell code which is used to create the JSON File:
$packageJson = #{
packageName = "ABC.DEF.GHI"
version = "1.1.1"
branchName = "somebranch"
oneOps = #{
platform = "XYZ"
component = "JNL"
}
}
$packageJson | ConvertTo-Json -depth 100 | Out-File "$packageName.json"
Above set of code creates the files successfully and when I view the file everything looks fine but when I cat the file it has leading '??' as shown below:
??{
"packageName": "ABC.DEF.GHI",
"version": "0.1.0-looper-poc0529",
"oneOps": {
"platform": "XYZ",
"component": "JNL"
},
"branchName": "somebranch"
}
Due to this I am unable to parse JSON file and it gives out following error:
com.jayway.jsonpath.InvalidJsonException: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('?' (code 65533 / 0xfffd)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
Those aren't ? characters. Those are two different unprintable characters that make up a Unicode byte order mark. You see ? because that's how the debugger, text editor, OS, or font in question renders unprintable characters.
To fix this, either change the output encoding, or use a character set on the other end that understands UTF-8. The former is a simpler fix, but the latter is probably better in the long run. Eventually you'll end up with data that needs an extended character.
tl;dr
It sounds like your Java code expects a UTF-8-encoded file without BOM, so direct use of the .NET Framework is needed:
[IO.File]::WriteAllText("$PWD/$packageName.json", ($packageJson | ConvertTo-Json))
As Tom Blodget points out, BOM-less UTF-8 is mandated by the IETF's JSON standard, RFC 8259.
Unfortunately, Windows PowerShell's default output encoding for Out-File and also redirection operator > is UTF-16LE ("Unicode"), in which:
(most) characters are represented as 2-byte units.
the file starts with a special 2-byte unit (0xff 0xfe, the UTF-16LE encoding of Unicode character U+FEFF the ), the so-called (BOM byte-order mark) or Unicode signature, which serves to identify the encoding.
If target programs do not understand this encoding, they treat the BOM as data (and would subsequently misinterpret the actual data), which causes the problem you saw.
The specific symptom you saw - a complaint about character U+FFFD, which is used as the generic stand-in for an invalid character in the input - suggests that your Java code likely expects UTF-8 encoding.
Unfortunately, using Out-File -Encoding utf8 is not a solution, because PowerShell invariably writes a BOM for UTF-8 as well, which Java doesn't expect.
Workarounds:
If you can be sure that the JSON string contains **only characters in the 7-bit ASCII range** (no accented characters), you can get away with Out-File -Encoding Ascii, as TheIncorrigible1 suggests.
Otherwise, use the .NET framework directly for creating your output file with BOM-less UTF-8 encoding.
The answers to this question demonstrate solutions, one of which is shown in the "tl;dr" section at the top.
If it's an option, use the cross-platform PowerShell Core edition instead, whose default encoding is sensibly BOM-less UTF-8, for compatibility with the rest of the world.
Note that not all Windows PowerShell functionality is available in PowerShell Core, however, and vice versa, but future development efforts will focus on PowerShell Core.
A more general solution that's not specific to Out-File is to set these before you call ConvertTo-Json:
$OutputEncoding = [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8;

How do I write UTF8 with no BOM to console (no file)?

I have a powershell script that returns some strings via Write-Output.
I would like those lines to be UTF8 with no bom. I do not want a global setting, I just want this to be effective for that particular few lines I write at that time.
This other question helped me get to a point: Using PowerShell to write a file in UTF-8 without the BOM
I took inspiration from one of the answers, and wrote the following code:
$mystr = "test 1 2 3"
$mybytes = [Text.Encoding]::UTF8.GetBytes($mystr)
$OutStream = [console]::OpenStandardOutput()
$OutStream.Write($mybytes,0,$TestBytes.Length)
$OutStream.Close()
However this code ONLY writes to stdout, and if I try to redirect it, it ignores my request. In other words, putting that code in test.ps1 and running test.ps1 >out.txt still prints to the console instead of to out.txt.
Could someone recommend how I could write this code so in case a user redirects the output of my PS to a file via >, that output is UTF8 with no BOM?
To add to Frode F.'s helpful answer:
What you were ultimately looking to achieve was to write a raw byte stream to PowerShell's success-output stream (the equivalent of stdout in traditional shells[0]
), not to the console.
The success output stream is what commands in PowerShell use to pass data to each other, including to output-redirection operator >, at which point the console isn't involved.
(Data written to the success-output stream may end up displayed in the console, namely if the stream is neither captured in a variable nor redirected elsewhere.)
However, it is not possible to send raw byte streams to PowerShell's success output stream; only objects (instances of .NET types) can be sent, because PowerShell is fundamentally object-oriented.
Even data representing a stream of bytes must be sent as a .NET object, such as a [byte[]] array.
However, redirecting a [byte[]] array directly to a file with >, does not write the array's raw bytes, because > creates a "Unicode" (UTF-16LE-encoded[1])
text representation of the array (as you would see if you printed the array to the console).
In order to encode objects as byte streams (that are often encoded text) for external sinks such as a file, you need the help of PowerShell cmdlets (e.g., Set-Content), > (the output redirection operator), or the methods of appropriate .NET types (e.g., [System.IO.File]), except in 2 special cases:
When piping to an external program, the encoding stored in preference variable $OutputEncoding is implicitly used.
When printing to the console, the encoding stored in [Console]::OutputEncoding is implicitly used; also, output from external programs is assumed to be encoded that way[2]
.
Generally, when it comes to text output, it is simpler to use the -Encoding parameter of output cmdlets such as Set-Content to let that cmdlet perform the encoding rather than trying to obtain a byte representation in a separate first step.
However, a BOM-less UTF-8 encoding cannot be selected this way in Windows PowerShell (it can in PowerShell Core), so using an explicit byte representation is an option, in combination with Set-Content -Encoding Byte[3]
; e.g.:
# Write string "hü" to a UTF-8-encoded file *without BOM*:
[Text.Encoding]::UTF8.GetBytes('hü') |
Set-Content -Encoding Byte file.txt
[0] Writing to stdout from within PowerShell, as you attempted, bypasses PowerShell's own system of output streams and prints directly to the console. (As an aside: Console.OpenStandardOutput() is designed to bypass redirections even in the context of traditional shells.)
[1] Up to PowerShell v5.0, you couldn't change the encoding used by >; in PSv5.1 and above, you can use something like $PSDefaultParameterValues['Out-File:Encoding']='UTF8' - that would still include a BOM, however. For background, see this answer of mine.
[2] There is a noteworthy asymmetry: on sending text to external programs, $OutputEncoding defaults to ASCII (7-bit only) encoding, which means that any non-ASCII characters get transliterated to literal ? chars.; by contrast, on interpreting text from external programs, the applicable [Console]::OutputEncoding defaults to the system's active legacy OEM code page, which is an 8-bit encoding. See the list of code pages supported by Windows.
[3] Of course, passing bytes through is not really an encoding; perhaps for that reason -Encoding Byte was removed from PowerShell Core, where -AsByteStream must be used instead.
Encoding is used for saving text to a file, not for writing to the console. Your redirection operator > is the one saving the content which means it decides the encoding. Redirection in Powershell uses Unicode. If you need to use another encoding, you can't use redirection.
When you are
writing to files, the redirection operators use Unicode encoding. If
the file has a different encoding, the output might not be formatted
correctly. To redirect content to non-Unicode files, use the Out-File
cmdlet with its Encoding parameter.
Source: about_redirection
Normally you would use ex. Out-File -Path test.txt -Encoding UTF8 inside your script, but it includes BOM so I'd recommend using WriteAllLines(path,contents) which uses UTF8 without BOM as default.
[System.IO.File]::WriteAllLines("c:\test.txt", $MyOutputArray)

How can I get powershell to write þ (lowercase thorn) to a file as 0xfe?

I am attempting to write a PS script that builds and executes a script file for Rocket Software's SBClient. The scripting language uses two different delimiters, þ (lowercase thorn) (0xFE) and ü (u with umlaut) (0xFC).
Each of these gets written to files as two characters. þ is written as þ (A with tilde and 3/4) (0xC3 0xBE). ü gets written as ü (A with tilde and 1/4) (0xC3 0xBC).
I have tried multiple different methods to write the file and it comes up the same way every time. I'm sure this is because these are extended ASCII characters.
Is there a way to write these to a text file with their proper two-character hex codes without converting the string to hex and writing a binary file? If not, what is the best way to convert the string to hex for this? I have seen a few different examples in other languages, but nothing really solid in PS.
It looks like I could convert the string to an array of bytes and then use io.file::WriteAllBytes() to write the file. I was just hoping there was a better way to do this.
Here is the pertinent code...
$ScriptFileContent = 'TUSCRIPTþþþ[Company Name] Logon Please:þ{enter}üPST{enter}þ2þ'
$ScriptFilePath = ([Environment]::GetFolderPath("ApplicationData")).ToString() + "\Rocket Software\SBClient\tuscript\NT"
out-file -filepath $ScriptFilePath -inputobject $ScriptFileContent -encoding ascii
Solution
$enc = [System.Text.Encoding]::GetEncoding("iso-8859-1")
$ScriptFileContent = 'TUSCRIPTþþþ[Company Name] Logon Please:þ{enter}üPST{enter}þ2þ'
$ScriptFileContent = $enc.GetBytes($ScriptFileContent)
$ScriptFilePath = ([Environment]::GetFolderPath("ApplicationData")).ToString() + "\Rocket Software\SBClient\tuscript\NT"
[io.file]::WriteAllBytes($ScriptFilePath, $ScriptFileContent)
Thanks for your help!
What you're seeing are your chars, outside ASCII, being encoded as UTF-8. You have two choices here:
either you use [System.Text.Encoding]::GetEncoding("iso-8859-1") to write your file as Latin1
or you use the FileStream.WriteByte() method of the result of io.file.Open to directly write the 0xFE and 0xFC bytes yourself (seems less overkill, but that depends how you write the rest of the data)

How can I save Perl/Expect output that contains mixed ascii content?

I have a perl script that uses the expect library to login to a remote system. I'm getting the final output of the interaction with the before method:
$exp->before();
I'm saving this to a text file. When I use cat on the file it outputs fine in the terminal, but when I open the text file in an editor or try to process it the formatting is bizarre:
[H[2J[1;19HCIRCULATION ACTIVITY by TERMINAL (Nov 6,14)[11;1H
Is there a better way to save the output?
When I run enca it's identified as:
7bit ASCII characters
Surrounded by/intermixed with non-text data
you can remove none ascii chars.
$str1 =~ s/[^[:ascii:]]//g;
print "$str1\n";
I was able to remove the ANSI escape codes from my output by using the Text::ANSI::Util library's ta_strip() function:
my $ansi_string = $exp->before();
my $clean_string = ta_strip($ansi_string);