Invoke-WebRequest and Hebrew characters - powershell

I already tried the reghack for PS to support Hebrew characters. I can type Hebrew no problems but for some reasons characters containing Hebrew returned from Invoke-WebRequest are in gibberish (see the following screenshot).
Here's the site URL I'm attempting to query:
https://www.hometheater.co.il/vt278553.html
Update:
It looks like the content-type being returned is ALWAYS of charset Windows-1255 which is probably the issue.

This seems to be not only an issue of having to specify the encoding but also that the shell cannot display the encoding correctly. If you specify the encoding to a file and edit it with a decent text editor (not Notepad but e.g. Notepad++), then you will be see that it has parsed it correctly.
Invoke-WebRequest -Uri "https://www.hometheater.co.il/vt278553.html" -ContentType "text/plain; charset=Windows-1255" -OutFile content.txt
We can also test that the in-memory presentation is correct by reading it and writing it to another file:
Get-Content .\content.txt | Set-Content test.txt

Related

Powershell Invoke-Webrequest encodes filename of uploaded file to Base64 when it contains german umlaut

when I'm uploading a file using Powershell Invoke-Webrequest, then the filename gets encoded to base64 when it contains a german umlaut, otherwise it stays in the original encoding. Here's an example:
$path = "C:\test\Peter Müller.txt"
$uploadFormDict = #{}
$uploadFormDict['myfile'] = Get-Item -Path $path
Invoke-WebRequest -Uri "https://www.my-example-url.de/upload" -Method POST -Form $uploadFormDict
The filename that has been uploaded is '=?utf-8?B?UGV0ZXIgTcO8bGxlci50eHQ=?=', so the Base64-encoded string 'UGV0ZXIgTcO8bGxlci50eHQ=?=' of 'Peter Müller.txt' with a prepended '=?utf-8?B?'.
If I upload a file named 'Peter Mueller.txt', the filename stays 'Peter Mueller.txt'.
How can handle that the filename will not be encoded to Base64?
Thank you!
I found a solution. For that I had to debug Powershell to see that they use the ContentDispositionHeaderValue and that the Name property of the FileInfo still contains the original name, but the FileName property of the ContentDispositionHeaderValue contains the encoded string (seen here).
In the documentation of ContentDispositionHeaderValue I found that "The FileName property uses MIME encoding for non-ascii characters." Here is a description of MIME encoding, which matches exactly the format of encoded string I have encountered.
The solution was in my case to decode that filename back at the server side.

encoding issue in powershell search and replace

I'm running a powershell script on XML files recursively to search and replace text. The code is working fine with searching and replacing the text. However in certain files there are other languages text like fréquentes which is getting changed to fréquentes after running the script. I've been using UTF8 encoding in the script. Any pointers on how to retain the encoading?
$content| Foreach-Object{$_ -replace 'test1' , 'testing'`
-replace 'test2' , 'testing' }| Out-File file.FullName -Encoding utf8
You seem to be ignoring the XML file's encoding, which seems to be Latin 1. XML files specify their encoding at the start (or, if they don't, they will be autodetected as UTF-8, UTF-16, or UTF-32):
<?xml version='1.0' encoding='utf-8'?>
So it seems to me like you read the content with the correct encoding, but write the file in UTF-8 which doesn't match the declared one.
You could use the XML APIs to change the file, which may be preferable, or simply change your Out-File to
Out-File -Encoding Default
However, that can cause the encoding to differ between different computers, so careful with that. I pretty much only use it for files I know are in the system's legacy codepage, or for quick one-off scripts.

Parsing HTML with PowerShell on dynamic sites

Is there a way to parse HTML from http://www.pgatour.com site using Invoke-WebRequest cmdlet? When I try doing this, ParsedHtml does not contain elements that I need (because cmdlet incorrectly parses the page).
I tried getting data from this page by creating IE COM object in PowerShell and it works, but very slow, so I'm wondering if there is another approach using Invoke-WebRequest (or even external parsers).
Thanks!
You could give the HmtlAgilityPack a try to parse the content returned by Invoke-WebRequest. In this scenario, I would use the -UseBasicParsing parameter.
Window 10 64-bit. PowerShell 5.1
Parsing HTML with PowerShell 5.1 on dynamic sites using Invoke-WebRequest and a regex that returns everything between un-nested tags like <html>,<title>,<head>, and <body>. It will take some tweaking for nested tags.
Invoke-WebRequest -Uri http://www.pgatour.com | sc golf.html
(gc -raw golf.html) -match '(<body>)(.*|\n).*?(<\/body>)'
$matches[0]
Everything between <div class="success-message"> and the next </div>
Invoke-WebRequest -Uri http://www.pgatour.com | sc golf.html
(gc -raw golf.html) -match '(<div class="success-message">)(.*?|\n)*(<\/div>)'
$matches[0]
Greedy and lazy quantifiers explained
regex101.com is your friend.

Powershell Invoke-RestMethod incorrect character

I'm using Invoke-RestMethod to get page names from an application I'm using. I notice that when I do a GET on the page it returns the page name like so
This page â is working
However the actual page name is
This page – is working
Here's how my request looks
Invoke-WebRequest -Uri ("https://example.com/rest/api/content/123789") -Method Get -Headers $Credentials -ContentType "application/json; charset=utf-8"
The problem is with the en-dash, does anyone know how I can fix this?
In case of Invoke-WebRequest does not detect responce encoding right, you can use RawContentStream and convert it to needed encoding:
$resp = Invoke-WebRequest -Uri ...
$html=[system.Text.Encoding]::UTF8.GetString($resp.RawContentStream.ToArray());
Invoke-restmethod or invoke-webrequest?
The Invoke-RestMethod cmdlet uses the default decoding on the result of the HttpWebResponse.CharacterSet property.
If that is not set it uses a default encoding of ISO-8859-1 by default (afaik).
I'm assuming your server is sending some wrong charset in the response headers (or dropping it) hence it's beeing decoded wrongly.
Do you know what charset/encoding are sent in your response from your server?
If you're trying the Invoke-webrequest; check your headers in your response like e.g.
$r = invoke-webrequest http://example.com
$r.Headers
If you're dealing with an encoding issue; e.g. your server is not sending the right headers; you can always try to dump the response in a file and read it with a different encoding:
Invoke-WebRequest http://example.com -outfile .\test.txt
$content = get-content .\test.txt -Encoding utf8 -raw
In this case you will no longer be working with the http-response; but it might help you debug/find the encoding issues your looking for.
One line solution (without files):
[system.Text.Encoding]::UTF8.GetString((Invoke-WebRequest "https://www.example.com").RawContentStream.ToArray())

Invoke-WebRequest - issue with special characters in json

I'm trying to send special characters (norwegian) using Invoke-WebRequest to an ASP .NET MVC4 API controller.
My problem is that the json object show up as NULL when received by the controller, if my json data contains characters like Æ Ø Å.
An example of my code:
$text = 'Æ Ø Å'
$jsondata = $text | ConvertTo-Json
Invoke-WebRequest -Method POST -Uri http://contoso.com/create -ContentType 'application/json; charset=utf8' -Body $jsondata
Also when looking in fiddler the characters turn up like the usual weird utf8 boxes.
Sending json data from fiddler to the same API controller works fine
Any advice?
For the Body parameter try this:
... -Body ([System.Text.Encoding]::UTF8.GetBytes($jsondata))
The string in PowerShell is Unicode but you've specified a UTF8 encoding so I think you need to give it some help getting to UTF8.