Read html file as string in Powershell [duplicate] - powershell

This question already has answers here:
PowerShell: Store Entire Text File Contents in Variable
(5 answers)
Closed 5 years ago.
I need to read a html file and parse the content to a string
From this
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="x-ua-compatible" content="ie=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Index</title>
</head>
<body>
Index
</body>
</html>
To an output like this
$stringValue = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"...
I've tried with $stringValue = $htmlFile | ConvertTo-Json but it transforms some characters into new codes (> = u003e) where I want to keep the special characters intact.
Any help is appreciated

You can use below command to get the content of html file and that you can store in any string variable like below.
[string]$Datas = Get-Content [HTML_file_Location]

Try to read it as UTF-16 and see if output is passed through as desired. This answer shows how to read it as UTF-16.
Reading a "string in little-endian UTF-16 encoding" with BinaryReader

Related

Outlook 2016: Some emails arrive with the body in Chinese

Occasionally one of our users will receive an email from a known source, and the characters of the body of the email will be Chinese instead of English. They won't all be Chinese, but a mix of Chinese and some random characters, like this:
"格浴㹬਍†格慥㹤਍††䴼呅⁁瑨灴攭畱癩∽潃瑮湥⵴祔数•潣瑮湥㵴琢硥⽴瑨汭※档牡敳㵴瑵ⵦ㘱㸢਍††洼瑥⁡瑨灴攭畱癩∽潃瑮湥⵴祔数•潣瑮湥㵴琢硥⽴瑨汭※档牡敳㵴卉ⵏ㠸㤵ㄭ㸢਍††琼瑩敬刾捩敫⁹效潲慭獮䠠獡䐠汥癩牥摥夠畯⁲汆睯牥⁳牏䜠晩㱴琯瑩敬ാ 㰠栯慥㹤਍†戼摯⁹杢潣潬㵲⌢晦晦晦㸢਍††琼扡敬眠摩"
It only seems to be happening to one or two users, and it's not every sender - in fact, one of the emails from the sender could be fine, and the next could be like this. Encoding seems to be fine, but we're not sure where else to look. One other thing - we have Barracuda as our email filter. If we view one of the problem emails in Barracuda first, it's English. It seems to be changed to Chinese on the client side.
We have an on prem Exchange 2016 server with Outlook 2016 as the mail client, and the OS is Windows 10. Thanks!
I can tell you what has happened although I cannot tell you why.
I saved your string to a text file. I created a small Excel macro to read that file and display the characters in hexadecimal:
683C 6D74 3E6C 0A0D 2020 683C 6165 3E64 0A0D 2020 2020 4D3C 5445 2041 7468 7074 652D 7571
7669 223D 6F43 746E 6E65 2D74 7954 6570 2022 6F63 746E 6E65 3D74 7422 7865 2F74 7468 6C6D
203B 6863 7261 6573 3D74 7475 2D66 3631 3E22 0A0D 2020 2020 6D3C 7465 2061 7468 7074 652D
7571 7669 223D 6F43 746E 6E65 2D74 7954 6570 2022 6F63 746E 6E65 3D74 7422 7865 2F74 7468
6C6D 203B 6863 7261 6573 3D74 5349 2D4F 3838 3935 312D 3E22 0A0D 2020 2020 743C 7469 656C
523E 6369 656B 2079 6548 6F72 616D 736E 4820 7361 4420 6C65 7669 7265 6465 5920 756F 2072
6C46 776F 7265 2073 724F 4720 6669 3C74 742F 7469 656C 0D3E 200A 3C20 682F 6165 3E64 0A0D
2020 623C 646F 2079 6762 6F63 6F6C 3D72 2322 6666 6666 6666 3E22 0A0D 2020 2020 743C 6261
656C 7720 6469
Each pair of hexadecimal digits represents a valid ASCII character. The fourth character is “0A0D” or “linefeed carriage-return”. This should be “carriage-return linefeed”. Somehow a valid ASCII email body has been interpreted as a little-endian UTF-16 email body. If you split these characters up and reverse them, you get:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-16">
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Rickey Heromans Has Delivered Your Flowers Or Gift</title>
</head>
<body bgcolor="#ffffff">
<table wid
My knowledge of Html does not extend to knowing the significance of having two charsets defined although it would appear the first has been obeyed. All the other tags (html, head, meta, title, body and table) are lower case so my guess is that the incorrect <META http-equiv="Content-Type" content="text/html; charset=utf-16"> has been added somewhere.
Hope this helps.

negative regex with xidel + garbage-collect function

I currently use this command to extract URLs from a site
xidel https://www.website.com --extract "//h1//extract(#href, '.*')[. != '']"
This will extract all URLs (.*) but I would like to change this in a way that it would not extract URLs that contain specific strings in their URI path. For example, I would like to extract all URLs, except the ones that contain -text1- and -text2-
Also, xidel has a function called garbage-collect but it's not clear to me how to use these functions. I could be
--extract garbage-collect()
or
--extract garbage-collect()[0]
or
x:extract garbage-collect()
or
x"extract garbage-collect()
But these didn't reduce the memory usage when extracting URLs from multiple pages using --follow.
Just noticed this old question. It looks like OP's account is suspended, so I hope the following answer will be helpful for other users.
Let's assume 'test.htm' :
<html>
<body>
<span class="a-text1-u">1</span>
<span class="b-text2-v">2</span>
<span class="c-text3-w">3</span>
<span class="d-text4-x">4</span>
<span class="e-text5-y">5</span>
<span class="f-text6-z">6</span>
</body>
</html>
To extract all "class"-nodes, except the ones that contain "-text1-" and "-text2-":
xidel -s test.htm -e "//span[not(contains(#class,'-text1-') or contains(#class,'-text2-'))]/#class"
#or
xidel -s test.htm -e "//#class[not(contains(.,'-text1-') or contains(.,'-text2-'))]"
c-text3-w
d-text4-x
e-text5-y
f-text6-z
xidel has a function called garbage-collect but it's not clear to me how to use these functions.
http://www.benibela.de/documentation/internettools/xpath-functions.html#x-garbage-collect:
x:garbage-collect (0 arguments)
Frees unused memory. Always call it as garbage-collect()[0], or it might garbage collect its own return value
and crash.
So that would be -e "garbage-collect()[0]".

ConvertTo-Html from hash table of html content to html document

I am trying to display some html encoded information on a document that is generated by a scheduled execution of a powershell script.
The following MVP illustrates my issue:
#{ a="<div style=""color:red;"">Hello</div>"; b="Hi"}.GetEnumerator() | Select Key, Value | ConvertTo-Html | Out-File -Encoding utf8 -FilePath C:\Scripts\Test.html
Which outputs:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>HTML TABLE</title>
</head><body>
<table>
<colgroup><col/><col/></colgroup>
<tr><th>Key</th><th>Value</th></tr>
<tr><td>a</td><td><div style="color:red;">Hello</div></td></tr>
<tr><td>b</td><td>Hi</td></tr>
</table>
</body></html>
Which, when opened, looks like:
But I want my Hello to be red, and not to see the escaped html div code.
Is there any way to tell ConvertTo-Html not to escape my inputs?
Note: This MVP only illustrates the issue I'm facing. I actually have a very complex report that I would like to decorate for easier viewing (color coding, symbol, et al).
This is the report I am trying to configure:
The main purpose of the ConvertTo-Html cmdlet is to provide an easy-to-use tool for converting lists of objects into tabular HTML reports. The input for this conversion is expected to be non-HTML data, and characters that have a special meaning in HTML are automatically escaped. This cannot be turned off.
Unescaped HTML fragments can be inserted into the HTML report via the parameters -Body, -PreContent, and -PostContent before or after tabular data. However, for more complex reports this probably isn't versatile enough. The best approach in situations like that is to generate the individual parts of your report as fragments, e.g.
$ps = Get-Process | ConvertTo-Html -PreContext '<p>Process list</p>' -Fragment
and then combine all fragments with a here-string:
$html = #"
<html>
<head>
...
</head>
<body>
${ps}
<hr>
${other_fragment}
...
</body>
</html>
"#
As for individual formatting of particular parts of generated fragments: that is not supported. You need to modify the resulting HTML code yourself, either via search&replace (in fragments or the full HTML content) or by parsing and modifying the full HTML content.

POEdit encodings issues

I'm using PoEdit to create my messages.mo file to be used in my php web app.
I checked my encoding is UTF-8 and still, my accents are not showing (e.g. "é", "è", ...). Actually, both source and target files are defined with UTF-8...
Here's the code I use to enable gettext:
<?php
$dir = "../locale";
$lang="fr_FR";
$domain="messages";
putenv("LANG=$lang");
setlocale(LC_ALL, $lang);
bindtextdomain ($domain, $dir);
textdomain ($domain);
echo gettext("TEST 1") . "\n";
echo __("Test 2"); // works if using gettext("Test 2");
?>
EDIT: I also add here the header of my page, stating I should be using UTF-8...
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
EDIT 2: Here a link to the po file. Also, I try to copy-paste the result here
R�gle plurielle 1 accessible en �criture.
I should get
Règle plurielle 1 accessible en écriture.
Any idea how to resolve this?

cfmailparam file and charset

I generate a CSV file as one string with semicolons as field separator and chr(13) chr(10) as line separator. I append this to an email with the cfmailparam tag using disposition="attachment" and the content attribute contains the said string. The attachment of the mail seems to be encoded in UTF8 which Excel does not like, so my Umlauts are destroyed. Is there a possibility to provide the cfmailparam tag with a charset attribute to ensure the file is attached/sent Windows1252 encoded?
Is it better to store the string with the cffile tag and the Windows1252 encoding and appending it to the mail with the cfmailparam file attribute?
<cfset arrTitel = [
"Titel"
, "Geschäftsbereich"
]>
<cfsavecontent variable="csv"><cfoutput><cfloop from="1" to="#ArrayLen( arrTitel )#" index="spaltentitel"><cfif spaltentitel gt 1>;</cfif>"#arrTitel[spaltentitel]#</cfloop>#chr(13)##chr(10)#</cfoutput></cfsavecontent>
<cfmail from="#mail#" to="#mailempf#" subject="subj" type="text/html">
<cfmailparam
content="#csv#"
disposition="attachment"
file="Report.csv"
>
</cfmail>
This is what it basically looks like.
Please don't advise to change the encoding of the cfm file. Other values containing Umlauts are not hard coded but come from a database.