Puppeteer's `page.content()` always in UTF-8 or in page specific charset? - dom

Does Puppeteer's page.content() return the string always in UTF-8 or in the page-specific charset?
I've seen it uses document.documentElement.outerHTMLinternally (see source code) but not sure how it works.

Diving into outerHTML's documentation:
Reading the value of outerHTML returns a DOMString containing an HTML
serialization of the element and its descendants. Setting the value of
outerHTML replaces the element and all of its descendants with a new
DOM tree constructed by parsing the specified htmlString.
Diving into DOMString's documentation:
DOMString is a UTF-16 String. As JavaScript already uses such
strings, DOMString is mapped directly to a String.
So it seems the mistery ends here.

Related

What do you call input class="gLFyf" vs input.gLFyF (vocabulary help needed)

In Chrome DevTools, the element tab shows the constructed DOM and I can click on elements in the DOM which also highlights the element on the page. Image of both versions shown in DevTools
If the DOM shows:
<input class="gLFyf">
Then the page highlight will show:
input.gLFyF
I realise these are two ways of writing the same thing, I also realise the former is HTML style and the latter follows CSS conventions. However, I lack the vocabulary to properly refer to either.
What do I call each format?
Eg. would it make sense to refer to <input class="gLFyf"> as HTML syntax and input.gLFyF as CSS syntax? Is there a more widely accepted way to differentiate and name them?
gLFyf is the name of the class which is an attribute that can be referred to in the stylesheet to match styles with elements of that class on the page.
A class leads with a period (.) - whereas an ID would lead with a hash (#).
So .gLFyf is a class.
And #gLFyf would be an ID.
It is a class, whether viewing HTML markup or the DOM inspector. They both refer to the same thing as you already state.
This may be of some use/reference.

Appending element to body in VBScript

I found the you can append an element to an element by ID in VBScript:
document.getElementById("td1").appendChild img
How can I append to the body? In JavaScript you would do document.body but that throws an object required document.body in vbscript.
You could try using getElementsByTagName("body") instead (see MSDN docs).
Note that even though you'll likely have just one body tag, this function is designed to return a list of nodes, so you'll need to grab the first element before calling appendChild on it.
See also: does document.getElementsByTagName work in vbscript?

Why setHTML("<table><tr>..</tr></table>"); but then getHTML(); return "<table><tbody><tr>..</tr></tbody></table>" (Gwt)?

I don't understand how Gwt setHTML & getHTML work. It doesn't seem to be consistent.
Let see this example:
myInlineHtml.setHTML(SafeHtmlUtils.fromSafeConstant("<table><tr><td>Test</td></tr></table>"));
System.out.println(myInlineHtml.getHTML());
Output: "<table><tbody><tr><td>Test</td></tr></tbody></table>"
Clearly when we set the html for myInlineHtml we don't have <tbody></tbody>, but when we getHTML from myInlineHtml then Gwt include <tbody></tbody>.
Why does that's happen because it can be confusing when you want to get the Html value and you thought it has the same value I the time we set it but it hasn't?
Does this happen independently from browsers or dpendently from
browsers? cos that is serious.
This is how HTML is parsed (how browsers are expected to parse it).
In HTML 4, TABLE was defined (in terms of SGML) as requiring a TBODY child element, and that TBODY is defined with both the start and end tags being optional.
In HTML5 (which codifies how browsers actually parse HTML), this is the same: when building a table, if the browser finds a tr, then it inserts a tbody element before parsing the tr as if there were a tbody initially.
Browsers try to format the html properly even if you omit certain keys or parameters. Most modern browsers will accept almost anything you pass it without complaining much, but instead of inserting exactly what you intended, it will interpret what you meant and insert valid HTML.
Therefore, is is perfectly valid to create a table without specifiyng a tbody node, but the browser will supply it for you. Once you use getHTML() you are accessing the parsed, well formatted tags.

How to use unescape() function inside JavaScript?

I have a JSP page in which I have JavaScript function that will be called when a link is clicked. Now, when the value reaches the JavaScript function, the apostrophe is encoded.
Example:
Name#039;s
Before # there is &, which originally should be:
Name's
I have used the unescape() decode function, but nothing seems to work. In the end, I had to delete the characters and add the apostrophe. Does anyone know a fix for this? Is it that JSP doesn't support encoding for &? When I was writing the same encode value in this page, it changed the symbol to the apostrophe, which is what I wanted in my code.
Built-in Javascript function such as unescape(), decodeURIComponent() has nothing to do with the string you are working on, because the one you are looking to decode are HTML entites.
There are no HTML entites decoder available in Javascript, but since you are working with a browser, if the string is considered safe, you may do the following (in JQuery, for example)
var str = $('<p />').html(str).text();
It bascially insert the string as HTML to a <p> element and then extract the text within.
Edit: I just realize the JSP output you posted is not real HTML entities; To process the example given you should use the following, add & before every #1234; and make it Ӓ:
var str = $('<p />').html(str.replace(/\#(\d+)\;/g '&#$1;')).text();

Ignore CDATA while xml parsing

I am new to iphone development.I want to ignore CDATA tag while parsing because it consider the HTML tag following it as text.Since i want to display the content alone ,i want my parser to ignore CDATA tag.My source code is
[CDATA[<br /><p class="author"><span class="by">By: </span>By Sydney Ember</p><br><p>In the week since an </p>]].
Is there any way to ignore CDATA tag?
Is there any way to parse my source twice so it displays only the content?
Please give me some sample code.Please help me out.Thanks.
If you treat the CDATA content as XML instead of CDATA then your parser will throw an error (since your HTML is a weird mix of XHTML and HTML and is not well formed).
If you want to get the HTML, then parse the XML, extract the text content of the node, then parse that text as HTML.
There is no way to ignore the CDATA tag - it's part of the xml spec and parsers should honour it.
If you don't like the idea of this answer to your earlier question, you could get the contents of the CDATA section and parse it as XML again. However, this is highly not recommended! You don't know that the contents of the CDATA are going to be valid xml (they're probably not).
If you can 100% guarentee that the CDATA section contains the form you have above, you could probably use some string manipulation to get the data out (i.e. string replace '<span class="by">By: </span>' with '') but again, this will almost certainly break if the CDATA contents change.
Where is the xml coming from? It's a better idea to talk to owner of the service and get them to send you instead of description something like
<description>
<author>By Sydney Ember</autho>
<text>In the week since an </text>
</description>
S