Beautiful Soup lxml Character Encoding Issue

Beautiful Soup lxml Character Encoding Issue - encoding

I'm trying to parse a web page that has non-printable characters on it and write that to a file in python. I'm using Python 2.7 with requests and Beautiful Soup.
I get the page with requests, and parse it with the following-
for option in recon:
data['opts'] = '/c' + option
print "Getting: ",
print option
r = requests.post(url, data)
print r.content
page = bs4.BeautifulSoup(r.content, "lxml", from_encoding='utf-8')
print page
tag = page.pre.contents
print tag[0]
When testing, the print r.content shows the page properly in all its unformatted glory. The page is a .cfm, and the text I'm looking for falls between "pre" tags. After running through bs though, bs interprets some of the non printable text into "br" tags, resulting in tags being a list of 2 items, instead of just all the text between the pre tags. Is there a way to either just get the text between the pre tags with requests, or do something differently with bs to get it to not misinterpret the characters?
I've read through the following trying to figure it out, plus requests and beautiful soup docs, but found no luck so far-
Joel on Software - Character Sets
SO utf-8 vs unicode
SO Getting text between tags

Overthought the problem. I just base64 encoded the data before transfer with certutil on windows, removed the first and last line, and then decoded on the far side.

Related

Unity 2019 - linebreak \n not working for UI text elements

I am having some difficulty getting linebreaks to work for my Unity UI elements. (Unity 2019.2.17f1 Personal)
What I'm doing is:
string twoLinesOfText = LanguagePack.getTextByID(ID);
result:
twoLinesOfText = "Text line 1\nText line 2"
Expected output:
Text line 1
Text line 2
Reality:
Text line 1\nText line 2
I have tried using "\n", "\\n" and "\r\n". None of these give the intended result.
I assign the text to the component using
UITextComponent.GetComponent<Text>().text = twoLinesOfText;
Can this direct assignment be a problem? Do i need to push my string through a toString() or parse it somehow for the \n to be recognised?
Workaround:
I have a workaround. By using an XML file for my LanguagePack, and inserting (enter) linebreaks in the base file, I feed the linebreaks into my Unity UI elements. Obviously this is not ideal.
Reading back the strings in Debug.Log does not show which linebreak code was ultimately used: it just breaks the string according to the (enter) linebreaks in the XML file.

You can't import it trought Language Package. What you should do is :
string line1 = LanguagePackage.getTextByID(ID1);
string line2 = LanguagePackage.getTextByID(ID2);
string twoLinesOfText = line1 + "\n" + line2;
UITextComponent.GetComponent<Text>().text = twoLinesOfText;

Run into this problem myself, a little investigation showed that what I thought was \n in the string had been converted to \\n so it showed in the text box as \n.
Converting it during debugging to just \n got me the multiline text I wanted.
Now to investigate where in my data chain it got converted :-)
Ok, investigation complete. A file was saved, on my PC from a program in Visual Basic using the File.WriteAllLines function, one of those lines had a couple of instances of \n. A look at that file in notepad shows it had correctly written that line. The problem came when I used File.ReadAllLines in my unity program as it converted those \n instances to \\n. As far as I can tell this is not a documented action, in fact it's possible, on reading the MS docs, to think that it would have split that line into multiple lines, which it doesn't do.
I checked in my VB program and File.ReadAllLines does not behave in this way there. It's probably something to do with the environment, VB does not use \n, C# does. I fixed the problem by tagging a replace onto the string e.g. string.Replace("\\n", "\n"). It's entirely possible that attempting to write a string from C# with File.WriteAllLines could also mess with \n.
Geez, this was hard to write as the Editor here messes with \\n and convert it to \n and I end up having to use \\\n

For people who encounter this issue. You Could try to use some HTML similar syntax and see whether it works or not.
Eg:
Using for newline instead of \n

How to extract articles from Hindi Web pages with Goose?

I'm using Python Goose to extract articles from Web pages. It works fine for many languages, but fails for Hindi. I have tried to add Hindi stop as stopwords-hi.txt and set target_language to hi, without success.
Thanks, Eran

Yeah I had the same problem. I've been working on extracting articles in all Indian regional languages and I couldn't extract the content alone with Goose.
If you can work with the article description alone, the meta_description works perfectly. You can use that instead of cleaned_text which doesn't return anything.
Another alternative, but more lines of code:
import urllib
from bs4 import BeautifulSoup
url = "http://www.jagran.com/news/national-this-pay-scale-calculator-will-tell-your-new-salary-after-7th-pay-commission-14132357.html"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
##removing all script, style and reference links to get only the article content
for script in soup(["script", "style",'a',"href","formfield"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
print (text)
Open disclosure: I actually got the original code somewhere on stack overflow only. Modified it a tiny bit.

Why am I unable to parse non-proportional text using CAM::PDF?

While parsing page no. 22 of http://sfdoccentral.symantec.com/sf/5.1/linux/pdf/vxfs_admin.pdf, I am able to parse all the words except mount_vxfs as its encoding style and/or font is different than normal plain text.
Please find attached PDF Page for details.
Please find my code :-
`#!/usr/bin/perl
use CAM::PDF;
my $file_name="vxfs_admin_51sp1_lin.pdf";
my $pdf = CAM::PDF ->new($file_name);
my $no_pages=$pdf->numPages();
print "$no_pages\n";
for(my $i=1;$i<$no_pages;$i++){
my $page = $pdf->getPageText($i);
//for page no. 22
//if($i==22){
print $page;
//}
}`

PDF doesn't store the semantic text that you read but rather uses character codes which map to glyphs (the painted characters) in a particular font. Often, however, the code-glyph mapping matches common character sets (such as ISO-8859-1 or UTF-8) so that the codes are human-readable. That's the case for all of the text you have been able to parse, although sometimes the odd character, mostly punctuation, is also "wrong".
The text for "mount_vxfs" in your document is encoded completely differently, unfortunately, resulting in apparent garbage. If you're curious, you can see what's really there by substituting getPageText() with getPageContent() in your code.
In order to convert the PDF text back to meaningful characters, PDF readers have to jump through hoops with a number of conversion tables (including the so-called CMaps). Because this is a lot of programming work, many simpler libraries opt not to implement them. That's the case with CAM::PDF.
If you're just interested in parsing the text (not editing it), the following technique is something I use with success:
Obtain xpdf (http://foolabs.com/xpdf) or Poppler (http://poppler.freedesktop.org/). Poppler is a newer fork of xpdf. If you're using *nix, there will be a package available.
Use the command-line tool 'pdftotext' to extract the text from a file, either page-wise or all at once.
Example:
#!/usr/bin/perl
use English;
my $file_name="vxfs_admin.pdf";
open my $text_fh, "/usr/bin/pdftotext -layout -q '$file_name' - 2>/dev/null |";
local $INPUT_RECORD_SEPARATOR = "\f"; # slurp a whole page at a time
while (my $page_text = <$text_fh>) {
# this is here only for demo purposes
print $page_text if $INPUT_LINE_NUMBER == 19;
}
close $text_fh;
(Note: The document I retrieved using your link is slightly different; the offending bit is on page 19 instead.)

Encoding problems in ASP when using English and Chinese characters

I am having problems with encoding Chinese in an ASP site. The file formats are:
translations.txt - UTF-8 (to store my translations)
test.asp - UTF-8 - (to render the page)
test.asp is reading translations.txt that contains the following data:
Help|ZH|帮助
Home|ZH|首页
The test.asp splits on the pipe delimiter and if the user contains a cookie with ZH, it will display this translation, else it will just revert back to the Key value.
Now, I have tried the following things, which have not worked:
Add a meta tag
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
Set the Response.CharSet = "UTF-8"
Set the Response.ContentType = "text/html"
Set the Session.CodePage (and Response) to both 65001 (UTF-8)
I have confirmed that the text in translations.txt is definitely in UTF-8 and has no byte order mark
The browser is picking up that the page is Unicode UTF-8, but the page is displaying gobbledegook.
The Scripting.OpenTextFile(<file>,<create>,<iomode>,<encoding>) method returns the same incorrect text regardless of the Encoding parameter.
Here is a sample of what I want to be displayed in China (ZH):
首页
帮助
But the following is displayed:
é¦–é¡µ
å¸®åŠ©
This occurs all tested browsers - Google Chrome, IE 7/8, and Firefox 4. The font definitely has a Chinese branch of glyphs. Also, I do have Eastern languages installed.
--
I have tried pasting in the original value into the HTML, which did work (but note this is a hard coded value).
首页
é¦–é¡µ
However, this is odd.
首页 --(in hex)--> E9 A6 96 E9 A1 --(as chars)--> é¦–é¡µ
Any ideas what I am missing?

In order to read the UTF-8 file, you'll probably need to use the ADODB.Stream object. I don't claim to be an expert on character encoding, but this test worked for me:
test.txt (saved as UTF-8 without BOM):
首页
帮助
test.vbs
Option Explicit
Const adTypeText = 2
Const adReadLine = -2
Dim stream : Set stream = CreateObject("ADODB.Stream")
stream.Open
stream.Type = adTypeText
stream.Charset = "UTF-8"
stream.LoadFromFile "test.txt"
Do Until stream.EOS
WScript.Echo stream.ReadText(adReadLine)
Loop
stream.Close

Whatever part of the process is reading the translations.txt file does not seem to understand that the file is in UTF-8. It looks like it is reading it in as some other encoding. You should specify encoding in whatever process is opening and reading that file. This will be different from the encoding of your web page.
Inserting the byte order mark at the beginning of that file may also be a solution.

Scripting.OpenTextFile does not understand UTF-8 at all. It can only read the current OEM encoding or Unicode. As you can see from the number of bytes being used for some character sets UTF-8 is quite inefficient. I would recommend Unicode for this sort of data.
You should save the file as Unicode (in Windows parlance) and then open with:
Dim stream : Set stream = Scripting.OpenTextFile(yourFilePath, 1, false, -1)

Just use the script below at the top of your page
Response.CodePage=65001
Response.CharSet="UTF-8"

What kind of text code is %62%69%73%68%6F%70?

On a specific webpage, when I hover over a link, I can see the text as "bishop" but when I copy-and-paste the link to TextPad, it shows up as "%62%69%73%68%6F%70". What kind of code is this, and how can I convert it into text?
Thanks!

URL encoding, I think.
You can decode it here: http://meyerweb.com/eric/tools/dencoder/
Most programming languages will have functions to urlencode/decode too.

This is URL encoding. It is designed to pass characters like < / or & through a URL using their ASCII values in hex after a %. However, you can also use this for characters that don't need encoding per se. Makes the URL harder to read, which is sometimes desirable.

URL encoding replaces characters outside the ascii set.
More info about URL encoding in the w3schools site.

As mentioned by others, this is simply an ASCII representation of the text so that it can be passed around the HTTP object easily. If you've ever noticed typing in a website URL that has a space in it, the browser will usually convert that to %20. That's the hexadecimal value for the "space" character in ASCII.
This used to be a way to trick old spam scrapers. One way spammers get email addresses is to scrape the source code of websites for strings matching the pattern "username#company.tld". By encoding just the username portion or the whole string as ASCII characters, the string would be readable by humans, but would require the scraper to convert it to a literal string before it could be used to send emails. Of course, modern-day spamming tools account for these sort of strings.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Beautiful Soup lxml Character Encoding Issue - encoding

Overthought the problem. I just base64 encoded the data before transfer with certutil on windows, removed the first and last line, and then decoded on the far side.

Related

Unity 2019 - linebreak \n not working for UI text elements

How to extract articles from Hindi Web pages with Goose?

Why am I unable to parse non-proportional text using CAM::PDF?

Encoding problems in ASP when using English and Chinese characters

What kind of text code is %62%69%73%68%6F%70?

Categories

Resources