How to extract articles from Hindi Web pages with Goose? - goose

I'm using Python Goose to extract articles from Web pages. It works fine for many languages, but fails for Hindi. I have tried to add Hindi stop as stopwords-hi.txt and set target_language to hi, without success.
Thanks, Eran

Yeah I had the same problem. I've been working on extracting articles in all Indian regional languages and I couldn't extract the content alone with Goose.
If you can work with the article description alone, the meta_description works perfectly. You can use that instead of cleaned_text which doesn't return anything.
Another alternative, but more lines of code:
import urllib
from bs4 import BeautifulSoup
url = "http://www.jagran.com/news/national-this-pay-scale-calculator-will-tell-your-new-salary-after-7th-pay-commission-14132357.html"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
##removing all script, style and reference links to get only the article content
for script in soup(["script", "style",'a',"href","formfield"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
print (text)
Open disclosure: I actually got the original code somewhere on stack overflow only. Modified it a tiny bit.

Related

Unity 2019 - linebreak \n not working for UI text elements

I am having some difficulty getting linebreaks to work for my Unity UI elements. (Unity 2019.2.17f1 Personal)
What I'm doing is:
string twoLinesOfText = LanguagePack.getTextByID(ID);
result:
twoLinesOfText = "Text line 1\nText line 2"
Expected output:
Text line 1
Text line 2
Reality:
Text line 1\nText line 2
I have tried using "\n", "\\n" and "\r\n". None of these give the intended result.
I assign the text to the component using
UITextComponent.GetComponent<Text>().text = twoLinesOfText;
Can this direct assignment be a problem? Do i need to push my string through a toString() or parse it somehow for the \n to be recognised?
Workaround:
I have a workaround. By using an XML file for my LanguagePack, and inserting (enter) linebreaks in the base file, I feed the linebreaks into my Unity UI elements. Obviously this is not ideal.
Reading back the strings in Debug.Log does not show which linebreak code was ultimately used: it just breaks the string according to the (enter) linebreaks in the XML file.
You can't import it trought Language Package. What you should do is :
string line1 = LanguagePackage.getTextByID(ID1);
string line2 = LanguagePackage.getTextByID(ID2);
string twoLinesOfText = line1 + "\n" + line2;
UITextComponent.GetComponent<Text>().text = twoLinesOfText;
Run into this problem myself, a little investigation showed that what I thought was \n in the string had been converted to \\n so it showed in the text box as \n.
Converting it during debugging to just \n got me the multiline text I wanted.
Now to investigate where in my data chain it got converted :-)
Ok, investigation complete. A file was saved, on my PC from a program in Visual Basic using the File.WriteAllLines function, one of those lines had a couple of instances of \n. A look at that file in notepad shows it had correctly written that line. The problem came when I used File.ReadAllLines in my unity program as it converted those \n instances to \\n. As far as I can tell this is not a documented action, in fact it's possible, on reading the MS docs, to think that it would have split that line into multiple lines, which it doesn't do.
I checked in my VB program and File.ReadAllLines does not behave in this way there. It's probably something to do with the environment, VB does not use \n, C# does. I fixed the problem by tagging a replace onto the string e.g. string.Replace("\\n", "\n"). It's entirely possible that attempting to write a string from C# with File.WriteAllLines could also mess with \n.
Geez, this was hard to write as the Editor here messes with \\n and convert it to \n and I end up having to use \\\n
For people who encounter this issue. You Could try to use some HTML similar syntax and see whether it works or not.
Eg:
Using for newline instead of \n

Beautiful Soup lxml Character Encoding Issue

I'm trying to parse a web page that has non-printable characters on it and write that to a file in python. I'm using Python 2.7 with requests and Beautiful Soup.
I get the page with requests, and parse it with the following-
for option in recon:
data['opts'] = '/c' + option
print "Getting: ",
print option
r = requests.post(url, data)
print r.content
page = bs4.BeautifulSoup(r.content, "lxml", from_encoding='utf-8')
print page
tag = page.pre.contents
print tag[0]
When testing, the print r.content shows the page properly in all its unformatted glory. The page is a .cfm, and the text I'm looking for falls between "pre" tags. After running through bs though, bs interprets some of the non printable text into "br" tags, resulting in tags being a list of 2 items, instead of just all the text between the pre tags. Is there a way to either just get the text between the pre tags with requests, or do something differently with bs to get it to not misinterpret the characters?
I've read through the following trying to figure it out, plus requests and beautiful soup docs, but found no luck so far-
Joel on Software - Character Sets
SO utf-8 vs unicode
SO Getting text between tags
Overthought the problem. I just base64 encoded the data before transfer with certutil on windows, removed the first and last line, and then decoded on the far side.

Formatting PHP code for Epub in MS Word

I'm trying to format the PHP code sections of a 700+ page book for Epub conversion. If I use soft returns at the end of the code lines, they get eaten. If I use hard returns (making each line a paragraph), I either get too much space between the lines, or not enough before and after the code section. If I add an empty line before and after the code section, it gets eaten.
There are thousands of lines of code in the book. Is there some way to handle this without manually editing the html file?
Is there some common format for these code section like being wrapped with PHP tags?
If they are PHP tags you can use this, which will wrap each tag set with a :
function fixPHPcode($matches)
{
return '<p class="php_code">' . $matches[0] . '</p>';
}
$data = preg_replace_callback('/<\?php(.|\s)+?\?>/i', 'fixPHPcode', $data);
I did try some fairly complex regex transforms, but I've found an easier method that actually works fairly well.
The secret is to create a style based on Word's "HTML Preformatted" style, or if you don't have that style, a style based on Normal that specifies Arial Unicode MS or Courier New as the font, with no proofing, left justified.
Indent with spaces and use soft returns (shift-enter) at the end of each line.
Calibre will produce acceptable Epub and Mobi versions of this. Courier is a crap font for code, but at least it's monospaced so the indents will line up, and people are used to seeing it as a code font.

Japanese characters in a latex \section{} cause an error

I am working on getting Japanese documents created with latex. I have installed the latest version of texlive-2008 which includes CJK.
In my document I have the following:
\documentclass{class}
\usepackage{CJK}
\begin{document}
\begin{CJK*}{UTF8}{min}
\title{[Japanese Characters here 1]}
\maketitle
\section{[Japanese Characters here 2]}
[Japanese Characters here 3]
\end{CJK*}
\end{document}
In the above code there are 3 locations Japanese characters are used.
1 + 3 work fine whereas 2, which contains Japanese characters in a \section{} fails with the following error.
! Argument of \#sect has an extra }.
After some research it turns out this error manifests when you’ve put a fragile command inside a moving argument. A moving argument because section can be moved to a contents page for example.
Does anyone know how to get this to work and why latex thinks Japanese characters are "fragile".
Sorry to post this as an answer rather than a comment to your answer; I don't have enough rep yet to comment. (EDIT: Now I have enough rep to comment, but I'm not sorry anymore. Thanks Will.)
Your solution of replacing
\section{[Japanese Text]}
with
\section{\texorpdfstring{[Japanese Text]}{}}
suggests that you're using the hyperref package. When you use the hyperref package, any sort of not-totally-boring text (e.g. math) within \section causes a problem because \section is having trouble generating pdf bookmarks. \texorpdfstring allows you to specify how you want the section title to appear in the pdf bookmark. For example, I might write
\section{Calculation of \texorpdfstring{$H_2(\mathcal{X})$}{H\_2(X)}}
if I want the section title to be "Calculation of $H_2(\mathcal{X})$" but I want the pdf bookmark to be "Calculation of H_2(X)".
You should probably use xetex/xelatex, as it has been created to support unicode. The change is sometimes not easy for already existing documents, though. (xelatex should be included in texlive, it is just different executable to call -- this is how it is done in Debian).
I have managed to get this working now!
Using Latex and CJK as before.
\section{[Japanese Text]}
was replaced with
\section{\texorpdfstring{[Japanese Text]}{}}
Now the contents pages and section titles work and update fine.

How can I extract text from a PDF file in Perl?

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine.
The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.
Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?
These modules you can acheive the extract text from pdf
PDF::API2
CAM::PDF
CAM::PDF::PageText
From CPAN
my $pdf = CAM::PDF->new($filename);
my $pageone_tree = $pdf->getPageContentTree(1);
print CAM::PDF::PageText->render($pageone_tree);
This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.
All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.
You may never get an appropriate solution to your problem. The PDF format can encode text either as ASCII values with a font applied, or it can encode it as a bitmap. If the tool that created your PDF decided to encode the special characters as a bitmap, you will be out of luck (unless you want to get into OCR solutions, of course).
I'm not a Perl user but I imagine you'll struggle to find a better free text extractor than pdftotext.
pdftotext usually recognises non-ASCII characters fine, is it possible it's extracting them ok but the app you're using to view the text file isn't using the correct encoding? If pdftoetxt on windows is the same as the one on my linux system, then it defaults to exporting as utf-8.
There is getpdftext.pl; part of CAM::PDF.
Well, I tried 2-3 perl modules like CAM::PDF, API2 but the problem remains the same! I'm parsing a pdf file containing main pages. Cam or API2 parses the plain text very well. However, they are not able to parse the code snippet [code snippet usually are in different font & encoding than plain text].
James Healy is correct. After trying CAM::PDF and PDF::API2, the former of which I've had some success reading text, downloading pdftotext worked great for a number of my implementations.
If on windows go here and download xpdf precompiled binary:
http://www.foolabs.com/xpdf/download.html
Then, if you need to run this within perl use system, e.g.,:
system("C:\Utilities\xpdfbin-win-3.04\bin64\pdftotext.exe $saveName");
where $saveName is the full path to your PDF file.
This hopefully leaves you with a text file you can open and parse in perl.
i tried this module which is working fine for special characters of pdf..
!/usr/bin/perl
use strict;
use warnings;
use PDF::OCR::Thorough;
my $filename = "pdf.pdf";
my $pdf = PDF::OCR::Thorough->new($filename);
my $text = $pdf->get_text();
print "$text";
Take a look at PDFBox. It is a library but i think that it also comes with some tool to do text extracting.