I'm looking for a way to take a standard mail message (RFC 822 et. al) in a text file (say, from a mail spool or maildir), format it nicely, and output a postscript or PDF file suitable for printing. I'd prefer not to reinvent the wheel in terms of developing a pleasing layout, and I'm not familiar with PostScript or any graphics libraries anyway.
Are there any ready-made libraries or tools that can produce output similar to what most mail clients send to a printer? I've tried a couple of Linux command-line tools (like mp), but the output isn't very attractive.
You can solve your problem in two ways:
First:
Pass e-mail in "HTML Tidy" component or the "HTML Beautifier .Net" for the formatting and cleanup, and then convert with the "PDF Metamorphosis .Net"(www.sautinsoft.net).
Your HTML -> filter, clean up, modify HTML -> convert -> Your PDF
Second way:
Immediately send a message to "PDF Metamorphosis" for the converting to pdf.
Your HTML -> convert -> Your PDF
For example:
SautinSoft.PdfMetamorphosis p = new SautinSoft.PdfMetamorphosis();
string inputFile = #"С:\email.html";
string outputFile = #"С:\email.pdf";
int result = p.HtmlToPdfConvertFile(inputFile, outputFile);
if (result == 0)
{
System.Console.WriteLine("Converted successfully!");
System.Diagnostics.Process.Start(outputFile);
}
Related
I'm trying to parse a web page that has non-printable characters on it and write that to a file in python. I'm using Python 2.7 with requests and Beautiful Soup.
I get the page with requests, and parse it with the following-
for option in recon:
data['opts'] = '/c' + option
print "Getting: ",
print option
r = requests.post(url, data)
print r.content
page = bs4.BeautifulSoup(r.content, "lxml", from_encoding='utf-8')
print page
tag = page.pre.contents
print tag[0]
When testing, the print r.content shows the page properly in all its unformatted glory. The page is a .cfm, and the text I'm looking for falls between "pre" tags. After running through bs though, bs interprets some of the non printable text into "br" tags, resulting in tags being a list of 2 items, instead of just all the text between the pre tags. Is there a way to either just get the text between the pre tags with requests, or do something differently with bs to get it to not misinterpret the characters?
I've read through the following trying to figure it out, plus requests and beautiful soup docs, but found no luck so far-
Joel on Software - Character Sets
SO utf-8 vs unicode
SO Getting text between tags
Overthought the problem. I just base64 encoded the data before transfer with certutil on windows, removed the first and last line, and then decoded on the far side.
I'm searching for a method to convert msg files to eml format. I have Outlook 2010 but it appears only possible to save as msg. I did find some third party tools that can be used but I prefer not to use a third party tool - if possible.
If you are looking for a quick and dirty VB script, Redemption (I am its author) is probably your only option. Other options are IConverterSession (C++ or Delphi only) or explicitly building the MIME file one line at a time from various message properties (MailItem can be returned by calling Namespace.OpenSharedItem in the Outlook Object Model).
set Session = CreateObject("Redemption.RDOSession")
Session.Logon 'not needed if you don't need to convert EX addresses to SMTP
set Msg = Session.GetMessageFromMsgFile("c:\temp.test.msg")
Msg.SaveAs "c:\temp.test.eml", 1031
MSG file and EML file both formats are used for containing the email with the attachment, but different from each other. EML extension used by multiple email clients, other hand MSG file only used by Outlook email client. In your scenario, you need an effective way to convert multiple MSG files to EML format, by using Outlook email client you can easily convert MSG file to EML format by using save as option of Outlook email client, but this way cannot convert their attachment.
I'm trying to take some data stored in a database and populate a Word template's Content Controls with it using the Open XML SDK. The data contains paragraphs and so there are carriage return and line feed characters in it. The data is stored in the database as nvarchar.
When I open the generated document, the CR+LF combination shows up as a question mark with a box around it (not sure the name of this character). This is actually two sequences back to back, so CR+LF CR+LF equals two strange characters:
If I unzip the .docx, take the Custom XML part and do a hex dump, I can clearly see 0d0a 0d0a so the CR+LF is there. Word is just printing it weird.
I've tried enforcing UTF-8 encoding in my XmlWriter's settings, but that didn't seem to help:
Dim docStream As New MemoryStream
Dim settings As XmlWriterSettings = New XmlWriterSettings()
settings.Encoding = New UTF8Encoding(False)
Dim docWriter As XmlWriter = XmlTextWriter.Create(docStream, settings)
Does anyone know how I can get Word to render these characters correctly when written to a .docx through the Open XML SDK?
To bind to a Word 2013 rich text control, your XML element has to contain a complete docx. See [MS-DOCX]:
the data stored in the XML element will be an escaped string comprised of a flattened WordprocessingML document representing the formatted data in the structured document tag range.
Earlier versions couldn't bind a rich text control.
Things should work though (with CR/LF, not w:br), if you bind to a plain text control, with multiline set to true.
Friends,
I am preparing a TSV file from excel file, containing Chinese (special) characters as follows - The Seonjeongneung ... Jeonghyeon (貞顯王后, 1462–1530) .....
I have tried using perl CPAN's Spreadsheet::ParseExcel and Spreadsheet::ParseExcel::FmtJapan. But no success. These characters are appearing as ?? in the TSV file, when opened in VIM.
I also tried " binmode STDOUT, ':utf8'; " and " binmode STDOUT, ':encoding(cp932)'; "
Please help me out, finding a way to extract information from Excel sheets and getting into TSV format.
PS : Excel allows direct save as TSV, but the output was screwed up there as well
I just exported your sample text perfectly from OpenOffice Calc, just by choosing the "Save as .csv" option and choosing UTF-8 as format. I'd be very surprised if Excel can't do the same. Have you considered the possibility that VIM / your console doesn't support Chinese characters correctly or that it's set to use a font that doesn't include Chinese characters? To check for this kind of error, open your .csv or .tsv file in your web browser. Web browsers will do anything to correctly display a file, including changing fonts as necessary.
If you want, send me the file you need to export and I'll check if there's anything weird about it. Could be one of the native Chinese encodings (gb or big5) instead of Unicode.
I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine.
The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.
Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?
These modules you can acheive the extract text from pdf
PDF::API2
CAM::PDF
CAM::PDF::PageText
From CPAN
my $pdf = CAM::PDF->new($filename);
my $pageone_tree = $pdf->getPageContentTree(1);
print CAM::PDF::PageText->render($pageone_tree);
This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.
All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.
You may never get an appropriate solution to your problem. The PDF format can encode text either as ASCII values with a font applied, or it can encode it as a bitmap. If the tool that created your PDF decided to encode the special characters as a bitmap, you will be out of luck (unless you want to get into OCR solutions, of course).
I'm not a Perl user but I imagine you'll struggle to find a better free text extractor than pdftotext.
pdftotext usually recognises non-ASCII characters fine, is it possible it's extracting them ok but the app you're using to view the text file isn't using the correct encoding? If pdftoetxt on windows is the same as the one on my linux system, then it defaults to exporting as utf-8.
There is getpdftext.pl; part of CAM::PDF.
Well, I tried 2-3 perl modules like CAM::PDF, API2 but the problem remains the same! I'm parsing a pdf file containing main pages. Cam or API2 parses the plain text very well. However, they are not able to parse the code snippet [code snippet usually are in different font & encoding than plain text].
James Healy is correct. After trying CAM::PDF and PDF::API2, the former of which I've had some success reading text, downloading pdftotext worked great for a number of my implementations.
If on windows go here and download xpdf precompiled binary:
http://www.foolabs.com/xpdf/download.html
Then, if you need to run this within perl use system, e.g.,:
system("C:\Utilities\xpdfbin-win-3.04\bin64\pdftotext.exe $saveName");
where $saveName is the full path to your PDF file.
This hopefully leaves you with a text file you can open and parse in perl.
i tried this module which is working fine for special characters of pdf..
!/usr/bin/perl
use strict;
use warnings;
use PDF::OCR::Thorough;
my $filename = "pdf.pdf";
my $pdf = PDF::OCR::Thorough->new($filename);
my $text = $pdf->get_text();
print "$text";
Take a look at PDFBox. It is a library but i think that it also comes with some tool to do text extracting.