I’m am trying to create a pdf file from matlab figure using cmyk colors, but facing a problem with umlauts and also some other special characters. Is there any other way to handle this than Latex? The following example demonstrates the issue.
plot(rand(199,1))
title_string = ['Some text:äö' char(228) ':2005' char(150) '2008:end text'];
title(title_string);
print(gcf,'-dpdf','cmykfile.pdf','-r600','-cmyk');
print(gcf,'-dpdf','rgbfile.pdf','-r600');
As you can see from the pdf-files the RGB-version handles umlauts, but not en-dash, and CMYK skips them all.
PDF is generated in Matlab using Ghostscript, but I have not found how to configure character encoding for GS.
I am using Windows and Matlab R2014.
I'm not completely sure this is the solution you was looking for.
Anyway, if you create an eps first and then convert it to pdf the output file doesn't have any issue with the special characters in the title, provided that you don't build your title string using char.
plot(rand(199,1))
title_string = 'Some text:äöä:2005—2008æ:end text';
title(title_string);
print(gcf,'-depsc','cmykfile.eps','-r600','-cmyk');
!ps2pdf cmykfile.eps cmykfile.pdf
The code above works if you have the ps2pdf utility in your system path. You already have ps2pdf on your computer if you have MiKTeX installed, but you might need to update your system path. Basically ps2pdf should be a shortcut to gs, therefore also if you have only gs and not MiKTeX installed, you should be able to achieve the same result.
EDIT
On my machine (Windows 7, MATLAB R2014b), also this code works well, without the need to use ps2pdf:
plot(rand(199,1))
title_string = 'Some text:äöä:2005—2008æ:end text';
title(title_string);
print(gcf,'-dpdf','cmykfile.pdf','-r600','-cmyk');
It seems that the issue happens when you build the title string using char.
Related
I run some matlab function and after upgrading to a new computer, MATLAB 2017b is not able to do string comparison containing umlaut because all the umlaut are dispalyed as ?. See below:
strcmp(factor_struct.conditions(i,k),'gr?sser als')
Chaning ? to ö does not work as matlab seems not able to display properly this character.
Is there a set-up to change in order to be able to read that type of charachters?
I managed to solve the problem by changing indeed the OS language format.
I'm trying to read a .csv file for my script, but the script contains the special danish characters "æ","ø" and "å". So far my idea was to define it like this:
CellM = readtable(csvdata,'Format','%æ%ø%å%s');
but it's not working.
You can specify the file encoding when you call readtable:
readtable(..., 'Encoding', 'UTF-8');
This should solve your problem, assuming you didn't use another function on csvdata, earlier, which already messed your data up.
The MATLAB Engine is a C interface to MATLAB. It provides a function engEvalString() which takes some MATLAB code as a C string (char *), evaluates it, then returns MATLAB's output as a C string again.
I need to be able to pass unicode data to MATLAB through engEvalString() and to retrieve the output as unicode. How can I do this? I don't care about the particular encoding (UTF-8, UTF-16, etc.), any will do. I can adapt my program.
More details:
To give a concrete example, if I send the following sting, encoded as, say, UTF-8,
s='Paul Erdős'
I would like to get back the following output, encoded again as UTF-8:
s =
Paul Erdős
I hoped to achieve this by sending feature('DefaultCharacterSet', 'UTF-8') (reference) before doing anything else, and this worked fine when working with MATLAB R2012b on OS X. It also works fine with R2013a on Ubuntu Linux. It does not work on R2013a on OS X though. Instead of the character ő in the output of engEvalString(), I get character code 26, which is supposed to mean "I don't know how to represent this". However, if I retrieve the contents of the variable s by other means, I see that MATLAB does correctly store the character ő in the string. This means that it's only the output that didn't work, but MATLAB did interpret the UTF-8 input correctly. If I test this on Windows with R2013a, neither input, nor output works correctly. (Note that the Windows and the Mac/Linux implementations of the MATLAB Engine are different.)
The question is: how can I get unicode input/output working on all platforms (Win/Mac/Linux) with engEvalString()? I need this to work in R2013a, and preferably also in R2012b.
If people are willing to experiment, I can provide some test C code. I'm not posting that yet because it's a lot of work to distill a usable small example from my code that makes it possible to experiment with different encodings.
UPDATE:
I learned about feature('locale') which returns some locale-related data. On Linux, where everything works correctly, all encodings it returns are UTF-8. But not on OS X / Windows. Is there any way I could set the various encodings returned by feature('locale')?
UPDATE 2:
Here's a small test case: download. The zip file contains a MATLAB Engine C program, which reads a file, passes it to engEvalString(), then writes the output to another file. There's a sample file included with the following contents:
feature('DefaultCharacterSet', 'UTF-8')
feature('DefaultCharacterSet')
s='中'
The (last part of the) output I expect is
>>
s =
中
This is what I get with R2012b on OS X. However, R2013 on OS X gives me character code 26 instead of the character 中. Outputs produces by R2012b and R2013a are included in the zip file.
How can I get the expected output with R2013a on all three platforms (Windows, OS X, Linux)?
I strongly urge you to use engPutVariable, engGetVariable, and Matlab's eval instead. What you're trying to do with engEvalString will not work with many unicode strings due to embedded NULL (\0) characters, among other problems. Due to how the Windows COM interface works, the Matlab engine can't really support unicode in interpreted strings. I can't speculate about how the engine works on other platforms.
Your other question had an answer about using mxCreateString_UTF16. Wasn't that sufficient?
I have a function within a custom CRM web application (old VB.Net circa 2003) that takes a set of fields from a database and merges them with palceholders in a set of RTF based template documents. These generate merged letters and documentation. The code essentially loops through each line of the RTF template file and replaces any instances of the placeholder values with text from a database record. The issue I'm having is that users have pasted a certain type of apostrophe into the web app (and therefore into the database) that is not rendering correctly in the resulting RTF file. It is rendering like this - ’.
I need a way to spot this invalid apostrophe in the code and replace it with a valid one. Unfortunately when I paste the invalid apostrophe into the Visual Studio editor it gets converted into the correct one. So I need another way to express this invalid apostrophe's value. Unfortunately I do not know a great deal about unicode and other encodings so I'm calling out for help with this.
Any ideas?
If you really just want to figure out what the character is you might want to try and paste it into a text editor like ultraedit. It has a hex mode that you can flip to to see the actual underlying bytes.
In order to do the replace once you've figured out the character you'd do something like this in Vb,
text.Replace(ChrW(2001), "'")
Note that you might not be able to figure it out easily using the text editor because it might also get mangled by paste from the clipboard. You might want to either print some debug of the ascii values from code. You can use the AscW function to do that.
I can't help but think that it may actually simply be a case of specifying the correct encoding to use when you write out the stream though. Assuming you're using a StreamWriter you can specify it on the constructor. I'm guessing you actually want ASCII given your requirement.
oWriter = New System.IO.StreamWriter(path, False, System.Text.Encoding.ASCII)
It looks like you probably want to encode characters out of the 8 bit range (>255).
You can do that using \uNNNN according to the wikipedia article.
I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine.
The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.
Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?
These modules you can acheive the extract text from pdf
PDF::API2
CAM::PDF
CAM::PDF::PageText
From CPAN
my $pdf = CAM::PDF->new($filename);
my $pageone_tree = $pdf->getPageContentTree(1);
print CAM::PDF::PageText->render($pageone_tree);
This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.
All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.
You may never get an appropriate solution to your problem. The PDF format can encode text either as ASCII values with a font applied, or it can encode it as a bitmap. If the tool that created your PDF decided to encode the special characters as a bitmap, you will be out of luck (unless you want to get into OCR solutions, of course).
I'm not a Perl user but I imagine you'll struggle to find a better free text extractor than pdftotext.
pdftotext usually recognises non-ASCII characters fine, is it possible it's extracting them ok but the app you're using to view the text file isn't using the correct encoding? If pdftoetxt on windows is the same as the one on my linux system, then it defaults to exporting as utf-8.
There is getpdftext.pl; part of CAM::PDF.
Well, I tried 2-3 perl modules like CAM::PDF, API2 but the problem remains the same! I'm parsing a pdf file containing main pages. Cam or API2 parses the plain text very well. However, they are not able to parse the code snippet [code snippet usually are in different font & encoding than plain text].
James Healy is correct. After trying CAM::PDF and PDF::API2, the former of which I've had some success reading text, downloading pdftotext worked great for a number of my implementations.
If on windows go here and download xpdf precompiled binary:
http://www.foolabs.com/xpdf/download.html
Then, if you need to run this within perl use system, e.g.,:
system("C:\Utilities\xpdfbin-win-3.04\bin64\pdftotext.exe $saveName");
where $saveName is the full path to your PDF file.
This hopefully leaves you with a text file you can open and parse in perl.
i tried this module which is working fine for special characters of pdf..
!/usr/bin/perl
use strict;
use warnings;
use PDF::OCR::Thorough;
my $filename = "pdf.pdf";
my $pdf = PDF::OCR::Thorough->new($filename);
my $text = $pdf->get_text();
print "$text";
Take a look at PDFBox. It is a library but i think that it also comes with some tool to do text extracting.