itext: how to tweak text extraction? - itext

I'm using iText 5.5.8 for Java.
Following the default, straightforward text extraction procedures, i.e.
PdfTextExtractor.getTextFromPage(reader, pageNumber)
I was surprised to find several mistakes in the output, specifically all letter ds come out as os.
So how does text extraction in iText really work? Is it some kind of OCR?
I took a look under the hood, trying to grasp how TextExtractionStrategy works, but I couldn't figure out much. SimpleTextExtractionStrategy for example seems to just determine the presence of lines and spaces, whereas it's TextRenderInfo that provides text by invoking some decode method on a GraphicsState's font field and that's as far as I could go without getting a major migraine.
So who's my man? Which class should I override or which parameter should I tweak to be able to tell iText "hey, you're reading all ds wrong!"
edit:
sample PDF can be found at http://www.fpozzi.com/stampastopper/download/ name of file is 0116_LR.pdf
Sorry, can't share a direct link.
This is some basic code for text extraction
import java.io.File;
import java.io.IOException;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class Import
{
public static void importFromPdf(final File pdfFile) throws IOException
{
PdfReader reader = new PdfReader(pdfFile.getAbsolutePath());
try
{
for (int i = 1; i <= reader.getNumberOfPages(); i++)
{
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
System.out.println("----------------------------------");
}
}
catch (IOException e)
{
throw e;
}
finally
{
reader.close();
}
}
public static void main(String[] args)
{
try
{
importFromPdf(new File("0116_LR.pdf"));
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
edit after #blagae and #mkl answers
Before starting to fiddle with iText I have tried text extraction from Apache PDFBox (a project similar to iText I just discoreved) but it does have the same issue.
Understanding how these programs treat text is way beyond my dedication, so I have written a simple method to extract text from raw page content, that is whatever stands between BT and ET markers.
import java.io.File;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.itextpdf.text.io.RandomAccessSourceFactory;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;
import com.itextpdf.text.pdf.parser.ContentByteUtils;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class Import
{
private final static Pattern actualWordPattern = Pattern.compile("\\((.*?)\\)");
public static void importFromPdf(final File pdfFile) throws IOException
{
PdfReader reader = new PdfReader(pdfFile.getAbsolutePath());
Matcher matcher;
String line, extractedText;
boolean anyMatchFound;
try
{
for (int i = 1; i <= 16; i++)
{
byte[] contentBytes = ContentByteUtils.getContentBytesForPage(reader, i);
RandomAccessFileOrArray raf = new RandomAccessFileOrArray(new RandomAccessSourceFactory().createSource(contentBytes));
while ((line = raf.readLine()) != null && !line.equals("BT"));
extractedText = "";
while ((line = raf.readLine()) != null && !line.equals("ET"))
{
anyMatchFound = false;
matcher = actualWordPattern.matcher(line);
while (matcher.find())
{
anyMatchFound = true;
extractedText += matcher.group(1);
}
if (anyMatchFound)
extractedText += "\n";
}
System.out.println(extractedText);
System.out.println("+++++++++++++++++++++++++++");
String properlyExtractedText = PdfTextExtractor.getTextFromPage(reader, i);
System.out.println(properlyExtractedText);
System.out.println("---------------------------");
}
}
catch (IOException e)
{
throw e;
}
finally
{
reader.close();
}
}
public static void main(String[] args)
{
try
{
importFromPdf(new File("0116_LR.pdf"));
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
It appears, at least in my case, that characters are correct. However the order of words or even letters is messy, super messy in fact, so this approach is unusable either.
What really surprises me is that all methods I have tried so far to retrieve text from PDFs, including copy/paste from Adobe Reader, screw something up.
I have come to the conclusion that the most reliable way to get some decent text extraction may also be the most unexpected: some good OCR.
I am now trying to:
1) transform pdf into an image (PDFBox is great at doing that - do not even bother to try pdf-renderer)
2) OCR that image
I will post my results in a few days.

Your input document has been created in a strange (but 'legal') way. There is a Unicode mapping in the resources that maps arbitrary glyphs to Unicode points. In particular, character number 0x64, d in ASCII, is mapped to the glyph with Unicode point 0x6f (UTF-8), which is o, in this font. This is not a problem per se - any PDF viewer can handle it - but it is strange, because all other glyphs that are used are not "cross-mapped". e.g. character 0x63 is mapped to Unicode point 0x63 (which is c), etc.
Now for the reason that Acrobat does the text extraction correctly (except for the space), and the others go wrong. We'll have to delve into the PDF syntax for this:
[p, -17.9, e, -15.1, l, 1.4, l, 8.4, i, -20, m, 5.8, i, 14, st, -17.5, e, 31.2, ,, -20.1, a] TJ
<</ActualText <fffffffeffffffff00640064> >> BDC
5.102 0 Td
[d, -14.2, d] TJ
EMC
That tells a PDF viewer to print p-e-l-l-i- -m-i-st-e- -a on the first line of code, and d-d after that on the fourth line. However, d maps to o, which is apparently only a problem for text extraction. Acrobat does do the text extraction correctly, because there is a content marker /ActualText which says that whatever we write between the BDC and EMC markers must be parsed as dd (0x64,0x64).
So to answer your question: iText does this on the same level as a lot of well-respected viewers, which all ignore the /ActualText marker. Except for Acrobat, which does respect it and overrules the ToUnicode mapping.
And to really answer your question: iText is currently looking into parsing the /ActualText marker, but it will probably take a while before it gets into an official release.

This probably has to do with how the PDF with OCR'd in the first place, rather than with how iTextSharp is parsing the PDF's contents. Try copy/pasting the text from the PDF into Notepad, and see if the "ds -> os" transformation still occurs. If this is the case, you're going to have to do the following when parsing text from this particular PDF:
Identify all occurrences of the string "os".
Decide whether or not the word of which the given "os" instance is a constituent is a valid English/German/Spanish/ word.
If it IS a valid word, do nothing.
If it is NOT a valid word, perform the reverse "os -> ds" transformation, and check against the dictionary in the language of your choice again.

Related

dragging and dropping from one JavaFX Application to another

I am trying to move an element from one JavaFX application to another via drag-and-drop, as far as I understand this shouldnt be a problem.
So I have an object of a class and drag it from one application to the other and then have its contents printed to the console. It's mostly looking good, I can get the drop to "accepted" or "not accepted" by playing around with transfer modes, which shows me that the mechanism itself seems to be working.
But when I drop the object on the other application a bunch of, what I believe to be mostly chinese, letters are printed to the console. This is apparently some encoding problem, but I can't really figure out what's happening, aside from the fact that both applications mainly use the same codebase, the "chinese" letters are quite numerous. The object's toString merely prints one and a half line in latin characters, but upon dropping there are several paragraphs of "chinese" letters printed.
Can anyone tell me what's happening here? Is it just a simple encoding f-up? Does the OS (Win7 btw) maybe interfer here? Have I uncovered long lost ancient chinese wisdom?
The code itself is rather simple, here is the code from the "sender"
setOnDragDetected(event ->
{
Dragboard db = startDragAndDrop(TransferMode.ANY);
ClipboardContent clipboardContent = new ClipboardContent();
clipboardContent.put(DataFormat.PLAIN_TEXT, treeElement.getEntities());
db.setContent(clipboardContent);
System.out.println(db.getContent(DataFormat.PLAIN_TEXT));
event.consume();
});
and here from the "receiver"
setOnDragDropped(event ->
{
Dragboard db = event.getDragboard();
if (db.hasContent(DataFormat.PLAIN_TEXT))
{
System.out.println(db.getContent(DataFormat.PLAIN_TEXT));
System.out.println("Accept Drop");
}
event.consume();
});
I just don't really see anything that would explain my error.
The issue is using DataFormat.PLAIN_TEXT. This means JavaFX considers the data format to be just what it says on the tin: text, i.e. String data. This is not really the case. There is no static member of DataFormat that refers to a suitable DataFormat, so you need to create one on your own:
final String mimeType = "application/javafx-entrylist"; // TODO: choose properly
// use existing format or introduce new one
DataFormat f = DataFormat.lookupMimeType(mimeType);
final DataFormat format = f == null ? new DataFormat(mimeType) : f;
setOnDragDetected(event -> {
Dragboard db = startDragAndDrop(TransferMode.ANY);
ClipboardContent clipboardContent = new ClipboardContent();
clipboardContent.put(format, treeElement.getEntities());
db.setContent(clipboardContent);
System.out.println(db.getContent(format));
event.consume();
});
setOnDragDropped(event -> {
Dragboard db = event.getDragboard();
if (db.hasContent(format)) {
System.out.println(db.getContent(format));
System.out.println("Accept Drop");
}
event.consume();
});

Arabic problems with converting html to PDF using ITextRenderer

When I use ITextRenderer converting html to PDF.this is my code
ByteArrayOutputStream out = new ByteArrayOutputStream();
ITextRenderer renderer = new ITextRenderer();
String inputFile = "C://Users//Administrator//Desktop//aaa2.html";
String url = new File(inputFile).toURI().toURL().toString();
renderer.setDocument(url);
renderer.getSharedContext().setReplacedElementFactory(
new B64ImgReplacedElementFactory());
// 解决阿拉伯语问题
ITextFontResolver fontResolver = renderer.getFontResolver();
try {
fontResolver.addFont("C://Users//Administrator//Desktop//arialuni.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
} catch (DocumentException e) {
e.printStackTrace();
}
renderer.layout();
OutputStream outputStream = new FileOutputStream("C://Users//Administrator//Desktop//HTMLasPDF.pdf");
renderer.createPDF(outputStream, true);
/*PdfWriter writer = renderer.getWriter();
writer.open();
writer.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
OutputStream outputStream2 = new FileOutputStream( "C://Users//Administrator//Desktop//HTMLasPDFcopy.txt");
renderer.createPDF(outputStream2);*/
renderer.finishPDF();
out.flush();
out.close();
Actual PDF Result:
Expected PDF Result:
How to make arabic ligature?
If you want to do this properly (I assume using iText, since your post is tagged as such), you should use
iText7
pdfHTML (to convert HTML to PDF)
pdfCalligraph (to handle Arabic ligatures properly)
a font that supports these features (as indicated by another answer)
For an example, please consult the HTML to PDF tutorial, more specifically the following FAQ item: How to convert HTML containing Arabic/Hebrew characters to PDF?
You need fonts that contain the glyphs you need, e.g.:
public static final String[] FONTS = {
"src/main/resources/fonts/noto/NotoSans-Regular.ttf",
"src/main/resources/fonts/noto/NotoNaskhArabic-Regular.ttf",
"src/main/resources/fonts/noto/NotoSansHebrew-Regular.ttf"
};
And you need a FontProvider that knows how to find these fonts in the ConverterProperties:
public void createPdf(String src, String[] fonts, String dest) throws IOException {
ConverterProperties properties = new ConverterProperties();
FontProvider fontProvider = new DefaultFontProvider(false, false, false);
for (String font : fonts) {
FontProgram fontProgram = FontProgramFactory.createFont(font);
fontProvider.addFont(fontProgram);
}
properties.setFontProvider(fontProvider);
HtmlConverter.convertToPdf(new File(src), new File(dest), properties);
}
Note that the text will come out all wrong if you don't have the pdfCalligraph add-on. That add-on didn't exist at the time Flying Saucer was created, hence you can't use Flying Saucer for converting documents with text in Arabic, Hindi, Telugu,... Read the pdFCalligraph white paper if you want to know more about ligatures.
Greek characters seemed to be omitted; they didn’t show up in the document.
In flying saucer the generated PDF uses some kind of default
(probably Helvetica) font, that contains a very limited character set,
that obviously does not contain the Greek code page. link
I change the way to convert pdf by using wkhtmltopdf.

Chapters in iText 7

I'm looking to create a pdf file with chapters and sub chapters with iText 7. I've found examples for previous versions of iText using the Chapter class. However this class does not seem to be included in iText 7.
How is that functionality implemented in iText7?
The Chapter and Section class in iText 5 were problematic. Already with iText 5, we advised people to use PdfOutline.
For an example on how to create chapters, and more specifically, the corresponding outlines in the bookmarks panel, please take a look at the iText 7: Building Blocks tutorial. This tutorial has a recurring theme: the novel "The Strange Case of Dr. Jekyll and Mr. Hyde."
We use that text and a database with movies based on this novel to explain how iText 7 works. If you don't have the time to read it, please jump to Chapter 6.
In this chapter, we create a document that looks like this:
You can download the full sample code here: TOC_OutlinesDestinations
BufferedReader br = new BufferedReader(new FileReader(SRC));
String name, line;
Paragraph p;
boolean title = true;
int counter = 0;
PdfOutline outline = null;
while ((line = br.readLine()) != null) {
p = new Paragraph(line);
p.setKeepTogether(true);
if (title) {
name = String.format("title%02d", counter++);
outline = createOutline(outline, pdf, line, name);
p.setFont(bold).setFontSize(12)
.setKeepWithNext(true)
.setDestination(name);
title = false;
document.add(p);
}
else {
p.setFirstLineIndent(36);
if (line.isEmpty()) {
p.setMarginBottom(12);
title = true;
}
else {
p.setMarginBottom(0);
}
document.add(p);
}
}
In this example, we loop over a text file that contains titles and chapters. Every time we encounter a title, we create a name (title01, title02, and so on), and we use this named as named destination for the title paragraph: setDestination(name).
We create the outlines using the PdfOutline object for which we define a named destination like this: PdfDestination.makeDestination(new PdfString(name))
public PdfOutline createOutline(PdfOutline outline, PdfDocument pdf, String title, String name) {
if (outline == null) {
outline = pdf.getOutlines(false);
outline = outline.addOutline(title);
outline.addDestination(PdfDestination.makeDestination(new PdfString(name)));
return outline;
}
PdfOutline kid = outline.addOutline(title);
kid.addDestination(PdfDestination.makeDestination(new PdfString(name)));
return outline;
}
There are other ways to achieve this result, but using named destinations is the most simple way. Just try the example, you'll discover that most of the complexity of this example is caused by the fact that we turn a simple text file into a document with chapter titles and chapter content.

How to recognize PDF watermark and remove it using PDFBox

I'm trying to extract text except watermark text from PDF files with Apache PDFBox library,so I want to remove the watermark first and the rest is what I want.but unfortunately,Both PDmetadata and PDXObject can't recognize the watermark,any help will be appreciated.I found some code below.
// Open PDF document
PDDocument document = null;
try {
document = PDDocument.load(PATH_TO_YOUR_DOCUMENT);
} catch (IOException e) {
e.printStackTrace();
}
// Get all pages and loop through them
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
while( iter.hasNext() ) {
PDPage page = (PDPage)iter.next();
PDResources resources = page.getResources();
Map images = null;
// Get all Images on page
try {
images = resources.getImages();//How to specify watermark instead of images??
} catch (IOException e) {
e.printStackTrace();
}
if( images != null ) {
// Check all images for metadata
Iterator imageIter = images.keySet().iterator();
while( imageIter.hasNext() ) {
String key = (String)imageIter.next();
PDXObjectImage image = (PDXObjectImage)images.get( key );
PDMetadata metadata = image.getMetadata();
System.out.println("Found a image: Analyzing for Metadata");
if (metadata == null) {
System.out.println("No Metadata found for this image.");
} else {
InputStream xmlInputStream = null;
try {
xmlInputStream = metadata.createInputStream();
} catch (IOException e) {
e.printStackTrace();
}
try {
System.out.println("--------------------------------------------------------------------------------");
String mystring = convertStreamToString(xmlInputStream);
System.out.println(mystring);
} catch (IOException e) {
e.printStackTrace();
}
}
// Export the images
String name = getUniqueFileName( key, image.getSuffix() );
System.out.println( "Writing image:" + name );
try {
image.write2file( name );
} catch (IOException e) {
// TODO Auto-generated catch block
//e.printStackTrace();
}
System.out.println("--------------------------------------------------------------------------------");
}
}
}
In contrast to your assumption there is nothing like an explicit watermark object in a PDF to recognize watermarks in generic PDFs.
Watermarks can be applied to a PDF page in many ways; each PDF creating library or application has its own way to add watermarks, some even offer multiple ways.
Watermarks can be
anything (Bitmap graphics, vector graphics, text, ...) drawn early in the content and, therefore, forming a background on which the rest of the content is drawn;
anything (Bitmap graphics, vector graphics, text, ...) drawn late in the content with transparency, forming a transparent overlay;
anything (Bitmap graphics, vector graphics, text, ...) drawn in the content stream of a watermark annotation which shall be used to represent graphics that shall be printed at a fixed size and position on a page, regardless of the dimensions of the printed page (cf. section 12.5.6.22 of the PDF specification ISO 32000-1).
Some times even mixed forms are used, have a look at this answer for an example, at the bottom you find a 'watermark' drawn above graphics but beneath text (to allow for easy reading).
The latter choice (the watermark annotation) obviously is easy to remove, but it actually also is the least often used choice, most likely because it is so easy to remove; people applying watermarks generally don't want their watermarks to get lost. Furthermore, annotations are sometimes handled incorrectly by PDF viewers, and code copying page content often ignores annotations.
If you do not handle generic documents but a specific type of documents (all generated alike), on the other hand, the very manner in which the watermarks are applied in them, probably can be recognized and an extraction routine might be feasible. If you have such a use case, please share a sample PDF for inspection.

How to encode Chinese text in QR barcodes generated with iTextSharp?

I'm trying to draw QR barcodes in a PDF file using iTextSharp. If I'm using English text the barcodes are fine, they are decoded properly, but if I'm using Chinese text, the barcode is decoded as question marks. For example this character '测' (\u6D4B) is decoded as '?'. I tried all supported character sets, but none of them helped.
What combination of parameters should I use for the QR barcode in iTextSharp in order to encode correctly Chinese text?
iText and iTextSharp apparently don't natively support this but you can write some code to handle this on your own. The trick is to get the QR code parser to work with just an arbitrary byte array instead of a string. What's really nice is that the iTextSharp code is almost ready for this but doesn't expose the functionality. Unfortunately many of the required classes are sealed so you can't just subclass them, you'll have to recreate them. You can either download the entire source and add these changes or just create separate classes with the same names. (Please check over the license to make sure you are allowed to do this.) My changes below don't have any error correction so make sure you do that, too.
The first class that you'll need to recreate is iTextSharp.text.pdf.qrcode.BlockPair and the only change you'll need to make is to make the constructor public instead of internal. (You only need to do this if you are creating your own code and not modifying the existing code.)
The second class is iTextSharp.text.pdf.qrcode.Encoder. This is where we'll make the most changes. Add an overload to Append8BitBytes that looks like this:
static void Append8BitBytes(byte[] bytes, BitVector bits) {
for (int i = 0; i < bytes.Length; ++i) {
bits.AppendBits(bytes[i], 8);
}
}
The string version of this method converts text to a byte array and then uses the above so we're just cutting out the middle man. Next, add a new overload to the constructor that takes in a byte array instead of a string. We'll then just cut out the string detection part and force the system to byte-mode, otherwise the code below is pretty much the same.
public static void Encode(byte[] bytes, ErrorCorrectionLevel ecLevel, IDictionary<EncodeHintType, Object> hints, QRCode qrCode) {
String encoding = DEFAULT_BYTE_MODE_ENCODING;
// Step 1: Choose the mode (encoding).
Mode mode = Mode.BYTE;
// Step 2: Append "bytes" into "dataBits" in appropriate encoding.
BitVector dataBits = new BitVector();
Append8BitBytes(bytes, dataBits);
// Step 3: Initialize QR code that can contain "dataBits".
int numInputBytes = dataBits.SizeInBytes();
InitQRCode(numInputBytes, ecLevel, mode, qrCode);
// Step 4: Build another bit vector that contains header and data.
BitVector headerAndDataBits = new BitVector();
// Step 4.5: Append ECI message if applicable
if (mode == Mode.BYTE && !DEFAULT_BYTE_MODE_ENCODING.Equals(encoding)) {
CharacterSetECI eci = CharacterSetECI.GetCharacterSetECIByName(encoding);
if (eci != null) {
AppendECI(eci, headerAndDataBits);
}
}
AppendModeInfo(mode, headerAndDataBits);
int numLetters = dataBits.SizeInBytes();
AppendLengthInfo(numLetters, qrCode.GetVersion(), mode, headerAndDataBits);
headerAndDataBits.AppendBitVector(dataBits);
// Step 5: Terminate the bits properly.
TerminateBits(qrCode.GetNumDataBytes(), headerAndDataBits);
// Step 6: Interleave data bits with error correction code.
BitVector finalBits = new BitVector();
InterleaveWithECBytes(headerAndDataBits, qrCode.GetNumTotalBytes(), qrCode.GetNumDataBytes(),
qrCode.GetNumRSBlocks(), finalBits);
// Step 7: Choose the mask pattern and set to "qrCode".
ByteMatrix matrix = new ByteMatrix(qrCode.GetMatrixWidth(), qrCode.GetMatrixWidth());
qrCode.SetMaskPattern(ChooseMaskPattern(finalBits, qrCode.GetECLevel(), qrCode.GetVersion(),
matrix));
// Step 8. Build the matrix and set it to "qrCode".
MatrixUtil.BuildMatrix(finalBits, qrCode.GetECLevel(), qrCode.GetVersion(),
qrCode.GetMaskPattern(), matrix);
qrCode.SetMatrix(matrix);
// Step 9. Make sure we have a valid QR Code.
if (!qrCode.IsValid()) {
throw new WriterException("Invalid QR code: " + qrCode.ToString());
}
}
The third class is iTextSharp.text.pdf.qrcode.QRCodeWriter and once again we just need to add an overloaded Encode method supports a byte array and that calls are new constructor created above:
public ByteMatrix Encode(byte[] bytes, int width, int height, IDictionary<EncodeHintType, Object> hints) {
ErrorCorrectionLevel errorCorrectionLevel = ErrorCorrectionLevel.L;
if (hints != null && hints.ContainsKey(EncodeHintType.ERROR_CORRECTION))
errorCorrectionLevel = (ErrorCorrectionLevel)hints[EncodeHintType.ERROR_CORRECTION];
QRCode code = new QRCode();
Encoder.Encode(bytes, errorCorrectionLevel, hints, code);
return RenderResult(code, width, height);
}
The last class is iTextSharp.text.pdf.BarcodeQRCode which we once again add our new constructor overload:
public BarcodeQRCode(byte[] bytes, int width, int height, IDictionary<EncodeHintType, Object> hints) {
newCode.QRCodeWriter qc = new newCode.QRCodeWriter();
bm = qc.Encode(bytes, width, height, hints);
}
The last trick is to make sure when calling this that you include the byte order mark (BOM) so that decoders know to decode this properly, in this case UTF-8.
//Create an encoder that supports outputting a BOM
System.Text.Encoding enc = new System.Text.UTF8Encoding(true, true);
//Get the BOM
byte[] bom = enc.GetPreamble();
//Get the raw bytes for the string
byte[] bytes = enc.GetBytes("测");
//Combine the byte arrays
byte[] final = new byte[bom.Length + bytes.Length];
System.Buffer.BlockCopy(bom, 0, final, 0, bom.Length);
System.Buffer.BlockCopy(bytes, 0, final, bom.Length, bytes.Length);
//Create are barcode using our new constructor
var q = new BarcodeQRCode(final, 100, 100, null);
//Add it to the document
doc.Add(q.GetImage());
Looks like you may be out of luck. I tried too and got the same results as you did. Then looked at the Java API:
"*CHARACTER_SET the values are strings and can be Cp437, Shift_JIS and
ISO-8859-1 to ISO-8859-16. The default value is ISO-8859-1.*"
Lastly, looked at the iTextSharp BarcodeQRCode class source code to confirm that only those characters sets are supported. I'm by no means an authority on Unicode or encoding, but according to ISO/IEC 8859, the character sets above won't work for Chinese.
Essentially the same trick that Chris has done in his answer could be implemented by specifying UTF-8 charset in barcode hints.
var hints = new Dictionary<EncodeHintType, Object>() {{EncodeHintType.CHARACTER_SET, "UTF-8"}};
var q = new BarcodeQRCode("\u6D4B", 100, 100, hints);
If you want to be more safe, you can start your string with BOM character '\uFEFF', like Chris suggested, so it would be "\uFEFF\u6D4B".
UTF-8 is unfortunately not supported by QR codes specification, and there are a lot of discussions on this subject, but the fact is that most QR code readers will correctly read the code created by this method.