Japanese characters in a latex \section{} cause an error - unicode

I am working on getting Japanese documents created with latex. I have installed the latest version of texlive-2008 which includes CJK.
In my document I have the following:
\documentclass{class}
\usepackage{CJK}
\begin{document}
\begin{CJK*}{UTF8}{min}
\title{[Japanese Characters here 1]}
\maketitle
\section{[Japanese Characters here 2]}
[Japanese Characters here 3]
\end{CJK*}
\end{document}
In the above code there are 3 locations Japanese characters are used.
1 + 3 work fine whereas 2, which contains Japanese characters in a \section{} fails with the following error.
! Argument of \#sect has an extra }.
After some research it turns out this error manifests when you’ve put a fragile command inside a moving argument. A moving argument because section can be moved to a contents page for example.
Does anyone know how to get this to work and why latex thinks Japanese characters are "fragile".

Sorry to post this as an answer rather than a comment to your answer; I don't have enough rep yet to comment. (EDIT: Now I have enough rep to comment, but I'm not sorry anymore. Thanks Will.)
Your solution of replacing
\section{[Japanese Text]}
with
\section{\texorpdfstring{[Japanese Text]}{}}
suggests that you're using the hyperref package. When you use the hyperref package, any sort of not-totally-boring text (e.g. math) within \section causes a problem because \section is having trouble generating pdf bookmarks. \texorpdfstring allows you to specify how you want the section title to appear in the pdf bookmark. For example, I might write
\section{Calculation of \texorpdfstring{$H_2(\mathcal{X})$}{H\_2(X)}}
if I want the section title to be "Calculation of $H_2(\mathcal{X})$" but I want the pdf bookmark to be "Calculation of H_2(X)".

You should probably use xetex/xelatex, as it has been created to support unicode. The change is sometimes not easy for already existing documents, though. (xelatex should be included in texlive, it is just different executable to call -- this is how it is done in Debian).

I have managed to get this working now!
Using Latex and CJK as before.
\section{[Japanese Text]}
was replaced with
\section{\texorpdfstring{[Japanese Text]}{}}
Now the contents pages and section titles work and update fine.

Related

Show special characters in VS Code

I have a problem with my VS Code. When trying to modify a file that contains special characters like "á", "ñ", "ó" etc., the special characters are replaced with a question mark. (See image below.)
Although, this can be solved easily from the back of Visual Studio Code, changing the language type to "Windows 1252", because at first it worked for me. But now, even if I change it to that language, the signs are still there.
the files that you opened before you made the changes to the encoding have been auto-overwritten and the original characters were replaced with the unknown-character character

Correct syntax for newline in Github Bio

Here is an example on my github profile - https://github.com/jack17529
I want to change this -
Silver Bullet in Issue KILLING.____
Master Mind to create Issues.______
My strongest language is Python not English.
I want to have newline instead of blanks.
like this -
Silver Bullet in Issue KILLING.
Master Mind to create Issues.
My strongest language is Python not English.
I have checked Bitbucket Bio is nowhere related to Github Bio.
Maybe they don't allow us to do it via the normal way, But It is possible to do of course. We can use the auto newline rule for the words which are too long for appending to the current line, for our need. All we need to do is putting other Unicode Spaces instead of normal space. And normal space between lines, for using newline rule against forbidden newline rule.
And if you want a free line, because of the character limitations, you can use the longer one;
" " instead of " "      (Try selecting spaces between quotes with your mouse)
Also this trick allows me create unnecessary spaces in the Stack Overflow too, like above, in the quote box.
Here is the result: github.com/cosmicog:
I have tried other answers, html ways, but no, they handle html tricks of course.
Note: This causes a bad look in the list view and the profile overview tooltip:
Maybe that is why it is not allowed but I hope they will fix this in the future.
As told to me by github support there is no way !, see here -
According to Github Support
I just did it by simply copying and pasting the character corresponding to this codepoint | unicode-table.com | as many time as needed in order to align the text the way I wanted.
This is the procedure I followed: at the end of each line I pressed Enter, then I filled the new line with 7 instances of the character mentioned above; then I pressed Enter again and started the new line with its text.
This question is a little stale, but I found it before I solved this myself, so I thought I'd drop my solution.
The bio doesn't appear to honor markdown, but neither does it accept HTML entities or elements. I worked around this with non-breaking characters to create long "words" similar to how you've used "_".
You can see in my bio that I needed a " " and a "‑" to format mine. The long word will pop to the next line. If you have a real short line, you can extend it with a lot of non-breaking spaces, but this probably isn't necessary. Since you cannot enter " " you need to use copy/paste or ALT codes (not looked up, but someone might add these for you). Those are the real characters above, so you can take them from this answer.
Refer: How to create newline in Github Bio
Just use   in HTML editor mode to new line is OK, This is my GitHub Bio

How do you display a section of plain text in GitHub markdown?

I'm having a hard time finding any real answer to this (really simple?) question on Google, and I'm starting to worry that there is no solution.
I am learning GitHub markdown. I would like to show some example code that contains fake email address like user#example.com. But GitHub insists on auto-linking this text. I also have a large chunk of text that has many special characters.
Is there a way to escape blocks or sections so that no special characters are processed, and no auto-links are generated?
Wrap the block in backticks:
```text
code();
address#domain.example
```
You can wrap such text in pre tags.
<pre>Text I want left alone#donotlinkme.example</pre>
I just tested this out on github.
This is all part of the kramdown syntax. The last link shows every GitHub markdown trick.
So this will work also:
~~~text
code();
address#domain.example
~~~

Decoding Korean text files from the 90s

I have a collection of .html files created in the mid-90s, which include a significant ammount of Korean text. The HTML lacks character set metadata, so of course all of the Korean text now does not render properly. The following examples will all make use of the same excerpt of text .
In text editors such as Coda and Text Wrangler the text displays as
╙╦ ╝№бя└К ▓щ╥НВь╕цль▒Ф ▓щ╥НВь╕цль▒Ф
Which in the absence of character set metadata in < head > is rendered by the browser as:
ÓË ¼ü¡ïÀŠ ²éÒ‚ì¸æ«ì±” ²éÒ‚ì¸æ«ì±”
Adding euc-kr metadata to < head >
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
Yields the following, which is illegible nonsense (verified by a native speaker):
沓 숩∽핅 꿴�귥멩レ콛 꿴�귥멩レ콛
I have tried this approach with all historic Korean character sets, each yielding similarly unsuccessful results. I also tried parsing and upgrading to UTF-8, via Beautiful Soup, which also failed.
Viewing the files in Emacs seems promising, as it reveals the text encoding a lower level. The following is the same sample of text:
\323\313 \274\374\241\357\300\212
\262\351\322\215\202\354\270\346\253\354\261\224 \262\3\
51\322\215\202\354\270\346\253\354\261\224
How can I identify this text encoding and promote it to UTF-8?
All of those octal codes that emacs revealed are less than 254 (or \376 in octal), so it looks like one of those old pre-Unicode fonts that just used it's own mapping in the ASCII range. If this is right, you'll just have to try to figure out what font it was intended for, find it and perhaps do the conversion yourself.
It's a pain. Many years ago I did something similar for some popular pre-Unicode Greek fonts: http://litot.es/unicode-converter/ (the code: https://github.com/seanredmond/Encoding-Converter)
In the end, it is about finding the correct character encoding and using iconv.
iconv --list
displays all available encodings. Grepping for "KR" reveals at least my system can do CSEUCKR, CSISO2022KR, EUC-KR, ISO-2022-KR and ISO646-KR. Korean is also BIG5HKSCS, CSKSC5636 and KSC5636 according to Wikipedia. Try them all until something reasonable pops out.
Even if this thread is old, it's still an issue, and not having found a way to convert the files in bulk (outside of using a Korean version of Windows7), now I'm using Naver, which has a cloud service like Google docs and if you upload those weirdly encoded files there, it deals with them very well. I just edit and copy the text, and it's back to being standard when I copy it elsewhere.
Not the kind of solution I like, but it might save a few passers-by.
You can register for the cloud account with an ID, even if you do not live in SKorea by the way, there's some minimal english to get by.

JMeter CSV Data Set is corrupting Japanese strings stored as proper UTF-8, I get Question Marks instead

I read in search terms from a simple text file to send to a search engine.
It works fine in English, but gives me ???? for any Japanese text.
Text with mixed English and Japanese does show the English text, so I know it's reading it.
What I'm seeing:
Input text:
Snow Leopard をインストールする場合、新しい
Turns into:
Snow Leopard ???????????????
This is in my POST field of an HTTP.
If I set JMeter to encode the data, it just puts in the percent sequence for question marks.
About the Data:
The CSV file is very simple in
structure.
There's only one field / one column,
which I name TERM, and later use as
${TERM}
I don't really need full CSV because it's only one string per line.
There's no commas or quotes.
It's UTF-8 and when I run the Unix "file" command on the file, it says UTF-8 text.
I've also verified UTF-8 in command line and graphical mode on two machines.
Interesting note:
An interesting coincidence that I noticed: if there are 15 Japanese characters then I get 15 question marks, so at some point it's being seen as full characters and not just bytes.
JMeter CSV Dataset Config:
Filename: japanese-searches.csv
File encoding: UTF-8 (also tried without)
Variable names: TERM
Delimiter: ,
Allow Quoted Data: False (I also tried True, different, but still wrong)
Recycle at EOF: True
Stop at EOF: False
Staring mode: All threads
A few things I've tried:
- Tried Allow quoted Data. It changed to other strange characters.
- Added -Dfile.encoding=UTF-8
- Tried encoding the POST stage, but it just turned into a bunch of %nn for question marks
And I'm not sure how "debug" just after the each line of the CSV is read in. I think it's corrupted right away, but I'm not sure.
If it's only mangled when I reference it, then instead of ${TERM} perhaps there's some other "to bytes" function call. I'll start checking into that. I haven't done anything with the JMeter functions yet.
Edited Dec 24:
Tweaks:
Changed formatting and added bullet
points for more clarity.
Clarified that the file is UTF-8, and have verified that.
A new theory:
Is it possible that the Japanese characters are making it through, and the issue is that EVERY SINGLE place that shows them maps them to a "?" at DISPLAY TIME only. So even though I've checked in a bunch of places, they all have a display issue just in the UI?
Is there a way in JMeter to see the numeric value of a character or string? Actually, to tell JMeter to display the list of Unicode code points?
I'll look at my last log files... although I suppose even the server logs could mis-mapped the characters.
Also, perhaps when doing variable expansion inside of the text field that I POST, where I reference the ${TERM}, maybe at that point it also maps to question marks, but that the corruption happens at that later point. If that happened, AND it was mis-displayed in the UI, then it might lead to a false conclusion.
What I'd really like to do is pause JMeter after the first CSV record, just after that line is loaded, and look at it with a "data scope" or byte editor or something. Not sure if this is possible.
Found the issue, there was another place the UTF-8 had to be specified.
In the HTTP Request, to the right of the Method, you have to also set Content Encoding to UTF-8
Yes, in hindsight, this seems obvious, but there were a number of reasons I didn't think this was needed. Some of my incorrect assumptions might be helpful for others who are debugging, so here goes - I would have thought that:
1: Once text has made it into Java as Unicode, it stays as Unicode, and goes in and out by UTF-8. Obviously not in this case.
2: I sort of thought HTTP defaulted to UTF-8 unless you say otherwise, but maybe I'm just used to XML, but probably not a good practice to assume that, and maybe HTTP defaults to ISO-Latin1 or something, or even if there's a spec, maybe folks don't follow it.
3: And if I don't specific it, I'd think the "do no harm" approach would be to pass the characters on, and let the receiver on the other end deal with it. Wrong again!
(OK, so points 1, 2 and 3 overlap a bit)
4: Even though my HTTP Request POST, I did still try the Encode checkbox. I certainly thought that would have encoded it, but all I got was the repeating % hex for question marks, so seemed to me that the data was already corrupted at that point. Wrong again. I suspect WITHIN the HTTP phase, there's TWO character transitions, first from Unicode to whatever encoding it thinks you have, and THEN a second encoding into the %signs, and my data was mis-encoded at the first step.
5: And I would have thought JMeter would say something or warn, but from my reading, apparently it's not helpful in that respect. You can do logging or whatever.
And the "?" is Java's way of reporting a problem BY default, this started in the Java 1.4x timeframe. In my Java code I prefer to set encoding errors to report as an exception, but again, not the default, and not what JMeter does.
So I learned my lesson.
The HINT that the Unicode was at least starting out OK was that the number of question marks equaled the number of Japanese characters, instead of having 2 or 3 times as many question marks. If the length of "???" matches your Japanese (or Chinese) string, then Java DID see actual Unicode characters at some point along the journey. Whereas if you see 3 times as many ?'s as input text, then Java always saw them as bytes or ints or whatever, and NEVER as valid codepoints.
Came across this topic when searching for solution to use parameters from csv file that contained some columns written in Hebrew.
I used Excel 2007 to create a 1000 lines data for user registrations. The first and the last names had to be in Hebrew.
I exported the file to "Unicode text" file. It became tab delimited.
"Unicode Text" saves in UTF-16 LE (Little Endian), not in UTF-8. That is important.
I opened the result in Notepad++. I could see the Hebrew letters properly. The Notepad++ has the "Encoding" menu item, where you can check the encoding or change it. So I changed the Little Endian to UTF-8.
Then I replaced tabs with commas (just selected the tab and pasted it into the Find box.
The parameters were substituted ok, but after running the script I saw the following:
In the "View Results Tree" listener I opened the "Result" tab of the "Http Request".
The parameters were substituted, but the HTTP view tab (on the bottom) of the Request showed me some gibberish.
But when I looked at the Raw view, I saw that the request parameters actually contained strings like %D7%A9%D7%A8%D7%9E%D7%95%D7%98%D7%94 that when taken in pairs (%D7 %A9) corersponded properly to Hebrew letters.
To my mind, the JMeter has a bug and can not properly display the unicode chars. But it sends (POSTs) them out ok.
Hope I am right and hope it will help someone.
You can try to use "SHIFT-JIS" in Content encoding (it's nearby Method selection). Then you should uncheck "Encode?" for parameter that included Japanese.
Hope it works you.