How to convert a TextMate Grammar (XML flavor) to either YAML or JSON flavor - visual-studio-code

Textmate grammar (.tmLanguage files) are sometimes expressed in XML format.
I would like to convert to a more readable format (i.e. JSON or YAML) to integrate in a VS Code Syntax Highlighting Extension.
To clarify what I mean, here are a few examples:
XML format
YAML format (equivalent to the previous one)
JSON format
I could write a script in Python to do that but it would save me some time if such converter already exists.
Thanks

The TextMate Languages extension has some commands built-in for this. Screenshot from the readme:

Related

How to programatically get the fragments called in an RTFtemplate?

I need to programmatically find the fragments that are called by each rtftemplate.
So, for example in the figure, I would need to get the "GlossaryTermsAcronyms" fragment for the H2_terms_acronyms template.
I can't seem to find any query or script solution to do this. But this should be possible, right?
Unfortunately that is (almost) impossible.
The information is stored in the t_documents.bincontent column. It is binary encoded RTF.
Somewhere in that RTF there should be a reference to the templates fragments that are used.
If you can figure out how to decode the bincontent to get to the actual RTF code of your template, you might have a chance.
Binary fields in EA are usually stored as a zipped text file.
In case the field is included in an xml file (or xml string in the database), it will be base64 encoded.

Specifing file encoding while reading a file with sys.io.File.read in Haxe

I know how to read a file with Haxe by using sys.io.File.read (compare Reading lines from a file in Haxe and I also know that the sys module is not available for each target). However how can I tell sys.io.File.read that my text file is encoded via a certain encoding (e.g. UTF-16, UTF-8, ISO-8859-1, ...)?
There is no way to do this at File-level, but you can encode / decode the String after reading the file. For instance, Utf8.encode() will convert a ISO-8859-1 string to a UTF-8 string:
var isoString = sys.io.File.getContent("iso_file.txt");
var utf8String = haxe.Utf8.encode(isoString);
sys.io.File.saveContent("utf8_file.txt", utf8String);
The standard library currently doesn't support UTF-16, but it's coming in Haxe 4. In the meantime, you can use libraries such as unifill for that.
Btw, if you don't need to read a file line-by-line, File.getContent() is much more convenient than the File.read()-approach you linked.

Script to convert dokuwiki syntax to mediawiki

Does someone knows a script to convert a file containing text formatted using the dokuwiki syntax to a text formatted using the wikimedia syntax?
I don't know any of them, but (apparently, I have not used it) with this you can convert a docuwiki document in Markdown, and thus with Pandoc you can convert it in Mediawiki syntax.

Does Apache Tika do character set conversion?

I'm using org.apache.tika.Tika.parseToString() to convert documents into plain text (i.e., unformatted text) files. My application potentially needs to convert documents that don't use a Unicode character set. For instance, some documents may be encoded in the Chinese GB2312 character set. It would be great if Tika re-coded the output into UTF-8. This would require Tika to reference a mapping between many different character sets and Unicode in order to convert the characters.
Does Tika convert the non-Unicode character set text into Unicode as the output of parseToString()? There are a lot of character sets out there so I would be impressed if Tika did this for more than a few character sets.
Update: I was able to create a couple different files with some non-Latin charsets (GB2312 (Chinese) and KOI8-R (Russian)). Tika.parseToString() couldn't even detect the charset or encoding. I opened an issue on the Tika bug tracker here: https://issues.apache.org/jira/browse/TIKA-1262
When talking about Character Sets in Apache Tika, you need to consider two kinds of files differently. One kind is that of basically just plain text, the other are more complex types (including binary ones)
With the more complex files, Tika mostly uses third party libraries, and these libraries are responsible for returning Java Strings. The exact way of doing that will depend on the file format in question - sometimes the file format will including encoding information, other times it'll be fixed in what it supports. Either way, Tika gets Java Strings, and returns to you a Java String. How you choose to encode that for output is up to you. (For Windows users especially, check the encoding of your terminal, and the font used. There've been lots of "Tika Encoding Problems" which were actually people failing to correctly set the default Java encoding on output, or failing to have a Unicode capable terminal!)
With plain text files, there's no encoding information in the file, all we have is a bunch of bytes. Here, Apache Tika uses one of a number of EncodingDetector instances to do the detection. These use hints, n-grams, language detection etc, to try to work out the most likely encoding of the file based on information given, pattern of bytes in the file etc.
The definition of EncodingDetector is held in the Tika-Core jar, but most of the implentations are held in the Tika-Parsers jar (and loaded by the service loader method, just like Detectors and Parsers). The main ones are here in SVN. If you check there, you'll see the main list of encodings that Tika can detect.
One final thing - the encoding detection is only performed on files that are text files, it isn't done on the binary type files. Depending on how you call Tika, you might need to tweak that and/or provide a hint that it's a text file, so that the EncodingDetector logic gets triggered.
This answer actually comes from a JIRA user on the Tika project. https://issues.apache.org/jira/browse/TIKA-1262
It turns out that if you tell Tika that the file extension is '.txt' it will treat the file as plain text, attempt to detect the encoding, and convert it to UTF.
An easy way to do this is to pass an empty Metadata object to TikaInputStream.get(). This will fill out the resourceName field of the Metadata object. Then pass this object to parseToString(). With the resourceName field set to a file name that ends with .txt the parser knows to treat this file as plain text and will do a encoding detection to try to discover how to decode the file. The string returned from parseToString() is a Java UTF-16 String object. When written to a file you can see that it is Unicode and uses the UCS charset.
Tika tika = new Tika();
Metadata metadata = new Metadata();
TikaInputStream reader = TikaInputStream.get(new File(filepath), metadata);
String contents = tika.parseToString(reader, metadata);
So far this has worked for text files using either GB2312/GB18030 and KOI8-R. This is the expected behavior and it's perfect! I don't know what other charsets/encoding is can handle.

Correct file name extension for MIME 'text/enriched' file format?

Emacs offers the ability to use the MIME standard text/enriched for writing enriched text. What is the canonical file name extension for this format. Emacs seems to think it's .doc (see $EMACSDIR/24.1/etc/enriched.doc), but this could be confused with the more common Microsoft Word .doc format. Is there an alternative?
(I know it's not .rtf, which is a format different from both Microsoft's .doc and MIME's text/enriched formats)
EDIT:
It seems that .etf and .txte are some accepted extensions for this file type.
text/enriched is also known as Enriched Text Format or .etf.
http://users.starpower.net/ksimler/eudora/etf.html
http://filext.com/file-extension/ETF
Due to its intended use as an inline mail format, I don't think there is a standard file extension for it. .txte seems to be another extension that is used for it though, which does not clash with other well-known formats.