using edeD3 to change encoding of mp3 tags - encoding

I found this question which is my exact starting point: Chinese-encoded metadata on mp3 files. I want to re-encode all my metadata as utf-8 so that Banshee can read it.
I can't figure out how to get eyeD3 to do that. I can decode individual tags as per that previous link, but I can't make eyeD3 change the actual text encoding of the mp3 file itself, so those tags can be rewritten in the proper encoding. I tried reading all the data into variables (below, 't' is the properly encoded title), then calling:
tag.clear()
tag.update(eyeD3.ID3_V2_4)
tag.setTitle(t)
That tells me: ValueError: ID3 vNone.None is not supported. Not what I was expecting.
I tried tag.setTextEncoding('utf-8'), but that tells me eyeD3.tag.TagException: Invalid encoding. All the other encodings I try give me the same error message.
eyeD3.TAGS2_2_TO_TAGS_2_3_AND_4 looks promising, but it's a dictionary of cryptic letter codes that mean nothing to me.
Can someone tell me how to change the version of the tags to something that supports utf-8, then change the file encoding to utf-8 and write the metadata back in?

Looks like somebody's already created something that does this:
http://code.google.com/p/id3-to-unicode/
It's pretty easy to use. Just download the latest version of the script from the website, make sure you have the eyeD3 and chardet python modules installed (a quick sudo apt-get install python-eyed3 python-chardet did the trick for me in ubuntu), and run the script with the -h flag to see how to use it.
My only complaint is that the script assumes that your music is organized like artist/album/01 track name.mp3, and uses path/file information to fill in missing tags. I disabled this in the latest version (http://id3-to-unicode.googlecode.com/files/id3_to_unicode_1.1.py) by commenting out lines 126-138.

Eric Abrahamsen figured out, that setting the text encoding should look like
tag.setTextEncoding(eyeD3.UTF_8_ENCODING) instead of
tag.setTextEncoding('utf-8').

Related

What can I do to recover a UTF-8 binary file?

I somehow had a script running on my company's server that basically did a mongodump and then for some reason used recode to encode all .bson files to UTF-8. Thanks to that, I can't use mongorestore, as it says every single .bson file has 268 Mb.
Is there anything one can do to get data back from a recoded to UTF-8 binary BSON file? There's apparently no way to recode it back. Thanks.
OK. This works only on MongoDB, probably, but I'll put it as an answer because it may work for people with this exact problem:
BSON files, while binary, are somewhat readable, depending on your need. In my case, I had a product collection, and most of what I had to update was descriptions and such.
While not a perfect solution, it is possible to just use Notepad++ to turn hex characters into new lines or anything else, and try to parse the resulting file, if you know what you are doing.
Since all fields (name, _id, description) are still there, I recommend turning those into XML headers, for example.
That solved my problem. Thanks.

TCL fileutil::magic::mimetype not recognising Microsoft documents or mp3

I'm wondering if this is a limitation of fileutil::magic::mimetype, or whether something has gotten messed up in my installation. TCLLIB 1.15/TCL 8.5
Take an ordinary Microsoft Word .doc file and pass it to fileutil::magic::mimetype e.g.
package require fileutil
package require fileutil::magic::mimetype
set result [fileutil::magic::mimetype "/tmp/test.doc"]
It returns empty string. Same for mp3, plus other file formats. It does recognise GIF, PNG, TIFF and other image formats.
Calling fileutil::fileType returns binary for the Word document.
Standard Linux command file -i returns "application/msword" for the same file.
Can anyone confirm if this expected behaviour? I'm a little confused about the relationship between the fileutil and fumagic libraries, so maybe I've broken something in my install around that area.

Cleaning up text files with sed?

I have a bunch of text files that need cleaning up. Example
`E..4B?#.#...
..9J5.....P0.z.n9.9.. ........
.k#a..5
E...y^#.r...J5..
E...y_#.r...J5..
..9.P..n9..0.z............
….2..3..9…n7…..#.yr`
Is there any way sed can do this? Like notice weird patterns?
For this answer, I will assume that you have access to standard unix/linux tools.
Your file might be in some word-processor format. If so, the best way to get rid of the junk is to open it with that program. You may be able to find out which with file:
$ file mysteryfile
mysteryfile: Composite Document File V2 Document, Little Endian, Os: Windows, Version 6.1 ....
If that doesn't work, there is a standard unix utility for extracting text from binary files. It is called strings:
$ strings mysteryfile
Some
Recovered Text
...
The behavior of strings can be fine tuned with several options. See man strings.

How can I force emacs (or any editor) to read a file as if it is in ASCII format?

I could not find this answer in the man or info pages, nor with a search here or on Google. I have a file which is, in essence, a text file, but it somehow got screwed up upon saving. (I think there are a few strange bytes at the front of the file accidentally.)
I am able to open the file, and it makes sense, using head or cat, but not using any sort of editor.
In the end, all I wish to do is open the file in emacs, delete the "messy" characters, and save it once cleaned up. The file, however, is huge, so I need something powerful like emacs to be able to open it.
Otherwise, I suppose I can try to create a script to read this in line by line, forcing the script to read it in text format, then write it. But I wanted something quick, since I won't be doing this over & over.
Thanks!
Mike
perl -i.bk -pe 's/[^[:ascii:]]//g;' file
Found this perl one liner here: http://www.perlmonks.org/?node_id=619792
Try M-xfind-file-literally in Emacs.
You could edit the file using hexl-mode, which lets you edit the file in hexadecimal. That would let you see precisely what those offending characters are, and remove them.
It sounds like you either got a different line ending in the file (eg: carriage returns on a *nix system) or it got saved in an unexpected encoding.
You could use strings to grab "printable characters in file". You might have to play with the --encoding though I have only ever used it to grab ascii strings from executable files.

Unicode characters in MATLAB source files

I'd like to use Unicode characters in comments in a MATLAB source file. This seems to work when I write the text; however, if I close the file and reload it, "unusual" characters have been turned into question marks. I guess MATLAB is saving the file as ASCII.
Is there any way to tell MATLAB to use UTF-8 instead?
According to http://www.mathworks.de/matlabcentral/newsreader/view_thread/238995
feature('DefaultCharacterSet', 'UTF8')
will change the encoding to UTF-8. You can put the line above in your startup.m file.
How the MATLAB Process Uses Locale Settings shows how to set the encoding for different platforms. Use
feature('DefaultCharacterSet')
You can read more about this undocumented function here. See also this Matlab Central thread for other options.
Mac OSX only!
As I found solution which worked in my case I want to share it.
Mathworks advises here to use slCharacterEncoding(encoding) in order to change the encoding as desired, but for the OSX this does not solve the issue exactly as the feature('DefaultCharacterSet') in accepted answer does not do it. What helped me to get the UTF-8 encoding set for opening and saving .m files was the following link on MATLAB answers:
https://www.mathworks.com/matlabcentral/answers/12422-macosx-encoding-problem
Matlab seems to ignore any value set in slCharacterEncoding(encoding) or feature('DefaultCharacterSet') but uses region set in System Preferences -> Language & Region. After checking which region is selected in our case then it is possible to define the actual encoding in the hidden configuration file in
$matlabroot/bin/lcdata.xml
This directory can be opened by getting to the Applications and after right click on Matlab by selecting Show Package Contents as on screenshot (here in German)
For example for German default ISO-8859-1 it is possible to adjust it by changing the respective line in the file lcdata.xml:
<locale name="de_DE" encoding="ISO-8859-1" xpg_name="de_DE.ISO8859-1">
to:
<locale name="de_DE" encoding="UTF-8" xpg_name="de_DE.UTF-8">
If the region which is selected is not present in the lcdata.xml file this will not work.
Hope this helps!
The solution provided here worked for me on Windows with R2018a.
In case link doesn't work: the idea is to use file matlabroot/bin/lcdata.xml to configure an alias for encoding name (some explanation can be found in this very file in the comments):
<codeset>
<encoding name="UTF-8">
<encoding_alias name="windows-1252" />
</encoding>
</codeset>
You would use your own value instead of windows-1252, currently used encoding can be obtained by running feature('locale').
Although, if you use Unicode characters in help comments, the help browser does not recognize them, as well as console window output.
For Mac OS users, Jendker's solution really helps!!! Thanks a lot first.
Recap here.
Check the default language in Matlab by typing in the command window getenv('LANG'). Mine returned en_US.ISO8859-1.
In the Application directory find Matlab, show its package contents. Go to bin, open lcdata.xml as an administrator, locate the corresponding xpg_name, in my case en_US.ISO8859-1. Change encoding in the same line to UTF-8. Save it.
Restart Matlab, and it's all done!