I got this weird text output from a program and I can't figure out how to decode it. I'm not a coder and don't have much experience, but I did try a few online tools with no luck. all other text outputs worked fine except this one
"bAпÑ-Ð…rbel" and "bAпїЅrbel"
Related
I am running tesseract on windows 11 using the command prompt.
The text file is my training data. Words that I want to turn into images.
The output is the next step in the Tesseract process for training my font.
I am saying find fonts but I only have one font in the folder.
text2image --text="C:\PythonProjects\DiabloTesseractTrainFont\text.txt" --outputbase="C:\PythonProjects\DiabloTesseractTrainFont\Output\Dia.font.exp0" --fontconfig_tmpdir="C:\PythonProjects\DiabloTesseractTrainFont" --find_fonts --fonts_dir="C:\PythonProjects\DiabloTesseractTrainFont\Diablo Fonts"
The result:
Total chars = 223645
Font Exocet Light failed with 223518 hits = 99.94%
Not sure why it fails. I have built something similar to this before. I have tried with a font file that I know has worked and it does the exact same thing.
Any help would be appreciated.
I solved it. In the text file, there were some characters that had been changed when I read them into python. I believe they used to be bullet points but when I read the file I had implemented in python ASCII encoding and ignore errors. I figured that those characters would be removed. I was wrong. Those bullet points were replaced with text that said PAD. I found it in notepad++ and highlighted one of them and then replaced them with a space. Note in Notepad++ when I did the replace it did not have anything in the find field but it still replaced all of them. Now it compiles just fine. I was stuck for many hours I hope this helps someone.
I have a HTML file that contains some records shown in a table that somehow got encoded in the wrong way. A large part of the file is correct and shows the content as expect but some parts of the file seem to be encoded in the wrong way. Actually the whole HTML part is shown correctly (all the elements etc) but the values within the cells of the table are sometimes encoded in the wrong way.
For example one cell contains:
<cell>»¿è²å¼æäºæ 线æ¥å¥ç½ç»ä¸çæ³¢ææå½¢ææ¯ç 究</cell>
While it should contain:
<cell>绿色异构云无线接入网络中的波束成形技术研究</cell>
I already tried figuring out what exactly went wrong, but I can't seem to find the correct solution to completely resolve this problem for the whole file. I tried tools such as FTFY, which didn't give me any meaningful result.
These websites gave me some direction and it seems that something went wrong between Windows-1252/1251 and UTF-8. The first website seems to fix the problem but still returns some unknown characters (UTF-8 displayed as Windows-1252).
Does anyone have an idea how to fix this for the whole file? Or give me any tips to further figure it out on my own.
Thanks in advance.
I found this question which is my exact starting point: Chinese-encoded metadata on mp3 files. I want to re-encode all my metadata as utf-8 so that Banshee can read it.
I can't figure out how to get eyeD3 to do that. I can decode individual tags as per that previous link, but I can't make eyeD3 change the actual text encoding of the mp3 file itself, so those tags can be rewritten in the proper encoding. I tried reading all the data into variables (below, 't' is the properly encoded title), then calling:
tag.clear()
tag.update(eyeD3.ID3_V2_4)
tag.setTitle(t)
That tells me: ValueError: ID3 vNone.None is not supported. Not what I was expecting.
I tried tag.setTextEncoding('utf-8'), but that tells me eyeD3.tag.TagException: Invalid encoding. All the other encodings I try give me the same error message.
eyeD3.TAGS2_2_TO_TAGS_2_3_AND_4 looks promising, but it's a dictionary of cryptic letter codes that mean nothing to me.
Can someone tell me how to change the version of the tags to something that supports utf-8, then change the file encoding to utf-8 and write the metadata back in?
Looks like somebody's already created something that does this:
http://code.google.com/p/id3-to-unicode/
It's pretty easy to use. Just download the latest version of the script from the website, make sure you have the eyeD3 and chardet python modules installed (a quick sudo apt-get install python-eyed3 python-chardet did the trick for me in ubuntu), and run the script with the -h flag to see how to use it.
My only complaint is that the script assumes that your music is organized like artist/album/01 track name.mp3, and uses path/file information to fill in missing tags. I disabled this in the latest version (http://id3-to-unicode.googlecode.com/files/id3_to_unicode_1.1.py) by commenting out lines 126-138.
Eric Abrahamsen figured out, that setting the text encoding should look like
tag.setTextEncoding(eyeD3.UTF_8_ENCODING) instead of
tag.setTextEncoding('utf-8').
My Localizable.strings file has somehow been corrupted and I don't know how to restore it.
If I open it as a Plain Text File it starts with weird characters that I can't copy here.
If I leave the file be the app builds. If I make any changes either the values aren't interpreted properly or I get an error at compile time.
Localizable.strings: Conversion of string failed. The string is empty.
Command /Developer/Library/Xcode/Plug-ins/CoreBuildTasks.xcplugin/Contents/Resources/copystrings failed with exit code 1
I suspect this is an encoding problem but I don't know how it happened (maybe SVN is to blame?) nor how to solve it. Any tips will be much more appreciated.
I have issues with the same file that sound very similar to your own. What happens for me is that Xcode doesn't know the correct file formating. I often get this when rearranging the project and I remove and re-add this file to the Xcode project. When I re-add the file, its encoding gets set to something like Western Roman which can't seem to render anything other than ASCII.
Here's what I do to fix the problem:
In Xcode select the Localizable.stings file in the Groups&Files panel.
Do a Get Info on that file.
On the info panel select the General tab.
In that tab go to the File Encoding and change its value.
The last step is where the trick lies as you now have to guess the right encoding. I find that for most European languages that "Unicode (UTF-8)" works. And for Asian languages I find that "Unicode (UTF-16/32)" are the ones to try.
I just had that error because I forgot a semicolon. Took me a while to figure it out. Seems like a really ambiguous compiler error but the fix was simple.
Make sure in File-Get Info, that UTF-16 is selected. If it's set to none or UTF-8 as encoding then you need to change it. If your characters have spaces between them then you choose to "re-interpret" the file as UTF-16. If there are weird characters in the file, then you need to remove them.
Execpt the UTF-8 problem, sometimes you still have to check the content in case if there are some syntax problems.
Use the following Regular Expression to verify your text line by line, if there's any line not matched, there must be a problem.
"(.+?)"="(.+?)";
You can use the plutil command line tool. Without options or with the -lint option, it checks the syntax of the file given as argument. It will tell you more precisely where the error is.
This happens to me when there is a missing quote or something not right with the file. MOst commonly, since my language files are done by another team member, he tends to forget a quote or something. Usually XCode shows an error on that line, sometimes it does'nt and just throws "Corrupted data" error.
Double check if all your strings are properly closed in quotes
Open the file in Xcode.
Right click it in Project Navigator.
Select Open as -> ASCII Property List
After a lot of experiments, I still can't get the following script working. I need some guidance on how to diagnoze this particular Perl problem. Thanks in advance.
This script is for testing the use of Office 2007 OCR API:
use warnings;
use strict;
use Win32::OLE;
use Win32::OLE::Const;
Win32::OLE::Const->Load("Microsoft Office Document Imaging 12\.0 Type Library")
or
die "Cannot use the Office 2007 OCR API";
my $miDoc = Win32::OLE->new('MODI.Document')
or die "Cannot create a MODI object";
#Loads an existing TIFF file
$miDoc->Create('OCR-test.tif');
#Performs OCR with the OCR language set to English
$miDoc->OCR(LangId => 'miLANG_ENGLISH');
#Get the OCR result
my $OCRresult = $miDoc->{Images}->Item(0)->{Layout}{Text};
print $OCRresult;
I did a small test. I loaded an .MDI file containing the OCR information. I deleted the OCR method line and ran the script and I got the expected text output of "print $OCRresult". But otherwise, Perl throws me the error saying
Use of uninitialized value $OCRresult in print at E:\OCR-test.pl line 15
I'm suspecting that something's wrong with the line
$miDoc->OCR(LangId => 'miLANG_ENGLISH');
I tried leaving the parens empty or using three paraments, like 'miLANG_ENGLISH',1,1 etc but without any luck.
I also tried using Microsfot Office Document Imaging to test if the TIF I'm experimenting with was text recognizable and the result was positive.
So what other diagnostic methods do I have?
Or can someone who happens to have Office 2007 test my code with a whatever jpg,bmp or tif pictures that have text content and see if something's wrong?
Thanks in advance.
UPDATE
Haha, I've finally figured out where the problem is and how I can solve it. #hobbs, thank you for leaving the comment :) Things are interesting. When I was trying to respond to your comment, I added the link of the url of Office Document Imaging 2003 VBA Language Reference and I took yet another look at the stuff there. And the following information caught my eyes:
LangId can be one of the following MiLANGUAGES constants.
miLANG_CHINESE_SIMPLIFIED (2052, &H804)
I changed the following OCR method line:
$miDoc->OCR('miLANG_ENGLISH',1,1);
to this:
$miDoc->OCR(2052,1,1);
A few notes:
1. I'm running ActivePerl 5.10.0 on Windows XP (Chinese version)
2. Before this, I already tried $miDoc->(9) but without luck
And suddenly and kind of magically that pesky ERROR saying "Use of uninitialized value $OCRresult in print at E:\OCR-test.pl line 15" disappeared completely and the OCRed text appeared on the screen. The OCR result was not satisfying but the parameter "2052" refers to Chinese and the TIF image contains all English. So I changed the parameter to
$miDoc->OCR(9,1,1) but this time without luck. Windows threw me this error:
unknown software exception (0x0000000d)
I changed the TIF image to one that contains all Chinese characters and changed the parameter to "$miDoc->OCR(2052,1,1);" again and this time everything worked just like expected. The OCR result was satisfying.
Now I think there's something weird about my Office 2007 OCR API and if someone who happens to run Windows XP (English version) and have installed Office 2007 would probably not encounter that exception error with the parameter
$miDoc->OCR(9,1,1);
Anyway, I'm really happy that I've finally get things working :D
For starters I would try dumping the value of $miDoc->{Images} -- does it exist? If it exists and it's a collection does it contain anything? If it contains anything, what is it? An error? Or maybe just a different structure than you're expecting? warn, Dumper, and a little exploration can go a long way.
Incidentally, if you want to do the "modern" thing and don't mind grabbing a nifty tool off of CPAN, try Devel::Dwarn -- it makes dumping to stderr even more fun than it was already :)