Removing folders with "bad" names on GitHub - github

I did not realize that certain characters were not allowed for folder names on github and named a couple of folders with the character ":". I cannot figure out how to rename/delete these folders. I don't care about the data inside, I can just reupload.
Anyone know how to fix this?

In general, Git is capable of handling arbitrary byte sequences in file names because it's designed for Unix systems. That means any character except forward slash or NUL can appear in a path component, including characters such as 0xfe and 0xff, which are not valid UTF-8. Colons are one of those permitted characters.
GitHub also does not have a problem with arbitrary bytes. However, if the path isn't valid UTF-8, it might not be rendered properly in the web interface, although it should still be supported via operations with Git.
However, there are some operating systems which are less capable. For example, Windows excludes many common punctuation characters from permissible file names. As a result, you may wish to be kind to users of those operating systems and not use file names that cause problems there.
Since you're on Windows, you'll have some trouble checking out the repository. The best thing to do is clone the repository on a Linux system or under Windows Subsystem for Linux and then rename the files or directories with git mv, then commit and push. macOS should also be able to handle colons in path names, although it requires that the path names be valid UTF-8.

Related

Google Cloud Storage not handling UTF-8 filenames

I am serving files from Google Cloud Storage and some of the filenames contain non-ASCII, UTF-8 encoded characters. For example, volvía.mp3.
If I request volvía.mp3, GCS throws an error.
If I percent encode the filename (í = %C3%AD) as volv%C3%AD.mp3, it still fails.
If I percent encode the filename using the "combining acute accent" = %CC%81 as volvi%CC%81a.mp3, it succeeds.
Any ideas what is going on?
EDIT: The error it throws is an "Access Denied" error:
Anonymous users does not have storage.objects.get access to object. However, this seems to be the error one gets when requesting an object that's not found.
The problem is due Mac OS's HFS+ file system, which enforces canonical decomposition (NFD) on filenames. This means it normalizes characters such as í into two code points (i + combining acute accent) rather than the single code point that is used in "composed" forms, ie., NFC).
GCS treats these two different forms as distinct filenames, despite that fact that they appear identical.
One solution is to convert NFD filenames to the more common NFC forms (using a utility such as convmv) before uploading to GCS. However, this can't be done on Mac OS because the file system itself enforces NFD.
I was not able to reproduce your issue. I uploaded an object named volvía.mp3 and was able to retrieve it as both http://storage.googleapis.com/bucketname/volvía.mp3 and http://storage.googleapis.com/bucketname/volv%C3%ADa.mp3
I suspect that you actually created an object with the "combining acute accent" character instead. How did you upload your object?

How to keep BOM from removal from Perforce unicode files

I have converted entire branch with .NET and SQL sources to UTF-8 with BOM, having their Perforce file type changed to Unicode in the same operation. (Encoding difference might sound confusing, but in Perforce, Unicode file type denotes UTF-8 file content.) But later I have found out that Perforce silently elliminates BOM marker from UTF-8 files. Is it possible to set Perforce to keep UTF-8 BOM markers in files of Unicode file type? I can't find it documented.
Perforce server is switched to Unicode mode, connection encoding is UTF-8 no BOM (but changing it to UTF-8 with BOM doesn't make any difference).
Example:
check out a source file from Perforce
change file type to Unicode
convert file content to format "UTF-8 with BOM"
submit the file (now the file still keeps BOM in first 3 bytes)
remove the file from workspace
get the latest revision of the file (now the file doesn't contain BOM at the beginning)
OK, Hans Passant's comment encouraged me to re-examine P4CHARSET and finally, the answer has two parts:
For Perforce command line access, setting of P4CHARSET variable controls the behavior. To enable adding BOM to files of Unicode type, use command
p4 set P4CHARSET=utf8-bom
In order to have these files without BOM, use
p4 set P4CHARSET=utf8
For P4V The Perforce Visual Client, the setting can be changed via menu Connection > Choose Character Encoding.... Use value Unicode (UTF-8) to enable adding BOM and Unicode (UTF-8, no BOM) to suppress it.
if menu item Choose Character Encoding... is disabled, ensure the following (and then check again)
P4V has connection to server open and working
pane containing depot/workspace tree is focused (click inside to re-ensure this)
Notes:
if you usually combine both above ways to access Perforce, you need to apply both solutions, otherwise you will keep getting mixed results
if you want to instantly add/remove BOM to/from existing files, adjust the above settings, then remove files from workspace and add them again (see steps 5 and 6 of example posted in the question). Other server actions changing content of files (integrating, merging etc.) will do the similar
for other encoding options and their impact on BOM, see the second table in Internationalization Notes for P4D, the Perforce Server and Perforce client applications

How to use unicode characters in Eclipse File Search?

We have some XML file that contains some invalid character, and the program says neither which file it is, nor which line number or character offset. It would be a few seconds work to fix the problem if I could just search for exactly that character, but I cannot find how to express a Unicode character in the file search (or at least I assume so, since the search returns nothing).
Neither 0x1e nor \u001e seem to match anything.
[EDIT] I mean, I can still change the code, and eventually find which file it is by catching the Exception, and using some kind of script/tool to find where exactly the character is, but I do believe it should be possible to search with Unicode in Eclipse, and that is what I am asking in this question.
It may be a problem with the character encoding.
As you're going to need to perform a global / site-wide search to find the , you'll probably need to set the global text file encoding:
Preferences -> Workspace -> Text file encoding
This option may be under the 'General' section in Eclipse, depending on your setup and installed plugins etc.
Ensure that the encoding is set to UTF-8.
You will also need to escape the unicode character sequences, like so:
\u2665
(which I see you have tried)

How important is file encoding?

How important is file encoding? The default for Notepad++ is ANSI, but would it be better to use UTF-8 or what problems could occur if not using one or the other?
Yes, it would be better if everyone used UTF-8 for all documents always.
Unfortunately, they don't, primarily because Windows text editors (and many other Win tools) default to “ANSI”. This is a misleading name as it is nothing to do with ANSI X3.4 (aka ASCII) or any other ANSI standard, but in fact means the system default code page of the current Windows machine. That default code page can change between machines, or on the same machine, at which point all text files in “ANSI” that have non-ASCII characters like accented letters in will break.
So you should certainly create new files in UTF-8, but you will have to be aware that text files other people give you are likely to be in a motley collection of crappy country-specific code pages.
Microsoft's position has been that users who want Unicode support should use UTF-16LE files; it even, misleadingly, calls this encoding simply “Unicode” in save box encoding menus. MS took this approach because in the early days of Unicode it was believed that this would be the cleanest way of doing it. Since that time:
Unicode was expanded beyond 16-bit code points, removing UTF-16's advantage of each code unit being a code point;
UTF-8 was invented, with the advantage that as well as covering all of Unicode, it's backwards-compatible with 7-bit ASCII (which UTF-16 isn't as it's full of zero bytes) and for this reason it's also typically more compact.
Most of the rest of the world (Mac, Linux, the web in general) has, accordingly, already moved to UTF-8 as a standard encoding, eschewing UTF-16 for file storage or network purposes. Unfortunately Windows remains stuck with the archaic and useless selection of incompatible code pages it had back in the early Windows NT days. There is no sign of this changing in the near future.
If you're sharing files between systems that use differing default encodings, then a Unicode encoding is the way to go. If you don't plan on it, or use only the ASCII set of characters and aren't going to work with encodings that, for whatever reason, modify those (I can't think of any at the moment, but you never know...), you don't really need it.
As an aside, this is the sort of stuff that happens when you don't use a Unicode encoding for files with non-ASCII characters on a system with a different encoding from the one the file was created with: http://en.wikipedia.org/wiki/Mojibake
It is very importaint since your whatevertool will show false chars/whatever if you use the wrong encoding. Try to load a kyrillic file in Notepad without using UTF-8 or so and see a lot of "?" coming up. :)

What problems should I expect when moving legacy Perl code to UTF-8?

Until now, the project I work in used ASCII only in the source code. Due to several upcoming changes in I18N area and also because we need some Unicode strings in our tests, we are thinking about biting the bullet and move the source code to UTF-8, while using the utf8 pragma (use utf8;)
Since the code is in ASCII now, I don't expect to have any troubles with the code itself. However, I'm not quite aware of any side effects we might be getting, while I think it's quite probable that I will get some, considering our environment (perl5.8.8, Apache2, mod_perl, MSSQL Server with FreeTDS driver).
If you have done such migrations in the past: what problems can I expect? How can I manage them?
The utf8 pragma merely tells Perl that your source code is UTF-8 encoded. If you have only used ASCII in your source, you won't have any problems with Perl understanding the source code. You might want to make a branch in your source control just to be safe. :)
If you need to deal with UTF-8 data from files, or write UTF-8 to files, you'll need to set the encodings on your filehandles and encode your data as external bits expect it. See, for instance, With a utf8-encoded Perl script, can it open a filename encoded as GB2312?.
Check out the Perl documentation that tells you about Unicode:
perlunicode
perlunifaq
perlunitut
Also see Juerd's Perl Unicode Advice.
A few years ago I moved our in-house mod_perl platform (~35k LOC) to UTF-8. Here are the things which we had to consider/change:
despite the perl doc advice of 'only where necessary', go for using 'use utf8;' in every source file - it gives you consistency.
convert your database to UTF-8 and ensure your DB config sets the connection charset to UTF-8 (in MySQL, watch out for field length issues with VARCHARs when doing this)
use a recent version of DBI - older versions don't correctly set the utf8 flag on returned scalars
use the Encode module, avoid using perl's built in utf8 functions unless you know exactly what data you're dealing with
when reading UTF-8 files, specify the layer - open($fh,"<:utf8",$filename)
on a RedHat-style OS (even 2008 releases) the included libraries won't like reading XML files stored in utf8 scalars - upgrade perl or just use the :raw layer
in older perls (even 5.8.x versions) some older string functions can be unpredictable - eg. $b=substr(lc($utf8string),0,2048) fails randomly but $a=lc($utf8string);$b=substr($a,0,2048) works!
remember to convert your input - eg. in a web app, incoming form data may need decoding
ensure all dev staff know which way around the terms encode/decode are - a 'utf8 string' in perl is in /de/-coded form, a raw byte string containg utf8 data is /en/-coded
handle your URLs properly - /en/-code a utf8 string into bytes and then do the %xx encoding to produce the ASCII form of the URL, and /de/-code it when reading it from mod_perl (eg. $uri=utf_decode($r->uri()))
one more for web apps, remember the charset in the HTTP header overrides the charset specified with <meta>
I'm sure this one goes without saying - if you do any byte operations (eg. packet data, bitwise operations, even an MIME Content-Length header) make sure you're calculating with bytes and not chars
make sure your developers know how to ensure their text editors are set to UTF-8 even if there's no BOM on a given file
remember to ensure your revision control system (for google's benefit - subversion/svn) will correctly handle the files
where possible, stick to ASCII for filenames and variable names - this avoids portability issues when moving code around or using different dev tools
One more - this is the golden rule - don't just hack til it works, make sure you fully understand what's happening in a given en/decoding situation!
I'm sure you already had most of these sorted out but hopefully all that helps someone out there avoid the many hours debugging which we went through.