When and how might the operating system store a file under a different name than I gave it? - unicode

I found this statement under another SO question concerning Unicode and I'd like to ask for further elaboration of this rather surprising fact.
Code that believes once you successfully create a file by a given name, that when you run ls or readdir on its enclosing directory,
you'll actually find that file with the name you created it under is
buggy, broken, and wrong. Stop being surprised by this!
When does this happen and what to do about it?

The first example which comes to my mind: If you create a file under OSX that is named é (single U+00E9 codepoint), the OS will store it actually as U+0065 U+0301 (Unicode decomposition). The file will be still accessible under the original name, but listed as decomposed.
How to avoid: don't lookup your files manually unless you are sure their names are pure ASCII.
Second: On Windows, if you have a file called e, try creating (with overwriting enabled) a file called E, the OS will still list a file called e. If e didn't exists beforehand, a file called E would be created.
How to avoid: don't lookup your files manually unless you are sure their names are pure ASCII, and take case into account. Try using a consistent capitalisation style. I suggest going all lowercase.
Third: on Windows, if for example you have Windows 1250 as your system encoding, and you want to create a file named ê via the narrow, char-based API, a file called e will be created instead. This of course is easy to avoid, but this exact problem bit me once: WinRAR extracted files ê.png, è.png and e.png all into e.png, overwriting data. Similar problems can happen with other encoding mixups, too.
How to avoid: don't use API's that take the filename as a char* on Windows.

Related

rename file when using DD name

In a 'C' language LP64 compiled program, which will run in Batch, TSO and z/OS UNIX, when opening a PDS(E) member using the following notation (recommended in order to allow file disposition to be used):-
hFile = fopen("DD:CONFIG(COPY)", "w");
fclose(hFile);
I am surprised to discover that the following does not appear to work:-
rename("DD:CONFIG(COPY)","DD:CONFIG(MAIN)");
Failing as it does with an errno of ENOENT (EDC5129I No such file or directory.)
The documentation for rename says:-
The rename() function renames memory files and DASD data sets. It also renames individual members of PDSs (and PDSEs)
If instead I do:-
rename("//'MYUSER.CONFIG(COPY)'","//'MYUSER.CONFIG(MAIN)'");
the rename() works.
Alternatively if I do:-
rename("//'MYUSER.CONFIG(COPY)'","DD:CONFIG(MAIN)");
if fails with an errno of EINVAL (EDC5121I Invalid argument.)
Why does it not accept the same file name notation that is used for fopen?
The reason this is important is because the rename() cannot succeed while the PDSE is being browsed by someone. Whereas, using the DD: notation allows an fopen() for write to succeed when the PDSE is being browsed because the DISP=SHR coded on the DD name in the JCL is adopted by the fopen().
So, I suppose the real question is - how can my program rename a PDSE member in a way that will succeed when the PDSE is also being browsed by someone?
The technique required to rename a dataset is different than the technique to rename a member inside a PDS/PDSE...I'd wager that the system rename() function you're calling is just getting this wrong. In z/OS, there are lots of combinations functions like "rename()" have to handle, and it's not unusual to find some that don't work as you expect.
Certainly it's worth a call to IBM Support to see if there's something else going on here...what you're trying to do seems like it should work, so I think there's something to be said for treating it like a bug or documentation error.
Beyond that, as you suggest, you can either use the form of rename that works, or you can replace the system's rename function with something that actually works properly.
One simple way would be to create the rename() as you show it:
rename("//'MYUSER.CONFIG(COPY)'","//'MYUSER.CONFIG(MAIN)'");
You can get the DSN for a DDNAME using the fldata() function, so it's not hard to create a rename like this on the fly given an open file handle. Beware that the form of rename may allocate the file you specify with DISP=OLD, and hence cause problems if some other task has the file allocated. Also, if this is supposed to be commercial quality code, as a customer, my eyebrows would go up if I found out you needed to launch some external program because you couldn't figure out how to rename a PDS/PDSE member - but that might just be me.
The other alternative is to write your own "rename()" function...unfortunately, it most likely would need to be assembler language if you want it to be efficient. As others suggest, you might spawn off a shell, REXX or TSO command, but of course, that means creating a new process, etc etc etc just to rename the PDS/PDSE member. Keep in mind also that some of these approaches might also have issues with trying to allocate the input file with DISP=OLD.
If that's too slow for your needs, the way to do what you want is to call a small assembler routine that invokes the system STOW service against your DDNAME to do your rename. The flow would be something like this:
You'd create a 16-byte area containing the old and new member names. They're 8 characters each and blank padded.
You'd need the address of an open DCB that describes the file you're looking at. You can get the DCB address from the FILE structure, I believe - or you could just open a second DCB to the DDNAME you have allocated.
You'd call the system STOW service with the parameters that tell it to rename a PDS/PDSE member:
STOW dcb,area_from_step1,C
In the STOW macro above, the "directory option" of "C" tells STOW that you want to rename an existing member. The area_from_step1 has the current and new member names - the system searches the directory for the current name and rewrites it with the new member name in place.
To be honest, what I describe above is exactly what the system runtime should be doing, but if it's not and IBM doesn't want to fix it, then you might prefer to do this sort of thing "by hand".
Not sure if this will work, but since you have the dataset already allocated, perhaps you could "call" (for some value of call) IEHPROGM from your program, constucting the proper SYSIN before making the call?
Here's a link to the IBM example for IEHPROGM (mind any break):
https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/com.ibm.zos.v2r1.idau100/u1354.htm
--Scott

What is this symbol ``, and why can't some people see it?

The symbol is this one: 
It crops up at the end of folder names sometimes on our filesystem. The people who put the folder there swear that they couldn't see it. I think it might be being produced by one of our internal asset building tools, perhaps in response to something copied in from a Google docs spreadsheet?
Sorry that's so vague...
It looks like the culprit could be Windows CHKDSK (analogous to fsck) …
https://github.com/dw/scratch/blob/master/ntfs-3g-chkdsk-unicode-fix.py →
# … Windows
# CHKDSK, which will immediately replace all the invalid characters with
# Unicode ordinals in the private use plane.
This symbol is used by OS X if you add a trailing space to a folder name on a FAT file system.
Why? Because it is forbidden to have trailing spaces in folder names on Windows. As OS X allows trailing spaces in general, it somewhat extends this behavior to FAT drives. The private use unicode character is used as a means to keep the file system sane at the same time.
While you're on the Mac, you barely see the additional character as it is rendered as space there. Only after moving the drive to Windows or Linux, the weird unicode character shows up as unprintable.
This is the unicode character U+F028 which falls in the private use area of U+E000–U+F8FF. This means it's use is application specific. You are probably right that some tool produces this character and fails to remove/replace that character when copying. Wikipedia lists some vendor usage examples but I doubt that this list will help much.
To find the cause of the problem, I would first try asking the users who created the folders about which operating systems they use and which applications they copied the text from. If you think one of your build tool is the cause, you could try to search for that character in all log files of those tools and any databases that the tools use.
As for the actual question in the title, this character does not have a defined symbol. Some applications show a default symbol while other application may just ignore it and show nothing.

Windows Installer Automation and Installshield Basic MSI: Mystery String during Chained MSI

EDIT: Turns out the mystery string was a simple MD5 hash of the name of the file (including the extension and capitalization).
I'm attempting to automate the process of creating a Chained MSI through InstallShield. In the GUI, this involves going to Releases, adding a chained package, linking to the MSI and streaming the file into the project.
I've reverse engineered what exactly happens behind the scenes by analyzing the project file as XML. It essentially just comes down to table edits. I understand you can use Windows Installer Automation to open an *.ism file and access the database tables (LINK).
Yet, there is a single field in the ISChainPackageData table which I cannot seem to generate or figure out how it was calculated. It is the column titled, File. It is a 32 character hex string preceded by an underscore. I have discovered that the only attribute that determines this field is the name of the MSI file being streamed. For example:
Linking to a chained MSI by the name of Test.msi, yields _29B31F67F21C9EE77CBF8C4C5D24ACE9.
Changing the name would change this. Changing the file, including replacing it with an empty file of the same name, does not.
I believe it is some kind of simple hash of the name, but I haven't had any luck guessing it.
Does anyone have any insight on what they might be using here?
Thanks!
Close. It's a hash-based GUID of a combination of a few things. I'd have to trudge up the code to find out exactly what, but it's at least the relative path and filename, and possibly something related to the package in question (probably its primary key value).
This is used to generate a unique key for each file you include with a package, without allowing duplicate files. (Windows Installer doesn't like backslashes in its primary keys.) The actual value here isn't meaningful; if you're careful to avoid duplicate keys and don't overlap file path and name combinations, you can probably put in any valid key value you like. However that may prevent the IDE from detecting duplicates itself.

how can we identify notepad file?

how can we identify notepad files which is created in two computer, is there a any way to get any information about in which computer it was created.Or whether it is build in xp or linux.
If you right click on the file, you should be able to see the permissions and attributes of the file.
Check at the end of the line. Under GNU/Linux lines end with \n (ascii: 0x0A) while under Miscrosoft W$ndos it is \r\n (ascii: 0x0D 0x0A).
Wikipedia: https://en.wikipedia.org/wiki/Newline
found this: http://bit.ly/J258Mr
for identifying a word document but some of the info is relevant
To see on which computer the document had been created, open the Word
document in a hex editor and look for "PID_GUID". This is followed by
a globally unique identifier that, depending upon the version of Word
used, may contain the MAC address of the system on which the file was
created.
Checking the user properties (as already mentioned) is a good way to
see who the creator of the original file was...so, if the document was
not created from scratch and was instead originally created on another
system, then the user information will be for the original file.
Another way to locate the "culprit" in this case is to parse the
contents of the NTUSER.DAT files for each user on each computer. While
this sounds like a lot of work, it really isn't...b/c you're only
looking for a couple of pieces of information. Specifically, you're
interested in the MRU keys for the version of Word being used, as well
as perhaps the RecentDocs keys."
The one thing I can think on the top of my mind is inspecting the newline characters on your file - I'm assuming your files do have multiple lines. If the file was generated using Windows then a newline would be characterized by the combination of carriage return and line feed characters (CR+LF) whereas a simple line feed (LF) would be a hint that the file was generated in a Linux machine.
Right click one the file--> Details . You can see the computer name where it was created and the date.

Rename file containing '©' character

We received as input in our application (running on Windows) a list of files. These files were automatically extracted from a database with a script.
Apparently some of the names are containing special characters (like accents) and these characters are rendered as '©' on our side.
How can rename programmatically these text files (around 900'000) to get rid of this character?
We cannot change the source neither re-extract the files.
The problem is that because of this character another program involved with our system does not accept the files.
Have a look at the unix command rename. It allows you to apply a perl regex to the names of a bunch of files. In this case you might want something like:
$ rename 's/[^a-zA-Z0-9]//' *
In debian the rename command is part of the perl package. It should also be available on CPAN.
I ended up creating a new script that reads the input files and search for special characters in their title.
It was quite easy indeed:
string filename = filename.Replace("©", "e");
Since the '©' is in the filename, the script (in C#) is able to recognize it and replace the match accordingly. In this way I can loop through all the folders and subfolders simply reading the filename and change specials characters.
Thank you all for the contributions!