How to read ASCII files with mixed line endings (Windows and Unix) and UTF-16 Big Endian files in SAP?
Background: our ABAP application must read some of our configuration files. Most of them are ASCII files (normal text files) and one is Unicode Big Endian. So far, the files were read using ASCII mode and things were fine during our test.
However, the following happened at customers: the configuration files are located on a Linux terminal, so it has Unix Line Endings. People read the configuration files via FTP or similar and transport it to the Windows machine. On the Windows machine, they adapt some of the settings. Depending on the editor, our customers now have mixed line endings.
Those mixed line ending cause trouble when reading the file in ASCII mode in ABAP. The file is read up to the point where the line endings change plus a bit more but not the whole file.
I suggested reading the file in BINARY mode, remove all the CRs, then replace all the remaining LFs by CR LF. That worked fine - except for the UTF-16 BE file for which this approach results in a mess. So the whole thing was reverted.
I'm not an ABAP developer, I just have to test this. With my background in other programming languages I must assume there is a solution and I tend to decline a "CAN'T FIX" resolution of this bug.
you can use CL_ABAP_FILE_UTILITIES=>CHECK_FOR_BOMto determine which encoding the file has and then use the constants of class CL_ABAP_CHAR_UTILITIES to process further.
Related
I'm having a little problem automating a test with UFT (vbscript). I need to open a csv file, modify it, and then save it again. The problem is that when I open the file in Notepad++, it shows the encoding as "UCS-2 LE BOM". This file is then injected into our system for processing and if I change the encoding to ANSI, the injection will fail because the file seems to lose its column structure, and I'm not sure it is readable for the system anymore.
From what I understand, it's not possible to do it directly with vbscript but any idea how I could do it with powershell for example? Is there a notepad++ command line to change the encoding of a file for example?
Thanks
I have a bunch of text files that need cleaning up. Example
`E..4B?#.#...
..9J5.....P0.z.n9.9.. ........
.k#a..5
E...y^#.r...J5..
E...y_#.r...J5..
..9.P..n9..0.z............
….2..3..9…n7…..#.yr`
Is there any way sed can do this? Like notice weird patterns?
For this answer, I will assume that you have access to standard unix/linux tools.
Your file might be in some word-processor format. If so, the best way to get rid of the junk is to open it with that program. You may be able to find out which with file:
$ file mysteryfile
mysteryfile: Composite Document File V2 Document, Little Endian, Os: Windows, Version 6.1 ....
If that doesn't work, there is a standard unix utility for extracting text from binary files. It is called strings:
$ strings mysteryfile
Some
Recovered Text
...
The behavior of strings can be fine tuned with several options. See man strings.
Here's the simplified version of my problem: I have two text files, different data but identical first line and generated by the same program, although possibly on different OS's. When emacs reads one of them it says it is in DOS format, while it does not when reading the other.
I used several Hex editors (Bless, GHex, OKTeta on Kubuntu) and on all of them I see the same thing, which is that every line ends with the sequence OD OA (CR LF) for both files, including the last line.
So my question is: how does emacs determine what is a DOS file and what is not, and is there something else in the file the the Hex editor would not show, or add?
Both files have the same name, in different directories. Also I came upon this problem because I have C++ code that parses strings and failed on the file that emacs lists as DOS, so the issue is really with the file content.
Last note: you will notice there is no C/C++ tag. I'm not looking for advice on how to modify my C++ code to handle the situation. I know how to do it.
Thanks for your help
a
Emacs handles DOS files by converting the CRLF to LF when reading the file and then the LF back into CRLF when writing it out. So if there is a lone LF in the file, reading&writing would end up adding a CR even if the buffer had not been modified. For this reason, if there is such a lone LF hidden in the middle of the file, Emacs will handle the file not as DOS but as a UNIX file.
What can cause Notepad++ to make new lines as CRLF in one file and only LF in the other?
Both files were created at the same folder from the same OS and no modifications to Notepad++ preferences were made, AFAIK... Is there any option in Notepad++ that changes how new lines are defined?
Go to Edit->EOL Conversion -- change setting to Windows Format or Unix Format or Old Mac Format (depending on your prefernces)
PowerShell $OutputEncoding defaults to ASCII. PowerShell by default represents strings in Unicode. When I create scripts using the ISE, they are created in Unicode.
The following command sends text to a file in Unicode:
echo Testing > test.txt
When I push these files to GitHub, the code view chokes on them because they aren't UTF-8.
I'm confused about the best solution, with the least amount of work during commit.
I know I could convert each file and then commit, but that seems cockeyed to me. Clearly, I can't change how PowerShell represents strings internally, nor would I want to.
What are others doing?
ISE preserves an existing file's encoding but when you create a new file with ISE it is always creates the file with Unicode encoding. This has nothing to do with the $OutputEncoding. IIRC it was deemed a bug late in the dev cycle - too late to fix. Anyway, you can work-around this by going to the command window the first time you save a file and execute:
$psISE.CurrentFile.Save([Text.Encoding]::ASCII)
After that, you can just press the Save button.