Convert to alphanumeric a sequential file generated by COBOL with compacted data

Convert to alphanumeric a sequential file generated by COBOL with compacted data - unicode

I have a COBOL program which generates a sequential file with this structure:
FD ALUMNOS-FILE.
01 ALUMNOS-DATA.
88 EOF VALUE HIGH-VALUES.
05 STUDENTID PIC 9(7).
05 STUDENTNAME PIC X(10).
05 FILLER PIC X(8).
05 COURSECODE PIC X(4).
05 FOO PIC S9(7)V USAGE COMP-3.
If I open the file in Notepad++, I see strange unicode symbols which are difficult to read caused by the COMP-3 variable. Something similar to the image below (the image is from another file):
Is there any way without using COBOL to rewrite this sequential file to be readable? Maybe using a script language like VBS? Any tip or advice will be appreciated, and if you need more info let me know and I'll edit the post.

I would suggest having a look at the Last Cobol Questions
But the RecordEditor will let you view / edit Cobol Files using a Cobol Copybook. In the RecordEditor you can export the file as Csv, Xml if you want.
As mentioned in the Last Cobol Questions there are several solution for Reading Cobol files in Java and probably some in other languages.
To import the Cobol Copybook into the RecordEditor,
Select: Record Layout >>> Import Cobol Copybook
The File-Structure controls how the file is read. Use a File-Structure of Fixed length Binary if all the records are of the same length (no carraige return).
Other Structures supported include
Standard Text files File Structures: ** ..Text.. **
Cobol Variable Record Length Files. These Typically have a Record-Length followed by the Data. There are versions for Mainframes, Open-Cobol, Fujitsu.
The Default File structure will choose the most likely File-Structure based on the Record-Definition. In your case it should choose Fixed length Binary because there is a binary Field in the Definition.
Note: From Record Editor 0.94.4, with a File-Structure of Fixed Length Binary you can edit Fixed Length Text files in a basic Text Editor if you want.
Note: I am the author of RecordEditor
Answer Updates 2017/08/09
Conversion Utilities
For simple Cobol files these conversion utilities (based on JRecord) could be used:
[CobolToCsv][5]
[CobolToXml][6]
[Cobol To Json][7]
RecordEditor
The RecordEditor has a Generate option for generating Java / JRecord code.
See [RecordEditor Code Generation notes][8]

Related

I need to remove a specific unicode in my existing subtitle text file

I basically work on subtitles and I have this arabic file and when I open it up on notepad and right click and select SHOW UNICODE CONTROL CHARACTERS I give me some weird characters on the left of every line. I tried so many ways to remove it but failed I also tried NOTEPAD++ but failed.
Notepad ++
SUBTITLE EDIT
EXCEL
WORD
288
00:24:41,960 --> 00:24:43,840
‫أتعلم، قللنا من شأنك فعلاً‬
289
00:24:44,000 --> 00:24:47,120
‫كان علينا تجنيدك لتكون جاسوساً‬
‫مكان (كاي سي)‬
290
00:24:47,280 --> 00:24:51,520
‫لا تعلمون كم أنا سعيد‬
‫لسماع ذلك‬
291
00:24:54,800 --> 00:24:58,160
‫لا تقلق، سيستيقظ نشيطاً غداً‬
292
00:24:58,320 --> 00:25:00,800
‫ولن يتذكر ما حصل‬
‫في الساعات الـ٦‬
the unicodes are not showing in this the unicode is U+202B which shows a ¶ sign, after googling it I think it's called PILCROW.
The issue with this is that it doesn't display subtitles correctly on ps4 app.
I need this PILCROW sign to go away. with this website I can see the issue in this file https://www.soscisurvey.de/tools/view-chars.php

The PILCROW ¶ is used by various software and publishers to show the end of a line in a document. The actual Unicode character does not exist in your file so you can't get rid of it.

The Unicode characters in these lines are 'RIGHT-TO-LEFT EMBEDDING'
(code \u202b) and 'POP DIRECTIONAL FORMATTING' (code \u202c) -
these are used in the text to indicate that the included text should be rendered
right-to-left instead of the ocidental left-to-right direction.
Now, these characters are included as hints to the application displaying the text, rather than to actually perform the text reversing - so they likely can be removed without compromising the text displaying itself.
Now this a programing Q&A site, but you did not indicate any programming language you are familiar with - enough for at least running a program. So it is very hard to know how give an answer that is suitable to you.
Python can be used to create a small program to filter such characters from a file, but I am not willing to write a full fledged GUI program, or an web app that you could run there just as an answer here.
A program that can work from the command line just to filter out a few characters is another thing - as it is just a few lines of code.
You have to store the follwing listing as a file named, say "fixsubtitles.py" there, and, with a terminal ("cmd" if you are on Windows) type python3 fixsubtitles.py \path\to\subtitlefile.txt and press enter.
That, of course, after installing Python3 runtime from http://python.org
(if you are on Mac or Linux that is already pre-installed)
import sys
from pathlib import Path
encoding = "utf-8"
remove_set = str.maketrans("\u202b\u202c")
if len(sys.argv < 2):
print("Usage: python3 fixsubtitles.py [filename]", file=sys.stderr)
exit(1)
path = Path(sys.argv[1])
data = path.read_text(encoding=encoding)
path.write_text(data.translate("", "", remove_set), encoding=encoding)
print("Done")
You may need to adjust the encoding - as Windows not always use utf-8 (the files can be in, for example "cp1256" - if you get an unicode error when running the program try using this in place of "utf-8") , and maybe add more characters to the set of characters to be removed - the tool you linked in the question should show you other such characters if any. Other than that, the program above should work

Reading file names inside .zip file

I am familiar with the .zip file format, and able to read the internal file table content so far.
The problem occurs with non-english characters in the file name.
The specification states that file names use OEM character set, yet sometimes I get UTF-8 representation and sometimes I get OEM represantation.
The specification states the "version made by" field should be in range 0-20, yet I get versions 31 and 63 which may or may not affect the character set.
Another related problem: When I read the "extra field" there is "up" (unicode path, id=0x7075) which suppose to store the utf-8 represantation of the filename, well, it starts with 5 redundant bytes before the actual utf-8 string (Created by WinRar), yet the other softwares seems to read it correctly.
Any input about the issue?

some nonASCII Characters are coming like a thick vertical line in talend 5.6

I have a file with name "→Ψjohn.txt" and now i want to remove those special characters from the file name and update the file name as "john.txt". But talend is recognizing those characters as thick vertical line so it is not recognizing the source file in the physical location.can anyone please suggest a solution.
I have this file in database as well as physical location and when I am reading the file from the database it has to remove the special characters and update the same in the database as well as physical location.
in database the file looks like this
database
when I am reading from database using talend it looks like following
talend
Thanks in advance

You dont have to specify the exact file name, you can use tFilelist and get all file in a specific directory, you can also regex to mask some names, for example iterate over all *john.txt.
Once you get the actual filename, use a regex to remove unwanted characters for example : \W for non word characters and rename the file through a system command or using tFileCopy.

After seeing the pictures this is not a Talend issue but has to do with the font used in the Talend (Eclipse) console and the enconding setting of Java.
Those rectangles (better visible with a bigger font size) show that the font cannot represent your characters - it has no symbols for it.
Talend (Eclipse) settings
In Talend, navigate to Window / Preferences and choose General / Appearance / Colors and Fonts (as described in the Eclipse help). Check what font you are using for Debug / Console font. I've had good results with the font Consolas, which is set from Talend version 6 onwards. Beforehand it was Courier New.
Java encoding
You should check that Java uses UTF-8 encoding to display the characters. The Console Enconding has to be set to UTF-8. See this answer for an explanation how to do this.
Alternative
Alternatively you could store all the log data into a file and open this file in e.g. Notepad++ to see if the output is generated correctly and only displayed wrong.

Interpretation of ambiguous dates by Microsoft Access textbox

I've been searching around without any luck for an MSDN or any other official specification which describes how 2 digit years are interpreted in a date format textbox. That is, when data is manually entered into a textbox on a form, with the format sent to short date. (My current locale defines dates as yyyy/MM/dd)
A few random observations (conversion from entered date)
29/12/31 --> 2029/12/31
30/1/1 --> 1930/01/01
So far it makes sense, the range for 2 digit dates is 1930 to 2029. Then as we go on,
1/2/32 --> 1932/01/02 (interpreted as M/d/yy)
15/2/28 --> 2015/02/28 (interpreted as yy/M/dd)
15/2/29 --> 2029/02/15 (interpreted as M/d/yy)
2/28/16 --> 2016/02/28 (interpreted as M/dd/yy)
2/29/15 --> 2029/02/15 (interpreted as M/yy/dd)
It tries to twist about invalid dates so that they are valid in some format, but seem to ignore the system locale setting for dates. Only the ones that are invalid in any format (like 0/0/1) seem to generate an error. Is this behavior documented somewhere?
(I only want to refer the end user to this documentation, I have no problem with the actual behavior)

The 29/30 split was settled this way with Access 2.0 as of 1999-12-17 in the Acc2Date.exe Readme File as part of the last Y2K update:
Introduction
The Acc2Date.exe file contains three updated files that modify the way
Microsoft Access 2.0 interprets two-digit years. By default, Access
2.0 interprets all dates that are entered by the user or imported from a text file to fall within the 1900s. After you apply the updated
files, Access 2.0 will treat two-digit dates that are imported from
text in the following manner:
00 to 29 - resolve to the years 2000 to 2029 30 to 99 - resolve
to the years 1930 to 1999
Years that are entered into object property sheets, the query design
grid, or expressions in Access modules will be interpreted based on a
100-year sliding date window as defined in the Win.ini on the computer
that is running Access 2.0.
The Acc2Date.exe file contains the following files:
File name Version Description
---------------------------------------------------------------------
MSABC200.DLL 2.03 The Updated Access Basic file
MSAJT200.DLL 2.50.2825 The Updated Access Jet Engine Library file
MSAJU200.DLL 2.50.2819 The Updated Access Jet Utilities file
Readme.txt n/a This readme file
For more information about the specific issues solved by this update,
see the following articles in the Microsoft Knowledge Base:
Article ID: Q75455
Title : ACC2: Years between 00 and 29 Are Interpreted as 1900 to 1929
That article can be found here as KB75455 (delayed page load):
ACC2: Years Between 00 and 29 Are Interpreted as 1900 to 1929
As for the 2/29/15 is not accepted here where system default is dd-mm-yyyy, so there are limits to how much creativity Access/VBA puts into interpreting date expressions.

SSIS unicode flat file issue "Character not in code page"

I have a text file created in java using UTF-16 encoding.
When I try to import I am getting a validation failure/error on the flat file source before it even begins to move data. The error is a character is not in the specified code page.
[Flat File Source [908]] Error: Data conversion failed. The data conversion for column "ACTIVE_INGREDIENT" returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page.
In my Flat File connection, I don't have unicode selected (as that struggles to find my CR LF line terminators), but have have set code page to 65001-UTF8.
In may flat file data source, I have changed all Internal and External Columns to be DT_WSTR in the advanced editor (I can't cahnge code page it seems, stuck on 0 with this option).
I am not doing a data conversion as I am mapping to NVARCHAR tables (the SSIS job isnt even getting this far to try to transfer data).
I cant even redirect the rows to a text file to identify them as I have the same issue trying to output to a flat file destination.
Any help appreciated.