mainframe - what is the name of extension suffix on filenames that change? - db2

I recall when doing an ftp get to copy files from mainframe to windows that there would always be some numeric suffix on the filenames that would change each day. ie abc.4328 then it would become abc.23595..etc what is the concept/terminology of the changing suffix in mainframe world?

Leaving aside mainframe files residing in the Unix file system (z/OS is a flavor of Unix and has been for some years now), mainframe files do not have an extension or suffix.
Mainframe file names (called DataSet Names or DSNs) take the form HLQ[.Q1[.Q2[.Qn]]] where HLQ is the High Level Qualifier and Q1...Qn are subsequent qualifiers separated from the HLQ and each other by full-stops. The entire DSN must be no more than 44 characters. Each qualifier must be comprised of alphabetic, numeric, and what IBM calls "national" characters which (in the USA anyway) are #, # and $. Additionally, a qualifier may not begin with a numeric character. There are exceptions to this which, in my opinion, are best avoided.
As Bruce Martin indicates in his comment, mainframes have the concept of Generation Data Groups (GDGs) which have a lowest level qualifier taking the form GnnnnVnn generated by the operating system where the four digits between the G and V are the "generation number" and the two digits following the V are the "version number." The generation number is incremented by the operating system each time a new instance of the file is created.
So it is possible you are thinking of a GDG. Be advised that the GDG lowest level qualifier is not dependent on date or time, it merely indicates the order in which the instances of the dataset were created.
GDGs are normally accessed not by absolute generation number but by relative generation number. If ABC.DEF is a GDG and there are four extant generations ABC.DEF.G0008V00, ABC.DEF.G0009V00, ABC.DEF.G0010V00, ABC.DEF.G0011V00 then a reference to ABC.DEF(0) would be shorthand for ABC.DEF.G0011V00. A reference to ABC.DEF(-1) would be shorthand for ABC.DEF.G0010V00. Referencing relative generation (0) is always a reference to the most recently created instance of the GDG.
A mainframe dataset may also be a PDS (Partitioned DataSet). Partitioned datasets have "members" and are conceptually slightly similar to (though implemented very differently from) directories on PC or Unix file systems. A PDS may contain many related members, such as utility control statements, where there is a desire to manage them as a group.
PDS names follow the same rules as normal DSNs, and member names follow the same rules as normal DSN qualifiers, but referring to a member requires specifying it in parentheses. If MY.DATA is a PDS and I wish to access a member whose name is XYZ I would specify MY.DATA(XYZ).
Note that the format of a dataset is not necessarily indicated in its name. That a dataset is, e.g. a PDS containing fixed 100 byte records is recorded as metadata in the file system.

Great response from #cschneid above. To add to it:
There's doc from IBM on GDGs on the z/OS Basic Skills page - https://www.ibm.com/support/knowledgecenter/zosbasics/com.ibm.zos.zconcepts/zconcepts_175.htm
There are several dataset types - a GDG isn't really a different dataset organization, it's just a special naming convention that indicates relative "age". There are sequential datasets ("flat files"), partitioned datasets (sort of like a collection of flat files surrounded by a directory), VSAM datasets (a very long topic), and a few other esoteric types that aren't used much these days.
GDGs are a pretty slick way of naming (non-VSAM) datasets with version numbers that can be referenced in JCL or line commands using those relative version numbers. But it's just naming
Trivia: In places I've worked, systems programmers and operations staff members would often refer to those GnnnnVnn as "goovoo" numbers, b/c they often were numbered G00nnV00.. :-)

Related

Classification of gender for given names

after some research I could not find yet a suitable open source library or software I can use to classify by most likely gender a long table of first names I have.
For my application I have a set of first names from many different countries, and many of them are also pretty exotic.
For example, when I tried to use Genderize I could get only 1/8 of the names classified, while the remaining are labeled as Unknown (I made sure that the format is correct, no lower/upper case ambiguity, etc..).
Any advise would be appreciated. Thank you in advance !
For the record, the best I could find was really just do it manually looking up names from google or dedicated websites such as https://namepedia.org. I am afraid there is no automated solution for my use case. This mostly for the following reasons:
Many names are somewhat archaic (I could not even recognise several names of my own nationality)
Many names were truncated to form nicknames or had two nearby letters swapped: here a LUT approach would fail and rather one would need a score from a model
There were several names not based on Roman alphabet but where the mapping into roman characters produced some ambiguities I guess
For those curious of the original dataset, this is part of a Kaggle challenge (Spaceship Titanic, https://www.kaggle.com/competitions/spaceship-titanic).

Using ampersand in SSAS Tabular models

Some people in my company have gone to great lengths to remove & characters from data and measure names in our Tabular models. I wasn't around when they made this decision, but it destroys readability in our financial reporting. Instead of R&D and SG&A in our statements, we have RD and SGA.
The offenders are no longer around to answer for their crimes. I am trying to convince my co-workers to re-add the &, but they won't budge without some idea why this was done in the first place. My best guess is that a consultant told them not to use & in models. I think they meant in object names only, but our team got carried away. I've been able to find this page that says & was a reserved character in Compatibility Level 1100, but that goes back to SQL Server 2012! I think our lowest environment is SSAS 2017.
Am I missing anything or can we re-add & to our data and measure names? Is there any reason you would avoid the use of & anywhere in an SSAS tabular model? Links to documentation appreciated!
https://learn.microsoft.com/en-us/analysis-services/multidimensional-models/olap-physical/object-naming-rules-analysis-services?view=asallproducts-allversions
Exceptions: When Reserved Characters are Allowed
As noted, databases of a specific modality and compatibility level can have object names that include reserved characters. Dimension attribute, hierarchy, level, measure and KPI object names can include reserved characters, for tabular databases (1103 or higher) that allow the use of extended characters:
Server mode and database compatibility level Reserved characters allowed?
Databases can have a ModelType of default. Default is equivalent to multidimensional, and thus does not support the use of reserved characters in column names.

How to create a list of hex number in 8085 assembler?

I need to know how to create a list of number in 8085 assembler and store the list in successive memory location?
Most assemblers will have a set of define statements and a way to specify different bases.
For example, the values zero, one and forty-two, along with a short nul-terminated string, may be created with something like:
some_vals: db 0, 1, 2Ah, 'hello', 0
How your assembler does it is probably in the documentation somewhere. Without more specific details on which assembler you're using, this not much more help I can give.
Pseudo-ops like data definition (db, dw, ds), address specification (org) or label setting (mylabel:) do not generally form part of the processor documentation itself, rather they're a function of the assembler.
See, for example, chapter 4 of this document. I particularly love the fact that we used to buy these 200-page books for $3.95 whereas now you'll be shelling out a hundred bucks for a digital copy with no incremental cost of production :-)

Term for diff/delta on multiple files or data structures

I would like to know whether there is a proper term to describe "diffing" of / obtaining the delta between multiple files or data structures, such that the resulting "diff" contains first a description of the parts common to all files/structures, then descriptions of how this "base" file/structure must be modified to obtain the individual ones, ideally in a hierarchical fashion if some files/structures are more similar to each other than others.
There are some questions and answers about how to do this with certain tools (e.g. DIFF utility works for 2 files. How to compare more than 2 files at a time?), but as I want to do this for a specific type of data structure (namely JSON), I'm at a loss as to what I should even search for.
This type of problem seems to me like it should be common enough to have a name such as "hierarchical diff" (which however seems to be reserved for 2-way diffs on hierarchical data structures), "commonality finding", or something like that.
I guess a related concept about hierarchical ordering of commonalities and differences is formal concept analysis, but this operates on sets of properties rather than hierarchical data structures and won't help me much.
There are multiple valid denominations :
Data comparison (or Sequence comparison)
Delta encoding
Delta compression (or Differential compression)
Algorithms:
An O(ND) Difference Algorithm and Its Variations (Eugene Myer)
A technique for isolating differences between files (Paul Heckel)
The String-to-String Correction Problem with Block Moves (Walter Tichy)
Good Wikipedia links
Longest common subsequence problem
Comparison of file comparison tools
Diff Unix Utility
Some implementations
diff-match-patch (Neil Fraser - Google)
jsdifflib
jsondiffpatch

Fastest method of checking if multiple different strings are a substring of a 2nd string

Context:
I'm creating a program which will sort and rename my media files which are named e.g. The.Office.s04e03.DIVX.WaREZKiNG.avi into an organized folder structure, which will consist of a list of folders for each TV Series, each folder will have a list of folders for the seasons, and those folders will contain the media files.
The problem:
I am unsure as to what the best method for reading a file name and determining what part of that name is the TV Show. For e.g. In "The.Office.s04e03.DIVX.WaREZKiNG.avi", The Office is the name of the series. I decided to have a list of all TV Shows and to check if each TV Show is a substring in the file name, but as far as I know this means I have to check every single series against the name for every file.
My question: How should I determine if a string contains one of many other strings?
Thanks
The Aho-Corsasick algorithm[1] efficiently solves the "does this possibly long string exactly contain any of these many short strings" problem.
However, I suspect this isn't really the problem you want to solve. It seems to me that you want something to extract the likely components from a string that is in one of possibly many different formats. I suspect that having a few different regexps for likely providers, video formats, season/episode markers, perhaps a database of show names, etc, is really what you want. Then you can independently run these different 'information extractors' on your filenames to pull out their structure.
[1] http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
It depends on the overall structure of the filenames in general, for instance is the series name always first? If so a tree structure work well. Is there a standard marking between words (period in your example) if so you can split the string on those and create a case-insensitive hashtable of interesting words to boost performance.
However extracting seasons and episodes becomes more difficult, a simple solution would be to implement an algorithm to handle each format you uncover, although by using hints you could create an interesting parser if you wanted too. (Likely overkill however)