Intersystems Cache - Maintaining Object Code to ensure Data is Compliant with Object Definition - intersystems-cache

I am new to using intersytems cache and face an issue where I am querying data stored in cache, exposed by classes which do not seem to accurately represent the data in the underlying system. The data stored in the globals is almost always larger than what is defined in the object code.
As such I get errors like the one below very frequently.
Msg 7347, Level 16, State 1, Line 2
OLE DB provider 'MSDASQL' for linked server 'cache' returned data that does not match expected data length for column '[cache]..[namespace].[tablename].columname'. The (maximum) expected data length is 5, while the returned data length is 6.
Does anyone have any experience with implementing some type of quality process to ensure that the object definitions (sql mappings) are maintained in such away that they can accomodate the data which is being persisted in the globals?
Property columname As %String(MAXLEN = 5, TRUNCATE = 1) [ Required, SqlColumnNumber = 2, SqlFieldName = columname ];
In this particular example the system has the column defined with a max len of 5, however the data stored in the system is 6 characters long.
How can I proactively monitor and repair such situations.
/*
I did not create these object definitions in cache
*/

It's not completely clear what "monitor and repair" would mean for you, but:
How much control do you have over the database side? Cache runs code for a data-type on converting from a global to ODBC using the LogicalToODBC method of the data-type class. If you change the property types from %String to your own class, AppropriatelyNamedString, then you can override that method to automatically truncate. If that's what you want to do. It is possible to change all the %String property types programatically using the %Library.CompiledClass class.
It is also possible to run code within Cache to find records with properties that are above the (somewhat theoretical) maximum length. This obviously would require full table scans. It is even possible to expose that code as a stored procedure.
Again, I don't know what exactly you are trying to do, but those are some options. They probably do require getting deeper into the Cache side than you would prefer.
As far as preventing the bad data in the first place, there is no general answer. Cache allows programmers to directly write to the globals, bypassing any object or table definitions. If that is happening, the code doing so must be fixed directly.
Edit: Here is code that might work in detecting bad data. It might not work if you are doing cetain funny stuff, but it worked for me. It's kind of ugly because I didn't want to break it up into methods or tags. This is meant to run from a command prompt, so it would have to be modified for your purposes probably.
{
S ClassQuery=##CLASS(%ResultSet).%New("%Dictionary.ClassDefinition:SubclassOf")
I 'ClassQuery.Execute("%Library.Persistent") b q
While ClassQuery.Next(.sc) {
If $$$ISERR(sc) b Quit
S ClassName=ClassQuery.Data("Name")
I $E(ClassName)="%" continue
S OneClassQuery=##CLASS(%ResultSet).%New(ClassName_":Extent")
I '$IsObject(OneClassQuery) continue //may not exist
try {
I 'OneClassQuery.Execute() D OneClassQuery.Close() continue
}
catch
{
D OneClassQuery.Close()
continue
}
S PropertyQuery=##CLASS(%ResultSet).%New("%Dictionary.PropertyDefinition:Summary")
K Properties
s sc=PropertyQuery.Execute(ClassName) I 'sc D PropertyQuery.Close() continue
While PropertyQuery.Next()
{
s PropertyName=$G(PropertyQuery.Data("Name"))
S PropertyDefinition=""
S PropertyDefinition=##CLASS(%Dictionary.PropertyDefinition).%OpenId(ClassName_"||"_PropertyName)
I '$IsObject(PropertyDefinition) continue
I PropertyDefinition.Private continue
I PropertyDefinition.SqlFieldName=""
{
S Properties(PropertyName)=PropertyName
}
else
{
I PropertyName'="" S Properties(PropertyDefinition.SqlFieldName)=PropertyName
}
}
D PropertyQuery.Close()
I '$D(Properties) continue
While OneClassQuery.Next(.sc2) {
B:'sc2
S ID=OneClassQuery.Data("ID")
Set OneRowQuery=##class(%ResultSet).%New("%DynamicQuery:SQL")
S sc=OneRowQuery.Prepare("Select * FROM "_ClassName_" WHERE ID=?") continue:'sc
S sc=OneRowQuery.Execute(ID) continue:'sc
I 'OneRowQuery.Next() D OneRowQuery.Close() continue
S PropertyName=""
F S PropertyName=$O(Properties(PropertyName)) Q:PropertyName="" d
. S PropertyValue=$G(OneRowQuery.Data(PropertyName))
. I PropertyValue'="" D
.. S PropertyIsValid=$ZOBJClassMETHOD(ClassName,Properties(PropertyName)_"IsValid",PropertyValue)
.. I 'PropertyIsValid W !,ClassName,":",ID,":",PropertyName," has invalid value of "_PropertyValue
.. //I PropertyIsValid W !,ClassName,":",ID,":",PropertyName," has VALID value of "_PropertyValue
D OneRowQuery.Close()
}
D OneClassQuery.Close()
}
D ClassQuery.Close()
}

The simplest solution is to increase the MAXLEN parameter to 6 or larger. Caché only enforces MAXLEN and TRUNCATE when saving. Within other Caché code this is usually fine, but unfortunately ODBC clients tend to expect this to be enforced more strictly. The other option is to write your SQL like SELECT LEFT(columnname, 5)...

The simplest solution which I use for all Integration Services Packages, for example is to create a query that casts all nvarchar or char data to the correct length. In this way, my data never fails for truncation.
Optional:
First run a query like: SELECT Max(datalength(mycolumnName)) from cachenamespace.tablename.mycolumnName
Your new query : SELECT cast(mycolumnname as varchar(6) ) as mycolumnname,
convert(varchar(8000), memo_field) AS memo_field
from cachenamespace.tablename.mycolumnName
Your pain of getting the data will be lessened but not eliminated.
If you use any type of oledb provider, or if you use an OPENQUERY in SQL Server,
the casts must occur in the query sent to Intersystems CACHE db, not in the the outer query that retrieves data from the inner OPENQUERY.

Related

How to use DISP=SHR on an fopen

With code like:
fopen("DD:LOGLIBY(L1234567)", "w");
and JCL like:
//LOGTEST EXEC PGM=LOGTEST
//LOGLIBY DD DSN=MYUSER.LOG.LIBY,DISP=SHR
I can create PDS(E) members while at the same time browsing the PDS(E) to look at existing members, as expected with DISP=SHR.
If instead I code:
fopen("//'MYUSER.LOG.LIBY(L1234567)'", "w");
The fopen fails if I am browsing the PDS(E) at the time, or the browse of the PDS(E) fails while I have the file open. In other words there is no DISP=SHR. According to the fopen() documentation, DISP=SHR is the default when using file modes of "r" etc, but not "w".
How can I provide DISP=SHR in the second example?
There are two possibilities...
For anyone not familiar with the internal structure of partitioned datasets (that is, PDS or PDS/E), these datasets are logically divided into two parts: a "directory" having pointers to all the individual members, and a "data" area containing the actual records for the individual members:
PDS: <DIRECTORY BLOCKS>
<MEMBER1>: ADDRESS OF DATA FOR MEMBER1 (xxx)
<MEMBER2>: ADDRESS OF DATA FOR MEMBER2 (yyy)
...
<DIRECTORY FREESPACE)
...
<EOF - END OF THE PDS DIRECTORY>
<DATA PORTION>
+xxx = DATA FOR MEMBER1
...
<EOF - END OF MEMBER1>
+yyy = DATA FOR MEMBER2
...
<EOF - END OF MEMBER2>
...
FREE SPACE (ALLOCATED, BUT UNUSED)
...
END OF PDS
Throughout the next few paragraphs, keep in mind that you can open either the entire PDS/PDSE, and this enables you to read/write whatever members you like, or you can allocate and open a single member, and that gets processed like any other sequential file.
First, if you actually to have a DD statement coded as you show in the question, then you may simply need to change your open from fopen(dsn,...) to fopen(dd:ddname,...). If you're running under the UNIX Shell or you do something that results in your process running in a different address space (such as fork()), then this might not work, but it may be worth a try. If you do this with the JCL you show, the challenge would be managing the PDS/E directory - you'd need to issue your own "STOW" when you create/update a new member since the JCL allocates the entire dataset, not just a single member. The sequence would be:
Open the DD for output.
Write your data.
Update the PDS or PDS/E directory with new member information (this is where the STOW function comes in - it updates the directory of the PDS/PDSE to reflect the member you created or updated).
Close the file
If you also need to read members, you'd need to issue FIND (or BLDL/POINT - which can be fseek() in C) to point to the correct member, then read the member. I'm sure it sounds like a hassle, but the advantage of this approach is that you can allocate/open the file once, and process as many individual members as you like.
A second workaround might be to dynamically allocate the file yourself, and then open it using DD:ddname syntax...if you only infrequently access the file, this is probably easier to code. The gory details of dynamic allocation are fully described here: https://www.ibm.com/support/knowledgecenter/SSLTBW_2.4.0/com.ibm.zos.v2r4.ieaa800/reqsvc.htm.
There are several ways to invoke dynamic allocation: you can write a small assembler program, you can use the z/OS UNIX Services BPXWDYN callable service, or you can use the C runtime "dynalloc()" or "svc99()" functions. The dynalloc() function is easy to use, but it only exposes a subset of what dynamic allocation can do...svc99() is more cumbersome to use, but it exposes more functionality.
However you do it, dynamic allocation takes "text units" that roughly correspond to the parameters you find on JCL DD statements. What you're describing sounds like you'd just need to pass DSN and DISP text units, and maybe DDNAME (you can either pass your own DDNAME, or let the system generate one for you).
The C runtime functions make all this easy, but be aware that there are a few oddities, such as the need to pad the parameters to their maximum length. For example, a DSN needs to be 44 characters and padded on the right with blanks - not a C-style null-terminated string.
Here's a small code snippet as an example:
#include <dynit.h>
. . .
int allocate(ddn, dsn, mem)
{
__dyn_t ip; // Parameters to dynalloc()
. . .
// Prepare the parameters to dynalloc()
dyninit(&ip); // Initialize the parameters
ip.__ddname = ddn; // 8-char blank-padded
ip.__dsname = dsn; // 44-char blank-padded
ip.__status = __DISP_SHR; // DISP=(SHR)
ip.__normdisp = __DISP_KEEP; // DISP=(...,KEEP)
ip.__misc_flags = __CLOSE; // FREE=CLOSE
if (*mem) // Optional PDS, PDS/E member
ip.__member = mem; // 8-char blank-padded
// Now we can call dynalloc()...
if (dynalloc(&ip)) // 0: Success, else error
{
// On error, the errcode/infocode explain why - values
// are detailed in z/OS Authorized Services Reference
printf("SVC99: Can't allocate %s - RC 0x%x, Info 0x%x\n",
dsn, ip.__errcode, ip.__infocode);
return FALSE;
}
// If dynalloc works, you can open the file with fopen("DD:ddname",...)
}
Don't forget that when you're done with the file, you generally need to deallocate it.
The code snippet above uses "FREE=CLOSE" - this means that when the file is closed, z/OS will automatically free the allocation...if you only open and process the dataset once, this is a convenient approach. If you need to repeatedly open and close the file, then you wouldn't use FREE=CLOSE, but instead call dynamic allocation a second time after you're done with your processing and want to free the file.
If you need to concurrently access multiple files, beware that you'll need to generate multiple unique DDNAMEs. You can either do this in your own code, or you can use the form of dynamic allocation that automatically builds and returns a usable DDNAME (of the form "SYSnnnnn").
Also, don't forget that updating a dataset under DISP=SHR can be dangerous in some situations, especially if the dataset involved can be a conventional PDS as well as a PDS/E. The big danger is that two applications open the dataset for output concurrently...both will write data into the same place, and the result will likely be a damaged PDS directory.
There are some other oddities in the UNIX Services environment, particularly if you use fork() or exec() and expect file handles to work in subprocesses since allocations are generally tied to a particular z/OS address space. Services like spawn() can let the child process run in the same address space, so this is one possibility.

Spark caching strategy

I have a Spark driver that goes like this:
EDIT - earlier version of the code was different & didn't work
var totalResult = ... // RDD[(key, value)]
var stageResult = totalResult
do {
stageResult = stageResult.flatMap(
// Some code that returns zero or more outputs per input,
// and updates `acc` to number of outputs
...
).reduceByKey((x, y) => x.sum(y))
totalResult = totalResult.union(stageResult)
} while(stageResult.count() > 0)
I know from properties of my data that this will eventually terminate (I'm essentially aggregating up the nodes in a DAG).
I'm not sure of a reasonable caching strategy here - should I cache stageResult each time through the loop? Am I setting up a horrible tower of recursion, since each totalResult depends on all previous incarnations of itself? Or will Spark figure that out for me? Or should I put each RDD result in an array and take one big union at the end?
Suggestions will be welcome here, thanks.
I would rewrite this as follows:
do {
stageResult = stageResult.flatMap(
//Some code that returns zero or more outputs per input
).reduceByKey(_+_).cache
totalResult = totalResult.union(stageResult)
} while(stageResult.count > 0)
I am fairly certain(95%) that the stageResult DAG used in the union will be the correct reference (especially since count should trigger it), but this might need to be double checked.
Then when you call totalResult.ACTION, it will put all of the cached data together.
ANSWER BASED ON UPDATED QUESTION
As long as you have the memory space, then I would indeed cache everything along the way as it stores the data of each stageResult, unioning all of those data points at the end. In fact, each union does not rely on the past as that is not the semantics of RDD.union, it merely puts them together at the end. You could just as easily change your code to use a val due to RDD immutability.
As a final note, maybe the DAG visualization will help understand why there would not be recursive ramifications:

Assigning a whole DataStructure its nullind array

Some context before the question.
Imagine file FileA having around 50 fields of different types. Instead of all programs using the file, I tried having a service program, so the file could only be accessed by that service program. The programs calling the service would then receive a DataStructure based on the file structure, as an ExtName. I use SQL to recover the information, so, basically, the procedure would go like this :
Datastructure shared by service program :
D FileADS E DS ExtName(FileA) Qualified
Procedure called by programs :
P getFileADS B Export
D PI N
D PI_IDKey 9B 0 Const
D PO_DS LikeDS(FileADS)
D LocalDS E DS ExtName(FileA) Qualified
D NullInd S 5i 0 Array(50) <-- Since 50 fields in fileA
//Code
Clear LocalDS;
Clear PO_DS;
exec sql
SELECT *
INTO :LocalDS :nullind
FROM FileA
WHERE FileA.ID = :PI_IDKey;
If SqlCod <> 0;
Return *Off;
EndIf;
PO_DS = LocalDS;
Return *On;
P getFileADS E
So, that procedure will return a datastructure filled with a record from FileA if it finds it.
Now my question : Is there any way I can assign the %nullind(field) = *On without specifying EACH 50 fields of my file?
Something like a loop
i = 1;
DoW (i <= 50);
if nullind(i) = -1;
%nullind(datastructure.field) = *On;
endif;
i++;
EndDo;
Cause let's face it, it'd be a pain to look each fields of each file every time.
I know a simple chain(n) could do the trick
chain(n) PI_IDKey FileA FileADS;
but I really was looking to do it with SQL.
Thank you for your advices!
OS Version : 7.1
First, you'll be better off in the long run by eliminating SELECT * and supplying a SELECT list of the 50 field names.
Next, consider these two web pages -- Meaningful Names for Null Indicators and Embedded SQL and null indicators. The first shows an example of assigning names to each null indicator to match the associated field names. It's just a matter of declaring a based DS with names, based on the address of your null indicator array. The second points out how a null indicator array can be larger than needed, so future database changes won't affect results. (Bear in mind that the page shows a null array of 1000 elements, and the memory is actually relatively tiny even at that size. You can declare it smaller if you think it's necessary for some reason.)
You're creating a proc that you'll only write once. It's not worth saving the effort of listing the 50 fields. Maybe if you had many programs using this proc and you had to create the list each time it'd be a slight help to use SELECT *, but even then it's not a great idea.
A matching template DS for the 50 data fields can be defined in the /COPY member that will hold the proc prototype. The template DS will be available in any program that brings the proc prototype in. Any program that needs to call the proc can simply specify LIKEDS referencing the template to define its version in memory. The template DS should probably include the QUALIFIED keyword, and programs would then use their own DS names as the qualifying prefix. The null indicator array can be handled similarly.
However, it's not completely clear what your actual question is. You show an example loop and ask if it'll work, but you don't say if you had a problem with it. It's an array, so a loop can be used much like you show. But it depends on what you're actually trying to accomplish with it.
for old school rpg just include the nulls in the data structure populated with the select statement.
select col1, ifnull(col1), col2, ifnull(col2), etc. into :dsfilewithnull where f.id = :id;
for old school rpg that can't handle nulls remove them with the select statement.
select coalesce(col1,0), coalesce(col2,' '), coalesce(col3, :lowdate) into :dsfile where f.id = :id;
The second method would be easier to use in a legacy environment.
pass the key by value to the procedure so you can use it like a built in function.
One answer to your question would be to make the array part of a data structure, and assign *all'0' to the data structure.
dcl-ds nullIndDs;
nullInd Ind Dim(50);
end-ds;
nullIndDs = *all'0';
The answer by jmarkmurphy is an example of assigning all zeros to an array of indicators. For the example that you show in your question, you can do it this way:
D NullInd S 5i 0 dim(50)
/free
NullInd(*) = 1 ;
Nullind(*) = 0 ;
*inlr = *on ;
return ;
/end-free
That's a complete program that you can compile and test. Run it in debug and stop at the first statement. Display NullInd to see the initial value of its elements. Step through the first statement and display it again to see how the elements changed. Step through the next statement to see how things changed again.
As for "how to do it in SQL", that part doesn't make sense. SQL sets the values automatically when you FETCH a row. Other than that, the array is used by the host language (RPG in this case) to communicate values back to SQL. When a SQL statement runs, it again automatically uses whatever values were set. So, it either is used automatically by SQL for input or output, or is set by your host language statements. There is nothing useful that you can do 'in SQL' with that array.

Preparing a command with Structured Parameters

I have this ADO.NET command object and I can set some parameters and execute it successfully.
_mergecommand.Parameters.Add(new SqlParameter("values", SqlDbType.Structured));
_mergecommand.Parameters["values"].TypeName = "strlist";
_mergecommand.Parameters["values"].Direction = ParameterDirection.Input;
_mergecommand.Parameters["values"].Value = valuelist;
_mergecommand.ExecuteNonQuery();
This works fine. But I want to prepare this command before executing it because I need to run this millions of times. I am using SQL Server 2008. I get this error if I try to prepare it
SqlCommand.Prepare method requires all variable length parameters to have an explicitly set non-zero Size.
Any idea how to do this?
This is old, but there does appear to be a correct answer which is to use -1 as the size, e.g.:
_mergecommand.Parameters.Add(new SqlParameter("values", SqlDbType.Structured, -1));
If you have to do it millions of times using a command like this is probably not a good strategy.
Can you serialize your data into an XML string and pass that as a single argument? That will be considerably less load on your network and SQL Server.... although it will probably hit your client a lot harder.
If you are dead set on doing it that way, maybe what you are looking for is an overload of the SqlCommand.Parameters.Add method:
_mergecommand.Parameters.Add("#values", System.Data.SqlDbType.NVarChar, 100).Value = foo;
is that more like what you wanted?

Can the Sequence of RecordSets in a Multiple RecordSet ADO.Net resultset be determined, controlled?

I am using code similar to this Support / KB article to return multiple recordsets to my C# program.
But I don't want C# code to be dependant on the physical sequence of the recordsets returned, in order to do it's job.
So my question is, "Is there a way to determine which set of records from a multiplerecordset resultset am I currently processing?"
I know I could probably decipher this indirectly by looking for a unique column name or something per resultset, but I think/hope there is a better way.
P.S. I am using Visual Studio 2008 Pro & SQL Server 2008 Express Edition.
No, because the SqlDataReader is forward only. As far as I know, the best you can do is open the reader with KeyInfo and inspect the schema data table created with the reader's GetSchemaTable method (or just inspect the fields, which is easier, but less reliable).
I spent a couple of days on this. I ended up just living with the physical order dependency. I heavily commented both the code method and the stored procedure with !!!IMPORTANT!!!, and included an #If...#End If to output the result sets when needed to validate the stored procedure output.
The following code snippet may help you.
Helpful Code
Dim fContainsNextResult As Boolean
Dim oReader As DbDataReader = Nothing
oReader = Me.SelectCommand.ExecuteReader(CommandBehavior.CloseConnection Or CommandBehavior.KeyInfo)
#If DEBUG_ignore Then
'load method of data table internally advances to the next result set
'therefore, must check to see if reader is closed instead of calling next result
Do
Dim oTable As New DataTable("Table")
oTable.Load(oReader)
oTable.WriteXml("C:\" + Environment.TickCount.ToString + ".xml")
oTable.Dispose()
Loop While oReader.IsClosed = False
'must re-open the connection
Me.SelectCommand.Connection.Open()
'reload data reader
oReader = Me.SelectCommand.ExecuteReader(CommandBehavior.CloseConnection Or CommandBehavior.KeyInfo)
#End If
Do
Dim oSchemaTable As DataTable = oReader.GetSchemaTable
'!!!IMPORTANT!!! PopulateTable expects the result sets in a specific order
' Therefore, if you suddenly start getting exceptions that only a novice would make
' the stored procedure has been changed!
PopulateTable(oReader, oDatabaseTable, _includeHiddenFields)
fContainsNextResult = oReader.NextResult
Loop While fContainsNextResult
Because you're explicitly stating in which order to execute the SQL statements the results will appear in that same order. In any case if you want to programmatically determine which recordset you're processing you still have to identify some columns in the result.