How could I get offset of a field in flatbuffers binary file? - mmap

I am using a library, and the library requires me to provide the offset of the desired data in the file, so it can use mmap to read the data (I can't edit the code of this library but only provide the offset).
So I want to use flatbuffers to serialize my whole data because there isn't any packing and unpacking in flatbuffers, (I think) which means that it is easy to get the offset of the desired part in the binary file.
But I don't know how to get the offset. I have tried loading the binary file and calculate the offset of the pointer of the desired field, for example, the address of the root is 1111, the address of the desired field is 1222, so the offset of the field in the binary file is 1222 - 1111 = 111 (because there is no unpacking step). But in fact, the offset of the pointer is a huge negative number.
Could someone help me with this problem? Thanks in advance!

FlatBuffers is indeed very suitable for mmap. There are no offsets to be computed, since the generated code does that all for you. You should simply mmap the whole FlatBuffers file, and then use the field accessors as normal, starting from auto root = GetRoot<MyRootType>(my_mmapped_buffer). If you want to get a direct pointer to the data in a larger field such as a string or a vector, again simply use the provided API: root->my_string_field()->c_str() for example (which will point to inside your mmapped buffer).

Related

Convert PSS/E .raw file to Pandapower

I'm trying to find a possible way to convert PSS/E native .raw files to Pandapower format.
My objective is to take advantage of the network plotting capabilities that are available in Pandapower.
For that, I have to first be able to load my grid data into Pandapower.
For that, I have to somehow bridge the gap between PSSE .raw to Pandapower.
Literature says that a possible way of doing this is by using the 'psse2mpc' function available in Matpower.
I've tried to use it but I get the following error message:
(quote)
>> psse2mpc('RED1523.raw')
Reading file 'RED1523.raw' ............................................. done.
Splitting into individual lines ...error: regexp: the input string is invalid UTF-8
error: called from
psse_read at line 60 column 9
psse2mpc at line 68 column 21
(unquote)
I'was informed that maybe I should save my .raw file (natively generated with a PSSE/E v33 version) into an older .raw format (corresponding to previous PSS/E versions).
I've tried this as well but still have the same error message.
Apart from getting this error which so far impedes to reach my objective, I've been unable to guess the Pandapower "equivalent .raw" structure. Does anybody know how this input structure looks like in Pandapower?
If I would know how Pandapower needs to get the input data, I could even try to code a taylor-made python script that converts my .raw file into whatever is required from Pandapower.
If somebody could help me to get out of this labyrinth I would be most gratefull !!!
Thanks.
Eneko.
You need to check your .raw file to enter the other Inputs of the psse2mpc function. For instance, if I have the case39.raw file and I want to convert it to matpower format like case39mpc.m, then I must enter something like this:
psse2mpc ('case39.raw', 'case39mpc.m', '1', '29')

Get offset and length of a subset of a WAT archive from Common Crawl index server

I would like to download a subset of a WAT archive segment from Amazon S3.
Background:
Searching the Common Crawl index at http://index.commoncrawl.org yields results with information about the location of WARC files on AWS S3. For example, searching for url=www.celebuzz.com/2017-01-04/*&output=json yields JSON-formatted results, one of which is
{
"urlkey":"com,celebuzz)/2017-01-04/watch-james-corden-george-michael-tribute",
...
"filename":"crawl-data/CC-MAIN-2017-34/segments/1502886104631.25/warc/CC-MAIN-20170818082911-20170818102911-00023.warc.gz",
...
"offset":"504411150",
"length":"14169",
...
}
The filename entry indicates which archive segment contains the WARC file for this particular page. This archive file is huge; but fortunately the entry also contains offset and length fields, which can be used to request the range of bytes containing the relevant subset of the archive segment (see, e.g., lines 22-30 in this gist).
My question:
Given the location of a WARC file segment, I know how to construct the name of the corresponding WAT archive segment (see, e.g., this tutorial). I only need a subset of the WAT file, so I would like to request a range of bytes. But how do I find the corresponding offset and length for the WAT archive segment?
I have checked the API documentation for the Common Crawl index server, and it isn't clear to me that this is even possible. But in case it is, I'm posting this question.
The Common Crawl index does not contain offsets into WAT and WET files. So, the only way is to search the whole WAT/WET file for the desired record/URL. Eventually, it would be possible to estimate the offset because the record order in WARC and WAT/WET files is the same.
After many trial and error I had managed to get a range from a warc file in python and boto3 the following way:
# You have this form the index
offset, length, filename = 2161478, 12350, "crawl-data/[...].warc.gz"
import boto3
from botocore import UNSIGNED
from botocore.client import Config
# Boto3 anonymous login to common crawl
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
# Count the range
offset_end = offset + length - 1
byte_range = 'bytes={offset}-{end}'.format(offset=2161478, end=offset_end)
gzipped_text = s3.get_object(Bucket='commoncrawl', Key=filename, Range=byte_range)['Body'].read()
# The requested file in GZIP
with open("file.gz", 'w') as f:
f.write(gzipped_text)
The rest is optimisation... Hope it helps! :)

Matlab error saving containers.Map()

I try to save a map as a *.mat file which is quite big. (somewhere around 4 or 5gigs. I cannot be sure because I could never save the file...)
the map is generated by:
[amap, ~] = load_audio(config);
and saved later on by
save('audioMap', 'amap');
Now the generated file is only 218 bytes but no errors occur. Trying to read the contents of the file with whos('-file', 'audioMap.mat') results in the following error:
Warning: Unable to read some of the variables due to unknown MAT-file error.
every record of the map is a cell with 6 values. Now querying the size of the map in the Matlab workspace results in the following output:
Name Size Bytes Class Attributes
amap 2279x1 112 containers.Map
Now clearly the size is not correct but I am able to iterate through the map and all data is present. When querying the size of a record it is approximately 2.5MB.
I also tried to save the variable from the workspace with right-click and save-as with the same result.
Anyone got any ideas why Matlab is not able to properly save this map?
You where attempting to write a MAT-File version 7.0 which has a maximum variable size of 2^31 bytes=2GB
When you attempt to write Variables larger than the limit, the expected behaviour would be to receive a warning when saving the variable.
Warning: Variable 'varname' cannot be saved to a MAT-file whose version is older than 7.3. To save this variable, use the -v7.3 switch. Skipping...
For some reason the warning was not raised, but beeing unable to write such large objects is the expected behaviour.

Why does Open XML API Import Text Formatted Column Cell Rows Differently For Every Row

I am working on an ingestion feature that will take a strongly formatted .xlsx file and import the records to a temp storage table and then process the rows to create db records.
One of the columns is strictly formatted as "Text" but it seems like the Open XML API handles the columns cells differently on a row-by-row basis. Some of the values while appearing to be numeric values are truly not (which is why we format the column as Text) -
some examples are "211377", "211727.01", "209395.388", "209395.435"
what these values represent is not important but what happens is that some values (using the Open XML API v2.5 library) will be read in properly as text whether retrieved from the Shared Strings collection or simply from InnerXML property while others get sucked in as numbers with what appears to be appended rounding or precision.
For example the "211377", "211727.01" and "209395.435" all come in exactly as they are in the spreadsheet but the "209395.388" value is being pulled in as "209395.38800000001" (there are others that this happens to as well).
There seems to be no rhyme or reason to which values get messed up and which ones which import fine. What is really frustrating is that if I use the native Import feature in SQL Server Management Studio and ingest the same spreadsheet to a temp table this does not happen - so how is that the SSMS import can handle these values as purely text for all rows but the Open XML API cannot.
To begin the answer you main problem seems to be values,
"209395.388" value is being pulled in as "209395.38800000001"
Yes in .xlsx file value is stored as 209395.38800000001 instead of 209395.388. And it's the correct format to store floating point numbers; nothing wrong in it. You van simply confirm it by following code snippet
string val = "209395.38800000001"; // <= What we extract from Open Xml
Console.WriteLine(double.Parse(val)); // < = Simply pass it to double and print
The output is :
209395.388 // <= yes the expected value
So there's nothing wrong in the value you extract from .xlsx using Open Xml SDK.
Now to cells, yes cell can have verity of formats. Numbers, text, boleans or shared string text. And you can styles to a cell which would format your string to a desired output in Excel. (Ex - Date Time format, Forced strings etc.). And this the way Excel handle the vast verity of data. It need this kind of formatting and .xlsx file format had to be little complex to support all.
My advice is to use a proper parse method set at extracted values to identify what format it represent (For example to determine whether its a number or a text) and apply what type of parse.
ex : -
string val = "209395.38800000001";
Console.WriteLine(float.Parse(val)); // <= Float parse will be deduce a different value ; 209395.4
Update :
Here's how value is saved in internal XML
Try for yourself ;
Make an .xlsx file with value 209395.388 -> Change extention to .zip -> Unzip it -> goto worksheet folder -> open Sheet1
You will notice that value is stored as 209395.38800000001 as scene in attached image.. So nothing wrong on API for extracting stored number. It's your duty to decide what format to apply.
But if you make the whole column Text before adding data, you will see that .xlsx hold data as it is; simply said as string.

IDCAMS LISTCAT deleting VSAM file when next step is IEFBR14

I have a requirement wherein I need to check if a VSAM file exists or not. If it is not present then I need to create it like TEST.FILE2. My JCL is as :
//STEP01 EXEC PGM=IDCAMS
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
LISTCAT ENTRIES('BRTEST.FILE1')
/*
//STEP02 EXEC PGM=IEFBR14,COND=(4,GT)
//DD01 DD DSN=BRTEST.FILE1,
// DISP=(,CATLG,DELETE),
// LIKE=BRTEST.FILE2
//SYSPRINT DD SYSOUT=*
//SYSOUT DD SYSOUT=*
But a stange thing is happening. Whenever I execute this JCL, STEP001 return a return code as 004 even though the file is already present, and a new file is created in STEP02. So if I submit this JCL twice, a new file is created both the times. I am not able to understand how the file is getting deleted. And the strange thing is if I run the JCL without STEP02 then it gives MAXCC as 0 saying that the file was found in catalog.
I was able to achieve my requirement by following code, but would still like to understand why and how my VSAM file gets deleted for LISTCAT.
//STEP02 EXEC PGM=IEFBR14,COND=(4,GT)
//DD01 DD DSN=BRTEST.FILE1,
// DISP=(MOD,CATLG,CATLG),
// LIKE=BRTEST.FILE2
//SYSPRINT DD SYSOUT=*
//SYSOUT DD SYSOUT=*
Here is the SYSPRINT when only STEP01 is executed:
IDCAMS SYSTEM SERVICES TIME: 03:47:44
LISTCAT ENTRIES('BRTEST.FILE1')
CLUSTER ------- BRTEST.FILE1
IN-CAT --- CATALOG.TEST03
DATA ------- BRTEST.FILE1.DATA
IN-CAT --- CATALOG.TEST03
INDEX ------ BRTEST.FILE1.INDEX
IN-CAT --- CATALOG.TEST03
IDCAMS SYSTEM SERVICES TIME: 03:47:44
THE NUMBER OF ENTRIES PROCESSED WAS:
AIX -------------------0
ALIAS -----------------0
CLUSTER ---------------1
DATA ------------------1
GDG -------------------0
INDEX -----------------1
NONVSAM ---------------0
PAGESPACE -------------0
PATH ------------------0
SPACE -----------------0
USERCATALOG -----------0
TAPELIBRARY -----------0
TAPEVOLUME ------------0
TOTAL -----------------3
THE NUMBER OF PROTECTED ENTRIES SUPPRESSED WAS 0
IDC0001I FUNCTION COMPLETED, HIGHEST CONDITION CODE WAS 0
IDC0002I IDCAMS PROCESSING COMPLETE. MAXIMUM CONDITION CODE WAS 0
And when both steps are executed:
IDCAMS SYSTEM SERVICES TIME: 03:48:35
LISTCAT ENTRIES('BRTEST.FILE1')
IDC3012I ENTRY BRTEST.FILE1 NOT FOUND
IDC3009I ** VSAM CATALOG RETURN CODE IS 8 - REASON CODE IS IGG0CLEG-42
IDC1566I ** BRTEST.FILE1 NOT LISTED
IDCAMS SYSTEM SERVICES TIME: 03:48:35
THE NUMBER OF ENTRIES PROCESSED WAS:
AIX -------------------0
ALIAS -----------------0
CLUSTER ---------------0
DATA ------------------0
GDG -------------------0
INDEX -----------------0
NONVSAM ---------------0
PAGESPACE -------------0
PATH ------------------0
SPACE -----------------0
USERCATALOG -----------0
TAPELIBRARY -----------0
TAPEVOLUME ------------0
TOTAL -----------------0
THE NUMBER OF PROTECTED ENTRIES SUPPRESSED WAS 0
IDC0001I FUNCTION COMPLETED, HIGHEST CONDITION CODE WAS 4
IDC0002I IDCAMS PROCESSING COMPLETE. MAXIMUM CONDITION CODE WAS 4
The value for ZOS390RL variable is z/OS 02.01.00 and ZENVIR is ISPF 7.1MVS TSO.
May have an answer for you. Didn't think of it because it is a VSAM dataset, and the way you are trying to do it is unusual (to me).
There is/was a product called UCC11. I think it is now marketed by Computer Associates, and called CA-11 (or somesuch). I think you are using this product or something similar at your site.
If executed at the beginning of a JOB it would look for files specified as NEW and CATLG, and look to see if there was an existing file of the same name in the catalog. If there was, the existing file would be deleted.
This would obviate the need for an initial IEFBR14 step to delete such files.
I think that you are using this product, or something similar. Your file is being automatically deleted when it exists, so your IDCAMS step, which is reading data from a file (even if it is SYSIN and DD *) is not known to the product, so your VSAM file is deleted before your IDCAMS step is run.
Changing the file to MOD as the initial disposition (MOD will add to an existing file and create a new file if none exists) will not cause such a product to delete the file.
Using the LIKE for a VSAM file will not obtain the CA-size and CI-SIZE from the model dataset. You will get default values for those, which may well impact on the performance of your programs. You cannot specify these values when defining a VSAM file in JCL. You also won't get buffer values from the model dataset, but you can specify those separately in the JCL (which you haven't).
Here is a description of what LIKE does for you: http://publibfp.dhe.ibm.com/cgi-bin/bookmgr/BOOKS/iea2b680/12.40?DT=20080604022956
The following attributes are copied from the model data set to the
new data set:
Data set organization
Record organization (RECORG) or
Record format (RECFM)
Record length (LRECL)
Key length (KEYLEN)
Key offset (KEYOFF)
Type, PDS, PDSE, basic format, extended format, large format, or HFS (DSNTYPE)
Space allocation (AVGREC and SPACE)
Unless you explicitly code the SPACE parameter for the new data set,
the system determines the space to be allocated for the new data
set by adding up the space allocated in the first three extents of the
model data set. Therefore, the space allocated for the new data set
will generally not match the space that was specified for the model
data set. Note that regardless of the units in which the model data
set was allocated, the new data set will be allocated in tracks. This
assumes that space was not specified on the JCL and is being picked up
from the model data set.
There are some other little "gotchas", like in the last paragraph, detailed in the link as well.
Unless you have strong reasons otherwise, I'd strongly suggest doing the whole thing in one IDCAMS step (as below).
I suspected it was going to be 1.12, 1.13 or 2.1 (2.01). IEFBR14 is, subtly, part of the OS now.
Exactly why you get this effect, I don't know. I don't have access to 2.1, so can't investigate myself.
IEFBR14 has changed, LIKE is not really intended for VSAM datasets (you'll get a lot of default values for things you may or may not want), it's not really a "usual" way to do this. See Suggestions below.
Try adding a DDname to your IDCAMS step which just references the VSAM dataset. See if that changes anything. Use that DDname in an IDCAMS statement. See if that changes anything.
Take all your results to your Sysprogs, and see if they can spot anything.
If not, it'll be PMR-time: http://www-01.ibm.com/support/docview.wss?uid=swg21507639
If you do raise a PMR, please update by adding an Answer with the resolution once you receive it.
Suggestions.
Find out how this task is done by other people at your site.
Have you tried using the VSAM file you have defined in that way? You should LISTCAT TEST.FILE1 and TEST.FILE2 and compare. If you look up LIKE in the JCL Reference, you will see that there are things which a VSAM DEFINE can do which you can't do for a VSAM file defined in the JCL using LIKE.
Unless there is some reason otherwise, I'd suggest you do the whole thing in one step with IDCAMS. See if the file exists, use IDCAM's IF to to test the CC from that and only DEFINE if the file does not exist. You can use a MODEL (for instance on your TEST.FILE2) to get everything which is similar to another file, and just override anything different that you need.
If you have a look here, http://pic.dhe.ibm.com/infocenter/zos/v1r13/index.jsp?topic=%2Fcom.ibm.zos.r13.idai200%2Fdefclu.htm, you will find a number of Modal Commands for IDCAMS which will give you everything you need to define if it is not there, and do something different (set the Condition Code, for instance) if it is.
Please still supply the requested information. It is an interesting question on the face of it, which may have a simple solution. But even with a solution, I don't think it is what you want.
What you want to do can be done entirely in the IDCAMS step. You can inspect the return code from a previous operation (ie, the LISTCAT) and do something (like define a new cluster) if the code is greater than 0. If the 2nd step works, then set the MAXCC to return 0 to tell your JCL that this step completed OK (and to let your Ops folks know this too).
Look for the IDCAMS 'IF'.