report progress during tplog row count using -11!(-2;`:tplog) - kdb

I'm counting the number of rows in a 1TB tplog using -11!(-2;`:tplog) internal function, which takes a long time as expected.
Is there a way to track progress (preferably as a percentage) of this operation?

No I don't believe this would be possible for the -2 option. It is possible with every other use of -11! by defining a custom .z.ps but it seems .z.ps isn't invoked when counting number of chunks.
Your best bet might be to check the raw file size on disk using a linux system command or using hcount and then based on some previous estimates tested on your machine you could create logic to come up with a ballpark estimate (maybe with a linear interpolation) of timing based only on the file size.
Of course - once you have the chunk count from the -2 option then you can easily create a progress estimate of an actual replay, just not for the chunk count part itself

Fundamentally, entries are variable length and there is no tracking built into the file, so there's no exact answer.
You might be able to do this:
Replay a short segment (say 1,000 entries) and a custom handler:
.z.ps:{estimate+:-22!x}
-11!(1000;`:tplog)
-1"estimate of entry size: ", string estimate%:1000
-1"estimate of number of entries: ", string floor n:hcount[`:tplog]%estimate
\x .z.ps
Then repeat the full file with a standard upd handler.

Related

howa to optimise tUniqRow and tSortRow

Is it better to put tSortRow before tUniqRow or vice versa for the best perfermence ?
And how to optimize tUniqRow ?
Even if I use "disk option", the job crashes.
I'm working on a 3Million line file
In order to optimize your job, you can try the following:
Use the option "use disk" on tSortRow with a smaller buffer (the default 1 million rows buffer is too big, so start with a small number of rows, 50k for instance, then increase it in order to get better performance). This will use more (smaller) files on disk, so your job will run slower, but it will consume less memory.
Try with a tSortRow (using disk) and a tAggregateSortedRow instead of tUniqRow (by specifying the unique columns in the Group By section, it acts as a tUniqRow, the columns not part of the unique key must be specified in the Operations tab each using 'First' function). As it expects the rows to already be sorted, it doesn't sort them first in memory. Note that this component requires you to know beforehand the number of rows in your flow, which you can get from a previous subjob if you're processing your data in multiple steps.
Also, if the columns you're sorting by in tSortRow come from your database table, you can use an ORDER BY clause in your tOracleInput. This way the sorting will be done on the database side and your job won't consume memory for sort.

Marc21 Binary Decoder with Akka-Stream

I'm trying to decode Marc21 binary data records which have the following specification concerning the field that provide the length of the record.
A Computer-generated, five-character number equal to the length of the
entire record, including itself and the record terminator. The number
is right justified and unused positions contain zeros.
I am trying to use
Akka Stream Framing.lengthField, however I just don't know how specify the size of that field. I imagine that a character is 8 bit, maybe 16 for a number, i am not sure, i wonder if that depend of the platform or language. In short, the question is is it possible to say what is the size of that field Knowing that i am in Scala/Java.
Also what does means:
The number is right justified and unused positions contain zeros"
Does that has implication on how one read the value if collected properly ?
If anyone know anything about this, please share.
EDIT1
Context:
I am trying to build a stream processing graph where the first stage would be processing the result of a sys command ran against a symphony (Vendor Cataloging system) server, which is a stream of unstructured byte chunck which as a whole represent all the Marc21 records Requested (full dump or partial dump).
By processing i mean, chunking that unstructured stream of byte into a stream of frames where the frames are the Records.
In other words, readying the bytes for one record at the time, and emitting it individually to the next stage.
The next stage will consist in emitting that record (Bytes) to apache Kafka.
Obviously the emission stage would be fully parallelize to speed up the process.
The Symphony server does not have the capability to stream a dump when requested, especially over the network. Hence, this Akka-stream based Graph processing to perform that work, for fast ingestion/production and overall streaming processing of our dumps in our overall fast data infrastructure.
EDIT2
Based on #badcook input, I wonder if ComputeFramesize could be used here. Not sure i am slightly confused by the function and what does it takes into parameters.
Little clarification would be much appreciated.
It looks like you're trying to parse MARC 21 records.
In that case I would recommend you just take a look at MARC4J and use that.
If you want to integrate it with Akka streams, or even if you want to parse MARC records your own way, I would recommend breaking up your byte steam with Framing.delimiter using the MARC 21 record terminator (ASCII control character 1D) into complete MARC records rather than try to stream and work with fragments of MARC records. It'll be a lot easier.
As for your specific questions: The MARC 21 specification uses characters rather than raw bytes when talking about its structure. It specifies two character encodings into raw bytes, UTF-8 and MARC 8, both of which are variable width encodings. Hence, no it is not true that every character is a byte. There is no single answer of how many bytes a character takes up.
"[R]ight justified and unused positions contain zeroes" is another way of saying that numbers are padded from the left with 0s. In this case this line comes from a larger quote staying that the numerical string must be 5 characters long. That means if you are trying to represent the number 1, you must represent it as 00001.

why are orientdb index sizes on disk so large

orientdb 2.0.5
I have a database in which we create non-unque index on 2 properties on a class called indexstat.
The two properties which make up the index are a string identifier plus a long timestamp.
Data is created in batches of few hundred records every 5 minutes. After a few hours old records are deleted.
This is file listing are the files related to that table.
Question:
Why is the .irs file which according to documentation (is related to non-unique indexes)...so monstrously huge after a few hours. 298056704 bytes larger than actual data (.irs size - .sbt size - .cpm size).
I would think the index would be smaller than the actual data.
Second question:
What is best practice here. Should I be using unique indexes instead of non-unique? Should I find a way to make the data in the index smaller (e.g. use longs instead of strings as identifiers)?
Below are file names and the sizes of each.
indexstat.cpm 727778304
indexstatidx.irs 1799095296
indexstatidx.sbt 263168
indexstat.pcl 773260288
This is repeated for a few tables where the index size is larger than the database data.
Internals of *.irs files organised in a such way that when you delete something from an index there is an unused hole left in the file. At some point, when about a half of the file space is wasted, those unused holes come into play again and become available for reuse and allocation. That is done for performance reasons to lower the index data fragmentation. In your case this means that sooner or later the *.irs file will stop growing, and its maximum size should be around 2-3 times larger than the maximum observed size of the corresponding *.pcl file, assuming your single stat record size is not much bigger compared to the size of the id-timestamp pair size.
Regarding the second question, in a long run it is almost always better to use the most specific/strict data types to model the data and the most specific/strict index types to index it.
At this link is shown a discussion relative to the index file, maybe can help you.
For the second question, the index should be chosen according to your purpose and your data (not vice versa). The data type (long, string) must be the one that best represents your fields (and already here if for example if you just an integer and this is sufficient to the scope, it is useless to use a long). The same choice for the index, if you need to not have duplicate the choice will be non-unique. if you need an index that allows to choose the range sb-tree instead of the hash and so on ...

historic data storage and retrieval

I am using a standard splayed format for my trade data where i have directories for each date and each column as separate file in there.
I am reading from csv files and storing using the below code. I am using the trial version 32 bit on win 7, 64 bit.
readDat: {[x]
tmp: read data from csv file(x)
tmp: `sym`time`trdId xasc tmp;
/trd: update `g#sym from trd;
trade:: trd;
.Q.dpft[`:/kdb/ndb; dt; `sym; `trade];
.Q.gc[];
};
\t readDat each 50#dtlist
I have tried both using the `g#sym and without it. Data has typically 1.5MM rows per date. select time for this is from 0.5 to 1 second for a day
Is there a way to improve times for either of the below queries.
\t select from trade where date=x
\t select from trade where date=x, sym=y
I have read the docs on segmentation, partitioning etc. but not sure if anything would help here.
On second thoughts, will creating a table for each sym speed up things? I am trying that out but wanted to know if there are memory/space tradeoffs i should be aware of.
Have you done any profiling to see what the actual bottleneck is? If you find that the problem has to do with disk read speed (using something like iostat) you can either get a faster disk (SSD), more memory (for bigger disk cache), or use par.txt to shard your database across multiple disks such that the query happens on multiple disks and cores in parallel.
As you are using .Q.dpft, you are already partitioning your DB. If your use case is always to pass one date in your queries, then segmenting by date will not provide any performance improvements. You could possibly segment by symbol range (see here), although this is never something I've tried.
One basic way to improve performance would be to select a subset of the columns. Do you really need to read all of the fields when querying? Depending on the width of your table this can have a large impact as it now can ignore some files completely.
Another way to improve performance would be to apply `u# to the sym file. This will speed up your second query as the look up on the sym file will be faster. Although this really depends on the size of your universe. The benefit of this would be marginal in comparison to reducing the number of columns requested I would imagine.
As user1895961 mentioned, selecting only certain columns will be faster.
KDB splayed\partitioned tables are almost exactly just files on the filesystem, the smaller the files and the fewer you have to read, the faster it will be. The balance between the number of folders and the number of files is key. 1.5mln per partition is ok, but is on the large side. Perhaps you might want to partition by something else.
You may also want to normalise you data, splitting it into multiple tables and using linked columns to join it back again on the fly. Linked columns, if set up correctly, can be very powerful and can help avoid reading too much data back from disk if filtering is added in.
Also try converting your data to char instead of sym, i found big performance increases from doing so.

Find global subscript midpoint

In Caché ObjectScript (Intersystems' dialect of MUMPS), is there a way to efficiently skip to the approximate midpoint or a linear point in the key for a global subscript range? Equal, based on the number of records.
I want to divide up the the subscript key range into approximately equal chunks and then process each chunk in parallel.
Knowing that the keys in a global are arranged in a binary tree of some kind, this should be a simple operation for the underlying data storage engine but I'm not sure if there is an interface to do this.
I can do it by scanning the global's whole keyspace but that would defeat the purpose of trying to run the operation in parallel. A sequential scan takes hours on this global. I need the keyspace divided up BEFORE I begin scanning.
I want each thread will to an approximately equal sized contiguous chunk of the keyspace to scan individually; the problem is calculating what key range to give each thread.
you can use second parameter "direction" (1 or -1) in function $order or $query
For my particular need, I found that the application I'm using has what I would call an index global. Another global maintained by the app with different keys, linking back to the main table. I can scan that in a fraction of the time and break up the keyset from there.
If someone comes up with a way to do what I want given only the main global, I'll change the accepted answer to that.