How to simulate a column addition and table reordering in KDB - kdb

I would like to know how much time would it take to add a column or a number of columns to my HDB and reorder them.
How could I go about doing this ?
I thought about adding and removing a dummy column to the said HDB and compare the start time and end time, but the problem is , this is a PROD hdb .
Is there a mathematical formula that would allow me to approximate how long such an operation would take ?

The dbmaint library is available https://github.com/KxSystems/kdb/blob/master/utils/dbmaint.md
It has an addcol function for adding an extra blank columns.
It is limited in speed more by disk write speed. As they are blank to start writing with compression on makes a lot of sense and will be faster. Testing adding the columns on a subset of data is the best way to test for speed.
The reordercols function is also available. It is very fast to run as it only has to edit small .d files to perform it's task.
https://code.kx.com/q/kb/splayed-tables/
If you need to re-sort the tables after the maintenance this is a much slower task. How long it takes depends on:
The datatypes of columns involved
The number of rows
The number of columns
Whether compression is on or off and what settings are being used
The specs of the server involved CPU and IO speed of the disk system
Rather than using a formula with high complexity and many input factors it is usually easiest to perform a test of the task you wish to complete on a subset of data on ideally as identical as possible hardware. Then you can estimate out from this how long the whole task will take.
Some notes also on:
https://kx.com/blog/kdb-q-insights-database-maintenance-with-q/

Related

Reducing sampling in simulation

Is there any way in Modelica, to reduce the sampeling during simulations? I have a DCDC converter with high frequency, consequently generating huge dataset. I am wondering, if there is any way to reduce the size of the dataset during simulation/exportation?
Thanks
Trying to create smaller dataset from models that generate huge ones (models with high frequencies).
Basically when you simulate, do not press the -> press the S on the toolbar and you get several tabs and the ones you care about are General and Output.
In General you can specify the number of intervals to reduce the data stored. It will only be stored at each interval.
In the Output you can say not to store events for example. You can also filter out variables that you are not interested in to reduce the result file size. Note that "Equidistant Time Grid" is activated by default, if not this would generate quite a lot of output maybe even several times per interval.
See more here about the things you have in General/Output:
https://openmodelica.org/doc/OpenModelicaUsersGuide/1.20/omedit.html#simulating-a-model
By including our desired variables in the Variable Filter in Output tab one can reduce the size of the output file without compromising the interval points. Each entry follows a POSIX EXTENDED regular expression format, say:
for a list of x,y,z it is ^x|y|z$. Another best practice could be unchecking the flag Store variables at Events. The best answer for this question is already answered by Adrian Pop.

tbl_regression (gtsummary) ordering covariables levels and processing time

Originally in my df, I had my BMI in numeric format(1-5), which I recoded (underweigh to obese), factored and choose a specific reference using relevel (Normal, originally 3). Then did a logistic regression: y~ BMI+other covariates. My questions are the following :
1- When I plug my logistic in tbl_regression, the levels have undesired orders (underweight, obese1, obese 2, overweight) . Is there a way to rearrange the levels the way I want to (underweight, overweight, obese 1, obese 2)?
2- I used tbl_regression on a small data set which went ok. My new model, however, is based on 3M observation and 13 variables (the database is 1Gb). This time my tbl_regression is taking about 1h to process and out put the table, which is not normal since I have a fast laptop. Is there a way to make this more efficient ? I tried keeping the model only while using tbl_regression and removed the database, but it is still hellishly long. I tried with the trial data and it was ok..
1 - I recommend using contrasts() to set the reference level. The relevel() function just moves a factor level to the first position. Examples here Is there a way to relevel a variable in gtsummary after generating the beautiful table?
2 - I suspect with such a large model, the confidence interval calculation is what is slowing you down. If you see a big difference in the computation times of summary() and broom::tidy() with the CI calculation compared to tbl_regression(), please create an illustrative example (that anyone can run locally) and it can be looked into further.

How to pass a vector from tableau to R

I have a need to pass a vector of arguments to Rserve from tableau. Specifically, I am using IRR calculations in R (on Rserve), and i want to pass vector of cash-flows that are as columns in my table (instead of rows/measure). So, i want to collect all those CF in a vector and pass it on to Rserve. Passing them one at a time slows down IO.
SCRIPT_REAL("r_func(c(.arg1, .arg2, .arg3))",sum(cf1), sum(cf2), sum(cf3))
cf1..cfn are cashflows corresponding to various periods. Above code works well when cf are few but takes a long time when i have few hundereds. Further, time spent is not in calculation but IO when communicating with remote Rserve. If i have a local Rserve, this calculation happens under few seconds while on remote, it takes well over a minute.
Also, want to point out that tableau / Rserve, set one argument after another and that takes time. My expectation is that once i have a vector, it would be just 1 transfer and setting of arguments, and therefore this should speed up
The first step in understanding how Tableau interacts with R or Python, is understanding how Tableau's table calcs work.
Tableau Script_XXX() functions are table calculations which means that you invoke them on a vector of aggregate query results and the corresponding R or Python code needs to return a vector usually of the same size. (I think you may be able to return a scalar or smaller vector which gets replicated to appear like a vector of the same size as the argument -- but not certain)
You can control how your data is partitioned into vectors, and also the ordering of data in the vectors, by editing the table calc to specify the partitioning and addressing for that calc.
Partitioning determines how your aggregate query results are broken up into vectors for calculation purposes. Addressing determines how the elements of each vector are ordered. You can either do that based on the physical layout of the table structure, or (better) based on the specific dimensions.
See the Tableau on-line help for table calcs for more info, and look online training videos from Tableau or blog entries (especially from anyone named Bora)
One way to test your understanding of these concepts is create a Tableau table (i.e., a viz with a mark type of text) with several dimensions on row and column shelves. Then create calculated fields for INDEX() and SIZE() and display them on text. Finally, change the partitioning and addressing in different ways by editing those table calcs. Try several different permutations. When you can confidently predict what those functions will produce for different settings, then you're ready to do more complex tasks - such as talking to R.
It is also instructive to experiment with FIRST(), LAST(), LOOKUP(), WINDOW_SUM() etc -- and finally dig into PREVIOUS_VALUE(). Warning, PREVIOUS_VALUE() is a bit odd, and does not behave the way you probably assume it does. Still, it is a useful technique that can implement a recursive calculation, and is about as close to a for loop as Tableau gets.

Optimizing compression using HDF5/H5 in Matlab

Using Matlab, I am going to generate several data files and store them in H5 format as 20x1500xN, where N is an integer that can vary, but typically around 2300. Each file will have 4 different data sets with equal structure. Thus, I will quickly achieve a storage problem. My two questions:
Is there any reason not the split the 4 different data sets, and just save as 4x20x1500xNinstead? I would prefer having them split, since it is different signal modalities, but if there is any computational/compression advantage to not having them separated, I will join them.
Using Matlab's built-in compression, I set deflate=9 (and DataType=single). However, I have now realized that using deflate multiplies my computational time with 5. I realize this could have something to do with my ChunkSize, which I just put to 20x1500x5 - without any reasoning behind it. Is there a strategic way to optimize computational load w.r.t. deflation and compression time?
Thank you.
1- Splitting or merging? It won't make a difference in the compression procedure, since it is performed in blocks.
2- Your choice of chunkshape seems, indeed, bad. Chunksize determines the shape and size of each block that will be compressed independently. The bad is that each chunk is of 600 kB, that is much larger than the L2 cache, so your CPU is likely twiddling its fingers, waiting for data to come in. Depending on the nature of your data and the usage pattern you will use the most (read the whole array at once, random reads, sequential reads...) you may want to target the L1 or L2 sizes, or something in between. Here are some experiments done with a Python library that may serve you as a guide.
Once you have selected your chunksize (how many bytes will your compression blocks have), you have to choose a chunkshape. I'd recommend the shape that most closely fits your reading pattern, if you are doing partial reads, or filling in in a fastest-axis-first if you want to read the whole array at once. In your case, this will be something like 1x1500x10, I think (second axis being the fastest, last one the second fastest, and fist the slowest, change if I am mistaken).
Lastly, keep in mind that the details are quite dependant on the specific machine you run it: the CPU, the quality and load of the hard drive or SSD, speed of RAM... so the fine tuning will always require some experimentation.

is kdb fast solely due to processing in memory

I've heard quite a couple times people talking about KDB deal with millions of rows in nearly no time. why is it that fast? is that solely because the data is all organized in memory?
another thing is that is there alternatives for this? any big database vendors provide in memory databases ?
A quick Google search came up with the answer:
Many operations are more efficient with a column-oriented approach. In particular, operations that need to access a sequence of values from a particular column are much faster. If all the values in a column have the same size (which is true, by design, in kdb), things get even better. This type of access pattern is typical of the applications for which q and kdb are used.
To make this concrete, let's examine a column of 64-bit, floating point numbers:
q).Q.w[] `used
108464j
q)t: ([] f: 1000000 ? 1.0)
q).Q.w[] `used
8497328j
q)
As you can see, the memory needed to hold one million 8-byte values is only a little over 8MB. That's because the data are being stored sequentially in an array. To clarify, let's create another table:
q)u: update g: 1000000 ? 5.0 from t
q).Q.w[] `used
16885952j
q)
Both t and u are sharing the column f. If q organized its data in rows, the memory usage would have gone up another 8MB. Another way to confirm this is to take a look at k.h.
Now let's see what happens when we write the table to disk:
q)`:t/ set t
`:t/
q)\ls -l t
"total 15632"
"-rw-r--r-- 1 kdbfaq staff 8000016 May 29 19:57 f"
q)
16 bytes of overhead. Clearly, all of the numbers are being stored sequentially on disk. Efficiency is about avoiding unnecessary work, and here we see that q does exactly what needs to be done when reading and writing a column - no more, no less.
OK, so this approach is space efficient. How does this data layout translate into speed?
If we ask q to sum all 1 million numbers, having the entire list packed tightly together in memory is a tremendous advantage over a row-oriented organization, because we'll encounter fewer misses at every stage of the memory hierarchy. Avoiding cache misses and page faults is essential to getting performance out of your machine.
Moreover, doing math on a long list of numbers that are all together in memory is a problem that modern CPU instruction sets have special features to handle, including instructions to prefetch array elements that will be needed in the near future. Although those features were originally created to improve PC multimedia performance, they turned out to be great for statistics as well. In addition, the same synergy of locality and CPU features enables column-oriented systems to perform linear searches (e.g., in where clauses on unindexed columns) faster than indexed searches (with their attendant branch prediction failures) up to astonishing row counts.
Sources(S): http://www.kdbfaq.com/kdb-faq/tag/why-kdb-fast
as for speed, the memory thing does play a big part but there are several other things, fast read from disk for hdb, splaying etc. From personal experienoce I can say, you can get pretty good speeds from c++ provided you want to write that much code. With kdb you get all that and some more.
another thing about speed is also speed of coding. Steep learning curve but once you get it, complex problems can be coded in minutes.
alternatives you can look at onetick or google in memory databases
kdb is fast but really expensive. Plus, it's a pain to learn Q. There are a few alternatives such as DolphinDB, Quasardb, etc.