I have a table which has ~ 3 billion rows in hdb. One of the column is char list, I want to cast this column to symbol after loading the hdb. But memory quickly crosses over 300GB which I cannot afford. Can this be optimized in any way?
Are you trying to cast to symbol in-memory (temporary) or cast to symbol on-disk (permanent)? If in-memory, you shouldn't try to cast to symbol for all dates, you can just cast to symbol as you select from it (with a date filter) or build a wrapper function to handle this. You need to analyse how repetitive the strings are though as every string you cast to symbol gets interned and consumes memory. If the strings are very unique (.e.g long) then you may end up creating too many interned symbols leading to your memory blowup.
If on-disk you should be using Kx's dbmaint utility - it has a specific example of casting from char list (string) to enumerated symbol.
https://github.com/KxSystems/kdb/blob/master/utils/dbmaint.md#fncol
You have to be very careful though - again you need to analyse the string column to ensure that it is repetitive enough to warrant casting to symbol (adding as few new symbols to the sym file as possible). If the strings are very unique then you should not cast to symbol as you risk polluting the sym file with a lot of new symbols.
Ultimately the most efficient approach is to make the permanent on-disk change assuming the strings are repetitive (e.g. short)
Related
I have a column in my table whose values that are dictionaries. The type in the meta of that column is " ".
I want to know how to splay this table. When I try to splay it, I get a type error. I am aware only vectors can be splayed, however, I have seen a table where a column holds dictionaries splayed before, so I know it's possible, but I am not sure how it is done.
Dictionaries are only supported in kdb version >3.6.
If you are running 3.6/4.0, double check you are enumerating the table for splay.
`:path/to/table set .Q.en[`:hdb;table]
If <3.6 json string is a good alternative although not recommended on large tables as .j.k is slow.
How to avoid the unnecessary CPU cost?
See this historic question with failure tests. Example: j->'x' is a JSONb representing a number and j->'y' a boolean. Since the first versions of JSONb (issued in 2014 with 9.4) until today (6 years!), with PostgreSQL v12... Seems that we need to enforce double conversion:
Discard j->'x' "binary JSONb number" information and transforms it into printable string j->>'x';discard j->'y' "binary JSONb boolean" information and transforms it into printable string j->>'y'.
Parse string to obtain "binary SQL float" by casting string (j->>'x')::float AS x; parse string to obtain "binary SQL boolean" by casting string (j->>'y')::boolean AS y.
Is there no syntax or optimized function to a programmer enforce the direct conversion?
I don't see in the guide... Or it was never implemented: is there a technical barrier to it?
NOTES about typical scenario where we need it
(responding to comments)
Imagine a scenario where your system need to store many many small datasets (real example!) with minimal disk usage, and managing all with a centralized control/metadata/etc. JSONb is a good solution, and offer at least 2 good alternatives to store in the database:
Metadata (with schema descriptor) and all dataset in an array of arrays;
Separating Metadata and table rows in two tables.
(and variations where metadata is translated to a cache of text[], etc.) Alternative-1, monolitic, is the best for the "minimal disk usage" requirement, and faster for full information retrieval. Alternative-2 can be the choice for random access or partial retrieval, when the table Alt2_DatasetLine have also more one column, like time, for time series.
You can create all SQL VIEWS in a separated schema, for example
CREATE mydatasets.t1234 AS
SELECT (j->>'d')::date AS d, j->>'t' AS t, (j->>'b')::boolean AS b,
(j->>'i')::int AS i, (j->>'f')::float AS f
FROM (
select jsonb_array_elements(j_alldata) j FROM Alt1_AllDataset
where dataset_id=1234
) t
-- or FROM alt2...
;
And CREATE VIEW's can by all automatic, running the SQL string dynamically ... we can reproduce the above "stable schema casting" by simple formating rules, extracted from metadata:
SELECT string_agg( CASE
WHEN x[2]!='text' THEN format(E'(j->>\'%s\')::%s AS %s',x[1],x[2],x[1])
ELSE format(E'j->>\'%s\' AS %s',x[1],x[1])
END, ',' ) as x2
FROM (
SELECT regexp_split_to_array(trim(x),'\s+') x
FROM regexp_split_to_table('d date, t text, b boolean, i int, f float', ',') t1(x)
) t2;
... It's a "real life scenario", this (apparently ugly) model is surprisingly fast for small traffic applications. And other advantages, besides disk usage reduction: flexibility (you can change datataset schema without need of change in the SQL schema) and scalability (2, 3, ... 1 billion of different datasets on the same table).
Returning to the question: imagine a dataset with ~50 or more columns, the SQL VIEW will be faster if PostgreSQL offers a "bynary to bynary casting".
Short answer: No, there is no better way to extract a jsonb number as PostgreSQL than (for example)
CAST(j ->> 'attr' AS double precision)
A JSON number happens to be stored as PostgreSQL numeric internally, so that wouldn't work “directly” anyway. But there is no principal reason why there could not be a more efficient way to extract such a value as numeric.
So, why don't we have that?
Nobody has implemented it. That is often an indication that nobody thought it worth the effort. I personally think that this would be a micro-optimization – if you want to go for maximum efficiency, you extract that column from the JSON and store it directly as column in the table.
It is not necessary to modify the PostgreSQL source to do this. It is possible to write your own C function that does exactly what you envision. If many people thought this was beneficial, I'd expect that somebody would already have written such a function.
PostgreSQL has just-in-time compilation (JIT). So if an expression like this is evaluated for a lot of rows, PostgreSQL will build executable code for that on the fly. That mitigates the inefficiency and makes it less necessary to have a special case for efficiency reasons.
It might not be quite as easy as it seems for many data types. JSON standard types don't necessarily correspond to PostgreSQL types in all cases. That may seem contrived, but look at this recent thread in the Hackers mailing list that deals with the differences between the numeric types between JSON and PostgreSQL.
All of the above are not reasons that such a feature could never exist, I just wanted to give reasons why we don't have it.
I have an array that looks like this [[(Double,Double)]]. It's a multi-dimensional array of tuples.
This is data that I will never query on, as it doesn't need to be queried. It only makes sense if it's like that on the client side. I'm thinking of storing the entire thing as string and then parsing it back to multi array.
Would that be a good approach and would the parsing be very expensive considering I can have a max of 20 arrays with 4 max inner array each with a tuple of 2 Double?
How would I check to see which is a better approach and if storing it as multi-dimensional array in PostgreSQL is the better approach?
How would I store it?
To store an array of composite type (with any nesting level), you need a registered base type to work with. You could have a table defining the row type, or just create the type explicitly:
CREATE TYPE dd AS (a float8, b float8);
Here are some ways to construct that 2-dimensional array of yours:
SELECT ARRAY [['(1.23,23.4)'::dd]]
, (ARRAY [['(1.23,23.4)']])::dd[]
, '{{"(1.23,23.4)"}}'::dd[]
, ARRAY[ARRAY[dd '(1.23,23.4)']]
, ARRAY(SELECT ARRAY (SELECT dd '(1.23,23.4)'));
Related:
How to pass custom type array to Postgres function
Pass array from node-postgres to plpgsql function
Note that the Postgres array type dd[] can store values with any level of nesting. See:
Mapping PostgreSQL text[][] type and Java type
Whether that's more efficient than just storing the string literal as text very much depends on details of your use case.
Arrays types occupy an overhead of 24 bytes plus the usual storage size of element values.
float8 (= double precision) occupies 8 bytes. The text string '1' occupies 2 bytes on disk and 4 bytes in RAM. text '123.45678' occupies 10 bytes on disk and 12 bytes in RAM.
Simple text will be read and written a bit faster than an array type of equal size.
Large text values are compressed (automatically), which can benefit storage size (especially with repetitive patterns) - but adds compression / decompression cost.
An actual Postgres array is cleaner in any case, as Postgres does not allow illegal strings to be stored.
A column has undergone a data type change, so the query has to be changed too:
Old query:
select from person where id = 100
New query:
select from person where id = `100
I'm new to Q and haven't been able to figure out how to do this:
Example:
I want to convert 100 to `100.
You would need to convert to string first and then cast to symbol:
q)`$string 100
`100
However, having a numerical column as symbols is a pretty bad idea. Is this table being written to disk? If so this could possibly blow up your sym file and blow up your in-memory interned symbol list (increasing your memory usage).....assuming the numerical values are not very repetitive
I am creating a new table on a KDB database as a parted splay (parted by date), the new table schema has a column called CCYY, which has a lot of repeating values. I am unsure if I should save it as char or symbols. My main goal is to use least amount of memory.
As a result which one should I use? What is the benefit/disadvantage of saving repeating values as either a char array or a symbol in a parted splayed setup?
It sounds like you should use symbol.
There's a guide to symbols/enumerations here:http://www.timestored.com/kdb-guides/strings-symbols-enumeration#when-to-use quote:
Typically you should follow the guidelines:
If the column is used in where clause equality comparisons e.g.
select from t where sym in AB -> Symbol
Short, often repeated strings -> Symbol
Else Long, Non-repeated strings -> String
When evaluating whether or not to use symbol for a column, cardinality of that column is key. Length of individual values matters less and, if anything, longer values might be better off as symbol, as they will be stored only once in the sym file, but repeated in the char vector. That consideration is pretty much moot if you compress you data on disk though.
If your values are short enough, don't forget about the possibility of using .Q.j10, .Q.x10, .Q.j12 and .Q.x12. This will use less space than a char vector. And it doesn't rely on a sym file, which in complex environments can save you from having to re-enumerate if you are, say, copying tables between hdbs who's sym files are not in sync.
If space is a concern, always compress the data on disk.