I need to store benchmark runs for each nightly builds. To do this, i came up with the following data model.
BenchmarkColumnFamily= {
build_1: {
(Run1, TPS) : 1000K
(Run1, Latency) : 0.5ms
(Run2, TPS) : 1000K
(Run2, Latency) : 0.5ms
(Run3, TPS) : 1000K
(Run3, Latency) : 0.5ms
}
build_2: {
...
}
...
}
To create such a schema, i came up with the following command on cassandra-cli:
create column family BenchmarkColumnFamily with
comparator = 'CompositeType(UTF8Type,UTF8Type)' AND
key_validation_class=UTF8Type AND
default_validation_class=UTF8Type AND
column_metadata = [
{column_name: TPS, validation_class: UTF8Type}
{column_name: Latency, validation_class: UTF8Type}
];
Does the above command create the schema i intend to create? The reason for my confusion is that, when i insert data into the above CF using:
set BenchmarkColumnFamily['1545']['TPS']='100';
it gets inserted successfully even though the comparator type is composite. Furthermore, even the following command gets executed successfully
set BenchmarkColumnFamily['1545']['Run1:TPS']='1000';
What is it that im missing?
I don't think you're doing anything wrong. The CLI is parsing the strings for values based on the type, probably using org.apache.cassandra.db.marshal.AbstractType<T>.fromString(). And for Composite types, it uses ':' as field separator (not that I've seen documented, but I've experimented with Java code to convince myself.
Without a ':', it seems to just set the first part of the Composite, and leave the second as null. To set the second only, you can use
set BenchmarkColumnFamily['1545'][':NOT_TPS']='999';
From the CLI, dump out the CF:
list BenchmarkColumnFamily;
and you should see all the names (for all the rows), e.g.
RowKey: 1545
=> (column=:NOT_TPS, value=999, timestamp=1342474086048000)
=> (column=Run1:TPS, value=1000, timestamp=1342474066695000)
=> (column=TPS, value=100, timestamp=1342474057824000)
There is no way (via CLI) to constrain the composite elements to be non-null or specific values, that's something you'd have to do in code.
Also, the column_metadata option for the CF creation is unnecessary, since you've already listed the default validation as UTF8Type.
The cassandra-cli tool is very limited in dealing with composites. Also, some unexpected things can happen in Cassandra with respect to validation of explicit, user-supplied composites. I don't know the exact answer for your situation, but I can tell you that you'll find this sort of model vastly easier to work with using the CQL 3 engine.
For example, your model could be expressed as:
CREATE TABLE BenchmarkColumnFamily (
build text,
run int,
tps text,
latency text,
PRIMARY KEY (build, run)
);
INSERT INTO BenchmarkColumnFamily (build, run, tps, latency) VALUES ('1545', 1, '1000', '0.5ms');
See this post for more information about how that translates to the storage-engine layer.
Related
Im currently using jOOQ to build my SQL (with code generation via the mvn plugin).
Executing the created query is not done by jOOQ though (Using vert.X SqlClient for that).
Lets say I want to select all columns of two tables which share some identical column names. E.g. UserAccount(id,name,...) and Product(id,name,...). When executing the following code
val userTable = USER_ACCOUNT.`as`("u")
val productTable = PRODUCT.`as`("p")
create().select().from(userTable).join(productTable).on(userTable.ID.eq(productTable.AUTHOR_ID))
the build method query.getSQL(ParamType.NAMED) returns me a query like
SELECT "u"."id", "u"."name", ..., "p"."id", "p"."name", ... FROM ...
The problem here is, the resultset will contain the column id and name twice without the prefix "u." or "p.", so I can't map/parse it correctly.
Is there a way how I can say to jOOQ to alias these columns like the following without any further manual efforts ?
SELECT "u"."id" AS "u.id", "u"."name" AS "u.name", ..., "p"."id" AS "p.id", "p"."name" AS "p.name" ...
Im using the holy Postgres Database :)
EDIT: Current approach would be sth like
val productFields = productTable.fields().map { it.`as`(name("p.${it.name}")) }
val userFields = userTable.fields().map { it.`as`(name("p.${it.name}")) }
create().select(productFields,userFields,...)...
This feels really hacky though
How to correctly dereference tables from records
You should always use the column references that you passed to the query to dereference values from records in your result. If you didn't pass column references explicitly, then the ones from your generated table via Table.fields() are used.
In your code, that would correspond to:
userTable.NAME
productTable.NAME
So, in a resulting record, do this:
val rec = ...
rec[userTable.NAME]
rec[productTable.NAME]
Using Record.into(Table)
Since you seem to be projecting all the columns (do you really need all of them?) to the generated POJO classes, you can still do this intermediary step if you want:
val rec = ...
val userAccount: UserAccount = rec.into(userTable).into(UserAccount::class.java)
val product: Product = rec.into(productTable).into(Product::class.java)
Because the generated table has all the necessary meta data, it can decide which columns belong to it, and which ones don't. The POJO doesn't have this meta information, which is why it can't disambiguate the duplicate column names.
Using nested records
You can always use nested records directly in SQL as well in order to produce one of these 2 types:
Record2<Record[N], Record[N]> (e.g. using DSL.row(table.fields()))
Record2<UserAccountRecord, ProductRecord> (e.g using DSL.row(table.fields()).mapping(...), or starting from jOOQ 3.17 directly using a Table<R> as a SelectField<R>)
The second jOOQ 3.17 solution would look like this:
// Using an implicit join here, for convenience
create().select(productTable.userAccount(), productTable)
.from(productTable)
.fetch();
The above is using implicit joins, for additional convenience
Auto aliasing all columns
There are a ton of flavours that users could like to have when "auto-aliasing" columns in SQL. Any solution offered by jOOQ would be no better than the one you've already found, so if you still want to auto-alias all columns, then just do what you did.
But usually, the desire to auto-alias is a derived feature request from a misunderstanding of what's the best approch to do something in jOOQ (see above options), so ideally, you don't follow down the auto-aliasing road.
I have a table with a column named "data" that contains json object of telemetry data.
The telemetry data is recieved from devices, where some of the devices has received a firmware upgrade. After the upgrade some of the properties has received a "namespace" (e.g. ""propertyname"" is now ""dsns:propertyname""), and there is also a few of the properties which is suddenly camelcased (e.g. ""propertyname"" is now ""propertyName"".
Neither the meaning of the properties or the number of properties has changed.
When querying the table, I do not want to get the whole "data", as the json is quite large. I only want the properties that are needed.
For now I have simply used a json_build_object such as:
select json_build_object('requestedproperty',"data"-> 'requestedproperty','anotherrequestedproperty',"data"-> 'anotherrequestedproperty') as "data"
from device_data
where id ='f4ddf01fcb6f322f5f118100ea9a81432'
and timestamp >= '2020-01-01 08:36:59.698' and timestamp <= '2022-02-16 08:36:59.698'
order by timestamp desc
But it does not work for fetching data from after the firmware upgrade for the devices that has received it.
It still works for querying data from before the upgrade, and I would like to have only one query for this if possible.
Are there some easy way to "ignore" namespaces and be case insensitive through json_build_object?
I'm obviously no postgresql expert, and would love all the help I can get.
This kind of insanity is exactly why device developers should not ever be allowed anywhere near a keyboard :-)
This is not an easy fix, but so long as you know the variations on the key names, you can use coalesce() to accomplish your goal and update your query with each new release:
json_build_object(
'requestedproperty', coalesce(
data->'requestedproperty',
data->'dsns:requestedproperty',
data->'requestedProperty'),
'anotherrequestedproperty', coalesce(data-> 'anotherrequestedproperty',
data-> 'dsns:anotherrequestedproperty',
data-> 'anotherRequestedProperty')
) as "data"
Edit to add: You can also use jsonb_each_text(), treat key-value pairs as rows, and then use lower() and a regex split to doctor key names more generally, but I bet that the kinds of scatterbrains behind the inconsistencies you already see will eventually lead to a misspelled key name someday.
How to avoid the unnecessary CPU cost?
See this historic question with failure tests. Example: j->'x' is a JSONb representing a number and j->'y' a boolean. Since the first versions of JSONb (issued in 2014 with 9.4) until today (6 years!), with PostgreSQL v12... Seems that we need to enforce double conversion:
Discard j->'x' "binary JSONb number" information and transforms it into printable string j->>'x';discard j->'y' "binary JSONb boolean" information and transforms it into printable string j->>'y'.
Parse string to obtain "binary SQL float" by casting string (j->>'x')::float AS x; parse string to obtain "binary SQL boolean" by casting string (j->>'y')::boolean AS y.
Is there no syntax or optimized function to a programmer enforce the direct conversion?
I don't see in the guide... Or it was never implemented: is there a technical barrier to it?
NOTES about typical scenario where we need it
(responding to comments)
Imagine a scenario where your system need to store many many small datasets (real example!) with minimal disk usage, and managing all with a centralized control/metadata/etc. JSONb is a good solution, and offer at least 2 good alternatives to store in the database:
Metadata (with schema descriptor) and all dataset in an array of arrays;
Separating Metadata and table rows in two tables.
(and variations where metadata is translated to a cache of text[], etc.) Alternative-1, monolitic, is the best for the "minimal disk usage" requirement, and faster for full information retrieval. Alternative-2 can be the choice for random access or partial retrieval, when the table Alt2_DatasetLine have also more one column, like time, for time series.
You can create all SQL VIEWS in a separated schema, for example
CREATE mydatasets.t1234 AS
SELECT (j->>'d')::date AS d, j->>'t' AS t, (j->>'b')::boolean AS b,
(j->>'i')::int AS i, (j->>'f')::float AS f
FROM (
select jsonb_array_elements(j_alldata) j FROM Alt1_AllDataset
where dataset_id=1234
) t
-- or FROM alt2...
;
And CREATE VIEW's can by all automatic, running the SQL string dynamically ... we can reproduce the above "stable schema casting" by simple formating rules, extracted from metadata:
SELECT string_agg( CASE
WHEN x[2]!='text' THEN format(E'(j->>\'%s\')::%s AS %s',x[1],x[2],x[1])
ELSE format(E'j->>\'%s\' AS %s',x[1],x[1])
END, ',' ) as x2
FROM (
SELECT regexp_split_to_array(trim(x),'\s+') x
FROM regexp_split_to_table('d date, t text, b boolean, i int, f float', ',') t1(x)
) t2;
... It's a "real life scenario", this (apparently ugly) model is surprisingly fast for small traffic applications. And other advantages, besides disk usage reduction: flexibility (you can change datataset schema without need of change in the SQL schema) and scalability (2, 3, ... 1 billion of different datasets on the same table).
Returning to the question: imagine a dataset with ~50 or more columns, the SQL VIEW will be faster if PostgreSQL offers a "bynary to bynary casting".
Short answer: No, there is no better way to extract a jsonb number as PostgreSQL than (for example)
CAST(j ->> 'attr' AS double precision)
A JSON number happens to be stored as PostgreSQL numeric internally, so that wouldn't work “directly” anyway. But there is no principal reason why there could not be a more efficient way to extract such a value as numeric.
So, why don't we have that?
Nobody has implemented it. That is often an indication that nobody thought it worth the effort. I personally think that this would be a micro-optimization – if you want to go for maximum efficiency, you extract that column from the JSON and store it directly as column in the table.
It is not necessary to modify the PostgreSQL source to do this. It is possible to write your own C function that does exactly what you envision. If many people thought this was beneficial, I'd expect that somebody would already have written such a function.
PostgreSQL has just-in-time compilation (JIT). So if an expression like this is evaluated for a lot of rows, PostgreSQL will build executable code for that on the fly. That mitigates the inefficiency and makes it less necessary to have a special case for efficiency reasons.
It might not be quite as easy as it seems for many data types. JSON standard types don't necessarily correspond to PostgreSQL types in all cases. That may seem contrived, but look at this recent thread in the Hackers mailing list that deals with the differences between the numeric types between JSON and PostgreSQL.
All of the above are not reasons that such a feature could never exist, I just wanted to give reasons why we don't have it.
The environment for this question is PostgreSQL 9.6.5 on AWS RDS.
The question is about an optimal schema design and batch update strategy for a table with 300 million rows containing the following logical data model:
id: primary key, string up to 40 characters long
code: integer 1-999
year: integer year
flags: variable number (1000+) each associated with a name, new flags added over time. Ideally, a flag should be thought of as having three values: absent (null), on (true/1) and off (false/0). It is possible, at the cost of additional updates (see below), to treat a flag as a simple bit (on or off, no absent). "On" values are typically very sparse: < 1/1000.
Queries typically involve boolean expressions on the presence or absence of one or more flags (by name) with code and year occasionally involved also.
The data is updated in batch via Apache Spark, i.e., updates can be represented as flat file(s), e.g., in COPY format, or as SQL operations. Only one update is active at any one time. Updates to code and year are very infrequent. Updates to flags affect 1-5% of rows per update (3-15 million rows). It is possible for the update rows to include all flags and their values, just the "on" flags to be updated or just the flags whose values have changed. In the former case, Spark would need to query the data to get the current values of flags.
There will be a small read load during updates.
The question is about an optimal schema and associated update strategy to support the query & updates as described.
Some comments from research so far:
Using 1,000+ boolean columns would create a very efficient row representation but, in addition to some DDL complexity, would require 1,000+ indexes.
Bit strings would be great if there was a way to index individual bits. Also, they do not offer a good way to represent absent flags. Using this approach would require maintaining a lookup table between flag names and bit IDs. Merging updates, if needed, works with ||, though, given PostgreSQL's MVCC there doesn't seem to be much benefit to updating just flags as opposed to replacing an entire row.
JSONB fields offer indexing. They also offer null representation but that comes at a cost: all flags that are "off" would need to be explicitly set, which would make the fields quite large. If we ignore null representation, JSONB fields would be relatively small. To further shrink them, we could use short 1-3 character field names with a lookup table. Same comments re: merging as with bit strings.
tsvector/tsquery: have no experience with this data type but, in theory, seems to be an exact representation of a set of "on" flags by name. Must use a lookup table mapping flag names to tokens with the additional requirement to ensure there are no collisions due to stemming.
Don't store the flags in the main table.
Assuming that the main table is called data, define something like the following:
CREATE TABLE flag_names (
id smallint PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE flag (
flagname_id smallint NOT NULL REFERENCES flag_names(id),
data_id text NOT NULL REFERENCES data(id),
value boolean NOT NULL,
PRIMARY KEY (flagname_id, data_id)
);
If a new flag is created, insert a new row in flag_names.
If a flag is set to TRUE or FALSE, insert or update a row in the flag table.
Join flag with data to test if a certain flag is set.
Can someone advise me on the SQL data type that should be used for a DICOM UID, 1.2.840.113986.3.2702661254.20150220.144310.372.4424 as a sample. I would like to use it as a primary key as well.
There are two options available here- either use a less-than-ideal data type which already exists, of which "text" is almost certainly the best option, or implement a custom data type for this particular type of data.
While the best built-in option is "text", looking at the example provided, you would likely get significant performance and space benefits from using a custom data type, though it would require writing code to implement it.
A final option to consider is to use a surrogate key for that data. To do this, you would build a table which contains a "bigserial" column and then a "text" column. The "text" column would hold the long form of the value as you have it shown above and the "bigserial" column would provide an integer (64bit with bigserial, 32 bit if you use "serial" instead) which you would then use in all of your tables, instead of the long form.