Publishing table back to tickerplant - kdb

I am trying to publish a table straight from a real time engine. Basically I have a real time engine that connects to the tickerplant, subscribes to a raw version of a table and adds some new columns. Now I want this enhanced version of the table to be pushed back to the tickerplant. I have a pub function which pushes the table in the follwoing way:
neg[handle](`.u.upd;`tablename;tabledata)
the problem is that I get a type error. I looked at the schemas of the table and they are slightly different.
meta table1
c | t f a
----------------| -----
time | p
sym | s
col1 | c
col2 | s
col3 | i
meta table2
c | t f a
----------------| -----
time | p
sym | s
col1 | C
col2 | s
col3 | i
That capital C most likely is the problem. However, I cannot load the schema in the tickerplant with capital letters. Any idea how should I go about this?

You can define the schema with a generic list type and it will take its type from the first insert.
tab:([] col1:`int$();generic:();col3:`$())
Another issue is that your tickerplant might be expecting a list (of lists) to be sent to its .u.upd rather than the table you may be sending to it, so you may want to value flip your table before sending it. (And note that the tickerplant would try to prepend a timestamp if the first column isn't already a timestamp)

The capital C in your meta table is the result of the incoming data being nested. To resolve this you should declare the schema with an untyped empty list.
table2:([] time:`timestamp$();sym:`$();col1:();col2:`$();col3:"I"$())
Consequently, until a result is entered its meta is:
q)meta table2
c | t f a
----| -----
time| p
sym | s
col1|
col2| s
col3| i
This will then be updated to match the first entry into the table.
Also, .u.upd requires the input to not be a table but a list of lists, this can be resolved
using:
neg[handle](`.u.upd;`tablename;value flip tabledata)

Related

KSQL create table with multi-column aggregation

So essentially I'd like to group by two columns, like this:
CREATE TABLE foo AS
SELECT a, b, SUM(a)
FROM whatever
GROUP BY a, b
whatever is a stream in Kafka format. When I issue the command, ksql returns:
Key format does not support schema.
format: KAFKA
schema: Persistence{columns=[`a` STRING KEY, `b` STRING KEY], features=[]}
reason: The 'KAFKA' format only supports a single field. Got: [`a` STRING KEY, `b` STRING KEY]
Caused by: The 'KAFKA' format only supports a single field. Got: [`DEVICE`
a KEY, `b` STRING KEY]
The problem is that the Kafka format does not support multi-column keys. Is there a way to workaround this, e. g. creating an artificial key in this table? I did not manage to do this.
I saw someone posted a similiar question and the answer seemed to work. I suggest that's because of the format. https://stackoverflow.com/a/50193239/9632703
The documentation mentions that multi-column aggregations might not work, though also saying that ksql does a background workaround to make it work. Unfortunately ksql only returns the given error message. https://www.confluent.de/blog/ksqldb-0-10-updates-key-columns/#multi-column-aggregations
The funny part is that omitting the first line CREATE TABLE foo AS works. So if some data comes in, the aggregation works. But that's not persistent, of course. If nothing else works, I would also be fine with having a table without a primary key defined, if possible in ksql, since I could still identify the data with {a, b} in my application.
Can someone help me? Thank you.
You can do this if you upgrade to ksqlDB 0.15. This version introduced multi-key support. You'll need to use a KEY_FORMAT that supports it.
ksql> CREATE TABLE FOO AS SELECT A, B, SUM(C) FROM TEST_STREAM GROUP BY A,B;
Message
-----------------------------------
Created query with ID CTAS_FOO_53
-----------------------------------
ksql> DESCRIBE FOO;
Name : FOO
Field | Type
---------------------------------------------
A | BIGINT (primary key)
B | VARCHAR(STRING) (primary key)
KSQL_COL_0 | DOUBLE
---------------------------------------------
For runtime statistics and query details run: DESCRIBE EXTENDED <Stream,Table>;
ksql> SELECT * FROM FOO EMIT CHANGES LIMIT 5;
+---------------------------+---------------------------+---------------------------+
|A |B |KSQL_COL_0 |
+---------------------------+---------------------------+---------------------------+
|220071000 |AIS |0.4 |
|257838000 |AIS |6.2 |
|538007854 |AIS |22.700000000000003 |
|257487000 |AIS |2.4 |
|257601800 |AIS |5.8999999999999995 |
Limit Reached
Query terminated

How to optimize inverse pattern matching in Postgresql?

I have Pg version 13.
CREATE TABLE test_schemes (
pattern TEXT NOT NULL,
some_code TEXT NOT NULL
);
Example data
----------- | -----------
pattern | some_code
----------- | -----------
__3_ | c1
__34 | c2
1_3_ | a12
_7__ | a10
7138 | a19
_123|123_ | a20
___253 | a28
253 | a29
2_1 | a30
This table have about 300k rows. I want to optimize simple query like
SELECT * FROM test_schemes where '1234' SIMILAR TO pattern
----------- | -----------
pattern | some_code
----------- | -----------
__3_ | c1
__34 | c2
1_3_ | a12
_123|123_ | a20
The problem is that this simple query will do a full scan of 300k rows to find all the matches. Given this design, how can I make the query faster (any use of special index)?
Internally, SIMILAR TO works similar to regexes, which would be evident by running an EXPLAIN on the query. You may want to just switch to regexes straight up, but it is also worth looking at text_pattern_ops indexes to see if you can improve the performance.
If the pipe is the only feature of SIMILAR TO (other than those present in LIKE) which you use, then you could process it into a form you can use with the much faster LIKE.
SELECT * FROM test_schemes where '1234' LIKE any(string_to_array(pattern,'|'))
In my hands this is about 25 times faster, and gives the same answer as your example on your example data (augmented with a few hundred thousand rows of garbage to get the table row count up to about where you indicated). It does assume there is no escaping of any pipes.
If you store the data already broken apart, it is about 3 times faster yet, but of course give cosmetically different answers.
create table test_schemes2 as select unnest as pattern, somecode from test_schemes, unnest(string_to_array(pattern,'|'));
SELECT * FROM test_schemes2 where '1234' LIKE pattern;

PostgreSQL UPDATE doesn't seem to update some rows

I am trying to update a table from another table, but a few rows simply don't update, while the other million rows work just fine.
The statement I am using is as follows:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql AND l.quali_ambiental IS NULL;
It says 647 rows were updated, but I can't see the change.
I've also tried without the is null clause, results are the same.
If I do a join it seems to work as expected, the join query I used is this one:
SELECT sql, l.quali_ambiental, c.quali_ambiental FROM lotes_infos l
JOIN sirgas_lotes_centroid c
USING (sql)
WHERE l.quali_ambiental IS NULL;
It returns 787 rows, (some are both null, that's ok), this is a sample from the result from the join:
sql | quali_ambiental | quali_ambiental
------------+-----------------+-----------------
1880040001 | | PA 10
1880040001 | | PA 10
0863690003 | | PA 4
0850840001 | | PA 4
3090500003 | | PA 4
1330090001 | | PA 10
1201410001 | | PA 9
0550620002 | | PA 6
0430790001 | | PA 1
1340180002 | | PA 9
I used QGIS to visualize the results, and could not find any tips to why it is happening. The sirgas_lotes_centroid comes from the other table, the geometry being the centroid for the polygon. I used the centroid to perform faster spatial joins and now need to place the information into the table with the original polygon.
The sql column is type text, quali_ambiental is varchar(6) for both.
If a directly update one row using the following query it works just fine:
UPDATE lotes_infos
SET quali_ambiental = 'PA 1'
WHERE sql LIKE '0040510001';
If you don't see results of a seemingly sound data-modifying query, the first question to ask is:
Did you commit your transaction?
Many clients work with auto-commit by default, but some do not. And even in the standard client psql you can start an explicit transaction with BEGIN (or syntax variants) to disable auto-commit. Then results are not visible to other transactions before the transaction is actually committed with COMMIT. It might hang indefinitely (which creates additional problems), or be rolled back by some later interaction.
That said, you mention: some are both null, that's ok. You'll want to avoid costly empty updates with something like:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql
AND l.quali_ambiental IS NULL
AND s.quali_ambiental IS NOT NULL; --!
Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
The duplicate 1880040001 in your sample can have two explanations. Either lotes_infos.sql is not UNIQUE (even after filtering with l.quali_ambiental IS NULL). Or sirgas_lotes_centroid.sql is not UNIQUE. Or both.
If it's just lotes_infos.sql, your query should still work. But duplicates in sirgas_lotes_centroid.sql make the query non-deterministic (as #jjanes also pointed out). A target row in lotes_infos can have multiple candidates in sirgas_lotes_centroid. The outcome is arbitrary for lack of definition. If one of them has quali_ambiental IS NULL, it can explain what you observed.
My suggested query fixes the observed problem superficially, in that it excludes NULL values in the source table. But if there can be more than one non-null, distinct quali_ambiental for the same sirgas_lotes_centroid.sql, your query remains broken, as the result is arbitrary.You'll have to define which source row to pick and translate that into SQL.
Here is one example how to do that (chapter "Multiple matches..."):
Updating the value of a column
Always include exact table definitions (CREATE TABLE statements) with any such question. This would save a lot of time wasted for speculation.
Aside: Why are the sql columns type text? Values like 1880040001 strike me as integer or bigint. If so, text is a costly design error.

Postgresql: More efficient way of joining tables based on multiple address fields

I have a table that lists two connected values, ID and TaxNumber(TIN) that looks somewhat like this:
IDTINMap
ID | TIN
-------------------------
1234567890 | 654321
-------------------------
3456321467 | 986321
-------------------------
8764932312 | 245234
An ID can map to multiple TINs, and a TIN might map to multiple IDs, but there is a Unique constraint on the table for an ID, TIN pair.
This list isn't complete, and the table has about 8000 rows. I have another table, IDListing that contains metadata for about 9 million IDs including name, address, city, state, postalcode, and the ID.
What I'm trying to do is build an expanded ID - TIN map. Currently I'm doing this by first joining the IDTINMap table with IDListing on the ID field, which gives something that looks like this in a CTE that I'll call Step1 right now:
ID | TIN | Name | Address | City | State | Zip
------------------------------------------------------------------------------------------------
1234567890 | 654321 | John Doe | 123 Easy St | Seattle | WA | 65432
------------------------------------------------------------------------------------------------
3456321467 | 986321 | Tyler Toe | 874 W 84th Ave| New York | NY | 48392
------------------------------------------------------------------------------------------------
8764932312 | 245234 | Jane Poe | 984 Oak Street|San Francisco | CA | 12345
Then I go through again and join the IDListing table again, joining Step1 on address, city, state, zip, and name all being equal. I know I could do something more complicated like fuzzy matching, but for right now we're just looking at exact matches. In the join I preserve the ID in step 1 as 'ReferenceID', keep the TIN, and then have another column of all the matching IDs. I don't keep any of the address/city/state/zip info, just the three numbers.
Then I can go back and insert all the distinct pairs into a final table.
I've tried this with a query and it works and gives me the desired result. However the query is slower than to be desired. I'm used to joining on rows that I've indexed (like ID or TIN) but it's slow to join on all of the address fields. Is there a good way to improve this? Joining on each field individually is faster than joining on a CONCAT() of all the fields (This I have tried). I'm just wondering if there is another way I can optimize this.
Make the final result a materialized view. Refresh it when you need to update the data (every night? every three hours?). Then use this view for your normal operations.

Using filtered results as field for calculated field in Tableau

I have a table that looks like this:
+------------+-----------+---------------+
| Invoice_ID | Charge_ID | Charge_Amount |
+------------+-----------+---------------+
| 1 | A | $10 |
| 1 | B | $20 |
| 2 | A | $10 |
| 2 | B | $20 |
| 2 | C | $30 |
| 3 | C | $30 |
| 3 | D | $40 |
+------------+-----------+---------------+
In Tableau, how can I have a field that SUMs the Charge_Amount for the Charge_IDs B, C and D, where the invoice has a Charge_ID of A? The result would be $70.
My datasource is SQL Server, so I was thinking that I could add a field (called Has_ChargeID_A) to the SQL Server Table that tells if the invoice has a Charge_ID of A, and then in Tableau just do a SUM of all the rows where Has_ChargeID_A is true and Charge_ID is either B, C or D. But I would prefer if I can do this directly in Tableau (not this exactly, but anything that will get me to the same result).
Your intuition is steering you in the right direction. You do want to filter to only Invoices that contain row with a Charge_ID of A, and you can do this directly in Tableau.
First place Invoice_ID on the filter shelf, then select the Condition tab for the filter. Then select the "By formula" option on the condition tab and enter the formula you wish to use to determine which invoice_ids are included by the filter.
Here is a formula for your example:
count(if Charge_ID = 'A' then 'Y' end) > 0
For each data row, it will calculate the value of the expression inside the parenthesis, and then only include invoice_ids with at least one non-null value for the internal expression. (The implicit else for the if statement, "returns" null).
The condition tab for a dimension field equates to a HAVING clause in SQL.
If condition formulas get complex, it's often a good a idea to define them with a calculated field -- or a combination of several simpler calculated fields, just to keep things manageable.
Finally, if you end up working with sets of dimensions like this frequently, you can define them as sets. You can still drop sets on the filter shelf, but then can reuse them in other ways: like testing set membership in a calculated field (like a SQL IN clause), or by creating new sets using intersection and union operators. You can think of sets like named filters, such as the set of invoices that contain type A charge.