Should KSQL tables be showing multiple rows per key for aggregates? - apache-kafka

My understanding of KSQL tables is that they show an "as is" view of our data rather than all the data.
So if I have a simple aggregating query and I SELECT from my table, I should see the data as it is at this point in time.
My data (stream):
MY_TOPIC_STREAM:
15 | BEACH | Steven Ebb | over there
24 | CIRCUS | John Doe | an adress
30 | CIRCUS | Alice Small | another address
35 | CIRCUS | Barry Share | a home
35 | CIRCUS | Garry Share | a home
40 | CIRCUS | John Mee | somewhere
45 | CIRCUS | David Three | a place
45 | CIRCUS | Mary Three | a place
45 | CIRCUS | Joffrey Three | a place
My table definition:
CREATE TABLE MY_TABLE WITH (VALUE_FORMAT='AVRO') AS
SELECT ROWKEY AS APPLICATION, COUNT(*) AS NUM_APPLICANTS
FROM MY_TOPIC_STREAM
WHERE header->eventType = 'CIRCUS'
GROUP BY ROWKEY;
I am confused as to why I see multiple rows in my table even though the eventual aggregates are correct?
SELECT * FROM MY_TABLE;
APPLICATION NUM_APPLICANTS
24 1
30 1
--> 35 1 <-- why do I see this?
35 2
40 1
--> 45 1 <-- why do I see this?
--> 45 2 <-- why do I see this?
45 3
My sink topic also shows me the same as the table output - presumably this is correct?
I expected my table result to be:
APPLICATION NUM_APPLICANTS
24 1
30 1
35 2
40 1
45 3
Outputs abridged for brevity and readability above, but you get the gist.
So - are my expectations of the table and sink topic outputs off the mark?
UPDATE
Matthias answer below explains correctly that the table and sink topic show changelog events so it is normal to see intermediate values. However what was confusing me was that I was seeing all intermediate rows. It turned out that this was because I was using the confluent 5.2.1 docker-compose which sets the environment variable KSQL_STREAMS_CACHE_MAX_BYTES_BUFFERING=0. This disables caching of all intermediate results in KSQL aggregations and therefore the table shows more rows than expected whilst eventually arriving at the correct aggregates. Setting this to e.g. 10MB caused the data to output as expected. This feature is not immediately obvious in the documentation for those starting to play with KSQL and using docker to stand up the instances! This issue pointed me in the right direction, and this page documents the parameters. I had spent a long time on this and could not work out why it was not behaving as expected! I hope this helps someone.

Not sure what version you are using, however, SELECT * FROM MY_TABLE; does not return the current content of the table, but the table's changelog stream (this holds for older versions; in newer version the query you show is not valid as the syntax was changed).
Since the transition from KSQL to ksqlDB, the query you showed would be called a push query expressed as SELECT * FROM my_table EMIT CHANGES;.
Furthermore, ksqlDB introduced pull queries that allow you to lookup the current state. However SELECT * FROM my_table; is not supported as a pull query yet (it will be added in the future). You can only do table lookups for a specific key, i.e., there must be a WHERE clause at the moment.
Check out the docs for more details: https://docs.ksqldb.io/en/latest/concepts/queries/pull/

Related

Optimal relational design for groups and subgroup relationships

I have a bit of an intro-level relational database design question. I'm working on a project where I'm capturing information from scientific journal articles and storing that in a Postgres database. One of my primary goals is to define a schema that is flexible enough to cover most cases I might encounter in a broad set of papers. In reality, articles tend to report a semi-standard set of details, but there's definitely variance once you get into the details. These things are written for humans, not machines.
For the most part, defining the schema has been pretty straightforward, but one thing I'm stuck on is how to sensibly structure a set of tables to capture details about a study's subject groups and subsets of subjects.
Take for example a simple randomized control trial - you typically have a set of people identified as screened for eligibility, a set determined to be eligible, a set randomized into the control group, and a set randomized into the treatment group. Within each of those groups you can have subgroups defined in all sorts of specific ways, but generally by some sort of interval (e.g. Age 26-32) or a category (e.g. pregnant/not pregnant).
Currently, I've set this up so that a Study record can have many Subject records, and Subject records can have many Interval_Subgroup records and many Categorical_Subgroup records.
Subject
-----------------------------------------
id | groupType | measure | value | study
-----------------------------------------
13 | treatment | count | 578 | 17
14 | control | count | 552 | 17
Interval_Subgroup
---------------------------------------------------------------
id | factor | factorMin | factorMax | measure | value | subject
---------------------------------------------------------------
41 | age | 18 | 24 | count | 125 | 13
42 | age | 25 | 32 | count | 204 | 13
Categorical_Subgroup
-----------------------------------------------------
id | factor | factorValue | measure | value | subject
-----------------------------------------------------
74 | sex | male | count | 251 | 13
75 | sex | female | count | 327 | 13
This seems workable, but feels clunky because I have two tables for capturing the same type of information. Also it's limiting because it wouldn't allow me to capture any combination of subgroup sets like males of age 18-24. Some studies report that kind of detail, some don't, but I want to be able to capture any depth of subgroup info the paper offers.
What is a more flexible way to structure these tables than what I've described above? I'm trying to sketch out how I think this should work, and right now, I have subject groups having many subgroups and subgroups having many subgroup definitions. There would just be one table capturing measurements about subgroups, and another table for defining what each subgroup is. I'm not sure if that is in the right direction. Maybe there is a far more simple solution that you might know of.
Thanks for taking the time to help out - it's much appreciated!
Edit:
Fixed id to be unique in the example tables.
From your description it sounds like a factor is a thing, and that each subgroup has one or more factors. To me this implies that factor needs its own table. Factors can in turn be of type interval or categorical, which means single table inheritance might be in order.
Example tables might look something like this:
subgroups
------------------------------
id | measure | value | subject
------------------------------
41 | count | 125 | 13
42 | count | 204 | 13
factors
id | type | factor | category | interval_min | interval_max | subgroup
-----------------------------------------------------------------------------
68 | interval | age | NULL | 18 | 24 | 13
69 | categorical | sex | male | NULL | NULL | 13
In this example subgroup 41 has two factors, age 18-24 and gender male.
It could also be that STI is overkill here, in which case you'd split factor into two tables, categorical_factors and interval_factors, and a subgroup could have zero or many of each.
As far as I'm aware, the complexity of using STI mostly depends on what ORM you're using. Rails / ActiveRecord has good support, other frameworks vary.
Hope that helps!

Issue with displaying of information in a Tables/Matrix visual - Power BI

Hi I'm new to Power BI desktop, but have come across an issue when displaying information, Hopefully it's due to my lack of knowledge, but I can't seem to find a way to display values in rows one after the other similar to Pivot tables functionality.
For example so if I had the following table
Location | Salary | Number
A | 100 | 1
A | 200 | 2
B | 100 | 3
B | 400 | 4
C | 400 | 5
D | 800 | 6
What I'd like to produce is something like .....
A | B | C | D
300 | 500 | 400 | 800 <-- Salary Sum
3 | 7 | 5 | 6 <-- Number Sum
I have a direct link with my data source, please suggest a way to display the same with tables/matrix
Thank you in advance
Unfortunately this is currently not supported in Power BI, but maybe there is some light at the end of the tunnel... The Power BI team have started working on this much requested feature. See here
As Tom said, this is available with the August release. You can check which version you have by going to File -> Help -> About. If you have an older verion, you can go here to download the right one for you (32-bit vs 64-bit).
Once you have made sure you are running the August version, simply create a matrix with Location in the Columns field and Salary and Number in the Values field. Then go into the formatting pane and under Values, turn Show on rows to on.
Try this : Go to query editor, select the first column of the desired table, location in your case and from transform tab, select unpivot other columns.
That's it! Now go and drop your visual.

Is it possible to make multiple fields default to the same date, but also be individually editable?

I am VERY new to Access - I was sort of thrust into designing a database for a research project I'm involved in. So, please bear with me because I know next to nothing :) The problem I am having is thus:
My database is for a medical research project, and is very time and date dependent, by which I mean I need to capture the date and time for each piece of data so that we end up with a sort of timeline of events for each subject.
As is, I have something like the following for each piece of data: (Each in it's own field)
ArrivalDate
ArrivalTime
HeartRateDate
HeartRateTime
HeartRateData
TemperatureDate
TemperatureTime
TemperatureData
BloodPressureDate
BloodPressureTime
BloodPressureData
There are around 200 similar pieces of data that I need to collect for each patient. To avoid having to re-enter the same data over and over, and also to reduce the potential for error, I would like to have all of the date fields in a given patient record default to the first one that is entered, in this case "Arrival Date". However, I also need each date field to be editable without affecting the others. The reason for this is that in the event that a patient's visit occurs over the span of a few days we can accurately record that.
I have tried messing around with the default value setting, as well as setting the control source to reference the "Arrival Date" field, but then of course any changes to one field affect them all. I am not even sure that what I am trying to do is possible but I will appreciate any help and/or suggestions!
Thank you in advance
Having all this data in separate columns of a big table isn't going to work. You don't measure things like temperature or blood pressure only once per patient, do you?
This is a classic one-to-many relation.
You should have a separate Measurements table, looking e.g. like this:
+--------+-----------+---------------+------------------+-----------+
| MeasID | PatientID | MeasType | MeasDateTime | MeasValue |
+--------+-----------+---------------+------------------+-----------+
| 1 | 1 | Temperature | 2017-05-17 14:30 | 38.2 |
| 2 | 1 | BloodPressure | 2017-05-17 14:30 | 130/90 |
| 3 | 1 | Temperature | 2017-05-17 18:00 | 38.5 |
| 4 | 2 | Temperature | etc. | |
+--------+-----------+---------------+------------------+-----------+
As Barmar wrote, there is no reason to have separate columns for date and time.
In the form where measurements are entered, you can use the BeforeInsert event to set MeasDateTime to the current time, with the Now() function.
So the user never has to enter it manually, but they can edit it if the measurement was at a different time than entering the data.

PostgreSQL Fuzzy Searching multiple words with Levenshtein

I am working out a postgreSQL query to allow for fuzzy searching capabilities when searching for a company's name in an app that I am working on. I have found and have been working with Postgres' Levenshtein method (part of the fuzzystrmatch module) and for the most part it is working. However, it only seems to work when the company's name is one word, for example:
With Apple (which is stored in the database as simply apple) I can run the following query and have it work near perfectly (it returns a levenshtein distance of 0):
SELECT * FROM contents
WHERE levenshtein(company_name, 'apple') < 4;
However when I take the same approach with Sony (which is stored in the database as Sony Electronics INC) I am unable to get any useful results (entering Sony gives a levenshtein distance of 16).
I have tried to remedy this problem by breaking the company's name down into individual words and inputting each one individually, resulting in something like this:
user input => 'sony'
SELECT * FROM contents
WHERE levenshtein('Sony', 'sony') < 4
OR levenshtein('Electronics', 'sony') < 4
OR levenshtein('INC', 'sony') < 4;
So my question is this: is there some way that I can accurately implement a multi-word fuzzy search with the current general approach that I have now, or am I looking in the complete wrong place?
Thanks!
Given your data and the following query with wild values for the Levenshtein Insertion (10000), Deletion (100) and Substitution (1) cost:
with sample_data as (select 101 "id", 'Sony Entertainment Inc' as "name"
union
select 102 "id",'Apple Corp' as "name")
select sample_data.id,sample_data.name, components.part,
levenshtein(components.part,'sony',10000,100,1) ld_sony
from sample_data
inner join (select sd.id,
lower(unnest(regexp_split_to_array(sd.name,E'\\s+'))) part
from sample_data sd) components on components.id = sample_data.id
The output is so:
id | name | part | ld_sony
-----+------------------------+---------------+---------
101 | Sony Entertainment Inc | sony | 0
101 | Sony Entertainment Inc | entertainment | 903
101 | Sony Entertainment Inc | inc | 10002
102 | Apple Corp | apple | 104
102 | Apple Corp | corp | 3
(5 rows)
Row 1 - no changes..
Row 2 - 9 deletions and 3 changes
Row 3 - 1 insertion and 2 changes
Row 4 - 1 deletion and 4 changes
Row 5 - 3 changes
I've found that splitting the words out causes a lot of false positives whe you give a threshold. You can order by the Levenshtein distance to position the better matches close to the top. Maybe tweaking the Levenshtein variables will help you to order the matches better. Sadly, Levenshtein doesn't weight earlier changes differently than later changes.

will select always start with the first (oldest) record in a table

I made a table like
record
----------+
1 | one |
----------+
2 | two |
----------+
3 | three |
----------+
4 | four |
----------+
5 | five |
----------+
There isn't an ID column, those are just the row numbers I see beside each row in DBVisualizer. I added the rows in the order 1, 2, 3, 4, 5. Is this"
SELECT
*
FROM
sch.test_table limit 1;
certain always to get one, ie start with the "oldest" record? Or is will that change in large datasets?
No, as per the SQL specification the order is indeterminate when not using order by. You're working with a set of data, and sets are not ordered. Also, the size of the set should not matter.
The Postgresql documentation says:
If ORDER BY is not given, the rows are returned in whatever order the
system finds fastest to produce.
Which means that the rows might come back in the expected order, or they might not - there are no guarantees.
The bottom line is that if you want deterministic results you have to useorder by.