Select difference depending on criteria - postgresql

Hope you can help me with one select statement.
I need to select only the records based on criteria but only when duplicates are detected. The criteria is No_Type='Custom'
Example of table data:
Product | Component | Detail | Stock | No_Type
-------------------------------------------------------
Brick Powder Grain 12 General
Brick Water Plain 34 General
Brick Additives A95 54 General
Brick Powder Grain 67 Custom
Brick Water Plain 55 Custom
Brick Additives A95 43 Custom
Box Wood Oak 1 General
Box Nails Steel 2 General
Result is based on detecting duplicate values for columns (Product, Component, Details) and if detected, select the records where under column No_Type is stated with 'Custom'.
What I would need as a result is:
Product | Component | Detail | Stock | No_Type
-------------------------------------------------------
Brick Powder Grain 67 Custom
Brick Water Plain 55 Custom
Brick Additives A95 43 Custom
Box Wood Oak 1 General
Box Nails Steel 2 General
There is no ID fields in the table, only what is here.
I'm using Postgresql.

Related

Getting breakdown of "Others" (the rest of Top N members) with SSAS MDX

How can I recursively get the breakdown of "Others" when Top N is applied to dimensions?
Imagine a measure Sales Amount is sliced by 3 dimensions, Region, Category and Product, and Top 1 is applied to each dimension. The result I want to see is a table like below. On each slice, the rest of members are grouped as "Others".
Region | Category | Product | Sales
============================================
Europe | Bikes | Mountain Bikes | $100
| |------------------------
| | Others | $ 30
|-----------------------------------
| Others | Gloves | $ 50
| |------------------------
| | Others | $120
--------------------------------------------
Others | Clothes | Jackets | $ 80
| |------------------------
| | Others | $130
|-----------------------------------
| Others | Shoes | $ 90
| |------------------------
| | Others | $110
--------------------------------------------
When an "Others" appears, I want to see the Top 1 of the next dimension within the scope of this "Others". This seems a little tricky. e.g. tuples like (North America, Clothes) and (Central America, Clothes) need to be aggregated as (Other Regions, Clothes). Is there a neat way to aggregate the measure based on the 2nd dimension, Category?
Alternatively, I think a sub cube that filters out Europe will easily provide the breakdown of Other Regions, Clothes and Other Categories. However, this is likely to result in creating many dependent queries. For an easy processing of the result set, it would be ideal if the query returns data in the above format.
Can this be possibly achieved by a single MDX query?
To get the breakdown of others we must use dynamic set, EXCEPT() and aggregate functions
in each of the three dimensions we will need to create a named dynamic set that holds too members (top 1 and others ).
as exemple, in the dimension category i have created a dynamic set that holds two members (Top 1 and others) like this :
CREATE MEMBER
CURRENTCUBE.[Product].[French Product Category Name].[ALL].[OTHERS] AS
AGGREGATE(EXCEPT([Product].[French Product Category Name].[French Product Category Name].MEMBERS,
TOPCOUNT([Product].[French Product Category Name].[French Product Category Name],1,[Measures].[Sales Amount])
));
CREATE DYNAMIC SET [TOP1 and Others]
AS {TOPCOUNT([Product].[French Product Category Name].[French Product Category Name],1,[Measures].[Sales Amount]),[OTHERS]};
because the set is dynamic then the values of top 1 and others will change according to the filters and slicers that you applay.

Should KSQL tables be showing multiple rows per key for aggregates?

My understanding of KSQL tables is that they show an "as is" view of our data rather than all the data.
So if I have a simple aggregating query and I SELECT from my table, I should see the data as it is at this point in time.
My data (stream):
MY_TOPIC_STREAM:
15 | BEACH | Steven Ebb | over there
24 | CIRCUS | John Doe | an adress
30 | CIRCUS | Alice Small | another address
35 | CIRCUS | Barry Share | a home
35 | CIRCUS | Garry Share | a home
40 | CIRCUS | John Mee | somewhere
45 | CIRCUS | David Three | a place
45 | CIRCUS | Mary Three | a place
45 | CIRCUS | Joffrey Three | a place
My table definition:
CREATE TABLE MY_TABLE WITH (VALUE_FORMAT='AVRO') AS
SELECT ROWKEY AS APPLICATION, COUNT(*) AS NUM_APPLICANTS
FROM MY_TOPIC_STREAM
WHERE header->eventType = 'CIRCUS'
GROUP BY ROWKEY;
I am confused as to why I see multiple rows in my table even though the eventual aggregates are correct?
SELECT * FROM MY_TABLE;
APPLICATION NUM_APPLICANTS
24 1
30 1
--> 35 1 <-- why do I see this?
35 2
40 1
--> 45 1 <-- why do I see this?
--> 45 2 <-- why do I see this?
45 3
My sink topic also shows me the same as the table output - presumably this is correct?
I expected my table result to be:
APPLICATION NUM_APPLICANTS
24 1
30 1
35 2
40 1
45 3
Outputs abridged for brevity and readability above, but you get the gist.
So - are my expectations of the table and sink topic outputs off the mark?
UPDATE
Matthias answer below explains correctly that the table and sink topic show changelog events so it is normal to see intermediate values. However what was confusing me was that I was seeing all intermediate rows. It turned out that this was because I was using the confluent 5.2.1 docker-compose which sets the environment variable KSQL_STREAMS_CACHE_MAX_BYTES_BUFFERING=0. This disables caching of all intermediate results in KSQL aggregations and therefore the table shows more rows than expected whilst eventually arriving at the correct aggregates. Setting this to e.g. 10MB caused the data to output as expected. This feature is not immediately obvious in the documentation for those starting to play with KSQL and using docker to stand up the instances! This issue pointed me in the right direction, and this page documents the parameters. I had spent a long time on this and could not work out why it was not behaving as expected! I hope this helps someone.
Not sure what version you are using, however, SELECT * FROM MY_TABLE; does not return the current content of the table, but the table's changelog stream (this holds for older versions; in newer version the query you show is not valid as the syntax was changed).
Since the transition from KSQL to ksqlDB, the query you showed would be called a push query expressed as SELECT * FROM my_table EMIT CHANGES;.
Furthermore, ksqlDB introduced pull queries that allow you to lookup the current state. However SELECT * FROM my_table; is not supported as a pull query yet (it will be added in the future). You can only do table lookups for a specific key, i.e., there must be a WHERE clause at the moment.
Check out the docs for more details: https://docs.ksqldb.io/en/latest/concepts/queries/pull/

Postgresql column generating

I'm making an application for a company. They needs to divide a total hours to sections. For example they creating a car and they have total 100 hours for a one project. 25 goes to painting 25 goes to montage 25 goes to wheels 25 goes to electronics. This 4 column is standard on table like this and i will have two tables total and finished hours of project:
project id -total hours - painting - montage - wheels - electronics
12543 | 100 | 25 | 25 | 25 | 25 |
project id - hours left - painting - montage - wheels - electronics
12543 | 100 | 25 | 25 | 25 | 25 |
but in some cases they want add custom column section on this like sunroof and give there extra hours and open column for this like
project id -total hours - painting - montage - wheels - electronics-sunroof
12543 | 125 | 25 | 25 | 25 | 25 | 25
But not all projects will have this on their own.
How can i create that elastic numbers of columns and custom columns for each project? Should i divide them to their special different tables or what? And how can i do that operation from an api? Examples are welcome in any language - pseudo code
thanx a lot
May be you need a project master with details of the projects (including total time), activity master which lists the possible activities and standard duration, and a table which links the projects to activities - i.e. connect all projects to the activities relevant for the project. This table can also capture the time_consumed value.

Optimal relational design for groups and subgroup relationships

I have a bit of an intro-level relational database design question. I'm working on a project where I'm capturing information from scientific journal articles and storing that in a Postgres database. One of my primary goals is to define a schema that is flexible enough to cover most cases I might encounter in a broad set of papers. In reality, articles tend to report a semi-standard set of details, but there's definitely variance once you get into the details. These things are written for humans, not machines.
For the most part, defining the schema has been pretty straightforward, but one thing I'm stuck on is how to sensibly structure a set of tables to capture details about a study's subject groups and subsets of subjects.
Take for example a simple randomized control trial - you typically have a set of people identified as screened for eligibility, a set determined to be eligible, a set randomized into the control group, and a set randomized into the treatment group. Within each of those groups you can have subgroups defined in all sorts of specific ways, but generally by some sort of interval (e.g. Age 26-32) or a category (e.g. pregnant/not pregnant).
Currently, I've set this up so that a Study record can have many Subject records, and Subject records can have many Interval_Subgroup records and many Categorical_Subgroup records.
Subject
-----------------------------------------
id | groupType | measure | value | study
-----------------------------------------
13 | treatment | count | 578 | 17
14 | control | count | 552 | 17
Interval_Subgroup
---------------------------------------------------------------
id | factor | factorMin | factorMax | measure | value | subject
---------------------------------------------------------------
41 | age | 18 | 24 | count | 125 | 13
42 | age | 25 | 32 | count | 204 | 13
Categorical_Subgroup
-----------------------------------------------------
id | factor | factorValue | measure | value | subject
-----------------------------------------------------
74 | sex | male | count | 251 | 13
75 | sex | female | count | 327 | 13
This seems workable, but feels clunky because I have two tables for capturing the same type of information. Also it's limiting because it wouldn't allow me to capture any combination of subgroup sets like males of age 18-24. Some studies report that kind of detail, some don't, but I want to be able to capture any depth of subgroup info the paper offers.
What is a more flexible way to structure these tables than what I've described above? I'm trying to sketch out how I think this should work, and right now, I have subject groups having many subgroups and subgroups having many subgroup definitions. There would just be one table capturing measurements about subgroups, and another table for defining what each subgroup is. I'm not sure if that is in the right direction. Maybe there is a far more simple solution that you might know of.
Thanks for taking the time to help out - it's much appreciated!
Edit:
Fixed id to be unique in the example tables.
From your description it sounds like a factor is a thing, and that each subgroup has one or more factors. To me this implies that factor needs its own table. Factors can in turn be of type interval or categorical, which means single table inheritance might be in order.
Example tables might look something like this:
subgroups
------------------------------
id | measure | value | subject
------------------------------
41 | count | 125 | 13
42 | count | 204 | 13
factors
id | type | factor | category | interval_min | interval_max | subgroup
-----------------------------------------------------------------------------
68 | interval | age | NULL | 18 | 24 | 13
69 | categorical | sex | male | NULL | NULL | 13
In this example subgroup 41 has two factors, age 18-24 and gender male.
It could also be that STI is overkill here, in which case you'd split factor into two tables, categorical_factors and interval_factors, and a subgroup could have zero or many of each.
As far as I'm aware, the complexity of using STI mostly depends on what ORM you're using. Rails / ActiveRecord has good support, other frameworks vary.
Hope that helps!

PostgreSQL Fuzzy Searching multiple words with Levenshtein

I am working out a postgreSQL query to allow for fuzzy searching capabilities when searching for a company's name in an app that I am working on. I have found and have been working with Postgres' Levenshtein method (part of the fuzzystrmatch module) and for the most part it is working. However, it only seems to work when the company's name is one word, for example:
With Apple (which is stored in the database as simply apple) I can run the following query and have it work near perfectly (it returns a levenshtein distance of 0):
SELECT * FROM contents
WHERE levenshtein(company_name, 'apple') < 4;
However when I take the same approach with Sony (which is stored in the database as Sony Electronics INC) I am unable to get any useful results (entering Sony gives a levenshtein distance of 16).
I have tried to remedy this problem by breaking the company's name down into individual words and inputting each one individually, resulting in something like this:
user input => 'sony'
SELECT * FROM contents
WHERE levenshtein('Sony', 'sony') < 4
OR levenshtein('Electronics', 'sony') < 4
OR levenshtein('INC', 'sony') < 4;
So my question is this: is there some way that I can accurately implement a multi-word fuzzy search with the current general approach that I have now, or am I looking in the complete wrong place?
Thanks!
Given your data and the following query with wild values for the Levenshtein Insertion (10000), Deletion (100) and Substitution (1) cost:
with sample_data as (select 101 "id", 'Sony Entertainment Inc' as "name"
union
select 102 "id",'Apple Corp' as "name")
select sample_data.id,sample_data.name, components.part,
levenshtein(components.part,'sony',10000,100,1) ld_sony
from sample_data
inner join (select sd.id,
lower(unnest(regexp_split_to_array(sd.name,E'\\s+'))) part
from sample_data sd) components on components.id = sample_data.id
The output is so:
id | name | part | ld_sony
-----+------------------------+---------------+---------
101 | Sony Entertainment Inc | sony | 0
101 | Sony Entertainment Inc | entertainment | 903
101 | Sony Entertainment Inc | inc | 10002
102 | Apple Corp | apple | 104
102 | Apple Corp | corp | 3
(5 rows)
Row 1 - no changes..
Row 2 - 9 deletions and 3 changes
Row 3 - 1 insertion and 2 changes
Row 4 - 1 deletion and 4 changes
Row 5 - 3 changes
I've found that splitting the words out causes a lot of false positives whe you give a threshold. You can order by the Levenshtein distance to position the better matches close to the top. Maybe tweaking the Levenshtein variables will help you to order the matches better. Sadly, Levenshtein doesn't weight earlier changes differently than later changes.