How do I generate a random sample of groups, including all people in the group, where the group_id (but not the person_id) changes across time? - postgresql

I have data that looks like this:
+----------+-----------+------------+------+
| group_id | person_id | is_primary | year |
+----------+-----------+------------+------+
| aaa1 | 1 | TRUE | 2000 |
| aaa2 | 1 | TRUE | 2001 |
| aaa3 | 1 | TRUE | 2002 |
| aaa4 | 1 | TRUE | 2003 |
| aaa5 | 1 | TRUE | 2004 |
| bbb1 | 2 | TRUE | 2000 |
| bbb2 | 2 | TRUE | 2001 |
| bbb3 | 2 | TRUE | 2002 |
| bbb1 | 3 | FALSE | 2000 |
| bbb2 | 3 | FALSE | 2001 |
+----------+-----------+------------+------+
The data design is such that
person_id uniquely identifies an individual across time
group_id uniquely identifies a group within each year, but may change from year to year
each group contains primary and non-primary individuals
My goal is three-fold:
Get a random sample, e.g. 10%, of primary individuals
Get the data on those primary individuals for all time periods they appear in the database
Get the data on any non-primary individuals that share a group with any of the primary individuals that were sampled in the first and second steps
I'm unsure where to start with this, since I need to first pull a random sample of primary individuals and get all observations for them. Presumably I can do this by generating a random number that's the same within any person_id, then sample based on that. Then, I need to get the list of group_id that contain any of those primary individuals, and pull all records associated with those group_id.
I don't know where to start with these queries and subqueries, and unfortunately, the interface I'm using to access this database can't link information across separate queries, so I can't pull a list of random person_id for primary individuals, then use that text file to filter group_id in a second query; I have to do it all in one query.

A quick way to get this done is:
select
data_result.*
from
data as data_groups join
(select
person_id
from
data
where
is_primary
group by
person_id
order by
random()
limit 1) as selected_primary
ON (data_groups.person_id = selected_primary.person_id)
JOIN data AS data_result ON (data_groups.group_id = data_result.group_id AND data_groups.year = data_result.year)
I even made a fiddle so you can test it.
The query is pretty straightforward, it gets the sample, then it gets their groups and then it gets all the users of those groups.
Please pay atention to the Limit 1 clause that is there because the data set was so little. You can put a value or a query that gets the right percentage.
If anyone has an answer using windowing functions I'd like to see that.
Note: next time please provide the schema and the data insertion so it is easier to answer.

Related

What exactly is a wide column store?

Googling for a definition either returns results for a column oriented DB or gives very vague definitions.
My understanding is that wide column stores consist of column families which consist of rows and columns. Each row within said family is stored together on disk. This sounds like how row oriented databases store their data. Which brings me to my first question:
How are wide column stores different from a regular relational DB table? This is the way I see it:
* column family -> table
* column family column -> table column
* column family row -> table row
This image from Database Internals simply looks like two regular tables:
The guess I have as to what is different comes from the fact that "multi-dimensional map" is mentioned along side wide column stores. So here is my second question:
Are wide column stores sorted from left to right? Meaning, in the above example, are the rows sorted first by Row Key, then by Timestamp, and finally by Qualifier?
Let's start with the definition of a wide column database.
Its architecture uses (a) persistent, sparse matrix, multi-dimensional
mapping (row-value, column-value, and timestamp) in a tabular format
meant for massive scalability (over and above the petabyte scale).
A relational database is designed to maintain the relationship between the entity and the columns that describe the entity. A good example is a Customer table. The columns hold values describing the Customer's name, address, and contact information. All of this information is the same for each and every customer.
A wide column database is one type of NoSQL database.
Maybe this is a better image of four wide column databases.
My understanding is that the first image at the top, the Column model, is what we called an entity/attribute/value table. It's an attribute/value table within a particular entity (column).
For Customer information, the first wide-area database example might look like this.
Customer ID Attribute Value
----------- --------- ---------------
100001 name John Smith
100001 address 1 10 Victory Lane
100001 address 3 Pittsburgh, PA 15120
Yes, we could have modeled this for a relational database. The power of the attribute/value table comes with the more unusual attributes.
Customer ID Attribute Value
----------- --------- ---------------
100001 fav color blue
100001 fav shirt golf shirt
Any attribute that a marketer can dream up can be captured and stored in an attribute/value table. Different customers can have different attributes.
The Super Column model keeps the same information in a different format.
Customer ID: 100001
Attribute Value
--------- --------------
fav color blue
fav shirt golf shirt
You can have as many Super Column models as you have entities. They can be in separate NoSQL tables or put together as a Super Column family.
The Column Family and Super Column family simply gives a row id to the first two models in the picture for quicker retrieval of information.
Most (if not all) Wide-column stores are indeed row-oriented stores in that every parts of a record are stored together. You can see that as a 2-dimensional key-value store. The first part of the key is used to distribute the data across servers, the second part of the key lets you quickly find the data on the target server.
Wide-column stores will have different features and behaviors. However, Apache Cassandra, for example, allows you to define how the data will be sorted. Take this table for example:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | US | 2020-10-01 | "a..." |
| 1 | JP | 2020-11-01 | "b..." |
| 1 | US | 2020-09-01 | "c..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|
If your partitioning key is (id) and your clustering key is (country, timestamp), the data will be stored like this:
[Key 1]
1:JP,2020-11-01,"b..." | 1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..."
[Key2]
2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..."
[Key3]
3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."
Or in table form:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | JP | 2020-11-01 | "b..." |
| 1 | US | 2020-09-01 | "c..." |
| 1 | US | 2020-10-01 | "a..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|
If you change the primary key (composite of partitioning and clustering key) to (id, timestamp) WITH CLUSTERING ORDER BY (timestamp DESC) (id is the partitioning key, timestamp is the clustering key in descending order), the result would be:
[Key 1]
1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..." | 1:JP,2020-11-01,"b..."
[Key2]
2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..."
[Key3]
3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."
Or in table form:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | US | 2020-09-01 | "c..." |
| 1 | US | 2020-10-01 | "a..." |
| 1 | JP | 2020-11-01 | "b..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|

Loop insert with select

I have the following structures
Tickets
+----+---------------------+-----------+---------------+
| id | price | seat_id | flight_id |
+----+---------------------+-----------+---------------+
Seats
+----+--------+-----------+
| id | letter | number |
+----+--------+-----------+
| 1 | A | 1 |
| 2 | A | 2 |
| 3 | A | 3 |
+----+--------+-----------+
I want to insert 2 tickets using only one query where the letter is A and the number is between 1 and 2, I guess to make more than 1 insert at time I have to use some plsql loop but I don't know how to do it and i don't know if this is the approach
Not sure what you are actually wanting to do, but from your description I'll assume you want 2rows in tickets referencing id 1 and 2 from seats.
SQL works in sets NOT in individual rows and loop (yes those are available via plpgsql) but avoid loops when ever possible. Inserting 2 rows does not require one; in fact it is almost exactly the same as inserting a single row. Since you didn not specify values for price and flight, I'll just omit them. But to insert 2 rows:
Insert into tickets(id,seat_id) values (1,1),(2,2);

Selecting value for the latest two distinct columns

I am trying to do an SQL which will return the latest data value of the two distinct columns of my table.
Currently, I select distinct the values of the column and afterwards, I iterate through the columns to get the distinct values selected before then order and limit to 1. These tags can be any number and may not always be posted together (one time only tag 1 can be posted; whereas other times 1, 2, 3 can).
Although it gives the expected outcome, this seems to be inefficient in a lot of ways, and because I don't have enough SQL experience, this was so far the only way I found of performing the task...
--------------------------------------------------
| name | tag | timestamp | data |
--------------------------------------------------
| aa | 1 | 566 | 4659 |
--------------------------------------------------
| ab | 2 | 567 | 4879 |
--------------------------------------------------
| ac | 3 | 568 | 1346 |
--------------------------------------------------
| ad | 1 | 789 | 3164 |
--------------------------------------------------
| ae | 2 | 789 | 1024 |
--------------------------------------------------
| af | 3 | 790 | 3346 |
--------------------------------------------------
Therefore the expected outcome is {3164, 1024, 3346}
Currently what I'm doing is:
"select distinct tag from table"
Then I store all the distinct tag values programmatically and iterate programmatically through these values using
"select data from table where '"+ tags[i] +"' in (tag) order by timestamp desc limit 1"
Thanks,
This comes close, but beware if you have two rows with the same tag share a maximum timestamp you will get duplicates in the result set
select data from table
join (select tag, max(timestamp) maxtimestamp from table t1 group by tag) as latesttags
on table.tag = latesttags.tag and table.timestamp = latesttags.maxtimestamp

Redshift query to count metrics by 10 minute windows

About the PostgreSQL tag, as you may know, Redshift is based off of PostgreSQL.
Amazon Redshift is based on PostgreSQL 8.0.2. Amazon Redshift and PostgreSQL have a number of very important differences that you must be aware of as you design and develop your data warehouse applications.
I have a table that was created like this:
create table purchase (
user_id int,
item_id int,
t timestamp
)
diststyle even
interleaved sortkey(user_id, item_id, t);
And I want to execute a query which tells me the 3 most-active users (users with the most purchases) in a ten-minute window, and the 3 most-purchased items in the same ten-minute window.
So the results should look like this
+-item_id-|-user_id-|-window-+
| aaa | xxx | 0 |
+---------+---------+--------+
| bbb | yyy | 0 |
+---------+---------+--------+
| ccc | zzz | 0 |
+---------+---------+--------+
| ... | ... | 1 |
+---------+---------+--------+
| ... | ... | 1 |
+---------+---------+--------+
| ... | ... | 1 |
..............................
| ... | ... | 5 |
+---------+---------+--------+
| ... | ... | 5 |
+---------+---------+--------+
| ... | ... | 5 |
+---------+---------+--------+
where aaa is the most-purchased item in the first ten minute window, bbb is the second most-purchased item in the first ten minute window, and so on, and xxx is the user with the most purchases in the first ten minute window, and yyy is the user with the second most purchases in the first window, and so on. There are six 10-minute windows because I will be doing this over an hour-long date range.
I'm pretty new to Redshift, so unfortunately I don't have any existing SQL to show you what I've tried.
My requirements slightly changed, but I was able to create a function that completed my new requirements. My new requirements were just to get a count of all the distinct item_ids and user_ids
select count(distinct item_id) as item_id_count, count(distinct user_id) as user_id_count, substring(t, 0, 16) as window group by window order by window asc;
Not sure if others will have the same date format, but mine was yyyy-MM-dd hh:mm:ss, so getting the substring for grouping by 10 minutes required me getting just the yyyy-MM-dd hh:m part, and then I just grouped on that.

Order by date AND id, sqldeveloper

I have some tables with date and id as two of the columns:
ID | DATE | ITEMS
1 | 7/1/13 | More Apples
2 | 6/29/13 | Carrots
1 | 6/20/13 | Apples
2 | 6/10/13 | Broccoli
I would like to order them by DATE and then group them by ID's so that all the 1's are together ordered by dates:
ID | DATE | ITEMS
1 | 7/1/13 | More Apples
1 | 6/20/13 | Apples
2 | 6/29/13 | Carrots
2 | 6/10/13 | Broccoli
How would I accomplish this?
I'm thinking my solution might be a sub-select but I haven't gotten anywhere closest to what I want to achieve. Note that the above tables are very simplified. I'm actually trying to accomplish this with many tables joined and many different fields being displayed. Thanks.