Doctrine : PostgreSQL group by different than select - postgresql

I have two tables :
user and activityHistory (that has a key to the user)
I am trying to query activityHistory while grouping by user and postgreSQL does not allow me to do that WHILE Sqlite does allow me
return $qb
->select('a.id')
->leftJoin('a.user', 'user')
->leftJoin(
ActivityHistory::class,
'b',
'WITH',
'a.id = b.id AND a.createdAt > b.createdAt'
)
->groupBy('user.id')
->orderBy( 'a.createdAt','ASC' )
->getQuery()->getArrayResult();
I am getting this error only with postgreSQL : Grouping error: 7 ERROR: column "a0_.id" must appear in the GROUP BY clause or be used in an aggregate function
The point is I don't want to groupBy activityHistory id, I only want it to be selected, how can I do ? (I heard about aggregate but I think this works only with functions like SUM etc)

First of all, let's clarify how aggregation works. Aggregation is the act of grouping by certain field(s) and selecting either those fields or calling aggregation functions and passing ungrouped fields.
You misunderstand how this works - hence the question -, but let me provide you a few very simple examples:
Example 1
Let's consider that there is a town and there are individuals living in that town. Each individual has an eye color, but, if you are wondering what the eye color of the people of the town is, then your question does not make sense, because the group itself does not have an eye color, unless specified otherwise.
Example 2
Let's modify the example above by grouping the people of the town by eye color. Such an aggregation will have a row for all existent eye colors and you can select the eye color, along with the average age, number of individuals, etc. as you like, because you are grouping by eye color
Your example
You have users and they are performing actions. So, an activity is performed by a single user, but a user may perform many activities. So, if you want to group by your user id, then the "eye color" that you are not grouping by here is the history id.
You will have a single record for any user, so you are grouping multiple history items into the same row and after the grouping, asking about the history item's id does not exist.
But, you can use string_agg(some_column, ',') which will take all the values you have and put them all into a string of values separated by comma.
You can explode(',', '$yourvalues) in PHP to convert such a value into an array.

Related

Tableau Union Joins - Can you un-merge automatically merged fields?

I am attempting to union join 2 tables in my Microsoft NAV data source within Tableau. However, I have two field named "No." that do not contain the same data.
When I apply a union join, Tableau automatically merges these fields and I cannot un-merge them.
Is there a way to un-merge these fields?
Or is there a way of doing a manual union join?
I have tried renaming the field before dragging the second table into the worksheet however I can see that the "Remote Field Name" still remains the same.
Thanks
One approach is to let Tableau merge the fields and then use the generated fields to distinguish between them.
When you perform a Union in Tableau, it adds a few fields to your data source so you can tell which data rows came from which tables. The most useful in your case is called [Table Name]. So when you build your visualizations, you can use the [Table Name] field to know how to interpret the [No.] field.
If that is awkward, you can create 2 calculated fields to represent only those [No.] values that have the same role. For example, define [No. Type 1] as if [Table Name] = “Table 1” then [No.] end. And define, [No. Type 2] similarly. Then you can hide the original [No.] field.
These new fields will only have values for the appropriate data rows, and will be null otherwise. Aggregate functions like SUM(), AVG() etc ignore nulls, so you can use those fields as measures easily.
If you want to use a calculation in a JOIN clause, say after making a UNION, then first specify the tables (or unions of tables) to join, then when you click on the Venn diagram to specify the join keys, and then select either the left or right list of fields --> look at the bottom of the list in small print to either create or edit your Join Calculation.

Multiple optional query parameters with PostgreSQL

I use PostgreSQL10 and I want to built queries that have multiple optional parameters.
A user must input area name, but then it is optional to pick none or any combination of the following event, event date, category, category date, style
So a full query could be "all the banks (category), constructed in 1990 (category date) with modern architecture (style), that got renovated in 1992 (event and event date) in the area of NYC (area) ".
My problem is that all those are in different tables, connected by many-to-many tables, so I cannot do something like
SELECT * FROM mytable
WHERE (Event IS NULL OR Event = event)
I dont know if any good will come if I just join four tables.
I can easily find the area id, since it is required, but I dont know what the user chose, beside that.
Any suggestions on how to approach this, with Postgre?
Thanks
It might be optimal to build the entire query dynamically and only join in tables that you know you're going to need in order to apply the user's filters, but it's impractical. You're better off creating a view on the full set of tables. Use LEFT OUTER JOINs to ensure that you don't accidentally filter out valid combinations and index your tables to ensure that the query planner can navigate the table graph quickly. Then query the view with a WHERE clause reflecting only the filters you want to apply.
If performance becomes a concern and you don't mind having non-realtime data, you could use a materialized view to cache the results. Materialized views can be indexed directly, but this is a pretty radical change so don't do this unless you have to.

How to best structure csv data for tableau that has "multiple categories"?

I have a set of 100 “student records”, I want to have checkboxes for each "favorite_food_type" and "favorite_food", whichever is checked would filter a "bar graph" that counts number of reports that contain that specific "favorite_food"type" and "favorite_food" schema could be:
name
favorite_food_type (e.g. vegetable)
favorite_food (e.g. banana)
I would like to in the dashboard be able to select via checkboxes, “Give me all the COUNT OF DISTINCT students with favorite_food of banana, apple, pear“ and filter graphs for all records. My issue is for a single student record, maybe one student likes both banana and apple. How do I best capture that? Should I have:
CASE A: Duplicate Records (this captures the two different “favorite_food”, but now I have to figure out how many students there are (which is one student)
NAME, FAVORITE_FOOD_TYPE,FRUIT
Charlie, Fruit, Apple
Charlie, Fruit, Pear
CASE B: Single Records (this captures the two different “favorite_food”, but is there a way to pick out from delimiters?)
NAME, FAVORITE_FOOD_TYPE,FRUITS
Charlie, Fruit, Apple#Pear
CASE C: Column for Each Fruit (this captures one record per student, but need a loooot of columns for each fruit, many would be false)
NAME, FAVORITE_FOOD_TYPE, APPLE, BANANA, PINEAPPLE, PEAR
Charlie, Fruit, TRUE, FALSE, TRUE, FALSE
I want to do this as easy as possible.
Avoid Case B if at all possible. Repeating information is almost always best handled by repeating rows -- not by cramming multiple values into a single table cell, nor by creating multiple columns such as Favorite_1 and Favorite_2
If you are provided data with multiple values in a field, Tableau does have functions and data connection features that can be used to split a single field into its constituent parts to form multiple fields. That works well with fixed number of different kinds of information -- say splitting a City, State field into separate fields for City and State.
Avoid Case C if at all possible. That cross tab structure makes it hard to analyze the data and make useful visualizations. Each value is treated as a separated field.
If you are provided data in crosstab format, Tableau allows you to pivot the data in the data connection pane to reshape into a form with fewer columns and many rows.
Case A is usually the best approach. You can simplify it further by factoring out repeating information into separated tables -- a process known as normalization. Then you can use a join to recombine the tables and see the repeating information when desired.
A normalized approach to your example would have two tables (or tabs in excel). The first table would have exactly one row per student with 2 columns: name and favorite_food_type. The second table would have a row per student/favorite food combination, with 2 columns: name and favorite_food. Now each student can have as many favorite foods as you like or none at all. Since both columns have a name field, that would be the key used to join (combine) the tables when needed.
Given that table design, you could have 2 data sources in Tableau. The first one just pointed to the student table and could be used to create visualizations that only involved students and favorite_food_types. The second data source would use a (left) join to read from both tables and could be used to look at favorite foods. When working with the second data source, you would have to be careful about reporting information about student names and favorite food types to account for the duplicate information. So use the first data source when possible. Finally, you could put both kinds of visualizations on a dashboard and use filter and highlight actions to make interaction seamless despite the two sources -- getting the best of both worlds.

Filter and display database audit / changelog (activity stream)

I'm developing an application with SQLAlchemy and PostgreSQL. Users of the system modify data in 8 or so tables. Consider this contrived example schema:
I want to add visible logging to the system to record what has changed, but not necessarily how it has changed. For example: "User A modified product Foo", "User A added user B" or "User C purchased product Bar". So basically I want to store:
Who made the change
A message describing the change
Enough information to reference the object that changed, e.g. the product_id and customer_id when an order is placed, so the user can click through to that entity
I want to show each user a list of recent and relevant changes when they log in to the application (a bit like the main timeline in Facebook etc). And I want to store subscriptions, so that users can subscribe to changes, e.g. "tell me when product X is modified", or "tell me when any products in store S are modified".
I have seen the audit trigger recipe, but I'm not sure it's what I want. That audit trigger might do a good job of recording changes, but how can I quickly filter it to show recent, relevant changes to the user? Options that I'm considering:
Have one column per ID type in the log and subscription tables, with an index on each column
Use full text search, combining the ID types as a tsvector
Use an hstore or json column for the IDs, and index the contents somehow
Store references as URIs (strings) without an index, and walk over the logs in reverse date order, using application logic to filter by URI
Any insights appreciated :)
Edit It seems what I'm talking about it an activity stream. The suggestion in this answer to filter by time first is sounding pretty good.
Since the objects all use uuid for the id field, I think I'll create the activity table like this:
Have a generic reference to the target object, with a uuid column with no foreign key, and an enum column specifying the type of object it refers to.
Have an array column that stores generic uuids (maybe as text[]) of the target object and its parents (e.g. parent categories, store and organisation), and search the array for marching subscriptions. That way a subscription for a parent category can match a child in one step (denormalised).
Put a btree index on the date column, and (maybe) a GIN index on the array UUID column.
I'll probably filter by time first to reduce the amount of searching required. Later, if needed, I'll look at using GIN to index the array column (this partially answers my question "Is there a trick for indexing an hstore in a flexible way?")
Update this is working well. The SQL to fetch a timeline looks something like this:
SELECT *
FROM (
SELECT DISTINCT ON (activity.created, activity.id)
*
FROM activity
LEFT OUTER JOIN unnest(activity.object_ref) WITH ORDINALITY AS act_ref
ON true
LEFT OUTER JOIN subscription
ON subscription.object_id = act_ref.act_ref
WHERE activity.created BETWEEN :lower_date AND :upper_date
AND subscription.user_id = :user_id
ORDER BY activity.created DESC,
activity.id,
act_ref.ordinality DESC
) AS sub
WHERE sub.subscribed = true;
Joining with unnest(...) WITH ORDINALITY, ordering by ordinality, and selecting distinct on the activity ID filters out activities that have been unsubscribed from at a deeper level. If you don't need to do that, then you could avoid the unnest and just use the array containment #> operator, and no subquery:
SELECT *
FROM activity
JOIN subscription ON activity.object_ref #> subscription.object_id
WHERE subscription.user_id = :user_id
AND activity.created BETWEEN :lower_date AND :upper_date
ORDER BY activity.created DESC;
You could also join with the other object tables to get the object titles - but instead, I decided to add a title column to the activity table. This is denormalised, but it doesn't require a complex join with many tables, and it tolerates objects being deleted (which might be the action that triggered the activity logging).

Most efficient database schema for counting keywords

I'm working on an iPhone app with a GAE backend. I currently have a database of ~8000 products and each product has 5 keywords, mined from reviews, that are the words used most often to describe the product. Once I deploy the app, I'd like to allow users to add new products, and add their 5 keywords to existing products. So, when "reviewing" an existing product, they would add their 5 words, and these would be reflected in the Top 5 words if they push a word over into the Top 5. These keywords will be selected via a large whitelist with indirect selection so I can control the user input. I'd like the application to scale to thousands of users without hitting my backend too hard.
My question is:
What's the most efficient database schema for keeping track of all the words for a product and calculating the top 5 for each product once it's updated?
My two ideas (which may be terrible):
Have a "words" column which contains a 2d array, one dimension is the word, the other is the count for that word. They would then be incremented/decremented as needed.
Have a database with each word as a column and each product as a row and the corresponding row/column would contain the count.
The easiest way to do this would be to have a 'tags' kind, defined something like this (you haven't specified a backend language, so I'm assuming Python):
class Tag(db.Model):
# Tags should be child entities of Products and have key name based on the tag
# eg, created with Tag(parent=a_product, key_name='awesome', ...)
count = db.IntegerProperty(required=True, default=0)
#classmethod
def increment_tags(cls, product, tag_names):
def _tx():
tags = cls.get_by_key_name(tag_names, parent=product)
for i, tag in enumerate(tags):
if tag is None:
# New tag
tags[i] = tag = cls(key_name=tag_names[i], parent=product)
tag.count += 1
db.put(tags)
return db.run_in_transaction(_tx)
#classmethod
def get_top_product_tags(cls, product, num=5):
return [x.key().name() for x
in cls.all().ancestor(product).order('-count').fetch(num)]
The increment_tags method increments the count property on all the relevant tags. Since they all have the same parent entity, they're in the same entity group, and it can do this transactionally, in a single transaction.
The get_top_product_tags method does a simple datastore query to find the num top ranked tags for a product.
You should use a normalized schema and let SQL and the database engine be your friend. Have a single table with a design like this:
create table KeywordUse
( AppID int
, UserID int
, Sequence int
, Word varchar(50) -- or whatever makes sense
)
You can also have an identity primary key if you like, but AppID + UserID + Sequence is a candidate key (i.e. the combination of these three must be unique).
To find the top 5 keywords for any app, do a SQL query like this:
select top 5
count(AppID) as Frequency -- If you have an identity PK count that instead.
, Word
from KeywordUse
where AppID = #AppIDVariable...
group by Word, AppID
order by count(AppID) desc
If you are really, really worried about performance you could denormalize the results of this query into a table that shows the words for each app. Then you'd have to work out how often to refresh that snapshot.
REVISED ANSWER:
As Nick Johnson so generously pointed out, aggregate functions are not available in GQL. However, the philosophy of my answer remains unchanged. Let the database engine do its job.
The table should be AppID, Word, and Frequency. (AppID and Word are the PK.) Then each use of the word would be added up as it is applied. Then, when you want to know the top five words for an app you select by AppID := #Value and order by Frequency (descending) with a LIMIT = 5.
You would need a separate table to track user keywords if that is important.