Most efficient database schema for counting keywords

Most efficient database schema for counting keywords - iphone

I'm working on an iPhone app with a GAE backend. I currently have a database of ~8000 products and each product has 5 keywords, mined from reviews, that are the words used most often to describe the product. Once I deploy the app, I'd like to allow users to add new products, and add their 5 keywords to existing products. So, when "reviewing" an existing product, they would add their 5 words, and these would be reflected in the Top 5 words if they push a word over into the Top 5. These keywords will be selected via a large whitelist with indirect selection so I can control the user input. I'd like the application to scale to thousands of users without hitting my backend too hard.
My question is:
What's the most efficient database schema for keeping track of all the words for a product and calculating the top 5 for each product once it's updated?
My two ideas (which may be terrible):
Have a "words" column which contains a 2d array, one dimension is the word, the other is the count for that word. They would then be incremented/decremented as needed.
Have a database with each word as a column and each product as a row and the corresponding row/column would contain the count.

The easiest way to do this would be to have a 'tags' kind, defined something like this (you haven't specified a backend language, so I'm assuming Python):
class Tag(db.Model):
# Tags should be child entities of Products and have key name based on the tag
# eg, created with Tag(parent=a_product, key_name='awesome', ...)
count = db.IntegerProperty(required=True, default=0)
#classmethod
def increment_tags(cls, product, tag_names):
def _tx():
tags = cls.get_by_key_name(tag_names, parent=product)
for i, tag in enumerate(tags):
if tag is None:
# New tag
tags[i] = tag = cls(key_name=tag_names[i], parent=product)
tag.count += 1
db.put(tags)
return db.run_in_transaction(_tx)
#classmethod
def get_top_product_tags(cls, product, num=5):
return [x.key().name() for x
in cls.all().ancestor(product).order('-count').fetch(num)]
The increment_tags method increments the count property on all the relevant tags. Since they all have the same parent entity, they're in the same entity group, and it can do this transactionally, in a single transaction.
The get_top_product_tags method does a simple datastore query to find the num top ranked tags for a product.

You should use a normalized schema and let SQL and the database engine be your friend. Have a single table with a design like this:
create table KeywordUse
( AppID int
, UserID int
, Sequence int
, Word varchar(50) -- or whatever makes sense
)
You can also have an identity primary key if you like, but AppID + UserID + Sequence is a candidate key (i.e. the combination of these three must be unique).
To find the top 5 keywords for any app, do a SQL query like this:
select top 5
count(AppID) as Frequency -- If you have an identity PK count that instead.
, Word
from KeywordUse
where AppID = #AppIDVariable...
group by Word, AppID
order by count(AppID) desc
If you are really, really worried about performance you could denormalize the results of this query into a table that shows the words for each app. Then you'd have to work out how often to refresh that snapshot.
REVISED ANSWER:
As Nick Johnson so generously pointed out, aggregate functions are not available in GQL. However, the philosophy of my answer remains unchanged. Let the database engine do its job.
The table should be AppID, Word, and Frequency. (AppID and Word are the PK.) Then each use of the word would be added up as it is applied. Then, when you want to know the top five words for an app you select by AppID := #Value and order by Frequency (descending) with a LIMIT = 5.
You would need a separate table to track user keywords if that is important.

Related

Feedback about my database design (multi tenancy)

The idea of the SaaS tool is to have dynamic tables with dynamic custom fields and values of different types, we were thinking to use "force.com/salesforce.com" example but is seems to be too complicated to maintain moving forward, also making some reports to create with a huge abstraction level, so we came up with simple idea but we have to be sure that this is kinda good approach.
This is the architecture we have today (in few steps).
Each tenant has it own separate database on the cluster (Postgres 12).
TABLE table, used to keep all of those tables as reference, this entity has ManyToOne relation to META table and OneToMany relation with DATA table.
META table is used for metadata configuration, has OneToMany relation with FIELDS (which has name of the fields as well as the type of field e.g. TEXT/INTEGER/BOOLEAN/DATETIME etc. and attribute value - as string, only as reference).
DATA table has ManyToOne relation to TABLES and 50 character varying columns with names like: attribute1...50 which are NULL-able.
Example flow today:
When user wants to open a TABLE DATA e.g. "CARS", we load the META table with all the FIELDS (to get fields for this query). User specified that he want to query against: Brand, Class, Year, Price columns.
We are checking by the logic, the reference for Brand, Class, Year and Price in META>FIELDS table, so we know that Brand = attribute2, Class = attribute 5, Year = attribute6 and Price = attribute7.
We parse his request into a query e.g.: SELECT [attr...2,5,6,7] FROM DATA and then show the results to user, if user decide to do some filters on it, based on this data e.g. Year > 2017 AND Class = 'A' we use CAST() functionality of SQL for example SELECT CAST(attribute6 AS int) AND attribute5 FROM DATA WHERE CAST(attribute6 AS int) > 2017 AND attribute5 = 'A';, so then we can actually support most principles of SQL.
However moving forward we are scared a bit:
Manage such a environment for more tenants while we are going to have more tables (e.g. 50 per customer, with roughly 1-5 mil per TABLE (5mil is maximum which we allow, for bigger data we have BigQuery) which is giving us 50-250 mil rows in single table DATA_X) which might affect performance of the queries, especially when we gave possibilities to manage simple WHERE statements (less,equal,null etc.) using some abstraction language e.g. GET CARS [BRAND,CLASS,PRICE...] FILTER [EQ(CLASS,A),MT(YEAR,2017)] developed to be similar to JQL (Jira Query Language).
Transactions lock, as we allow to batch upload CSV into the DATA_X so once they want to load e.g. 1GB of the data, it kinda locks the table for other systems to access the DATA table.
Keeping multiple NULL columns which can affect space a bit (for now we are not that scared as while TABLE creation, customer can decide how many columns he wants, so based on that we are assigning this TABLE to one of hardcoded entities DATA_5, DATA_10, DATA_15, DATA_20, DATA_30, DATA_50, where numbers corresponds to limitations of the attribute columns, and those entities are different, we also support migration option if they decide to switch from 5 to 10 attributes etc.
We are on super early stage, so we can/should make those before we scale, as we knew that this is most likely not the best approach, but we kept it to run the project for small customers which for now is working just fine.
We were thinking also about JSONB objects but that is not the option, as we want to keep it simple for getting the data.
What do you think about this solution (fyi DATA has PRIMARY key out of 2 tables - (ID,TABLEID) and built in column CreatedAt which is used form most of the queries, so there will be maximum 3 indexes)?
If it seem bad, what would you recommend as the alternative to this solution based on the details which I shared (basically schema-less RDBMS)?

IMHO, I anticipate issues when you wanted to join tables and also using cast etc.
We had followed the approach below that will be of help to you
We have a table called as Cars and also have a couple of tables like CarsMeta, CarsExtension columns. The underlying Cars table will have all the common fields for a ll tenant's. Also, we will have the CarsMeta table point out what are the types of columns that you can have for extending the Cars entity. In the CarsExtension table, you will have columns like StringCol1...5, IntCol1....5, LongCol1...10
In this way, you can easily filter for data also like,
If you have a filter on the base table, perform the search, if results are found, match the ids to the CarsExtension table to get the list of exentended rows for this entity
In case the filter is on the extended fields, do a search on the extension table and match with that of the base entity ids.
As we will have the extension table organized like below
id - UniqueId
entityid - uniqueid (points to the primary key of the entity)
StringCol1 - string,
...
IntCol1 - int,
...
In this case, it will be easy to do a join for entity and then get the data along with the extension fields.
In case you are having the table metadata and data being inferred from separate tables, it will be a difficult task to maintain this over long period of time and also huge volume of data.
HTH

Defining relevant indices for database indexing

I need to define and create indices for a postgresql DB used for translation memory.
This is related to this (Database design question regarding performance) question I've posted and the oversimplified design follows this (How to design a database for translation dictionary?) answer. The only difference being I have a Segment (basically a sentence instead of a word).
Tables:
I. languages
ID NAME
---------------
1 English
2 Slovenian
II. segments
ID CONTENT LANGUAGE_ID
-------------------------------
1 Hello World 1
2 Zdravo, svet 2
III. translation_records (TranslationRecord has more columns, omitted here, like domain, user etc.)
ID SOURCE_SEGMENT_ID TARGET_SEGMENT_ID
--------------------------------------
1 1 2
I want to index the segments table for when searching existing translations and for when searching combination of words in the DB.
My question is, is it enough to create an index for the segments table for the CONTENT column or should I also tokenize the CONTENT column to a new column TOKENS and index that as well?
Also, am I missing something else that might be important for creating such indices?
---EDIT---
Querying examples:
When a user enters a new text to translate, the app returns predefined number of existing translation records where source segment's content matches by a certain percent with the entered text.
When a user triggers a manual query to list a predefined number of existing translation records where source segment's content includes the words marked by the user (i.e. the concordance search).
Since there is only one table for all language combinations the first condition for querying would be the language_combination (attribute of translation_record).
---EDIT---

Filter and display database audit / changelog (activity stream)

I'm developing an application with SQLAlchemy and PostgreSQL. Users of the system modify data in 8 or so tables. Consider this contrived example schema:
I want to add visible logging to the system to record what has changed, but not necessarily how it has changed. For example: "User A modified product Foo", "User A added user B" or "User C purchased product Bar". So basically I want to store:
Who made the change
A message describing the change
Enough information to reference the object that changed, e.g. the product_id and customer_id when an order is placed, so the user can click through to that entity
I want to show each user a list of recent and relevant changes when they log in to the application (a bit like the main timeline in Facebook etc). And I want to store subscriptions, so that users can subscribe to changes, e.g. "tell me when product X is modified", or "tell me when any products in store S are modified".
I have seen the audit trigger recipe, but I'm not sure it's what I want. That audit trigger might do a good job of recording changes, but how can I quickly filter it to show recent, relevant changes to the user? Options that I'm considering:
Have one column per ID type in the log and subscription tables, with an index on each column
Use full text search, combining the ID types as a tsvector
Use an hstore or json column for the IDs, and index the contents somehow
Store references as URIs (strings) without an index, and walk over the logs in reverse date order, using application logic to filter by URI
Any insights appreciated :)
Edit It seems what I'm talking about it an activity stream. The suggestion in this answer to filter by time first is sounding pretty good.

Since the objects all use uuid for the id field, I think I'll create the activity table like this:
Have a generic reference to the target object, with a uuid column with no foreign key, and an enum column specifying the type of object it refers to.
Have an array column that stores generic uuids (maybe as text[]) of the target object and its parents (e.g. parent categories, store and organisation), and search the array for marching subscriptions. That way a subscription for a parent category can match a child in one step (denormalised).
Put a btree index on the date column, and (maybe) a GIN index on the array UUID column.
I'll probably filter by time first to reduce the amount of searching required. Later, if needed, I'll look at using GIN to index the array column (this partially answers my question "Is there a trick for indexing an hstore in a flexible way?")
Update this is working well. The SQL to fetch a timeline looks something like this:
SELECT *
FROM (
SELECT DISTINCT ON (activity.created, activity.id)
*
FROM activity
LEFT OUTER JOIN unnest(activity.object_ref) WITH ORDINALITY AS act_ref
ON true
LEFT OUTER JOIN subscription
ON subscription.object_id = act_ref.act_ref
WHERE activity.created BETWEEN :lower_date AND :upper_date
AND subscription.user_id = :user_id
ORDER BY activity.created DESC,
activity.id,
act_ref.ordinality DESC
) AS sub
WHERE sub.subscribed = true;
Joining with unnest(...) WITH ORDINALITY, ordering by ordinality, and selecting distinct on the activity ID filters out activities that have been unsubscribed from at a deeper level. If you don't need to do that, then you could avoid the unnest and just use the array containment #> operator, and no subquery:
SELECT *
FROM activity
JOIN subscription ON activity.object_ref #> subscription.object_id
WHERE subscription.user_id = :user_id
AND activity.created BETWEEN :lower_date AND :upper_date
ORDER BY activity.created DESC;
You could also join with the other object tables to get the object titles - but instead, I decided to add a title column to the activity table. This is denormalised, but it doesn't require a complex join with many tables, and it tolerates objects being deleted (which might be the action that triggered the activity logging).

Reporting Services and Dynamic Fields

I'm new to reporting services so this question might be insane. I am looking for a way to create an empty 'template' report (that is basically a form letter) rather than having to create one for every client in our system. Part of this form letter is a section that has any number of 25 specific fields. The section is arranged as such:
Name: Jesse James
Date of Birth: 1/1/1800
Address: 123 Blah Blah Street
Anywhere, USA 12345
Another Field: Data
Another Field2: More Data
Those (and any of the other fields the client specifies) could be arranged in any order and the label on the left could be whatever the client decides (example: 'DOB' instead of 'Date of Birth'). IDEALLY, I'd like to be able to have a web interface where you can click on the fields you want, specify the order in which they'll appear, and specify what the custom label is. I figured out a way to specify the labels and order them (and load them 'dynamically' in the report) but I wanted to take it one step further if I could and allow dynamic field (right side) selection and ordering. The catch is, I want to do this without using dynamic SQL. I went down the path of having a configuration table that contained an ordinal, custom label text, and the actual column name and attempting to join that table with the table that actually contains the data via information_schema.columns. Maybe querying ALL of the potential fields and having an INNER JOIN do my filtering (if there's a match from the 'configuration' table, etc). That doesn't work like I thought it would :) I guess I was thinking I could simulate the functionality of a dataset (it having the value and field name baked in to the object). I realize that this isn't the optimal tool to be attempting such a feat, it's just what I'm forced to work with.
The configuration table would hold the configuration for many customers/reports and I would be filtering by a customer ID. The config table would look somthing like this:
CustID LabelText ColumnName Ordinal
1 First Name FName 1
1 Last Name LName 2
1 Date of Birth DOBirth 3
2 Client ID ClientID 1
2 Last Name LName 2
2 Address 1 Address1 3
2 Address 2 Address2 4
All that to say:
Is there a way to pull off the above mentioned query?
Am I being too picky about not using dynamic SQL as the section in question will only be pulling back one row? However, there are hundreds of clients running this report (letter) two or three times a day.
Also, keep in mind I am not trying to dynamically create text boxes on the report. I will either just concatenate the fields into a single string and dump that into a text box or I'll have multiple reports each with a set number of text boxes expecting a generic field name ("field1",etc). The more I type, the crazier this sounds...
If there isn't a way to do this I'll likely finagle something in custom code; but my OCD side wants to believe there is SQL beyond my current powers that can do this in a slicker way.

Not sure why you need this all returned in one row: it seems like SSRS would want this normalized further: return a row for every row in the configuration table for the current report. If you really need to concatenate then do that in Embedded code in the report, or consider just putting a table in the form letter. The query below makes some assumptions about your configuration table. Does it only hold the cofiguration for the current report, or does it hold the config for many customers/reports at once? Also you didn't give much info about how you'll filter to the appropriate record, so I just used a customer ID.
SELECT
config.ordinal,
config.LabelText,
CASE config.ColumnName
WHEN 'FName' THEN DataRecord.FirstName
WHEN 'LName' THEN DataRecord.LastName
WHEN 'ClientID' THEN DataRecord.ClientID
WHEN 'DOBirth' THEN DataRecord.DOB
WHEN 'Address' THEN DataRecord.Address
WHEN 'Field' THEN DataRecord.Field
WHEN 'Field2' THEN DataRecord.Field2
ELSE
NULL
END AS response
FROM
ConfigurationTable AS config
LEFT OUTER JOIN
DataTable AS DataRecord
ON config.CustID = DataRecord.CustomerID
WHERE DataRecord.CustomerID = #CustID
ORDER BY
config.Ordinal
There are other ways to do this, in SSRS or in SQL, depends on more details of your requirements.

Create a new FileMaker layout showing unique records based on one field and a count for each

I have a table like this:
Application,Program,UsedObject
It can have data like this:
A,P1,ZZ
A,P1,BB
A,P2,CC
B,F1,KK
I'd like to create a layout to show:
Application,# of Programs
A,2
B,1
The point is to count the distinct programs.
For the life of me I can't make this work in FileMaker. I've created a summary field to count programs resetting after each group, but because it doesn't eliminate the duplicate programs I get:
A,3
B,1
Any help much appreciated.

Create a a summary field as:
cntApplicaiton = Count of Application
Do this by going into define fields, create a field called cntApplication, type summary. In the options dialogue make the summary field a count on application
Now create a new layout with a subsummary part and nobody. The subsummary should be sorted on Application. Put the Application and cntApplication fields in subsummary. If you enter browse mode and sort by Application you ought to get the data you want.
You can also create a calc field with the formula
GetSummary(cntApplication; Application)
This will allow you to use the total number of Applications with in a record

Since I also generate the data in this form, the solution I've adopted is to fill two tables in FileMaker. One provides the summary view, the other the detailed view.

I think that your problem is down to dupliate records and an inadequate key.
Create a text field called "App_Prog". In the options box set it to an auto-enter calc, unchecking the 'Do not replace...' option, and use the following calc:
Application & "_" & Program
Now create a self join to the table using App_Prog as the field on both sides, and call this 'MatchingApps'.
Now, create (if you don't alread have one) a unique serial number field, 'Counter' say, and make sure that you enter a value in each record. (Find all, click in the field, and use serial number option in'Replace Field Contents...')
Now add a new calc field - Is_Duplicate with the following calc...
If (Counter = MatchingApps::Counter; "Master Record" ; "Duplicate")
Finally, find all, click in the 'Application field, and use 'Replace Field Contents...' with a calculation to force the auto-enter calc for 'App_Prog' to come up with a value.
Where does this get you? You should now have a set of records that are marker either "Master Record" or "Duplicate". Do a find on "Master Record", and then you can perform your summary (by Application) to do a count of distinct application-program pairs.

If you have access to custom functions (you need FileMaker Pro Advanced), I'd do it like this:
Add the RemoveDuplicates function as found here (this is a recursive function that takes a list of strings and returns a list of unique values).
In the relationships graph, add another occurrence of your table and add an Application = Application relationship.
Create a calculated field in the table with the calculation looking something like this:
ValueCount(RemoveDuplicates(List(TABLE2::Program)))
You'll find that each record will contain the number of distinct programs for the given application. Showing a summary for each application should be relatively trivial from here.

I think the best way to do this is to create a separate applications table. So as you've given the data, it would have two records, one for A and one for B.
So, with the addition of an Applications table and your existing table, which I'll call Objects, create a relationship from Applications to Objects (with a table occurrence called ObjectsParent) based on the ApplicationName as the match field. Create a self join relationship between Objects and itself with both Application and Program as the match fields. I'll call one of the "table occurrences" ObjectsParent and the other ObjectsChildren. Make sure that there's a primary key field in Objects that is set to auto-enter a serial number or some other method to ensure uniqueness. I'll call this ID.
So your relationship graph has three table occurrences:
Applications::Applicaiton = ObjectsParent::Application
ObjectsParent::Application = ObjectsChildren::Application, ObjectsParent::Program = ObjectsChildren::Program
Now create a calculation field in Objects, and calculating from the context of ObjectsParent, give it the following formula:
AppCount = Count( ObjectsChildren::ID )
Create a calculation field in Applications and calculating from the context of the table occurrence you used to relate it to ObjectsParent with the following formula:
AppCount = ObjectsParent::AppCount
The count field in Objects will have the same value for every object with the same application, so it doesn't matter which one you get this data from.
If you now view the data in Applications in list view, you can place the Applications::Application and Applications::AppCount fields on the layout and you should get what you've requested.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse