what is correct format of database table structure - database-normalization

Is anybody tell me that if the database structure I create a single table to store multiple banners and templates for different event types like below is good database system. when we know will keep updating adding removing banners and template from admin side. Will it fulfill relational database and normalization rules.Also which approach is better for less query execution time.
A)table1
Event_type_id Key value
1 Small_banners Baneer1;banner2;banner3
2 Template Temp1;temp2;temp3
.....
……
Or instead create 2 tables like below is good approach
B) - banner table
<p>Id event_type_id value</p>
1 1 baner_1.jpg
2 1 baner_3.jpg
3 2 banner_4.jpg
.....
And second table for template
-Template table
Id event_type_id value
1 1 temp_1.jpg
2 1 temp_2.jpg
3 2 temp_3.jpg
......
C) One more thing having 50s of rows in a single table is better or we should split it in multiple tables.
Please suggest with reasons.

Going by rules of normalisation, it is always good to keep different table instead of inserting comma separated values in a single field.
But new database like Postgres supports json data to be stored in fields and also querying them.

Related

Aggregate on Redshift SUPER type

Context
I'm trying to find the best way to represent and aggregate a high-cardinality column in Redshift. The source is event-based and looks something like this:
user
timestamp
event_type
1
2021-01-01 12:00:00
foo
1
2021-01-01 15:00:00
bar
2
2021-01-01 16:00:00
foo
2
2021-01-01 19:00:00
foo
Where:
the number of users is very large
a single user can have very large numbers of events, but is unlikely to have many different event types
the number of different event_type values is very large, and constantly growing
I want to aggregate this data into a much smaller dataset with a single record (document) per user. These documents will then be exported. The aggregations of interest are things like:
Number of events
Most recent event time
But also:
Number of events for each event_type
It is this latter case that I am finding difficult.
Solutions I've considered
The simple "columnar-DB-friendy" approach to this problem would simply be to have an aggregate column for each event type:
user
nb_events
...
nb_foo
nb_bar
1
2
...
1
1
2
2
...
2
0
But I don't think this is an appropriate solution here, since the event_type field is dynamic and may have hundreds or thousands of values (and Redshift has a upper limit of 1600 columns). Moreover, there may be multiple types of aggregations on this event_type field (not just count).
A second approach would be to keep the data in its vertical form, where there is not one row per user but rather one row per (user, event_type). However, this really just postpones the issue - at some point the data still needs to be aggregated into a single record per user to achieve the target document structure, and the problem of column explosion still exists.
A much more natural (I think) representation of this data is as a sparse array/document/SUPER:
user
nb_events
...
count_by_event_type (SUPER)
1
2
...
{"foo": 1, "bar": 1}
2
2
...
{"foo": 2}
This also pretty much exactly matches the intended SUPER use case described by the AWS docs:
When you need to store a relatively small set of key-value pairs, you might save space by storing the data in JSON format. Because JSON strings can be stored in a single column, using JSON might be more efficient than storing your data in tabular format. For example, suppose you have a sparse table, where you need to have many columns to fully represent all possible attributes, but most of the column values are NULL for any given row or any given column. By using JSON for storage, you might be able to store the data for a row in key:value pairs in a single JSON string and eliminate the sparsely-populated table columns.
So this is the approach I've been trying to implement. But I haven't quite been able to achieve what I'm hoping to, mostly due to difficulties populating and aggregating the SUPER column. These are described below:
Questions
Q1:
How can I insert into this kind of SUPER column from another SELECT query? All Redshift docs only really discuss SUPER columns in the context of initial data load (e.g. by using json_parse), but never discuss the case where this data is generated from another Redshift query. I understand that this is because the preferred approach is to load SUPER data but convert it to columnar data as soon as possible.
Q2:
How can I re-aggregate this kind of SUPER column, while retaining the SUPER structure? Until now, I've discussed a simplified example which only aggregates by user. In reality, there are other dimensions of aggregation, and some analyses of this table will need to re-aggregate the values shown in the table above. By analogy, the desired output might look something like (aggregating over all users):
nb_events
...
count_by_event_type (SUPER)
4
...
{"foo": 3, "bar": 1}
I can get close to achieving this re-aggregation with a query like (where the listagg of key-value string pairs is a stand-in for the SUPER type construction that I don't know how to do):
select
sum(nb_events) nb_events,
(
select listagg(s)
from (
select
k::text || ':' || sum(v)::text as s
from my_aggregated_table inner_query,
unpivot inner_query.count_by_event_type as v at k
group by k
) a
) count_by_event_type
from my_aggregated_table outer_query
But Redshift doesn't support this kind of correlated query:
[0A000] ERROR: This type of correlated subquery pattern is not supported yet
Q3:
Are there any alternative approaches to consider? Normally I'd handle this kind of problem with Spark, which I find much more flexible for these kinds of problems. But if possible it would be great to stick with Redshift, since that's where the source data is.

Redirecting Data in Postgres

I have a project where all the data is going to one Postgres big table like:
Time
ItemID
ItemType
Value
2021-09-16
1
A
2
2021-09-16
2
B
3
2021-09-17
3
A
3
My issue is this table is becoming very large. Since there are only 2 ItemTypes, I'd like to have MyTableA and MyTableB and the have a 3rd table with a one-to-one mapping of ItemID to ItemType.
What is the most performant way to insert the data and redirect to the respective table? I am currently thinking about creating a view with an INSTEAD OF trigger and then using two insert statements with joins to get the desired filtering. Is there a better way? Perhaps maintaining the ItemID_A/B in a array somewhere? Or should I figure out a way to do this client side?

Combine data from several queries

We are looking into a more powerful way of collecting and processing data to be processed in our reports. For one advanced report on a big database, we need to run two indepedent SQL queries (on the same data source) and combine them afterwards.
Query1 returns:
user id#1 ... 3 columns
user id#2 ... 3 columns
user id#4 ... 3 columns
Query 2 returns:
user id#1 ... 5 columns
user id#3 .. 5 columns
user id#4 ... 5 columns
What we want to show:
user id#1 ... 3 columns + 5 columns
user id#2 ... 3 columns
user id#3 ... 5 columns
user id#4 ... 3 columns + 5 columns
Although it's counter-intuitive, we found that combining the results from both queries in SQL leads to considerably worse runtime of the SQL query.
We have looked at subdatasets, but from my understanding it's not possible to mix the data from two subdatasets (or the main data+one subdataset) in a single table.
We have looked at subreports, but from my understanding a subreport will call the query once for each row in the report, if I put the subreport in the Details area as we intend to. But for performance reasons we want to run the two queries that we prepared, and each only once.
We think the most reasonnable approach is for us to write such advanced reports in Java, and it's possible, however the JavaBean data source cannot access the report parameters. Our database is huge and therefore we can't just make queries without where and filter afterwards, the Java code needs access to the report parameters.
We are currently looking into implementing JRQueryExecutor as recommended there and there (last comment), or even taking advantage of scriptlets.
But it sounds really quite advanced and we are wondering are we thinking the wrong way or heading in the wrong direction? And if JRQueryExecutor is the correct way any example or documentation would be welcome.
We are also considering trying to refactor our SQL to achieve the result with only one query, but we do feel that the reporting system ought to allow us to manipulate the data also in Java.
In the end we made it with a scriptlet. In afterReportInit, inheriting JRDefaultScriptlet you get the parameters and the data source from parametersMap, and you can then fill in the data source from Java.

How to use Berkeley DB's non-SQL, Key/Value API to implement fuzzy query (LIKE key word)

I can understand this Blog, but it seems unable to apply in such case that using Berkeley DB's non-SQL, Key/Value API to implement "SELECT * FROM table WHERE name LIKE '%abc%'"
Table structure
-------------------------------------------
key data(name)
-------------------------------------------
0 abc
1 abcd
2 you
3 spring
. sabcd
. timeab
.
I guess iterating all records is not an efficient way, but it really do a trick.
You're correct. Absent any other tables, you'd have to scan all the entries and test each data item. In many cases, it's as simple as this.
If you're using SQL LIKE, I doubt you'll be able to do better unless your data items have a well-defined structure.
However, if the "WHERE name LIKE %abc%" query you have is really WHERE name="abc", then you might choose to take a performance penalty on your db_put call to create a reverse index, in addition to your primary table:
-------------------------------------------
key(name) data(index)
-------------------------------------------
abc 0
abcd 1
sabcd 4
spring 3
timeab 5
you 2
This table, sorted in alphabetical order, requires a lexical key comparison function, and uses support for duplicate keys in BDB. Now, to find the key for your entry, you could simply do a db_get ("abc"), or better, open a cursor with DB_SETRANGE on "abc".
Depending on the kinds of LIKE queries you need to do, you may be able to use the reverse index technique to narrow the search space.

Adding a new column to Table which contains live data

I have a large table consisting of over 60 millions records and I would like to add 2 new columns for data migration purposes. There are indexes on the table and some of them are large. So, by me adding the 2 new columns to the table, will I run the risk of slowing down the database whilst it attempts to add them and maybe time-out? Or will it just work?
I know that if I try and rearrange the columns SQL Server will ask me to drop and re-create the table, so I definately don't want this. Is this something everyone is challenged with?
We've had the same problem with column and index changes on larger tables.
I would simply add the columns using ALTER TABLE. The column order, though nice, is irrelevant.
If the columns are NULLable them time is reasonable. if you want to add a default value and make them NOT NULL, then this is more work obviously. However, I would consider adding as NOT NULL, then setting to a value, then changing to NOT NULL to make it 3 steps you can do at different times. We do this to reduce the time window we need, even if the whole process tales longer