How to use Berkeley DB's non-SQL, Key/Value API to implement fuzzy query (LIKE key word) - nosql

I can understand this Blog, but it seems unable to apply in such case that using Berkeley DB's non-SQL, Key/Value API to implement "SELECT * FROM table WHERE name LIKE '%abc%'"
Table structure
-------------------------------------------
key data(name)
-------------------------------------------
0 abc
1 abcd
2 you
3 spring
. sabcd
. timeab
.
I guess iterating all records is not an efficient way, but it really do a trick.

You're correct. Absent any other tables, you'd have to scan all the entries and test each data item. In many cases, it's as simple as this.
If you're using SQL LIKE, I doubt you'll be able to do better unless your data items have a well-defined structure.
However, if the "WHERE name LIKE %abc%" query you have is really WHERE name="abc", then you might choose to take a performance penalty on your db_put call to create a reverse index, in addition to your primary table:
-------------------------------------------
key(name) data(index)
-------------------------------------------
abc 0
abcd 1
sabcd 4
spring 3
timeab 5
you 2
This table, sorted in alphabetical order, requires a lexical key comparison function, and uses support for duplicate keys in BDB. Now, to find the key for your entry, you could simply do a db_get ("abc"), or better, open a cursor with DB_SETRANGE on "abc".
Depending on the kinds of LIKE queries you need to do, you may be able to use the reverse index technique to narrow the search space.

Related

Aggregate on Redshift SUPER type

Context
I'm trying to find the best way to represent and aggregate a high-cardinality column in Redshift. The source is event-based and looks something like this:
user
timestamp
event_type
1
2021-01-01 12:00:00
foo
1
2021-01-01 15:00:00
bar
2
2021-01-01 16:00:00
foo
2
2021-01-01 19:00:00
foo
Where:
the number of users is very large
a single user can have very large numbers of events, but is unlikely to have many different event types
the number of different event_type values is very large, and constantly growing
I want to aggregate this data into a much smaller dataset with a single record (document) per user. These documents will then be exported. The aggregations of interest are things like:
Number of events
Most recent event time
But also:
Number of events for each event_type
It is this latter case that I am finding difficult.
Solutions I've considered
The simple "columnar-DB-friendy" approach to this problem would simply be to have an aggregate column for each event type:
user
nb_events
...
nb_foo
nb_bar
1
2
...
1
1
2
2
...
2
0
But I don't think this is an appropriate solution here, since the event_type field is dynamic and may have hundreds or thousands of values (and Redshift has a upper limit of 1600 columns). Moreover, there may be multiple types of aggregations on this event_type field (not just count).
A second approach would be to keep the data in its vertical form, where there is not one row per user but rather one row per (user, event_type). However, this really just postpones the issue - at some point the data still needs to be aggregated into a single record per user to achieve the target document structure, and the problem of column explosion still exists.
A much more natural (I think) representation of this data is as a sparse array/document/SUPER:
user
nb_events
...
count_by_event_type (SUPER)
1
2
...
{"foo": 1, "bar": 1}
2
2
...
{"foo": 2}
This also pretty much exactly matches the intended SUPER use case described by the AWS docs:
When you need to store a relatively small set of key-value pairs, you might save space by storing the data in JSON format. Because JSON strings can be stored in a single column, using JSON might be more efficient than storing your data in tabular format. For example, suppose you have a sparse table, where you need to have many columns to fully represent all possible attributes, but most of the column values are NULL for any given row or any given column. By using JSON for storage, you might be able to store the data for a row in key:value pairs in a single JSON string and eliminate the sparsely-populated table columns.
So this is the approach I've been trying to implement. But I haven't quite been able to achieve what I'm hoping to, mostly due to difficulties populating and aggregating the SUPER column. These are described below:
Questions
Q1:
How can I insert into this kind of SUPER column from another SELECT query? All Redshift docs only really discuss SUPER columns in the context of initial data load (e.g. by using json_parse), but never discuss the case where this data is generated from another Redshift query. I understand that this is because the preferred approach is to load SUPER data but convert it to columnar data as soon as possible.
Q2:
How can I re-aggregate this kind of SUPER column, while retaining the SUPER structure? Until now, I've discussed a simplified example which only aggregates by user. In reality, there are other dimensions of aggregation, and some analyses of this table will need to re-aggregate the values shown in the table above. By analogy, the desired output might look something like (aggregating over all users):
nb_events
...
count_by_event_type (SUPER)
4
...
{"foo": 3, "bar": 1}
I can get close to achieving this re-aggregation with a query like (where the listagg of key-value string pairs is a stand-in for the SUPER type construction that I don't know how to do):
select
sum(nb_events) nb_events,
(
select listagg(s)
from (
select
k::text || ':' || sum(v)::text as s
from my_aggregated_table inner_query,
unpivot inner_query.count_by_event_type as v at k
group by k
) a
) count_by_event_type
from my_aggregated_table outer_query
But Redshift doesn't support this kind of correlated query:
[0A000] ERROR: This type of correlated subquery pattern is not supported yet
Q3:
Are there any alternative approaches to consider? Normally I'd handle this kind of problem with Spark, which I find much more flexible for these kinds of problems. But if possible it would be great to stick with Redshift, since that's where the source data is.

Statistics of all/many tables in FileMaker

I'm writing a kind of summary page for my FileMaker solution.
For this, I have define a "statistics" table, which uses formula fields with ExecuteSQL to gather info from most tables, such as number of records, recently changed records, etc.
This strangely takes a long time - around 10 seconds when I have a total of about 20k records in about 10 tables. The same SQL on any database system shouldn't take more than some fractions of a second.
What could the reason be, what can I do about it and where can I start debugging to figure out what's causing all this time?
The actual code is, like this:
SQLAusführen ( "SELECT COUNT(*) FROM " & _Stats::Table ; "" ; "" )
SQLAusführen ( "SELECT SUM(\"some_field_name\") FROM " & _Stats::Table ; "" ; "" )
Where "_Stats" is my statistics table, and it has a string field "Table" where I store the name of the other tables.
So each row in this _Stats table should have the stats for the table named in the "Table" field.
Update: I'm not using FileMaker server, this is a standalone client application.
We can definitely talk about why it may be slow. Usually this has mostly to do with the size and complexity of your schema. That is "usually", as you have found.
Can you instead use the DDR ( database design report ) instead? Much will depend on what you are actually doing with this data. Tools like FMPerception also will give you many of the stats you are looking for. Again, depends on what you are doing with it.
Also, can you post your actual calculation? Is the statistic table using unstored calculations? Is the statistics table related to any of the other tables? These are a couple things that will affect how ExecuteSQL performs.
One thing to keep in mind, whether ExecuteSQL, a Perform Find, or relationship, it's all the same basic query under-the-hood. So if it would be slow doing it one way, it's going to likely be slow with any other directly related approach.
Taking these one at a time:
All records count.
Placing an unstored calc in the target table allows you to get the count of the records through the relationship, without triggering a transfer of all records to the client. You can get the value from the first record in the relationship. Super light way to get that info vs using Count which requires FileMaker to touch every record on the other side.
Sum of Records Matching a Value.
using a field on the _Stats table with a relationship to the target table will reduce how much work FileMaker has to do to give you an answer.
Then having a Summary field in the target table so sum the records may prove to be more efficient than using an aggregate function. The summary field will also only sum the records that match the relationship. ( just don't show that field on any of your layouts if you don't need it )
ExecuteSQL is fastest when it can just rely on a simple index lookup. Once you get outside of that, it's primarily about testing to find the sweet-spot. Typically, I will use ExecuteSQL for retrieving either a JSON object from a user table, or verifying a single field value. Once you get into sorting and aggregate functions, you step outside of the optimizations of the function.
Also note, if you have an open record ( that means you as the current user ), FileMaker Server doesn't know what data you have on the client side, and so it sends ALL of the records. That's why I asked if you were using unstored calcs with ExecuteSQL. It can seem slow when you can't control when the calculations fire. Often I will put the updating of that data into a scheduled script.

PostgreSQL: Full text search multitenant site, plus only parts of site

I'm developing a multitenant web application, and I want to add full text search, so that people will be able to:
1) search only the site they are currently visiting (but not all sites), and
2) search only a section of that site (e.g. restrict search to a blog or a forum on the site), and
3) search a single forum thread only.
I wonder what indexes should I add?
Please assume the database is huge (so that e.g. index-scanning-by-site-ID and then filtering-by-full-text-search is too slow).
I can think of three approaches:
Create three indexes. 1) One that indexes everything on a per site basis.
And 2) one that indexes everything on a per-site plus site-section basis.
And 3) one that indexes everything on a per-site and page-id basis.
Create one single index, and insert into [the text to index] magic words like:
"site_<site-id>"
and "section_<section-id>" and "page_<page-id>", and then when I search
for section XX in site YYY I could prefix the search query like so:
"site_XX AND section_YYY AND ...".
Dynamically add database indexes when a new site or site section is created:
create index dw1_posts__search_site_YYY
on dw1_posts using gin(to_tsvector('english', approved_text))
where site_id = 'YYY';
Does any of these three approaches above make sense? Are there better alternatives?
(Details: However, perhaps approach 1 is impossible? Attempting to index-a-column and also index-for-full-text-searching at the same time, results in syntax errors:
> create index dw1_posts__search_site
on dw1_posts (site_id)
using gin(to_tsvector('english', approved_text));
ERROR: syntax error at or near "using"
LINE 1: ...dex dw1_posts__search_site on dw1_posts(site_id) using gin(...
^
> create index dw1_posts__search_site
on dw1_posts
using gin(to_tsvector('english', approved_text))
(site_id);
ERROR: syntax error at or near "("
LINE 1: ... using gin(to_tsvector('english', approved_text)) (site_id);
(If approach 1 was possible, then I could do queries like:
select ... from ... where site_id = ... and <full-text-search-column> ## <query>;
and have PostgreSQL first check site_id and then the full-text-search column, using one single index.)
)
/ End details.)
Update, one week later: I'm using ElasticSearch instead. I got the impression that no scalable solution exists, for faceted search, with relational databases / PostgreSQL. And integrating with ElasticSearch seems to be roughly as simple as implementing and testing and tweaking the approaches suggested here. (For example, PostgreSQL's stemmer/whatever-it's-called might split "section_NNN" into two words: "section" and "NNN" and thus index words that doesn't exist on the page! Tricky to fix such small annoying issues.)
The normal approach would be to create:
one full text index:
CREATE INDEX idx1
ON dw1_posts USING gin(to_tsvector('english', approved_text));
a simple index on the site_id:
CREATE INDEX idx2
on dw1_posts(page_id);
another simple index on the page_id:
CREATE INDEX idx3
on dw1_posts(site_id);
Then it's the SQL planner's business to decide which ones to use if any, and in what order depending on the queries and the distribution of values in the columns. There is no point in trying to outsmart the planner before you've actually witnessed slow queries.
Another alternative, which is similar to the "site_<site-id>" and "section_<section-id>" and "page_<page-id>" alternative, should be to prefix the text to index with:
SiteSectionPage_<site-id>_<section-id>_<subsection-id>_<page-id>
And then use prefix matching (i.e. :*) when searching:
select ... from .. where .. ## 'SiteSectionPage_NN_MMM:* AND (the search phrase)'
where NN is the site ID and MMM is the section ID.
But this won't work with Chinese? I think trigrams are appropriate when indexing Chinese, but then SiteSectionPage... will be split into: Sit, ite, teS, eSe, which makes no sense.

hbase rowkey design

I am moving from mysql to hbase due to increasing data.
I am designing rowkey for efficient access pattern.
I want to achieve 3 goals.
Get all results of email address
Get all results of email address + item_type
Get all results of particular email address + item_id
I have 4 attributes to choose from
user email
reverse timestamp
item_type
item_id
What should my rowkey look like to get rows efficiently?
Thanks
Assuming your main access is by email you can have your main table key as
email + reverse time + item_id (assuming item_id gives you uniqueness)
You can have an additional "index" table with email+item_type+reverse time+item_id and email+item_id as keys that maps to the first table (so retrieving by these is a two step process)
Maybe you are already headed in the right direction as far as concatenated row keys: in any case following comes to mind from your post:
Partitioning key likely consists of your reverse timestamp plus the most frequently queried natural key - would that be the email? Let us suppose so: then choose to make the prefix based on which of the two (reverse timestamp vs email) provides most balanced / non-skewed distribution of your data. That makes your region servers happier.
Choose based on better balanced distribution of records:
reverse timestamp plus most frequently queried natural key
e.g. reversetimestamp-email
or email-reversetimestamp
In that manner you will avoid hot spotting on your region servers.
.
To obtain good performance on the additional (secondary ) indexes, that is not "baked into" hbase yet: they have a design doc for it (look under SecondaryIndexing in the wiki).
But you can build your own a couple of ways:
a) use coprocessor to write the item_type as rowkey to separate tabole with a column containing the original (user_email-reverse timestamp (or vice-versa) fact table rowke
b) if disk space not issue and/or the rows are small, just go ahead and duplicate the entire row in the second (and third for the item-id case) tables.

I have a massive table that I need to optimize. I think I need to use indexes, but I was hoping for some more information about them

So I have a large table that I query (select only) quite frequently. The table is around 12,000 rows long. Since the advent of iOS, the time that it is taking to run these select queries has gone up 4-5x.
I was told that I need to add an index to my table. The query that I am using looks like this:
SELECT * FROM book_content WHERE book_id = ? AND chapter = ? ORDER BY verse ASC
How can I create an index for this table? Is it a command I just run once? What exactly is the index going to do? I didn't learn about these in school so they still seem like some sort of magic to me at this point, so I was hoping to get a little instruction.
Thanks!
You want an index on book_id and chapter. Without an index, a server would do a table scan and essentially load the entire table into memory to do its search. Do a quick search on the CREATE INDEX command for the RDBMS that you are using. You create the index once and every time you do an INSERT or DELETE or UPDATE, the server will update the index automatically. An index can be UNIQUE and it can be on multiple fields (in your case, book_id and chapter). If you make it UNIQUE, the database will not allow you to insert a second row with the same key (in this case, book_id and chapter). On most servers, having one index on two fields is different from having two individual indexes on single fields each.
A Mysql example would be:
CREATE INDEX id_chapter_idx ON book_content (book_id,chapter);
If you want only one record for each book_id, chapter combination, use this command:
CREATE UNIQUE INDEX id_chapter_idx ON book_content (book_id,chapter);
A PRIMARY INDEX is a special index that is UNIQUE and NOT NULL. Each table can only have one primary index. In fact, each table should have one primary index to ensure table integrity, especially during joins.
You don't have to think of indexes as "magic".
An index on an SQL table is much like the index in a printed book - it lets you find what you're looking for without reading the entire book cover-to-cover.
For example, say you have a cookbook, and you're looking for recipes that involve chicken. The index in the back of the book might say something like:
chicken: 30,34,72,84
letting you know that you will find chicken recipes on those 4 pages. It's much faster to find this information in the index than by reading through the whole book, because the index is shorter, and (more importantly) it's in alphabetical order, so you can quickly find the right place in the index.
So, in general you want to create indexes on columns that you will regularly need to query (book_id and chapter, in your example).
When you declare a column as primary key automatically generates an index on that column. In your case for using more often select an index is ideal, because they improve time of selection queries and degrade the time of insertion. So you can create the indexes you think you need without worrying about the performance
Indexes are a very sensitive subject. If you consider using them, you need to be very careful how many you make. The primary key, or id, of each table should have a clustered index. All the rest, it depends on how you plan to use them. I'm very fuzzy in the subject of indexes, and have actually never worked with them, but from a seminar I just watched actually yesterday, you don't want too many indexes - because they can actually slow things down when you don't need to use them.
Let's say you put an index on 5 out of 8 fields on a table. Each index is designated for a particular query somewhere in your software. Well, when 1 query is run, it uses that 1 index, and doesn't need the other 4. So that's unneeded weight on this 1 query. If you need an index, be sure that this is an index which could be useful in many places, not just 1 place.