Does Erlang Mnesia select on an ordered_set give a list in Erlang Term order? - select

In the documentation it isn't clear to me whether I need to iterate through in order with either next or perhaps foldl (it is mentioned that foldr goes in the opposite order to ordered_set so presuambly foldl goes in the same order) or if I can use select and rely upon it being ordered (assuming ordered_set table)

can I use select and rely upon it being ordered (assuming ordered_set table)
ets:select/2:
For tables of type ordered_set, objects are visited in the same order as in a first/next traversal. This means that the match
specification is executed against objects with keys in the first/next
order and the corresponding result list is in the order of that
execution.
ets:first/1:
Returns the first key Key in table Tab. For an ordered_set table, the
first key in Erlang term order is returned.
Table Traversal:
Traversals using match and select functions may not need to scan
the entire table depending on how the key is specified. A match
pattern with a fully bound key (without any match variables) will
optimize the operation to a single key lookup without any table
traversal at all. For ordered_set a partially bound key will limit the
traversal to only scan a subset of the table based on term order.
It would make no sense to me for a table of type ordered_set to return search results in a random order.

Related

DynamoDB - Most efficient way of deleting a partition?

Let's say I have a partition-key that is User:user#email.com and it has several sort-keys like Data, Sale:001, Contact:001.
Now, what if I want to delete this user?
I have thought of two possible ways using the API.
1 - Scan
First do a SCAN where partition-key=User:user#email, get the results and do a batch delete on each returned item with the respective sort-key.
2 - Query
For this I would first need to change all sort keys to have a common prefix, for example User|Data, User|Sale:001, User|Contact:001, and then do a query where
partition-key=User:user#email.com and sort_key.begins_with(User)
after getting the results I would then do a batch delete just like the scan option.
It isn't clear to me which option is the best because I'm not sure if the Scan has the "intelligence" to only scan inside that specific partition or it would scan every record in the table. Because in DynamoDB you pay for each kb of items that was "searched"
Because if it is intelligent then I think it would cost the same as the query option without needing to add a prefix to my sort keys.
Scan() doesn't support partition-key=User:user#email except as a filter expression.
So yes, the whole table would be read. Only the records that match would actually be returned.
Query() on the other hand requires partition-key=user:user#email as a key condition expression. You don't need to make any changes to your sort key design; as including a key condition for the sort key is optional.
The partition key equality test is required, and must be specified in
the following format:
partitionKeyName = :partitionkeyval
If you also want to provide a condition for the sort key, it must be
combined using AND with the condition for the sort key. Following is
an example, using the = comparison operator for the sort key:
partitionKeyName = :partitionkeyval AND sortKeyName = :sortkeyval

Convert to SARGable query

I want to write a query to search the containing string in the table.
Table:
Create table tbl_sarg
(
colname varchar(100),
coladdres varchar(500)
);
Note: I just want to use Index Seek for searching on 300 millions of records.
Index:
create nonclustered index ncidx_colname on tbl_sarg(colname);
Sample Records:
insert into tbl_sarg values('John A Mak','HNo 102 Street Road Uk');
insert into tbl_sarg values('Shawn A Meben','Church road USA');
insert into tbl_sarg values('Lee Decose','ShopNo 22 K Mark UK');
insert into tbl_sarg values('James Don','A Mall, 90 feet road UAE');
Query 1:
select * from tbl_sarg
where colname like '%ee%'
Actual Execution Plan:
Query 2:
select * from tbl_sarg
where charindex('ee',colname)>0
Actual Execution Plan:
Query 3:
select * from tbl_sarg
where patindex('%ee%',colname)>0
Actual Execution Plan:
How to force the query processor to use the index seek instead table/index scan on large data set?
All the queries that you have posted, by definition are not SARgable, for instance, the use of '%..%'' automatically force the Query Engine to do a Scan, the other case is the use of functions (as charindex or patindex) inside your column inside a predicate.
Here some post: https://bertwagner.com/2017/08/22/how-to-search-and-destroy-non-sargable-queries-on-your-server/
Kimberly Tripp has written very interesting articles about it if for you is mandatory to execute this kind of query with wildcards, maybe it is worth to check about the possibility of using FullTextSearch feature. My point is, or your limit and do a precise predicate into your queries or you will have to change of strategy, almost forget, don't try to force the use of Seek with HINT, I can't see that this medicine will be better than the illness.
A search argument, or SARG in short, is a filter predicate that enables the optimizer to rely on
index order. The filter predicate uses the following form (or a variant with two delimiters of a
range, or with the operand positions flipped):
WHERE <column> <operator> <expression>
Such a filter is sargable if:
You don’t apply manipulation to the filtered column.
The operator identifies a consecutive range of qualifying rows in the index. That’s the
case with operators like =, >, >=, <, <=, BETWEEN, LIKE with a known prefix, and so on.
That’s not the case with operators like <>, LIKE with a wildcard as a prefix.
In most cases, when you apply manipulation to the filtered column, the optimizer doesn’t
try to be too smart and understand the meaning of the calculation, and if index ordering
can still be relied on. It simply assumes that the result values might sort differently than the
source values, and therefore index ordering can’t be trusted.
So why doesn’t SQL Server use the index for the %ee% query? Pretend for a moment that you held a phone book in your hand, and I asked you to find everyone whose last name contains the letters %ee%. You would have to scan every single page in the phone book, because the results would include things like:
Anne Lee
Lee Yung
Kathlee
Aleen
When I asked you for all last names containing %ee% anywhere in the name, my query was not sargable – meaning, you couldn’t leverage the indexes to do an index seek.
That’s where SQL Server’s Full Text Search comes in.

Algebra Relational sql GROUP BY SORT BY ORDER BY

I wanted to know what is the equivalent in GROUP BY, SORT BY and ORDER BY in algebra relational ?
Neither is possible in relational algebra but people have been creating some "extensions" for these operations (Note: in the original text, part of the text is written as subscript).
GROUP BY, According to the book Fundamentals of Database Systems (Elmasri, Navathe 2011 6th ed):
Another type of request that cannot be expressed in the basic relational algebra is to
specify mathematical aggregate functions on collections of values from the database.
...
We can define an AGGREGATE FUNCTION operation, using the symbol ℑ (pronounced
script F)7, to specify these types of requests as follows:
<grouping attributes> ℑ <function list> (R)
where <grouping attributes> is a list of attributes of the relation specified in R, and <function list> is a list of (<function> <attribute>) pairs. In each such pair,
<function> is one of the allowed functions—such as SUM, AVERAGE, MAXIMUM,
MINIMUM,COUNT—and <attribute> is an attribute of the relation specified by R. The resulting relation has the grouping attributes plus one attribute for each element in the function list.
ORDER BY (SORT BY), John L. Donaldson's lecture notes* (not available anymore):
Since a relation is a set (or a bag), there is no ordering defined for a relation. That is, two relations are the same if they contain the same tuples, irrespective of ordering. However, a user frequently wants the output of a query to be listed in some particular order. We can define an additional operator τ which sorts a relation if we are willing to allow an operator whose output is not a relation, but an ordered list of tuples.
For example, the expression
τLastName,FirstName(Student)
generates a list of all the Student tuples, ordered by LastName (as the primary sort key) then FirstName (as a secondary sort key). (The secondary sort key is used only if two tuples agree on the primary sort key. A sorting operation can list any number of sort keys, from most significant to least significant.)
*John L. Donaldson's (Emeritus Professor) lecture notes from the course CSCI 311 Database Systems at the Oberlin College Computer Science. Referenced 2015. Checked 2022 and not available anymore.
You can use projection π for the columns that you want group the table by them without aggregating (The PROJECT operation removes any duplicate tuples)
as following:
π c1,c2,c3 (R)
where c1,c2,c3 are columns(attributes) and R is the table(the relation)
According to this SQL to relational algebra converter tool, we have:
SELECT agents.agent_code, agents.agent_name, SUM(orders.advance_amount)
FROM agents, orders
WHERE agents.agent_code = orders.agent_code
GROUP BY agents.agent_code, agents.agent_name
ORDER BY agents.agent_code
Written in functions sort of like:
τ agents.agent_code
γ agent_code, agent_name, SUM(advance_amount)
σ agents.agent_code = orders.agent_code (agents × orders)
With a diagram like:

what's the utility of array type?

I'm totally newbie with postgresql but I have a good experience with mysql. I was reading the documentation and I've discovered that postgresql has an array type. I'm quite confused since I can't understand in which context this type can be useful within a rdbms. Why would I have to choose this type instead of using a classical one to many relationship?
Thanks in advance.
I've used them to make working with trees (such as comment threads) easier. You can store the path from the tree's root to a single node in an array, each number in the array is the branch number for that node. Then, you can do things like this:
SELECT id, content
FROM nodes
WHERE tree = X
ORDER BY path -- The array is here.
PostgreSQL will compare arrays element by element in the natural fashion so ORDER BY path will dump the tree in a sensible linear display order; then, you check the length of path to figure out a node's depth and that gives you the indentation to get the rendering right.
The above approach gets you from the database to the rendered page with one pass through the data.
PostgreSQL also has geometric types, simple key/value types, and supports the construction of various other composite types.
Usually it is better to use traditional association tables but there's nothing wrong with having more tools in your toolbox.
One SO user is using it for what appears to be machine-aided translation. The comments to a follow-up question might be helpful in understanding his approach.
I've been using them successfully to aggregate recursive tree references using triggers.
For instance, suppose you've a tree of categories, and you want to find products in any of categories (1,2,3) or any of their subcategories.
One way to do it is to use an ugly with recursive statement. Doing so will output a plan stuffed with merge/hash joins on entire tables and an occasional materialize.
with recursive categories as (
select id
from categories
where id in (1,2,3)
union all
...
)
select products.*
from products
join product2category on...
join categories on ...
group by products.id, ...
order by ... limit 10;
Another is to pre-aggregate the needed data:
categories (
id int,
parents int[] -- (array_agg(parent_id) from parents) || id
)
products (
id int,
categories int[] -- array_agg(category_id) from product2category
)
index on categories using gin (parents)
index on products using gin (categories)
select products.*
from products
where categories && array(
select id from categories where parents && array[1,2,3]
)
order by ... limit 10;
One issue with the above approach is that row estimates for the && operator are junk. (The selectivity is a stub function that has yet to be written, and results in something like 1/200 rows irrespective of the values in your aggregates.) Put another way, you may very well end up with an index scan where a seq scan would be correct.
To work around it, I increased the statistics on the gin-indexed column and I periodically look into pg_stats to extract more appropriate stats. When a cursory look at those stats reveal that using && for the specified values will return an incorrect plan, I rewrite applicable occurrences of && with arrayoverlap() (the latter has a stub selectivity of 1/3), e.g.:
select products.*
from products
where arrayoverlap(cat_id, array(
select id from categories where arrayoverlap(parents, array[1,2,3])
))
order by ... limit 10;
(The same goes for the <# operator...)

How to query Cassandra by date range

I have a Cassandra ColumnFamily (0.6.4) that will have new entries from users. I'd like to query Cassandra for those new entries so that I can process that data in another system.
My sense was that I could use a TimeUUIDType as the key for my entry, and then query on a KeyRange that starts either with "" as the startKey, or whatever the lastStartKey was. Is this the correct method?
How does get_range_slice actually create a range? Doesn't it have to know the data type of the key? There's no declaration of the data type of the key anywhere. In the storage_conf.xml file, you declare the type of the columns, but not of the keys. Is the key assumed to be of the same type as the columns? Or does it do some magic sniffing to guess?
I've also seen reference implementations where people store TimeUUIDType in columns. However, this seems to have scale issues as this particular key would then become "hot" since every change would have to update it.
Any pointers in this case would be appreciated.
When sorting data only the column-keys are important. The data stored is of no consequence neither is the auto-generated timestamp. The CompareWith attribute is important here. If you set CompareWith as UTF8Type then the keys will be interpreted as UTF8Types. If you set the CompareWith as TimeUUIDType then the keys are automatically interpreted as timestamps. You do not have to specify the data type. Look at the SlicePredicate and SliceRange definitions on this page http://wiki.apache.org/cassandra/API This is a good place to start. Also, you might find this article useful http://www.sodeso.nl/?p=80 In the third part or so he talks about slice ranging his queries and so on.
Doug,
Writing to a single column family can sometimes create a hot spot if you are using an Order-Preserving Partitioner, but not if you are using the default Random Partitioner (unless a subset of users create vastly more data than all other users!).
If you sorted your rows by time (using an Order-Preserving Partitioner) then you are probably even more likely to create hotspots, since you will be adding rows sequentially and a single node will be responsible for each range of the keyspace.
Columns and Keys can be of any type, since the row key is just the first column.
Virtually, the cluster is a circular hash key ring, and keys get hashed by the partitioner to get distributed around the cluster.
Beware of using dates as row keys however, since even the randomization of the default randompartitioner is limited and you could end up cluttering your data.
What's more, if that date is changing, you would have to delete the previous row since you can only do inserts in C*.
Here is what we know :
A slice range is a range of columns in a row with a start value and an end value, this is used mostly for wide rows as columns are ordered. Known column names defined in the CF are indexed however so they can be retrieved specifying names.
A key slice, is a key associated with the sliced column range as returned by Cassandra
The equivalent of a where clause uses secondary indexes, you may use inequality operators there, however there must be at least ONE equals clause in your statement (also see https://issues.apache.org/jira/browse/CASSANDRA-1599).
Using a key range is ineffective with a Random Partitionner as the MD5 hash of your key doesn't keep lexical ordering.
What you want to use is a Column Family based index using a Wide Row :
CompositeType(TimeUUID | UserID)
In order for this not to become hot, add a first meaningful key ("shard key") that would split the data accross nodes such as the user type or the region.
Having more data than necessary in Cassandra is not a problem, it's how it is designed, so what you must ask yourself is "what do I need to query" and then design a Column Family for it rather than trying to fit everything in one CF like you'd do in an RDBMS.