sphinx: sorting by range of an association - sphinx

Let's say I have a 1:many relation:
Tree:Apples
Tree/Apple each have an primary key ID column and apple has some date attribute/column (created_at).
Using sphinx, I want to retrieve all trees, sorted by the number apples created during a given period of time. So, for example:
All trees, sorted by the total number of apples created between 1/1/2010 and 1/1/2011.
Is that possible?

So you have two tables
create table tree ( tree_id int unsigned primary key,...);
and
create table apple ( apple_id int unsigned primary key, tree_id int unsigned, created_at timestamp default current_timestamp,...);
So can then just build an index on apples
sql_query = select apple_id,tree_id,unix_timestamp(created_at) as created_at from apple
then run group by queries on it
$cl->setGroupBy('tree_id',SPH_GROUPBY_ATTR,'#count DESC');
The #count virtual attribute will give you the number of apples on that tree.
to set the filter
$cl->setFilterRange('created_at',strtotime('2010-01-01'),strtotime('2011-01-01'));
Also because you not using it, can set ranking to none
$cl->setRankingMode(SPH_RANK_NONE);
To be clear you just use a blank query
$res = $cl->Query('',$index);

Related

How do you optimize less than in where?

Suppose we have a query like below:
select top 1 X from dbo.Tab
where SomeDateTime < #P1 and SomeID = #P2
order by SomeDateTime desc
What indexes or other techniques do you use to effectively optimize these types of queries?
The table may still have a primary key but on a different column like Id:
Id BIGINT IDENTITY PK
X something
SomeDateTime DateTime2
SomeID BIGINT FK to somewhere
Suppose there are millions of records for a given SomeDateTime or SomeID - table is "long".
Let's assume that data is read and written with a similar frequency.
For posterity, an article with pictures on the topic:
https://use-the-index-luke.com/sql/where-clause/searching-for-ranges/greater-less-between-tuning-sql-access-filter-predicates

PostgreSQL indexed columns choice

I have these two tables :
CREATE TABLE ref_dates(
date_id SERIAL PRIMARY KEY,
month int NOT NULL,
year int NOT NULL,
month_name CHAR(255)
);
CREATE TABLE util_kpi(
kpi_id SERIAL PRIMARY KEY,
kpi_description int NOT NULL,
kpi_value float,
date_id int NOT NULL,
dInsertion timestamp default CURRENT_TIMESTAMP,
CONSTRAINT fk_ref_kpi FOREIGN KEY (date_id) REFERENCES ref_dates(date_id)
);
Usually, the type of request i'd do is :
Selecting kpi_description and kpi_value for a specified month and year:
SELECT kpi_description, kpi_value FROM util_kpi u JOIN ref_dates r ON u.date_id = r.date_id WHERE month=X AND year=XXXX
Selecting kpi_description and kpi_value for a specified kpi_description, month and year:
SELECT kpi_description, kpi_value FROM util_kpi u JOIN ref_dates r ON u.date_id = r.date_id WHERE month=X AND year=XXXX AND kpi_description='XXXXXXXXXXX'
I tought about creating these indexes :
CREATE INDEX idx_ref_date_year_month ON ref_dates(year, month);
CREATE INDEX idx_util_kpi_date ON util_kpi(date_id);
First of all, i want to know if it's a good idea to create these indexes.
Second of all and finally, I was wondering if it's a good idea to add kpi_description to the indexes on util_kpi table.
Can you guys give me your opinion ?
Regards
It's not possible to give exact answer without looking on data.
So it's only possible to give an opinion.
A. ref_dates
This table looks very similar to date dimension in ROLAP-schemas.
So the first what I would do: is change date_id from SERIAL to:
DATE datatype
or even "smart integer": integer datatype but in form YYYYMMDD. E.g. 20210430. It may look strange but it's not uncommon to see such identificators in date dimensions
The main point in using such form is that date_id in fact tables became informative even without joining to date dimension.
B. util_kpi
I suppose that:
ref_dates is a date dimension. So it's ~365 * number of years rows. It could be populated once for 20-30 years for future and it's still will not be really big
util_kpi is fact table. Which must be big like "really big" - millions and more records.
For `util_kpi' I expected id of time dimension but did not found it. So no hourly stats are supposed yet.
I see util_kpi.dInsertion - which I suppose is planned to be used as time dimension. I would think to extract it into time_id where put hours, minutes and seconds (if milliseconds are not needed).
C.Indexing
ref_dates: it does not matters a lot how you index ref_dates because it's a relatively small table. Maybe unique index on date_id with INCLUDE options for all fields would be the best. Don't create individual index for fields with low selectivity like year or month - it will not make much sense but it will not harm a lot too.
util_kpi - you need an index on date_id (as for any foreign keys to other dimension tables that will appear in future).
That's my thoughts that based on what I supposed.

Order picking in warehouse

In implementing the warehouse management system for an ecommerce store, I'm trying to create a picking list for warehouse workers, who will walk around a warehouse picking products in orders from different shelves.
One type of product can be on different shelves, and on each shelf there can be many of the same type of product.
If there are many of the same product in one order, sometimes the picker has to pick from multiple shelves to get all the items in an order.
To further make things trickier, sometimes the product will run out of stock as well.
My data model looks like this (simplified):
CREATE TABLE order_product (
id SERIAL PRIMARY KEY,
product_id integer,
order_id text
);
INSERT INTO "public"."order_product"("id","product_id","order_id")
VALUES
(1,1,'order1'),
(2,1,'order1'),
(3,1,'order1'),
(4,2,'order1'),
(5,2,'order2'),
(6,2,'order2');
CREATE TABLE warehouse_placement (
id SERIAL PRIMARY KEY,
product_id integer,
shelf text,
quantity integer
);
INSERT INTO "public"."warehouse_placement"("id","product_id","shelf","quantity")
VALUES
(1,1,E'A',2),
(2,2,E'B',2),
(3,1,E'C',2);
Is it possible, in postgres, to generate a picking list of instructions like the following:
order_id product_id shelf quantity_left_on_shelf
order1 1 A 1
order1 1 A 0
order1 2 B 1
order1 1 C 1
order2 2 B 0
order2 2 NONE null
I currently do this in the application code, but that feel quite clunky and somehow I feel like there should be a way to do this directly in SQL.
Thanks for any help!
Here we go:
WITH product_on_shelf AS (
SELECT warehouse_placement.*,
generate_series(1, quantity) AS order_on_shelf,
quantity - generate_series(1, quantity) AS quantity_left_on_shelf
FROM warehouse_placement
)
, product_on_shelf_with_product_order AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY product_id
ORDER BY quantity, shelf, order_on_shelf
) AS order_among_product
FROM product_on_shelf
)
, order_product_with_order_among_product AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY product_id
ORDER BY id
) AS order_among_product
FROM order_product
)
SELECT order_product_with_order_among_product.id,
order_product_with_order_among_product.order_id,
order_product_with_order_among_product.product_id,
product_on_shelf_with_product_order.shelf,
product_on_shelf_with_product_order.quantity_left_on_shelf
FROM order_product_with_order_among_product
LEFT JOIN product_on_shelf_with_product_order
ON order_product_with_order_among_product.product_id = product_on_shelf_with_product_order.product_id
AND order_product_with_order_among_product.order_among_product = product_on_shelf_with_product_order.order_among_product
ORDER BY order_product_with_order_among_product.id
;
Here's the idea:
We create a temporary table product_on_shelf, which is the same as warehouse_placement, except the rows are duplicated n times, n being the quantity of the product on the shelf.
We assign a number order_among_product to each row in product_on_shelf, so that each object on shelf knows its order among the same products.
We assign a symmetric number order_among_product to each row in order_product.
For each row in order_product, we try to find the product on shelf with the same order_among_product. If we can't find any, it means we've exhausted the products on any shelf.
Side note #1: Picking products off shelves is a concurrent action. You should make sure, either on the application side or on the DB side via smart locks, that any product on shelf can be attributed to one single order. Treating each row of product_order on the application side might be the best option to deal with concurrence.
Side note #2: I've written this query using CTEs for clarity. To boost performance, consider using subqueries instead. Make sure to run EXPLAIN ANALYZE

Creating a many to many in postgresql

I have two tables that I need to make a many to many relationship with. The one table we will call inventory is populated via a form. The other table sales is populated by importing CSVs in to the database weekly.
Example tables image
I want to step through the sales table and associate each sale row with a row with the same sku in the inventory table. Here's the kick. I need to associate only the number of sales rows indicated in the Quantity field of each Inventory row.
Example: Example image of linked tables
Now I know I can do this by creating a perl script that steps through the sales table and creates links using the ItemIDUniqueKey field in a loop based on the Quantity field. What I want to know is, is there a way to do this using SQL commands alone? I've read a lot about many to many and I've not found any one doing this.
Assuming tables:
create table a(
item_id integer,
quantity integer,
supplier_id text,
sku text
);
and
create table b(
sku text,
sale_number integer,
item_id integer
);
following query seems to do what you want:
update b b_updated set item_id = (
select item_id
from (select *, sum(quantity) over (partition by sku order by item_id) as sum from a) a
where
a.sku=b_updated.sku and
(a.sum)>
(select count(1) from b b_counted
where
b_counted.sale_number<b_updated.sale_number and
b_counted.sku=b_updated.sku
)
order by a.sum asc limit 1
);

Date Table/Dimension Querying and Indexes

I'm creating a robust date table want to know the best way to link to it. The Primary Key Clustered Index will be on the smart date integer key (per Kimball spec) with a name of DateID. Until now I have been running queries against it like so:
select Foo.orderdate -- a bunch of fields from Foo
,DTE.FiscalYearName
,DTE.FiscalPeriod
,DTE.FiscalYearPeriod
,DTE.FiscalYearWeekName
,DTE.FiscalWeekName
FROM SomeTable Foo
INNER JOIN
DateDatabase.dbo.MyDateTable DTE
ON DTE.date = CAST(FLOOR(CAST(Foo.forderdate AS FLOAT)) AS DATETIME)
Keep in mind that Date is a nonclustered index field with values such as:
2000-01-01 00:00:00.000
It just occured to me that since I have a clustered integer index (DATEID) that perhaps I should be converting the datetime in my database field to match it and linking based upon that field.
What do you folks think?
Also, depending on your first answer, if I am typically pulling those fields from the date table, what kind of index how can I optimize the retrieval of those fields? Covering index?
Even without changing the database structure, you'd get much better performance using a date range join like this:
select Foo.orderdate -- a bunch of fields from Foo
,DTE.FiscalYearName
,DTE.FiscalPeriod
,DTE.FiscalYearPeriod
,DTE.FiscalYearWeekName
,DTE.FiscalWeekName
FROM SomeTable Foo
INNER JOIN
DateDatabase.dbo.MyDateTable DTE
ON Foo.forderdate >= DTE.date AND Foo.forderdate < DATEADD(dd, 1, DTE.date)
However, if you can change it so that your Foo table includes a DateID field then, yes, you'd get the best performance by joining with that instead of any converted date value or date range.
If you change it to join on DateID and DateID is the first column of the clustered index of the MyDateTable then it's already covering (a clustered index always includes all other fields).