Group with levenshtein distance - postgresql

I have postgreSQL 9.2
My task is to find similar names in table (limited by some levenshtain distance).
For example, the distance is 3, the table has data:
| name |
|***************************|
| Marcus Miller |
| Marcos Miller |
| Macus Miler |
| David Bowie |
| Dave Grohl |
| Dav Grol |
| ... |
The result I want to get is like this:
| Marcus Miller, Marcos Miller, Macus Miler |
| Dave Grohl, Dav Grol |
| ... |
Or
| Marcus Miller, Marcos Miller |
| Marcus Miller, Macus Miler |
| Dave Grohl, Dav Grol |
| ... |
I tried this:
SELECT a.name, b.name
FROM my_table a
JOIN my_table b ON b.id < a.id AND levenshtein(b.name, a.name) < 3;
But it is too slow with my datum.

There is a significant conceptual error with your question; GROUP BY takes certain equivalence relations (in the mathematical sense) as an argument and uses that to partition a SQL relation into equivalence classes.
The problem is that the relation you describe, namely, "are two strings within a certain edit distance of each other", is not an equivalence relation. It's symmetric and reflexive, but not transitive. To illustrate, what should the answer be if I added a series of names to your dataset that morphed "Marcus Miller" into "Dave Grohl", with each name in the series being within that edit distance from the previous?
However, there are algorithms to partition a data set using things that aren't equivalence relations, such as geometrical distance. K-means clustering is one of the best known examples. Perhaps there is a way to adapt k-means or something similar to this problem, I don't know.

Related

How to groupBy and perform data scaling over each and every group using MlLib Pyspark?

I have a dataset just like in the example below and I am trying to group all rows from a given symbol and perform standard scaling of each group so that at the end all my data is scaled by groups. How can I do that with MlLib and Pyspark? I could not find a single solution on internet for it. Can anyone help here?
+------+------------------+------------------+------------------+------------------+
|symbol| open| high| low| close|
+------+------------------+------------------+------------------+------------------+
| AVT| 4.115| 4.115| 4.0736| 4.0736|
| ZEC| 365.6924715181936| 371.9164684545918| 364.8854025324053| 369.5950712239761|
| ETH| 647.220769018717| 654.6370842160561| 644.8942258095359| 652.1231757197687|
| XRP|0.3856343600456335|0.4042970302356221|0.3662228285447956|0.4016658006619401|
| XMR|304.97650674864144|304.98649644294267|299.96970554155274| 303.8663243145598|
| LTC|321.32437862304715| 335.1872636382617| 320.9704201234651| 334.5057757774086|
| EOS| 5.1171| 5.1548| 5.1075| 5.116|
| BCH| 1526.839255299505| 1588.106037653013|1526.8392543926366|1554.8447136830328|
| DASH| 878.00000003| 884.03769206| 869.22000004| 869.22000004|
| BTC|17042.224796462127| 17278.87984139109|16898.509289685637|17134.611038665582|
| REP| 32.50162799| 32.501628| 32.41062673| 32.50162799|
| DASH| 858.98413357| 863.01413927| 851.07145059| 851.17051529|
| ETH| 633.1390884474979| 650.546984589714| 631.2674221381849| 641.4566047907362|
| XRP|0.3912300406160967|0.3915937383961073|0.3480682353334925|0.3488616679337076|
| EOS| 5.11| 5.1675| 5.0995| 5.1674|
| BCH|1574.9602789966184|1588.6004569127992| 1515.3| 1521.0|
| BTC| 17238.0199449088| 17324.83886467445|16968.291408828714| 16971.12960974206|
| LTC| 303.3999614441217| 317.6966006615225|302.40702519057584| 310.971265429805|
| REP| 32.50162798| 32.50162798| 32.345677| 32.345677|
| XMR| 304.1618444641083| 306.2720324372592|295.38042671416935| 295.520097663825|
+------+------------------+------------------+------------------+------------------+
I suggest you import the following:
import pyspark.sql.functions as f
then you can do it like this (not fully tested code):
stats_df = df.groupBy('symbol').withColumn(\
'open', f.mean("open")).alias("open_mean")\
.withColumn(\
'open', f.stddev("open")).alias("open_stddev").collect()
This is the principle of how you would do it (you could use instead the min and max functions for a MinMax scaling), then you just have to apply the formula of standard scaling to stats_df:
x' = (x - μ) / σ

How to get back aggregate values across 2 dimensions using Python Cubes?

Situation
Using Python 3, Django 1.9, Cubes 1.1, and Postgres 9.5.
These are my datatables in pictorial form:
The same in text format:
Store table
------------------------------
| id | code | address |
|-----|------|---------------|
| 1 | S1 | Kings Row |
| 2 | S2 | Queens Street |
| 3 | S3 | Jacks Place |
| 4 | S4 | Diamonds Alley|
| 5 | S5 | Hearts Road |
------------------------------
Product table
------------------------------
| id | code | name |
|-----|------|---------------|
| 1 | P1 | Saucer 12 |
| 2 | P2 | Plate 15 |
| 3 | P3 | Saucer 13 |
| 4 | P4 | Saucer 14 |
| 5 | P5 | Plate 16 |
| and many more .... |
|1000 |P1000 | Bowl 25 |
|----------------------------|
Sales table
----------------------------------------
| id | product_id | store_id | amount |
|-----|------------|----------|--------|
| 1 | 1 | 1 |7.05 |
| 2 | 1 | 2 |9.00 |
| 3 | 2 | 3 |1.00 |
| 4 | 2 | 3 |1.00 |
| 5 | 2 | 5 |1.00 |
| and many more .... |
| 1000| 20 | 4 |1.00 |
|--------------------------------------|
The relationships are:
Sales belongs to Store
Sales belongs to Product
Store has many Sales
Product has many Sales
What I want to achieve
I want to use cubes to be able to do a display by pagination in the following manner:
Given the stores S1-S3:
-------------------------
| product | S1 | S2 | S3 |
|---------|----|----|----|
|Saucer 12|7.05|9 | 0 |
|Plate 15 |0 |0 | 2 |
| and many more .... |
|------------------------|
Note the following:
Even though there were no records in sales for Saucer 12 under Store S3, I displayed 0 instead of null or none.
I want to be able to do sort by store, say descending order for, S3.
The cells indicate the SUM total of that particular product spent in that particular store.
I also want to have pagination.
What I tried
This is the configuration I used:
"cubes": [
{
"name": "sales",
"dimensions": ["product", "store"],
"joins": [
{"master":"product_id", "detail":"product.id"},
{"master":"store_id", "detail":"store.id"}
]
}
],
"dimensions": [
{ "name": "product", "attributes": ["code", "name"] },
{ "name": "store", "attributes": ["code", "address"] }
]
This is the code I used:
result = browser.aggregate(drilldown=['Store','Product'],
order=[("Product.name","asc"), ("Store.name","desc"), ("total_products_sale", "desc")])
I didn't get what I want.
I got it like this:
----------------------------------------------
| product_id | store_id | total_products_sale |
|------------|----------|---------------------|
| 1 | 1 | 7.05 |
| 1 | 2 | 9 |
| 2 | 3 | 2.00 |
| and many more .... |
|---------------------------------------------|
which is the whole table with no pagination and if the products not sold in that store it won't show up as zero.
My question
How do I get what I want?
Do I need to create another data table that aggregates everything by store and product before I use cubes to run the query?
Update
I have read more. I realised that what I want is called dicing as I needed to go across 2 dimensions. See: https://en.wikipedia.org/wiki/OLAP_cube#Operations
Cross-posted at Cubes GitHub issues to get more attention.
This is a pure SQL solution using crosstab() from the additional tablefunc module to pivot the aggregated data. It typically performs better than any client-side alternative. If you are not familiar with crosstab(), read this first:
PostgreSQL Crosstab Query
And this about the "extra" column in the crosstab() output:
Pivot on Multiple Columns using Tablefunc
SELECT product_id, product
, COALESCE(s1, 0) AS s1 -- 1. ... displayed 0 instead of null
, COALESCE(s2, 0) AS s2
, COALESCE(s3, 0) AS s3
, COALESCE(s4, 0) AS s4
, COALESCE(s5, 0) AS s5
FROM crosstab(
'SELECT s.product_id, p.name, s.store_id, s.sum_amount
FROM product p
JOIN (
SELECT product_id, store_id
, sum(amount) AS sum_amount -- 3. SUM total of product spent in store
FROM sales
GROUP BY product_id, store_id
) s ON p.id = s.product_id
ORDER BY s.product_id, s.store_id;'
, 'VALUES (1),(2),(3),(4),(5)' -- desired store_id's
) AS ct (product_id int, product text -- "extra" column
, s1 numeric, s2 numeric, s3 numeric, s4 numeric, s5 numeric)
ORDER BY s3 DESC; -- 2. ... descending order for S3
Produces your desired result exactly (plus product_id).
To include products that have never been sold replace [INNER] JOIN with LEFT [OUTER] JOIN.
SQL Fiddle with base query.
The tablefunc module is not installed on sqlfiddle.
Major points
Read the basic explanation in the reference answer for crosstab().
I am including with product_id because product.name is hardly unique. This might otherwise lead to sneaky errors conflating two different products.
You don't need the store table in the query if referential integrity is guaranteed.
ORDER BY s3 DESC works, because s3 references the output column where NULL values have been replaced with COALESCE. Else we would need DESC NULLS LAST to sort NULL values last:
PostgreSQL sort by datetime asc, null first?
For building crosstab() queries dynamically consider:
Dynamic alternative to pivot with CASE and GROUP BY
I also want to have pagination.
That last item is fuzzy. Simple pagination can be had with LIMIT and OFFSET:
Displaying data in grid view page by page
I would consider a MATERIALIZED VIEW to materialize results before pagination. If you have a stable page size I would add page numbers to the MV for easy and fast results.
To optimize performance for big result sets, consider:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Optimize query with OFFSET on large table

How to decompose the relation into 5NF?

The example provided in the Fifth Normal Form has ACP(Agent, Company, Product) relation with the following data:
-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | Ford | truck |
| Smith | GM | car |
| Smith | GM | bus |
| Jones | Ford | car |
-----------------------------
Rule applied is if an agent sells a certain product, and he represents a company making that product, then he sells that product for that company.
The relation is decomposed into 3 relations according to constratint 3D: AC(agent, company), AP(agent, product) and CP(company, product). Hence the join dependency is *{(agent, company), (agent, product),(company, product)}. According to the definition of 5NF, a table R is in fifth normal form (5NF) or project-join normal form (PJ/NF) if and only if every join dependency in R is implied by the keys of R. But none of the projections contain a key, since the key is {agent, company, product} itself.
Another example provided in the book by C.J. Date contains a similar relation shipments(suppier_number, part_number, project_number) and the relation is decomposed similarly stating constratint 3D. However, the definition of 5NF doesn't state about the constratint 3D.
So, I have a few questions on the scenario presented above:
What is constratint 3D?
What if the relation has one more attribute "region" making ACPR(agent, company, product, region)?

Sane way to store different data types within same column in postgres?

I'm currently attempting to modify an existing API that interacts with a postgres database. Long story short, it's essentially stores descriptors/metadata to determine where an actual 'asset' (typically this is a file of some sort) is storing on the server's hard disk.
Currently, its possible to 'tag' these 'assets' with any number of undefined key-value pairs (i.e. uploadedBy, addedOn, assetType, etc.) These tags are stored in a separate table with a structure similar to the following:
+---------------+----------------+-------------+
|assetid (text) | tagid(integer) | value(text) |
|---------------+----------------+-------------|
|someStringValue| 1234 | someValue |
|---------------+----------------+-------------|
|aDiffStringKey | 1235 | a username |
|---------------+----------------+-------------|
|aDiffStrKey | 1236 | Nov 5, 1605 |
+---------------+----------------+-------------+
assetid and tagid are foreign keys from other tables. Think of the assetid representing a file and the tagid/value pair is a map of descriptors.
Right now, the API (which is in Java) creates all these key-value pairs as a Map object. This includes things like timestamps/dates. What we'd like to do is to somehow be able to store different types of data for the value in the key-value pair. Or at least, storing it differently within the database, so that if we needed to, we could run queries checking date-ranges and the like on these tags. However, if they're stored as text items in the db, then we'd have to a.) Know that this is actually a date/time/timestamp item, and b.) convert into something that we could actually run such a query on.
There is only 1 idea I could think of thus far, without complete changing changing the layout of the db too much.
It is to expand the assettag table (shown above) to have additional columns for various types (numeric, text, timestamp), allow them to be null, and then on insert, checking the corresponding 'key' to figure out what type of data it really is. However, I can see a lot of problems with that sort of implementation.
Can any PostgreSQL-Ninjas out there offer a suggestion on how to approach this problem? I'm only recently getting thrown back into the deep-end of database interactions, so I admit I'm a bit rusty.
You've basically got two choices:
Option 1: A sparse table
Have one column for each data type, but only use the column that matches that data type you want to store. Of course this leads to most columns being null - a waste of space, but the purists like it because of the strong typing. It's a bit clunky having to check each column for null to figure out which datatype applies. Also, too bad if you actually want to store a null - then you must chose a specific value that "means null" - more clunkiness.
Option 2: Two columns - one for content, one for type
Everything can be expressed as text, so have a text column for the value, and another column (int or text) for the type, so your app code can restore the correct value in the correct type object. Good things are you don't have lots of nulls, but importantly you can easily extend the types to something beyond SQL data types to application classes by storing their value as json and their type as the class name.
I have used option 2 several times in my career and it was always very successful.
Another option, depending on what your doing, could be to just have one value column but store some json around the value...
This could look something like:
{
"type": "datetime",
"value": "2019-05-31 13:51:36"
}
That could even go a step further, using a Json or XML column.
I'm not in any way PostgreSQL ninja, but I think that instead of two columns (one for name and one for type) you could look at hstore data type:
data type for storing sets of key/value pairs within a single
PostgreSQL value. This can be useful in various scenarios, such as
rows with many attributes that are rarely examined, or semi-structured
data. Keys and values are simply text strings.
Of course, you have to check how date/timestamps converting into and from this type and see if it good for you.
You can use 2 different technics:
if you have floating type for every tagid
Define table and ID for every tagid-assetid combination and actual data tables:
maintable:
+---------------+----------------+-----------------+---------------+
|assetid (text) | tagid(integer) | tablename(text) | table_id(int) |
|---------------+----------------+-----------------+---------------|
|someStringValue| 1234 | tablebool | 123 |
|---------------+----------------+-----------------+---------------|
|aDiffStringKey | 1235 | tablefloat | 123 |
|---------------+----------------+-----------------+---------------|
|aDiffStrKey | 1236 | tablestring | 123 |
+---------------+----------------+-----------------+---------------+
tablebool
+-------------+-------------+
| id(integer) | value(bool) |
|-------------+-------------|
| 123 | False |
+-------------+-------------+
tablefloat
+-------------+--------------+
| id(integer) | value(float) |
|-------------+--------------|
| 123 | 12.345 |
+-------------+--------------+
tablestring
+-------------+---------------+
| id(integer) | value(string) |
|-------------+---------------|
| 123 | 'text' |
+-------------+---------------+
In case if every tagid has fixed type
create tagid description table
tag descriptors
+---------------+----------------+-----------------+
|assetid (text) | tagid(integer) | tablename(text) |
|---------------+----------------+-----------------|
|someStringValue| 1234 | tablebool |
|---------------+----------------+-----------------|
|aDiffStringKey | 1235 | tablefloat |
|---------------+----------------+-----------------|
|aDiffStrKey | 1236 | tablestring |
+---------------+----------------+-----------------+
and correspodnding data tables
tablebool
+-------------+----------------+-------------+
| id(integer) | tagid(integer) | value(bool) |
|-------------+----------------+-------------|
| 123 | 1234 | False |
+-------------+----------------+-------------+
tablefloat
+-------------+----------------+--------------+
| id(integer) | tagid(integer) | value(float) |
|-------------+----------------+--------------|
| 123 | 1235 | 12.345 |
+-------------+----------------+--------------+
tablestring
+-------------+----------------+---------------+
| id(integer) | tagid(integer) | value(string) |
|-------------+----------------+---------------|
| 123 | 1236 | 'text' |
+-------------+----------------+---------------+
All this is just for general idea. You should adapt it for your needs.

Query join result appears to be incorrect

I have no idea what's going on here. Maybe I've been staring at this code for too long.
The query I have is as follows:
CREATE VIEW v_sku_best_before AS
SELECT
sw.sku_id,
sw.sku_warehouse_id "A",
sbb.sku_warehouse_id "B",
sbb.best_before,
sbb.quantity
FROM SKU_WAREHOUSE sw
LEFT OUTER JOIN SKU_BEST_BEFORE sbb
ON sbb.sku_warehouse_id = sw.warehouse_id
ORDER BY sbb.best_before
I can post the table definitions if that helps, but I'm not sure it will. Suffice to say that SKU_WAREHOUSE.sku_warehouse_id is an identity column, and SKU_BEST_BEFORE.sku_warehouse_id is a child that uses that identity as a foreign key.
Here's the result when I run the query:
+--------+-----+----+-------------+----------+
| sku_id | A | B | best_before | quantity |
+--------+-----+----+-------------+----------+
| 20251 | 643 | 11 | <<null>> | 140 |
+--------+-----+----+-------------+----------+
(1 row)
The join specifies that the sku_warehouse_id columns have to be equal, but when I pull the ID from each table (labelled as A and B) they're different.
What am I doing wrong?
Perhaps just sw.sku_warehouse_id instead of sw.warehouse_id?