Finding a non existing value in a column PostgreSQL - postgresql

I'm working on a DSpace 5.10 repository with PostgrSQL 9.x. The problem is that when harvested, there are a lot of items that lack metadata required by the regulating entity of my country. Is there a way to bring up which itemID's don't have a specific field?
For example:
I need a query that gives me as result all the resource_id that don't have a metadatafield_id = X. A same resource_id has many metadata_field_id entries.
Thanks a lot.

If I'm understanding you properly:
You're looking to return all resource_id that don't have X in the field metadatafield_id.
There are multiple rows per resource_id, but only some of those rows don't contain X in their respective metadata_field_id columns.
If so, try this:
SELECT distinct resource_id
FROM your_table_name
WHERE metadata_field_id != 'X'
By using distinct, you remove all duplicate rows. In this way, you'll only return unique resource_id. Without using distinct, you will return duplicate entries for resource_id in your result.
Here is the PostgreSQL documentation for distinct.
EDIT: distinct is only supported on PostgreSQL versions 9.5+

You need to get the list of all items that doesn't have a specific metadata, so the easier way is to exclude from the complete list the ones that actually have such metadata
select item_id from item where item_id not in
(
select resource_id from resourcepolicy where
resource_type_id = 2 and metadata_field_id = ?
);

Related

sql restriction for join table with string similarity rule

My Db is building from some tables that are similar to each other and share the same column names. The reason is to perform a comparison between data from each resource.
table_A and table_B: id, product_id, capacitor_name, ressitance
It is easy to join tables by product_id and see the comparison,
but I need to compare data between product_id if exists in both tables and if not I want to compare by name similarity and if similarity restricts the result for up to 3 results.
The names most of the time are not equal this is why I'm using a similarity.
SELECT * FROM table_a ta
JOIN table_b tb
ON
ta.product_id = tb.product_id
OR
similarity(ta.name,tb.name) > 0.8
It works fine. But the problem is sometimes I'm getting more data than I need, how can I restrict it? (and moreover, order it by similarity in order to get higher similarity names).
If you want to benefit from an trigram index, you need to use the operator form (%), not the function form. Then you would order on two "columns", the first to be exact matches first, the 2nd to put most similar matches after and in order. And use LIMIT to do the limit. I've assumed you have some WHERE condition to restrict this to just one row of table_a. If not, then your question is not very well formed. To what is this limit supposed to apply? Each what should be limited to just 3?
SELECT * FROM table_a ta
JOIN table_b tb
ON
ta.product_id = tb.product_id
OR
ta.name % tb.name
WHERE ta.id=$1
ORDER BY ta.product_id = tb.product_id desc, similarity(ta.name,tb.name) desc
LIMIT 3

Postgres: Optimization of query with simple where clause

I have a table with the following columns:
ID (VARCHAR)
CUSTOMER_ID (VARCHAR)
STATUS (VARCHAR) (4 different status possible)
other not relevant columns
I try to find all the lines with customer_id = and status = two different status.
The query looks like:
SELECT *
FROM my_table
WHERE customer_id = '12345678' AND status IN ('STATUS1', 'STATUS2');
The table contains about 1 mio lines. I added two indexes on customer_id and status. The query still needs about 1 second to run.
The explain plan is:
1. Gather
2. Seq Scan on my_table
Filter: (((status)::text = ANY ('{SUBMITTED,CANCELLED}'::text[])) AND ((customer_id)::text = '12345678'::text))
I ran the 'analyze my_table' after creating the indexes. What could I do to improve the performance of this quite simple query?
You need a compound (multi-column) index to help satisfy your query.
Guessing, it seems like the most selective column (the one with the most distinct values) is customer_id. status probably has only a few distinct values. So customer_id should go first in the index. Try this.
ALTER TABLE my_table ADD INDEX customer_id_status (customer_id, status);
This creates a BTREE index. A useful mental model for such an index is an old-fashioned telephone book. It's sorted in order. You look up the first matching entry in the index, then scan it sequentially for the items you want.
You may also want to try running ANALYZE my_table; to update the statistics (about selectivity) used by the query planner to choose an appropriate index.
Pro tip Avoid SELECT * if you can. Instead name the columns you want. This can help performance a lot.
Pro tip Your question said some of your columns aren't relevant to query optimization. That's probably not true; index design is a weird art. SELECT * makes it less true.

SQL statement that returns exactly one row with columns

I'm having trouble creating a query for the following task: i want to return exactly one row with columns: region_id, region_name, province_name, province_code, country_name, country_code for any given regionid. The database has 3 tables "countrylist" , "provinces" and "regionlist"
the table countrylist has the following columns : countryid, language code, countryname, countrycode and continentid
provinces : country_code, country_name, province_code, province_name
regionlist: regionid, regiontype.
So I tried writing a query for joining the table but I'm sure if I'm doing it correct.
exactly one row with columns: region_id, region_name, province_name, province_code, country_name, country_code for any given regionid.
I am not 100% aware of the differences between Postgres and MySQL - but guess you get the idea at the very least.
One way to do it, to get your id with WHERE regionlist.regionid = and join the other tables. From either the regionlist you can use the LIMIT (reference) to get a limited amount of rows.
Apparently neither provinces nor country have a common column with regionlist, so I can not tell where the link between those are. However, once you have 1 row of the region list you should have no troubles joining them with the others (if the links are trivial).

Sphinx seems to force a Order on ID?

I added a new field to my index (weight) which is an integer based value I want to sort on.
I added it to the select and invoked it as sql_attr_uint
When I call it in my query it shows up. However when I try to sort on it i get strange behavior. It always sorts on the record ID instead. So Order on ID is identical to Order on Weight.
I've checked the Index pretty thoroughly and can't find a reason why, does sphinx auto-sort on record ID somehow?
I know the details are fairly sparse yet I'm hoping there is some basic explanation I'm missing before asking anyone to delve in further.
As an update: I don't believe the ID field sort has been "imposed" on the index in anyway inadvertently since I can Order by a # of other fields, both integer and text and the results are returned independent of the ID values (e.g sort on last name Record #100 Adams will come before Record #1 Wyatt)
Yet ordering on Weight always returns the same order as ordering by ID whether it is asc or desc. No error about the field or index not existing or being sortable, no ignoring the order request (desc and asc work) it just ignores that particular field value and uses the ID instead.
Further Update: The Weight value is indexed via a join to the main table indexed by sphinx in the following manner:
sql_attr_multi = uint value_Weight from ranged-query; \
SELECT j.id AS ID, IF(s.Weight > 0, 1, 0) AS Weight \
FROM Customers j \
INNER JOIN CustomerSources s ON j.customer_id = s.customer_id \
AND j.id BETWEEN $start AND $end \
ORDER BY s.id; \
SELECT MIN(id), MAX(id) FROM Customers
Once indexed sorting on both id and value_Weight return the same sort whereas Weight and ID are unrelated.
Ah yes, from
http://sphinxsearch.com/docs/current/mva.html
Filtering and group-by (but not sorting) on MVA attributes is supported.
Can't sort by a MVA attribute (which as noted in comments makes sense, as MVAs usually contain many values, sorting by many values is rather 'tricky'.
When you try, it simply fails. So sorting is falling back on the 'natural' order of the index, which is usually by ID.
Use sql_attr_unit instead
http://sphinxsearch.com/docs/current/conf-sql-attr-uint.html
(but will proabbly mean rewriting the sql_query to perform the JOIN on CustomerSources )

PostgreSQL -must appear in the GROUP BY clause or be used in an aggregate function

I am getting this error in the pg production mode, but its working fine in sqlite3 development mode.
ActiveRecord::StatementInvalid in ManagementController#index
PG::Error: ERROR: column "estates.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT "estates".* FROM "estates" WHERE "estates"."Mgmt" = ...
^
: SELECT "estates".* FROM "estates" WHERE "estates"."Mgmt" = 'Mazzey' GROUP BY user_id
#myestate = Estate.where(:Mgmt => current_user.Company).group(:user_id).all
If user_id is the PRIMARY KEY then you need to upgrade PostgreSQL; newer versions will correctly handle grouping by the primary key.
If user_id is neither unique nor the primary key for the 'estates' relation in question, then this query doesn't make much sense, since PostgreSQL has no way to know which value to return for each column of estates where multiple rows share the same user_id. You must use an aggregate function that expresses what you want, like min, max, avg, string_agg, array_agg, etc or add the column(s) of interest to the GROUP BY.
Alternately you can rephrase the query to use DISTINCT ON and an ORDER BY if you really do want to pick a somewhat arbitrary row, though I really doubt it's possible to express that via ActiveRecord.
Some databases - including SQLite and MySQL - will just pick an arbitrary row. This is considered incorrect and unsafe by the PostgreSQL team, so PostgreSQL follows the SQL standard and considers such queries to be errors.
If you have:
col1 col2
fred 42
bob 9
fred 44
fred 99
and you do:
SELECT col1, col2 FROM mytable GROUP BY col1;
then it's obvious that you should get the row:
bob 9
but what about the result for fred? There is no single correct answer to pick, so the database will refuse to execute such unsafe queries. If you wanted the greatest col2 for any col1 you'd use the max aggregate:
SELECT col1, max(col2) AS max_col2 FROM mytable GROUP BY col1;
I recently moved from MySQL to PostgreSQL and encountered the same issue. Just for reference, the best approach I've found is to use DISTINCT ON as suggested in this SO answer:
Elegant PostgreSQL Group by for Ruby on Rails / ActiveRecord
This will let you get one record for each unique value in your chosen column that matches the other query conditions:
MyModel.where(:some_col => value).select("DISTINCT ON (unique_col) *")
I prefer DISTINCT ON because I can still get all the other column values in the row. DISTINCT alone will only return the value of that specific column.
After often receiving the error myself I realised that Rails (I am using rails 4) automatically adds an 'order by id' at the end of your grouping query. This often results in the error above. So make sure you append your own .order(:group_by_column) at the end of your Rails query. Hence you will have something like this:
#problems = Problem.select('problems.username, sum(problems.weight) as weight_sum').group('problems.username').order('problems.username')
#myestate1 = Estate.where(:Mgmt => current_user.Company)
#myestate = #myestate1.select("DISTINCT(user_id)")
this is what I did.