I have a Postgres 9.1 table called ngram_sightings. Each row is a record of seeing an ngram in a document. An ngram can appear multiple times in a given document.
CREATE TABLE ngram_sightings
(
ngram VARCHAR,
doc_id INTEGER
);
I want summarize this table in another table called ngram_counts.
CREATE TABLE ngram_counts
(
ngram VARCHAR PRIMARY INDEX,
-- the number of unique doc_ids for a given ngram
doc_count INTEGER,
-- the count of a given ngram in ngram_sightings
corpus_count INTEGER
);
What is the best way to do this?
ngram_sightings is ~1 billion rows.
Should I create an index on ngram_sightings.ngram first?
Give this a shot!
INSERT INTO ngram_counts (ngram, doc_count, corpus_count)
SELECT
ngram
, count(distinct doc_id) AS doc_count
, count(*) AS corpus_count
FROM ngram_counts
GROUP BY ngram;
-- EDIT --
Here is a longer version using some temporary tables. First, count how many documents each ngram is associated with. I'm using 'tf' for "term frequency" and 'df' for "doc frequency", since you are heading in the direction of tf-idf vectorization and you may as well use the standard language, it will help with the next few steps.
CREATE TEMPORARY TABLE ngram_df AS
SELECT
ngram
, count(distinct doc_id) AS df
FROM ngram_counts
GROUP BY ngram;
Now you can create table for the total count of each ngram.
CREATE TEMPORARY TABLE ngram_tf AS
SELECT
ngram
, count(*) AS tf
FROM ngram_counts
GROUP BY ngram;
Then join the two on ngram.
CREATE TABLE ngram_tfidf AS
SELECT
tf.ngram
, tf.tf
, df.df
FROM ngram_tf
INNER JOIN ngram_df ON ngram_tf.ngram = ngram_df.ngram;
At this point, I expect you will be looking up ngram quite a bit, so it makes sense to index the last table on ngram. Keep me posted!
Related
I am trying to add the same data for a row into my table x number of times in postgresql. Is there a way of doing that without manually entering the same values x number of times? I am looking for the equivalent of the go[count] in sql for postgres...if that exists.
Use the function generate_series(), e.g.:
insert into my_table
select id, 'alfa', 'beta'
from generate_series(1,4) as id;
Test it in db<>fiddle.
Idea
Produce a resultset of a given size and cross join it with the record that you want to insert x times. What would still be missing is the generation of proper PK values. A specific suggestion would require more details on the data model.
Query
The sample query below presupposes that your PK values are autogenerated.
CREATE TABLE test ( id SERIAL, a VARCHAR(10), b VARCHAR(10) );
INSERT INTO test (a, b)
WITH RECURSIVE Numbers(i) AS (
SELECT 1
UNION ALL
SELECT i + 1
FROM Numbers
WHERE i < 5 -- This is the value `x`
)
SELECT adhoc.*
FROM Numbers n
CROSS JOIN ( -- This is the single record to be inserted multiple times
SELECT 'value_a' a
, 'value_b' b
) adhoc
;
See it in action in this db fiddle.
Note / Reference
The solution is adopted from here with minor modifications (there are a host of other solutions to generate x consecutive numbers with SQL hierachical / recursive queries, so the choice of reference is somewhat arbitrary).
ّ am trying to insert multiple records got from the join table to another table user_to_property. In the user_to_property table user_to_property_id is primary, not null it is not autoincrementing. So I am trying to add user_to_property_id manually by an increment of 1.
WITH selectedData AS
( -- selection of the data that needs to be inserted
SELECT t2.user_id as userId
FROM property_lines t1
INNER JOIN user t2 ON t1.account_id = t2.account_id
)
INSERT INTO user_to_property (user_to_property_id, user_id, property_id, created_date)
VALUES ((SELECT MAX( user_to_property_id )+1 FROM user_to_property),(SELECT
selectedData.userId
FROM selectedData),3,now());
The above query gives me the below error:
ERROR: more than one row returned by a subquery used as an expression
How to insert multiple records to a table from the join of other tables? where the user_to_property table contains a unique record for the same user-id and property_id there should be only 1 record.
Typically for Insert you use either values or select. The structure values( select...) often (generally?) just causes more trouble than it worth, and it is never necessary. You can always select a constant or an expression. In this case convert to just select. For generating your ID get the max value from your table and then just add the row_number that you are inserting: (see demo)
insert into user_to_property(user_to_property_id
, user_id
, property_id
, created
)
with start_with(current_max_id) as
( select max(user_to_property_id) from user_to_property )
select current_max_id + id_incr, user_id, 3, now()
from (
select t2.user_id, row_number() over() id_incr
from property_lines t1
join users t2 on t1.account_id = t2.account_id
) js
join start_with on true;
A couple notes:
DO NOT use user for table name, or any other object name. It is a
documented reserved word by both Postgres and SQL standard (and has
been since Postgres v7.1 and the SQL 92 Standard at lest).
You really should create another column or change the column type
user_to_property_id to auto-generated. Using Max()+1, or
anything based on that idea, is a virtual guarantee you will generate
duplicate keys. Much to the amusement of users and developers alike.
What happens in an MVCC when 2 users run the query concurrently.
I have two tables that I need to make a many to many relationship with. The one table we will call inventory is populated via a form. The other table sales is populated by importing CSVs in to the database weekly.
Example tables image
I want to step through the sales table and associate each sale row with a row with the same sku in the inventory table. Here's the kick. I need to associate only the number of sales rows indicated in the Quantity field of each Inventory row.
Example: Example image of linked tables
Now I know I can do this by creating a perl script that steps through the sales table and creates links using the ItemIDUniqueKey field in a loop based on the Quantity field. What I want to know is, is there a way to do this using SQL commands alone? I've read a lot about many to many and I've not found any one doing this.
Assuming tables:
create table a(
item_id integer,
quantity integer,
supplier_id text,
sku text
);
and
create table b(
sku text,
sale_number integer,
item_id integer
);
following query seems to do what you want:
update b b_updated set item_id = (
select item_id
from (select *, sum(quantity) over (partition by sku order by item_id) as sum from a) a
where
a.sku=b_updated.sku and
(a.sum)>
(select count(1) from b b_counted
where
b_counted.sale_number<b_updated.sale_number and
b_counted.sku=b_updated.sku
)
order by a.sum asc limit 1
);
I am new to postgreSql and I used following query to retrieve all the fields from database.
SELECT student.*,row_number() OVER () as rnum FROM student;
I don't know how to delete particular row by row number.Please give me some idea.
This is my table:
Column | Type
------------+------------------
name | text
rollno | integer
cgpa | double precision
department | text
branch | text
with a as
(
SELECT student.*,row_number() OVER () as rnum FROM student
)
delete from student where ctid in (select ctid from a where rnum =1) -- the
-- row_number you want
-- to delete
Quoted from PostgreSQL - System Columns
ctid :
The physical location of the row version within its table. Note
that although the ctid can be used to locate the row version very
quickly, a row's ctid will change each time it is updated or moved by
VACUUM FULL. Therefore ctid is useless as a long-term row identifier.
The OID, or even better a user-defined serial number, should be used
to identify logical rows.
Note : I strongly recommend you to use an unique filed in student table.
As per Craig's comment, I'll give another way to solve OP's issue it's a bit tricky
First create a unique column for table student, for this use below query
alter table student add column stu_uniq serial
this will produce stu_uniq with corresponding unique values for each row, so that OP can easily DELETE any row(s) using this stu_uniq
I don't know whether its a correct alternative for this problem.But it satisfies my problem.What my problem is I need to delete a row without help of anyone of it's column.I created table with OIDS,and with help of oid I deleted the rows.
CREATE TABLE Student(Name Text,RollNo Integer,Cgpa Float,Department Text,Branch Text)WITH OIDS;
DELETE FROM STUDENT WHERE oid=18789;
DELETE FROM STUDENT WHERE oid=18790;
Quoted from PostgreSQL - System Columns
Thanks to #WingedPanther for suggesting this idea.
You could try like this.
create table t(id int,name varchar(10));
insert into t values(1,'a'),(2,'b'),(3,'c'),(4,'d');
with cte as
(
select *,ROW_NUMBER()over(order by id) as rn from t
)
delete from cte where rn=1;
Cte in Postgres
I've got two SQL2008 tables, one is a "Import" table containing new data and the other a "Destination" table with the live data. Both tables are similar but not identical (there's more columns in the Destination table updated by a CRM system), but both tables have three "phone number" fields - Tel1, Tel2 and Tel3. I need to remove all records from the Import table where any of the phone numbers already exist in the destination table.
I've tried knocking together a simple query (just a SELECT to test with just now):
select t2.account_id
from ImportData t2, Destination t1
where
(t2.Tel1!='' AND (t2.Tel1 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel2!='' AND (t2.Tel2 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel3!='' AND (t2.Tel3 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
... but I'm aware this is almost certainly Not The Way To Do Things, especially as it's very slow. Can anyone point me in the right direction?
this query requires a little more that this information. If You want to write it in the efficient way we need to know whether there is more duplicates each load or more new records. I assume that account_id is the primary key and has a clustered index.
I would use the temporary table approach that is create a normalized table #r with an index on phone_no and account_id like
SELECT Phone, Account into #tmp
FROM
(SELECT account_id, tel1, tel2, tel3
FROM destination) p
UNPIVOT
(Phone FOR Account IN
(Tel1, tel2, tel3)
)AS unpvt;
create unclustered index on this table with the first column on the phone number and the second part the account number. You can't escape one full table scan so I assume You can scan the import(probably smaller). then just join with this table and use the not exists qualifier as explained. Then of course drop the table after the processing
luke
I am not sure on the perforamance of this query, but since I made the effort of writing it I will post it anyway...
;with aaa(tel)
as
(
select Tel1
from Destination
union
select Tel2
from Destination
union
select Tel3
from Destination
)
,bbb(tel, id)
as
(
select Tel1, account_id
from ImportData
union
select Tel2, account_id
from ImportData
union
select Tel3, account_id
from ImportData
)
select distinct b.id
from bbb b
where b.tel in
(
select a.tel
from aaa a
intersect
select b2.tel
from bbb b2
)
Exists will short-circuit the query and not do a full traversal of the table like a join. You could refactor the where clause as well, if this still doesn't perform the way you want.
SELECT *
FROM ImportData t2
WHERE NOT EXISTS (
select 1
from Destination t1
where (t2.Tel1!='' AND (t2.Tel1 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel2!='' AND (t2.Tel2 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel3!='' AND (t2.Tel3 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
)