SQLAlchemy: Efficiently subsitute integer code for string name when inserting data - postgresql

What is the most efficient way to substitute an integer key from a lookup table for a string in my input data?
For example let's say I have a table that has country names in string format, with the primary key of the lookup table as a foreign key column on a second table, "cities". I have list of tuples containing data for the "cities" table, and one of those fields is the string name of the country.
So each time I input a row for a new city, I must select the the PK from the lookup table where the "country_name" string column is equal to the input string. Then the integer PK for that country row needs to be put into the FK "country_id" column in the row being added to cities.
Is there a canonial way to do this in SQLAlchemy? The most obvious way would be to write a function that gets the appropriate PK with something like select(Country.country_id).where(.country_name == 'Ruritania').
But I wonder if SQLAlchemy has a more efficient way to do it, especially for the bulk insertion of records.
"Association Proxies" sound like what I want, but I don't understand them well enough to know how to use them in the context of bulk inserts. From what I have gathered so far, an ENUM data type would be too constraining as it cannot be updated easily, but I would consider such a solution if there is a way around that caveat.
Are there ways to make sure that values are not repeatedly read from the lookup table in a batch of operations?
I am using Postgres for my database.

Related

Is it bad for columns in composite keys to have mismatching types?

Problem:
I'd like to make a composite primary key from columns id and user_id for a postgres database table. Column user_id is a foreign key with an integer type, whereas id is a string. Will this cause a conflict because the types are different?
Edit: Also, are there combinations of types that would cause problems?
Context:
I obviously should match the type of the User.id field for its foreign key. And, the id for my table will be derived from a uuid to prevent data leaks. So I would prefer not to change the types of either field I want in this table.
Research:
I am using sqlalchemy. Their documentation mentions how to create a composite primary key, but it doesn't discuss dealing with different types for each column.
No, this won't be a problem.
Your question seems to indicate that you think, the values of the indexed columns are somehow concatenated and then stored in the index as a single value. This is not the case. Each column value is stored independently but together. Similar to the way the column values are stored in the actual table.

Extract fields from Postgres jsonb

I'm trying to find an efficient way to extract specific fields from a Postgres jsonb column.
CREATE TABLE foo (
id integer,
data jsonb
)
"data" contains a row with:
{
"firstname": "bob",
"lastname": "smith,
"tags": ["tag0","tag1"]
}
I want to extract a large number of fields from the data column. This select statement works, but it's cumbersome with large numbers of fields, yields really long SQL statements, and also I don't know if it is traversing the jsonb repeatedly for each column:
SELECT data->>'firstname', data->'tags' FROM foo
I tried this:
SELECT jsonb_path_query(data, '$.[firstname,tags]') FROM foo
but it gave an error message: syntax error, unexpected '[' This syntax is, in fact, correct jsonpath per https://jsonpath.com/, but it appears that Postgres doesn't implement it.
Is there a way to extract jsonb fields efficiently, both in terms of execution speed and compactness of the SQL query command?
Yes, your query will read the complete data column for all rows of foo.
Even if you normalize the data model and turn your JSON attributes into regular columns, it will read the table row by row, but then your query becomes cheaper if you only access the first couple of columns in the table.
What you are looking for is a column store, but PostgreSQL doesn't have that built in.

Postgres JSONB unique constraint

I have a table as following table.
create table person {
firstname varchar,
lastname varchar,
person_info jsonb,
..
}
I already have unique constraints on firstname + lastname. I recently identify there is always something different in person_info jsonb. I want to uniquely identify by person_info jsonb.
Should I add person_info as part of unique constraints firstname + lastname + person_info ? Is there any performance impact with such implementation ? I heard JSONB is not good for index when number of data increases.
I am thinking to use store person_info hashvalue in different field and combine this new hashvalue field as part of unique index.
I would appreciate if I get some help from expert on this.
This seems like a wrong idea.
A primary key should be immutable and uniquely identify a table row.
Names are not good for that, because
different people can have the same name
names can change
This is probably why you are tempted to add additional information to truly identify each individual row.
Unless you have some immutable attribute that uniquely identifies each person (such as the social security nubmer), you should generate an artificial primary key for the table:
ALTER TABLE person
ADD id bigint
GENERATED ALWAYS AS IDENTITY
PRIMARY KEY;
Indexing a jsonb is possible, but you will get problems with long values since index entries are limited in size, and you will get an error if you exceed the limit.
I recommend that any attribute that you might want to index is not stored in a jsonb, but as a regular table column.
JSONB indexing IMHO refers to the ability to index fields inside the binary JSON rather than the whole block. Be aware also that key ordering is not kept! So if you can obtain two different hashes for two json with the exact same data but different ordering. Instead, if you can find which json fields gives you uniqueness, than you can use directly those for indexing.
Try also to look at this page

sqlite3 database help in improving performance and design

I have a sqlite3 database with this schema:
CREATE TABLE [dict] (
[Entry] [CHAR(209)],
[Definition] [CHAR(924975)]);
CREATE INDEX [i_dict_entry] ON [dict] ([Entry]);
it's a kind of dictionary with 260000 records and nearly 1GB of size; I have created an index for the Entry column to improve performance;
a sample of a row's entry column is like this:
|love|lovingly|loves|loved|loving|
All the words which are separated with | are referring to the same definition;(I put all of them in one string, separated with | to prevent duplication of data in Definition column)
and this is the command that I use to retrieve the results:
SELECT * FROM dict WHERE Entry like '%|loves|%'
execution time: ~1.7s
if I use = operator instead of LIKE operator, the execution is nearly instantaneous;
SELECT * FROM dict WHERE Entry='|love|lovingly|loves|loved|loving|'
but this way I can't search for words like: love,loves...(separately I mean)
My questions:
Although I have created an index for the Entry column, is indexing really effective while we are using LIKE operator with % in it?
what about the idea that I create different rows for each part of composite Entry columns(one for love another for loves...then all will have the same definition) and then use = operator? if yes; is there anyway of referencing of data? I mean rather than repeating the same Definition for each entry, create one and all others point to it; is it possible?
thanks in advance for any tip and suggestion;
Every entry should have a separate row in the database:
CREATE TABLE Definitions (
DefinitionID INTEGER PRIMARY KEY,
Definition TEXT
);
CREATE TABLE Entries (
EntryID INTEGER PRIMARY KEY,
DefinitionID INTEGER REFERENCES Definitions(DefinitionID),
Entry TEXT
);
CREATE INDEX i_entry ON Entries(Entry);
You can then query the definition by joiing the two tables:
SELECT Definition
FROM Entries
JOIN Definitions USING (DefinitionID)
WHERE Entry = 'loves'
Also see Database normalization.

How to use BULK INSERT when rows depend on foreign keys values?

My question is related to this one I asked on ServerFault.
Based on this, I've considered the use of BULK INSERT. I now understand that I have to prepare myself a file for each entities I want to save into the database. No matter what, I still wonder whether this BULK INSERT will avoid the memory issue on my system as described in the referenced question on ServerFault.
As for the Streets table, it's quite simple! I have only two cities and five sectors to care about as the foreign keys. But then, how about the Addresses? The Addresses table is structured like this:
AddressId int not null identity(1,1) primary key
StreetNumber int null
NumberSuffix_Value int not null DEFAULT 0
StreetId int null references Streets (StreetId)
CityId int not null references Cities (CityId)
SectorId int null references Sectors (SectorId)
As I said on ServerFault, I have about 35,000 addresses to insert. Shall I memorize all the IDs? =P
And then, I now have the citizen people to insert who have an association with the addresses.
PersonId int not null indentity(1,1) primary key
Surname nvarchar not null
FirstName nvarchar not null
IsActive bit
AddressId int null references Addresses (AddressId)
The only thing I can think of is to force the IDs to static values, but then, I lose any flexibility that I had with my former approach with the INSERT..SELECT stategy.
What are then my options?
I force the IDs to be always the same, then I have to SET IDENTITY_INSERT ON so that I can force the values into the table, this way I always have the same IDs for each of my rows just as suggested here.
How to BULK INSERT with foreign keys? I can't get any docs on this anywhere. =(
Thanks for your kind assistance!
EDIT
I edited in order to include the BULK INSERT SQL instruction that finally made it for me!
I had my Excel workbook ready with the information I needed to insert. So, I simply created a few supplemental worksheet and began to write formulas in order to "import" the information data to these new sheets. I had one for each of my entities.
Streets;
Addresses;
Citizens.
As for the two other entities, it wasn't worthy to bulk insert them, as I had only two cities and five sectors (cities subdivisions) to insert. Once the both the cities and sectors inserted, I noted their respective IDs and began to ready my record sets for bulk insert. Using the power of Excel to compute the values and to "import" the foreign keys was a charm of itself, by the way. Afterwards, I have saved each of the worksheets to a separated CSV file. My records were then ready to bulked.
USE [DatabaseName]
GO
delete from Citizens
delete from Addresses
delete from Streets
BULK INSERT Streets
FROM N'C:\SomeFolder\SomeSubfolder\Streets.csv'
WITH (
FIRSTROW = 2
, KEEPIDENTITY
, FIELDTERMINATOR = N','
, ROWTERMINATOR = N'\n'
, CODEPAGE = N'ACP'
)
GO
FIRSTROW
Indicates the row number at which to begin the insert. In my situation, my CSVs contained the column headers, so the second row was the one to begin with. Aside, one could possibly want to start anywhere in his file, let's say the 15th row.
KEEPIDENTITY
Allows one to bulk-insert specified in-file entity IDs even though the table has an identity column. This parameter is the same as SET INDENTITY_INSERT my_table ON before a row insert when you wish to insert with a precise id.
As for the other parameters, they speak by themselves.
Now that this is explained, the same code was repeated for each of the two remaining entities to insert Addresses and Citizens. And because the KEEPIDENTITY was specified, all of my foreign keys remained still, though my primary keys were set as identities in SQL Server.
Only a few tweaks though, just the exact same thing as marc_s said in his answer, just import your data as fast as you can into a staging table with no restriction at all. This way, you're gonna make your life much easier, while following good practices nevertheless. =)
The basic idea is to bulk insert your data into a staging table that doesn't have any restrictions, any constraints etc. - just bulk load the data as fast as you can.
Once you have the data in the staging table, then you need to start to worry about constraints etc. when you insert the data from the staging table into the real tables.
Here, you could e.g.
insert only those rows into your real work tables that match all the criteria (and mark them as "successfully inserted" in your staging table)
handle all rows that are left in the staging table that aren't successfully inserted by some error / recovery process - whatever that could be: printing a report with all the "problem" rows, tossing them into an "error bin" or whatever - totally up to you.
Key point is: the actual BULK INSERT should be into a totally unconstrained table - just load the data as fast as you can - and only then in a second step start to worry about constraints and lookup data and references and stuff like that