I am setting up a location aware application, as mentioned here. I have since learned a lot more about GIS apps, and have decided to change a few things about the setup I had originally proposed -- I'm now going to use a postgresql database using the postgis extension to allow for geometric fields, and use TIGER/Line data to fill it. The TIGER/Line data seems to offer different data sets in different resolutions (layers) -- there is data for states, counties, zips, blocks, etc. I need a way to associate a post to an address using the finest grain resolution possible.
For instance, if possible, I would like to associate a post with a particular street (finest resolution). If not a street, then a particular zip code (less specific). If not a zip code, then a particular county (less specific), and so on. Sidenote: I want to eventually show these all on a map.
This is what I propose:
Locations
id -- int
street_name -- varchar -- NULL
postal_code_id -- int -- NULL
county_id -- int -- NULL
state_id -- int
Postal Codes
id -- int
code -- varchar
geom -- geometry
Counties
id -- int
name -- varchar
geom -- geometry
The states table is similar, and so on...
As you can see, the locations table would decide the level of specificity by whatever fields are set. The postal codes, counties, and states table are not tied together by foreign key (too complex to determine a proper hierarchy that is valid everywhere), however, I believe that there is a way to determine their relationship using the geometry field (e.g., query what state a certain zip code is contained in or what zip codes belong to a certain state).
I think this is a good setup because if the database grows (lets say I decide to include data for districts or blocks in the database) then I can add another table for that data and then add another foreign key to the locations table (eg, block_id).
Does anybody know of a better way to do this?
Is it possible that a street belongs to two different counties? or two postal codes?, In my country this is possible, specially in cities. If this is possible then your schema won't work.
Despite of what I said before, I would add the geometry of the streets(open street map) without linking it to a postal code or county or even the state, and then with a simple query that intersects the geometry of the streets with the other tables you could get that information, and fill another table that has that relationships.
Related
The idea of the SaaS tool is to have dynamic tables with dynamic custom fields and values of different types, we were thinking to use "force.com/salesforce.com" example but is seems to be too complicated to maintain moving forward, also making some reports to create with a huge abstraction level, so we came up with simple idea but we have to be sure that this is kinda good approach.
This is the architecture we have today (in few steps).
Each tenant has it own separate database on the cluster (Postgres 12).
TABLE table, used to keep all of those tables as reference, this entity has ManyToOne relation to META table and OneToMany relation with DATA table.
META table is used for metadata configuration, has OneToMany relation with FIELDS (which has name of the fields as well as the type of field e.g. TEXT/INTEGER/BOOLEAN/DATETIME etc. and attribute value - as string, only as reference).
DATA table has ManyToOne relation to TABLES and 50 character varying columns with names like: attribute1...50 which are NULL-able.
Example flow today:
When user wants to open a TABLE DATA e.g. "CARS", we load the META table with all the FIELDS (to get fields for this query). User specified that he want to query against: Brand, Class, Year, Price columns.
We are checking by the logic, the reference for Brand, Class, Year and Price in META>FIELDS table, so we know that Brand = attribute2, Class = attribute 5, Year = attribute6 and Price = attribute7.
We parse his request into a query e.g.: SELECT [attr...2,5,6,7] FROM DATA and then show the results to user, if user decide to do some filters on it, based on this data e.g. Year > 2017 AND Class = 'A' we use CAST() functionality of SQL for example SELECT CAST(attribute6 AS int) AND attribute5 FROM DATA WHERE CAST(attribute6 AS int) > 2017 AND attribute5 = 'A';, so then we can actually support most principles of SQL.
However moving forward we are scared a bit:
Manage such a environment for more tenants while we are going to have more tables (e.g. 50 per customer, with roughly 1-5 mil per TABLE (5mil is maximum which we allow, for bigger data we have BigQuery) which is giving us 50-250 mil rows in single table DATA_X) which might affect performance of the queries, especially when we gave possibilities to manage simple WHERE statements (less,equal,null etc.) using some abstraction language e.g. GET CARS [BRAND,CLASS,PRICE...] FILTER [EQ(CLASS,A),MT(YEAR,2017)] developed to be similar to JQL (Jira Query Language).
Transactions lock, as we allow to batch upload CSV into the DATA_X so once they want to load e.g. 1GB of the data, it kinda locks the table for other systems to access the DATA table.
Keeping multiple NULL columns which can affect space a bit (for now we are not that scared as while TABLE creation, customer can decide how many columns he wants, so based on that we are assigning this TABLE to one of hardcoded entities DATA_5, DATA_10, DATA_15, DATA_20, DATA_30, DATA_50, where numbers corresponds to limitations of the attribute columns, and those entities are different, we also support migration option if they decide to switch from 5 to 10 attributes etc.
We are on super early stage, so we can/should make those before we scale, as we knew that this is most likely not the best approach, but we kept it to run the project for small customers which for now is working just fine.
We were thinking also about JSONB objects but that is not the option, as we want to keep it simple for getting the data.
What do you think about this solution (fyi DATA has PRIMARY key out of 2 tables - (ID,TABLEID) and built in column CreatedAt which is used form most of the queries, so there will be maximum 3 indexes)?
If it seem bad, what would you recommend as the alternative to this solution based on the details which I shared (basically schema-less RDBMS)?
IMHO, I anticipate issues when you wanted to join tables and also using cast etc.
We had followed the approach below that will be of help to you
We have a table called as Cars and also have a couple of tables like CarsMeta, CarsExtension columns. The underlying Cars table will have all the common fields for a ll tenant's. Also, we will have the CarsMeta table point out what are the types of columns that you can have for extending the Cars entity. In the CarsExtension table, you will have columns like StringCol1...5, IntCol1....5, LongCol1...10
In this way, you can easily filter for data also like,
If you have a filter on the base table, perform the search, if results are found, match the ids to the CarsExtension table to get the list of exentended rows for this entity
In case the filter is on the extended fields, do a search on the extension table and match with that of the base entity ids.
As we will have the extension table organized like below
id - UniqueId
entityid - uniqueid (points to the primary key of the entity)
StringCol1 - string,
...
IntCol1 - int,
...
In this case, it will be easy to do a join for entity and then get the data along with the extension fields.
In case you are having the table metadata and data being inferred from separate tables, it will be a difficult task to maintain this over long period of time and also huge volume of data.
HTH
I have a table (several tables, actually) which contains a textual item_id column, originally populated with an ID provided by a third-party data source. Unfortunately, the data source recently changed the format of their IDs. Meanwhile, my customers will call into my service using whatever format of the ID they most recently saw, meaning that they frequently get incomplete data because they're looking at an item_id which is either too old or too new.
Fortunately, the change in format was relatively straightforward, so it's easy for me to normalize both old and new item_id values into a consistent value, but I'd like to do this for ALL queries regardless of where they come from. Is it possible to set up some sort of trigger that intercepts any query against the item_id column and normalizes the queried value?
It may be a silly basic question but as described in the title, I am wondering how PostgreSQL deals with performance when having millions of entries (with the possibility of reaching a billion entries).
To put it in a more concrete way, I want to store data (audio, photos and videos) in my database (I'm only storing their path, files are organised in the file system), but I have to decide wether I use a single table "data" to store all the different types of data, or multiple tables ("data_audio", "data_photos", "data_videos") to separate those types.
The reason why I am asking this question is that I have something like 95% of photos and 5% of audio and videos, and if I want to query my database for an audio entry, I don't want it to be slowed by all the photos entries (searching for a line among a thousand must be different than searching among a million). So I would like to know how PostgreSQL deals with this and if there exists some way to have the best optimisation.
I have read this topic that is really interesting and seems relevant:
How does database indexing work?
Is it the way I should do?
Recap of the core stored informations I will have in my core tables:
1st option:
DATA TABLE (containing audio, photos and videos):
id of type bigserial
_timestamp of type timestamp
path_file of type text
USERS TABLE:
id of type serial
forename of type varchar(255)
surname of type varchar(255)
birthdate of type date
email_address of type varchar(255)
DATA USERS RELATION TABLE:
id_data of type bigserial
id_user of type serial
ACTIVITIES TABLE:
id of type serial
name of type varchar(255)
description of type text
DATA ACTIVITIES RELATION TABLE:
id_data of type bigserial
id_activity of type serial
(SEARCH queries are mainly done on DATA._timestamp and ACTIVITIES.name fields after filtering data by USERS.id)
2nd option (only switching the previous DATA TABLE with the following three tables and keeping all the other tables):
DATA_AUDIO TABLE
DATA_PHOTOS TABLE
DATA_VIDEOS TABLE
Additional question:
Is it a good idea to have a database per user ? (in the storyline, to be able to query the database for data depends on whether you have the permission or not, and if you want to retrieve data from two different users, you have to ask the permission from both users, and the permission process is a process in its own right, it is not handled here, so let’s say that when you query the database, it will always be queries on the same user)
I hope I have been clear, thanks in advance for any help or advices!
Cyrille
Answers:
PostgreSQL is cool with millions and billions of rows.
If the different types of data all have the same attributes and are the same from the database perspective (have the same relationships to other tables etc.), then keep them in one table. If not, use different tables.
The speed of index access to a table does not depend on the size of the table.
If the data of different users have connections, like they use common base tables or you want to be able to join tables for different users, it is best to keep them in different schemas in one database. If it is important that they be separated no matter what, keep them in different databases.
It is also an option to keep data for different users in one table, if you use Row Level Security or let your application take care of it.
This decision depends strongly on your use case and architecture.
Warning: don't create clusters with thousands of databases and databases with thousands of schemas. That causes performance problems in the catalogs.
So I have a form I have Vendors fill out when they want to ship to us. It's an excel form that I then import into Access so I can run reports. Sometimes when they send the form back it's in a format in which I have to manually enter the data into our database.
The form looks like this:
The middle section is just for example purposes so it's a rectangle with text in it.
So everything seemed simple enough until I got to the middle section. See in my excel form I have a section for multiple PO's and units. So essentially each shipment can have one to many PO's and Units. Currently I can approach this task with the redundant method of reentering information per PO on the form. But I want to make this simple.
So the task at hand is that I want to have a form field for PO's and Units where I can input multiple lines of information so that when I hit a submit button. It appears in the database on separate lines with the same vendor information.
So if I filled out my form had this in the middle section:
PO | Units
111111 22
222222 33
333333 44
When I hit submit I want it to attach the rest of the forms information to each PO on separate lines so it'd be like:
Vendor | City | State | PO | Units
Nike Memphis TN 111111 22
Nike Memphis TN 222222 33
Nike Memphis TN 333333 44
So how would I go about accomplishing this task?
From your description of the problem and your example of how the data appears to ultimately be stored in Access it looks to me like you are using Access as a spreadsheet and not as a database. This is ok, but you might want to consider normalizing the data to take advantage of the power of databases in general.
For example:
Create a Vendors table whose sole purpose is to keep details about each Vendor you work with. A very basic implementation would have an ID field to uniquely identify each vendor and a Name field for the vendor name.
If Vendors will only ever have a single location you could also store City, State, ZipCode and Email in this same Vendor table, but I suspect having a separate VendorLocation or VendorAddress table would be a better fit long term.
Create a VendorShipment table that tracks the higher level information on your mockup, such as:
ShipmentID (primary key of this table)
VendorID (foreign key back to Vendor table)
Ready Date
Carrier
Estimated Cost
FreightClass
Tracking #
Estimated Transit Time
Finally, create a VendorShipmentDetail table that tracks the information of each shipment, including:
ShipmentDetailID (primary key of this table)
ShipmentID (foreign key back to VendorShipment table)
PO
Units
Any other details that you want to or need to track
Organizing and storing the data in a normalized fashion would ultimately help simplify your data entry \ data management process and potentially make for a better user experience.
For example, rather than having to enter the Vendor Name, Address information, etc. each time you could instead use a combo box control that is tied to the Vendor table. If the Vendor exists in the table you select it from the list and you already have the Address information, no need to re-enter it each time. If the Vendor did not already exist you enter it once (probably on a Vendor screen where you maintain the details for each Vendor) and draw upon the information in the future.
You would then use queries to tie the information back together for reporting purposes (de-normalize the information).
The art of database design can take a while to pick up, but a good starting point might be to check out the Northwind database that Microsoft has maintained over the years. It has some examples you could draw from immediately to get a practical understanding of how to use normalization within Access. You can find more information here: http://office.microsoft.com/en-us/templates/northwind-sales-web-database-TC101114818.aspx
I've got a CSV with some data that looks like this:
A0A0A0,48.5674500000,-54.8432250000,Gander,NL
A0A1A0,47.0073470000,-52.9589210000,Aquaforte,NL
A0A1B0,47.3622800000,-53.2939930000,Avondale,NL
But my database is normalized such that Cities and Provinces are in separate tables, each with their own ID column.
So what's the easiest way to import this file into 3 separate tables and link the foreign keys properly?
To be more clear, the tables are
cities (id, name, province_id)
provinces (id, code, name, country_id)
postal_codes (id, code, city_id)
countries (id, code, name)
Use COPY to import the csv into a temp table. Than use some INSERT INTO ... SELECT ... FROM ... to dump the data in the correct tables.
... my database is normalized
Doesn't appear to be. There are many issues, but the one that will trip you up in this question is, there does not seem to be correct PKs, no Unique Keys at all; so you will end up with duplicated data. Id "keys" do not prevent duplicate names, you need an unique index on name. It is not clear how you support two towns with the same name in the same province.
You know you have to load three tables from the one imported table. Due to FKs, which are a Good Thing, you need to load Provinces first, then Cities, then PostalCodes. But from the look of your import file, it is cities (or towns or localities or suburbs) ... that resolution needs to be clearly identified first. There are 360 kilometres and dozens of localities between Gander and Aquaforte. What exactly constitutes a record in the file ?
It may help to understand the structure on the excellent Canadian postal code system.
Then you need to check what level of granularity you are storing in the Db. Apparently Cities or towns, but not suburbs, not localities. What about Counties or Parishes ? Eg _0A ___ means it is a rural area; since you are storing cities, not counties, not municipalities, you can ignore them.
Once you are clear about the granularity or resolution of the source data, and the level of resolution you want in the target tables, you can then load the import file, most probably is several waves per table. The SQL is easy.