PostgreSQL - store real-time data of multiple subjects - postgresql

Scenario:
I'm trying to build a real-time monitoring webpage for ship operations
I have 1,000 - 10,000 ships operating
All ships are sending real-time data to DB, 24 hours - for 30 days
Each new data inserted has a dimension of 1 row X 100 col
When loading the webpage, all historic data of a chosen ship will be fetched and visualized
Last row of the ship's real-time data table will be queried, and fetched on the webpage to update real-time screen
Each ship has its own non-real-time data, such as ship dimensions, cargo, attendants, etc...
So far I've been thinking about creating a new schema for each ship. Something like this:
public_schema
ship1_schema
ship2_schema
ship3_schema
|--- realtime_table
|--- cargo_table
|--- dimensions_table
|--- attendants_table
ship4_schema
ship5_schme
Is this a good way to store individual ship's real-time data, and fetch them on a webserver? What other ways would you recommend?
For time-series wise, I'm already using a PostgreSQL extension called Timescale DB. My question rather about storing time-series data, in case I have many ships. Is it a good idea to differentiate each ship's RT data my constructing a new schema?
++ I'm pretty new to PostgreSQL, and some of the advice I got from other people was too advanced for me... I would greatly appreciated if you suggest some method, briefly explain what it is

This seems personally like the wrong way to work.
In this case i would have all the ship data in one table and from there on i would include a shipid to
realtime_table
cargo_table
dimensions_table
attendants_table
From there on if you believe that your data will reach a lot of volume you have the following choices.
Create indexes on the fields that are important to query, Postgres query planner is very useful for that.
Latest Postgres has implemented table partitioning based on criteria you provide without having to use table inheritance.**
Since you will be needing live data on the web page you can use
Listen command for Postgres
for when data are received from the ship (Unless you have another way of sending this data to the web server like web sockets)

Adding a bit of color here - if you are already using the TimescaleDB extension, you won't need to use table partitioning, since TimescaleDB will handle that for you automatically.
The approach of storing all ship data in a single table with a metadata table outside of the time series table is a common practice. As long as you build the correct indexes, as others have suggested, you should be fine. An important thing to note is that if you (for example) build an index on time, you want to make sure to include time in your queries to benefit from constraint exclusion.

Related

How to implement an optimistic (or pessimistic) locking when using two databases that need to be in sync?

I'm working on a solution in which we have two databases that are used with the following purposes:
An Elasticsearch used for search purposes
A Postgres database that acts as a source of truth for the data
Our application allows users to retrieve and update products, and a product has multiple attributes: name, price, description, etc... And two typical use cases are:
Retrieve products by name: a search is performed using elasticsearch, and then the IDs retrieved by ES are used on a secondary query against Postgres to obtain the actual and trustworthy data (so we get fast searches on big tables while getting trustworthy data)
Update product fields: We allow users to update any product information (kind of a collaborative wiki). First we store the data in Postgres, and then into Elasticsearch.
However and as I feared and as the amount of people using the app increased, we've run into race conditions; if a user #1 changed the name of a product to "Banana" and then user #2 at the same time changed the name of a product to "Apple", sometimes in elasticsearch the last record saved would be "Banana" while in Postgres "Apple" would be the last value, creating a serious inconsistency between databases.
So I've ventured into reading about optimistic/pessimistic locking in order to solve my problem, but so far all articles I find relate when you only use 1 relational database, and the solutions offered rely on ORM implementations (e.g. Hibernate). But our combined storage solution of ES + Postgres requires more "ballet" than that.
What techniques/options are available to me to solve my kind of problem?
Well I may attract some critics but let me explain you in a way that I understand. What I understand is this problem/concern is more of an architectural perspective rather than design/code perspective.
Immediate Consistency and of course eventual consistency
From Application layer
For the immediate consitency between two databases, the only way you can achieve them is to do polygot persistence in a transactional way so that either the same data in both Postgres and Elasticearch gets updated or none of them would. I wouldn't recommend this purely because it would put a lot of pressure on the application and you would find it very difficult to scale/maintain.
So basically GUI --> Application Layer --> Postgres/Elasticsearch
Queue/Real Time Streaming mechanism
You need to have a messaging queue so that the updates are going to the Queue in the event based approached.
GUI --> Application Layer --> Postgres--> Queue --> Elasticsearch
Eventual consistency but not immediate consistency
Have a separate application, normally let's call this indexer. The purpose of this tool is to carry out updates from postgres and push them into the Elasticsearch.
What you can have in the indexer is have multiple single configuration per source which would have
An option to do select * and index everything into Elasticsearch or full crawl
This would be utlized when you want to delete/reindex entire data into Elasticsearch
Ability to detect only the updated rows in the Postgres and thereby push them into Elasticsearch or incremental crawl
For this you would need to have a select query with where clause based on the status on your postgres rows for e.g. pull records with status 0 for document which got recently updated, or based on timestamp to pull records which got updated in last 30 secs/1 min or depending on your needs. Incremental query
Once you perform the incremental crawl, if you implement incremental using status, you need to change the status of this to 1(success) or '-1'(failure) so that in the next crawl the same document doesn't get picked up. Post-incremental query
Basically schedule jobs to run above queries as part of indexing operations.
Basically we would have GUI --> Application Layer --> Postgres --> Indexer --> Elasticsearch
Summary
I do not think it would be wise to think of fail proof way rather we should have a system that can recover in quickest possible time when it comes to providing consistency between two different data sources.
Having the systems decoupled would help greatly in scaling and figuring out issues related to data correctness/quality and at the same time would help you deal with frequent updates as well as growth rate of the data and updates along with it.
Also I recommend one more link that can help
Hope it helps!

Performance improvement for fetching records from a Table of 10 million records in Postgres DB

I have a analytic table that contains 10 million records and for producing charts i have to fetch records from analytic table. several other tables are also joined to this table and data is fetched currently But it takes around 10 minutes even though i have indexed the joined column and i have used Materialized views in Postgres.But still performance is very low it takes 5 mins for executing the select query from Materialized view.
Please suggest me some technique to get the result within 5sec. I dont want to change the DB storage structure as so much of code changes has to be done to support it. I would like to know if there is some in built methods for query speed improvement.
Thanks in Advance
In general you can take care of this issue by creating a better data structure(Most engines do this to an extent for you with keys).
But if you were to create a sorting column of sorts. and create a tree like structure then you'd be left to a search rate of (N(log[N]) rather then what you may be facing right now. This will ensure you always have a huge speed up in your searches.
This is in regards to binary tree's, Red-Black trees and so on.
Another implementation for a speedup may be to make use of something allong the lines of REDIS, ie - a nice database caching layer.
For analytical reasons in the past I have also chosen to make use of technologies related to hadoop. Though this may be a larger migration in your case at this point.

Database design: Postgres or EAV to hold semi-structured data

I was given the task to decide whether our stack of technologies is adequate to complete the project we have at hand or should we change it (and to which technologies exactly).
The problem is that I'm just a SQL Server DBA and I have a few days to come up with a solution...
This is what our client wants:
They want a web application to centralize pharmaceutical researches separated into topics, or projects, in their jargon. These researches are sent as csv files and they are somewhat structured as follows:
Project (just a name for the project)
Segment (could be behavioral, toxicology, etc. There is a finite set of about 10 segments. Each csv file holds a segment)
Mandatory fixed fields (a small set of fields that are always present, like Date, subjects IDs, etc. These will be the PKs).
Dynamic fields (could be anything here, but always as a key/pair value and shouldn't be more than 200 fields)
Whatever files (images, PDFs, etc.) that are associated with the project.
At the moment, they just want to store these files and retrieve them through a simple search mechanism.
They don't want to crunch the numbers at this point.
98% of the files have a couple of thousand lines, but there's a 2% with a couple of million rows (and around 200 fields).
This is what we are developing so far:
The back-end is SQL 2008R2. I've designed EAVs for each segment (before anything please keep in mind that this is not our first EAV design. It worked well before with less data.) and the mid-tier/front-end is PHP 5.3 and Laravel 4 framework with Bootstrap.
The issue we are experiencing is that PHP chokes up with the big files. It can't insert into SQL in a timely fashion when there's more than 100k rows and that's because there's a lot of pivoting involved and, on top of that, PHP needs to get back all the fields IDs first to start inserting. I'll explain: this is necessary because the client wants some sort of control on the fields names. We created a repository for all the possible fields to try and minimize ambiguity problems; fields, for instance, named as "Blood Pressure", "BP", "BloodPressure" or "Blood-Pressure" should all be stored under the same name in the database. So, to minimize the issue, the user has to actually insert his csv fields into another table first, we called it properties table. This action won't completely solve the problem, but as he's inserting the fields, he's seeing possible matches already inserted. When the user types in blood, there's a panel showing all the fields already used with the word blood. If the user thinks it's the same thing, he has to change the csv header to the field. Anyway, all this is to explain that's not a simple EAV structure and there's a lot of back and forth of IDs.
This issue is giving us second thoughts about our technologies stack choice, but we have limitations on our possible choices: I only have worked with relational DBs so far, only SQL Server actually and the other guys know only PHP. I guess a MS full stack is out of the question.
It seems to me that a non-SQL approach would be the best. I read a lot about MongoDB but honestly, I think it would be a super steep learning curve for us and if they want to start crunching the numbers or even to have some reporting capabilities,
I guess Mongo wouldn't be up to that. I'm reading about PostgreSQL which is relational and it's famous HStore type. So here is where my questions start:
Would you guys think that Postgres would be a better fit than SQL Server for this project?
Would we be able to convert the csv files into JSON objects or whatever to be stored into HStore fields and be somewhat queryable?
Is there any issues with Postgres sitting in a windows box? I don't think our client has Linux admins. Nor have we for that matter...
Is it's licensing free for commercial applications?
Or should we stick with what we have and try to sort the problem out with staging tables or bulk-insert or other technique that relies on the back-end to do the heavy lifting?
Sorry for the long post and thanks for your input guys, I appreciate all answers as I'm pulling my hair out here :)

Caching EAV data - XML or NoSQL / MongoDB?

I'm building a web app that relies heavily on the EAV pattern for storing data. This basically means that each attribute of an object has it's own row in a massive database table. I'm using MySQL to store everything. This is a very simplified example of what I'm storing...
OBJECTS ATTRIBUTES
objId | type objId | attribute | value
============= =========================
1 | fruit 1 | color | green
2 | fruit 1 | shape | round
3 | book 2 | color | red
I know some people hate EAV, but I need to be able to add new object attributes arbitrarily without modifying the database schema, and it's working very well for me so far.
As I think anyone else finds when building a system using an EAV data structure, the weakness of this approach is the retrieval of multiple objects together with each object's attributes. At the moment my app only displays 10 objects at a time, so I just query my EAV table 10 times (once for each object) and it's still very fast. However, I'd like to remove this limitation and allow hundreds of objects to be fetched in one go. I also want to be able to query objects in a more flexible way than I'm doing currently.
Doing this with SQL joins would be hideous, so I'm considering caching the data. On average the database gets about 300 reads for every 1 write, so I think this it's a good candidate for caching.
So far these are the options I've come up with...
XML database column: Every time a write is performed, update an XML text column in the objects table containing all the object's attributes. This would work for reading the data quickly, but querying XML data hidden in a database table is messy.
XML file: Every time a write is performed, write an XML file to disk which contains each object and it's attributes. This has the benefit that I can then use XQuery to query the objects.
NoSQL (eg. MongoDB): Perhaps I should have built the system on a schemaless database like MongoDB. Re-writing the entire app to use MongoDB would be quite time consuming, but it struck me that I could use it as a cache. So for example, every time data is written to the EAV store, the equivalent object would be updated in MongoDB which would then be used for reads and queries.
Originally I thought an XML file would be the best approach, but I can see the file getting really big and unmanageable. At the moment I'm leaning towards using MongoDB. I know it seems crazy running two database servers for one app, but I think it could work in my case.
I'd love to hear your thoughts on this.
I see only two ways, both of them were mentioned in comments.
First, you can really migrate to document-oriented db like Mongo - this is suitable as alternative to EAV. Since it'll be no JOINs and other logic, it'll be very fast and slightly scaled. (So, perhaps you'll be able to avoid using cache).
Second, you can use specific tool for caching like Redis or Mongo or Memcached to save every query result for some time.
But I want to turn our mind to the future of this system. What is planned loading and scaling?
If you want to reduce system load, I think the best way is to migrate to document-oriented db.
Or, if you want to have result immediately (cache data for reading) - it can be reached by using caching tool, even [if possible] on network level (for example nginx support memcached out of the box).
So, as usual, you should find balance between one-time and continious costs.

Calculating price drop Apps or Apps gonna free - App Store

I am working on a Website which is displaying all the apps from the App Store. I am getting AppStore data by their EPF Data Feeds through EPF Importer. In that database I get the pricing of each App for every store. There are dozen of rows in that set of data whose table structure is like:
application_price
The retail price of an application.
Name Key Description
export_date The date this application was exported, in milliseconds since the UNIX Epoch.
application_id Y Foreign key to the application table.
retail_price Retail price of the application, or null if the application is not available.
currency_code The ISO3A currency code.
storefront_id Y Foreign key to the storefront table.
This is the table I get now my problem is that I am not getting any way out that how I can calculate the price reduction of apps and the new free apps from this particular dataset. Can any one have idea how can I calculate it?
Any idea or answer will be highly appreciated.
I tried to store previous data and the current data and then tried to match it. Problem is the table is itself too large and comparing is causing JOIN operation which makes the query execution time to more than a hour which I cannot afford. there are approx 60, 000, 000 rows in the table
With these fields you can't directly determine price drops or new application. You'll have to insert these in your own database, and determine the differences from there. In a relational database like MySQL this isn't too complex:
To determine which applications are new, you can add your own column "first_seen", and then query your database to show all objects where the first_seen column is no longer then a day away.
To calculate price drops you'll have to calculate the difference between the retail_price of the current import, and the previous import.
Since you've edited your question, my edited answer:
It seems like you're having storage/performance issues, and you know what you want to achieve. To solve this you'll have to start measuring and debugging: with datasets this large you'll have to make sure you have the correct indexes. Profiling your queries should helping in finding out if they do.
And probably, your environment is "write once a day", and read "many times a minute". (I'm guessing you're creating a website). So you could speed up the frontend by processing the differences (price drops and new application) on import, rather than when displaying on the website.
If you still are unable to solve this, I suggest you open a more specific question, detailing your DBMS, queries, etc, so the real database administrators will be able to help you. 60 million rows are a lot, but with the correct indexes it should be no real trouble for a normal database system.
Compare the table with one you've downloaded the previous day, and note the differences.
Added:
For only 60 million items, and on a contemporary PC, you should be able to store a sorted array of the store id numbers and previous prices in memory, and do an array lookup faster than the data is arriving from the network feed. Mark any differences found and double-check them against the DB in post-processing.
Actually I also trying to play with these data, and I think best approach for you base on data from Apple.
You have 2 type of data : full and incremental (updated data daily). So within new data from incremental (not really big as full) you can compare only which record updated and insert them into another table to determine pricing has changed.
So you have a list of records (app, song, video...) updated daily with price has change, just get data from new table you created instead of compare or join them from various tables.
Cheers