Why Parquet over some RDBMS like Postgres - postgresql

I'm working to build a data architecture for my company. A simple ETL with internal and external data with the aim to build static dashboard and other to search trend.
I try to think about every step of the ETL process one by one and now I'm questioning about the Load part.
I plan to use Spark (LocalExcecutor on dev and a service on Azure for production) so I started to think about using Parquet into a Blob service. I know all the advantage of Parquet over CSV or other storage format and I really love this piece of technology. Most of the articles I read about Spark finish with a df.write.parquet(...).
But I cannot figure it out why can I just start a Postgres and save everything here. I understand that we are not producing 100Go per day of data but I want to build something future proof in a fast growing company that gonna produce exponentially data by the business and by the logs and metrics we start recording more and more.
Any pros/cons by more experienced dev ?
EDIT : What also make me questioning this is this tweet : https://twitter.com/markmadsen/status/1044360179213651968

The main trade-off is one of cost and transactional semantics.
Using a DBMS means you can load data transactionally. You also pay for both storage and compute on an on-going basis. The storage costs for the same amount of data are going to be more expensive in a managed DBMS vs a blob store.
It is also harder to scale out processing on a DBMS (it appears the largest size Postgres server Azure offers has 64 vcpus). By storing data into an RDBMs you are likely going to run-up against IO or compute bottlenecks more quickly then you would with Spark + blob storage. However, for many datasets this might not be an issue and as the tweet points out if you can accomplish everything inside a the DB with SQL then it is a much simpler architecture.
If you store Parquet files on a blob-store, updating existing data is difficult without regenerating a large segment of your data (and I don't know the details of Azure but generally can't be done transactionally). The compute costs are separate from the storage costs.

Storing data in Hadoop using raw file formats is terribly inefficient. Parquet is a Row Columnar file format well suited for querying large amounts of data in quick time. As you said above, writing data to Parquet from Spark is pretty easy. Also writing data using a distributed processing engine (Spark) to a distributed file system (Parquet+HDFS) makes the entire flow seamless. This architecture is well suited for OLAP type data.
Postgres on the other hand is a relational database. While it is good for storing and analyzing transactional data, it cannot be scaled horizontally as easily as HDFS can be. Hence when writing/querying large amount of data from Spark to/on Postgres, the database can become a bottleneck. But if the data you are processing is OLTP type, then you can consider this architecture.
Hope this helps

One of the issues I have with a dedicated Postgres server is that it's a fixed resource that's on 24/7. If it's idle for 22 hours per day and under heavy load 2 hours per day (in particular if the hours aren't
continuous and are unpredictable)
then the server sizing during those 2 hours is going to be too low whereas during the other 22 hours it's too high.
If you store your data as parquet on Azure Data Lake Gen 2 and then use Serverless Synapse for SQL queries then you don't pay for anything on a 24/7 basis. When under heavy load, everything scales automatically.
The other benefit is that parquet files are compressed whereas Postgres doesn't store data compressed.
The downfall is "latency" (probably not the right term but it's how I think of it). If you want to query a small amount of data then, in my experience, it's slower with the file + Serverless approach compared to a well indexed clustered or partitioned Postgres table. Additionally, it's really hard to forecast your bill with the Serverless model coming from the server model. There's definitely going to be usage patterns where Serverless is going to be more expensive than a dedicated server. In particular if you do a lot of queries that have to read all or most of the data.
It's easier/faster to save a parquet than to do a lot of inserts. This is a double edged sword because the db guarantees acidity whereas saving parquet files doesn't.
Parquet storage optimization is its own task. Postgres has autovacuum. If the data you're consuming is published daily but you want it on a node/attribute/feature partition scheme then you need to do that manually (perhaps with spark pools).

Related

Exposing Data from BigQuery to Mobile/Web Apps Via Firestore

I am looking for an easy way (of course with good performance) to expose data in my BigQuery table to web applications.
The current solution which is running is using a Cloud Function and Firestore (in native mode) to expose the data in BigQuery. The implementation is like - as soon as the data is written to the final big query table, we are triggering cloud functions (500 records per commit) to update the data in our final firestore table. The data in firestore table is finally exposed to the App/Web client.
And, to avoid timeout issues associated with Cloud Functions, we are dividing the entire dataset into batches and each cloud function instance will handle a single batch of records only.
But soon after going live, we were hit with scalability issues for the writes as we were triggering the Cloud Function instances sequentially.
A simple way to improve the performance could be to do parallel writes from inside the cloud function, but again as per the firestore documentation doing more than 1000 writes/sec against a collection can reduce performance. So eventually the performance gains we are getting with this approach could be minimum. In our case, we have only one collection.
Anyone here has experience of dealing with high volume writes and reads against Firestore ? Firestore in datastore mode can be used for high volume writes, but what about the read latency?
Also, I am thinking of using BigTable for this purpose (eventual consistency could be fine for us), but using bigtable might add additional layers to expose the data, maybe through a web service.
We are expecting data size to be around GBs only.
PS : I don't need the offline capabilities offered by the Firestore, the reason for choosing Firestore was for the ease of development only. 
Based on the information you shared Firestore does not seem like an appropriate choice of product for the amount of data you will be adding at once, plus the costs of this might be heavier than the alternative if we talking about TBs of data, which I assume is the case.
Generally speaking Firestore is not recommended for very data intensive apps nor apps with too many writes, for pricing reasons, as reads are considerably cheaper than writes.
Personally I would choose Big Table for this case for the following reasons:
Supports apps with high throughput.
Easily scalable without lost in performance or instance downtime while doing so.
If kept in the same zone or region as Big Query will have no additional costs to migrating the data to Big Table.

Is this scenario a big data project?

i'm involved in a project with 2 phases and i'm wondering if this is a big data project (i'm newbie in this field)
In the first phase i have this scenario:
i have to collect huge amont of data
i need to store them
i need to build a web application that shows data to the users
In the second phase i need to analyze stored data and builds report and do some analysis on them
Some example about data quantity; in one day i may need to collect and store around 86.400.000 record
Now i was thinking to this kind of architecture:
to colect data some asynchronous tecnology like Active MQ and MQTT protocol
to store data i was thinking about a NoSQL DB (mongo, Hbase or other)
Now this would solve my first phase problems
But what about the second phase?
I was thinking about some big data SW (like hadoop or spark) and some machine learning SW; so i can retrieve data from the DB, analyze them and build or store in a better way in order to build good reports and do some specific analysis
I was wondering if this is the best approach
How would you solve this kind of scenario? Am I in the right way?
thank you
Angelo
As answered by siddhartha, whether your project can be tagged as bigdata project or not, depends on context and buiseness domain/case of your project.
Coming to tech stack, each of the technology you mentioned has specific purpose. For example if you have structured data, you can use any new age base database with query support. NoSQL databases come in different flavours (columner, document based, key-value, etc), so technology choice depends again on the kind of data and use-case that you have. I suggest you to do some POCs and analysis of technologies before taking final calls.
Definition of big data varies from user to user. For Google 100 TB might be a small data but for me this is big data because of difference in available Hardware commodity. Ex -> Google can have cluster of 50000 nodes each node having 64 GB Ram for analysing 100 Tb of data so for them this not big data. But I cannot have cluster of 50000 node so for me it is big data.
Same is your case if have commodity hardware available you can go ahead with hadoop. As you have not mentioned size of file you are generating each day I cannot be certain about your case. But hadoop is always a good choice to process your data because of new projects like spark which can help you process data in much less time and moreover it also give you features of real time analysis. So according to me it is better if you can use spark or hadoop because then you can play with your data. Moreover since you want to use nosql database you can use hbase which is available with hadoop to store your data.
Hope this answers your question.

XML versus MongoDB

I have a problem...
I need to store a daily barrage of about 3,000 mid-sized XML documents (100 to 200 data elements).
The data is somewhat unstable in the sense that the schema changes from time to time and the changes are not announced with enough advance notice, but need to be dealt with retroactively on an emergency "hotfix" basis.
The consumption pattern for the data involves both a website and some simple analytics (some averages and pie charts).
MongoDB seems like a great solution except for one problem; it requires converting between XML and JSON. I would prefer to store the XML documents as they arrive, untouched, and shift any intelligent processing to the consumer of the data. That way any bugs in the data-loading code will not cause permanent damage. Bugs in the consumer(s) are always harmless since you can fix and re-run without permanent data loss.
I don't really need "massively parallel" processing capabilities. It's about 4GB of data which fits comfortably in a 64-bit server.
I have eliminated from consideration Cassandra (due to complex setup) and Couch DB (due to lack of familiar features such as indexing, which I will need initially due to my RDBMS ways of thinking).
So finally here's my actual question...
Is it worthwhile to look for a native XML database, which are not as mature as MongoDB, or should I bite the bullet and convert all the XML to JSON as it arrives and just use MongoDB?
You may have a look at BaseX, (Basex.org), with built in XQuery processor and Lucene text indexing.
That Data Volume is Small
If there is no need for parallel data processing, there is no need for Mongo DB. Especially if dealing with small data amounts like 4GB, the overhead of distributing work can easily get larger than the actual evaluation effort.
4GB / 60k nodes is not large of XML databases, either. After some time of getting into it you will realize XQuery as a great tool for XML document analysis.
Is it Really?
Or do you get daily 4GB and have to evaluate that and all data you already stored? Then you will get to some amount which you cannot store and process on one machine any more; and distributing work will get necessary. Not within days or weeks, but a year will already bring you 1TB.
Converting to JSON
How does you input look like? Does it adhere any schema or even resemble tabular data? MongoDB's capabilities for analyzing semi-structured are way worse than what XML databases provide. On the other hand, if you only want to pull a few fields on well-defined paths and you can analyze one input file after the other, Mongo DB probably will not suffer much.
Carrying XML into the Cloud
If you want to use both an XML database's capabilities in analyzing the data and some NoSQL's systems capabilities in distributing the work, you could run the database from that system.
BaseX is getting to the cloud with exactly the capabilities you need -- but it will probably still take some time for that feature to get production-ready.

Which NoSQL ... again :), but a different use case

Suggestions for a NoSQL datastore so that we can push data and generate real time Qlikview reports easily?
Easily means:
1. Qlikview support for reads (mongodb connector available, otherwise maybe can write a JDBC connector, otherwise maybe can write a custom QVX connector to the datastore)
Easily adaptable to changes in schema, or schemaless. We change our schema quite frequently ...
Java support for writes
Super fast reads - real time incremental access, as well as batch access for old data within a time range. I read that Cassandra excels in ranges.
Reasonably fast writes
Reasonably big data storage - 20 million rows stored per day, about 200 bytes each
Would be nice if it can scale for a years worth of data, elasticity not so important.
Easy to use, install, and run. Looking at minimal setup and configuration time.
Matlabe support for adhoc querying
Initially I don't think we need a distributed system however a cluster is a possibility.
I've looked at Mongodb, Cassandra and Hbase. I don't think going over REST is a good idea due to (theoretically) slower performance.
I'm leaning towards MongoDB at the moment due to its ease of use, matlab support, totally schema less, Qlikview support (beta connector is available). However if anyone can suggest something better that would be great!
Depending on the server infrastructure you will use, I guess the best choice is amazon's NoSQL service, avalaible in aws.amazon.com.
The fact is any DB will have a poor performance in cloud infrastructure due to the way it stores data, amazon EC2 with EBS for instance is VERY slow for this task, requiring you to connect up to 20 EBS volumes in raid to acquire a decent speed. They solved this issue creating this NoSQL service, which I never used, but seems nice.

Storing millions of log files - Approx 25 TB a year

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.