From Mongodb to PostgresSQL - mongodb

We are scraping data online and the data is (relatively for us) fastly growing. The data consists of one big text (approx. 2000 chars) and a dozen of simple text fields (few words max).
We scrap around 1M or 2M rows per week and it will grow up to 5-10M rows per week probably.
Currently we use Mongodb Atlas to store the rows. Until very soon, we were adding all the infos available but now we defined a schema and only keep what we need. Now the flexibility of the document db is not necessary anymore. And since mongodb pricing is growing exponentially with storage and tier upgrade, we are looking for another storage solution to store that data.
Here is the pipeline : we send data from the scrapers to Mongodb -> using Airbyte we replicate data periodically to Bigquery -> we process data on Bigquery using Spark or Apache Beam -> We perform analysis on transformed data using Sisense
With those requirements in mind, could we use Postgres to replace Mongodb for the "raw" storage ?
Is postgres scaling well for that kind of data (we are not even close to big data but in a near futur we will have at least 1TB data) ? We don't plan to use relations in postgres, isn't it overkill then ? We will however use array and json types, that's why I selected postgres first. Is there other storage solution we could use ? Also could it be possible / good practice to store data directly in Bigquery ?
Thanks for the help

Related

How to export data from firestore in a cost-effective manner for data analysis?

I am looking for a way to get the data in my company's firestore database without it costing too many reads, it is a huge database and currently, it is not an option to export it regularly due to cost, I need a reasonable sample size if not all the content in the database to run some analytical work using either python or R.

Should I migrate to Redshift?

I'm currently struggling querying be chunk of data that is stored in partitioned table (partition per date)
the data looks like that:
date, product_id, orders
2019-11-01, 1, 100
2019-11-01, 2, 200
2019-11-02, 1, 300
I have hundreds of date-partitions and millions of rows per date.
Now, if I want to query, for instance, total orders for product id 1 and 2 for period of 2 weeks, and group by date (to show in a graph per date), the db has to go to 2 weeks of partitions and fetch the data for them.
That process might be taking a long time when the number of products is big or the time frame required is long.
I have read that AWS Redshift is suitable for this kind of tasks. I'm considering shifting my partitioned tables (aggregated analytics per date) to that technology but I wonder if that's really what I should do to make those queries to run much faster.
Thanks!
As per your use case Redshift is really a good choice for you.
To gain the best performance out of Redshift, it is very important to set proper distribution and sort key. In your case "date" column should be distribution key and "productid" should be sort key. Another important note, Do not encode "date" and "productid" column.
You should get better performance.
If you are struggling with traditional SQL databases, then Amazon Redshift is certainly an option. It can handle tables with billions of rows.
This would involve loading the data from Amazon S3 into Redshift. This will allow Redshift to optimize the way that the data is stored, making it much faster for querying.
Alternatively, you could consider using Amazon Athena, which can query the data directly from Amazon S3. It understands data that is partitioned into separate directories (eg based on date).
Which version of PostgreSQL are you using?
Are you using native partioning or inheritance partitioning trigger-based?
Latest version of postgresql improved partitioning management.
Considering your case Amazon Redshift can be a good choice, so does Amazon Athena. But it is also important to consider your application framework. Are you opt moving to Amazon only for Database or you have other Amazon services in the list too?
Also before making the decision please check the cost of Redshift.

Processing multiple concurrent read queries in Postgres

I am planning to use AWS RDS Postgres version 10.4 and above for storing data in a single table comprising of ~15 columns.
My use case is to serve:
1. Periodically (after 1 hour) store/update rows in to this table.
2. Periodically (after 1 hour) fetch data from the table say 500 rows at a time.
3. Frequently fetch small data (10 rows) from the table (100's of queries in parallel)
Does AWS RDS Postgres support serving all of above use cases
I am aware of Read-Replicas support, but is there any in built load balancer to serve the queries that come in parallel?
How many read queries can Postgres be able to process concurrently?
Thanks in advance
Your usecases seems to be a normal fit for all relational database systems. So I would say: yes.
The question is: how fast the DB can handle the 100 queries (3).
In general the postgresql documentation is one of the best I ever read. So give it a try:
https://www.postgresql.org/docs/10/parallel-query.html
But also take into consideration how big your data is!
That said, try w/o read replicas first! You might not need them.

Tableau extract using row store vs Column store DBs

I am creating a .TDE(tableau extract) from a table in sql server which has around 180+ columns and 60 million records and is taking around 4 hours in current infra of 16GB RAM and 12 cores
I was looking for any other way by which this can be done faster. I would like to know if I could load my data into any column store DB which can connect to tableau and then create a TDE from the data in the column store DB can make a bit better in performance.
If yes, please suggest any such column store DB
The Tableau SDK is a way to build TDE files without having to use Desktop. You can try it and see if you get better performance.
Does your TDE need all 180+ columns? You can get a noticeable performance improvement if your TDE contains only the columns you need.

MongoDB into AWS Redshift

We've got a pretty big MongoDB instance with sharded collections. It's reached a point where it's becoming too expensive to rely on MongoDB query capabilities (including aggregation framework) for insight to the data.
I've looked around for options to make the data available and easier to consume, and have settled on two promising options:
AWS Redshift
Hadoop + Hive
We want to be able to use a SQL like syntax to analyze our data, and we want close to real time access to the data (a few minutes latency is fine, we just don't want to wait for the whole MongoDB to sync overnight).
As far as I can gather, for option 2, one can use this https://github.com/mongodb/mongo-hadoop to move data over from MongoDB to a Hadoop cluster.
I've looked high and low, but I'm struggling to find a similar solution for getting MongoDB into AWS Redshift. From looking at Amazon articles, it seems like the correct way to go about it is to use AWS Kinesis to get the data into Redshift. That said, I can't find any example of someone that did something similar, and I can't find any libraries or connectors to move data from MongoDB into a Kinesis stream. At least nothing that looks promising.
Has anyone done something like this?
I ended up coding up our own migrator using NodeJS.
I got a bit irritated with answers explaining what redshift and MongoDB is, so I decided I'll take the time to share what I had to do in the end.
Timestamped data
Basically we ensure that all our MongoDB collections that we want to be migrated to tables in redshift are timestamped, and indexed according to that timestamp.
Plugins returning cursors
We then code up a plugin for each migration that we want to do from a mongo collection to a redshift table. Each plugin returns a cursor, which takes the last migrated date into account (passed to it from the migrator engine), and only returns the data that has changed since the last successful migration for that plugin.
How the cursors are used
The migrator engine then uses this cursor, and loops through each record.
It calls back to the plugin for each record, to transform the document into an array, which the migrator then uses to create a delimited line which it streams to a file on disk. We use tabs to delimit this file, as our data contained a lot of commas and pipes.
Delimited exports from S3 into a table on redshift
The migrator then uploads the delimited file onto S3, and runs the redshift copy command to load the file from S3 into a temp table, using the plugin configuration to get the name and a convention to denote it as a temporary table.
So for example, if I had a plugin configured with a table name of employees, it would create a temp table with the name of temp_employees.
Now we've got data in this temp table. And the records in this temp table get their ids from the originating MongoDB collection. This allows us to then run a delete against the target table, in our example, the employees table, where the id is present in the temp table. If any of the tables don't exist, it gets created on the fly, based on a schema provided by the plugin. And so we get to insert all the records from the temp table into the target table. This caters for both new records and updated records. We only do soft deletes on our data, so it'll be updated with an is_deleted flag in redshift.
Once this whole process is done, the migrator engine stores a timestamp for the plugin in a redshift table, in order to keep track of when the migration last run successfully for it. This value is then passed to the plugin the next time the engine decides it should migrate data, allowing the plugin to use the timestamp in the cursor it needs to provide to the engine.
So in summary, each plugin/migration provides the following to the engine:
A cursor, which optionally uses the last migrated date passed to it
from the engine, in order to ensure that only deltas are moved
across.
A transform function, which the engine uses to turn each document in the cursor into a delimited string, which gets appended to an export file
A schema file, this is a SQL file containing the schema for the table at redshift
Redshift is a data ware housing product and Mongo DB is a NoSQL DB. Clearly, they are not a replacement of each other and can co-exist and serve different purpose. Now how to save and update records at both places.
You can move all Mongo DB data to Redshift as a one time activity.
Redshift is not a good fit for real time write. For Near Real Time Sync to Redshift, you should Modify program that writes into Mongo DB.
Let that program also writes into S3 locations. S3 location to redshift movement can be done on regular interval.
Mongo DB being a document storage engine, Apache Solr, Elastic Search can be considered as possible replacements. But they do not support SQL type querying capabilities.They basically use a different filtering mechanism. For eg, for Solr, you might need to use the Dismax Filter.
On Cloud, Amazon's Cloud Search/Azure Search would be compelling options to try as well.
You can use AWS DMS to migrate data to redshift now easily , you can also realtime ongoing changes with it.