I am creating a .TDE(tableau extract) from a table in sql server which has around 180+ columns and 60 million records and is taking around 4 hours in current infra of 16GB RAM and 12 cores
I was looking for any other way by which this can be done faster. I would like to know if I could load my data into any column store DB which can connect to tableau and then create a TDE from the data in the column store DB can make a bit better in performance.
If yes, please suggest any such column store DB
The Tableau SDK is a way to build TDE files without having to use Desktop. You can try it and see if you get better performance.
Does your TDE need all 180+ columns? You can get a noticeable performance improvement if your TDE contains only the columns you need.
Related
I'm attempting to mock a database for testing purposes. What I'd like to do is given a connection to an existing Postgres DB, retrieve the schema, limit the data pulled to 1000 rows from each table, and persist both of these components as a file which can later be imported into a local database.
pg_dump doesn't seem to fullfill my requirements as theres no way to tell it to only retrieve a limited amount of rows from tables, its all or nothing.
COPY/\copy commands can help fill this gap, however, it doesn't seem like theres a way to copy data from multiple tables into a single file. I'd rather avoid having to create a single file per table, is there a way to work around this?
We are scraping data online and the data is (relatively for us) fastly growing. The data consists of one big text (approx. 2000 chars) and a dozen of simple text fields (few words max).
We scrap around 1M or 2M rows per week and it will grow up to 5-10M rows per week probably.
Currently we use Mongodb Atlas to store the rows. Until very soon, we were adding all the infos available but now we defined a schema and only keep what we need. Now the flexibility of the document db is not necessary anymore. And since mongodb pricing is growing exponentially with storage and tier upgrade, we are looking for another storage solution to store that data.
Here is the pipeline : we send data from the scrapers to Mongodb -> using Airbyte we replicate data periodically to Bigquery -> we process data on Bigquery using Spark or Apache Beam -> We perform analysis on transformed data using Sisense
With those requirements in mind, could we use Postgres to replace Mongodb for the "raw" storage ?
Is postgres scaling well for that kind of data (we are not even close to big data but in a near futur we will have at least 1TB data) ? We don't plan to use relations in postgres, isn't it overkill then ? We will however use array and json types, that's why I selected postgres first. Is there other storage solution we could use ? Also could it be possible / good practice to store data directly in Bigquery ?
Thanks for the help
I am planning to use AWS RDS Postgres version 10.4 and above for storing data in a single table comprising of ~15 columns.
My use case is to serve:
1. Periodically (after 1 hour) store/update rows in to this table.
2. Periodically (after 1 hour) fetch data from the table say 500 rows at a time.
3. Frequently fetch small data (10 rows) from the table (100's of queries in parallel)
Does AWS RDS Postgres support serving all of above use cases
I am aware of Read-Replicas support, but is there any in built load balancer to serve the queries that come in parallel?
How many read queries can Postgres be able to process concurrently?
Thanks in advance
Your usecases seems to be a normal fit for all relational database systems. So I would say: yes.
The question is: how fast the DB can handle the 100 queries (3).
In general the postgresql documentation is one of the best I ever read. So give it a try:
https://www.postgresql.org/docs/10/parallel-query.html
But also take into consideration how big your data is!
That said, try w/o read replicas first! You might not need them.
I have some set of tables which has 20 million records in a postgres server. As of now i m migrating some table data from one server to another server using insert and update queries with dependent tables in functions. It takes around 2 hours even after optimizing the query. I need a solution to migrate the data faster by using mongodb or cassandra. How?
Try putting your updates and inserts into a file and then load the file. I understand Postgresql will optimise loading the file contents. It's always worked for me although I haven't used that quantity.
I am new to Tableau, and having performance issues and need some help. I have a query that joins several large tables. I am using a live data connection to a MySQL db.
The issue I am having is that it is not applying the filter criteria before asking MySQL for the data. So it is essentially doing a SELECT * from my query and not applying the filter criteria to the where clause. It pulls all the data from MySQL db back to Tableau, then throws away the un-needed data based on my filter criteria. My two main filter criteria are on account_id and a date range.
I can cleanly get a list of the accounts from just doing a select from my account table to populate the filter list, then need to know how to apply that selection when it goes to pull the data from the main data query from MySQL.
To apply a filter at the data source first, try using context filters.
Performance can also be improved by using extracts.
I would personally use an extract, go into your MySQL DB Back-end, run the query, and a CREATE TABLE extract1 AS statement, or whatever you want to call your data table.
When you import this table into Tableau it will already have a SELECT * of your aggregate data in the workbook. From here your query efficiency will be increased ten fold.
Unfortunately, it's going to take awhile for Tableau processing time + mySQL backend DB query time = Ntime to process your data.
Try the extracts...
I've been struggling with the very same thing. I have found that the tableau extracts aren't any faster than pulling directly from a SQL table. What I have done is within SQL created tables that already have the filtered data in them, so the Select * will have only the needed data. The downside to this is it takes up more space on the server, but this isn't a problem on my side.
For the Large Data sets Tableau recommend using an Extract.
An extract will create a snapshot of the data that you are connected with and processing on this data will be faster than a live connection.
All the charts and visualization will load faster and saves your time, each time when you go to the Dashboard.
For the filters that you are using to filter the data-set will work faster in an extract connection. But to get the latest data you have to refresh the extract or schedule a refresh in the server ( if you are uploading the report to server).
There are multiple type of filters available in Tableau, the use of which depends on your application, context filters and global filters can be use to filter the whole set of data.