I am new to Google Cloud SQL and have created a schema on Cloud SQL. I have imported (using import in google GUI) a CSV file of 5M(unique) rows into this table but only 0.5M rows show up. Not sure if there is a limit I am missing something.
P.S. I also have enough free storage available.
Yes there's a row size limit of 65kb for mySQL even if your storage is capable of storing larger rows. This may be the reason why your table only displays a limited number of rows. There are several factors that affect the row size limit like storage engine (InnoDB/MyISAM) and page header and trailer data used by the storage engine. This is based on the mySQL documentation on row size limits.
Since Cloud SQL supports current and previous versions of mySQL (as well as PostgreSQL and SQL Server), the row size limit for those versions are also applicable.
Related
We are scraping data online and the data is (relatively for us) fastly growing. The data consists of one big text (approx. 2000 chars) and a dozen of simple text fields (few words max).
We scrap around 1M or 2M rows per week and it will grow up to 5-10M rows per week probably.
Currently we use Mongodb Atlas to store the rows. Until very soon, we were adding all the infos available but now we defined a schema and only keep what we need. Now the flexibility of the document db is not necessary anymore. And since mongodb pricing is growing exponentially with storage and tier upgrade, we are looking for another storage solution to store that data.
Here is the pipeline : we send data from the scrapers to Mongodb -> using Airbyte we replicate data periodically to Bigquery -> we process data on Bigquery using Spark or Apache Beam -> We perform analysis on transformed data using Sisense
With those requirements in mind, could we use Postgres to replace Mongodb for the "raw" storage ?
Is postgres scaling well for that kind of data (we are not even close to big data but in a near futur we will have at least 1TB data) ? We don't plan to use relations in postgres, isn't it overkill then ? We will however use array and json types, that's why I selected postgres first. Is there other storage solution we could use ? Also could it be possible / good practice to store data directly in Bigquery ?
Thanks for the help
I would like to automatically stream data from an external PostgreSQL database into a Google Cloud Platform BigQuery database in my GCP account. So far, I have seen that one can query external databases (MySQL or PostgreSQL) with the EXTERNAL_QUERY() function, e.g.:
https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries
But for that to work, the database has to be in GCP Cloud SQL. I tried to see what options are there for streaming from the external PostgreSQL into a Cloud SQL PostgreSQL database, but I could only find information about replicating it in a one time copy, not streaming:
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external
The reason why I want this streaming into BigQuery is that I am using Google Data Studio to create reports from the external PostgreSQL, which works great, but GDS can only accept SQL query parameters if it comes from a Google BigQuery database. E.g. if we have a table with 1M entries, and we want a Google Data Studio parameter to be added by the user, this will turn into a:
SELECT * from table WHERE id=#parameter;
which means that the query will be faster, and won't hit the 100K records limit in Google Data Studio.
What's the best way of creating a connection between an external PostgreSQL (read-only access) and Google BigQuery so that when querying via BigQuery, one gets the same live results as querying the external PostgreSQL?
Perhaps you missed the options stated on the google cloud user guide?
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external#setup-replication
Notice in this section, it says:
"When you set up your replication settings, you can also decide whether the Cloud SQL replica should stay in-sync with the source database server after the initial import is complete. A replica that should stay in-sync is online. A replica that is only updated once, is offline."
I suspect online mode is what you are looking for.
What you are looking for will require some architecture design based on your needs and some coding. There isn't a feature to automatically sync your PostgreSQL database with BigQuery (apart from the EXTERNAL_QUERY() functionality that has some limitations - 1 connection per db - performance - total of connections - etc).
In case you are not looking for the data in real time, what you can do is with Airflow for instance, have a DAG to connect to all your DBs once per day (using KubernetesPodOperator for instance), extract the data (from past day) and loading it into BQ. A typical ETL process, but in this case more EL(T). You can run this process more often if you cannot wait one day for the previous day of data.
On the other hand, if streaming is what you are looking for, then I can think on a Dataflow Job. I guess you can connect using a JDBC connector.
In addition, depending on how you have your pipeline structure, it might be easier to implement (but harder to maintain) if at the same moment you write to your PostgreSQL DB, you also stream your data into BigQuery.
Not sure if you have tried this already, but instead of adding a parameter, if you add a dropdown filter based on a dimension, Data Studio will push that down to the underlying Postgres db in this form:
SELECT * from table WHERE id=$filter_value;
This should achieve the same results you want without going through BigQuery.
I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile
I am creating a .TDE(tableau extract) from a table in sql server which has around 180+ columns and 60 million records and is taking around 4 hours in current infra of 16GB RAM and 12 cores
I was looking for any other way by which this can be done faster. I would like to know if I could load my data into any column store DB which can connect to tableau and then create a TDE from the data in the column store DB can make a bit better in performance.
If yes, please suggest any such column store DB
The Tableau SDK is a way to build TDE files without having to use Desktop. You can try it and see if you get better performance.
Does your TDE need all 180+ columns? You can get a noticeable performance improvement if your TDE contains only the columns you need.
Google Cloud SQL has a price for I/O operations.
What is the cost of a DROP DATABASE operation? E.g., is it a function of the size of the database, or a fixed cost?
Similar questions for DROP TABLE as well as deleting an entire instance.
Google Cloud SQL currently uses innodb_file_per_table=OFF so all the data is stored in the system table space. When a large database is dropped all the InnoDB pages associated will be put in the list of free pages. This will only require updating the InnoDB pages that hold the bitmap table for the free pages so the number of I/O operations should be small. A just did a test and dropping a database 60GiB+ took about 18 seconds.
Dropping and table or an instance incur the same cost.
Deleting an instance doesn't cost anything. :-)
Note that, due to the use of innodb_file_per_table=OFF the size of the database will not decrease.