Power BI - Data Load Error- OLE DB or ODBC error: [DataSource.Error] PostgreSQL: Exception while reading from stream - postgresql

My question might look similar to some earlier posts but none of the solution has answered what is the rootcause of this behavior.
I would try to explain what I have done so far:
I am connecting to a PostgresDB (running in our company's aws environment) via my Power BI desktop client. The connection set up was pretty easy and I am able to see all the tables in the DB.
For 2 of my table which are extremely big in size, I am trying to load the data I am getting below error message -
Data Load Error- OLE DB or ODBC error: [DataSource.Error] PostgreSQL: Exception while reading from stream.
I tried changing the Command TimeOut Parameter in the initial M
query-- Didn't help
I tried writing native query with select * and
where clause (used parameter)-- It worked
Question:
When the Power BI starts loading the data without any parameter, it does start extracting some thousands of record but get interrupted and throws the mentioned error. Is it a limit from the database server side which is getting hit or is it a limitation of power BI?
What can I change in my Database server side, if I don't want to pull information using parameters (as at the end I need all the data for my reports)

Related

How to truncate the MySQL performance_schema in Google Cloud SQL without restarting DB instance?

According to the MySQL documentation we can truncate the performance_schema with the help of the following call:
CALL sys.ps_truncate_all_tables(FALSE);
Internally this procedure is coded like follows. It actually executes TRUNCATE TABLE statements against a list of tables obtained with the masks '%summary%' and '%history%'.
The problem is that root user isn't able to perform the TRUNCATE TABLE statement on the performance_schema database in Google Cloud SQL due to superuser restrictions.
mysql> truncate table performance_schema.events_statements_summary_by_digest ;
ERROR 1227 (42000): Access denied; you need (at least one of)
the SUPER privilege(s) for this operation
I didn't find any Cloud SQL Admin API or other method to do this.
Any advice, how to reset the MySQL performance_schema in Google Cloud SQL without restarting DB instance.
UPD. I have found that it does not work for MySQL 5.7 but works well for MySQL 8.0 in Google Cloud SQL. So that if you can migrate your Google DB instance to MySQL 8.0 it would be workaround.
Currently there are no workarounds other than restarting the DB instances. And there is already a feature request raised for the same. You can +1 and CC yourself in the request to show interest in this being implemented and receive an email in case there are any updates.
In case you want to use performance_schema for database sql query analysis, as an alternative, you can use CloudSql Query Insights. Query insights helps you detect, diagnose, and prevent query performance problems for Cloud SQL databases. It supports intuitive monitoring and provides diagnostic information that helps you go beyond detection to identify the root cause of performance problems.
Or you can contact Google support, Product Team may offer up the stored procedure solution.

RDS Data API BatchExecute taking significantly longer than standard connection

I have an AWS Lambda function that needs to insert several thousand rows of data into an RDS PostgreSQL database within a Serverless Cluster. Previously I used a normal database connection using psycopg2, but I switched to the RDS Data API in order to improve performance. However, using the Data API, BatchExecute exceeds the 5 minute lambda limit and still fails to commit the transaction in this time. Meanwhile, the psycopg2 solution, which uses a different transfer protocol, inserts all data in under 30 seconds.
Why is this possible? Shouldn't the Data API give superior performance as it doesn't need to establish a connection? Can I change any settings to make the RDS Data API perform suitably?
I don't believe I am reaching any of the data size limits, because the lambda times out rather than explicitly throwing an error. Also, I know that the connection is succeeding, as other small queries are able to execute successfully.

streaming PostgreSQL tables into Google BigQuery

I would like to automatically stream data from an external PostgreSQL database into a Google Cloud Platform BigQuery database in my GCP account. So far, I have seen that one can query external databases (MySQL or PostgreSQL) with the EXTERNAL_QUERY() function, e.g.:
https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries
But for that to work, the database has to be in GCP Cloud SQL. I tried to see what options are there for streaming from the external PostgreSQL into a Cloud SQL PostgreSQL database, but I could only find information about replicating it in a one time copy, not streaming:
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external
The reason why I want this streaming into BigQuery is that I am using Google Data Studio to create reports from the external PostgreSQL, which works great, but GDS can only accept SQL query parameters if it comes from a Google BigQuery database. E.g. if we have a table with 1M entries, and we want a Google Data Studio parameter to be added by the user, this will turn into a:
SELECT * from table WHERE id=#parameter;
which means that the query will be faster, and won't hit the 100K records limit in Google Data Studio.
What's the best way of creating a connection between an external PostgreSQL (read-only access) and Google BigQuery so that when querying via BigQuery, one gets the same live results as querying the external PostgreSQL?
Perhaps you missed the options stated on the google cloud user guide?
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external#setup-replication
Notice in this section, it says:
"When you set up your replication settings, you can also decide whether the Cloud SQL replica should stay in-sync with the source database server after the initial import is complete. A replica that should stay in-sync is online. A replica that is only updated once, is offline."
I suspect online mode is what you are looking for.
What you are looking for will require some architecture design based on your needs and some coding. There isn't a feature to automatically sync your PostgreSQL database with BigQuery (apart from the EXTERNAL_QUERY() functionality that has some limitations - 1 connection per db - performance - total of connections - etc).
In case you are not looking for the data in real time, what you can do is with Airflow for instance, have a DAG to connect to all your DBs once per day (using KubernetesPodOperator for instance), extract the data (from past day) and loading it into BQ. A typical ETL process, but in this case more EL(T). You can run this process more often if you cannot wait one day for the previous day of data.
On the other hand, if streaming is what you are looking for, then I can think on a Dataflow Job. I guess you can connect using a JDBC connector.
In addition, depending on how you have your pipeline structure, it might be easier to implement (but harder to maintain) if at the same moment you write to your PostgreSQL DB, you also stream your data into BigQuery.
Not sure if you have tried this already, but instead of adding a parameter, if you add a dropdown filter based on a dimension, Data Studio will push that down to the underlying Postgres db in this form:
SELECT * from table WHERE id=$filter_value;
This should achieve the same results you want without going through BigQuery.

Need to join oracle and sql server tables in oledb source without using linked server

My ssis package has an oledb source which joins oracle and sql server to get source data and loads it into sql server oledb destination. Earlier we were using linked server for this purpose but we cannot use linked server anymore.
So I am taking the data from sql server and want to return it to the in clause of the oracle query which i am keeping as sql command oledb source.
I tried parsing an object type variable from sql server and putting it into the in clause of oracle query in oledb source but i get error that oracle cannot have more than 1000 literals in the in statement. So basically I think I have to do something like this:
select * from oracle.db where id in (select id from sqlserver.db).
Since I cannot use linked server so i was thinking if I could have a temp table which can be used throughout the package.
I tried out another way of using merge join in ssis. but my source data set is really large and the merge join is returning fewer rows than expecetd. I am badly stuck at this point. I have tried a number if things nothung seems to be working.
Can someone please help. Any help will be greatly appreciated.
A couple of options to try.
Lookup:
My first instinct was a Lookup Task, but that might not be a great solution depending on the size of your data sets, since all of the records from both tables have to pulled over the wire and stored in memory on the SSIS server. But if you were able to pull off a Merge Join, then a Lookup should also work, but it might be slow.
Set an OLE DB Source to pull the Oracle data, without the WHERE clause.
Set a Lookup to pull the id column from your SQL Server table.
On the General tab of the Lookup, under Specify how to handle rows with no matching entries, select Redirect rows to no-match output.
The output of the Lookup will just be the Oracle rows that found a matching row in your SQL Server query.
Working Table on the Oracle server
If you have the option of creating a table in the Oracle database, you could create a Data Flow Task to pipe the results of your SQL Server query into a working table on the Oracle box. Then, in a subsequent Data Flow, just construct your Oracle query to use that working table as a filter.
Probably follow that up with an Execute SQL Task to truncate that working table.
Although this requires write access to Oracle, it has the advantage of off-loading the heavy lifting of the query to the database machine, and only pulling the rows you care about over the wire.

Viewing tableau server when one data source is missing

I have a dashboard in Tableau which pulls data from about 10 tables in a SQL database.
These tables are refreshed at various times of day. There are occasions where one of them is not available (or has been deleted and awaiting rebuild)
However when I open my tableau dashboard on the server it wont let me see any of it. Not seeing the data from the missing table is fine but the majority of the data that does not come from that table is unavailable too.
I get this error
An unexpected error occurred. If you continue to receive this error please contact your Tableau Server Administrator.
TableauException: [Microsoft][SQL Server Native Client 11.0][SQL Server]Invalid object name 'dbo.survey_order_info_fy16_TV_L'. The table "[dbo].[survey_order_info_fy16_TV_L]" does not exist. Unable to connect to the server "dbedwro.vistaprint.net". Check that the server is running and that you have access privileges to the requested database.
"survey_order_info_fy16_TV_L" being the missing table but not one I'm bothered about right now.
Is there an option that might help me see all the other data?
I am not sure if it's possible to avoid this behavior.
If there isn't there is a workaround for that by creating extract of these tables and storing them on the Tableau server. You can then use these extracts instead of the tables on the DB and just refresh them either by schedule if you know when the tables are available again or from the SQL server (eg. with SSIS by triggering the refresh once the data is available again).
Advantage of that would be that
you can refresh them independently and always have the latest data
it performs better than an SQL connection
you don't jam your SQL server with connections (in case you have a lot of users accesing)
you can filter and select if you didn't want your users to get access to the full dataset
disadvantages:
you will have to create one extract per table, and replace all data sources in workbooks you already use
It's a matter of creating a workbook, connecting to the source (adding filters or hiding fields) and publishing it to the server. Details of that can be found here:
http://onlinehelp.tableau.com/current/pro/online/mac/en-us/publish_datasources.html