I am using Azure PostgreSQL, I have a lot of files saved as byeta datatype in a table. In my project, I will execute some SQL query to get these files.
Sometimes a query will involve multiple files so the result data size of SQL query will be large. My questions: is there has some data size limit of SQL result for one SQL query ? Should I do some limit here? Any suggestion is appreciated.
There is no limit for the size of a result set in PostgreSQL.
However, many clients cache the whole result set in memory, which can easily lead to an out-of-memory condition on the client side.
There are ways around that:
Use cursors and fetch the result row by row or in batches. That should work with any client API.
With the C API (libpq), you could activate single-row mode.
With JDBC, you could set the fetch size.
Note that this means that you could get a runtime error from the database server in the middle of processing a result set.
Related
I want to write a Spark pipeline to perform aggregation on my production DB data and then write data back to the DB. My goal of writing the pipeline is to perform aggregation and not impact production DB while it runs, meaning I don't want users experiencing lag nor DB having heavy IOPS while the aggregation is performed. For example, an equivalent aggregation query just run as SQL would take a long time and also use up the RDS IOPS, which results in users not being able to get data - trying to avoid this. A few questions:
How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?
Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)? For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?
When writing data back to DB, does that incur load on DB as well?
I'm using a PostgreSQL database in case this matters.
How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?
By default there will be a single partition in Glue to which the whole table is read into.But you can do parallel reads using this and make sure to chose a column that will not affect the DB performance.
Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)?
Yes when you pass a query instead of table you will be only reading the result of it from the DB and reducing the large n/w and IO transfer. This means you are delegating it to DB engine to calculate the result.Refer to this on how you can do it.
For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?
Yes depending on the table size and query complexity this might affect DB performance and if you have a read replica then you can simply use that.
When writing data back to DB, does that incur load on DB as well?
Yes it depends on how you are writing the result back to DB. Few partitions is always good i.e, not too many and not too less.
I've a procedure in Oracle PL/SQL which fetches transactional data based on certain condition, then performs some logical calculations. I used cursor to store the SQL and then I used FETCH (cursor) BULK COLLECT INTO (table type variable) LIMIT 10000, iterated over this table variable to perform calculation and ultimately storing the value in a DB table. Once 10000 rows have been processed, query will be executed to fetch next set of records,
This helped me limiting number of times SQL is executed via cursor and limiting the number of records loaded into memory.
I am trying to migrate this code to plpgsql. How can I achieve this functionality in plpgsql?
You cannot achieve this functionality in PostgreSQL.
I wrote an extension https://github.com/okbob/dbms_sql . It can be used for reduce of necessary work related to migration from Oracle to Postgres.
But you don't need this feature in Postgres. Although PL/pgSQL is similar to PL/SQL, the architecture is very different - and bulk collect operations are not necessary.
My ssis package has an oledb source which joins oracle and sql server to get source data and loads it into sql server oledb destination. Earlier we were using linked server for this purpose but we cannot use linked server anymore.
So I am taking the data from sql server and want to return it to the in clause of the oracle query which i am keeping as sql command oledb source.
I tried parsing an object type variable from sql server and putting it into the in clause of oracle query in oledb source but i get error that oracle cannot have more than 1000 literals in the in statement. So basically I think I have to do something like this:
select * from oracle.db where id in (select id from sqlserver.db).
Since I cannot use linked server so i was thinking if I could have a temp table which can be used throughout the package.
I tried out another way of using merge join in ssis. but my source data set is really large and the merge join is returning fewer rows than expecetd. I am badly stuck at this point. I have tried a number if things nothung seems to be working.
Can someone please help. Any help will be greatly appreciated.
A couple of options to try.
Lookup:
My first instinct was a Lookup Task, but that might not be a great solution depending on the size of your data sets, since all of the records from both tables have to pulled over the wire and stored in memory on the SSIS server. But if you were able to pull off a Merge Join, then a Lookup should also work, but it might be slow.
Set an OLE DB Source to pull the Oracle data, without the WHERE clause.
Set a Lookup to pull the id column from your SQL Server table.
On the General tab of the Lookup, under Specify how to handle rows with no matching entries, select Redirect rows to no-match output.
The output of the Lookup will just be the Oracle rows that found a matching row in your SQL Server query.
Working Table on the Oracle server
If you have the option of creating a table in the Oracle database, you could create a Data Flow Task to pipe the results of your SQL Server query into a working table on the Oracle box. Then, in a subsequent Data Flow, just construct your Oracle query to use that working table as a filter.
Probably follow that up with an Execute SQL Task to truncate that working table.
Although this requires write access to Oracle, it has the advantage of off-loading the heavy lifting of the query to the database machine, and only pulling the rows you care about over the wire.
When streaming large volumes of data out of PostgreSQL into C#, using Npgsql, does the command default to Single Row Mode or will it extract the entire result set before returning? I can find no mention of Single Row Mode in the Npgsql documentation, and nothing in the source code to suggest that it is optional one way or the other.
When Npgsql sends the SQL query you give it, PostgreSQL will immediately send back all the rows. If you pass CommandBehavior.SingleRow (or SingleResult) to NpgsqlCommand.ExecuteReader, Npgsql will simply not return those rows to the user; it will consume them internally, but they are still sent from the server. In other words, if you expect these options to reduce the network bandwidth used, that won't work; your only way to do that is to limit the resultset in the SQL itself, via a LIMIT clause. This is in general a better idea anyway.
See https://github.com/npgsql/npgsql/issues/410 for a bit more detail on why we didn't implement something more aggressive.
From my experience, the default in Npgsql is to get a cursor for the result set that will fetch the number of rows you are currently processing, basically, when invoking reader.Read() you get a row from the server to the driver client. There might be some buffering taking place, but streaming the result is the norm.
I am new to Tableau, and having performance issues and need some help. I have a query that joins several large tables. I am using a live data connection to a MySQL db.
The issue I am having is that it is not applying the filter criteria before asking MySQL for the data. So it is essentially doing a SELECT * from my query and not applying the filter criteria to the where clause. It pulls all the data from MySQL db back to Tableau, then throws away the un-needed data based on my filter criteria. My two main filter criteria are on account_id and a date range.
I can cleanly get a list of the accounts from just doing a select from my account table to populate the filter list, then need to know how to apply that selection when it goes to pull the data from the main data query from MySQL.
To apply a filter at the data source first, try using context filters.
Performance can also be improved by using extracts.
I would personally use an extract, go into your MySQL DB Back-end, run the query, and a CREATE TABLE extract1 AS statement, or whatever you want to call your data table.
When you import this table into Tableau it will already have a SELECT * of your aggregate data in the workbook. From here your query efficiency will be increased ten fold.
Unfortunately, it's going to take awhile for Tableau processing time + mySQL backend DB query time = Ntime to process your data.
Try the extracts...
I've been struggling with the very same thing. I have found that the tableau extracts aren't any faster than pulling directly from a SQL table. What I have done is within SQL created tables that already have the filtered data in them, so the Select * will have only the needed data. The downside to this is it takes up more space on the server, but this isn't a problem on my side.
For the Large Data sets Tableau recommend using an Extract.
An extract will create a snapshot of the data that you are connected with and processing on this data will be faster than a live connection.
All the charts and visualization will load faster and saves your time, each time when you go to the Dashboard.
For the filters that you are using to filter the data-set will work faster in an extract connection. But to get the latest data you have to refresh the extract or schedule a refresh in the server ( if you are uploading the report to server).
There are multiple type of filters available in Tableau, the use of which depends on your application, context filters and global filters can be use to filter the whole set of data.