I'm developing a collection of foreign data wrappers using multicorn and I've run into an issue with batching data.
So, I have two foreign tables, search and data, that are each backed by a foreign data wrapper that I'm writing.
I need to do a basic join on these tables:
SELECT data.*
FROM search, data
WHERE search.data_id = data.id
AND search.term = 'search for this pls'
This works, but there's a hitch in the data fdw being able to batch queries to the server. If the search table returns 5 ids for a given search, then the data fdw is executed once for each of those ids. The API backing the data fdw is capable of processing many ids in one request.
The following works:
SELECT data.*
FROM data
WHERE id in ('2244', '31895')
In this case the data fdw receives an array of both ids and is able to perform one request.
Is there any way to make the join work where the data fdw has the opportunity to batch ids for a request?
Thanks!
You should look at the EXPLAIN output for your query, and then you'll probably see that PostgreSQL is performing a nested loop join, i.e. it scans search for the matching rows, and for each result row scans data for matching rows.
PostgreSQL has other join strategies like hash joins, but for that it would have to read the whole data table, which is probably not a win.
You might want to try it by setting enable_nestloopto off and testing query performance. If that is an improvement, you might want to adjust the cost values for the foreign table scan on data to reflect the high “startup costs” so that the planner becomes more reluctant to choose a nested loop join.
There is no such join strategy as you propose – while it may well be a win for FDW joins, it does not offer advantages in regular joins. So if the join strategy you envision is really the optimal one, you'd have to first fetch the data_ids from search, construct a query for data and implement the join in the application.
Related
When should I use "eclipselink.join-fetch", when should I use "eclipselink.batch" (batch type = IN)?
Is there any limitations for join fetch, such as the number of tables being fetched?
Answer is alway specific to your query, the specific use case, and the database, so there is no hard rule on when to use one over the other, or if to use either at all. You cannot determine what to use unless you are serious about performance and willing to test both under production load conditions - just like any query performance tweaking.
Join-fetch is just what it says, causing all the data to be brought back in the one query. If your goal is to reduce the number of SQL statements, it is perfect. But it comes at costs, as inner/outer joins, cartesian joins etc can increase the amount of data being sent across and the work the database has to do.
Batch fetching is one extra query (1+1), and can be done a number of ways. IN collects all the foreign key values and puts them into one statement (more if you have >1000 on oracle). Join is similar to fetch join, as it uses the criteria from the original query to select over the relationship, but won't return as much data, as it only fetches the required rows. EXISTS is very similar using a subquery to filter.
I'm a newbie on database design hoping to learn by experimenting and implementing, and I'm sure some version of this question has been asked within database design in general, but this is specific to Tableau.
I have a few dashboards that are drawing from a PostgreSQL database table that contains several million rows of data. Performance rerendering views is quite slow (ie., if I select a different parameter, Tableau's Executing SQL query popup will appear and it often takes several minutes to complete).
I debugged this by using the performance recording option within Tableau, exporting the SQL queries that Tableau is using to a text file, and then using EXPLAIN ANALYZE to figure out what exactly the bottlenecks were. Unfortunately, I can't post the SQL queries themselves, but I've created a contrived case below to be as helpful as possible. Here is what my table looks like currently. The actual values being rendered by Tableau are in green, and the columns that I have foreign key references on are in yellow:
I see within the Query Plan that there's a lot of costly bitmap heap scans that are implementing the filters am using on Tableau on the frontend on neighborhood_id, view, animal, date_updated, animal_name.
I've attempted to place a multiple index on these fields, but upon rerunning the queries, it does not look like the PG query planner is electing to use these indices.
So my proposed solution is to create foreign key references for each of these fields (neighborhood_id, view, animal, date_updated, animal_name)- again, yellow represents a FK reference:
My hope is that these FK references will force the query planner is use an index scan instead of a sequential scan or bitmap heap scan. However, my questions are
Before, all the data was more or less stored in this one table, with
two joins to shelter and age_of_animal tables. Now, this table
will be joined to 8 smaller subtables- will these joins drastically
reduce performance? The subtables are quite small (ie. the animal
table will have only 40 entries).
I know the question is difficult to answer without seeing the actual
query and query plan, but what are some high-level reasons would the
query planner elect to not use an index? I've read through some articles like "Why Postgres Won't Always Use An Index" but mostly they refer to cases where it's a small table and a simple query where the cost of the index lookup is greater than simply traversing the rows. I don't think applies to my case though- I have millions of rows and a complex filter on 5+ columns.
Is the PG Query Planner any more likely to use multiple column
indices on a collection of foreign key columns versus regular
columns? I know that PG does not automatically add indices on
foreign keys, so I imagine I'll still need to add indices after
creating the foreign key references.
Of course, the answers to my questions could be "Why don't you just try it and see?", but in this case refactoring such a large table is quite costly and I'd like some intuition on whether it's worth the cost prior to undertaking it.
I have been evaluating migration of our datastore from MongoDB to DynamoDB, since it is a well established AWS service.
However, I am not sure if the DynamoDB data model is robust enough to support our use cases. I understand that DynamoDB added document support in 2014, but whatever examples I have seen, does not look to be addressing queries which work across documents, and which do not specify a value for the partition key.
For instance if I have a document containing employee info,
{
"name": "John Doe",
"department": "sales",
"date_of_joining": "2017-01-21"
}
and I need to make query like give me all the employees which have joined after 01-01-2016, then I can't make it with this schema.
I might be able to make this query after creating a secondary index which has a randomly generated partition key (say 0-99) and create a sort key on "date_of_joining", then query for all the partitions and put condition on "date_of_joining". But this is too complex a way to do a simple query, doing something like this in MongoDB is quite straightforward.
Can someone help with understanding if there is a better way to do such queries in DynamoDB and is DynamoDB really suited for such use cases?
Actually, the partition key of the GSI need not be unique. You can have date_of_joining as a partition key of GSI.
However, when you query the partition key, you cannot use greater than for the partition key field. Only equality is supported for partition key. I am not sure that why you wanted to have a random number as partition key of GSI and date_of_joining as sort key. Even if you design like, I don't thing you will be able to use DynamoDB Query API to get the expected result. You may end-up using DynamoDB Scan API which is a costly operation in DynamoDB.
GSI:
date_of_joining - as Partition key
Supported in Query API:-
If you have multiple items for the same DOJ, the result with have multiple items (i.e. when you query using GSI).
KeyConditionExpression : 'date_of_joining = :doj'
Not supported in Query API:-
KeyConditionExpression : 'date_of_joining > :doj'
Conclusion:-
You need to use DynamoDB Scan. If you are going to use Scan, then GSI may not be required. You can directly scan the main table using FilterExpression.
FilterExpression : 'date_of_joining > :doj'
Disadvantage:-
Costly
Not efficient
You might decide to support your range queries with an indexing backend. For example, you could stream your table updates in DynamoDB to AWS ElasticSearch with a Lambda function, and then query ES for records matching the range of join dates you choose.
I want to store a list of users in a Cassandra Column Family(Wide rows).
The columns in the CF will have Composite Keys of pattern id:updated_time:name:score
After inserting all the users, i need to query users in a different sorted order each time.
For example, if i specify updated_time, i could be able to fetch the recent 10 users.
And, if i specify score, then i could be able to fetch the top 10 users based on score.
Does Cassandra supports this?
Kindly help me in this regard...
i need to query users in a different sorted order each time...
Does Cassandra supports this
It does not. Unlike a RDBMS, you can not make arbitrary queries and expect reasonable performance. Instead you must design you data model so the queries you anticipate will be made will be efficient:
The best way to approach data modeling for Cassandra is to start with your queries and work backwards from there. Think about the actions your application needs to perform, how you want to access the data, and then design column families to support those access patterns.
So rather than having one column family (table) for your data, you might want several with cross references between them. That is, you might have to denormalise your data.
Greeting!
I have the following problem. I have a table with huge number of rows which I need to search and then group search results by many parameters. Let's say the table is
id, big_text, price, country, field1, field2, ..., fieldX
And we run a request like this
SELECT .... WHERE
[use FULLTEXT index to MATCH() big_text] AND
[use some random clauses that anyway render indexes useless,
like: country IN (1,2,65,69) and price<100]
This we be displayed as search results and then we need to take these search results and group them by a number of fields to generate search filters
(results) GROUP BY field1
(results) GROUP BY field2
(results) GROUP BY field3
(results) GROUP BY field4
This is a simplified case of what I need, the actual task at hand is even more problematic, for example sometimes the first results query does also its own GROUP BY. And example of such functionality would be this site
http://www.indeed.com/q-sales-jobs.html
(search results plus filters on the left)
I've done and still doing a deep research on how MySQL functions and at this point I totally don't see this possible in MySQL. Roughly speaking MySQL table is just a heap of rows lying on HDD and indexes are tiny versions of these tables sorted by the index field(s) and pointing to the actual rows. That's a super oversimplification of course but the point is I don't see how it is possible to fix this at all, i.e. how to use more than one index, be able to do fast GROUP BY-s (by the time query reaches GROUP BY index is completely useless because of range searches and other things). I know that MySQL (or similar databases) have various helpful things such index merges, loose index scans and so on but this is simply not adequate - the queries above will still take forever to execute.
I was told that the problem can be solved by NoSQL which makes use of some radically new ways of storing and dealing with data, including aggregation tasks. What I want to know is some quick schematic explanation of how it does this. I mean I just want to have a quick glimpse at it so that I could really see that it does that because at the moment I can't understand how it is possible to do that at all. I mean data is still data and has to be placed in memory and indexes are still indexes with all their limitation. If this is indeed possible, I'll then start studying NoSQL in detail.
PS. Please don't tell me to go and read a big book on NoSQL. I've already done this for MySQL only to find out that it is not usable in my case :) So I wanted to have some preliminary understanding of the technology before getting a big book.
Thanks!
There are essentially 4 types of "NoSQL", but three of the four are actually similar enough that an SQL syntax could be written on top of it (including MongoDB and it's crazy query syntax [and I say that even though Javascript is one of my favorite languages]).
Key-Value Storage
These are simple NoSQL systems like Redis, that are basically a really fancy hash table. You have a value you want to get later, so you assign it a key and stuff it into the database, you can only query a single object at a time and only by a single key.
You definitely don't want this.
Document Storage
This is one step up above Key-Value Storage and is what most people talk about when they say NoSQL (such as MongoDB).
Basically, these are objects with a hierarchical structure (like XML files, JSON files, and any other sort of tree structure in computer science), but the values of different nodes on the tree can be indexed. They have a higher "speed" relative to traditional row-based SQL databases on lookup because they sacrifice performance on joining.
If you're looking up data in your MySQL database from a single table with tons of columns (assuming it's not a view/virtual table), and assuming you have it indexed properly for your query (that may be you real problem, here), Document Databases like MongoDB won't give you any Big-O benefit over MySQL, so you probably don't want to migrate over for just this reason.
Columnar Storage
These are the most like SQL databases. In fact, some (like Sybase) implement an SQL syntax while others (Cassandra) do not. They store the data in columns rather than rows, so adding and updating are expensive, but most queries are cheap because each column is essentially implicitly indexed.
But, if your query can't use an index, you're in no better shape with a Columnar Store than a regular SQL database.
Graph Storage
Graph Databases expand beyond SQL. Anything that can be represented by Graph theory, including Key-Value, Document Database, and SQL database can be represented by a Graph Database, like neo4j.
Graph Databases make joins as cheap as possible (as opposed to Document Databases) to do this, but they have to, because even a simple "row" query would require many joins to retrieve.
A table-scan type query would probably be slower than a standard SQL database because of all of the extra joins to retrieve the data (which is stored in a disjointed fashion).
So what's the solution?
You've probably noticed that I haven't answered your question, exactly. I'm not saying "you're finished," but the real problem is how the query is being performed.
Are you absolutely sure you can't better index your data? There are things such as Multiple Column Keys that could improve the performance of your particular query. Microsoft's SQL Server has a full text key type that would be applicable to the example you provided, and PostgreSQL can emulate it.
The real advantage most NoSQL databases have over SQL databases is Map-Reduce -- specifically, the integration of a full Turing-complete language that runs at high speed that query constraints can be written in. The querying function can be written to quickly "fail out" of non-matching queries or quickly return with a success on records that meet "priority" requirements, while doing the same in SQL is a bit more cumbersome.
Finally, however, the exact problem you're trying to solve: text search with optional filtering parameters, is more generally known as a search engine, and there are very specialized engines to handle this particular problem. I'd recommend Apache Solr to perform these queries.
Basically, dump the text field, the "filter" fields, and the primary key of the table into Solr, let it index the text field, run the queries through it, and if you need the full record after that, query your SQL database for the specific index you got from Solr. It uses some more memory and requires a second process, but will probably best suite your needs, here.
Why all of this text to get to this answer?
Because the title of your question doesn't really have anything to do with the content of your question, so I answered both. :)