Combine SQL with an external search - tsql

In my case SQL for structured data and considering Lucene for text search.
Yes MSSQL has FullText but Lucene offers some stuff I want.
For the purpose of the question any external search.
In SQL there is a main table with a PK.
In SQL there are a number queries that use the main table and number of other tables.
From the external search I will get list of Main.PK to filter by.
That list could be from 1 to to 1 million.
The external search is the most expensive part of the search. The SQL part is very efficient. Passing the SQL PK to the external is not really a good option as I need various data from the SQL query. The only thing coming back from Lucene is the PK (term) and some times the score.
Is there a best practice?
Options I see are
where Main.PK in (PK values from external search)
populate the external search PK values in a #TEMP and join to that
since some times I need the score this seems best as I can put the
score in the #temp
In an ideal world there would be a join like this:
join exeternalvirtualtable as evt
on evt.PK = Main.PK
and syntax specific to the external search
I get that is asking a lot but is there anything like that in general?
Is there a syntax/API to make an external search look like a table (or view) to MSSQL?
Is there anything like that for MSSQL to Lucene?
This is kind of a start OLE DB Providers and OPENROWSET
Ideally a .NET Framework Data Providers for Lucene that mapped some SQL syntax to Lucene.
The app is .NET in case there is a .NET specific solution.
The product RavenDB combines a structures and unstructured (Lucene) search very fast even if the Lucene return a lot of row so there has to be a way to do this short of putting PK in a #temp.

Is there a syntax/API to make an external search look like a table (or view) to MSSQL?
You can use IndexSearcher class of Lucene, it will give you a TopDocs object that contain the relevant documents (PKs in your case). Then you can populate a SQL table based on this result.
You will need something like this:
TopDocs topDocs = searcher.search(query, MAX_HITS);
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
Document doc = searcher.doc(topDocs.scoreDocs[i].doc);
String pk = doc.get("PK");
// Connection to database and executing insertion
}

Related

Statistics of all/many tables in FileMaker

I'm writing a kind of summary page for my FileMaker solution.
For this, I have define a "statistics" table, which uses formula fields with ExecuteSQL to gather info from most tables, such as number of records, recently changed records, etc.
This strangely takes a long time - around 10 seconds when I have a total of about 20k records in about 10 tables. The same SQL on any database system shouldn't take more than some fractions of a second.
What could the reason be, what can I do about it and where can I start debugging to figure out what's causing all this time?
The actual code is, like this:
SQLAusführen ( "SELECT COUNT(*) FROM " & _Stats::Table ; "" ; "" )
SQLAusführen ( "SELECT SUM(\"some_field_name\") FROM " & _Stats::Table ; "" ; "" )
Where "_Stats" is my statistics table, and it has a string field "Table" where I store the name of the other tables.
So each row in this _Stats table should have the stats for the table named in the "Table" field.
Update: I'm not using FileMaker server, this is a standalone client application.
We can definitely talk about why it may be slow. Usually this has mostly to do with the size and complexity of your schema. That is "usually", as you have found.
Can you instead use the DDR ( database design report ) instead? Much will depend on what you are actually doing with this data. Tools like FMPerception also will give you many of the stats you are looking for. Again, depends on what you are doing with it.
Also, can you post your actual calculation? Is the statistic table using unstored calculations? Is the statistics table related to any of the other tables? These are a couple things that will affect how ExecuteSQL performs.
One thing to keep in mind, whether ExecuteSQL, a Perform Find, or relationship, it's all the same basic query under-the-hood. So if it would be slow doing it one way, it's going to likely be slow with any other directly related approach.
Taking these one at a time:
All records count.
Placing an unstored calc in the target table allows you to get the count of the records through the relationship, without triggering a transfer of all records to the client. You can get the value from the first record in the relationship. Super light way to get that info vs using Count which requires FileMaker to touch every record on the other side.
Sum of Records Matching a Value.
using a field on the _Stats table with a relationship to the target table will reduce how much work FileMaker has to do to give you an answer.
Then having a Summary field in the target table so sum the records may prove to be more efficient than using an aggregate function. The summary field will also only sum the records that match the relationship. ( just don't show that field on any of your layouts if you don't need it )
ExecuteSQL is fastest when it can just rely on a simple index lookup. Once you get outside of that, it's primarily about testing to find the sweet-spot. Typically, I will use ExecuteSQL for retrieving either a JSON object from a user table, or verifying a single field value. Once you get into sorting and aggregate functions, you step outside of the optimizations of the function.
Also note, if you have an open record ( that means you as the current user ), FileMaker Server doesn't know what data you have on the client side, and so it sends ALL of the records. That's why I asked if you were using unstored calcs with ExecuteSQL. It can seem slow when you can't control when the calculations fire. Often I will put the updating of that data into a scheduled script.

How can I prevent SQL injection with arbitrary JSONB query string provided by an external client?

I have a basic REST service backed by a PostgreSQL database with a table with various columns, one of which is a JSONB column that contains arbitrary data. Clients can store data filling in the fixed columns and provide any JSON as opaque data that is stored in the JSONB column.
I want to allow the client to query the database with constraints on both the fixed columns and the JSONB. It is easy to translate some query parameters like ?field=value and convert that into a parameterized SQL query for the fixed columns, but I want to add an arbitrary JSONB query to the SQL as well.
This JSONB query string could contain SQL injection, how can I prevent this? I think that because the structure of the JSONB data is arbitrary I can't use a parameterized query for this purpose. All the documentation I can find suggests I use parameterized queries, and I can't find any useful information on how to actually sanitize the query string itself, which seems like my only option.
For example a similar question is:
How to prevent SQL Injection in PostgreSQL JSON/JSONB field?
But I can't apply the same solution as I don't know the structure of the JSONB or the query, I can't assume the client wants to query a particular path using a particular operator, the entire JSONB query needs to be freely provided by the client.
I'm using golang, in case there are any existing libraries or code fragments that I can use.
edit: some example queries on the JSONB that the client might do:
(content->>'company') is NULL
(content->>'income')::numeric>80000
content->'company'->>'name'='EA' AND (content->>'income')::numeric>80000
content->'assets'#>'[{"kind":"car"}]'
(content->>'DOB')::TIMESTAMP<'2000-01-30T10:12:18.120Z'::TIMESTAMP
EXISTS (SELECT FROM jsonb_array_elements(content->'assets') asset WHERE (asset->>'value')::numeric > 100000)
Note that these don't cover all possible types of queries. Ideally I want any query that PostgreSQL supports on the JSONB data to be allowed. I just want to check the query to ensure it doesn't contain sql injection. For example, a simplistic and probably inadequate solution would be to not allow any ";" in the query string.
You could allow the users to specify a path within the JSON document, and then parameterize that path within a call to a function like json_extract_path_text. That is, the WHERE clause would look like:
WHERE json_extract_path_text(data, $1) = $2
The path argument is just a string, easily parameterized, which describes the keys to traverse down to the given value, e.g. 'foo.bars[0].name'. The right-hand side of the clause would be parameterized along the same rules as you're using for fixed column filtering.

What is the proper way to translate a complicated SQL query with custom columns and rdbms-specific functions to Doctrine

I have been a Propel user for years and only recently started switching to Doctrine. It's still quite new to me and sometimes Propel habits kick in and make it hard form me to "think in Doctrine". Below is a specific case. You don't have to know Propel to answer my question - I also present my case in raw SQL.
Simplified structure of the tables that my query refers is like this:
Application table has FK to Admin which has FK to User (fos_user in the DB)
ApplicationUser table has FK to Application
My query gets all Application records with custom columns containing additional info retrieved from related User records (through Admin) and some COUNTs of related ApplicationUser objects, one of which is additionally filtered (adminname, usercount, usercountperiod columns added to the query).
I have a Propel query like this:
ApplicationQuery::create()
->leftJoinApplicationUser()
->useAdminQuery()
->leftJoinUser()
->endUse()
->withColumn('fos_user.username', 'adminname')
->withColumn('COUNT(application_user.id)', 'usercount')
->withColumn('COUNT(application_user.id) FILTER '
. '(WHERE score > 0 AND '
. ' application_user.created_at >= to_timestamp('.strtotime($users_scored['begin']).') and '
. ' application_user.created_at < to_timestamp('.strtotime($users_scored['end']).') )', 'usercountperiod')
->groupById()
->groupBy('User.Id')
->orderById('DESC')
->paginate( ....
This is how it translates to SQL (PostgreSQL):
SELECT application.id, application.name, ...,
fos_user.username AS "adminname",
COUNT(socialscore_application_user.id) AS "usercount",
COUNT(application_user.id) FILTER (
WHERE score > 0 AND
application_user.created_at >= to_timestamp(1491004800) and
application_user.created_at < to_timestamp(1498780800) ) AS "usercountperiod"
FROM application
LEFT JOIN application_user ON (application.id=application_user.application_id)
LEFT JOIN admin ON (application.admin_id=admin.id)
LEFT JOIN fos_user ON (admin.id=fos_user.id)
GROUP BY application.id,fos_user.id
ORDER BY application.id DESC
LIMIT 15
As you can see it's quite complex (in terms of translating it to Doctrine ORM, when you're a Doctrine newbie like me :) ). It uses specific features of PostgreSQL:
being able to include only Primary Key in GROUP BY statement, while other columns from the same table can be used in SELECT without aggregating function or inclusion in GROUP BY (because they are "dependent" on the PK);
FILTER which allows you to further filter records that are fed into aggregate functions
It also uses some joins and adds custom columns (adminname, usercount, usercountperiod) which I can access in my resulting Propel Model objects (with functions like $result->getAdminname().
My question is: what is the "Doctrine way" to achieve as similar thing as possible as simply as possible (use some PostgreSQL-specific or any RDBMS-specific features, add some custom columns which will be accessible through ORM objects and so on)?
Thank you for help.

Small Colum and large table - Should i use FTS or Like

At the moment i am using Full text (2008 R2) on small columns like 'Client Name', 'PO Number' and etc ? but i was wondering if it is really worth using FTS on small columns and could use 'Like' for searching.
Table has over 11k rows which is not alot but this table is growing.
If it is better to use 'Like' than do i have to remove columns from the catalog?
What is meant by unstructured text data here?
"In contrast to full-text search, the LIKE Transact-SQL predicate works on character patterns only. Also, you cannot use the LIKE predicate to query formatted binary data. Furthermore, a LIKE query against a large amount of unstructured text data is much slower than an equivalent full-text query against the same data. A LIKE query against millions of rows of text data can take minutes to return; whereas a full-text query can take only seconds or less against the same data, depending on the number of rows that are returned. "
If you're already using full text, why change it? LIKE queries may be suitable now but the performance will degrade sharply as your table grows, as stated in the MSDN article you quoted.
If it is better to use 'Like' than do i have to remove columns from
the catalog?
No, the full text catalog has no impact on LIKE queries.
What is meant by unstructured text data here?
Full text can be used on binary formats that SQL Server supports (imagine searching across Word files stored in SQL) and XML data. LIKE cannot do this.

How to build a select using Zend with a DISTINCT specific column?

I'm using Zend Framework for my website and I'd like to retrieve some data from my PostgreSQL database.
I have a request like :
SELECT DISTINCT ON(e.id) e.*, f.*, g.* FROM e, f, g
WHERE e.id = f.id_e AND f.id = g.id_f
This request works well but I don't know how to convert the DISTINCT ON(e.id) with Zend.
It seems that I can get DISTINCT rows but no distinct columns.
$select->distinct()->from("e")->join("f", "e.id = f.id_e")
->join("g", "f.id = g.id_f");
Any idea on how to make a select with distinct column ?
Thanks for help
You probably can't do this with Zend Framework since distinct on is not part of the SQL standard (end of page in Postgres documentation). Although Postgres supports it, I would assume its not part of Zend Framework because you could in theory configure another database connection which does not offer support.
If you know in advance that you're developing for a specific database (Postgres in this case), you could use manually written statements instead. You'll gain more flexibility within the queries and better performance at the cost of no longer being able to switch databases.
You would then instantiate a Zend_Db_Apdapter for Postgres. There a various methods available to get results for SQL queries which are described in the frameworks documentation starting at section Reading Query Results. If you choose to go this route I'd recommend to create an own subclass of the Zend_Db_Adapter_Pgsql class. This is to be able to convert data types and throw exceptions in case of errors instead of returning ambiguous null values and hiding error causes.