MVA attributes in Sphinx - sphinx

Can anybody help me understand the expected format of data for creating MVA (multi-value)
attributes in Sphinx?
I have a MySQL function which returns a row of comma-separated integers, collated with
GROUP_CONCAT, as a blob. I have two further MVA attributes which collate the results of a
JOIN statement, with GROUP_CONCAT, as a blob (as generated by ThinkingSphinx). These are all included in my sql_query in my sphinx.conf.
I've tried running the SQL on a small result set in the console, and it works: for all
the MVA columns, the results are a blob containing data such as:
2432,35345,342347,8975,453645
and so on. The two MVA attributes generated with the JOIN/GROUP_CONCAT combination index correctly. However, the MVA attribute generated with the MySQL function causes the
indexing to fail silently (seemingly little or no data is indexed). This is despite the query working absolutely fine in the console..
So the data format seems to be identical, but Sphinx is rejecting one of the columns. Does anybody know of any gotchas with defining MVA attributes which might help me debug
this?

I've never used thinking-sphinx (being a PHP shop here), but I don't think you should be group_concat'ing your results. From a working example in one of my sphinx.conf files:
sql_attr_multi = uint categories from query; SELECT entry_id, cat_id FROM exp_category_posts

I solved this problem eventually. It was happening because of something
which seemed unrelated: a 'sql_attr_str2ordinal' attribute which seemed to be affected
(or effect) the SQL query/indexing in ways I don't fully understand.
See: http://www.sphx.org/forum/view.html?id=2867
Fortunately, in my case I was able to remove it entirely, and indexing now seems to work.

Related

Redshift Spectrum table doesnt recognize array

I have ran a crawler on json S3 file for updating an existing external table.
Once finished I checked the SVL_S3LOG to see the structure of the external table and saw it was updated and I have new column with Array<int> type like expected.
When I have tried to execute select * on the external table I got this error: "Invalid operation: Nested tables do not support '*' in the SELECT clause.;"
So I have tried to detailed the select statement with all columns names:
select name, date, books.... (books is the Array<int> type)
from external_table_a1
and got this error:
Invalid operation: column "books" does not exist in external_table_a1;"
I have also checked under "AWS Glue" the table external_table_a1 and saw that column "books" is recognized and have the type Array<int>.
Can someone explain why my simple query is wrong?
What am I missing?
Querying JSON data is a bit of a hassle with Redshift: when parsing is enabled (eg using the appropriate SerDe configuration) the JSON is stored as a SUPER type. In your case that's the Array<int>.
The AWS documentation on Querying semistructured data seems pretty straightforward, mentioning that PartiQL uses "dotted notation and array subscript for path navigation when accessing nested data". This doesn't work for me, although I don't find any reasons in their SUPER Limitations Documentation.
Solution 1
What I have to do is set the flags set json_serialization_enable to true; and set json_serialization_parse_nested_strings to true; which will parse the SUPER type as JSON (ie back to JSON). I can then use JSON-functions to query the data. Unnesting data gets even crazier because you can only use the unnest syntax select item from table as t, t.items as item on SUPER types. I genuinely don't think that this is the supposed way to query and unnest SUPER objects but that's the only approach that worked for me.
They described that in some older "Amazon Redshift Developer Guide".
Solution 2
When you are writing your query or creating a query Redshift will try to fit the output into one of the basic column data types. If the result of your query does not match any of those types, Redshift will not process the query. Hence, in order to convert a SUPER to a compatible type you will have to unnest it (using the rather peculiar Redshift unnest syntax).
For me, this works in certain cases but I'm not always able to properly index arrays, not can I access the array index (using my_table.array_column as array_entry at array_index syntax).

Sphinx / Manticore - base one plain index off another?

I have a plain text index that sucks data from MySQL and inserts it into Manticore in a format I need (e.g. converting datetime strings to timestamp, CONCATing some fields etc.
I then want to create a second plain text index based off this data to group it further. This will save me having to either re-run the normalisation that's done to the first index on INSERT or make it easier for me to query in the future.
For example, my first index is a list of all phone calls that have been made / received (telephone number, duration, agent). The second index should group by Year-Month-Date in such a way that I can see how many calls each agent made on that day. This means I end up with idx_phone_calls and idx_phone_calls_by_date.
Currently, I generate the first index from MySQL, then get Manticore to query itself (by setting the MySQL host to localhost. It works, but it feels as though I should be able to query Manticore directly from within the index. However, I'm struggling to find if that's possible.
Is there a better way to do it?
Well Sphinx/Manticore, has its own GROUP BY function. So maybe can just run the final query against the original index anyway, avoid the need for the second index.
Sphinx's Aggregation (in some way) is more powerful than MySQL, and can do some 'super aggregation' functions (like with WITHIN GROUP ORDER BY)
But otherwise there is no direct way to create an off another (eg there is no CREATE TABLE idx_phone_calls_by_date SELECT ... FROM idx_phone_calls ... )
Your 'solution' of directing indexer to query the data from searchd is good. In general this should be pretty efficent, particully on localhost, there is little overhead. Maintains the logical seperation of searchd being for queries, indexer being for well building indexes.

Using arrays with pg-promise

I'm using pg-promise and am not understanding how to run this query. The first query works, but I would like to use pg-promise's safe character escaping, and then I try the second query it doesn't work.
Works:
db.any(`SELECT title FROM books WHERE id = ANY ('{${ids}}') ORDER BY id`)
Doesn't work
db.any(`SELECT title FROM books WHERE id = ANY ($1) ORDER BY id`, ids)
The example has 2 problems. First, it goes against what the documentation tells you:
IMPORTANT: Never use the reserved ${} syntax inside ES6 template strings, as those have no knowledge of how to format values for PostgreSQL. Inside ES6 template strings you should only use one of the 4 alternatives - $(), $<>, $[] or $//.
Manual query formatting, like in your first example, is a very bad practice, resulting in bad things, ranging from broken queries to SQL injection.
And the second issue is that after switching to the correct SQL formatting, you should use the CSV Filter to properly format the list of values:
db.any(`SELECT title FROM books WHERE id IN ($/ids:csv/) ORDER BY id`, {ids})
or via an index variable:
db.any(`SELECT title FROM books WHERE id IN ($1:csv) ORDER BY id`, [ids])
Note that I also changed from ANY to IN operand, as we are providing a list of open values here.
And you can use filter :list interchangeably, whichever you like.

Informatica SQ returns different result

I am trying to pull data from DB2 via informatica, I have a SQ query that pulls few fields based on joins for 4 different tables.
When I run the query directly in the database, it returns the expected result, however when I run it in informatica and run a debugger, I see something else.
Please note all the columns data perfectly match, except one single column.
Weird thing is, this is a calculated field from the table based on a case statement:
CASE WHEN Column1='3' THEN 'N' ELSE 'Y' END.
Since this is a calculated field with a length of one string, I have connected from the source to SQ from one of the sources having 1 character length.
This returns 'Y' when executed in the database, the same query when I copy paste in SQ of information and run it, I get a data 'E', and this data can never be possible as I expect only a N or a Y. I have verified the column order, that its in the right place. This is very strange, is something going wrong because of the CASE Statement?
Save yourself the hassle, put an expression transformation after tge source qualifier and calculate, port value there then forget about it
I think i got the issue. We use Informatica PowerExchange to connect to a as400 system(DB2), and it seems that when we are trying to set a flag information in AS400, and pass it to informatica via PowerExchange, it converts it to binary, and to solve this, there needs to be an entry in the PowerExchange configuration file.
Unfortunately, i myself was not aware that it could be related to PowerExchange instead of powercenter itself.!!
Thanks for your assistance! Below is the KB about it.
https://kb.informatica.com/solution/4/Pages/17498.aspx

sqlalchemy group_by error

The following works
s = select([tsr.c.kod]).where(tsr.c.rr=='10').group_by(tsr.c.kod)
and this does not:
s = select([tsr.c.kod, tsr.c.rr, any fields]).where(tsr.c.rr=='10').group_by(tsr.c.kod)
Why?
thx.
It doesn't work because the query isn't valid like that.
Every column needs to be in the group_by or needs an aggregate (i.e. max(), min(), whatever) according to the SQL standard. Most databases have always complied to this but there are a few exceptions.
MySQL has always been the odd one in this regard, within MySQL this behaviour depends on the ONLY_FULL_GROUP_BY setting: https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html
I would personally recommend setting the sql_mode setting to ANSI. That way you're largely compliant to the SQL standard which will help you in the future if you ever need to use (or migrate) to a standards compliant database such as PostgreSQL.
What you are trying to do is somehow valid in mysql, but invalid in standard sql, postgresql and common sense. When you group rows by 'kod', each row in a group has the same 'kod' value, but different values for 'rr' for example. With aggregate functions you can get some aspect of the values in this column for each group, for example
select kod, max(rr) from table group by kod
will give you list of 'kod's and the max of 'rr's in each group (by kod).
That being sad, in the select clause you can only put columns from the group by clause and/or aggregate functions from other columns. You can put whatever you like in where - this is used for filtering. You can also put additional 'having' clause after group that contains aggregate function expression that can also be used as post-group filtering.