SphinxSearch indextool dumpdict what is the output format? - sphinx

Executing indextool:
indextool --dumpdict myindex
gives output like this:
anymore,1,1,329756
baltimore,3,5,153685
Obviously the first column is the keyword, what are the other columns?

Umm, isn't there a header row :)
keyword,docs,hits,offset
(I suppose offset, but might not be self explanatory. AFAIK, just the offset within the doclist file. Sphinx uses that processing queries, the dict is a quick lookup table. probably of little use to external users, and can be ignored)

Related

Querying timestamp column In q

I want to count the number of records inserted in a kdb+ database using a q query.
Currently, using below query:
count select from executionTable where ingestTimeStamp within 2019.09.07D00:00:00.000000000 2019.09.08D00:00:00.000000000
It works but not highly performant. Any recommendations to make it efficient is highly appreciated.
Thank you for your help.
If you only want count then use 'count i' inside select like below:
q) select count i from executionTable where ingestTimeStamp within 2019.09.07D00:00:00.000000000 2019.09.08D00:00:00.000000000
This will only get the count instead of fetching full data which is what your query is doing and that's one of the reasons for taking more time.
And if it is a partitioned database, then add 'date' in the filter as #Callum Biggs mentioned.
Given the information you have provided I'm assuming you're querying on-disk data, likely saved in a standard date partitioned structure. In this case, you should be specifying a date clause before you specify a time clause, this will prevent searching all the date directories.
select from executionTable where date=2019.09.07, ingestTimeStamp within 2019.09.07D00:00:00.000000000 2019.09.08D00:00:00.000000000
I'd suggest reading through the whitepaper on query optimization, it will give some guidance in good query structure, and how to take advantage of map reduction in kdb.

Talend tAggregateRow put same grouping into different rows

A screenshot of my output in tLogRow:
tmap settings:
taggregaterow settings:
the workflow:
What I want to do is grouping by "fckd_operator" and "fdtgl_pinjam", but as you can see, "wiros" and "08-2015" aren't being grouped correctly
Any ideas?
EDIT 11/29 -> adding workflow,tmap, and taggregate row screenshot
Do you want to group or to sort the result on those columns?
It seems you want to sort.
In this case use tSortRow instead of tAggregateRow.
Else, share your input sample and the expected result.
Hope this helps.
TRF
I think you are dealing with an order issue.
To correctly use tAggregateRow, you need to make sure the data is already ordered afaik. The order in your case needs to be first operator, then pinjam.
This can be done with the database beforehand or with a tSortRow as TRF suggested.
The works of Talend are somewhat questionable sometimes...
I don't know if this only happens on my machine, but it is kinda stupid.
In my tmap settings, you can see that I use pattern "yyyy-MM" to format fdtgl_pinjam, but it seems that pattern in tmap date only affects the outputted text and not the value, hence make the grouping seems wrong.
My workaround is:
remove the pattern in tmap settings
using fdtgl_pinjam once more in tmap, but the value is Talend.getFirstDayOfMonth(fdtgl_pinjam), and group them using this field
It works now! :D

Pivot data in Talend

I have some data which I need to pivot in Talend. This is a sample:
brandname,metric,value
A,xyz,2
B,xyz,2
A,abc,3
C,def,1
C,ghi,6
A,ghi,1
Now I need this data to be pivoted on the metric column like this:
brandname,abc,def,ghi,xyz
A,3,null,1,2
B,null,null,null,2
C,null,1,6,null
Currently I am using tPivotToColumnsDelimited to pivot the data to a file and reading back from that file. However having to store data on an external file and reading back is messy and unnecessary overhead.
Is there a way to do this with Talend without writing to an external file? I tried to use tDenormalize but as far as I understand, it will return the rows as 1 column which is not what I need. I also looked for some 3rd party component in TalendExchange but couldn't find anything useful.
Thank you for your help.
Assuming that your metrics are fixed, you can use their names as columns of the output. The solution to do the pivot has two parts: first, a tMap that transposes the value of each input-row in into the corresponding column in the output-row out and second, a tAggregate that groups the map's output-rows according to the brandname.
For the tMap you'd have to fill the columns conditionally like this, example for output colum named "abc":
out.abc = "abc".equals(in.metric)?in.value:null
In the tAggregate you'd have to group by out.brandname and aggregate each column as sum ignoring nulls.

Transpose data using Talend

I have this kind of data:
I need to transpose this data into something like this using Talend:
Help would be much appreciated.
dbh's suggestion should work indeed, but I did not try it.
However, I have another solution which doesn't require to change input format and is not too complicated to implement. Indeed the job has only 2 transformation components (tDenormalize and tMap).
The job looks like the following:
Explanation :
Your input is read from a CSV file (could be a database or any other kind of input)
tDenormalize component will Denormalize your column value (column 2), based on value on id column (column 1), separating fields with a specific delimiter (";" in my case), resulting as shown in 2 rows.
tMap : split the aggregated column into multiple columns, by using java's String.split() method and spreading the resulting array into multiple columns. The tMap should like like this:
Since Talend doesn't accept to store Array objects, make sure to store the splitted String in Object format. Then, cast that object into Array on the right side of the Map.
That approach should give you the expected result.
IMPORTANT:
tNormalize might shuffle the rows, meaning for bigger input, you might encounter unsorted output. Make sure to sort it if needed or use tDenormalizeSortedRow instead.
tNormalize is similar to an aggregation component meaning it scans the whole input before processing, which results into possible performance issues with particularly big inputs (tens of millions of records).
Your input is probably wrong (you have 5 entries with 1 as id, and 6 entries with 2 as id). 6 columns are expected meaning you should always have 6 lines per id. If not, then you should implement dbh's solution, and you probably HAVE TO add a column with a key.
You can use Talend's tPivotToColumnsDelimited component to achieve this. You will most likely need an additional column in your data to represent the field name.
Like "Identifier, field name, value "
Then you can use this component to pivot the data and write a file as output. If you need to process the data further, read the resulting file with tFileInoutDelimited .
See docs and an example at
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/13.43+tPivotToColumnsDelimited

MVA attributes in Sphinx

Can anybody help me understand the expected format of data for creating MVA (multi-value)
attributes in Sphinx?
I have a MySQL function which returns a row of comma-separated integers, collated with
GROUP_CONCAT, as a blob. I have two further MVA attributes which collate the results of a
JOIN statement, with GROUP_CONCAT, as a blob (as generated by ThinkingSphinx). These are all included in my sql_query in my sphinx.conf.
I've tried running the SQL on a small result set in the console, and it works: for all
the MVA columns, the results are a blob containing data such as:
2432,35345,342347,8975,453645
and so on. The two MVA attributes generated with the JOIN/GROUP_CONCAT combination index correctly. However, the MVA attribute generated with the MySQL function causes the
indexing to fail silently (seemingly little or no data is indexed). This is despite the query working absolutely fine in the console..
So the data format seems to be identical, but Sphinx is rejecting one of the columns. Does anybody know of any gotchas with defining MVA attributes which might help me debug
this?
I've never used thinking-sphinx (being a PHP shop here), but I don't think you should be group_concat'ing your results. From a working example in one of my sphinx.conf files:
sql_attr_multi = uint categories from query; SELECT entry_id, cat_id FROM exp_category_posts
I solved this problem eventually. It was happening because of something
which seemed unrelated: a 'sql_attr_str2ordinal' attribute which seemed to be affected
(or effect) the SQL query/indexing in ways I don't fully understand.
See: http://www.sphx.org/forum/view.html?id=2867
Fortunately, in my case I was able to remove it entirely, and indexing now seems to work.