Find exact word in order in SphinxQL - sphinx

I'got got an indexed columns that contains some numbers (ids) and I need to extract rows that match exactly some numbers in a given order.
For example
"Give me rows that contains 1 followed by 1 followed by 25 followed by 30"
1 1 1 2 2 25 25 26 30 31 => is valid
1 1 1 2 2 25 25 26 31 32 => is not valid
1 1 1 2 2 2 2 2 25 30 30 => is valid
I'm trying with 1 >> 1 >> 2 >> 2 but it does not work (I think because it match "1" as single character and not as a "word")

The strict order operator is << , soo
1 << 1 << 25 << 30
should work.
Matching part words/single charactors (as opposed to whole words) would only work if specifically enabled it, eg with min_infix_len=1 and would probably only match if have enable_keywords=1 (unless sphinx is old enough to have enable_stat=0

Related

Is there a simple way in Pyspark to find out number of promotions it took to convert someone into customer?

I have a date-level promotion data frame that looks something like this:
ID
Date
Promotions
Converted to customer
1
2-Jan
2
0
1
10-Jan
3
1
1
14-Jan
3
0
2
10-Jan
19
1
2
10-Jan
8
0
2
10-Jan
12
0
Now I want to see what were the number of promotions it took to convert someone into a customer
For eg., It took (2+3) promotions to convert ID 1 to the customer and (19) to convert ID 2 to the customer.
Eg.
ID
Date
1
5
2
19
I am unable to think of an idea to solve it. Can you please help me?
#Corralien and mozway have helped with the solution in Python. But I am unable to implement it in Pyspark because of the huge dataframe size (>1 TB).
You can use:
prom = (df.groupby('ID')['Promotions'].cumsum()
.where(df['Converted to customer'].eq(1))
.dropna().astype(int))
out = df.loc[prom.index, ['ID', 'Date']].assign(Promotion=prom)
print(out)
# Output
ID Date Promotion
1 1 10-Jan 5
3 2 10-Jan 19
Use one groupby to generate a mask to hide the rows, then one groupby.sum for the sum:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = df[~mask].groupby('ID')['Promotions'].sum()
Output:
ID
1 5
2 19
Name: Promotions, dtype: int64
Alternative output:
df[~mask].groupby('ID', as_index=False).agg(**{'Number': ('Promotions', 'sum')})
Output:
ID Number
0 1 5
1 2 19
If you potentially have groups without conversion to customer, you might want to also aggregate the "" column as indicator:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = (df[~mask]
.groupby('ID', as_index=False)
.agg(**{'Number': ('Promotions', 'sum'),
'Converted': ('Converted to customer', 'max')
})
)
Output:
ID Number Converted
0 1 5 1
1 2 19 1
2 3 39 0
Alternative input:
ID Date Promotions Converted to customer
0 1 2-Jan 2 0
1 1 10-Jan 3 1
2 1 14-Jan 3 0
3 2 10-Jan 19 1
4 2 10-Jan 8 0
5 2 10-Jan 12 0
6 3 10-Jan 19 0 # this group has
7 3 10-Jan 8 0 # no conversion
8 3 10-Jan 12 0 # to customer
you want to compute something by ID, so a groupby ID seems appropriate, e.g.
data.groupby("ID").apply(fct)
Now write a separate function agg_fct which computes the result for a
dataframe consisting of only one ID
Assuming data are ordered by Date, I guess that
def agg_fct(df):
index_of_conv = df["Converted to customer"].argmax()
return df.iloc[0:index_of_conv,df.columns.get_loc("Promotions")].sum()
would be fine. You might want to make some adjustments in case of a customer who has never been converted.

KDB/Q How to implement moving rank efficiently?

I am trying to implement a moving rank function, taking parameters of n, the number of items, and m, the column name. Here is how I implement it:
mwindow: k){[y;x]$[y>0;x#(!#x)+\:!y;x#(!#x)+\:(!-y)+y+1]};
mrank: {[n;x] sum each x > prev mwindow[neg n;x]};
But this seems to take quite some time if n is moderately large, say 100.
I figure it is because it has to calculate from scratch, unlike msum, which keeps a running variable and only calculate the difference between the newly added and the dropped.
There's a number of general sliding window functions here that you can use to generate rolling lists on which to apply your rank: https://code.kx.com/q/kb/programming-idioms/#how-do-i-apply-a-function-to-a-sequence-sliding-window
Those approaches seem to fill the lists out with zeros/nulls however which I think won't really suit your use of rank. Here's another possible approach which might be more suitable to rank (though I haven't tested this for performance on the large scale):
q)mwin:{x each (),/:{neg[x]sublist y,z}[y]\[z]}
q)update r:mwin[rank;4;c] from ([]c:10?100)
c r
----------
84 ,0
25 1 0
31 2 0 1
0 3 1 2 0
51 1 2 0 3
29 2 0 3 1
25 0 3 2 1
73 2 1 0 3
0 2 1 3 0
6 2 3 0 1
q)update r:last each mwin[rank;4;c] from ([]c:10?100)
c r
----
38 0
72 1
13 0
77 3
64 1
9 0
37 1
79 3
97 3
63 1
q)

Delete adjacent repeated terms

I have the following vector a:
a=[8,8,9,9,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8]
From a I want to delete all "adjacent" repetitions to obtain:
b=[8,9,1,2,3,4,5,6,7,8]
However, when I do:
unique(a,'stable')
ans =
8 9 1 2 3 4 5 6 7
You see, unique only really gets the unique elements of a, whereas what I want is to delete the "duplicates"... How do I do this?
It looks like a run-length-encoding problem (check here). You can modify Mohsen's solution to get the desired output. (i.e. I claim no credit for this code, yet the question is not a duplicate in my opinion).
Here is the code:
a =[8,8,9,9,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8]
F=find(diff([a(1)-1, a]));
Since diff(a) returns an array of length (length(a) -1), we want to add a value at the beginning (i.e the a(1)) to get a vector the same size as a. Here we subtract 1 so that, as mentioned by #surgical_tubing, the command find effectively finds it because it looks for non zero elements, so we want to make sure the value is non zero.
Hence diff([a(1)-1, a]) looks like this:
Columns 1 through 8
1 0 1 0 -8 0 1 0
Columns 9 through 16
1 0 1 0 1 0 1 0
Columns 17 through 20
1 0 1 0
Now having found the repeated elements, we index back into a with the positions found by find:
newa=a(F)
and output:
newa =
Columns 1 through 8
8 9 1 2 3 4 5 6
Columns 9 through 10
7 8

How to set an indexed value in a matrix based on another matrix's values

Say I have a matrix A
A =
0 1 2
2 1 1
3 1 2
and another matrix B
B =
0 42
1 24
2 32
3 12
I want to replace each value in A by the one associated to it in B.
I would obtain
A =
42 24 32
32 24 24
12 24 32
How can I do that without loops?
There are several ways to accomplish this, but here is an short one:
[~,ind]=ismember(A,B(:,1));
Anew = reshape(B(ind,2),size(A))
If you can assume that the first column of B is always 0:size(B,1)-1, then it is easier, becoming just reshape(B(A+1,2),size(A)).
arrayfun(#(x)(B(find((x)==B(:,1)),2)),A)

Converting Single dimensional matrix to two dimension in Matlab

Well I do not know if I used the exact term. I tried to find an answer on the net.
Here is what i need:
I have a matix
a = 1 4 7
2 5 8
3 6 9
If I do a(4) the value is 4. So it is reading first column top to buttom then continuing to next .... I don't know why. However,
What I need is to call it using two indices. As row and column:
a(1,2)= 4
or even better if i can call it in the following way:
a{1}(2)=4
What is this process really called (want to learn) and how to perform in matlab.
I thought of a loop. Is there a built in function
Thanks a lot
Check this:
a =
18 18 16 18 18 18 16 0 0 0
16 16 18 0 18 16 0 18 18 16
18 0 18 18 0 16 0 0 0 18
18 0 18 18 16 0 16 0 18 18
>> a(4)
ans =
18
>> a(5)
ans =
18
>> a(10)
ans =
18
I tried reshape. it is reshaping not converting into 2 indeces
If you've already got a matrix, you already can access it with two indices:
if you've got
a = 1 4 7
2 5 8
3 6 9
you can access it as
a(3,2) = 6
However, the indexing goes from the top left, row first then column. If you want to get at the "4" in the matrix then do:
a(1,2)
To reshape a vector/matrix/array, use reshape().
Or you could leave it as a one dimensional array and just use
((Column - 1) * 3) + Row - 1) as the index. 3 because there are three columns.
NB a(4) = 4 because of the way you've arranged columns and rows in the one dimensional array, yours is "loaded" as
R1C1,R2C1,R3C1, R1C2 etc wher R is row and C is column
If that's inconvenient then you just need to get whatever fills the array row then column so the above mapping would be
((Row - 1) * 3) + Column - 1)
Don't do Matlab so above code assumes array starts at 0, if not just add 1 to it.