Based on https://www.postgresql.org/docs/13/textsearch-features.html
tsvector || tsvector
The tsvector concatenation operator returns a vector which combines the lexemes and positional information of the two vectors given as arguments. Positions and weight labels are retained during the concatenation. Positions appearing in the right-hand vector are offset by the largest position mentioned in the left-hand vector, so that the result is nearly equivalent to the result of performing to_tsvector on the concatenation of the two original document strings. (The equivalence is not exact, because any stop-words removed from the end of the left-hand argument will not affect the result, whereas they would have affected the positions of the lexemes in the right-hand argument if textual concatenation were used.)
One advantage of using concatenation in the vector form, rather than concatenating text before applying to_tsvector, is that you can use different configurations to parse different sections of the document. Also, because the setweight function marks all lexemes of the given vector the same way, it is necessary to parse the text and do setweight before concatenating if you want to label different parts of the document with different weights.
Thus this query
select 'a:1 b:2'::tsvector || 'a:1 c:2 b:3'::tsvector;
will result in 'a':1,3 'b':2,5 'c':4
Please advice is there a way to merge several tsvectors while preserving original positions (something similar to this):
select concat_with_preserving('a:1 b:2'::tsvector, 'a:1 c:2 b:3'::tsvector);
so it is equal to 'a':1 'b':2,3 'c':2, eg same positions deduplicated and different positions are just merged (w/o offset).
Thanks!
Convert then to text, then concatenate them with spaces between, then convert them back.
(a::text || ' ' || b::text)::tsvector
Related
I'm implementing an automation process with PowerShell using iTextSharp lib, to extract needed information about several PDF documents.
Based on this PDF content portion:
It returns this result:
[(1)-1688.21(1)-492.975(0)-493.019(0)]TJ
[(5)-493.019(0)-17728.1(2)]TJ
I can extract the literal values with some regex manipulation but, only using this method the result is:
$line -replace "^\[\(|\)\]TJ$", "" -split "\)\-?\d+\.?\d*\(" -join ""
1000
502
Of course, these results are not integral, and I need more specification on the reading/parsing.
I'm suspecting that the numbers between the literal characters (e.g -1688.21,-492.975,...), may be useful, but I didnt find explanation about such parameters.
What they represent?
When you are wondering about details of the PDF format, you should have a look into the PDF specification ISO 32000.
Operands
Operator
Description
array
TJ
Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm. The number shall be expressed in thousandths of a unit of text space (see 9.4.4, "Text Space Details"). This amount shall be subtracted from the current horizontal or vertical coordinate, depending on the writing mode. In the default coordinate system, a positive adjustment has the effect of moving the next glyph painted either to the left or down by the given amount. Figure 46 shows an example of the effect of passing offsets to TJ.
(ISO 32000-1, Table 109 – Text-showing operators)
Thus,
I'm suspecting that the numbers between the literal characters (e.g -1688.21,-492.975,...), may be useful, but I didnt find explanation about such parameters.
What they represent?
For each such number, the operator adjusts the text position by that amount. The number is expressed in thousandths of a unit of text space. This amount is subtracted from the current horizontal or vertical coordinate, depending on the writing mode.
I'm reading a perl code to distill what it's doing, but can't figure out what 1..$scalar_name in these lines is doing
my $scalar_name = scalar #array_name;
push #zeroes, 0 for(1..$scalar_name);
Thank you!
Two dots .. is a range operator.
Binary ".." is the range operator, which is really two different operators depending on the context. In list context, it returns a list of values counting (up by ones) from the left value to the right value. If the left value is greater than the right value then it returns the empty list. The range operator is useful for writing foreach (1..10) loops and for doing slice operations on arrays.
Your code takes the number of elements of the array and creates a new array with the same number of zeros.
I have the following table: Table
Columns 'L' and 'U' if the table consist of cells that contain object names that correspond to the headers in columns 4-281. Example
Goal: For every date verify what objects are in 'L' (respectively 'U') and sum the aggregate of those objects' 4-point trailing moving sum and its standard deviation (going up in the table!) and store it in a new variable, e.g. LSum and LStd for 'L' as well as USum and UStd for 'U'. For dates with insufficient values, e.g. 15-Jul-2016 with only 3 instead of 4 time steps ahead, return NaN's.
How I would start:
for row=1:size(ABC,1)
row_values = ABC{row,:};
row_values = row_values(4:end);
% How to make the loop for columns L and U where there are multiple objects in one cell?
% How can I use 'movsum' and 'movstd' here to calculate values vertically going up?
end;
Thanks a lot for your help!
Maybe you could use the functions cell2mat and cellfun to achieve your goal.
With these functions you can:
Convert your cell matrix to a double matrix in order to perform (cell2mat)
Perform a certain operation on all cell elements (cellfun)
Given a table with the following format in MATLAB:
itemids keywords
1 3D,children,anim,pixar,3D,3D pixar
2 3D,4D pixar,3D car
... ...
I want to count the number of times each keyword is repeated in each item. All the list of unique keywords are available in keywords = {'3D';'Children';'anim';'pixar' ...}. The output is a matrix TF with rows equal to the number of items and columns equal to length(keywords).
One of the difficulties here is to search for an exact match for each string. I am currently using strcmp() which seems to be giving all the entries with a given word, not exact match. In my case I would need to differentiate between 3D and 3D pixar.
This can be done using the ismember function in MATLAB. I am assuming that keywords for each item is actually a single string in which case you will need to split the keywords before doing ismember.
relevantKeyWords = {'3D','Children','anim','pixar'};
keywordsInItem = strtrim(strsplit(keywordsStr,',')) % Split the words and trim each word
tmp = ismember(relevantKeywords,keywordsInItem);
tmp will be of size 1 x length(relevantKeywords) indicating if the relevant keyword was found.
I'm trying to initialize
labels =['dh';'Dh';'gj';'Gj';'ll';'Ll';'nj';'Nj';'rr';'Rr';'sh';'Sh';'th';'Th';'xh';'Xh';'zh';'Zh';'ç';'Ç';'ë';'Ë'];
But it shows me the error on title.When I try with numbers it's all perfect but not with characters.What could be the problem?
If you wish to eliminate any padding, you can also store it into a cell as follows.
labels = {'dh';'Dh';'gj';'Gj';
'll';'Ll';'nj';'Nj';
'rr';'Rr';'sh';'Sh';
'th';'Th';'xh';'Xh';
'zh';'Zh';'ç';'Ç';
'ë';'Ë'};
Then you can reference the "i"th element using labels{i} instead of labels(i,:) which is simpler. You can further run more string operations using cellfun and not interfere with any existing values that you've stored.
I agree with krisdestruction that using a cell array makes the code accessing the strings simpler and is generally more idiomatic. That is what I would also recommend unless there is a compelling reason to do something else.
For completeness, you could use the char function to add the padding automatically for you if you really want a character array:
>> char('aa','bb','c')
ans =
aa
bb
c
where the last row is 'c '. From the char documentation:
S = char(A1,...,AN) converts the arrays A1,...,AN into a single character array. After conversion to characters, the input arrays become rows in S. Each row is automatically padded with blanks as needed. An empty string becomes a row of blanks.
(Emphasis mine)
From the Mathworks documentation:
Apply the MATLAB concatenation operator, []. Separate each row with a semicolon (;). Each row must contain the same number of characters. For example, combine three strings of equal length:
You can try padding like this to make every row 2 characters:
labels = ['dh';'Dh';'gj';'Gj';
'll';'Ll';'nj';'Nj';
'rr';'Rr';'sh';'Sh';
'th';'Th';'xh';'Xh';
'zh';'Zh';'ç ';'Ç ';
'ë ';'Ë '];