sphinx: how to search for a phrase with wildcard? - sphinx

I have a phrase "my name is bob". I want to match it by querying "my n".
How my query should look like? What config should I have?
min_prefix_len and min_prefix_len did not give any expecting results.
I had min_word_len set to 2, but changing it to 1 did not help either.
expand_keywords 1/2 had made no difference.
Here's my index config:
index track
{
source = track
path = /var/lib/sphinx/track
min_word_len = 1
docinfo = extern
mlock = 1
morphology = none
expand_keywords = 1
}
The queries i tried:
"my n*"
"my n"*
my n
"my n" | my n*
"my n" | "my n*" | my n*
No matter what, I cannot match "my name ...".

min_word_len = 1
min_prefix_len = 1
expand_keywords = 0
Need min_prefix_len to enable wildcard searches. But want expand_keywords off, as that makes all keywords have wildcards on them.
Then can just do
"my n*"

Related

Search special chars by space in sphinx

i have problem with sphinx search.
I have string for indexing
xyz a'qwerty
I need to find it if i use
xy - ok
xy a - ok
xyz a'qwerty - ok
xyz a qwerty - ok
xyz a qwe - not ok
I rly can't reach right result, know someone how to do this?
My index look like this, regex_filters was some experiments so, can be removed.
index ProductSearch
{
source = ProductSearchSource
path = c:/wamp/sphinx/data/product
docinfo = extern
enable_star = 0
expand_keywords = 1
min_word_len = 2
min_prefix_len = 1
charset_type = utf-8
charset_table = 0..9, A..Z->a..z, _, a..z, U+0022, U+0026, U+0027, U+0060, U+00B4, U+002E, U+0e1->a, U+0c1->a, U+10d->c, U+10c->c, U+10f->d, U+10e->d, U+0e9->e, U+0c9->e, U+11b->e, U+11a->e, U+0ed->i, U+0cd->i, U+148->n, U+147->n, U+0f3->o, U+0d3->o, U+159->r, U+158->r, U+161->s, U+160->s, U+165->t, U+164->t, U+0fa->u, U+0da->u, U+16f->u, U+16e->u, U+0fd->y, U+0dd->y, U+17e->z, U+17d->z,
wordforms = c:/wamp/www/project/configs/sphinx/synonyms
regexp_filter = (\w*)'(\w*) => \1'\2
regexp_filter = (\w*)'(\w*) => \1 \2
regexp_filter = (\w*)'(\w*) => \1
regexp_filter = (\w*)'(\w*) => \2
}
Using SPH_MATCH_EXTENDED2
PS.: Sorry for bad english
Problem solved, I missed synonyms in wordforms, it rewrites my tested word, so it looked like sphinx doesn't work correctly.. (Facepalm here)

Email Signature - VBscript for Word, with tables

I'm trying to set a company signature and then implement it with GPO.
Here's what I'm trying to accomplish:
John Hancock | Paralegal | Company, PC
<Logo (to the left of text)> 60 Test Street | PO Box 1389 | Testing, PA 19820
Phone: 555.555.5555| Fax: 555.555.5555 | Email: testing#testing.com (need this hyperlinked)
EDIT: Additional information from comments.
I'm trying to have different attributes (font size, font type, bold, etc) for the text in each particular line within the second row of the table. For example: Test text (this is bold and Calibri) - Test Text 2 (this is not bold and Arial). When I run the script as it stands, I get the logo on the left, in the first column, and a line of text to the right of the logo, in the second column. What I can't figure out is how to add another line of text, on the right, directly underneath the first line, and have that line of text show with different font attributes and such.
Here's the code I have so far:
Set objSysInfo = CreateObject("ADSystemInfo")
Set WshShell = CreateObject("WScript.Shell")
strUser = objSysInfo.UserName
Set objUser = GetObject("LDAP://" & strUser)
strName = objUser.FullName
strFirst = objUser.FirstName
strLast = objUser.LastName
strInitials = objUser.Initials
strOffice = objUser.physicalDeliveryOfficeName
strPOBox = objUser.postOfficeBox
strTitle = objUser.Description
strCred = objUser.info
strStreet = objUser.StreetAddress
strLocation = objUser.l
strPostCode = objUser.PostalCode
strPhone = objUser.TelephoneNumber
strMobile = objUser.Mobile
strFax = objUser.FacsimileTelephoneNumber
strEmail = objUser.mail
strCompany = objUser.Company
Const NUMBER_OF_ROWS = 1
Const NUMBER_OF_COLUMNS = 2
Set objWord = CreateObject("Word.Application")
Set objDoc = objWord.Documents.Add()
Set objSelection = objWord.Selection
Set objEmailOptions = objWord.EmailOptions
Set objSignatureObject = objEmailOptions.EmailSignature
Set objSignatureEntries = objSignatureObject.EmailSignatureEntries
Set objRange = objDoc.Range()
objDoc.Tables.Add objRange, NUMBER_OF_ROWS, NUMBER_OF_COLUMNS
Set objTable = objDoc.Tables(1)
Set objShape = objTable.Cell(1, 1).Range.Hyperlinks.Add(objSelection.InlineShapes.AddPicture("\\eg-fileserver\admin space\signature\logo.jpg"), "http://www.eastburngray.com",,,"")
objTable.Columns(1).Width = 20
objTable.Columns(2).Width = 320
objTable.Cell(1, 2).Range.Font.Bold = True
objTable.Cell(1, 2).Range.Font.Name = "Calibri"
objTable.Cell(1, 2).Range.Font.Size = 10
objTable.Range.ParagraphFormat.SpaceAfter = 0
objTable.Cell(1, 2).Range.Text = strFirst & strInitials & strLast & " | " & strOffice & " | " & strCompany
Set objSelection = objDoc.Range()
objSignatureEntries.Add "Full Signature", objSelection
objSignatureObject.NewMessageSignature = "Full Signature"
objDoc.Saved = True
objWord.Quit
The key to adding text with various formatting in Word is to work with a Range object. You can think of a Range like an invisible Selection, with the major difference that you can have as many Range objects as you need - there can be only one Selection. The trick to changing the formatting is to "collapse" the Range (think of it like pressing the Right- or Left-Arrow keys to a blinking "point", then continuing to type).
Edit Note: Based on bibadia's surmise that this is actually about VBScript and not VBA I've changed the tags in your question and am editing my Answer to fit VBScript. VBScript cannot use Word-specific object declarations and enumerations, so I've removed the "Dim As" and replaced all wdEnum with the Integer equivalent.
Using your code as a starting point, the approach could look something like this:
Dim rngCell
Set rngCell = objTable.Cell(1,2).Range
rngCell.ParagraphFormat.SpaceAfter = 0
rngCell.Text = strFirst & strInitials & strLast & " | " & _
strOffice & " | " & strCompany & vbCr
rngCell.Font.Bold = True
rngCell.Font.Name = "Calibri"
rngCell.Font.Size = 10
rngCell.Collapse 0 'wdCollapseEnd
rngCell.MoveEnd 1, -1 'wdCharacter, -1
rngCell.Text = strPhone & " | " & strFax & " | " & strEmail
rngCell.Font.Bold = False
rngCell.Font.Size = 8
Note 1: The order in which you do things is usually reversed from that when typing as a user: First populate the Range, then apply the formatting.
Note 2: When collapsing at the end of a cell, Word will move the Range position to the beginning of the following cell. Thus, the code moves the point back one character, putting it at the end of the previous (original) cell: rngCell.MoveEnd wdCharacter, -1
Note 3: I added a vbCr at the end of the first rngCell.Text to create the new paragraph within the table cell.

Sphinx internal error/ query not send to searchd

I'm trying to use Sphinx with a service called questasy (nobody will know it). Our dutch colleges did this before and the software is definitely giving us the opportunity to run searches via Sphinx.
So here the problem I got:
I set up the questasy portal, enabled the questasy usage and the portal runs perfectly.
I unpacked Sphinx to C:/Sphinx, created the /data and /log directories.
I set up the config file and ran the indexer. It works.
I installed searchd as a service with the config and it works and runs.
BUT now when I try to search in the portal it shows me a message like "internal error. Please try again later". When I look into the "Query.log" there is nothing in it, so I think the query isn't send to the searchd-service. I checked the config, I checked the port it is listening on and everything is like our colleges got it too.
Does anybody know about a common bug or problem or something like this we missed??
Here is my .conf:
# Questasy configuration file for sphinx
#
# To handle the Sphinx requirement that every document have a unique 32-bit ID,
# use a unique number for each index as the first 8 bits, and then use
# the normal index from the database for the last 24 bits.
# Here is the list of "index ids"
# 1 - English Question Text
# 2 - Dutch Question Text
# 3 - Concepts
# 4 - Variables
# 5 - Study Units
# 6 - Publications
#
# The full index will combine all of these indexes
#
# COMMANDS
# To index all of the files (when searchd is not running), use the command:
# indexer.exe --config qbase.conf --all
# To index all of the files (when searchd is running), use the command:
# indexer.exe --config qbase.conf --all --rotate
# Set up searchd as a service with the command
# searchd.exe --install --config c:\full\path\to\qbase.conf
# Stop searchd service with the command
# searchd.exe --stop --config c:\full\path\to\qbase.conf
# Remove searchd service with the command
# searchd.exe --delete --config c:\full\path\to\qbase.conf
# To just run searchd for development/testing
# searchd.exe --config qbase.conf
# base class with basic connection information
source base_source
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass =
sql_db = questasy
sql_port = 3306 # optional, default is 3306
}
# Query for English Question Text
source questions_english : base_source
{
sql_query = SELECT ((1<<24)|QuestionItem.id) as id, StudyUnit.id as study_unit_id, QuestionItem.lh_text_1 as question_text, GROUP_CONCAT(Code.lt_label_1 SEPARATOR ' ') as answer_text FROM `question_items` AS `QuestionItem` LEFT JOIN `question_schemes` AS `QuestionScheme` ON (`QuestionItem`.`question_scheme_id` = `QuestionScheme`.`id`) LEFT JOIN `data_collections` AS `DataCollection` ON (`DataCollection`.`id` = `QuestionScheme`.`data_collection_id`) LEFT JOIN `study_units` AS `StudyUnit` ON (`StudyUnit`.`id` = `DataCollection`.`study_unit_id`) LEFT JOIN `response_domains` AS `ResponseDomain` ON (`QuestionItem`.`response_domain_id` = `ResponseDomain`.`id`) LEFT JOIN `code_schemes` As `CodeScheme` ON (`ResponseDomain`.`code_scheme_id` = `CodeScheme`.`id` AND `ResponseDomain`.`domain_type`=4) LEFT JOIN `codes` AS `Code` ON (`Code`.`code_scheme_id` = `CodeScheme`.`id`) WHERE `StudyUnit`.`published` >= 20 GROUP BY QuestionItem.id
sql_attr_uint = study_unit_id
# sql_query_info = SELECT CONCAT('/question_items/view/',$id) AS URL
}
# Query for Dutch Question Text
source questions_dutch : base_source
{
sql_query = SELECT ((2<<24)|QuestionItem.id) as id, StudyUnit.id as study_unit_id, QuestionItem.lh_text_2 as question_text, GROUP_CONCAT(Code.lt_label_2 SEPARATOR ' ') as answer_text FROM `question_items` AS `QuestionItem` LEFT JOIN `question_schemes` AS `QuestionScheme` ON (`QuestionItem`.`question_scheme_id` = `QuestionScheme`.`id`) LEFT JOIN `data_collections` AS `DataCollection` ON (`DataCollection`.`id` = `QuestionScheme`.`data_collection_id`) LEFT JOIN `study_units` AS `StudyUnit` ON (`StudyUnit`.`id` = `DataCollection`.`study_unit_id`) LEFT JOIN `response_domains` AS `ResponseDomain` ON (`QuestionItem`.`response_domain_id` = `ResponseDomain`.`id`) LEFT JOIN `code_schemes` As `CodeScheme` ON (`ResponseDomain`.`code_scheme_id` = `CodeScheme`.`id` AND `ResponseDomain`.`domain_type`=4) LEFT JOIN `codes` AS `Code` ON (`Code`.`code_scheme_id` = `CodeScheme`.`id`) WHERE `StudyUnit`.`published` >= 20 GROUP BY QuestionItem.id
sql_attr_uint = study_unit_id
# sql_query_info = SELECT CONCAT('/question_items/view/',$id) AS URL
}
# Query for Concepts
source concepts : base_source
{
sql_query = SELECT ((3<<24)|Concept.id) as id, Concept.lt_label_1 as concept_label, Concept.lh_description_1 as concept_description FROM `concepts` AS `Concept`
# sql_query_info = SELECT CONCAT('/concepts/view/',$id) AS URL
}
# Query for Data Variable
source variables : base_source
{
sql_query = SELECT ((4<<24)|DataVariable.id) as id, StudyUnit.id as study_unit_id, DataVariable.name as variable_name, DataVariable.lh_label_1 as variable_label FROM `data_variables` AS `DataVariable` LEFT JOIN `variable_schemes` AS `VariableScheme` ON (`DataVariable`.`variable_scheme_id` = `VariableScheme`.`id`) LEFT JOIN `base_logical_products` AS `BaseLogicalProduct` ON (`BaseLogicalProduct`.`id` = `VariableScheme`.`base_logical_product_id`) LEFT JOIN `study_units` AS `StudyUnit` ON (`StudyUnit`.`id` = `BaseLogicalProduct`.`study_unit_id`) WHERE `StudyUnit`.`published` >= 15
sql_attr_uint = study_unit_id
# sql_query_info = SELECT CONCAT('/data_variables/view/',$id) AS URL
}
# Query for Study Units
source study_units : base_source
{
sql_query = SELECT ((5<<24)|StudyUnit.id) as id, StudyUnit.id as study_unit_id, StudyUnit.fulltitle as study_unit_name, StudyUnit.subtitle as study_unit_subtitle, StudyUnit.alternate_title AS study_unit_alternatetitle, StudyUnit.lh_note_1 as study_unit_note, StudyUnit.lh_purpose_1 as study_unit_purpose, StudyUnit.lh_abstract_1 as study_unit_abstract, StudyUnit.creator as study_unit_creator FROM study_units AS StudyUnit WHERE `StudyUnit`.`published` >= 10
sql_attr_uint = study_unit_id
# sql_query_info = SELECT CONCAT('/study_units/view/',$id) AS URL
}
# Query for Publications
source publications : base_source
{
sql_query = SELECT ((6<<24)|Publication.id) as id, Publication.id as publication_id, Publication.title as publication_name, Publication.subtitle as publication_subtitle, Publication.creator as publication_creator, Publication.contributor as publication_contributor, Publication.abstract as publication_abstract, Publication.lh_note_1 as publication_note, Publication.source as publication_source FROM publications AS Publication WHERE NOT(`Publication`.`accepted_timestamp` IS NULL)
# sql_query_info = SELECT CONCAT('/publications/view/',$id) AS URL
}
# Query for Hosted Files - Other materials
source other_materials : base_source
{
sql_query = SELECT ((7<<24)|HostedFile.id) as id, OtherMaterial.title as hosted_file_title, HostedFile.name as hosted_file_name, StudyUnit.id as study_unit_id FROM `hosted_files` as `HostedFile`, `other_materials` as OtherMaterial, `study_units` as `StudyUnit` WHERE OtherMaterial.hosted_file_id = HostedFile.id AND OtherMaterial.study_unit_id = StudyUnit.id AND `StudyUnit`.`published` >= 20
sql_attr_uint = study_unit_id
# sql_query_info = SELECT CONCAT('/hosted_files/download/',$id) AS URL
}
# Query for Hosted Files - Datasets
source physical_instances : base_source
{
sql_query = SELECT ((8<<24)|HostedFile.id) as id, PhysicalInstance.name as hosted_file_name, StudyUnit.id as study_unit_id FROM `hosted_files` as `HostedFile`, `physical_instances` as PhysicalInstance, `study_units` as `StudyUnit` WHERE PhysicalInstance.hosted_file_id = HostedFile.id AND PhysicalInstance.study_unit_id = StudyUnit.id AND `StudyUnit`.`published` >= 20
sql_attr_uint = study_unit_id
# sql_query_info = SELECT CONCAT('/hosted_files/download/',$id) AS URL
}
# Query for Physical Data Products (Variable Schemes)
source physical_data_products : base_source
{
sql_query = SELECT ((9<<24)| PhysicalDataProduct.id) as id, PhysicalDataProduct.name FROM `physical_data_products` AS `PhysicalDataProduct`, `study_units` as `StudyUnit` WHERE PhysicalDataProduct.study_unit_id = StudyUnit.id AND PhysicalDataProduct.deleted = 0 AND StudyUnit.published >= 20
}
# English Question Text Index
index questions_english_index
{
source = questions_english
path = C:\Sphinx\data\questions_english_index
docinfo = extern
mlock = 0
morphology = stem_en
min_word_len = 3
min_prefix_len = 0
min_infix_len = 3
# enable_star = 1
html_strip = 1
# charset_type = utf-8
}
# Dutch Question Text Index
index questions_dutch_index
{
source = questions_dutch
path = C:\Sphinx\data\questions_dutch_index
docinfo = extern
mlock = 0
morphology = stem_en
min_word_len = 3
min_prefix_len = 0
min_infix_len = 3
# enable_star = 1
html_strip = 1
# charset_type = utf-8
}
# Concept Index
index concepts_index
{
source = concepts
path = C:\Sphinx\data\concepts_index
docinfo = extern
mlock = 0
morphology = stem_en
min_word_len = 3
min_prefix_len = 0
min_infix_len = 3
# enable_star = 1
html_strip = 1
# charset_type = utf-8
}
# Variable Index
index variables_index
{
source = variables
path = C:\Sphinx\data\variables_index
docinfo = extern
mlock = 0
morphology = stem_en
min_word_len = 3
min_prefix_len = 0
min_infix_len = 3
# enable_star = 1
html_strip = 1
# charset_type = utf-8
}
# Study Unit Index
index study_units_index
{
source = study_units
path = C:\Sphinx\data\study_units_index
docinfo = extern
mlock = 0
morphology = stem_en
min_word_len = 3
min_prefix_len = 0
min_infix_len = 3
# enable_star = 1
html_strip = 1
# charset_type = utf-8
}
# Publication Index
index publications_index
{
source = publications
path = C:\Sphinx\data\publications_index
docinfo = extern
mlock = 0
morphology = stem_en
min_word_len = 3
min_prefix_len = 0
min_infix_len = 3
# enable_star = 1
html_strip = 1
# charset_type = utf-8
}
# Other Materials Index
index other_materials_index
{
source = other_materials
path = C:\Sphinx\data\other_materials_index
docinfo = extern
mlock = 0
morphology = stem_en
min_word_len = 3
min_prefix_len = 0
min_infix_len = 3
# enable_star = 1
html_strip = 1
# charset_type = utf-8
}
# Datasets file Index
index physical_instances_index
{
source = physical_instances
path = C:\Sphinx\data\physical_instances_index
docinfo = extern
mlock = 0
morphology = stem_en
min_word_len = 3
min_prefix_len = 0
min_infix_len = 3
# enable_star = 1
html_strip = 1
# charset_type = utf-8
}
# Datasets Index
index physical_data_products_index
{
source = physical_data_products
path = C:\Sphinx\data\physical_data_products_index
docinfo = extern
mlock = 0
morphology = stem_en
min_word_len = 3
min_prefix_len = 0
min_infix_len = 3
# enable_star = 1
html_strip = 1
# charset_type = utf-8
}
# Full Index - merge all of the other indexes
index full_index
{
type = distributed
local = questions_english_index
local = questions_dutch_index
local = concepts_index
local = variables_index
local = study_units_index
local = publications_index
local = other_materials_index
local = physical_instances_index
local = physical_data_products_index
}
indexer
{
# memory limit, in bytes, kiloytes (16384K) or megabytes (256M)
# optional, default is 32M, max is 2047M, recommended is 256M to 1024M
mem_limit = 256M
# maximum IO calls per second (for I/O throttling)
# optional, default is 0 (unlimited)
#
# max_iops = 40
# maximum IO call size, bytes (for I/O throttling)
# optional, default is 0 (unlimited)
#
# max_iosize = 1048576
}
# Settings for the searchd service
searchd
{
# port = 3312
log = C:\Sphinx\log\searchd.log
query_log = C:\Sphinx\log\query.log
pid_file = C:\Sphinx\log\searchd.pid
listen = 127.0.0.1
}
# C:\Sphinx\bin\searchd --config C:\xampp\htdocs\sphinx\vendors\questasy.conf
Thanks in advance

Sphinx autocomplete search

I'm trying to do a google-style autocomplete search with sphinx and ajax.
Say user is looking for an iphone. The goal is that input like "ip", "iph", "ipho" must give me the result, but it does not, while "iphon" or "iphone" do.
So, what am i doing wrong here?
index product
{
source = product
path = /var/lib/sphinx/product
docinfo = extern
mlock = 0
morphology = stem_enru
min_word_len = 2
charset_type = utf-8
charset_table = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F
min_prefix_len = 2
max_substring_len = 6
enable_star = 1
}
and the query
$sphinx = new SphinxClient();`
$sphinx -> SetLimits (0,1500,2500);
$sphinx->SetServer('localhost', 9312);
$sphinx->SetMatchMode(SPH_MATCH_EXTENDED);
$sphinx->SetSortMode(SPH_SORT_RELEVANCE);
$sphinx->SetFieldWeights(array ('name' => 30, 'brand' => 20, 'parent_name' => 10, 'description' => 5));
$result = $sphinx->Query($string, '*');

Compute the Frequency of bigrams in Matlab

I am trying to compute and plot the distribution of bigrams frequencies
First I did generate all possible bigrams which gives 1296 bigrams
then i extract the bigrams from a given file and save them in words1
my question is how to compute the frequency of these 1296 bigrams for the file a.txt?
if there are some bigrams did not appear at all in the file, then their frequencies should be zero
a.txt is any text file
clear
clc
%************create bigrams 1296 ***************************************
chars ='1234567890abcdefghijklmonpqrstuvwxyz';
chars1 ='1234567890abcdefghijklmonpqrstuvwxyz';
bigram='';
for i=1:36
for j=1:36
bigram = sprintf('%s%s%s',bigram,chars(i),chars1(j));
end
end
temp1 = regexp(bigram, sprintf('\\w{1,%d}', 1), 'match');
temp2 = cellfun(#(x,y) [x '' y],temp1(1:end-1)', temp1(2:end)','un',0);
bigrams = temp2;
bigrams = unique(bigrams);
bigrams = rot90(bigrams);
bigram = char(bigrams(1:end));
all_bigrams_len = length(bigrams);
clear temp temp1 temp2 i j chars1 chars;
%****** 1. Cleaning Data ******************************
collection = fileread('e:\a.txt');
collection = regexprep(collection,'<.*?>','');
collection = lower(collection);
collection = regexprep(collection,'\W','');
collection = strtrim(regexprep(collection,'\s*',''));
%*******************************************************
temp = regexp(collection, sprintf('\\w{1,%d}', 1), 'match');
temp2 = cellfun(#(x,y) [x '' y],temp(1:end-1)', temp(2:end)','un',0);
words1 = rot90(temp2);
%*******************************************************
words1_len = length(words1);
vocab1 = unique(words1);
vocab_len1 = length(vocab1);
[vocab1,void1,index1] = unique(words1);
frequencies1 = hist(index1,vocab_len1);
I. Character counting problem for a string
bsxfun based solution for counting characters -
counts = sum(bsxfun(#eq,[string1-0]',65:90))
Output -
counts =
2 0 0 0 0 2 0 1 0 0 ....
If you would like to get a tabulate output of counts against each letter -
out = [cellstr(['A':'Z']') num2cell(counts)']
Output -
out =
'A' [2]
'B' [0]
'C' [0]
'D' [0]
'E' [0]
'F' [2]
'G' [0]
'H' [1]
'I' [0]
....
Please note that this was a case-sensitive counting for upper-case letters.
For a lower-case letter counting, use this edit to this earlier code -
counts = sum(bsxfun(#eq,[string1-0]',97:122))
For a case insensitive counting, use this -
counts = sum(bsxfun(#eq,[upper(string1)-0]',65:90))
II. Bigram counting case
Let us suppose that you have all the possible bigrams saved in a 1D cell array bigrams1 and the incoming bigrams from the file are saved into another cell array words1. Let us also assume certain values in them for demonstration -
bigrams1 = {
'ar';
'de';
'c3';
'd1';
'ry';
't1';
'p1'}
words1 = {
'de';
'c3';
'd1';
'r9';
'yy';
'de';
'ry';
'de';
'dd';
'd1'}
Now, you can get the counts of the bigrams from words1 that are present in bigrams1 with this code -
[~,~,ind] = unique(vertcat(bigrams1,words1));
bigrams_lb = ind(1:numel(bigrams1)); %// label bigrams1
words1_lb = ind(numel(bigrams1)+1:end); %// label words1
counts = sum(bsxfun(#eq,bigrams_lb,words1_lb'),2)
out = [bigrams1 num2cell(counts)]
The output on code run is -
out =
'ar' [0]
'de' [3]
'c3' [1]
'd1' [2]
'ry' [1]
't1' [0]
'p1' [0]
The result shows that - First element ar from the list of all possible bigrams has no find in words1 ; second element de has three occurrences in words1 and so on.
Hey similar to Dennis solution you can just use histc()
string1 = 'ASHRAFF'
histc(string1,'ABCDEFGHIJKLMNOPQRSTUVWXYZ')
this checks the number of entries in the bins defined by the string 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' which is hopefully the alphabet (just wrote it fast so no garantee). The result is:
Columns 1 through 21
2 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0
Columns 22 through 26
0 0 0 0 0
Just a little modification of my solution:
string1 = 'ASHRAFF'
alphabet1='A':'Z'; %%// as stated by Oleg Komarov
data=histc(string1,alphabet1);
results=cell(2,26);
for k=1:26
results{1,k}= alphabet1(k);
results{2,k}= data(k);
end
If you look at results now you can easily check rather it works or not :D
This answer creates all bigrams, loads in the file does a little cleanup, ans then uses a combination of unique and histc to count the rows
Generate all Bigrams
note the order here is important as unique will sort the array so this way it is created presorted so the output matches expectation;
[y,x] = ndgrid(['0':'9','a':'z']);
allBigrams = [x(:),y(:)];
Read The File
this removes capitalisation and just pulls out any 0-9 or a-z character then creates a column vector of these
fileText = lower(fileread('d:\loremipsum.txt'));
cleanText = regexp(fileText,'([a-z0-9])','tokens');
cleanText = cell2mat(vertcat(cleanText{:}));
create bigrams from file by shifting by one and concatenating
fileBigrams = [cleanText(1:end-1),cleanText(2:end)];
Get Counts
the set of all bigrams is added to our set (so the values are created for all possible). Then a value ∈{1,2,...,1296} is assigned to each unique row using unique's 3rd output. Counts are then created with histc with the bins equal to the set of values from unique's output, 1 is subtracted from each bin to remove the complete set bigrams we added
[~,~,c] = unique([fileBigrams;allBigrams],'rows');
counts = histc(c,1:1296)-1;
Display
to view counts against text
[allBigrams, counts+'0']
or for something potentially more useful...
[sortedCounts,sortInd] = sort(counts,'descend');
[allBigrams(sortInd,:), sortedCounts+'0']
ans =
or9
at8
re8
in7
ol7
te7
do6 ...
Did not look into the entire code fragment, but from the example at the top of your question, I think you are looking to make a histogram:
string1 = 'ASHRAFF'
nr = histc(string1,'A':'Z')
Will give you:
2 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
(Got a working solution with hist, but as #The Minion shows histc is more easy to use here.)
Note that this solution only deals with upper case letters.
You may want to do something like so if you want to put lower case letters in their correct bin:
string1 = 'ASHRAFF'
nr = histc(upper(string1),'A':'Z')
Or if you want them to be shown separately:
string1 = 'ASHRaFf'
nr = histc(upper(string1),['a':'z' 'A':'Z'])
bi_freq1 = zeros(1,all_bigrams_len);
for k=1: vocab_len1
for i=1:all_bigrams_len
if char(vocab1(k)) == char(bigrams(i))
bi_freq1(i) = frequencies1(k);
end
end
end