Indexing Server problem - indexing-service

I've got an HTML file which I'm having a problem with through Index Server. Here is the text in question.
(B) $10,800 per linear mile for
(C) $40,000 per linear mile for any
My problem is with the bolded text.
If I search for 800, it finds the document
If I search for $10, it finds the document
If I search for $40, it finds the document
If I search for $10,800, 10,800, $40,000 or 40,000 it will not find the document, regardless of if I surround the text with double quotes.
I thought perhaps it was the comma, but I can find other documents in the indexing server repository with the 40,000 and it will find it.
Any ideas?

NEVER MIND!!!
I finally figured it out (after about two days). There were a series of non-breaking spaces (& nbsp;) right before the text in question. I replaced the last one with a physical space and re-indexed it - I guess the indexing service isn't smart enough to ignore those or to treat them as GASP - A SPACE!

Related

How to remove words from a document on a column-by-column basis instead of whole lines in word

Perhaps a stupid question but I have a document where I have a large number of numerical values arranged in columns, although not in word's actual column formatting and I want to delete certain columns while leaving one intact. Heres a link to a part of my document.
Data
As can be seen there are four columns and I only want to keep the 3rd column but when I select any of this in word, it selects the whole line. Is there a way I can select data in word as a column, rather than as whole lines? If not, can this be done in other word processing programs?
Generally, spreadsheet apps or subprograms are what you need for deleting and modifying data in column or row format.
Microsoft's spreadsheet equivalent is Excel, part of the Microsoft Office Suite that Word came with. I believe Google Docs has a free spreadsheet tool online as well.
I have not looked at the uploaded file, but if it is small enough, you might be able to paste one row of data at a time into a spreadsheet, and then do your operation on the column data all at once.
There may be other solutions to this problem, but that's a start.

Do stopwords interfere with queries that use SENTENCE in Sphinx?

I do not have Sphinx installed yet I am still reviewing if it makes sense so pardon a question I could probably answer on my own otherwise.
If I had the period as a stopword (so I can match e.g. U.S. to US) will a query that uses SENTENCE? In other words will the query ignore the period because of stopword and thus fail to recognize the end of sentence boundary?

how much data can i store per node in Neo4j

I need to save big chunks of Unicode text strings in Neo4j nodes. The documentation does not mention anything about the size of data one can store per node.
Does anyone know that?
I just tried the following with the neo4j web interface :
I wrote a line of 26 characters and duplicated it through 32000 lines, which makes a total of 832000 characters.
I created a node with a property "text" and copied my text in it, and it worked perfectly.
I tried again with 64000 lines with white spaces at the end of lines, with a total of 1728000 characters. Created a new node, then queried the node and copied the result back in a file to check the size (you never know), and wc gave me 1728001 (the one must be an error in the copy/paste process I suppose).
It didn't seem to complain.
FYI this is equivalent to a text with 345600 words of an average size of 4 and a space (5 characters), and a book of 1000 pages with 300 words per page.
I don't know however how this could impact performances if there are too many nodes. If it doesn't work well because of this, you could always consider having neo4j for storing informations about relationships, with a property ID as an id for another document oriented database to retrieve the text (or simply the path of a file as a path property).
Neo4j is by default indexed using Lucene. Lucene was built as a full text search toolbox (with Solr being the de facto search engine implementation). Since Lucene was intended to search over large amounts of text, my suspicion is that you can put as much text into a node as you want and it will work just fine.
Neo4j is a very nice solution for managing relationships between objects. As you may already know these relationships can have properties as well as the nodes themselves. But I think you cannot store "a big chunk" of data on these nodes. I think Neo4j was intended to be used with another database such as MongoDb or even mysql. You get "really fast" the information you first need and then look up for it using another engine. On my projects I store usernames, names, date of birth, ids, and these kind of information, but not very large text strings.

Copying contents of one Word Document into another Document with different names

I have Word 2003. I basically have over 100 documents with contents pertaining to a specific unit of a process, say Process 1. I have four other areas with different names, but the content will stay the same. How do I copy multiple contents from the first set of documents to the remaining 3 without changing the name?
There are two tables in the document, I only want to copy the second table from the first set of documents to each of the other documents. The first table has the file name and other info in it that needs to stay unique to that document.
Any help would be greatly appreciated as I have thousands of these that will need to be copied over eventually, and doing it manually would pretty much kill me.
Thank you,
David at Work
I'm assuming the documents are .doc binary files (since you say Word 2003).
Given this, your 2 most feasible options are to do something from within Word (eg macro or add-in), or automate Word.

How to reduce the size of an sqlite3 database for iphone?

edit: many thanks for all the answers. Here are the results after applying the optimisations so far:
Switching to sorting the characters and run length encoding - new DB size 42M
Dropping the indexes on the booleans - new DB size 33M
The really nice part is this hasn't required any changes in the iphone code
I have an iphone application with a large dictionary held in sqlite format (read only). I'm looking for ideas to reduce the size of the DB file, which is currently very large.
Here is the number of entries and resulting size of the sqlite DB:
franks-macbook:DictionaryMaker frank$ ls -lh dictionary.db
-rw-r--r-- 1 frank staff 59M 8 Oct 23:08 dictionary.db
franks-macbook:DictionaryMaker frank$ wc -l dictionary.txt
453154 dictionary.txt
...an average of about 135 bytes per entry.
Here is my DB schema:
create table words (word text primary key, sowpods boolean, twl boolean, signature text)
create index sowpods_idx on words(sowpods)
create index twl_idx on words(twl)
create index signature_idx on words(signature)
Here is some sample data:
photoengrave|1|1|10002011000001210101010000
photoengraved|1|1|10012011000001210101010000
photoengraver|1|1|10002011000001210201010000
photoengravers|1|1|10002011000001210211010000
photoengraves|1|1|10002011000001210111010000
photoengraving|1|1|10001021100002210101010000
The last field represents the letter frequencies for anagram retrieval (each position is in the range 0..9). The two booleans represent sub dictionaries.
I need to do queries such as:
select signature from words where word = 'foo'
select word from words where signature = '10001021100002210101010000' order by word asc
select word from words where word like 'foo' order by word asc
select word from words where word = 'foo' and (sowpods='1' or twl='1')
One idea I have is to encode the letter frequencies more efficiently, e.g. binary encode them as a blob (perhaps with RLE as there are many zeros?). Any ideas for how best to achieve this, or other ideas to reduce the size? I am building the DB in ruby, and reading it on the phone in objective C.
Also is there any way to get stats on the DB so I can see what is using the most space?
Have you tried typing the "vacuum" command to make sure you don't have extra space in the db you forgot to reclame?
Remove the indexes on sowpods and twl -- they are probably not helping your query times and are definitely taking lots of space.
You can get stats on the database using sqlite3_analyzer from the SQLite downloads page.
As a totally different approach, you could try using a bloom filter instead of a comprehensive database. Basically, a bloom filter consists of a bunch of hash functions, each of which is associated with a bitfield. For each legal word, each hash function is evaluated, and the corresponding bit in the corresponding bit field is set. Drawback is it's theoretically possible to get false positives, but those can be minimized/practically eliminated with enough hashes. Plus side is a huge space savings.
I'm not clear on all the use cases for the signature field but it seems like storing an alphabetized version of the word instead would be beneficial.
The creator of SQLite sells a version of SQLite that includes database compression (and encryption). This would be perfect.
Your best bet is to use compression, which unfortunately SQLite does not support natively at this point. Luckily, someone took the time to develop a compression extension for it which could be what you need.
Otherwise I'd recommend storing your data mostly in compressed format and uncompressing on the fly.
As a text field, signature is currently using at least 26 * 8 bytes per entry (208 bytes) but if you were to pack the data into a bitfield, you could probably get away with only 3 bits per letter (reducing your maximum frequency per letter to 7). That would mean you could pack the entire signature in 26 * 3 bits = 78 bits = 10 bytes. Even if you used 4 bits per letter (for a maximum frequency of 15 per letter) you would only use 104 bits (13 bytes).
EDIT: After a bit more thought, I think 4 bits per letter (instead of 3) would be a better idea because it would make the binary math easier.
EDIT2: Reading through the docs on SQLite data types, it seems that you might be able to just make the "signature" field span 26 columns of type INTEGER and SQLite will do the right thing and only use as many bits as required to store the value.
Do I reckon correctly that you have about 450K words like that in your database ?
I've got no clue about iPhone, neither serious about sqlitem but... as long as sqlite does not allow for a way to save the file as gz right away (it maybe already does internally? no, does not look like that when you say it's about 135 b per entry. not even with both indexes), I would move away from the table approach, save it "manually" in a dictionary approach compression and build the rest on the fly and in memory. That should perform VERY well on your type of data.
Wait... Are you using that signature to allow for fulltextsearching or mistyping recogition ? Would full text search on sqlite not obsolete that field ?
As noted storing "Signature" more efficiently seems like a good idea.
However, it also seems like you could gain a ton of space savings by using some kind of lookup table for words - since you seem to be taking a root word and then appending "er", "ed", "es", etc why not have a column with a numeric ID that references a root word from a separate lookup table, and then a separate column with a numeric ID that references a table of common word suffixes that would be appended to the base word.
If there were any tricks around storing shorthand versions of signatures for multiple entries with a single root word, you could also employ those to reduce the size of stored signatures (not sure what algorithm is producing those values)
This also seems to make a lot of sense to me as you have the "word" column as a primary key, but do not even index it - just create a separate numeric column that is the primary ID for the table.
mhmm... an iPhone... doesn't it have a permanent data connection ?
I think this is where a webapplication/webservice can jump in snugly.
Move most of your business logic to the webserver (he's gonna have real SQL with FTS and looooots of memory) and fetch that info online to the client on the device.
As mentioned elsewhere, lose the indexes on the boolean columns, they will almost certainly be slower (if used at all) than a table scan and are going to use space needlessly.
I'd consider applying a simple compression to the words, Huffman coding is pretty good for this sort of thing. Also, I'd look at the signatures: sort the columns in letter frequency order and don't bother storing trailing zeroes, which can be implied. I guess you could Huffman-encode those, too.
Always assuming your encoded strings don't upset SQLite, of course.