Truncating a mergefield in Microsoft Word for mailmerge - ms-word

I'm looking to create a formula and apply to a mergefield in Word 2013. I have no access to the db and only given merge fields. My goal is truncate/shorten certain mergefields. E.g. Full Name to just initial, Age [27 years] to short age [27].
Access and excel have the formula 'left' which i've tried to use with no success. There seems to be a low more options available for numbers.
{=left({ MERGEFIELD First_Name },1)}
However this gives a syntax error. Is there a list of formulas that work for mergefield?
Outcome
'Steven' -> 'S'
'27 Years' -> '27'

The short answer to your question is that there is nothing in the Word field language that can reliably do string manipulation such as left(), mid() and so on. The {=} field only has numeric functions such as SUM, ABS, PRODUCT and so on. There are unreliable approaches, and in some cases they may be reliable enough for your requirement, but that really depends on how sure you can be that the data source will always contain values formatted as you expect.
As a simple example, let's take the "27 years" thing.
If every value in the relevant data source column is in the same general format, which I will describe as "something that Word recognises as an individual number, followed by an alpha string", then you can in fact use
{ SET dat { MERGEFIELD age } }{ =dat }
Notice that in that case, if you are merging to a new document, { =dat } fields will remain in the output, and updating those fields will cause errors. You can avoid that by nesting either the { =dat } field or the all the fields in a QUOTE field:
{ QUOTE "{ SET dat { MERGEFIELD age } }{ =dat }" }
{ SET dat { MERGEFIELD age } }{ QUOTE { =dat } }
However, if your data source field could contain a value such as
4 years 2 months
then this will not work because in that case { =dat } will evaluate to 6, not 2. Word will also evaluate anything that looks like an { = } field expression, e.g. if your data source contains
SUM(23,25)
then { =dat } will evaluate to 48. There are further oddities that I will not describe now.
The simplest unreliable approach to extracting the first letter from a field is to use a large number of IF fields to test for every possible initial letter, e.g.
{ IF "{ MERGEFIELD First_Name }" = "A*" "A" }{ IF "{ MERGEFIELD First_Name }" = "B*" "B" } etc.
If you don't need to distinguish between lower and upper case you can use
{ IF "{ MERGEFIELD First_Name \*Upper }" = "A*" "A" } etc.
That's OK if you know (for example) that the names can only start with A-Z,a-z (and you could obviously test for 0-9 etc. as well. But what if it could start with any Unicode letter? Not sure that inserting thousands of IF fields is a reliable approach.
There is a similarly "unreliable" - and resource-consuming - way to use functions such as left, mid etc., as long as you are using recent versions of Windows Word (not Mac Word).
What you can do is create a completely empty Access/Jet database .mdb (let's say it is at c:\i\i.mdb, then insert a DATABASE field nested in a QUOTE field like this
{ QUOTE { DATABASE \d "c:\\i\\i.mdb" \s "SELECT left('{ MERGEFIELD First_Name }',1)" } }
Normally, a DATABASE field inserts a Word table (unless the data source has more columns than a Word table can contain), but when you only insert a single value with no headings, Word does not put the value in a cell. Unfortunately, these days Word does add a paragraph mark, but nesting the DATABASE field inside a QUOTE field appears to remove that again.
So why is that "unreliable"? Well, the main reason is if the First_Name field contains any quotation marks (certainly single-quotation marks, and OTTOMH I think double quotation marks) then the query that Word sends to Jet will look like this like this
SELECT left('a name containing a ' mark'),1)
and Jet will return a syntax error.
There are other problems with the DATABASE field approach, including
Word restricts the SELECT statement to 255 characters (I think). If
your data source filed causes the SLEECT statement length to exceed
that, Jet will return an error.
You have to put the database somewhere. If you are just using this
merge yourself, that may not be a problem, but if you have to
distribute the Word document etc. for others to use, you also have to
ensure they have the .mdb and that it's at the specified location.
Word sometimes gets confused between a Mail Merge data source and a
data source introduced via a DATABASE field.
Even one DATABASE field will execute a query for every record in the data source. If you use this technique in several places, a very large number of queries will be issued. That could cause problems.
As far as "single letter extraction" is concerned, there is another approach, rather similar to the DATABASE one, that uses an external .XML file and a set of INCLUDETEXT fields to specify a node in the file and return its content. But there are also similar difficulties. I may modify this Answer to describe that approach at some point, but as far as I know it has never been used in a real-world scenario.
So what if you need something more reliable? Well, there are several approaches, but all of them suffer from shortcomings of one kind or another. The main approaches I know are:
use Word VBA and the OpenDataSource method to open the data source.
That allows you to specify a query in the SQL dialect understood by
the data source.
use a Query/View defined in an intermediate database to extract the
data items you need, and use that Query/View as your data source
Use Word VBA's MailMerge Events to manipulate the data for each
record in the data source as Word processes the mailmerge
use a manual intermediate step
(more drastic) ditch Word MailMerge and find another approach
altogether, e.g. create a .docx using .NET, the relevant database
provider, and the Office Open XML SDK
If you are creating this merge for use by other people, two side-effects of all those approaches is that the overall process becomes more complicated or unfamiliar for the user, and in particular, they may not be able to use Word's facilities for data source record filtering and so on. Another issue that some people encounter is that if your database contains long text fields/memo fields longer than 255 characters, they have a tendency to be truncated by Jet whenever you do something much more complicated than the default "SELECT * FROM TABLE"
(1) requires that you can write a suitable query to get the columns you need from your data source. Because the query is executed using OLE DB you don't actually need to create any permanent objects in your database. So it may be a viable approach as long as the backend database allows you to execute external queries. But Word also imposes a 255 or 511 character limit on the query, so if you have to manipulate a lot of fields or the functions you need are complicated, you may find that you exceed the character limit quite quickly.
(2) is rather similar to (1) but may allow you to specify a much more complex query. For example, if your data source is a Jet .accdb, you may be able to create your own .accdb and define a query in that that accesses the tables in the .accdb that you are not allowed to modify. You might either used "linked tables" to achieve that, or in certain cases you can specify the locations of the underlying tables/queries in the SQL.
(3) means that you use VBA to intercept Word as it processes each data source record. I leave you to research that. You have to control the process from VBA to ensure that the MailMerge events are invoked. There have been reports of various unreliabilities. VBA can only access the first 255 characters of any memo fields.
(4), e.g. you create an Excel workbook and use it to query the database. In that case you may be able to issue a much longer SQL query than you can in Word, and you may be able to create new Excel columns that manipulate the data using excel formulas. (I have never tried that, though). Then use that as your data source.
Finally, a web search should reveal a list of functions recognised by Word's "=" field, but recent Microsoft documentation tends to omit the IF() function. The ISO29500 documents on the .docx standard omit it as well, but I think that was not the intention and may be fixed in a future version of the standard. The functions are:
ABS, AND, AVERAGE, COUNT, DEFINED, FALSE, IF, INT, MIN, MAX, MOD, NOT, OR, PRODUCT, ROUND, SUM, TRUE.

Related

Postgres fuzzy array intersection

I'm using Postgresql 13 and my problem was easily solved with #> operator like this:
select id from documents where keywords #> '{"winter", "report", "2020"}';
meaning that keywords array should contain all these elements. Also I've created a GIN index on this column.
Is it possible to achieve similar behavior even if I provide my request like '{"re", "202", "w"}' ? I heard that ngrams have semantics like this, but "intersection" capabilities of arrays are crucial for me.
In your example, the matches are all prefixes. Is that the general rule here? If so, you would probably went to use the match feature of full text search, not trigrams. It would require you reformat your data, or at least your query.
select * from
(values (to_tsvector('simple','winter report 2020'))) f(x)
where x## 're:* & 202:* & w:*'::tsquery;
If the strings can contain punctuation which you want preserved, you would need to take pains to properly format them into a quoted tsvector yourself rather than just letting to_tsvector deal with it. Using 'simple' config gets rid of the stemming and stop word removal features, which would interfere with what you want to do.

Use Postgresql full text search to fuzzy match all search terms

I have 2 tables (projects and tasks) that both contain a name field. I want users to be able to search both tables at the same time when entering a new item. I want to rank results based on all the terms entered. A user should be able to enter text in any order he/she chooses.
For example, searching on:
office bmt
should yield these results:
PR BMT Time - Office
BMT Office - Development
BMT Office - Development
...
The following search should also work:
BMT canter
should contain this result:
Canterburry - BMT time
So partial matches need to work too.
Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
I now use something like this:
where to_tsvector(projects.name || ' - ' || tasks.name) ## to_tsquery('OFF:*&BMT:*')
I build the search string itself in the Ruby backend by splitting the user entry according to its spaces.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
For example searching for:
off bmt
Gives results that don't contain Off at all because off is ignored completely.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
I could do it by building a list of AND statements in the WHERE clause with LIKE '% ... %' but that would probably hurt performance and doesn't support fuzzysearch.
Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
This could be very hard to do on more than a best-effort basis. If someone enters "Canter", how should the system know if they meant a shortening of Canterburry, or a misspelling of "cancer", or of "cantor", or if they really meant a horse's gait? Perhaps you can create a dictionary of common typos for your specific field? Also, without the specific knowledge that time zones are expected and common, "bmt" seems like a misspelling of, well, something.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
Don't just believe, check and see!
select to_tsquery('english','OFF:*&BMT:*');
to_tsquery
------------
'bmt':*
Yes indeed, to_tsquery does omit stop words, even with the :* thingy.
One option is to use 'simple' rather than 'english' as your configuration:
select to_tsquery('simple','OFF:*&BMT:*');
to_tsquery
-------------------
'off':* & 'bmt':*
Another option is to write tsquery directly rather than processing through to_tsquery. Note that in this case, you have to lower-case it yourself:
select 'off:*&bmt:*'::tsquery;
tsquery
-------------------
'off':* & 'bmt':*
Also note that if you do this with 'office:*', you will never get a match in an 'english' configuration, because 'office' in the document gets stemmed to 'offic', while no stemming occurs when you write 'office:*'::tsquery. So you could use 'simple' rather than 'english' to avoid both stemming and stop words. Or you could test each word in the query individually to see if it gets stemmed before deciding to add :* to it.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
What do you mean by fuzzysearch? You don't seem to be using that now. You are just using prefix matching, and accidentally using stemming and stopwords. How large is your table to be searched, and what kind of performance is acceptable?
If did you use ElasticSearch, how would you then phrase your searches? If you explained how you would phrase the search in ES, maybe someone can help you do the same thing in PostgreSQL. I don't think we can take it as a given that switching to ES will just magically do the right thing.
I could do it by building a list of AND statements in the WHERE clause
with LIKE '% ... %' but that would probably hurt performance and
doesn't support fuzzysearch.
Have you looked into pg_trgm? It can make those types of queries quite fast. Also, LIKE '%...%' is lot more fuzzy than what you are currently doing, so I don't understand how you will lose that. pg_trgm also provides the '<->' operator which is even fuzzier, and might be your best bet. It can deal with typos fairly well when embedded in long strings, but in short strings they can really be a problem.
In your case, to_tsquery() need to indicate that all words are required, you can use to_tsquery('english', 'off & bmt') and indicates a particular dictionary containing the 'off' word, listed in the link 4, below.
Some tips to use tsvector:
Create a field on your table that contains all fields with terms that you want to search, this field should be the type tsvector
Your search should use tsquery as you mentioned in your answer. In search, you can make some good tricks, like as follow:
2.a. Create a rank, with ts_rank(), indicating the search priority, this indicates the priority and how much the tsquery approximates with original terms
2.b. If you have specific words (like my case, search of chemical terms), you can create a dictionary with the commonly words used, this words can be used to extract radical or parts to compare the similarity.
2.c. About the performance: The tsquery works very well with gin and gist indexes. I have used full text search in a table with +200k registers and the search returns in < 0.4secs.
If you need more fuzzy search in words, you can also use the fuzzy match. I used with tsquery, the levenshtein_less_equal search, using a distance of 3. The function searches words with 3 or minus letters differing from the search, for unique words is a good way to search.
tsquery and tsvector: https://www.postgresql.org/docs/10/datatype-textsearch.html
text search: https://www.postgresql.org/docs/10/textsearch-controls.html#TEXTSEARCH-RANKING
Fuzzy: https://www.postgresql.org/docs/11/fuzzystrmatch.html#id-1.11.7.24.6
Lexize: https://www.postgresql.org/docs/10/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY

Using NEsper to read LogFiles for reporting purposes

We are evaluating NEsper. Our focus is to monitor data quality in an enterprise context. In an application we are going to log every change on a lot of fields - for example in an "order". So we have fields like
Consignee name
Consignee street
Orderdate
....and a lot of more fields. As you can imagine the log files are going to grow big.
Because the data is sent by different customers and is imported in the application, we want to analyze how many (and which) fields are updated from "no value" to "a value" (just as an example).
I tried to build a test case with just with the fields
order reference
fieldname
fieldvalue
For my test cases I added two statements with context-information. The first one should just count the changes in general per order:
epService.EPAdministrator.CreateEPL("create context RefContext partition by Ref from LogEvent");
var userChanges = epService.EPAdministrator.CreateEPL("context RefContext select count(*) as x, context.key1 as Ref from LogEvent");
The second statement should count updates from "no value" to "a value":
epService.EPAdministrator.CreateEPL("create context FieldAndRefContext partition by Ref,Fieldname from LogEvent");
var countOfDataInput = epService.EPAdministrator.CreateEPL("context FieldAndRefContext SELECT context.key1 as Ref, context.key2 as Fieldname,count(*) as x from pattern[every (a=LogEvent(Value = '') -> b=LogEvent(Value != ''))]");
To read the test-logfile I use the csvInputAdapter:
CSVInputAdapterSpec csvSpec = new CSVInputAdapterSpec(ais, "LogEvent");
csvInputAdapter = new CSVInputAdapter(epService.Container, epService, csvSpec);
csvInputAdapter.Start();
I do not want to use the update listener, because I am interested only in the result of all events (probably this is not possible and this is my failure).
So after reading the csv (csvInputAdapter.Start() returns) I read all events, which are stored in the statements NewEvents-Stream.
Using 10 Entries in the CSV-File everything works fine. Using 1 Million lines it takes way to long. I tried without EPL-Statement (so just the CSV import) - it took about 5sec. With the first statement (not the complex pattern statement) I always stop after 20 minutes - so I am not sure how long it would take.
Then I changed my EPL of the first statement: I introduce a group by instead of the context.
select Ref,count(*) as x from LogEvent group by Ref
Now it is really fast - but I do not have any results in my NewEvents Stream after the CSVInputAdapter comes back...
My questions:
Is the way I want to use NEsper a supported use case or is this the root cause of my failure?
If this is a valid use case: Where is my mistake? How can I get the results I want in a performant way?
Why are there no NewEvents in my EPL-statement when using "group by" instead of "context"?
To 1), yes
To 2) this is valid, your EPL design is probably a little inefficient. You would want to understand how patterns work, by using filter indexes and index entries, which are more expensive to create but are extremely fast at discarding unneeded events.
Read:
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#processingmodel_indexes_filterindexes and also
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#pattern-walkthrough
Try the "previous" perhaps. Measure performance for each statement separately.
Also I don't think the CSV adapter is optimized for processing a large file. I think CSV may not stream.
To 3) check your code? Don't use CSV file for large stuff. Make sure a listener is attached.

PostgreSQL Full Text Search: Cant get a partial match of tsvector

Here's the problem:
I have a table in PostgreSQL with adresses in plain text and tsvectors. And i'm trying to find an adress record in a query like this.
SELECT * FROM address_catalog
WHERE address_catalog.search_vector ## to_tsquery('123456:* & Klingon:* & Empire:* & Kronos:* & city:* & Matrok:* & street:* & 789:*')
But the problem is that I don't know anything about the adress in a query. I can't define where a country, a city or a street is in the incoming string. I don't know what order of words the adress has, or does it contain extra words.
I can only search for countries and cities, but if the incoming string contains street, index or anything else, the search returns nothing because of the conjunction of all vector tokens. At the same time, I simply can't delete some string parts or use disjunction, because I never know where in the string the extra words are.
So, is there any way to construct a tsquery to return some best matches for the incoming string? Or maybe partial matches? When i tried to force it to use OR instead of AND everywhere in tsquery, it returned me nearly the whole database. I need vectors intersection... in postgresql.
I'd recommend using the smlar (PDF) extension for this. It was written by the same guys that wrote text search. It lets you use the TF-IDF similarity measure, which allows for "extraneous" query terms
Here's how to compile it (I haven't figured out how to compile it on Windows):
http://blog.databasepatterns.com/2014/07/postgresql-install-smlar-extension.html
And here's how to use it:
http://blog.databasepatterns.com/2014/08/tf-idf-text-search-in-postgres.html

Is there any way for Access 2016 to sort the numbers that are part of a "text" data type formatted field as though they are numeric values?

I am working on a database that (hopefully) will end up using a primary key with both numbers and letters in the values to track lots of agricultural product. Due to the way in which the weighing of product takes place at more than one facility, I have no other option but to maintain the same base number but use letters in addition to this base number to denote split portions of each lot of product. The problem is, after I create record number 99, the number 100 suddenly floats up and underneath 10. This makes it difficult to maintain consistency and forces me to replace this alphanumeric lot ID with a strictly numeric value in order to keep it sorted (which I use "autonumber" as the data type). Either way, I need the alphanumeric lot ID, and so having 2 ID's for the same lot can be confusing for anyone inputting values into the form. Is there a way around this that I am just not seeing?
If you're using query as a data source then you may try to sort it by string converted to number, something like
SELECT id, field1, field2, ..
ORDER BY CLng(YourAlphaNumericField)
Edit: you may also try Val function instead of CLng - it should not fail on non-numeric input
Why not properly format your key before saving ? e.g: "0000099". You will avoid a costly conversion later.
Alternatively, you could use 2 fields as the composite PK. One with the Number (as Long) and one with the Location (as String).