How to a separate choices then insert it to a table without the separator, whilst forgiving any Unicode inputs? [duplicate] - unicode

I need to do a simple split of a string, but there doesn't seem to be a function for this, and the manual way I tested didn't seem to work. How would I do it?

Here is my really simple solution. Use the gmatch() function to capture strings which contain at least one character of anything other than the desired separator. The separator is any whitespace (%s in Lua) by default:
function mysplit (inputstr, sep)
if sep == nil then
sep = "%s"
end
local t={}
for str in string.gmatch(inputstr, "([^"..sep.."]+)") do
table.insert(t, str)
end
return t
end

If you are splitting a string in Lua, you should try the string.gmatch() or string.sub() methods. Use the string.sub() method if you know the index you wish to split the string at, or use the string.gmatch() if you will parse the string to find the location to split the string at.
Example using string.gmatch() from Lua 5.1 Reference Manual:
t = {}
s = "from=world, to=Lua"
for k, v in string.gmatch(s, "(%w+)=(%w+)") do
t[k] = v
end

If you just want to iterate over the tokens, this is pretty neat:
line = "one, two and 3!"
for token in string.gmatch(line, "[^%s]+") do
print(token)
end
Output:
one,
two
and
3!
Short explanation: the "[^%s]+" pattern matches to every non-empty string in between space characters.

Just as string.gmatch will find patterns in a string, this function will find the things between patterns:
function string:split(pat)
pat = pat or '%s+'
local st, g = 1, self:gmatch("()("..pat..")")
local function getter(segs, seps, sep, cap1, ...)
st = sep and seps + #sep
return self:sub(segs, (seps or 0) - 1), cap1 or sep, ...
end
return function() if st then return getter(st, g()) end end
end
By default it returns whatever is separated by whitespace.

Here is the function:
function split(pString, pPattern)
local Table = {} -- NOTE: use {n = 0} in Lua-5.0
local fpat = "(.-)" .. pPattern
local last_end = 1
local s, e, cap = pString:find(fpat, 1)
while s do
if s ~= 1 or cap ~= "" then
table.insert(Table,cap)
end
last_end = e+1
s, e, cap = pString:find(fpat, last_end)
end
if last_end <= #pString then
cap = pString:sub(last_end)
table.insert(Table, cap)
end
return Table
end
Call it like:
list=split(string_to_split,pattern_to_match)
e.g.:
list=split("1:2:3:4","\:")
For more go here:
http://lua-users.org/wiki/SplitJoin

Because there are more than one way to skin a cat, here's my approach:
Code:
#!/usr/bin/env lua
local content = [=[
Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat.
]=]
local function split(str, sep)
local result = {}
local regex = ("([^%s]+)"):format(sep)
for each in str:gmatch(regex) do
table.insert(result, each)
end
return result
end
local lines = split(content, "\n")
for _,line in ipairs(lines) do
print(line)
end
Output:
Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat.
Explanation:
The gmatch function works as an iterator, it fetches all the strings that match regex. The regex takes all characters until it finds a separator.

A lot of these answers only accept single-character separators, or don't deal with edge cases well (e.g. empty separators), so I thought I would provide a more definitive solution.
Here are two functions, gsplit and split, adapted from the code in the Scribunto MediaWiki extension, which is used on wikis like Wikipedia. The code is licenced under the GPL v2. I have changed the variable names and added comments to make the code a bit easier to understand, and I have also changed the code to use regular Lua string patterns instead of Scribunto's patterns for Unicode strings. The original code has test cases here.
-- gsplit: iterate over substrings in a string separated by a pattern
--
-- Parameters:
-- text (string) - the string to iterate over
-- pattern (string) - the separator pattern
-- plain (boolean) - if true (or truthy), pattern is interpreted as a plain
-- string, not a Lua pattern
--
-- Returns: iterator
--
-- Usage:
-- for substr in gsplit(text, pattern, plain) do
-- doSomething(substr)
-- end
local function gsplit(text, pattern, plain)
local splitStart, length = 1, #text
return function ()
if splitStart then
local sepStart, sepEnd = string.find(text, pattern, splitStart, plain)
local ret
if not sepStart then
ret = string.sub(text, splitStart)
splitStart = nil
elseif sepEnd < sepStart then
-- Empty separator!
ret = string.sub(text, splitStart, sepStart)
if sepStart < length then
splitStart = sepStart + 1
else
splitStart = nil
end
else
ret = sepStart > splitStart and string.sub(text, splitStart, sepStart - 1) or ''
splitStart = sepEnd + 1
end
return ret
end
end
end
-- split: split a string into substrings separated by a pattern.
--
-- Parameters:
-- text (string) - the string to iterate over
-- pattern (string) - the separator pattern
-- plain (boolean) - if true (or truthy), pattern is interpreted as a plain
-- string, not a Lua pattern
--
-- Returns: table (a sequence table containing the substrings)
local function split(text, pattern, plain)
local ret = {}
for match in gsplit(text, pattern, plain) do
table.insert(ret, match)
end
return ret
end
Some examples of the split function in use:
local function printSequence(t)
print(unpack(t))
end
printSequence(split('foo, bar,baz', ',%s*')) -- foo bar baz
printSequence(split('foo, bar,baz', ',%s*', true)) -- foo, bar,baz
printSequence(split('foo', '')) -- f o o

I like this short solution
function split(s, delimiter)
result = {};
for match in (s..delimiter):gmatch("(.-)"..delimiter) do
table.insert(result, match);
end
return result;
end

You can use this method:
function string:split(delimiter)
local result = { }
local from = 1
local delim_from, delim_to = string.find( self, delimiter, from )
while delim_from do
table.insert( result, string.sub( self, from , delim_from-1 ) )
from = delim_to + 1
delim_from, delim_to = string.find( self, delimiter, from )
end
table.insert( result, string.sub( self, from ) )
return result
end
delimiter = string.split(stringtodelimite,pattern)

a way not seen in others
function str_split(str, sep)
if sep == nil then
sep = '%s'
end
local res = {}
local func = function(w)
table.insert(res, w)
end
string.gsub(str, '[^'..sep..']+', func)
return res
end

You could use penlight library. This has a function for splitting string using delimiter which outputs list.
It has implemented many of the function that we may need while programming and missing in Lua.
Here is the sample for using it.
>
> stringx = require "pl.stringx"
>
> str = "welcome to the world of lua"
>
> arr = stringx.split(str, " ")
>
> arr
{welcome,to,the,world,of,lua}
>

Simply sitting on a delimiter
local str = 'one,two'
local regxEverythingExceptComma = '([^,]+)'
for x in string.gmatch(str, regxEverythingExceptComma) do
print(x)
end

I used the above examples to craft my own function. But the missing piece for me was automatically escaping magic characters.
Here is my contribution:
function split(text, delim)
-- returns an array of fields based on text and delimiter (one character only)
local result = {}
local magic = "().%+-*?[]^$"
if delim == nil then
delim = "%s"
elseif string.find(delim, magic, 1, true) then
-- escape magic
delim = "%"..delim
end
local pattern = "[^"..delim.."]+"
for w in string.gmatch(text, pattern) do
table.insert(result, w)
end
return result
end

Super late to this question, but in case anyone wants a version that handles the amount of splits you want to get.....
-- Split a string into a table using a delimiter and a limit
string.split = function(str, pat, limit)
local t = {}
local fpat = "(.-)" .. pat
local last_end = 1
local s, e, cap = str:find(fpat, 1)
while s do
if s ~= 1 or cap ~= "" then
table.insert(t, cap)
end
last_end = e+1
s, e, cap = str:find(fpat, last_end)
if limit ~= nil and limit <= #t then
break
end
end
if last_end <= #str then
cap = str:sub(last_end)
table.insert(t, cap)
end
return t
end

For those coming from the exercice 10.1 of the "Programming in Lua" book, it seems clear that we could not use notion explained later in the book (iterator) and that the function should take more than a single char seperator.
The split() is a trick to get pattern to match what is not wanted (the split) and return an empty table on empty string. The return of plainSplit() is more like the split in other language.
magic = "([%%%.%(%)%+%*%?%[%]%^%$])"
function split(str, sep, plain)
if plain then sep = string.gsub(sep, magic, "%%%1") end
local N = '\255'
str = N..str..N
str = string.gsub(str, sep, N..N)
local result = {}
for word in string.gmatch(str, N.."(.-)"..N) do
if word ~= "" then
table.insert(result, word)
end
end
return result
end
function plainSplit(str, sep)
sep = string.gsub(sep, magic, "%%%1")
local result = {}
local start = 0
repeat
start = start + 1
local from, to = string.find(str, sep, start)
from = from and from-1
local word = string.sub(str, start, from, true)
table.insert(result, word)
start = to
until start == nil
return result
end
function tableToString(t)
local ret = "{"
for _, word in ipairs(t) do
ret = ret .. '"' .. word .. '", '
end
ret = string.sub(ret, 1, -3)
ret = ret .. "}"
return #ret > 1 and ret or "{}"
end
function runSplit(func, title, str, sep, plain)
print("\n" .. title)
print("str: '"..str.."'")
print("sep: '"..sep.."'")
local t = func(str, sep, plain)
print("-- t = " .. tableToString(t))
end
print("\n\n\n=== Pattern split ===")
runSplit(split, "Exercice 10.1", "a whole new world", " ")
runSplit(split, "With trailing seperator", " a whole new world ", " ")
runSplit(split, "A word seperator", "a whole new world", " whole ")
runSplit(split, "Pattern seperator", "a1whole2new3world", "%d")
runSplit(split, "Magic characters as plain seperator", "a$.%whole$.%new$.%world", "$.%", true)
runSplit(split, "Control seperator", "a\0whole\1new\2world", "%c")
runSplit(split, "ISO Time", "2020-07-10T15:00:00.000", "[T:%-%.]")
runSplit(split, " === [Fails] with \\255 ===", "a\255whole\0new\0world", "\0", true)
runSplit(split, "How does your function handle empty string?", "", " ")
print("\n\n\n=== Plain split ===")
runSplit(plainSplit, "Exercice 10.1", "a whole new world", " ")
runSplit(plainSplit, "With trailing seperator", " a whole new world ", " ")
runSplit(plainSplit, "A word seperator", "a whole new world", " whole ")
runSplit(plainSplit, "Magic characters as plain seperator", "a$.%whole$.%new$.%world", "$.%")
runSplit(plainSplit, "How does your function handle empty string?", "", " ")
output
=== Pattern split ===
Exercice 10.1
str: 'a whole new world'
sep: ' '
-- t = {"a", "whole", "new", "world"}
With trailing seperator
str: ' a whole new world '
sep: ' '
-- t = {"a", "whole", "new", "world"}
A word seperator
str: 'a whole new world'
sep: ' whole '
-- t = {"a", "new world"}
Pattern seperator
str: 'a1whole2new3world'
sep: '%d'
-- t = {"a", "whole", "new", "world"}
Magic characters as plain seperator
str: 'a$.%whole$.%new$.%world'
sep: '$.%'
-- t = {"a", "whole", "new", "world"}
Control seperator
str: 'awholenewworld'
sep: '%c'
-- t = {"a", "whole", "new", "world"}
ISO Time
str: '2020-07-10T15:00:00.000'
sep: '[T:%-%.]'
-- t = {"2020", "07", "10", "15", "00", "00", "000"}
=== [Fails] with \255 ===
str: 'a�wholenewworld'
sep: ''
-- t = {"a"}
How does your function handle empty string?
str: ''
sep: ' '
-- t = {}
=== Plain split ===
Exercice 10.1
str: 'a whole new world'
sep: ' '
-- t = {"a", "whole", "new", "world"}
With trailing seperator
str: ' a whole new world '
sep: ' '
-- t = {"", "", "a", "", "whole", "", "", "new", "world", "", ""}
A word seperator
str: 'a whole new world'
sep: ' whole '
-- t = {"a", "new world"}
Magic characters as plain seperator
str: 'a$.%whole$.%new$.%world'
sep: '$.%'
-- t = {"a", "whole", "new", "world"}
How does your function handle empty string?
str: ''
sep: ' '
-- t = {""}

I found that many of the other answers had edge cases which failed (eg. when given string contains #, { or } characters, or when given a delimiter character like % which require escaping). Here is the implementation that I went with instead:
local function newsplit(delimiter, str)
assert(type(delimiter) == "string")
assert(#delimiter > 0, "Must provide non empty delimiter")
-- Add escape characters if delimiter requires it
delimiter = delimiter:gsub("[%(%)%.%%%+%-%*%?%[%]%^%$]", "%%%0")
local start_index = 1
local result = {}
while true do
local delimiter_index, _ = str:find(delimiter, start_index)
if delimiter_index == nil then
table.insert(result, str:sub(start_index))
break
end
table.insert(result, str:sub(start_index, delimiter_index - 1))
start_index = delimiter_index + 1
end
return result
end

Here is a routine that works in Lua 4.0, returning a table t of the substrings in inputstr delimited by sep:
function string_split(inputstr, sep)
local inputstr = inputstr .. sep
local idx, inc, t = 0, 1, {}
local idx_prev, substr
repeat
idx_prev = idx
inputstr = strsub(inputstr, idx + 1, -1) -- chop off the beginning of the string containing the match last found by strfind (or initially, nothing); keep the rest (or initially, all)
idx = strfind(inputstr, sep) -- find the 0-based r_index of the first occurrence of separator
if idx == nil then break end -- quit if nothing's found
substr = strsub(inputstr, 0, idx) -- extract the substring occurring before the separator (i.e., data field before the next delimiter)
substr = gsub(substr, "[%c" .. sep .. " ]", "") -- eliminate control characters, separator and spaces
t[inc] = substr -- store the substring (i.e., data field)
inc = inc + 1 -- iterate to next
until idx == nil
return t
end
This simple test
inputstr = "the brown lazy fox jumped over the fat grey hen ... or something."
sep = " "
t = {}
t = string_split(inputstr,sep)
for i=1,15 do
print(i, t[i])
end
Yields:
--> t[1]=the
--> t[2]=brown
--> t[3]=lazy
--> t[4]=fox
--> t[5]=jumped
--> t[6]=over
--> t[7]=the
--> t[8]=fat
--> t[9]=grey
--> t[10]=hen
--> t[11]=...
--> t[12]=or
--> t[13]=something.

Depending on the use case, this could be useful. It cuts all text either side of the flags:
b = "This is a string used for testing"
--Removes unwanted text
c = (b:match("a([^/]+)used"))
print (c)
Output:
string

Related

Can you Print the wavy lines generated by Spell check in writer?

As per google group, this macro can be used to print mis-spelled words in MS office.
https://groups.google.com/g/microsoft.public.word.spelling.grammar/c/OiFYPkLAbeU
Is there similar option in libre-office writer?
The following Subroutine replicates what the code in the Google group does. It is more verbose than the MS version but that is to be expected with LibreOffice / OpenOffice. It only does the spellchecker lines and not the green grammar checker ones, which is also the case with the MS version in the Google group.
Sub UnderlineMisspelledWords
' From OOME Listing 315 Page 336
GlobalScope.BasicLibraries.loadLibrary( "Tools" )
Dim sLocale As String
sLocale = GetRegistryKeyContent("org.openoffice.Setup/L10N", FALSE).getByName("ooLocale")
' ooLocale appears to return a string that consists of the language and country
' seperated by a dash, e.g. en-GB
Dim nDash As Integer
nDash = InStr(sLocale, "-")
Dim aLocale As New com.sun.star.lang.Locale
aLocale.Language = Left(sLocale, nDash - 1)
aLocale.Country = Right(sLocale, Len(sLocale) -nDash )
Dim oSpeller As Variant
oSpeller = createUnoService("com.sun.star.linguistic2.SpellChecker")
Dim emptyArgs() as new com.sun.star.beans.PropertyValue
Dim oCursor As Object
oCursor = ThisComponent.getText.createTextCursor()
oCursor.gotoStart(False)
oCursor.collapseToStart()
Dim s as String, bTest As Boolean
Do
oCursor.gotoEndOfWord(True)
s = oCursor.getString()
bTest = oSpeller.isValid(s, aLocale, emptyArgs())
If Not bTest Then
With oCursor
.CharUnderlineHasColor = True
.CharUnderlineColor = RGB(255, 0,0)
.CharUnderline = com.sun.star.awt.FontUnderline.WAVE
' Possible alternatives include SMALLWAVE, DOUBLEWAVE and BOLDWAVE
End With
End If
Loop While oCursor.gotoNextWord(False)
End Sub
This will change the actual formatting of the font to have a red wavy underline, which will print out like any other formatting. If any of the misspelled words in the document already have some sort of underlining then that will be lost.
You will probably want to remove the underlining after you have printed it. The following Sub removes underlining only where its style exactly matches that of the line added by the first routine.
Sub RemoveUnderlining
Dim oCursor As Object
oCursor = ThisComponent.getText.createTextCursor()
oCursor.gotoStart(False)
oCursor.collapseToStart()
Dim s as String, bTest As Boolean
Do
oCursor.gotoEndOfWord(True)
Dim bTest1 As Boolean
bTest1 = False
If oCursor.CharUnderlineHasColor = True Then
bTest1 = True
End If
Dim bTest2 As Boolean
bTest2 = False
If oCursor.CharUnderlineColor = RGB(255, 0,0) Then
bTest2 = True
End If
Dim bTest3 As Boolean
bTest3 = False
If oCursor.CharUnderline = com.sun.star.awt.FontUnderline.WAVE Then
bTest3 = True
End If
If bTest1 And bTest2 And bTest3 Then
With oCursor
.CharUnderlineHasColor = False
.CharUnderline = com.sun.star.awt.FontUnderline.NONE
End With
End If
Loop While oCursor.gotoNextWord(False)
End Sub
This will not restore any original underlining that was replaced by red wavy ones. Other ways of removing the wavy lines that would restore these are:
Pressing undo (Ctrl Z) but you will need to do that once for every word in your document, which could be a bit of a pain.
Running the subroutine UnderlineMisspelledWords on a temporary copy of the document and then discarding it after printing.
I hope this is what you were looking for.
In response to your above comment, it is straightforward to modify the above subroutine to do that instead of drawing wavy lines. The code below opens a new Writer document and writes into it a list of the misspelled words together with the alternatives that the spellchecker suggests:
Sub ListMisSpelledWords
' From OOME Listing 315 Page 336
GlobalScope.BasicLibraries.loadLibrary( "Tools" )
Dim sLocale As String
sLocale = GetRegistryKeyContent("org.openoffice.Setup/L10N", FALSE).getByName("ooLocale")
' ooLocale appears to return a string that consists of the language and country
' seperated by a dash, e.g. en-GB
Dim nDash As Integer
nDash = InStr(sLocale, "-")
Dim aLocale As New com.sun.star.lang.Locale
aLocale.Language = Left(sLocale, nDash - 1)
aLocale.Country = Right(sLocale, Len(sLocale) -nDash )
Dim oSource As Object
oSource = ThisComponent
Dim oSourceCursor As Object
oSourceCursor = oSource.getText.createTextCursor()
oSourceCursor.gotoStart(False)
oSourceCursor.collapseToStart()
Dim oDestination As Object
oDestination = StarDesktop.loadComponentFromURL( "private:factory/swriter", "_blank", 0, Array() )
Dim oDestinationText as Object
oDestinationText = oDestination.getText()
Dim oDestinationCursor As Object
oDestinationCursor = oDestinationText.createTextCursor()
Dim oSpeller As Object
oSpeller = createUnoService("com.sun.star.linguistic2.SpellChecker")
Dim oSpellAlternatives As Object, emptyArgs() as new com.sun.star.beans.PropertyValue
Dim sMistake as String, oSpell As Object, sAlternatives() as String, bTest As Boolean, s As String, i as Integer
Do
oSourceCursor.gotoEndOfWord(True)
sMistake = oSourceCursor.getString()
bTest = oSpeller.isValid(sMistake, aLocale, emptyArgs())
If Not bTest Then
oSpell = oSpeller.spell(sMistake, aLocale, emptyArgs())
sAlternatives = oSpell.getAlternatives()
s = ""
for i = LBound(sAlternatives) To Ubound(sAlternatives) - 1
s = s & sAlternatives(i) & ", "
Next i
s = s & sAlternatives(Ubound(sAlternatives))
oDestinationText.insertString(oDestinationCursor, sMistake & ": " & s & Chr(13), False)
End If
Loop While oSourceCursor.gotoNextWord(False)
End Sub
I don't know about the dictionaries but, in answer to your previous comment, if you paste the following code below Loop While and above End Sub it will result in the text in the newly opened Writer document being sorted without duplicates. It's not very elegant but it works on the text I've tried it on.
oDestinationCursor.gotoStart(False)
oDestinationCursor.gotoEnd(True)
Dim oSortDescriptor As Object
oSortDescriptor = oDestinationCursor.createSortDescriptor()
oDestinationCursor.sort(oSortDescriptor)
Dim sParagraphToBeChecked As String
Dim sThisWord As String
sThisWord = ""
Dim sPreviousWord As String
sPreviousWord = ""
oDestinationCursor.gotoStart(False)
oDestinationCursor.collapseToStart()
Dim k As Integer
Do
oDestinationCursor.gotoEndOfParagraph(True)
sParagraphToBeChecked = oDestinationCursor.getString()
k = InStr(sParagraphToBeChecked, ":")
If k <> 0 Then
sThisWord = Left(sParagraphToBeChecked, k-1)
End If
If StrComp(sThisWord, sPreviousWord, 0) = 0 Then
oDestinationCursor.setString("")
End If
sPreviousWord = sThisWord
Loop While oDestinationCursor.gotoNextParagraph(False)
Dim oReplaceDescriptor As Object
oReplaceDescriptor = oDestination.createReplaceDescriptor()
oReplaceDescriptor.setPropertyValue("SearchRegularExpression", TRUE)
oReplaceDescriptor.setSearchString("^$")
oReplaceDescriptor.setReplaceString("")
oDestination.replaceAll(oReplaceDescriptor)
It seems I didn't spot that because the text I tested it on contained only words that were either correct or had more than zero alternatives. I managed to replicate the error by putting in a word consisting of random characters for which the spellchecker was unable to suggest any alternatives. If no alternatives are found the function .getAlternatives() returns an array of size -1 so the error can be avoided by testing for this condition before the array is used. Below is a modified version of the first Do loop in the subroutine with such a condition added. If you replace the existing loop with that it should eliminate the error.
Do
oSourceCursor.gotoEndOfWord(True)
sMistake = oSourceCursor.getString()
bTest = oSpeller.isValid(sMistake, aLocale, emptyArgs())
If Not bTest Then
oSpell = oSpeller.spell(sMistake, aLocale, emptyArgs())
sAlternatives = oSpell.getAlternatives()
s = ""
If Ubound(sAlternatives) >= 0 Then
for i = LBound(sAlternatives) To Ubound(sAlternatives) - 1
s = s & sAlternatives(i) & ", "
Next i
s = s & sAlternatives(Ubound(sAlternatives))
End If
oDestinationText.insertString(oDestinationCursor, sMistake & ": " & s & Chr(13), False)
End If
Loop While oSourceCursor.gotoNextWord(False)
On re-reading the whole subroutine I think it would improve its readability if the variable sMistake were renamed to something like sWordToBeChecked, as the string this variable contains isn't always misspelled. This would of course need to be changed everywhere in the routine and not just in the above snippet.
Below is a modified version that uses the dispatcher as suggested by Jim K in his answer go to end of word is not always followed. I have written it out in its entirety because the changes are more extensive than just adding or replacing a block. In particular, it is necessary to get the view cursor before creating the empty destination document, otherwise the routine will spell check that.
Sub ListMisSpelledWords2
' From OOME Listing 315 Page 336
GlobalScope.BasicLibraries.loadLibrary( "Tools" )
Dim sLocale As String
sLocale = GetRegistryKeyContent("org.openoffice.Setup/L10N", FALSE).getByName("ooLocale")
' ooLocale appears to return a string that consists of the language and country
' seperated by a dash, e.g. en-GB
Dim nDash As Integer
nDash = InStr(sLocale, "-")
Dim aLocale As New com.sun.star.lang.Locale
aLocale.Language = Left(sLocale, nDash - 1)
aLocale.Country = Right(sLocale, Len(sLocale) -nDash )
Dim oSourceDocument As Object
oSourceDocument = ThisComponent
Dim nWordCount as Integer
nWordCount = oSourceDocument.WordCount
Dim oFrame As Object, oViewCursor As Object
With oSourceDocument.getCurrentController
oFrame = .getFrame()
oViewCursor = .getViewCursor()
End With
Dim oDispatcher as Object
oDispatcher = createUnoService("com.sun.star.frame.DispatchHelper")
oDispatcher.executeDispatch(oFrame, ".uno:GoToStartOfDoc", "", 0, Array())
Dim oDestinationDocument As Object
oDestinationDocument = StarDesktop.loadComponentFromURL( "private:factory/swriter", "_blank", 0, Array() )
Dim oDestinationText as Object
oDestinationText = oDestinationDocument.getText()
Dim oDestinationCursor As Object
oDestinationCursor = oDestinationText.createTextCursor()
Dim oSpeller As Object
oSpeller = createUnoService("com.sun.star.linguistic2.SpellChecker")
Dim oSpellAlternatives As Object, emptyArgs() as new com.sun.star.beans.PropertyValue
Dim sMistake as String, oSpell As Object, sAlternatives() as String, bTest As Boolean, s As String, i as Integer
For i = 0 To nWordCount - 1
oDispatcher.executeDispatch(oFrame, ".uno:WordRightSel", "", 0, Array())
sWordToBeChecked = RTrim( oViewCursor.String )
bTest = oSpeller.isValid(sWordToBeChecked, aLocale, emptyArgs())
If Not bTest Then
oSpell = oSpeller.spell(sWordToBeChecked, aLocale, emptyArgs())
sAlternatives = oSpell.getAlternatives()
s = ""
If Ubound(sAlternatives) >= 0 Then
for i = LBound(sAlternatives) To Ubound(sAlternatives) - 1
s = s & sAlternatives(i) & ", "
Next i
s = s & sAlternatives(Ubound(sAlternatives))
End If
oDestinationText.insertString(oDestinationCursor, sWordToBeChecked & ": " & s & Chr(13), False)
End If
oDispatcher.executeDispatch(oFrame, ".uno:GoToPrevWord", "", 0, Array())
oDispatcher.executeDispatch(oFrame, ".uno:GoToNextWord", "", 0, Array())
Next i
oDestinationCursor.gotoStart(False)
oDestinationCursor.gotoEnd(True)
' Sort the paragraphs
Dim oSortDescriptor As Object
oSortDescriptor = oDestinationCursor.createSortDescriptor()
oDestinationCursor.sort(oSortDescriptor)
' Remove duplicates
Dim sParagraphToBeChecked As String, sThisWord As String, sPreviousWord As String
sThisWord = ""
sPreviousWord = ""
oDestinationCursor.gotoStart(False)
oDestinationCursor.collapseToStart()
Dim k As Integer
Do
oDestinationCursor.gotoEndOfParagraph(True)
sParagraphToBeChecked = oDestinationCursor.getString()
k = InStr(sParagraphToBeChecked, ":")
If k <> 0 Then
sThisWord = Left(sParagraphToBeChecked, k-1)
End If
If StrComp(sThisWord, sPreviousWord, 0) = 0 Then
oDestinationCursor.setString("")
End If
sPreviousWord = sThisWord
Loop While oDestinationCursor.gotoNextParagraph(False)
' Remove empty paragraphs
Dim oReplaceDescriptor As Object
oReplaceDescriptor = oDestinationDocument.createReplaceDescriptor()
oReplaceDescriptor.setPropertyValue("SearchRegularExpression", TRUE)
oReplaceDescriptor.setSearchString("^$")
oReplaceDescriptor.setReplaceString("")
oDestinationDocument.replaceAll(oReplaceDescriptor)
End Sub
It looks like the problem is caused by one of the while loops in the function TrimWord, which removes punctuation from before and after a word before it is fed to the spell-check service. If the word is only one character long and if it is a valid punctuation character then the condition at the beginning of the loop is true, so the loop is entered and the counter n is decremented to zero. Then at the beginning of the next traversal of the loop, even though the condition is false anyway, it still asks the Mid function to return the 0th character of the word, which it can't do because the characters are numbered from 1, so it throws an error. Some languages would ignore the error if the truth value of the condition could be unambiguously determined from the other parts of the expression. It looks like BASIC doesn't do that.
The following modified version of the function gets round the problem in a rather inelegant way, but it seems to work:
Function TrimWord(sWord As String) As String
Dim n as Long
n = Len(sWord)
If n > 0 Then
Dim m as Long : m = 1
Dim bTest As Boolean
bTest = m <= n
Do While IsPermissiblePrefix( ASC(Mid(sWord, m, 1) ) ) And bTest
if (m < n) Then
m = m + 1
Else
bTest = False
End If
Loop
bTest = n > 0
Do While IsPermissibleSuffix( ASC(Mid(sWord, n, 1) ) ) And bTest
if (n > 1) Then
n = n - 1
Else
bTest = False
End If
Loop
If n > m Then
TrimWord = Mid(sWord, m, (n + 1) - m)
Else
TrimWord = sWord
End If
Else
TrimWord = ""
End If
End Function
This works for me.
Firstly, in response to your question about the bug, I'm not a maintainer so I can't fix that. However, as the bug concerns moving a text cursor to the start and end of a word it should be possible to get round it by searching for the white-space between words instead. Since the white-space characters are (I think) the same in all languages, any problems recognising certain characters from certain alphabets shouldn't matter. The easiest way to do it would be to first read the entire text of the document into a string but LibreOffice strings have a maximum length of 2^16 = 65536 characters and while this seems like a lot it could easily be too small for a reasonable sized document. The limit can be avoided by navigating through the text one paragraph at a time. According to Andrew Pitonyak (OOME Page 388): "I found gotoNextSentence() and gotoNextWord() to be unreliable, but the paragraph cursor worked well."
The code below is yet another modification of the subroutines in previous answers. This time it gets a string from a paragraph and splits it up into words by finding the white-space between the words. It then spell checks the words as before. The subroutine depends on some other functions that are listed below it. These allow you to specify which characters to designate as word separators (i.e. white-space) and which characters to ignore if they are found at the beginning or end of a word. This is necessary so that, for example, the quotes surrounding a quoted word are not counted as part of the word, which would lead to it being recognised as a spelling mistake even if the word inside the quotes is correctly spelled.
I am not familiar with non-latin alphabets and I don't have an appropriate dictionary installed, but I pasted the words from your question go to end of word is not always followed, namely testी, भारत and इंडिया and they all appeared unmodified in the output document.
On the question of looking up synonyms, as each misspelled word has multiple suggestions, and each of those will have multiple synonyms, the output could rapidly become very large and confusing. It may be better for your user to look them up individually if they want to use a different word.
Sub ListMisSpelledWords3
' From OOME Listing 315 Page 336
GlobalScope.BasicLibraries.loadLibrary( "Tools" )
Dim sLocale As String
sLocale = GetRegistryKeyContent("org.openoffice.Setup/L10N", FALSE).getByName("ooLocale")
' ooLocale appears to return a string that consists of the language and country
' seperated by a dash, e.g. en-GB
Dim nDash As Integer
nDash = InStr(sLocale, "-")
Dim aLocale As New com.sun.star.lang.Locale
aLocale.Language = Left( sLocale, nDash - 1)
aLocale.Country = Right( sLocale, Len(sLocale) - nDash )
Dim oSource As Object
oSource = ThisComponent
Dim oSourceCursor As Object
oSourceCursor = oSource.getText.createTextCursor()
oSourceCursor.gotoStart(False)
oSourceCursor.collapseToStart()
Dim oDestination As Object
oDestination = StarDesktop.loadComponentFromURL( "private:factory/swriter", "_blank", 0, Array() )
Dim oDestinationText as Object
oDestinationText = oDestination.getText()
Dim oDestinationCursor As Object
oDestinationCursor = oDestinationText.createTextCursor()
Dim oSpeller As Object
oSpeller = createUnoService("com.sun.star.linguistic2.SpellChecker")
Dim oSpellAlternatives As Object, emptyArgs() as new com.sun.star.beans.PropertyValue
Dim sWordToCheck as String, oSpell As Object, sAlternatives() as String, bTest As Boolean
Dim s As String, i as Integer, j As Integer, sParagraph As String, nWordStart As Integer, nWordEnd As Integer
Dim nChar As Integer
Do
oSourceCursor.gotoEndOfParagraph(True)
sParagraph = oSourceCursor.getString() & " " 'It is necessary to add a space to the end of
'the string otherwise the last word of the paragraph is not recognised.
nWordStart = 1
nWordEnd = 1
For i = 1 to Len(sParagraph)
nChar = ASC(Mid(sParagraph, i, 1))
If IsWordSeparator(nChar) Then '1
If nWordEnd > nWordStart Then '2
sWordToCheck = TrimWord( Mid(sParagraph, nWordStart, nWordEnd - nWordStart) )
bTest = oSpeller.isValid(sWordToCheck, aLocale, emptyArgs())
If Not bTest Then '3
oSpell = oSpeller.spell(sWordToCheck, aLocale, emptyArgs())
sAlternatives = oSpell.getAlternatives()
s = ""
If Ubound(sAlternatives) >= 0 Then '4
for j = LBound(sAlternatives) To Ubound(sAlternatives) - 1
s = s & sAlternatives(j) & ", "
Next j
s = s & sAlternatives(Ubound(sAlternatives))
End If '4
oDestinationText.insertString(oDestinationCursor, sWordToCheck & " : " & s & Chr(13), False)
End If '3
End If '2
nWordEnd = nWordEnd + 1
nWordStart = nWordEnd
Else
nWordEnd = nWordEnd + 1
End If '1
Next i
Loop While oSourceCursor.gotoNextParagraph(False)
oDestinationCursor.gotoStart(False)
oDestinationCursor.gotoEnd(True)
Dim oSortDescriptor As Object
oSortDescriptor = oDestinationCursor.createSortDescriptor()
oDestinationCursor.sort(oSortDescriptor)
Dim sParagraphToBeChecked As String
Dim sThisWord As String
sThisWord = ""
Dim sPreviousWord As String
sPreviousWord = ""
oDestinationCursor.gotoStart(False)
oDestinationCursor.collapseToStart()
Dim k As Integer
Do
oDestinationCursor.gotoEndOfParagraph(True)
sParagraphToBeChecked = oDestinationCursor.getString()
k = InStr(sParagraphToBeChecked, ":")
If k <> 0 Then
sThisWord = Left(sParagraphToBeChecked, k-1)
End If
If StrComp(sThisWord, sPreviousWord, 0) = 0 Then
oDestinationCursor.setString("")
End If
sPreviousWord = sThisWord
Loop While oDestinationCursor.gotoNextParagraph(False)
Dim oReplaceDescriptor As Object
oReplaceDescriptor = oDestination.createReplaceDescriptor()
oReplaceDescriptor.setPropertyValue("SearchRegularExpression", TRUE)
oReplaceDescriptor.setSearchString("^$")
oReplaceDescriptor.setReplaceString("")
oDestination.replaceAll(oReplaceDescriptor)
End Sub
'----------------------------------------------------------------------------
' From OOME Listing 360.
Function IsWordSeparator(iChar As Integer) As Boolean
' Horizontal tab \t 9
' New line \n 10
' Carriage return \r 13
' Space 32
' Non-breaking space 160
Select Case iChar
Case 9, 10, 13, 32, 160
IsWordSeparator = True
Case Else
IsWordSeparator = False
End Select
End Function
'-------------------------------------
' Characters to be trimmed off beginning of word before spell checking
Function IsPermissiblePrefix(iChar As Integer) As Boolean
' Symmetric double quote " 34
' Left parenthesis ( 40
' Left square bracket [ 91
' Back-tick ` 96
' Left curly bracket { 123
' Left double angle quotation marks « 171
' Left single quotation mark ‘ 8216
' Left single reversed 9 quotation mark ‛ 8219
' Left double quotation mark “ 8220
' Left double reversed 9 quotation mark ‟ 8223
Select Case iChar
Case 34, 40, 91, 96, 123, 171, 8216, 8219, 8220, 8223
IsPermissiblePrefix = True
Case Else
IsPermissiblePrefix = False
End Select
End Function
'-------------------------------------
' Characters to be trimmed off end of word before spell checking
Function IsPermissibleSuffix(iChar As Integer) As Boolean
' Exclamation mark ! 33
' Symmetric double quote " 34
' Apostrophe ' 39
' Right parenthesis ) 41
' Comma , 44
' Full stop . 46
' Colon : 58
' Semicolon ; 59
' Question mark ? 63
' Right square bracket ] 93
' Right curly bracket } 125
' Right double angle quotation marks » 187
' Right single quotation mark ‘ 8217
' Right double quotation mark “ 8221
Select Case iChar
Case 33, 34, 39, 41, 44, 46, 58, 59, 63, 93, 125, 187, 8217, 8221
IsPermissibleSuffix = True
Case Else
IsPermissibleSuffix = False
End Select
End Function
'-------------------------------------
Function TrimWord( sWord As String) As String
Dim n as Integer
n = Len(sWord)
If n > 0 Then
Dim m as Integer : m = 1
Do While IsPermissiblePrefix( ASC(Mid(sWord, m, 1) ) ) And m <= n
m = m + 1
Loop
Do While IsPermissibleSuffix( ASC(Mid(sWord, n, 1) ) ) And n >= 1
n = n - 1
Loop
If n > m Then
TrimWord = Mid(sWord, m, (n + 1) - m)
Else
TrimWord = sWord
End If
Else
TrimWord = ""
End If
End Function

Convert octal value to letters and join together - JAVA

I have a octal value, its - 0110 0145 0154 0154 0157 054 040 0110 0151, the result must be - Hello, Hi.
Here is my code :
String octal = "0110 0145 0154 0154 0157 054 040 0110 0151 ";
List<String> result = Arrays.asList(octal.split("\\s*,\\s*"));
long item = 1;
String res = "";
while(item < result.size()) {
char re = (char) Integer.parseInt(result.get((int) item), 8);
res = res + " "+ re;
item += 1;
}
System.out.println("Its" + res);
But the output :
Its e
Expected
Hello, Hi
I tried everything, but failed ):
Why did you think the split pattern "\\s*,\\s*" were a solution for your needs? To split at space, we can use "\\s+".
And in order to start at the first letter, you should initialize item = 0.

Alternative of ChrW function

Is there any alternative function/solution of the ChrW() which accepts value not in range is -32768–65535 like for character code 􀂇 which leads to "􀂇". Using ChrW() gives error
"Invalid procedure call or argument"
So I want an alternative solution to convert the charactercode to the actual character.
Code:
Function HTMLDecode(sText)
Dim regEx
Dim matches
Dim match
sText = Replace(sText, """, Chr(34))
sText = Replace(sText, "<" , Chr(60))
sText = Replace(sText, ">" , Chr(62))
sText = Replace(sText, "&" , Chr(38))
sText = Replace(sText, " ", Chr(32))
Set regEx= New RegExp
With regEx
.Pattern = "&#(\d+);" 'Match html unicode escapes
.Global = True
End With
Set matches = regEx.Execute(sText)
'Iterate over matches
For Each match In matches
'For each unicode match, replace the whole match, with the ChrW of the digits.
sText = Replace(sText, match.Value, ChrW(match.SubMatches(0)))
Next
HTMLDecode = sText
End Function
' https://en.wikipedia.org/wiki/UTF-16#Description
function Utf16Encode(byval unicode_code_point)
if (unicode_code_point >= 0 and unicode_code_point <= &hD7FF&) or (unicode_code_point >= &hE000& and unicode_code_point <= &hFFFF&) Then
Utf16Encode = ChrW(unicode_code_point)
else
unicode_code_point = unicode_code_point - &h10000&
Utf16Encode = ChrW(&hD800 Or (unicode_code_point \ &h400&)) & ChrW(&hDC00 Or (unicode_code_point And &h3FF&))
end if
end function
For Each match In matches
'For each unicode match, replace the whole match, with the ChrW of the digits.
sText = Replace(sText, match.Value, Utf16Encode(CLng(match.SubMatches(0))))
Next

Is there a MS Word wildcard for frequency?

I learning how to use Microsoft Word wildcards and codes to help me in my position as a medical editor. A big part of my job is submitting manuscripts to medical journals for review, and each journal has very specific requirements.
Most of the journals we submit manuscripts to require that medical terms/phrases be abbreviated only if they are used three or more times. For example, the term “Overall Survival” can be abbreviated to OS if the term is referenced at least three times in the text. If the text only mentions “Overall Survival” once or twice, it is preferred that the term remain expanded, and it should not be abbreviated to OS.
We have been using the PerfectIt system, by Intelligent Editing. This Word widget scans for abbreviations that are only used once and will flag them for our review, but does not pick up if an abbreviation is only used twice in the selected text. We are hoping to find some solution (my thought would be some sort of wildcard search or macro) that will be able to detect if an abbreviation is used only one or two times.
I saw this similar post on stackoverflow, but it seemed to do with code. I will need this to be on a company computer that I do not have administrative access to, and furthermore, I know nothing about code. I appreciate any help, guidance, or directions for further research!
Thank you!
Edit: I could use a wildcard search to make all of the two+ capitalized letters highlighted by using <[A-Z]{2,}>, then formatting them as highlighted, if this would help with any macros.
For any given abbreviation, you could use a macro like:
Sub Demo()
Application.ScreenUpdating = False
Dim i As Long
With ActiveDocument.Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = InputBox("What is the Text to Find")
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchCase = True
.MatchWholeWord = True
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
.Execute
End With
Do While .Find.Found
i = i + 1
.Collapse wdCollapseEnd
.Find.Execute
Loop
End With
Application.ScreenUpdating = True
MsgBox i & " instances found."
End Sub
For PC macro installation & usage instructions, see: http://www.gmayor.com/installing_macro.htm
For Mac macro installation & usage instructions, see: https://wordmvp.com/Mac/InstallMacro.html
Provided there's at least one occurrence of the abbreviation in parens you could use a macro like the following. The macro checks the contents of a document for upper-case/numeric parenthetic abbreviations it then looks backwards to try to determine what term they abbreviate. For example:
World Wide Web (WWW)
Naturally, given the range of acronyms in use, it’s not foolproof and, if a match isn’t made, the preceding sentence (in VBA terms) is captured so the user can edit the output. A table is then built at the end of the document, which is then searched for all references to the acronym (other than for the definition) and the counts and page numbers added to the table.
Note that the macro won't tell you how many times 'World Wide Web' appears in the document, though. After all, given your criteria, it's impossible to know what terms should have been reduced to an acronym but weren't.
Sub AcronymLister()
Application.ScreenUpdating = False
Dim StrTmp As String, StrAcronyms As String, i As Long, j As Long, k As Long, Rng As Range, Tbl As Table
StrAcronyms = "Acronym" & vbTab & "Term" & vbTab & "Page" & vbTab & "Cross-Reference Count" & vbTab & "Cross-Reference Pages" & vbCr
With ActiveDocument
With .Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.MatchWildcards = True
.Wrap = wdFindStop
.Text = "\([A-Z0-9]{2,}\)"
.Replacement.Text = ""
.Execute
End With
Do While .Find.Found = True
StrTmp = Replace(Replace(.Text, "(", ""), ")", "")
If (InStr(1, StrAcronyms, .Text, vbBinaryCompare) = 0) And (Not IsNumeric(StrTmp)) Then
If .Words.First.Previous.Previous.Words(1).Characters.First = Right(StrTmp, 1) Then
For i = Len(StrTmp) To 1 Step -1
.MoveStartUntil Mid(StrTmp, i, 1), wdBackward
.Start = .Start - 1
If InStr(.Text, vbCr) > 0 Then
.MoveStartUntil vbCr, wdForward
.Start = .Start + 1
End If
If .Sentences.Count > 1 Then .Start = .Sentences.Last.Start
If .Characters.Last.Information(wdWithInTable) = False Then
If .Characters.First.Information(wdWithInTable) = True Then
.Start = .Cells(.Cells.Count).Range.End + 1
End If
ElseIf .Cells.Count > 1 Then
.Start = .Cells(.Cells.Count).Range.Start
End If
Next
End If
StrTmp = Replace(Replace(Replace(.Text, " (", "("), "(", "|"), ")", "")
StrAcronyms = StrAcronyms & Split(StrTmp, "|")(1) & vbTab & Split(StrTmp, "|")(0) & vbTab & .Information(wdActiveEndAdjustedPageNumber) & vbTab & vbTab & vbCr
End If
.Collapse wdCollapseEnd
.Find.Execute
Loop
StrAcronyms = Replace(Replace(Replace(StrAcronyms, " (", "("), "(", vbTab), ")", "")
Set Rng = .Characters.Last
With Rng
If .Characters.First.Previous <> vbCr Then .InsertAfter vbCr
.InsertAfter Chr(12)
.Collapse wdCollapseEnd
.Style = "Normal"
.Text = StrAcronyms
Set Tbl = .ConvertToTable(Separator:=vbTab, NumRows:=.Paragraphs.Count, NumColumns:=5)
With Tbl
.Columns.AutoFit
.Rows(1).HeadingFormat = True
.Rows(1).Range.Style = "Strong"
.Rows.Alignment = wdAlignRowCenter
End With
.Collapse wdCollapseStart
End With
End With
Rng.Start = ActiveDocument.Range.Start
For i = 2 To Tbl.Rows.Count
StrTmp = "": j = 0: k = 0
With .Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Format = False
.Forward = True
.Text = "[!\(]" & Split(Tbl.Cell(i, 1).Range.Text, vbCr)(0) & "[!\)]"
.MatchWildcards = True
.Execute
End With
Do While .Find.Found
If Not .InRange(Rng) Then Exit Do
j = j + 1
If k <> .Duplicate.Information(wdActiveEndAdjustedPageNumber) Then
k = .Duplicate.Information(wdActiveEndAdjustedPageNumber)
StrTmp = StrTmp & k & " "
End If
.Collapse wdCollapseEnd
.Find.Execute
Loop
End With
Tbl.Cell(i, 4).Range.Text = j
StrTmp = Replace(Trim(StrTmp), " ", ",")
If StrTmp <> "" Then
'Add the current record to the output list (StrOut)
StrTmp = Replace(Replace(ParseNumSeq(StrTmp, "&"), ",", ", "), " ", " ")
End If
Tbl.Cell(i, 5).Range.Text = StrTmp
Next
End With
Set Rng = Nothing: Set Tbl = Nothing
Application.ScreenUpdating = True
End Sub
Function ParseNumSeq(StrNums As String, Optional StrEnd As String)
'This function converts multiple sequences of 3 or more consecutive numbers in a
' list to a string consisting of the first & last numbers separated by a hyphen.
' The separator for the last sequence can be set via the StrEnd variable.
Dim ArrTmp(), i As Long, j As Long, k As Long
ReDim ArrTmp(UBound(Split(StrNums, ",")))
For i = 0 To UBound(Split(StrNums, ","))
ArrTmp(i) = Split(StrNums, ",")(i)
Next
For i = 0 To UBound(ArrTmp) - 1
If IsNumeric(ArrTmp(i)) Then
k = 2
For j = i + 2 To UBound(ArrTmp)
If CInt(ArrTmp(i) + k) <> CInt(ArrTmp(j)) Then Exit For
ArrTmp(j - 1) = ""
k = k + 1
Next
i = j - 2
End If
Next
StrNums = Join(ArrTmp, ",")
StrNums = Replace(Replace(Replace(StrNums, ",,", " "), ", ", " "), " ,", " ")
While InStr(StrNums, " ")
StrNums = Replace(StrNums, " ", " ")
Wend
StrNums = Replace(Replace(StrNums, " ", "-"), ",", ", ")
If StrEnd <> "" Then
i = InStrRev(StrNums, ",")
If i > 0 Then
StrNums = Left(StrNums, i - 1) & Replace(StrNums, ",", " " & Trim(StrEnd), i)
End If
End If
ParseNumSeq = StrNums
End Function

Renaming a Word document and saving its filename with its first 10 letters

I have recovered some Word documents from a corrupted hard drive using a piece of software called photorec. The problem is that the documents' names can't be recovered; they are all renamed by a sequence of numbers. There are over 2000 documents to sort through and I was wondering if I could rename them using some automated process.
Is there a script I could use to find the first 10 letters in the document and rename it with that? It would have to be able to cope with multiple documents having the same first 10 letters and so not write over documents with the same name. Also, it would have to avoid renaming the document with illegal characters (such as '?', '*', '/', etc.)
I only have a little bit of experience with Python, C, and even less with bash programming in Linux, so bear with me if I don't know exactly what I'm doing if I have to write a new script.
How about VBScript? Here is a sketch:
FolderName = "C:\Docs\"
Set fs = CreateObject("Scripting.FileSystemObject")
Set fldr = fs.GetFolder(Foldername)
Set ws = CreateObject("Word.Application")
For Each f In fldr.Files
If Left(f.name,2)<>"~$" Then
If InStr(f.Type, "Microsoft Word") Then
MsgBox f.Name
Set doc = ws.Documents.Open(Foldername & f.Name)
s = vbNullString
i = 1
Do While Trim(s) = vbNullString And i <= doc.Paragraphs.Count
s = doc.Paragraphs(i)
s = CleanString(Left(s, 10))
i = i + 1
Loop
doc.Close False
If s = "" Then s = "NoParas"
s1 = s
i = 1
Do While fs.FileExists(s1)
s1 = s & i
i = i + 1
Loop
MsgBox "Name " & Foldername & f.Name & " As " & Foldername & s1 _
& Right(f.Name, InStrRev(f.Name, "."))
'' This uses copy, because it seems safer
f.Copy Foldername & s1 & Right(f.Name, InStrRev(f.Name, ".")), False
'' MoveFile will copy the file:
'' fs.MoveFile Foldername & f.Name, Foldername & s1 _
'' & Right(f.Name, InStrRev(f.Name, "."))
End If
End If
Next
msgbox "Done"
ws.Quit
Set ws = Nothing
Set fs = Nothing
Function CleanString(StringToClean)
''http://msdn.microsoft.com/en-us/library/ms974570.aspx
Dim objRegEx
Set objRegEx = CreateObject("VBScript.RegExp")
objRegEx.IgnoreCase = True
objRegEx.Global = True
''Find anything not a-z, 0-9
objRegEx.Pattern = "[^a-z0-9]"
CleanString = objRegEx.Replace(StringToClean, "")
End Function
Word documents are stored in a custom format which places a load of binary cruft on the beginning of the file.
The simplest thing would be to knock something up in Python that searched for the first line beginning with ASCII chars. Here you go:
#!/usr/bin/python
import glob
import os
for file in glob.glob("*.doc"):
f = open(file, "rb")
new_name = ""
chars = 0
char = f.read(1)
while char != "":
if 0 < ord(char) < 128:
if ord("a") <= ord(char) <= ord("z") or ord("A") <= ord(char) <= ord("Z") or ord("0") <= ord(char) <= ord("9"):
new_name += char
else:
new_name += "_"
chars += 1
if chars == 100:
new_name = new_name[:20] + ".doc"
print "renaming " + file + " to " + new_name
f.close()
break;
else:
new_name = ""
chars = 0
char = f.read(1)
if new_name != "":
os.rename(file, new_name)
NOTE: if you want to glob multiple directories you'll need to change the glob line accordingly. Also this takes no account of whether the file you're trying to rename to already exists, so if you have multiple docs with the same first few chars then you'll need to handle that.
I found the first chunk of 100 ASCII chars in a row (if you look for less than that you end up picking up doc keywords and such) and then used the first 20 of these to make the new name, replacing anything that's not a-z A-Z or 0-9 with underscores to avoid file name issues.