I'm using the following powershell script to open a few thousand HTML files and "save as..." Word documents.
param([string]$htmpath,[string]$docpath = $docpath)
$srcfiles = Get-ChildItem $htmPath -filter "*.htm*"
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatDocument");
$word = new-object -comobject word.application
$word.Visible = $False
function saveas-document
{
$opendoc = $word.documents.open($doc.FullName);
$opendoc.saveas([ref]"$docpath\$doc.FullName.doc", [ref]$saveFormat);
$opendoc.close();
}
ForEach ($doc in $srcfiles)
{
Write-Host "Processing :" $doc.FullName
saveas-document
$doc = $null
}
$word.quit();
The content converts splendidly, but my filename is not as expected.
$opendoc.saveas([ref]"$docpath\$doc.FullName.doc", [ref]$saveFormat); results in foo.htm saving as foo.htm.FullName.doc instead of foo.doc.
$opendoc.saveas([ref]"$docpath\$doc.BaseName.doc", [ref]$saveFormat); yields foo.htm.BaseName.doc
How do I set up a Save As... filename variable equal to a concatenation of BaseName and .doc?
Based on our comments above, it seems that moving the files is all you want to accomplish. The following works for me. In the current directory, it replaces .txt extensions with .py extensions. I found the command here.
PS C:\testing dir *.txt | Move-Item -Destination {[IO.Path]::ChangeExtension( $_.Name, "py")}
You can also change *.txt to C:\path\to\file\*.txt so you don't need to execute this line from the location of the files. You should be able to define a destination in a similar manner, so I'll report back if I find a simple way to do that.
Also, I found Microsoft's TechNet Library while I was searching. It has many tutorials on scripting using PowerShell. Files and Folders, Part 3: Windows PowerShell should help you to find additional info on copying and moving files.
I was having problems just converting the filename from .html to .docx. I took your code above and changed it to this:
function Convert-HTMLtoDocx {
param([string]$htmpath)
$srcfiles = Get-ChildItem $htmPath -filter "*.htm*"
$saveFormat = [Microsoft.Office.Interop.Word.WdSaveFormat]::wdFormatXMLDocument
$word = new-object -comobject word.application
$word.Visible = $False
ForEach ($doc in $srcfiles) {
Write-Host "Processing :" $doc.fullname
$name = Join-Path -Path $doc.DirectoryName -ChildPath $($doc.BaseName + ".docx")
$opendoc = $word.documents.open($doc.FullName)
$opendoc.saveas([ref]$name.Value,[ref]$saveFormat)
$opendoc.close()
$doc = $null
} #End ForEach
$word.quit()
} #End Function
The problem was the save format. For whatever reason, so save a document as a .docx you need to specify the format at wdFormatXMLDocument not wdFormatDocument.
This does a recursive walk of a root folder and writes and .doc to .htm filtered:
$docpath = "\\sf-xyz-serverabc01\ChangeTheseDocuments"
$WdTypes = Add-Type -AssemblyName 'Microsoft.Office.Interop.Word, Version=14.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c' -Passthru
$srcfiles = get-childitem $docpath -filter "*.doc" -rec | where {!$_.PSIsContainer} | select-object FullName
$saveFormat = $WdTypes | Where {$_.Name -eq 'WdSaveFormat'}
$word = new-object -comobject word.application
$word.Visible = $False
function saveas-filteredhtml
{
$opendoc = $word.documents.open($doc.FullName);
$Name=($doc.Fullname).replace("doc","htm")
$opendoc.saveas([ref]$Name, [ref]$saveFormat::wdFormatFilteredHTML);
$opendoc.close();
}
ForEach ($doc in $srcfiles)
{
Write-Host "Processing :" $doc.FullName
saveas-filteredhtml
$doc = $null
}
$word.quit();
I know this is an older post but I am posting this code here so that I can find it in the future
**
This does a recursive walk of a root folder and Converts Doc and DocX to Txt
**
Here is a LINK to the diffierent formats you can save to.
$docpath = "C:\Temp"
$WdTypes = Add-Type -AssemblyName 'Microsoft.Office.Interop.Word, Version=14.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c' -Passthru
$srcfiles = get-childitem $docpath -filter "*.doc" -rec | where {!$_.PSIsContainer} | select-object FullName
$saveFormat = $WdTypes | Where {$_.Name -eq 'WdSaveFormat'}
$word = new-object -comobject word.application
$word.Visible = $False
function saveas-filteredhtml
{
$opendoc = $word.documents.open($doc.FullName);
$Name=($doc.Fullname).replace(".docx",".txt").replace(".doc",".txt")
$opendoc.saveas([ref]$Name, [ref]$saveFormat::wdFormatDOSText); ##wdFormatDocument
$opendoc.close();
}
ForEach ($doc in $srcfiles)
{
Write-Host "Processing :" $doc.FullName
saveas-filteredhtml
$doc = $null
}
$word.quit();
Related
I am looking to find a way to compact and repair all the Access databases in a certain directory using Powershell via a script.
The VBA codes below work, but need one for Powershell:
Find all Access databases, and Compact and Repair
I am new to Powershell so will be grateful for the assistance.
Thanks
You may try this.
Add-Type -AssemblyName Microsoft.Office.Interop.Access
$rootfolder = 'c:\some\folder'
$createlog = $true # change to false if no log desired
$access = New-Object -ComObject access.application
$access.Visible = $false
$access.AutomationSecurity = 1
Get-ChildItem -Path $rootfolder -File -Filter *.accdb -Recurse -PipelineVariable file | ForEach-Object {
$newname = Join-Path $file.Directory ("{0}_compacted{1}" -f $file.BaseName,$file.Extension)
$message = #"
Current file: {0}
Output file: {1}
"# -f $file.FullName,$newname
Write-Host $message -ForegroundColor Cyan
$access.CompactRepair($file.fullname,$newname,$createlog)
}
$access.Quit()
This will output each compacted database as the name of the original file with _compacted appended to the name (before the extension.) I have tested this in every way except actually compacting databases.
Edit
Regarding your comment, a few minor changes should achieve the desired result. Keep in mind that this will put all new files in the same folder. This may not be an issue for your case but if there are duplicate file names you will have problems.
$rootfolder = 'c:\some\folder'
$destination = 'c:\some\other\folder'
$todaysdate = get-date -format '_dd_MM_yyyy'
Add-Type -AssemblyName Microsoft.Office.Interop.Access
$createlog = $true # change to false if no log desired
$access = New-Object -ComObject access.application
$access.Visible = $false
$access.AutomationSecurity = 1
Get-ChildItem -Path $rootfolder -File -Filter *.accdb -Recurse -PipelineVariable file | ForEach-Object {
$newname = Join-Path $destination ("{0}$todaysdate{1}" -f $file.BaseName,$file.Extension)
$message = #"
Current file: {0}
Output file: {1}
"# -f $file.FullName,$newname
Write-Host $message -ForegroundColor Cyan
$access.CompactRepair($file.fullname,$newname,$createlog)
}
$access.Quit()
i have a powershell script that automatically download from outlook and save in the file i already set. the script works fine but then i realise that some of the attachment downloaded is corrupted. here is the script that i use.
Function saveattachmentexcel
{
$Null = Add-type -Assembly "Microsoft.Office.Interop.Outlook"
#olFolders = "Microsoft.Office.Interop.Outlook.olDefaultFolders" -as [type]
#olFolderInbox = 6
$outlook = new-object -comobject outlook.application
$namespace = $outlook.GetNameSpace("MAPI")
$folder = $nameSpace.GetDefaultFolder([Microsoft.Office.Interop.Outlook.OlDefaultFolders]::olFolderInbox)
$filepath = "D:\DMR Folder\"
$folder.Items | Where {$_.UnRead -eq $True -and $($_.attachments).filename -match '.xlsm'} | ForEach-object {
$filename = $($_.attachments | where filename -match '.xlsm').filename
foreach($file in $filename)
{
$outpath = join-path $filepath $file
$($_.attachments).saveasfile($outpath)
}
$_.UnRead = $False
}
}
saveattachmentexcel
i do not know why this is happening. could anyone please help me?
This is likely because you attempt to save every single attachment to the same file name on disk with the $($_.attachments).saveasfile($outpath) statement.
Change this:
$filename = $($_.attachments | where filename -match '.xlsm').filename
foreach($file in $filename)
{
$outpath = join-path $filepath $file
$($_.attachments).saveasfile($outpath)
}
to:
foreach($attachment in $_.attachments)
{
if($attachment.Filename -like '*.xlsm'){
$outpath = Join-Path $filepath $attachment.Filename
# Only save this particular attachment to disk - not all of them
$attachment.SaveAsFile($outpath)
}
}
I have drafted a PowerShell script that searches for a string among a large number of Word files. The script is working fine, but I have around 1 GB of data to search through and it is taking around 15 minutes.
Can anyone suggest any modifications I can do to make it run faster?
Set-StrictMode -Version latest
$path = "c:\Tester1"
$output = "c:\Scripts\ResultMatch1.csv"
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "Roaming"
$charactersAround = 30
$results = #()
Function getStringMatch
{
For ($i=1; $i -le 4; $i++) {
$j="D"+$i
$finalpath=$path+"\"+$j
$files = Get-Childitem $finalpath -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }
# Loop through all *.doc files in the $path directory
Foreach ($file In $files)
{
$document = $application.documents.open($file.FullName,$false,$true)
$range = $document.content
If($range.Text -match ".{$($charactersAround)}$($findtext).{$($charactersAround)}"){
$properties = #{
File = $file.FullName
Match = $findtext
TextAround = $Matches[0]
}
$results += New-Object -TypeName PsCustomObject -Property $properties
$document.close()
}
}
}
If($results){
$results | Export-Csv $output -NoTypeInformation
}
$application.quit()
}
getStringMatch
import-csv $output
As mentioned in comments, you might want to consider using the OpenXML SDK library (you can also get the newest version of the SDK on GitHub), since it's way less overhead than spinning up an instance of Word.
Below I've turned your current function into a more generic one, using the SDK and with no dependencies on the caller/parent scope:
function Get-WordStringMatch
{
param(
[Parameter(Mandatory,ValueFromPipeline)]
[System.IO.FileInfo[]]$Files,
[string]$FindText,
[int]$CharactersAround
)
begin {
# import the OpenXML library
Add-Type -Path 'C:\Program Files (x86)\Open XML SDK\V2.5\lib\DocumentFormat.OpenXml.dll' |Out-Null
# make a "shorthand" reference to the word document type
$WordDoc = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] -as [type]
# construct the regex pattern
$Pattern = ".{$CharactersAround}$([regex]::Escape($FindText)).{$CharactersAround}"
}
process {
# loop through all the *.doc(x) files
foreach ($File In $Files)
{
# open document, wrap content stream in streamreader
$Document = $WordDoc::Open($File.FullName, $false)
$DocumentStream = $Document.MainDocumentPart.GetStream()
$DocumentReader = New-Object System.IO.StreamReader $DocumentStream
# read entire document
if($DocumentReader.ReadToEnd() -match $Pattern)
{
# got a match? output our custom object
New-Object psobject -Property #{
File = $File.FullName
Match = $FindText
TextAround = $Matches[0]
}
}
}
}
end{
# Clean up
$DocumentReader.Dispose()
$DocumentStream.Dispose()
$Document.Dispose()
}
}
Now that you have a nice function that supports pipeline input, all you need to do is gather your documents and pipe them to it!
# variables
$path = "c:\Tester1"
$output = "c:\Scripts\ResultMatch1.csv"
$findtext = "Roaming"
$charactersAround = 30
# gather the files
$files = 1..4|ForEach-Object {
$finalpath = Join-Path $path "D$i"
Get-Childitem $finalpath -Recurse | Where-Object { !($_.PsIsContainer) -and #('*.docx','*.doc' -contains $_.Extension)}
}
# run them through our new function
$results = $files |Get-WordStringMatch -FindText $findtext -CharactersAround $charactersAround
# got any results? export it all to CSV
if($results){
$results |Export-Csv -Path $output -NoTypeInformation
}
Since all of our components now support pipelining, you could do it all in one go:
1..4|ForEach-Object {
$finalpath = Join-Path $path "D$i"
Get-Childitem $finalpath -Recurse | Where-Object { !($_.PsIsContainer) -and #('*.docx','*.doc' -contains $_.Extension)}
} |Get-WordStringMatch -FindText $findtext -CharactersAround $charactersAround |Export-Csv -Path $output -NoTypeInformation
I'm very inexperienced in Powershell - but through trial and error I have managed to get a .doc/.docx to .pdf conversion working well for a specified folder and all subfolders.
$wdFormatPDF = 17
$word = New-Object -ComObject word.application
$word.visible = $false
$fileTypes = "*.docx","*.doc"
Get-ChildItem -Recurse -path "C:\test-acrobat" -include $fileTypes |
foreach-object `
{
$path = ($_.fullname).substring(0,($_.FullName).lastindexOf("."))
"Converting $path to pdf ..."
$doc = $word.documents.open($_.fullname)
$doc.saveas( $path, $wdFormatPDF)
$doc.close()
}
$word.Quit()
Now I'd like to be able to delete the original .doc/.docx files once they've been converted. On doing some searching I've found what I think would work:
{
remove-item $fileTypes # delete file from file-system
}
But I'd rather check than throw in a command to delete files...
Any help is greatly appreciated.
Philip
I would add the delete inside the foreach loop.
So you would get:
$wdFormatPDF = 17
$word = New-Object -ComObject word.application
$word.visible = $false
$fileTypes = "*.docx","*.doc"
Get-ChildItem -Recurse -path "C:\test-acrobat" -include $fileTypes |
foreach-object `
{
$path = ($_.fullname).substring(0,($_.FullName).lastindexOf("."))
Write-Host "Converting $path to pdf ..."
$doc = $word.documents.open($_.fullname)
$doc.saveas( $path, $wdFormatPDF)
$doc.close()
Remove-Item $_.fullname
}
$word.Quit()
Hi I am trying to remove the 'hidden data' and personal information set for '.doc, .docx, .pptx' documments through powershell :
HEre is the powershell script which I have written for the same :
$path = "C:\Users\anisjain\Documents\GRR Production\HiddenProrerties"
Add-Type -AssemblyName Microsoft.Office.Interop.Word
$xlRemoveDocType = "Microsoft.Office.Interop.xlRDIRemovePersonalInformation" -as [type]
$wordFiles = Get-ChildItem -Path $path -include *.doc, *.docx -recurse
$objword = New-Object -ComObject word.application
foreach($obj in $wordFiles)
{
$documents = $MSWord.Documents.Open($obj.fullname)
"Removing document information from $obj"
$documents.RemoveDocumentInformation($xlRemoveDocType::xlRDIRemovePersonalInformation)
$documents.Save()
$objword.documents.close()
}
$objword.Quit()
This however, doesnt work. Can someone please tell me where am i going wrong?
and if there is some other way of doing it. I have around 2000 records from which i wish to remove the 'hidden document information'. Thanks in advance.
here's the script that works for me, after some googling/copying/modifying
$path = "d:\rubbish\myfolder\"
Add-Type -AssemblyName Microsoft.Office.Interop.Word
$WdRemoveDocType = "Microsoft.Office.Interop.Word.WdRemoveDocInfoType" -as [type]
$wordFiles = Get-ChildItem -Path $path -include *.doc, *.docx -recurse
$objword = New-Object -ComObject word.application
$objword.visible = $false
foreach($obj in $wordFiles)
{
$documents = $objword.Documents.Open($obj.fullname)
"Removing document information from $obj"
# WdRemoveDocInfoType Enumeration Reference
# http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.wdremovedocinfotype(v=office.14).aspx
# 99 = WdRDIAll
#$documents.RemoveDocumentInformation(99)
$documents.RemoveDocumentInformation($WdRemoveDocType::wdRDIAll)
$documents.Save()
$objword.documents.close()
}
$objword.Quit()