Powershell Writing to .XLSX is Corrupting the Files - powershell

I have a Powershell script that loops through .xslx files in a folder and password protects them with the file name (for now.) I have no problem looping through and writing to .xls, but when I try to open an .xlsx file after writing it with Powershell - I get the error:
Excel cannot open the file 'abcd.xlsx' because the file format or file
extension is not valid. Verify that the file has not been corrupted
and that the file extension matches the format of the file.
Here's the script:
function Release-Ref ($ref) {
([System.Runtime.InteropServices.Marshal]::ReleaseComObject(
[System.__ComObject]$ref) -gt 0)
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()
}
$e = $ErrorActionPreference
$ErrorActionPreference="continue"
foreach ($f in Get-ChildItem "C:"){
try{
$ff = $f
$xlNormal = -4143
$s = [System.IO.Path]::GetFileNameWithoutExtension($f)
$xl = new-object -comobject excel.application
$xl.Visible = $False
$xl.DisplayAlerts = $False
$wb = $xl.Workbooks.Open($ff.FullName)
$wb.sheets(1).columns("A:S").entirecolumn.AutoFit()
$wb.sheets(1).columns("N").NumberFormat = "0.0%"
$a = $wb.SaveAs("C:\Out\" + $s + ".xls",$xlNormal,$s) #works
#$a = $wb.SaveAs("C:\Out\" + $s + ".xlsx",$xlNormal,$s) #doesn't work
$a = $xl.Quit()
$a = Release-Ref($ws)
$a = Release-Ref($wb)
$a = Release-Ref($xl)
}
catch {
Write-Output "Exception"
$ErrorActionPreference=$e;
}
}
I've searched other questions but can't find any other examples of the same issues writing from Powershell. Thank you.

The problem is caused because Xls is different a format from Xlsx. Older Excels before version 2007 used binary formats. The 2007 Office introduced new formats called Office Open Xml, which Xslx uses.
Excel is smart enough to check both file extension and file format. Since saving a binary file with new versions' extension creates a conflict, the error message hints for this possibility too:
and that the file extension matches the format of the file.
Why doesn't Excel just open the file anyway? I guess it's a security feature that prevents unintentional opening of Office documents. Back in the days, Office macro viruses were a bane of many offices. One of the main infection vectors was to trick users to open files without precautions. Unlike classic viruses, macro ones infected application data (including default template files) instead of OS binaries, but that's another a story.
Anyway, to work in proper a format, use proper version value. That would be -4143 for Xls and 51 for Xlsx. What's more, Get-ChildItem returns a collection of FileInfo objects, and file extension is there in Extension property. Like so,
# Define Xls and Xlsx versions
$typeXls = -4143
$typeXls = 51
foreach ($f in Get-ChildItem "C:"){
try{
$ff = $f
...
# Select saveas type to match original file extension
if($f.extension -eq '.xsl') { $fType = $typeXls }
else if($f.extension -eq '.xslx') { $fType = $typeXlsx }
$a = $wb.SaveAs("C:\Out\" + $s + $.extension, $fType, $s)

Working with com objects is too complicated sometimes with excel. I recommend the import-excel module.
Install-Module -Name ImportExcel
Then you can do something like this.
function Release-Ref ($ref) {
$e = $ErrorActionPreference
$ErrorActionPreference="continue"
foreach ($f in Get-ChildItem $file){
try{
$filePass = gci $f
$path = split-path $f
$newFile = $path + "\" + $f.BaseName + "-protected.xlsx"
$f | Export-excel $newFile -password $filePass -NoNumberConversion * -AutoSize
}
catch {
Write-Output "Exception"
$ErrorActionPreference=$e;
}
}
}

Related

Extracting text from word documents

TASK
Extract text from .doc, .docx and .pdf files and upload content to an Azure SQL database. Needs to be fast as its running over millions of documents.
ISSUES
Script starts to fail if one of the documents has an issue. Some that i have come across are:
This file failed to open last time you tried - Open readonly
File is corrupt
SCRIPT
Firstly i generate a list of files containing 100 file paths. This is so i can continue execution if i need to stop it and / or it errors out:
## Word object
if (!($continue)) {
$files = (Get-ChildItem -force -recurse $documentFolder -include *.doc, *.docx).fullname
$files | Out-File (Join-Path $PSScriptRoot "\documents.txt")
$i=0; Get-Content $documentFile -ReadCount 100 | %{$i++; $_ | Out-File (Join-Path $PSScriptRoot "\FileLists\documents_$i.txt")}
}
Then i create the ComObject with the DisplayAlerts flag set to 0 (i thought this would fix it. It didnt)
$word = New-Object -ComObject word.application
$word.Visible = $false
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")
$word.DisplayAlerts = 0
After this, I loop through each file in each list, save the file as .txt to the temp folder, extract the text and generate an SQL INSERT statemnt:
foreach ($file in (Get-Content $list)) {
Try {
if ($file -like "*-*") {
Write-Output "Processing: $($file)"
$doc = $word.Documents.Open($file)
$fileName = [io.path]::GetFileNameWithoutExtension($file)
$fileName = $filename + ".txt"
$doc.SaveAs("$env:TEMP\$fileName", [ref]$saveFormat)
$doc.Close()
$4ID = $fileName.split('-')[-1].replace(' ', '').replace(".txt", "")
$text = Get-Content -raw "$env:TEMP\$fileName"
$text = $text.replace("'", "")
$query += "
('$text', $4ID),"
Remove-Item -Force "$env:TEMP\$fileName"
}
<# Upload to azure #>
$query = $query.Substring(0,$query.Length-1)
$query += ";"
$params = #{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = $query
}
Invoke-Sqlcmd #params -ErrorAction "continue"
$query = "INSERT INTO tmp_CachedText (tCachedText, tOID)
VALUES "
}
Catch {
Write-Host "$($file) failed to process" -ForegroundColor RED;
}
}
Remove-Item -Force $list.FullName
ISSUES
As stated above, if something is wrong with one of the files or the document failed to open properly off a previous run the script starts failing. Everything in the loop throws errors, starting with:
You cannot call a method on a null-valued expression.
At D:\OneDrive\Scripts\Microsoft Cloud\CachedText-Extraction\CachedText-Extraction.ps1:226 char:13
+ $doc = $word.Documents.Open($file)
Basically what i want is a way to stop those errors from appearing by simply skipping the file if it has an error with the document. Alternatively, if there is a better way to extract text from document files using PowerShell and not using word that would be good too.
An example of one of the error messages:
This causes the file to be locked and execution to pause. The only way to get around it is to kill word, which then causes the rest of the script to fail.

Export or Print Outlook Emails to PDF

I am using PowerShell to loop through designated folders in Outlook and saving the attachments in a tree like structure. This works wonders, but now management has requested the email itself be saved as a PDF as well. I found the PrintOut method in object, but that prompts for a file name. I haven't been able to figure out what to pass to it to have it automatically save to a specific filename. I looked on the MSDN page and it was a bit to high for my current level.
I am using the com object of outlook.application.
Short of saving all of the emails to a temp file and using a third party method is there parameters I can pass to PrintOut? Or another way to accomplish this?
Here is the base of the code to get the emails. I loop through $Emails
$Outlook = New-Object -comobject outlook.application
$Connection = $Outlook.GetNamespace("MAPI")
#Prompt which folder to process
$Folder = $Connection.PickFolder()
$Outlook_Folder_Path = ($Folder.FullFolderPath).Split("\",4)[3]
$BaseFolder += $Outlook_Folder_Path + "\"
$Emails = $Folder.Items
Looks like there are no built-in methods, but if you're willing to use third-party binary, wkhtmltopdf can be used.
Get precompiled binary (use MinGW 32-bit for maximum compatibility).
Install or extract installer with 7Zip and copy wkhtmltopdf.exe to your script directory. It has no external dependencies and can be redistributed with your script, so you don't have to install PDF printer on all PCs.
Use HTMLBody property of MailItem object in your script for PDF conversion.
Here is an example:
# Get path to wkhtmltopdf.exe
$ExePath = Join-Path -Path (
Split-Path -Path $Script:MyInvocation.MyCommand.Path
) -ChildPath 'wkhtmltopdf.exe'
# Set PDF path
$OutFile = Join-Path -Path 'c:\path\to\emails' -ChildPath ($Email.Subject + '.pdf')
# Convert HTML string to PDF file
$ret = $Email.HTMLBody | & $ExePath #('--quiet', '-', $OutFile) 2>&1
# Check for errors
if ($LASTEXITCODE) {
Write-Error $ret
}
Please note, that I've no experience with Outlook and used MSDN to get relevant properties for object, so the code might need some tweaking.
Had this same issue. This is what I did to fix it if anybody else is trying to do something similar.
You could start by taking your msg file and converting it to doc then converting the doc file to pdf.
$outlook = New-Object -ComObject Outlook.Application
$word = New-Object -ComObject Word.Application
Get-ChildItem -Path $folderPath -Filter *.msg? | ForEach-Object {
$msgFullName = $_.FullName
$docFullName = $msgFullName -replace '\.msg$', '.doc'
$pdfFullName = $msgFullName -replace '\.msg$', '.pdf'
$msg = $outlook.CreateItemFromTemplate($msgFullName)
$msg.SaveAs($docFullName, 4)
$doc = $word.Documents.Open($docFullName)
$doc.SaveAs([ref] $pdfFullName, [ref] 17)
$doc.Close()
}
Then, just clean up the unwanted files after

Powershell not sending the right path for a file as argument

I'm trying to apply a hash function to all the files inside a folder as some kind of version control. The idea is to make a testfile that lists the name of the file and the generated checksum. Digging online I found some code that should do the trick (in theory):
$list = Get-ChildItem 'C:\users\public\documents\folder' -Filter *.cab
$sha1 = New-Object System.Security.Cryptography.SHA1CryptoServiceProvider
foreach ($file in $list) {
$return = "" | Select Name, Hash
$returnname = $file.Name
$returnhash = [System.BitConverter]::ToString($sha1.ComputeHash([System.IO.File]::ReadAllBytes($file.Name)))
$return = "$returnname,$returnhash"
Out-File -FilePath .\mylist.txt -Encoding Default -InputObject ($return) -Append
}
When I run it however, I get an error because it tries to read the files from c:\users\me\, the folder where I'm running the script. And the file c:\users\me\aa.cab does not exist and hence can't be reached.
I've tried everything that I could think of, but no luck. I'm using Windows 7 with Powershell 2.0, if that helps in any way.
Try with .FullName instead of just .Name.
$returnhash = [System.BitConverter]::ToString($sha1.ComputeHash([System.IO.File]::ReadAllBytes($file.FullName)))

Powershell automated deletion of specified SharePoint documents

We have a csv file with approximately 8,000 SharePoint document file URLs - the files in question they refer to have to be downloaded to a file share location, then deleted from the SharePoint. The files are not located in the same sites, but across several hundred in a server farm. We are looking to remove only the specified files - NOT the entire library.
We have the following script to effect the download, which creates the folder structure so that the downloaded files are separated.
param (
[Parameter(Mandatory=$True)]
[string]$base = "C:\Export\",
[Parameter(Mandatory=$True)]
[string]$csvFile = "c:\export.csv"
)
write-host "Commencing Download"
$date = Get-Date
add-content C:\Export\Log.txt "Commencing Download at $date":
$webclient = New-Object System.Net.WebClient
$webclient.UseDefaultCredentials = $true
$files = (import-csv $csvFile | Where-Object {$_.Name -ne ""})
$line=1
Foreach ($file in $files) {
$line = $line + 1
if (($file.SpURL -ne "") -and ($file.path -ne "")) {
$lastBackslash = $file.SpURL.LastIndexOf("/")
if ($lastBackslash -ne -1) {
$fileName = $file.SpURL.substring(1 + $lastBackslash)
$filePath = $base + $file.path.replace("/", "\")
New-Item -ItemType Directory -Force -Path $filePath.substring(0, $filePath.length - 1)
$webclient.DownloadFile($file.SpURL, $filePath + $fileName)
$url=$file.SpURL
add-content C:\Export\Log.txt "INFO: Processing line $line in $csvFile, writing $url to $filePath$fileName"
} else {
$host.ui.WriteErrorLine("Exception: URL has no backslash on $line for filename $csvFile")
}
} else {
$host.ui.WriteErrorLine("Exception: URL or Path is empty on line $line for filename $csvFile")
}
}
write-Host "Download Complete"
Is there a way we could get the versions for each file?
I have been looking for a means to carry out the deletion, using the same csv file as reference - all of the code I have seen refers to deleting entire libraries, which is not desired.
I am very new to PowerShell and am getting lost. Can anyone shed some light?
Many thanks.
This looks like it might be useful. It's a different approach and would need to be modified to pull in the file list from your CSV but it looks like it generally accomplishes what you are looking to do.
https://sharepoint.stackexchange.com/questions/6511/download-and-delete-documents-using-powershell

Slow Powershell script for CSV modification

I'm using a powershell script to append data to the end of a bunch of files.
Each file is a CSV around 50Mb (Say 2 millionish lines), there are about 50 files.
The script I'm using looks like this:
$MyInvocation.MyCommand.path
$files = ls *.csv
foreach($f in $files)
{
$baseName = [System.IO.Path]::GetFileNameWithoutExtension($f)
$year = $basename.substring(0,4)
Write-Host "Starting" $Basename
$r = [IO.File]::OpenText($f)
while ($r.Peek() -ge 0) {
$line = $r.ReadLine()
$line + "," + $year | Add-Content $(".\DR_" + $basename + ".CSV")
}
$r.Dispose()
}
Problem is, it's pretty slow. It's taken about 12 hours to get through them.
It's not super complex, so I wouldn't expect it to take that long to run.
What could I do to speed it up?
Reading and writing a file row by row can be a bit slow. Maybe your antivirus is contributing to slowness as well. Use Measure-Command to see which parts of the script are the slow ones.
As a general advise, rather write a few large blocks instead of lots of small ones. You can achieve this by storing some content in a StringBuilder and appending its contents into the output file every, say, 1000 processed rows. Like so,
$sb = new-object Text.StringBuilder # New String Builder for stuff
$i = 1 # Row counter
while ($r.Peek() -ge 0) {
# Add formatted stuff into the buffer
[void]$sb.Append($("{0},{1}{2}" -f $r.ReadLine(), $year, [Environment]::NewLine ) )
if(++$i % 1000 -eq 0){ # When 1000 rows are added, dump contents into file
Add-Content $(".\DR_" + $basename + ".CSV") $sb.ToString()
$sb = new-object Text.StringBuilder # Reset the StringBuilder
}
}
# Don't miss the tail of the contents
Add-Content $(".\DR_" + $basename + ".CSV") $sb.ToString()
Don't go into .NET Framework static methods and building up strings when there are cmdlets that can do the work on objects. Collect your data, add the year column, then export to your new file. You're also doing a ton of file I/O and that'll also slow you down.
This will probably require a little bit more memory. But it reads the whole file at once, and writes the whole file at once. It also assumes that your CSV files have column headings. But it's much easier for someone else to look at and understand exactly what's going on (write your scripts so they can be read!).
# Always use full cmdlet names in scripts, not aliases
$files = get-childitem *.csv;
foreach($f in $files)
{
#basename is a property of the file object in PowerShell, there's no need to call a static method
$basename = $f.basename;
$year = $f.basename.substring(0,4)
# Every time you use Write-Host, a puppy dies
"Starting $Basename";
# If you've got CSV data, treat it as CSV data. PowerShell can import it into a collection natively.
$data = Import-Csv $f;
$exportData = #();
foreach ($row in $data) {
# Add a year "property" to each row object
$row |Add-Member -membertype NoteProperty -Name "Year" -Value $year;
# Export the modified row to the output file
$row |Export-Csv -NoTypeInformation -Path $("r:\DR_" + $basename + ".CSV") -Append -NoClobber
}
}