Need to batch convert a large quantity of text files from ANSI to Unicode - unicode

I have a lot of ANSI text files that vary in size (from a few KB up to 1GB+) that I need to convert to Unicode.
At the moment, this has been done by loading the files into Notepad and then doing "Save As..." and selecting Unicode as the Encoding. Obviously this is very time consuming!
I'm looking for a way to convert all the files in one hit (in Windows). The files are in a directory structure so it would need to be able to traverse the full folder structure and convert all the files within it.
I've tried a few options but so far nothing has really ticked all the boxes:
ansi2unicode command line utility. This has been the closest to what I'm after as it processes files recursively in a folder structure...but it keeps crashing whilst running before it's finished converting.
CpConverter GUI utility. Works OK to a point but struggles with multiple files in a folder structure - only seems to be able to handle files in one folder
There's a DOS command that works OK on smaller files but doesn't seem to be able to cope with large files.
Tried GnuWin sed utility but it crashes every time I try and install it
So I'm still looking! If anyone has any recommendations I'd be really grateful
Thanks...

OK, so in case anyone else is interested, I found a way to do this using PowerShell:
Get-ChildItem "c:\some path\" -Filter *.csv -recurse |
Foreach-Object {
Write-Host (Get-Date).ToString() $_.FullName
Get-Content $_.FullName | Set-Content -Encoding unicode ($_.FullName + '_unicode.csv')
}
This recurses through the entire folder structure and converts all CSV files to Unicode; the converted files are written to the same locations as the originals but with "unicode" appended to the filename. You can change the value of the -Encoding parameter if you want to convert to something different (e.g. utf-8).
It also outputs a list of all the files converted along with a timestamp against each

Related

PowerShell `Select-String` selecting some but not others

I have a folder with a list of excel files in .xls and .xlsx formats. I have verified the strings exist in some of the files. When I run the below code, I can find some string patterns but not others. Example - I can find 'randomword' - but not '5625555555' or 'P-888452'. When I run -NotMatch on '5625555555' or 'P-888452' I do get a list of file names that do not match (although they return duplicated in many rows) so I know the pattern is registering. What could be happening here? Why is it playing nice with some string (it seems mostly letters) but no others (that contain integers).
gci "path" -Filter "*.xls" -Recurse -File | Select-String '\bANYTEXTORINT\b' | Select FileName
I also do not get an error when I run code. Just a return completed with white text and no results. I do get results for the 'randomword' though. Three files get returned that contain that pattern.

moving files into matching directories

looking for some help please - I have no experience in code writing so have been looking for a question/answer that is close but.....
My huge movie database lives on NAS drive "Video Y", each movie in its own subdirectory; it has multiple video file types, most being .avi, and I wanted to convert all the .avi to .mp4 (some devices will not play .avi").
So I filtered out all the .avi files and put them in one new directory "0 temp holder for avi", so I could use VideoProc to convert; this converted and placed the .mp4 files in one new directory "00 temp holder MP4".
Now I want to move the .mp4 files back in their own original sub directory which still contains various files related to the movie like .srt etc.
I think the simplest way for me is lining up the files in alphabetical order and the directories in the same order, (as directory names and file names are not necessarily exactly the same), checking for mismatches and correcting as needed, and using some code to move the first file to first directory, and iterate from there. But I'm still stumped and not sure to go about it.
I put under the Windows10 and Powershell tag, but someone may assist with more accurate tags please.
Directory layout
Shown below at the beginning of the PowerShell script, the $arrVideoFolders variable is to load all the folders names into an array.
The $arrFolder variable is for the filenames of all your video files in a separate array.
Shown below,
$arrVideoFolders = Get-ChildItem (Folders for the videos) |
Where-Object {$_.PSIsContainer} |
Foreach-Object {$_.Name}
$arrFolder = Get-ChildItem (Video file names) |
Where-Object {$_.PSIsContainer} |
Foreach-Object {$_.Name}
To give the logic of how i would write the rest of the PowerShell script.
Afterwards a foreach loop would be used to go through all folders, and foreach folder a second nested foreach loop would loop through your video files. An if statement would test, the video name is like, your folder name. This is because your video files names are similar, once found you can do a copy and paste, or cut and paste.
For testing maybe use folders, and simple empty text files instead of copying large files, and test in different folders.

Powershell: Go through all files (PDF's) in a directory and move them based on what's written in the first 6 bytes

I am currently trying to write a powershell script that does the following:
Go through all PDF-Files in the directory in which the script is in
Check the first few bytes of those PDF-Files
If those bytes say something along the lines of "PK", move them to a different location
If the bytes say something else (ex: PDF1.4), dont move them at all and go to the next one.
Context: We have around 70k PDF-Files that cant be opened. After checking them with a certain tool, it looks like around 99% of those are damaged and the remaining 1% are zip files.
The first bytes of a zipped PDF file start with "PK", the first bytes of a broken PDF-File start with PDF1.4 for example.
I need to unzip all zip files and relocate them. Going through 70k PDF-Files by hand is kinda painful, so im looking for a way to automate it.
I know im supposed to provide a code sample, but the truth is that i am absolutely lost. I have written a few powershell scripts before, but i have no idea how to do something like this.
So, if anyone could kindly point me to the right direction or give me a useful function, i would really appreciate it a lot.
You can use Get-Content to get your first 6 bytes as you asked.
We can then tie that into a loop on all the documents and configure a simple if statement to decide what to do next, e.g. move the file to another dir
EDITED BASED ON YOUR COMMENT:
$pdfDirectory = 'C:\Temp\struktur_id_1225\ext_dok'
$newLocation = 'C:\Path\To\New\Folder'
Get-ChildItem "$pdfDirectory" -Filter "*.pdf" | foreach {
if((Get-Content $_.FullName | select -first 1 ) -like "%PDF-1.5*"){
$HL7 = $_.FullName.replace("ext_dok","MDM")
$HL7 = $HL7.replace(".pdf",".hl7")
move $_.FullName $newLocation;
move $HL7 $newLocation
}
}
Try using the above, which is also a bit easier to edit.
$pdfDirectory will need to be set to the folder containing the PDF Files
$newLocation will obviously be the new directory!
And you will still need to change the -like "%PDF-1.5*" to suit your search!
It should do the rest for you, give it a shot
Another Edit
I have mimicked your folder structure on my computer, and placed a few PDF files and matching HL7 files and the script is working perfectly.
Get-Content is not suited for PDF's, you'd want to use iTextSharp to read PDF's.
Download the iTextSharp(found in releases) and put the itextsharp.dll somewhere easy to find (ie. the folder your script is located in).
You can install the .nupkg by using Install-Package, or simply using an archive tool to extract the contents of the .nupkg file (it's basically a .zip file)
The code below adds every word on page 1 for each PDF separated by whitespace to an array. You can then test if the array contains your keyword
Add-Type -Path "C:\path\to\itextsharp.dll"
$pdfs = Get-ChildItem "C:\path\to\pdfs" *.pdf
foreach ($pdf in $pdfs) {
$reader = New-Object itextsharp.text.pdf.pdfreader -ArgumentList $pdf.Fullname
$text = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,1).Split("")
foreach($line in $text) {
# do your test here
}
}

How can I replace every occurrence of a String in a file with PowerShell?

This question is similar to earlier question How can I replace every occurence of a String in a file with PowerShell?" except my challenge is to replace the text is multiple files. I tried using the solution in earlier question and used a command similar like below.
(Get-Content .\*.txt).replace("old text", "new text") | Set-Content .\*.txt
It seem to work but the each file size has increased drastically to the total of files in the directory. Although when I open any file it seems to look normal.
Anyone has ideas how to fix it. My litmus test would be I should revert my text changes and file sizes shouldn't change at all.
You must process the files one at a time:
Get-Item *.txt |
ForEach-Object {
$f = $_.FullName; (Get-Content $f).replace("old text", "new text") | Set-Content $f
}
Note that this will fail with completely empty (zero-byte) files.
Also, irrespective of what the encoding of the input files was, the output files will have Default encoding, according to the system's legacy code page (typically, a single-byte, extended-ASCII encoding).
As for what you tried:
(Get-Content .*.txt) sends the lines from all *.txt files as a single array of lines through the pipeline.
Set-Content *.txt then sends that one array (with replacements made) as a whole to every *.txt file in the current directory.

Powershell script write back to sources from drag and drop

I need to create a powershell script that removes quotes from CSV files in a user friendly drag and drop way. I have the basics of the script down courtesy of this page:
http://blogs.technet.com/b/heyscriptingguy/archive/2011/11/02/remove-unwanted-quotation-marks-from-csv-files-by-using-powershell.aspx
And I've already sucessfully made .ps1 files drag and droppable courtesy of this stack overflow question:
Drag and Drop to a Powershell script
The author of the answer implies that it's just as easy to drop a single file, many files, and folders with lots of files in them. However, I have yet to figure this out in a way that can also can write back to the source file. Here's my current code:
Param([string[]]$file)
(gc $file) | % {$_ -replace '"', ""} | out-file C:\Users\pfoster\Desktop\Output\test.txt -Fo -En ascii
Currently, this will only accept a single file, and output the result as a txt to a specified file regardless of the source file type (I can change that to CSV easily but I'd like the script to mirror the source). Ideally, I'd like it to accept files and folders, and to rewrite the source file. I have a feeling this would involve the get-ChildItem but I'm not sure how to implement that in the current scenario. I've also tried out-file $file and that didn't work either.
Thanks for the help!
For writing the modified content back to the original files try something like this:
foreach ($file in $ARGS) {
(Get-Content $file) -replace '"', '' | Out-File $file -Encoding ASCII -Force
}
Use a foreach in loop, because you need the file name in more than one place in the pipeline. Reading the content in a subshell and then piping the modified content into the Out-File cmdlet makes sure that the output file is only written after the content was already read.
Don't use a redirection operator ((Get-Content $file) >$file), because that would first open the file for writing (effectively truncating it) and afterwards read the content from the now empty file.
Beware that this approach may cause problems with large files, because each file is read completely into the RAM before they're processed and written back to disk. If a file doesn't fit into the available RAM the computer will start swapping, thus causing significant performance degradation.