Merging multiple CSV files into one using PowerShell - powershell

Hello I'm looking for powershell script which would merge all csv files in a directory into one text file (.txt) . All csv files have same header which is always stored in a first row of every file. So I need to take header from the first file, but in rest of the files the first row should be skipped.
I was able to find batch file which is doing exactly what I need, but I have more than 4000 csv files in a single directory and it takes more than 45 minutes to do the job.
#echo off
ECHO Set working directory
cd /d %~dp0
Deleting existing combined file
del summary.txt
setlocal ENABLEDELAYEDEXPANSION
set cnt=1
for %%i in (*.csv) do (
if !cnt!==1 (
for /f "delims=" %%j in ('type "%%i"') do echo %%j >> summary.txt
) else (
for /f "skip=1 delims=" %%j in ('type "%%i"') do echo %%j >> summary.txt
)
set /a cnt+=1
)
Any suggestion how to create powershell script which would be more efficient than this batch code?
Thank you.
John

If you're after a one-liner you can pipe each csv to an Import-Csv and then immediately pipe that to Export-Csv. This will retain the initial header row and exclude the remaining files header rows. It will also process each csv one at a time rather than loading all into memory and then dumping them into your merged csv.
Get-ChildItem -Filter *.csv | Select-Object -ExpandProperty FullName | Import-Csv | Export-Csv .\merged\merged.csv -NoTypeInformation -Append

This will append all the files together reading them one at a time:
get-childItem "YOUR_DIRECTORY\*.txt"
| foreach {[System.IO.File]::AppendAllText
("YOUR_DESTINATION_FILE", [System.IO.File]::ReadAllText($_.FullName))}
# Placed on seperate lines for readability
This one will place a new line at the end of each file entry if you need it:
get-childItem "YOUR_DIRECTORY\*.txt" | foreach
{[System.IO.File]::AppendAllText("YOUR_DESTINATION_FILE",
[System.IO.File]::ReadAllText($_.FullName) + [System.Environment]::NewLine)}
Skipping the first line:
$getFirstLine = $true
get-childItem "YOUR_DIRECTORY\*.txt" | foreach {
$filePath = $_
$lines = $lines = Get-Content $filePath
$linesToWrite = switch($getFirstLine) {
$true {$lines}
$false {$lines | Select -Skip 1}
}
$getFirstLine = $false
Add-Content "YOUR_DESTINATION_FILE" $linesToWrite
}

Try this, it worked for me
Get-Content *.csv| Add-Content output.csv

This is pretty trivial in PowerShell.
$CSVFolder = 'C:\Path\to\your\files';
$OutputFile = 'C:\Path\to\output\file.txt';
$CSV = Get-ChildItem -Path $CSVFolder -Filter *.csv | ForEach-Object {
Import-Csv -Path $_
}
$CSV | Export-Csv -Path $OutputFile -NoTypeInformation -Force;
Only drawback to this approach is that it does parse every file. It also loads all files into memory, so if we're talking about 4000 files that are 100 MB each you'll obviously run into problems.
You might get better performance with System.IO.File and System.IO.StreamWriter.

Your Batch file is pretty inefficient! Try this one (you'll be surprised :)
#echo off
ECHO Set working directory
cd /d %~dp0
ECHO Deleting existing combined file
del summary.txt
setlocal
for %%i in (*.csv) do set /P "header=" < "%%i" & goto continue
:continue
(
echo %header%
for %%i in (*.csv) do (
for /f "usebackq skip=1 delims=" %%j in ("%%i") do echo %%j
)
) > summary.txt
How this is an improvement
for /f ... in ('type "%%i"') requires to load and execute cmd.exe in order to execute the type command, capture its output in a temporary file and then read data from it, and this is done with each input file. for /f ... in ("%%i") directly reads data from the file.
The >> redirection opens the file, appends data at end and closes the file, and this is done with each output *line*. The > redirection keeps the file open all the time.

If you need to scan folder recursively then you can use the approach below
Get-ChildItem -Recurse -Path .\data\*.csv | Get-Content | Add-Content output.csv
what this basically does is:
Get-ChildItem -Recurse -Path .\data\*.csv Find the requested files recursively
Get-Content Get content for each
Add-Content output.csv append it to output.csv

Here is a version also using System.IO.File,
$result = "c:\temp\result.txt"
$csvs = get-childItem "c:\temp\*.csv"
#read and write CSV header
[System.IO.File]::WriteAllLines($result,[System.IO.File]::ReadAllLines($csvs[0])[0])
#read and append file contents minus header
foreach ($csv in $csvs) {
$lines = [System.IO.File]::ReadAllLines($csv)
[System.IO.File]::AppendAllText($result, ($lines[1..$lines.Length] | Out-String))
}

Get-ChildItem *.csv|select -First 1|Get-Content|select -First 1|Out-File -FilePath .\input.csv -Force #Get the header from one of the CSV Files, write it to input.csv
Get-ChildItem *.csv|foreach {Get-Content $_|select -Skip 1|Out-File -FilePath .\Input.csv -Append} #Get the content of each file, excluding the first line and append it to input.csv

stinkyfriend's helpful answer shows an elegant, PowerShell-idiomatic solution based on Import-Csv and Export-Csv.
Unfortunately,
it is quite slow because it involves ultimately unnecessary round-trip conversion to and from objects.
also, even though it shouldn't matter to a CSV parser, the specific format of the files can get altered in the process, because Export-Csv double-quotes all column values, invariably so in Windows PowerShell, by default in PowerShell (Core) 7+, which now offers opt-in control via -UseQuotes and -QuoteFields).
When performance matters, a plain-text solution is required, which also avoids any inadvertent format alteration (just like the linked answer it assumes that all input CSV files have the same column structure).
The following PSv5+ solution:
reads each input file's content into memory in full, as a single multi-line string, using Get-Content -Raw (which is much faster than the default line-by-line reading),
skips the header line for all but the first file with -replace '^.+\r?\n', using the regex-based -replace operator,
and saves the results to the target file with Set-Content -NoNewLine.
Character-encoding caveat:
PowerShell never preserves the input character encoding of files, so you may have to use the -Encoding parameter to override Set-Content's default encoding (the same applies to Export-Csv and any other file-writing cmdlets; in PowerShell (Core) 7+ all cmdlets now consistently default to BOM-less UTF-8; but not only do Windows PowerShell cmdlets not default to UTF-8, they use varying encodings - see the bottom section of this answer).
# Determine the output file and remove a preexisting one, if any.
$outFile = 'summary.csv'
if (Test-Path $outFile) { Remove-Item -ErrorAction Stop $outFile }
# Process all *.csv files in the current folder and merge their contents,
# skipping the header line for all but the first file.
$first = $true
Get-ChildItem -Filter *.csv |
Get-Content -Raw |
ForEach-Object {
$content =
if ($first) { # first file: output content as-is
$_; $first = $false
} else { # subsequent file: skip the header line.
$_ -replace '^.+\r?\n'
}
# Make sure that each file content ends in a newline
if (-not $content.EndsWith("`n")) { $content += [Environment]::NewLine }
$content # Output
} |
Set-Content -NoNewLine $outFile # add -Encoding as needed.

The modern Powershell 7 answer:
(Assuming all csv files are on the same directory and have the same amount of fields.)
#(Get-ChildItem -Filter *.csv).fullname | Import-Csv |Export-Csv ./merged.csv -NoTypeInformation
First part of the pipeline gets all the .csv files and parses the fullname (Path + filename + extension), then import CSV takes each and creates an object and then each object gets merged into a single CSV file with only one header.

I found the previous solutions quite inefficient for large csv-files in terms of performance, so here is a performant alternative.
Here is an alternative which simply appends the files:
cmd /c copy ((gci "YOUR_DIRECTORY\*.csv" -Name) -join '+') "YOUR_OUTPUT_FILE.csv"
Thereafter, you probably want to get rid of the multiple csv-headers.

The following batch script is very fast. It should work well as long as none of your CSV files contain tab characters, and all source CSV files have fewer than 64k lines.
#echo off
set "skip="
>summary.txt (
for %%F in (*.csv) do if defined skip (
more +1 "%%F"
) else (
more "%%F"
set skip=1
)
)
The reason for the restrictions is that MORE converts tabs into a series of spaces, and redirected MORE hangs at 64k lines.

#Input path
$InputFolder = "W:\My Documents\... input folder"
$FileType = "*.csv"
#Output path
$OutputFile = "W:\My Documents\... some folder\merged.csv"
#Read list of files
$AllFilesFullName = #(Get-ChildItem -LiteralPath $InputFolder -Filter $FileType | Select-Object -ExpandProperty FullName)
#Loop and write
Write-Host "Merging" $AllFilesFullName.Count $FileType "files."
foreach ($FileFullName in $AllFilesFullName) {
Import-Csv $FileFullName | Export-Csv $OutputFile -NoTypeInformation -Append
Write-Host "." -NoNewline
}
Write-Host
Write-Host "Merge Complete"

$pathin = 'c:\Folder\With\CSVs'
$pathout = 'c:\exported.txt'
$list = Get-ChildItem -Path $pathin | select FullName
foreach($file in $list){
Import-Csv -Path $file.FullName | Export-Csv -Path $pathout -Append -NoTypeInformation
}

type *.csv >> folder\combined.csv

Related

How can I (efficiently) match content (lines) of many small files with content (lines) of a single large file and update/recreate them

I've tried solving the following case:
many small text files (in subfolders) need their content (lines) matched to lines that exist in another (large) text file. The small files then need to be updated or copied with those matching Lines.
I was able to come up with some running code for this but I need to improve it or use a complete other method because it is extremely slow and would take >40h to get through all files.
One idea I already had was to use a SQL Server to bulk-import all files in a single table with [relative path],[filename],[jap content] and the translation file in a table with [jap content],[eng content] and then join [jap content] and bulk-export the joined table as separate files using [relative path],[filename]. Unfortunately I got stuck right at the beginning due to formatting and encoding issues so I dropped it and started working on a PowerShell script.
Now in detail:
Over 40k txt files spread across multiple subfolders with multiple lines each, every line can exist in multiple files.
Content:
UTF8 encoded Japanese text that also can contain special characters like \\[*+(), each Line ending with a tabulator character. Sounds like csv files but they don't have headers.
One large File with >600k Lines containing the translation to the small files. Every line is unique within this file.
Content:
Again UTF8 encoded Japanese text. Each line formatted like this (without brackets):
[Japanese Text][tabulator][English Text]
Example:
ใƒ†ใ‚นใƒˆ[1] Test [1]
End result should be a copy or a updated version of all these small files where their lines got replaced with the matching ones of the translation file while maintaining their relative path.
What I have at the moment:
$translationfile = 'B:\Translation.txt'
$inputpath = 'B:\Working'
$translationarray = [System.Collections.ArrayList]#()
$translationarray = #(Get-Content $translationfile -Encoding UTF8)
Get-Childitem -path $inputpath -Recurse -File -Filter *.txt | ForEach-Object -Parallel {
$_.Name
$filepath = ($_.Directory.FullName).substring(2)
$filearray = [System.Collections.ArrayList]#()
$filearray = #(Get-Content -path $_.FullName -Encoding UTF8)
$filearray = $filearray | ForEach-Object {
$result = $using:translationarray -match ("^$_" -replace '[[+*?()\\.]','\$&')
if ($result) {
$_ = $result
}
$_
}
If(!(test-path B:\output\$filepath)) {New-Item -ItemType Directory -Force -Path B:\output\$filepath}
#$("B:\output\"+$filepath+"\")
$filearray | Out-File -FilePath $("B:\output\" + $filepath + "\" + $_.Name) -Force -Encoding UTF8
} -ThrottleLimit 10
I would appreciate any help and ideas but please keep in mind that I rarely write scripts so anything to complex might fly right over my head.
Thanks
As zett42 states, using a hash table is your best option for mapping the Japanese-only phrases to the dual-language lines.
Additionally, use of .NET APIs for file I/O can speed up the operation noticeably.
# Be sure to specify all paths as full paths, not least because .NET's
# current directory usually differs from PowerShell's
$translationfile = 'B:\Translation.txt'
$inPath = 'B:\Working'
$outPath = (New-Item -Type Directory -Force 'B:\Output').FullName
# Build the hashtable mapping the Japanese phrases to the full lines.
# Note that ReadLines() defaults to UTF-8
$ht = #{ }
foreach ($line in [IO.File]::ReadLines($translationfile)) {
$ht[$line.Split("`t")[0] + "`t"] = $line
}
Get-ChildItem $inPath -Recurse -File -Filter *.txt | Foreach-Object -Parallel {
# Translate the lines to the matching lines including the $translation
# via the hashtable.
# NOTE: If an input line isn't represented as a key in the hashtable,
# it is passed through as-is.
$lines = foreach ($line in [IO.File]::ReadLines($_.FullName)) {
($using:ht)[$line] ?? $line
}
# Synthesize the output file path, ensuring that the target dir. exists.
$outFilePath = (New-Item -Force -Type Directory ($using:outPath + $_.Directory.FullName.Substring(($using:inPath).Length))).FullName + '/' + $_.Name
# Write to the output file.
# Note: If you want UTF-8 files *with BOM*, use -Encoding utf8bom
Set-Content -Encoding utf8 $outFilePath -Value $lines
} -ThrottleLimit 10
Note: Your use of ForEach-Object -Parallel implies that you're using PowerShell [Core] 7+, where BOM-less UTF-8 is the consistent default encoding (unlike in Window PowerShell, where default encodings vary wildly).
Therefore, in lieu of the .NET [IO.File]::ReadLines() API in a foreach loop, you could also use the more PowerShell-idiomatic switch statement with the -File parameter for efficient line-by-line text-file processing.

for each Name from CSV write to each output file replacing the word 00000 to name for example 1 output Henry and so on

I am trying to use every (Name) to be use in on CSV to replace (00000) written on on each file that I am exporting.
the file already contains the word 00000
csv file contains:
FullName Name LastWriteTime
\\remotecomputer\ Henry 4/30/2020 3:44:57 PM
\\remotecompter\ Magy 12/7/2020 9:04:28 PM
first txt should look like this
#echo off
if /i "%UserName%" neq "Henry" exit
second txt should look like this
#echo off
if /i "%UserName%" neq "Magy" exit
original Config.txt looks like this
#echo off
if /i "%UserName%" neq "00000" exit
Code:
$source = Read-Host -Prompt 'Insert source path'
Import-Csv C:\$source-PREP.csv | ForEach-Object {$_.Name} |ForEach-Object {Get-Content c:\Config.txt | ForEach-Object {$_ -replace '00000',"$_.Name"} |Out-File c:\$_.txt -Force}
I think you want to do:
read a template file c:\Config.txt which contains "00000" as placeholder
read a csv file C:\Something-PREP.csv which contains among others a column called Name
use those names to replace the "00000" placeholders in the template
output new files C:\Henry.txt, C:\Magy.txt etcetera
If that assumption is correct, try
$template = Get-Content -Path 'C:\Config.txt'
(Import-Csv -Path 'C:\Whatever-PREP.csv').Name | ForEach-Object {
# replace "00000" with the name (represented by `$_`) and use that name as filename for the output
$template -replace '"00000"', ('"{0}"' -f $_) | Set-Content -Path ('C:\{0}.txt' -f $_) -Force
}
P.S. Make sure the CSV you are importing uses the comma as delimiter character. If this is some other character, add -Delimiter '<yourCharacter>' to the cmdlet.

Powershell: Logging foreach changes

I have put together a script inspired from a number of sources. The purpose of the powershell script is to scan a directory for files (.SQL), copy all of it to a new directory (retain the original), and scan each file against a list file (CSV format - containing 2 columns: OldValue,NewValue), and replace any strings that matches. What works: moving, modifying, log creation.
What doesn't work:
Recording in the .log for the changes made by the script.
Sample usage: .\ConvertSQL.ps1 -List .\EVar.csv -Files \SQLFiles\Rel_1
Param (
[String]$List = "*.csv",
[String]$Files = "*.sql"
)
function Get-TimeStamp {
return "[{0:dd/MM/yyyy} {0:HH:mm:ss}]" -f (Get-Date)
}
$CustomFiles = "$Files\CUSTOMISED"
IF (-Not (Test-Path $CustomFiles))
{
MD -Path $CustomFiles
}
Copy-Item "$Files\*.sql" -Recurse -Destination "$CustomFiles"
$ReplacementList = Import-Csv $List;
Get-ChildItem $CustomFiles |
ForEach-Object {
$LogFile = "$CustomFiles\$_.$(Get-Date -Format dd_MM_yyyy).log"
Write-Output "$_ has been modified on $(Get-TimeStamp)." | Out-File "$LogFile"
$Content = Get-Content -Path $_.FullName;
foreach ($ReplacementItem in $ReplacementList)
{
$Content = $Content.Replace($ReplacementItem.OldValue, $ReplacementItem.NewValue)
}
Set-Content -Path $_.FullName -Value $Content
}
Thank you very much.
Edit: I've cleaned up a bit and removed my test logging files.
Here's the snippet of code that I've been testing with little success. I put the following right under $Content= Content.Replace($ReplacementItem.OldValue, $ReplacementItem.NewValue)
if ( $_.FullName -like '*TEST*' ) {
"This is a test." | Add-Content $LogFile
}
I've also tried to pipe out the Set-Content using Out-File. The outputs I end up with are either a full copy of the contents of my CSV file or the SQL file itself. I'll continue reading up on different methods. I simply want to, out of hundreds to a thousand or so lines, to be able to identify what variables in the SQL has been changed.
Instead of piping output to Add-Content, pipe the log output to: Out-File -Append
Edit: compare the content using the Compare-Object cmdlet and evaluate it's ouput to identify where the content in each string object differs.

Delete whole line in html file recursively with PowerShell

i'm trying to delete the "unwanted" class lines from an HTML file using power shell script
<a class="unwanted" href="http://www.mywebsite.com/rest/of/url1" target="_blank">my_file_name1</a><br>
<a class="mylink" href="http://www.mywebsite.com/rest/of/url2" target="_blank">my_file_name2</a><br>
<a class="unwanted" href="http://www.mywebsite.com/rest/of/url3" target="_blank">my_file_name3</a><br>
Currently i'm replacing strings using this script
$s = "old string"
$r = "new string"
Get-ChildItem "C:\Users\User\Desktop\Folder" -Recurse -Filter *.html | % {
(Get-Content $_.FullName) `
| % { $_ -replace [regex]::Escape($s), $r } `
| Set-Content $_.FullName
}
Since you tagged your question also with cmd and batch-file, I want to contribute a related answer.
cmd.exe/batch scripting does not understand HTML file format, but if your HTML file(s) look(s) like the sample data you provided (the <a> tag and the corresponding </a> tag are in a single line, and there is nothing else (than <br>)), the following command line could work for you -- supposing a HTML file to process is called classes.html and the modified data is to be written to file classes_new.html:
> "classes_new.html" findstr /V /I /L /C:"class=\"unwanted\"" "classes.html"
This only works if the string class="unwanted" occurs only in the <a> tags that need to be removed.
To process multiple files, the following batch script could be used, based on the above command line:
#echo off
setlocal EnableExtensions DisableDelayedExpansion
set "ARGS=%*"
setlocal EnableDelayedExpansion
for %%H in (!ARGS!) do (
endlocal
call :SUB "%%~H"
setlocal
)
endlocal
endlocal
exit /B
:SUB file
if /I not "%~x1"==".html" if /I not "%~x1"==".htm" exit /B 1
findstr /V /I /L /C:"class=\"unwanted\"" "%~f1" | (> "%~f1" find /V "")
exit /B
The actual removal of lines is done in the sub-routine :SUB, unless then file name extension is something other than .html or htm. The main script loops through all the given command line arguments and calls :SUB for every single file. Note that this script does not create new files for the modified HTML contents, it overwrites the given HTML files.
Removing lines is even easier than replacing them. When outputting to Set-Content, simply omit the lines that you want removed. You can do this with Where-Object in place of your Foreach.
Adapting your example:
$s = "unwanted regex"
Get-ChildItem "C:\Users\User\Desktop\Folder" -Recurse -Filter *.html | % {
(Get-Content $_.FullName) `
| where { $_ -notmatch $s } `
| Set-Content $_.FullName
}
If you want literal matching instead of regex, substitute the where clause
where { -not $_.Contains($s) } `
Note this is using the .NET function [String]::Contains(), and not the PowerShell operator -contains, as the latter doesn't work on strings.
Try using multiline strings for your $s and $r. I tested with the HTML examples you posted as well and that worked fine.
$s = #"
old string
"#
$r = #"
new string
"#
Get-ChildItem "C:\Users\User\Desktop\Folder" -Recurse -Filter *.html | % {
(Get-Content $_.FullName) `
| % { $_ -replace $s, $r } `
| Set-Content $_.FullName
}

Powershell Strip all directory paths from file

OK what I am trying to do is make a script that will read each line of a text file, "directory.txt" and export every line that is to a file and not to a directory. Example below.
I'm just trying to remove the paths to directories like "C:\users\"
and keep any path that is to a file like "C:\users\file.txt"
In the test file, Direcory.txt" there will be the following:
C:\path\path\folder\
C:\path\path\file.ext
C:\path\path\path\path\folder
The script will need to read the text file above and export the following line to a new text file.
C:\path\path\file.ext
The batch script equivalent would be the following:
#ECHO OFF
FOR /F %%A IN (directory.txt) DO CALL:NoDir "%%A"
pause
EXIT /B
:NoDir
IF "%~x1" NEQ "" ECHO %~1>>nodir.txt
EXIT /B
Batch script can't handle a file of 400mb so need to use powershell to do it o.0
FTR: The condition if "%~x1" neq "" does not do what you seem to expect. It will match not only folders, but also files without an extension.
Anyway, in PowerShell you'd probably do something like this to list only items that are not directories:
Get-Content \PATH\TO\directory.txt `
| Get-Item `
| Where-Object { -not $_.PSIsContainer } `
| Select-Object FullName
I'm not sure I understand the question (batch example shows all paths), maybe that's what you're looking for:
Get-Content directory.txt | Where-Object {$_ -eq 'C:\path\path\file.ext'}
You didn't say whether you're concerned if they are valid paths or not. The following outputs to the host the lines that are either valid files (using the '-PathType leaf' parameter of the Test-Path cmdlet), or if the last 4 characters of the last item in the path are a dot followed by 3 letters.
$Lines = Get-Content C:\Path\to\file.txt
foreach ( $line in $Lines )
{
if ( (Test-Path $line -PathType leaf) -OR ($line -match "\.\w{3}$") )
{
Write-Host $line
}
}
If you find your file extensions are longer than 3 letters, you can change the regex appropriately ("\.\w{3,4}$" for 4 character extensions, or "\.[\w\d]{3,4}$ to match extensions that are 3 or 4 characters long and might include numbers)
And the one-liner:
Get-Content C:\Path\to\file.txt | % { if ((Test-Path $_ -PathType Leaf) -OR ($_ -match "\.\w{3}$")) { $_ } }