Is there a "split" equivalent in Powershell? - powershell

I am looking for a PowerShell equivalent to "split" *NIX command, such as seen here : http://www.computerhope.com/unix/usplit.htm
split outputs fixed-size pieces of input INPUT to files named
PREFIXaa, PREFIXab, ...
This is NOT referring to .split() like for strings. This is to take a LARGE array from pipe and then be stored into X number of files of each with the same number of lines.
In my use case, the content getting piped is list of over 1Million files...
Get-ChildItem $rootPath -Recurse | select -ExpandProperty FullName | foreach{ $_.Trim()} | {...means of splitting file here...}

I don't think it exists a CmdLet doing exactly what you want. but you can quickly build a function doing that.
It's a kind of duplicate of How can I split a text file using PowerShell? and you will find more scripts solutions if you google "powershell split a text file into smaller files"
Here is a peace of code to begin, my advice is to use the .NET class System.IO.StreamReader to handle more efficiently big files.
$sourcefilename = "D:\temp\theFiletosplit.txt"
$desFolderPathSplitFile = "D:\temp\TFTS"
$maxsize = 2 # The number of lines per file
$filenumber = 0
$linecount = 0
$reader = new-object System.IO.StreamReader($sourcefilename)
while(($line = $reader.ReadLine()) -ne $null)
{
Add-Content $desFolderPathSplitFile$filenumber.txt $line
$linecount ++
If ($linecount -eq $maxsize)
{
$filenumber++
$linecount = 0
}
}
$reader.Close()
$reader.Dispose()

Related

Compare the contents of two files and output the the differences in contents along with line numbers

I came upon the problem where we need to compare contents two files a.txt and b.txt line by line and output the result if any difference found along with content and line number.
We should not use Compare-Object in this scenario. Do we have any alternative?
I tried using for loops but unable to get desired result
For ex : a.txt:
Hello = "Required"
World = 5678
Environment = "new"
Available = 9080.90
b.txt"
Hello = "Required"
World = 5678.908
Environment = "old"
Available = 6780.90
I need to get the output as:
Line number 2:World is not matching
Line number 3:Environment is not matching
Line number 4:Available is not matching
I tried with the following code snippet but was unsuccessful
$file1 = Get-Content "C:\Users\Desktop\a.txt"
$file2 = Get-Content "C:\Users\Desktop\b.txt"
$result = "C:\Users\Desktop\result.txt"
$file1 | foreach {
$match = $file2 -match $_
if ( $match ){
$match | Out-File -Force $result -Append
}
}
As you seem to have an adverse reaction to Compare-Object, lets try this extremely janky set-up. As you have little to no requirements listed, this will give you the bare minimum to meet your conditions of 'any difference found'.
Copy and paste more If statements should you have more lines.
$a = get-content C:\a.txt
$b = get-content C:\b.txt
If($a[0] -ne $b[0]) {
"Line number 1:Hello is not matching" | Out-Host
}
If($a[1] -ne $b[1]) {
"Line number 2:World is not matching" | Out-Host
}
If($a[2] -ne $b[2]) {
"Line number 3:Environment is not matching" | Out-Host
}
If($a[3] -ne $b[3]) {
"Line number 4:Available is not matching" | Out-Host
}
Get-Content returns the file content as an array of strings with a zero based index.
The array variable has an automatic property .Count/.Length
you can use to iterate the arrays with a simple counting for.
You need to split the line at the = to separate name and content.
Use -f format operator to output the results.
## Q:\Test\2019\05\21\SO_56231110.ps1
$Desktop = [environment]::GetFolderPath('Desktop')
$File1 = Get-Content (Join-Path $Desktop "a.txt")
$File2 = Get-Content (Join-Path $Desktop "b.txt")
for ($i=0;$i -lt $File.Count;$i++){
if($File1[$i] -ne $File2[$i]){
"Line number {0}:{1} is not matching" -f ($i+1),($File1[$i] -split ' = ')[0]
}
}
Sample output:
Line number 2:World is not matching
Line number 3:Environment is not matching
Line number 4:Available is not matching

Powershell to Break up CSV by Number of Row

So I am now tasked with getting constant reports that are more than 1 Million lines long.
My last question did not explain all things so I'm tryin got do a better question.
I'm getting a dozen + daily reports that are coming in as CSV files. I don't know what the headers are or anything like that as I get them.
They are huge. I cant open in excel.
I wanted to basically break them up into the same report, just each report maybe 100,000 lines long.
The code I wrote below does not work as I keep getting a
Exception of type 'System.OutOfMemoryException' was thrown.
I am guessing I need a better way to do this.
I just need this file broken down to a more manageable size.
It does not matter how long it takes as I can run it over night.
I found this on the internet, and I tried to manipulate it, but I cant get it to work.
$PSScriptRoot
write-host $PSScriptRoot
$loc = $PSScriptRoot
$location = $loc
# how many rows per CSV?
$rowsMax = 10000;
# Get all CSV under current folder
$allCSVs = Get-ChildItem "$location\Split.csv"
# Read and split all of them
$allCSVs | ForEach-Object {
Write-Host $_.Name;
$content = Import-Csv "$location\Split.csv"
$insertLocation = ($_.Name.Length - 4);
for($i=1; $i -le $content.length ;$i+=$rowsMax){
$newName = $_.Name.Insert($insertLocation, "splitted_"+$i)
$content|select -first $i|select -last $rowsMax | convertto-csv -NoTypeInformation | % { $_ -replace '"', ""} | out-file $location\$newName -fo -en ascii
}
}
The key is not to read large files into memory in full, which is what you're doing by capturing the output from Import-Csv in a variable ($content = Import-Csv "$location\Split.csv").
That said, while using a single pipeline would solve your memory problem, performance will likely be poor, because you're converting from and back to CSV, which incurs a lot of overhead.
Even reading and writing the files as text with Get-Content and Set-Content is slow, however.
Therefore, I suggest a .NET-based approach for processing the files as text, which should substantially speed up processing.
The following code demonstrates this technique:
Get-ChildItem $PSScriptRoot/*.csv | ForEach-Object {
$csvFile = $_.FullName
# Construct a file-path template for the sequentially numbered chunk
# files; e.g., "...\file_split_001.csv"
$csvFileChunkTemplate = $csvFile -replace '(.+)\.(.+)', '$1_split_{0:000}.$2'
# Set how many lines make up a chunk.
$chunkLineCount = 10000
# Read the file lazily and save every chunk of $chunkLineCount
# lines to a new file.
$i = 0; $chunkNdx = 0
foreach ($line in [IO.File]::ReadLines($csvFile)) {
if ($i -eq 0) { ++$i; $header = $line; continue } # Save header line.
if ($i++ % $chunkLineCount -eq 1) { # Create new chunk file.
# Close previous file, if any.
if (++$chunkNdx -gt 1) { $fileWriter.Dispose() }
# Construct the file path for the next chunk, by
# instantiating the template with the next sequence number.
$csvFileChunk = $csvFileChunkTemplate -f $chunkNdx
Write-Verbose "Creating chunk: $csvFileChunk"
# Create the next chunk file and write the header.
$fileWriter = [IO.File]::CreateText($csvFileChunk)
$fileWriter.WriteLine($header)
}
# Write a data row to the current chunk file.
$fileWriter.WriteLine($line)
}
$fileWriter.Dispose() # Close the last file.
}
Note that the above code creates BOM-less UTF-8 files; if your input contains ASCII-range characters only, these files will effectively be ASCII files.
Here's the equivalent single-pipeline solution, which is likely to be substantially slower.
Get-ChildItem $PSScriptRoot/*.csv | ForEach-Object {
$csvFile = $_.FullName
# Construct a file-path template for the sequentially numbered chunk
# files; e.g., ".../file_split_001.csv"
$csvFileChunkTemplate = $csvFile -replace '(.+)\.(.+)', '$1_split_{0:000}.$2'
# Set how many lines make up a chunk.
$chunkLineCount = 10000
$i = 0; $chunkNdx = 0
Get-Content -LiteralPath $csvFile | ForEach-Object {
if ($i -eq 0) { ++$i; $header = $_; return } # Save header line.
if ($i++ % $chunkLineCount -eq 1) { #
# Construct the file path for the next chunk.
$csvFileChunk = $csvFileChunkTemplate -f ++$chunkNdx
Write-Verbose "Creating chunk: $csvFileChunk"
# Create the next chunk file and write the header.
Set-Content -Encoding ASCII -LiteralPath $csvFileChunk -Value $header
}
# Write data row to the current chunk file.
Add-Content -Encoding ASCII -LiteralPath $csvFileChunk -Value $_
}
}
Another option from linux world - split command. To get it on windows just install git bash, then you'll be able to use many linux tools in your CMD/powershell.
Below is the syntax to achieve your goal:
split -l 100000 --numeric-suffixes --suffix-length 3 --additional-suffix=.csv sourceFile.csv outputfile
It's very fast. If you want you can wrap split.exe as a cmdlet

Get all lines containing a string in a huge text file - as fast as possible?

In Powershell, how to read and get as fast as possible the last line (or all the lines) which contains a specific string in a huge text file (about 200000 lines / 30 MBytes) ?
I'm using :
get-content myfile.txt | select-string -pattern "my_string" -encoding ASCII | select -last 1
But it's very very long (about 16-18 seconds).
I did tests without the last pipe "select -last 1", but it's the same time.
Is there a faster way to get the last occurence (or all occurences) of a specific string in huge file?
Perhaps it's the needed time ...
Or it there any possiblity to read the file faster from the end as I want the last occurence?
Thanks
Try this:
get-content myfile.txt -ReadCount 1000 |
foreach { $_ -match "my_string" }
That will read your file in chunks of 1000 records at a time, and find the matches in each chunk. This gives you better performance because you aren't wasting a lot of cpu time on memory management, since there's only 1000 lines at a time in the pipeline.
Have you tried:
gc myfile.txt | % { if($_ -match "my_string") {write-host $_}}
Or, you can create a "grep"-like function:
function grep($f,$s) {
gc $f | % {if($_ -match $s){write-host $_}}
}
Then you can just issue: grep $myfile.txt $my_string
$reader = New-Object System.IO.StreamReader("myfile.txt")
$lines = #()
if ($reader -ne $null) {
while (!$reader.EndOfStream) {
$line = $reader.ReadLine()
if ($line.Contains("my_string")) {
$lines += $line
}
}
}
$lines | Select-Object -Last 1
Have you tried using [System.IO.File]::ReadAllLines();? This method is more "raw" than the PowerShell-esque method, since we're plugging directly into the Microsoft .NET Framework types.
$Lines = [System.IO.File]::ReadAllLines();
[Regex]::Matches($Lines, 'my_string_pattern');
I wanted to extract the lines that contained failed and also write this lines to a new file, I will add the full command for this
get-content log.txt -ReadCount 1000 |
>> foreach { $_ -match "failed" } | Out-File C:\failes.txt

Remove Top Line of Text File with PowerShell

I am trying to just remove the first line of about 5000 text files before importing them.
I am still very new to PowerShell so not sure what to search for or how to approach this. My current concept using pseudo-code:
set-content file (get-content unless line contains amount)
However, I can't seem to figure out how to do something like contains.
While I really admire the answer from #hoge both for a very concise technique and a wrapper function to generalize it and I encourage upvotes for it, I am compelled to comment on the other two answers that use temp files (it gnaws at me like fingernails on a chalkboard!).
Assuming the file is not huge, you can force the pipeline to operate in discrete sections--thereby obviating the need for a temp file--with judicious use of parentheses:
(Get-Content $file | Select-Object -Skip 1) | Set-Content $file
... or in short form:
(gc $file | select -Skip 1) | sc $file
It is not the most efficient in the world, but this should work:
get-content $file |
select -Skip 1 |
set-content "$file-temp"
move "$file-temp" $file -Force
Using variable notation, you can do it without a temporary file:
${C:\file.txt} = ${C:\file.txt} | select -skip 1
function Remove-Topline ( [string[]]$path, [int]$skip=1 ) {
if ( -not (Test-Path $path -PathType Leaf) ) {
throw "invalid filename"
}
ls $path |
% { iex "`${$($_.fullname)} = `${$($_.fullname)} | select -skip $skip" }
}
I just had to do the same task, and gc | select ... | sc took over 4 GB of RAM on my machine while reading a 1.6 GB file. It didn't finish for at least 20 minutes after reading the whole file in (as reported by Read Bytes in Process Explorer), at which point I had to kill it.
My solution was to use a more .NET approach: StreamReader + StreamWriter.
See this answer for a great answer discussing the perf: In Powershell, what's the most efficient way to split a large text file by record type?
Below is my solution. Yes, it uses a temporary file, but in my case, it didn't matter (it was a freaking huge SQL table creation and insert statements file):
PS> (measure-command{
$i = 0
$ins = New-Object System.IO.StreamReader "in/file/pa.th"
$outs = New-Object System.IO.StreamWriter "out/file/pa.th"
while( !$ins.EndOfStream ) {
$line = $ins.ReadLine();
if( $i -ne 0 ) {
$outs.WriteLine($line);
}
$i = $i+1;
}
$outs.Close();
$ins.Close();
}).TotalSeconds
It returned:
188.1224443
Inspired by AASoft's answer, I went out to improve it a bit more:
Avoid the loop variable $i and the comparison with 0 in every loop
Wrap the execution into a try..finally block to always close the files in use
Make the solution work for an arbitrary number of lines to remove from the beginning of the file
Use a variable $p to reference the current directory
These changes lead to the following code:
$p = (Get-Location).Path
(Measure-Command {
# Number of lines to skip
$skip = 1
$ins = New-Object System.IO.StreamReader ($p + "\test.log")
$outs = New-Object System.IO.StreamWriter ($p + "\test-1.log")
try {
# Skip the first N lines, but allow for fewer than N, as well
for( $s = 1; $s -le $skip -and !$ins.EndOfStream; $s++ ) {
$ins.ReadLine()
}
while( !$ins.EndOfStream ) {
$outs.WriteLine( $ins.ReadLine() )
}
}
finally {
$outs.Close()
$ins.Close()
}
}).TotalSeconds
The first change brought the processing time for my 60 MB file down from 5.3s to 4s. The rest of the changes is more cosmetic.
$x = get-content $file
$x[1..$x.count] | set-content $file
Just that much. Long boring explanation follows. Get-content returns an array. We can "index into" array variables, as demonstrated in this and other Scripting Guys posts.
For example, if we define an array variable like this,
$array = #("first item","second item","third item")
so $array returns
first item
second item
third item
then we can "index into" that array to retrieve only its 1st element
$array[0]
or only its 2nd
$array[1]
or a range of index values from the 2nd through the last.
$array[1..$array.count]
I just learned from a website:
Get-ChildItem *.txt | ForEach-Object { (get-Content $_) | Where-Object {(1) -notcontains $_.ReadCount } | Set-Content -path $_ }
Or you can use the aliases to make it short, like:
gci *.txt | % { (gc $_) | ? { (1) -notcontains $_.ReadCount } | sc -path $_ }
Another approach to remove the first line from file, using multiple assignment technique. Refer Link
$firstLine, $restOfDocument = Get-Content -Path $filename
$modifiedContent = $restOfDocument
$modifiedContent | Out-String | Set-Content $filename
skip` didn't work, so my workaround is
$LinesCount = $(get-content $file).Count
get-content $file |
select -Last $($LinesCount-1) |
set-content "$file-temp"
move "$file-temp" $file -Force
Following on from Michael Soren's answer.
If you want to edit all .txt files in the current directory and remove the first line from each.
Get-ChildItem (Get-Location).Path -Filter *.txt |
Foreach-Object {
(Get-Content $_.FullName | Select-Object -Skip 1) | Set-Content $_.FullName
}
For smaller files you could use this:
& C:\windows\system32\more +1 oldfile.csv > newfile.csv | out-null
... but it's not very effective at processing my example file of 16MB. It doesn't seem to terminate and release the lock on newfile.csv.

How can I split a text file using PowerShell?

I need to split a large (500 MB) text file (a log4net exception file) into manageable chunks like 100 5 MB files would be fine.
I would think this should be a walk in the park for PowerShell. How can I do it?
A word of warning about some of the existing answers - they will run very slow for very big files. For a 1.6 GB log file I gave up after a couple of hours, realising it would not finish before I returned to work the next day.
Two issues: the call to Add-Content opens, seeks and then closes the current destination file for every line in the source file. Reading a little of the source file each time and looking for the new lines will also slows things down, but my guess is that Add-Content is the main culprit.
The following variant produces slightly less pleasant output: it will split files in the middle of lines, but it splits my 1.6 GB log in less than a minute:
$from = "C:\temp\large_log.txt"
$rootName = "C:\temp\large_log_chunk"
$ext = "txt"
$upperBound = 100MB
$fromFile = [io.file]::OpenRead($from)
$buff = new-object byte[] $upperBound
$count = $idx = 0
try {
do {
"Reading $upperBound"
$count = $fromFile.Read($buff, 0, $buff.Length)
if ($count -gt 0) {
$to = "{0}.{1}.{2}" -f ($rootName, $idx, $ext)
$toFile = [io.file]::OpenWrite($to)
try {
"Writing $count to $to"
$tofile.Write($buff, 0, $count)
} finally {
$tofile.Close()
}
}
$idx ++
} while ($count -gt 0)
}
finally {
$fromFile.Close()
}
Simple one-liner to split based on number of lines (100 in this case):
$i=0; Get-Content .....log -ReadCount 100 | %{$i++; $_ | Out-File out_$i.txt}
This is a somewhat easy task for PowerShell, complicated by the fact that the standard Get-Content cmdlet doesn't handle very large files too well. What I would suggest to do is use the .NET StreamReader class to read the file line by line in your PowerShell script and use the Add-Content cmdlet to write each line to a file with an ever-increasing index in the filename. Something like this:
$upperBound = 50MB # calculated by Powershell
$ext = "log"
$rootName = "log_"
$reader = new-object System.IO.StreamReader("C:\Exceptions.log")
$count = 1
$fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
while(($line = $reader.ReadLine()) -ne $null)
{
Add-Content -path $fileName -value $line
if((Get-ChildItem -path $fileName).Length -ge $upperBound)
{
++$count
$fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
}
}
$reader.Close()
Same as all the answers here, but using StreamReader/StreamWriter to split on new lines (line by line, instead of trying to read the whole file into memory at once). This approach can split big files in the fastest way I know of.
Note: I do very little error checking, so I can't guarantee it'll work smoothly for your case. It did for mine (1.7 GB TXT file of 4 million lines split in 100,000 lines per file in 95 seconds).
#split test
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
$filename = "C:\Users\Vincent\Desktop\test.txt"
$rootName = "C:\Users\Vincent\Desktop\result"
$ext = ".txt"
$linesperFile = 100000#100k
$filecount = 1
$reader = $null
try{
$reader = [io.file]::OpenText($filename)
try{
"Creating file number $filecount"
$writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
$filecount++
$linecount = 0
while($reader.EndOfStream -ne $true) {
"Reading $linesperFile"
while( ($linecount -lt $linesperFile) -and ($reader.EndOfStream -ne $true)){
$writer.WriteLine($reader.ReadLine());
$linecount++
}
if($reader.EndOfStream -ne $true) {
"Closing file"
$writer.Dispose();
"Creating file number $filecount"
$writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
$filecount++
$linecount = 0
}
}
} finally {
$writer.Dispose();
}
} finally {
$reader.Dispose();
}
$sw.Stop()
Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"
Output splitting a 1.7 GB file:
...
Creating file number 45
Reading 100000
Closing file
Creating file number 46
Reading 100000
Closing file
Creating file number 47
Reading 100000
Closing file
Creating file number 48
Reading 100000
Split complete in 95.6308289 seconds
I often need to do the same thing. The trick is getting the header repeated into each of the split chunks. I wrote the following cmdlet (PowerShell v2 CTP 3) and it does the trick.
##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Count
# (Or -c) The maximum number of lines in each file.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv 3000 -rc 1
#
#.LINK
# Out-TempFile
##############################################################################
function Split-File {
[CmdletBinding(DefaultParameterSetName='Path')]
param(
[Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
[String[]]$Path,
[Alias("PSPath")]
[Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
[String[]]$LiteralPath,
[Alias('c')]
[Parameter(Position=2,Mandatory=$true)]
[Int32]$Count,
[Alias('d')]
[Parameter(Position=3)]
[String]$Destination='.',
[Alias('rc')]
[Parameter()]
[Int32]$RepeatCount
)
process {
# yeah! the cmdlet supports wildcards
if ($LiteralPath) { $ResolveArgs = #{LiteralPath=$LiteralPath} }
elseif ($Path) { $ResolveArgs = #{Path=$Path} }
Resolve-Path #ResolveArgs | %{
$InputName = [IO.Path]::GetFileNameWithoutExtension($_)
$InputExt = [IO.Path]::GetExtension($_)
if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }
# get the input file in manageable chunks
$Part = 1
Get-Content $_ -ReadCount:$Count | %{
# make an output filename with a suffix
$OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))
# In the first iteration the header will be
# copied to the output file as usual
# on subsequent iterations we have to do it
if ($RepeatCount -and $Part -gt 1) {
Set-Content $OutputFile $Header
}
# write this chunk to the output file
Write-Host "Writing $OutputFile"
Add-Content $OutputFile $_
$Part += 1
}
}
}
}
I found this question while trying to split multiple contacts in a single vCard VCF file to separate files. Here's what I did based on Lee's code. I had to look up how to create a new StreamReader object and changed null to $null.
$reader = new-object System.IO.StreamReader("C:\Contacts.vcf")
$count = 1
$filename = "C:\Contacts\{0}.vcf" -f ($count)
while(($line = $reader.ReadLine()) -ne $null)
{
Add-Content -path $fileName -value $line
if($line -eq "END:VCARD")
{
++$count
$filename = "C:\Contacts\{0}.vcf" -f ($count)
}
}
$reader.Close()
Many of these answers were too slow for my source files. My source files were SQL files between 10 MB and 800 MB that needed to split into files of roughly equal line counts.
I found some of the previous answers which use Add-Content to be quite slow. Waiting many hours for a split to finish wasn't uncommon.
I didn't try Typhlosaurus's answer, but it looks to only do splits by file size, not line count.
The following has suited my purposes.
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
Write-Host "Reading source file..."
$lines = [System.IO.File]::ReadAllLines("C:\Temp\SplitTest\source.sql")
$totalLines = $lines.Length
Write-Host "Total Lines :" $totalLines
$skip = 0
$count = 100000; # Number of lines per file
# File counter, with sort friendly name
$fileNumber = 1
$fileNumberString = $filenumber.ToString("000")
while ($skip -le $totalLines) {
$upper = $skip + $count - 1
if ($upper -gt ($lines.Length - 1)) {
$upper = $lines.Length - 1
}
# Write the lines
[System.IO.File]::WriteAllLines("C:\Temp\SplitTest\result$fileNumberString.txt",$lines[($skip..$upper)])
# Increment counters
$skip += $count
$fileNumber++
$fileNumberString = $filenumber.ToString("000")
}
$sw.Stop()
Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"
For a 54 MB file, I get the output...
Reading source file...
Total Lines : 910030
Split complete in 1.7056578 seconds
I hope others looking for a simple, line-based splitting script that matches my requirements will find this useful.
There's also this quick (and somewhat dirty) one-liner:
$linecount=0; $i=0; Get-Content .\BIG_LOG_FILE.txt | %{ Add-Content OUT$i.log "$_"; $linecount++; if ($linecount -eq 3000) {$I++; $linecount=0 } }
You can tweak the number of first lines per batch by changing the hard-coded 3000 value.
Do this:
FILE 1
There's also this quick (and somewhat dirty) one-liner:
$linecount=0; $i=0;
Get-Content .\BIG_LOG_FILE.txt | %
{
Add-Content OUT$i.log "$_";
$linecount++;
if ($linecount -eq 3000) {$I++; $linecount=0 }
}
You can tweak the number of first lines per batch by changing the hard-coded 3000 value.
Get-Content C:\TEMP\DATA\split\splitme.txt | Select -First 5000 | out-File C:\temp\file1.txt -Encoding ASCII
FILE 2
Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 5000 | Select -First 5000 | out-File C:\temp\file2.txt -Encoding ASCII
FILE 3
Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 10000 | Select -First 5000 | out-File C:\temp\file3.txt -Encoding ASCII
etc…
I've made a little modification to split files based on size of each part.
##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Size
# (Or -s) The maximum size of each file. Size must be expressed in MB.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv -s 20 -rc 1
#
#.LINK
# Out-TempFile
##############################################################################
function Split-File {
[CmdletBinding(DefaultParameterSetName='Path')]
param(
[Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
[String[]]$Path,
[Alias("PSPath")]
[Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
[String[]]$LiteralPath,
[Alias('s')]
[Parameter(Position=2,Mandatory=$true)]
[Int32]$Size,
[Alias('d')]
[Parameter(Position=3)]
[String]$Destination='.',
[Alias('rc')]
[Parameter()]
[Int32]$RepeatCount
)
process {
# yeah! the cmdlet supports wildcards
if ($LiteralPath) { $ResolveArgs = #{LiteralPath=$LiteralPath} }
elseif ($Path) { $ResolveArgs = #{Path=$Path} }
Resolve-Path #ResolveArgs | %{
$InputName = [IO.Path]::GetFileNameWithoutExtension($_)
$InputExt = [IO.Path]::GetExtension($_)
if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }
Resolve-Path #ResolveArgs | %{
$InputName = [IO.Path]::GetFileNameWithoutExtension($_)
$InputExt = [IO.Path]::GetExtension($_)
if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }
# get the input file in manageable chunks
$Part = 1
$buffer = ""
Get-Content $_ -ReadCount:1 | %{
# make an output filename with a suffix
$OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))
# In the first iteration the header will be
# copied to the output file as usual
# on subsequent iterations we have to do it
if ($RepeatCount -and $Part -gt 1) {
Set-Content $OutputFile $Header
}
# test buffer size and dump data only if buffer is greater than size
if ($buffer.length -gt ($Size * 1MB)) {
# write this chunk to the output file
Write-Host "Writing $OutputFile"
Add-Content $OutputFile $buffer
$Part += 1
$buffer = ""
} else {
$buffer += $_ + "`r"
}
}
}
}
}
}
Sounds like a job for the UNIX command split:
split MyBigFile.csv
Just split my 55 GB csv file in 21k chunks in less than 10 minutes.
It's not native to PowerShell though, but comes with, for instance, the git for windows package https://git-scm.com/download/win
As the lines can be variable in logs I thought it best to take a number of lines per file approach. The following code snippet processed a 4 million line log file in under 19 seconds (18.83.. seconds)splitting it into 500,000 line chunks:
$sourceFile = "c:\myfolder\mylargeTextyFile.csv"
$partNumber = 1
$batchSize = 500000
$pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv"
[System.Text.Encoding]$enc = [System.Text.Encoding]::GetEncoding(65001) # utf8 this one
$fs=New-Object System.IO.FileStream ($sourceFile,"OpenOrCreate", "Read", "ReadWrite",8,"None")
$streamIn=New-Object System.IO.StreamReader($fs, $enc)
$streamout = new-object System.IO.StreamWriter $pathAndFilename
$line = $streamIn.readline()
$counter = 0
while ($line -ne $null)
{
$streamout.writeline($line)
$counter +=1
if ($counter -eq $batchsize)
{
$partNumber+=1
$counter =0
$streamOut.close()
$pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv"
$streamout = new-object System.IO.StreamWriter $pathAndFilename
}
$line = $streamIn.readline()
}
$streamin.close()
$streamout.close()
This can easily be turned into a function or script file with parameters to make it more versatile. It uses a StreamReader and StreamWriter to achieve its speed and tiny memory footprint
My requirement was a bit different. I often work with Comma Delimited and Tab Delimited ASCII files where a single line is a single record of data. And they're really big, so I need to split them into manageable parts (whilst preserving the header row).
So, I reverted back to my classic VBScript method and bashed together a small .vbs script that can be run on any Windows computer (it gets automatically executed by the WScript.exe script host engine on Window).
The benefit of this method is that it uses Text Streams, so the underlying data isn't loaded into memory (or, at least, not all at once). The result is that it's exceptionally fast and it doesn't really need much memory to run. The test file I just split using this script on my i7 was about 1 GB in file size, had about 12 million lines of text and was split into 25 part files (each with about 500k lines each) – the processing took about 2 minutes and it didn’t go over 3 MB memory used at any point.
The caveat here is that it relies on the text file having "lines" (meaning each record is delimited with a CRLF) as the Text Stream object uses the "ReadLine" function to process a single line at a time. But hey, if you're working with TSV or CSV files, it's perfect.
Option Explicit
Private Const INPUT_TEXT_FILE = "c:\bigtextfile.txt"
Private Const REPEAT_HEADER_ROW = True
Private Const LINES_PER_PART = 500000
Dim oFileSystem, oInputFile, oOutputFile, iOutputFile, iLineCounter, sHeaderLine, sLine, sFileExt, sStart
sStart = Now()
sFileExt = Right(INPUT_TEXT_FILE,Len(INPUT_TEXT_FILE)-InstrRev(INPUT_TEXT_FILE,".")+1)
iLineCounter = 0
iOutputFile = 1
Set oFileSystem = CreateObject("Scripting.FileSystemObject")
Set oInputFile = oFileSystem.OpenTextFile(INPUT_TEXT_FILE, 1, False)
Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)
If REPEAT_HEADER_ROW Then
iLineCounter = 1
sHeaderLine = oInputFile.ReadLine()
Call oOutputFile.WriteLine(sHeaderLine)
End If
Do While Not oInputFile.AtEndOfStream
sLine = oInputFile.ReadLine()
Call oOutputFile.WriteLine(sLine)
iLineCounter = iLineCounter + 1
If iLineCounter Mod LINES_PER_PART = 0 Then
iOutputFile = iOutputFile + 1
Call oOutputFile.Close()
Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)
If REPEAT_HEADER_ROW Then
Call oOutputFile.WriteLine(sHeaderLine)
End If
End If
Loop
Call oInputFile.Close()
Call oOutputFile.Close()
Set oFileSystem = Nothing
Call MsgBox("Done" & vbCrLf & "Lines Processed:" & iLineCounter & vbCrLf & "Part Files: " & iOutputFile & vbCrLf & "Start Time: " & sStart & vbCrLf & "Finish Time: " & Now())
If this may help, it works perfectly for me.
Script check a folder, parse all CSV files and check nb of lines per file.
If file contains more than 55000 lines in file, script split the file into sub-files of 50000 lines and name them " _1, _2, ...."
At end of the script, original file is renamed to avoid a load.
foreach ($MyFile in $MyFolder)
{
# Read parent CSV
$InputFilename = $MyFile
$InputFile = Get-Content $MyFile
$OutputFilenamePattern = "$MyFile"+"_"
Write-Host ".........."
Write-Host ". File to process"
Write-Host ".........."
WRITE-HOST "$MyVar_file_Path"
Write-Host "$InputFilename"
Write-Host "$OutputFilenamePattern"
Write-Host ".........."
$LineLimit = 50000
# Initialize
$line = 0
$i = 0
$file = 0
$start = 0
$nb_lines = (Get-Content $MyFile).Length
Write-Host ".........."
Write-Host "$nb_lines lines in the file"
Write-Host ".........."
if ($nb_lines -gt 55000)
{
# Loop all text lines
while ($line -le $InputFile.Length)
{
# Generate child CSVs
if ($i -eq $LineLimit -Or $line -eq $InputFile.Length)
{
$file++
$Filename = "$OutputFilenamePattern$file.csv"
# $InputFile[0] | Out-File $Filename -Force # Writes Header at the beginning of the line.
If ($file -ne 1) {$InputFile[0] | Out-File $Filename -Force}
$InputFile[$start..($line - 1)] | Out-File $Filename -Force -Append # Original line 19 with the addition of -Append so it doesn't overwrite the headers you just wrote.
# $InputFile[$start..($line-1)] | Out-File $Filename -Force
$start = $line;
$i = 0
Write-Host "$Filename"
}
# Increment counters
$i++;
$line++
}
$Source_name = $MyVar_file_Path2 + "\" + $InputFilename
$Destination_name = $MyVar_file_Path2 + "\" + "Splitted_" + $InputFilename
Write-Host ".........."
Write-Host ". File to rename"
Write-Host ".........."
Write-Host "$Source_name"
Write-Host "$Destination_name"
Write-Host ".........."
Rename-Item $Source_name -NewName $Destination_name
}
Write-Host "."
Write-Host "."
}
Here is my solution to split a file called patch6.txt (about 32,000 lines) into separate files of 1000 lines each. Its not quick, but it does the job.
$infile = "D:\Malcolm\Test\patch6.txt"
$path = "D:\Malcolm\Test\"
$lineCount = 1
$fileCount = 1
foreach ($computername in get-content $infile)
{
write $computername | out-file -Append $path_$fileCount".txt"
$lineCount++
if ($lineCount -eq 1000)
{
$fileCount++
$lineCount = 1
}
}