I have a PowerShell script to pull data from a database, but some of the fields contain commas and that is resulting in breaking up the fields because the StreamReader splits it up into fields by comma. How can I change the delimiter of how the data is split into it's fields?
$ConnectionString = "Data Source=server1; Database=Development; Trusted_Connection=True;";
$streamWriter = New-Object System.IO.StreamWriter ".\output.csv"
$sqlConn = New-Object System.Data.SqlClient.SqlConnection $ConnectionString
$sqlCmd = New-Object System.Data.SqlClient.SqlCommand
$sqlCmd.Connection = $sqlConn
$sqlCmd.CommandText = "SELECT * FROM Development.dbo.All_Opportunities WITH(NOLOCK)"
$sqlConn.Open();
$reader = $sqlCmd.ExecuteReader();
# Initialze the array the hold the values
$array = #()
for ( $i = 0 ; $i -lt $reader.FieldCount; $i++ )
{ $array += #($i) }
# Write Header
$streamWriter.Write($reader.GetName(0))
for ( $i = 1; $i -lt $reader.FieldCount; $i ++)
{ $streamWriter.Write($("," + $reader.GetName($i))) }
$streamWriter.WriteLine("") # Close the header line
while ($reader.Read())
{
# get the values;
$fieldCount = $reader.GetValues($array);
# add quotes if the values have a comma or double quote
for ($i = 0; $i -lt $array.Length; $i++)
{
if ($array[$i] -match "`"|\S")
{
$array[$i] = '"' + $array[$i].Replace("`"", "`"`"").ToString() + '"';
}
}
$newRow = [string]::Join(",", $array);
$streamWriter.WriteLine($newRow)
}
$reader.Close();
$sqlConn.Close();
$streamWriter.Close();
Have you read this post to see if it helps your effort. It's for a text fiel, but could open you creativity to what is possible.
'stackoverflow.com/questions/14954437/streamreader-with-tab-delimited-text-file'
FYI, there is no delimiter type called 'field'
Otherwise, for those columns that have a comma as part of the value, a common approach is either to double quote the value or escape it.
Related
I have a spreadsheet that has spaces in the column names, how do I go about replacing the space with underscores on the column headers?
Note: I am new at this so bear with me
using this code with no luck:
Powershell: search & replace in xlsx except first 3 columns
Theo's code works great!
$sheetname = 'my Data'
$file = 'C:\Users\donkeykong\Desktop\1\booka.xlsx'
# create a COM Excel object
$objExcel = New-Object -ComObject Excel.Application
$objExcel.Visible = $false
$workbook = $objExcel.Workbooks.Open($file)
$sheet = $workbook.Worksheets.Item($sheetname)
$sheet.Activate()
# get the number of columns used
$colMax = $sheet.UsedRange.Columns.Count
# loop over the column headers and replace the whitespaces
for ($col = 1; $col -le $colMax; $col++) {
$header = $sheet.Cells.Item(1, $col).Value() -replace '\s+', '_'
$sheet.Cells.Item(1, $col) = $header
}
# close and save the changes
$workbook.Close($true)
$objExcel.Quit()
# IMPORTANT: clean-up used Com objects
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($sheet)
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($workbook)
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($objExcel)
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()
Assuming your Excel file has the headers in the first row, this should work without using the ImportExcel module:
$sheetname = 'my Data'
$file = 'C:\Users\donkeykong\Desktop\1\booka.xlsx'
# create a COM Excel object
$objExcel = New-Object -ComObject Excel.Application
$objExcel.Visible = $false
$workbook = $objExcel.Workbooks.Open($file)
$sheet = $workbook.Worksheets.Item($sheetname)
$sheet.Activate()
# get the number of columns used
$colMax = $sheet.UsedRange.Columns.Count
# loop over the column headers and replace the whitespaces
for ($col = 1; $col -le $colMax; $col++) {
$header = $sheet.Cells.Item(1, $col).Value() -replace '\s+', '_'
$sheet.Cells.Item(1, $col) = $header
}
# close and save the changes
$workbook.Close($true)
$objExcel.Quit()
# IMPORTANT: clean-up used Com objects
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($sheet)
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($workbook)
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($objExcel)
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()
Im using the below code for splitting the huge file into 20K TSV UTF-8 files..
However, i need every split files should have the header in the 20k count, how we can do it?
$sourceFile = "C:\Users\lingaguru.c3\Desktop\Test\DE.txt"
$partNumber = 1
$batchSize = 20000
$pathAndFilename = "C:\Users\lingaguru.c3\Desktop\Test\Temp part $partNumber file.tsv"
[System.Text.Encoding]$enc = [System.Text.Encoding]::GetEncoding(65001) # utf8 this one
$fs=New-Object System.IO.FileStream ($sourceFile,"OpenOrCreate", "Read", "ReadWrite",8,"None")
$streamIn=New-Object System.IO.StreamReader($fs, $enc)
$streamout = new-object System.IO.StreamWriter $pathAndFilename
$line = $streamIn.readline()
$counter = 0
while ($line -ne $null)
{
$streamout.writeline($line)
$counter +=1
if ($counter -eq $batchsize)
{
$partNumber+=1
$counter =0
$streamOut.close()
$pathAndFilename = "C:\Users\lingaguru.c3\Desktop\Test\Temp part $partNumber file.tsv"
$streamout = new-object System.IO.StreamWriter $pathAndFilename
}
$line = $streamIn.readline()
}
$streamin.close()
$streamout.close()
Without altering your code too much, you need to capture the header line in a variable and write that out first thing on every new file:
$sourceFile = "C:\Users\lingaguru.c3\Desktop\Test\DE.txt"
# using a template filename saves writing
$pathOut = "C:\Users\lingaguru.c3\Desktop\Test\Temp part {0} file.tsv"
$partNumber = 1
$batchSize = 20000 # max number of data lines to write in each part
# construct the output filename using the template $pathOut
$pathAndFilename = $pathOut -f $partNumber
$enc = [System.Text.Encoding]::UTF8
$fs = [System.IO.FileStream]::new($sourceFile,"Open", "Read") # don't need write access on source file
$streamIn = [System.IO.StreamReader]::new($fs, $enc)
$streamout = [System.IO.StreamWriter]::new($pathAndFilename)
# assuming the first line contains the headers
$header = $streamIn.ReadLine()
# write out the header on the first part
$streamout.WriteLine($header)
$counter = 0
while (($line = $streamIn.ReadLine()) -ne $null) {
$streamout.WriteLine($line)
$counter++
if ($counter -ge $batchsize) {
$partNumber++
$counter = 0
$streamOut.Flush()
$streamOut.Dispose()
$pathAndFilename = $pathOut -f $partNumber
$streamout = [System.IO.StreamWriter]::new($pathAndFilename)
# write the header on this new part
$streamout.WriteLine($header)
}
}
$streamin.Dispose()
$streamout.Dispose()
$fs.Dispose()
I want multiple data fetching from excel sheet. I am getting error is Index was outside the bounds of the array.
$Data = Read-Host "Enter the count of Datastore"
$ds = "-sm-"
$Range1 = $Worksheetx1.Range("B1","B1048570")
$Search1 = $Range1.find($ds)
$r = $Search1.row
for ($i=1; $i -le $Data; $i++)
{
$Datastore = #()
$Datastore[$i] = $Worksheetx1.Cells.Item($r, 2).Value2
$r = $r+1
}
$Total_Datastore = $Datastore1 + $Datastore2 + $Datastore3 + $Datastore4
$Total_Datastore
The problem resides in this code:
for ($i=1; $i -le $Data; $i++)
{
$Datastore = #()
$Datastore[$i] = $Worksheetx1.Cells.Item($r, 2).Value2
$r = $r+1
}
You're creating an empty array $Datastore = #(), and try to store data in the second index ($i=1, array index starts at zero, therefore index two). This causes an IndexOutOfRangeException.
Also $Total_Datastore = $Datastore1 + $Datastore2 + $Datastore3 + $Datastore4 doesn't make sense, since $Datastore1 (2,3 and 4) aren't defined anywhere.
Try:
# Only integers are allowed
$Data = [int] (Read-Host "Enter the count of Datastore")
$ds = "-sm-"
$Range1 = $Worksheetx1.Range("B1","B1048570")
$Search1 = $Range1.find($ds)
$r = $Search1.row
$Datastore = #()
for ($i=1; $i -le $Data; $i++) {
# Todo: Check if $r is in a valid range or not !!!
$Datastore += $Worksheetx1.Cells.Item($r, 2).Value2
$r = $r+1
}
$Datastore
I have an excel file with 10 columns. I want to get the sum of the column with header "Sales" and print it on the console.
How this can be done with PowerShell? I am using the below code but I do not know how to replace H with $i in the following expression:
='=SUM(H1:H'+$RowCount')'
Where H is column "Sales"
$Excel = New-Object -ComObject Excel.Application
$Excel.Visible = $False
$NewWorkbook = $Excel.Workbooks.open("C:\Test.xlsx")
$NewWorksheet = $NewWorkbook.Worksheets.Item(1)
$NewWorksheet.Activate() | Out-Null
$NewWorksheetRange = $NewWorksheet.UsedRange
$RowCount = $NewWorksheetRange.Rows.Count
$ColumnCount = $NewWorksheetRange.Columns.Count
for ($i = 1; $i -lt $ColumnCount; $i++)
{
if ($NewWorksheet.cells.Item(1,$i).Value2 -eq "Sales")
{
$NewWorksheet.Cells.Item($RowCount+2,$i)='=SUM(H1:H'+$RowCount')'
Write-Host $NewWorksheet.Cells.Item($RowCount+1,$i).Value2
}
}
$Excel.Application.DisplayAlerts=$False
$NewWorkbook.SaveAs("C:\Test_New.xlsx")
$NewWorkbook.close($false)
$Excel.quit()
spps -n excel
I have replaced:
$NewWorksheet.Cells.Item($RowCount+2,$i) ='=SUM(H1:H'+(1+$RowCount)+')'
Write-Host $NewWorksheet.Cells.Item($RowCount+2,$i).Value2
with:
$FirstCell = $NewWorksheet.Cells(2,$i).Address('+True, False+')
$LastCell = $NewWorksheet.Cells(1+$RowCount,$i).Address('+True, False+')
$NewWorksheet.Cells.Item($RowCount+2,$i)='=SUM('+$FirstCell+':'+$LastCell+')'
Write-Host $NewWorksheet.Cells.Item($RowCount+2,$i).Value2
I can highly recommend the PowerShell-Module "ImportExcel". This Modules enables you to import Excel-Files as easy as with Import-Csv
Without knowing much about your files/enviroment, you could try something like this:
foreach ($data in (Import-Excel "$PSScriptRoot\test.xlsx")) {
$result += $data.Sales
}
Write-Host $result
I have a PowerShell script on Windows 2012 R2 that is used to export data from a database into a CSV file. I have a check in there to escape double quotes and text qualify necessary fields. I am looking to increase the performance of the script because it runs a little slower than I would like (exporting 20GB/20 million rows) and it only utilizes about 10% of the CPU. Does anyone have any suggestions for improvement?
$ConnectionString = "Data Source=server1; Database=Development; Trusted_Connection=True;";
$streamWriter = New-Object System.IO.StreamWriter ".\output.csv"
$sqlConn = New-Object System.Data.SqlClient.SqlConnection $ConnectionString
$sqlCmd = New-Object System.Data.SqlClient.SqlCommand
$sqlCmd.Connection = $sqlConn
$sqlCmd.CommandText = "SELECT * FROM Development.dbo.All_Opportunities WITH(NOLOCK)"
$sqlConn.Open();
$reader = $sqlCmd.ExecuteReader();
# Initialze the array the hold the values
$array = #()
for ( $i = 0 ; $i -lt $reader.FieldCount; $i++ )
{ $array += #($i) }
# Write Header
$streamWriter.Write($reader.GetName(0))
for ( $i = 1; $i -lt $reader.FieldCount; $i ++)
{ $streamWriter.Write($("," + $reader.GetName($i))) }
$streamWriter.WriteLine("") # Close the header line
while ($reader.Read())
{
# get the values;
$fieldCount = $reader.GetValues($array);
# add quotes if the values have a comma or double quote
for ($i = 0; $i -lt $array.Length; $i++)
{
if ($array[$i] -match "`"|,")
{
$array[$i] = '"' + $array[$i].Replace("`"", "`"`"").ToString() + '"';
}
}
$newRow = [string]::Join(",", $array);
$streamWriter.WriteLine($newRow)
}
$reader.Close();
$sqlConn.Close();
$streamWriter.Close();
You can split it to jobs and start them in the background
Try :
https://learn.microsoft.com/en-us/powershell/module/Microsoft.PowerShell.Core/Start-Job?view=powershell-5.1
Hope it help
So, I had a similar issue about a year ago, albeit with a slightly smaller table (~1 GB). Initially I just used:
Import-Module -Name SqlServer -Cmdlet Read-SqlTableData;
Read-SqlTableData -ServerInstance $SqlServer -DatabaseName $Database -SchemaName $Schema -TableName $Table |
Export-Csv -Path $OutputFilePath -NoTypeInformation
It worked, but it used a ton of memory (5+ GB out of 16 GB) and took about 7-9 minutes to run. All of these tests were with a spinning metal disk in a laptop, so bear that in mind with what follows as well.
I wondered if I could get it to go faster. I initially wrote it like this, which took about half the time, and about 100 MB of RAM:
$SqlServer = '...';
$SqlDatabase = '...';
$OutputFilePath = '...';
$SqlQuery = '...';
$SqlConnectionString = 'Data Source={0};Initial Catalog={1};Integrated Security=SSPI' -f $SqlServer, $SqlDatabase;
$Utf8NoBOM = New-Object -TypeName System.Text.UTF8Encoding -ArgumentList $false;
$StreamWriter = New-Object -TypeName System.IO.StreamWriter -ArgumentList $OutputFilePath, $Utf8NoBOM;
$CsvDelimiter = '"';
$CsvDelimiterEscape = '""';
$CsvSeparator = ',';
$SQLConnection = New-Object -TypeName System.Data.SqlClient.SqlConnection -ArgumentList $SqlConnectionString;
$SqlCommand = $SQLConnection.CreateCommand();
$SqlCommand.CommandText = $SqlQuery;
$SQLConnection.Open();
$SqlDataReader = $SqlCommand.ExecuteReader();
for ($Field = 0; $Field -lt $SqlDataReader.FieldCount; $Field++) {
if ($Field -gt 0) { $StreamWriter.Write($CsvSeparator); }
$StreamWriter.Write($CsvDelimiter);
$StreamWriter.Write($SqlDataReader.GetName($Field).Replace($CsvDelimiter, $CsvDelimiterEscape));
$StreamWriter.Write($CsvDelimiter);
}
$StreamWriter.WriteLine();
while ($SqlDataReader.Read()) {
for ($Field = 0; $Field -lt $SqlDataReader.FieldCount; $Field++) {
if ($Field -gt 0) { $StreamWriter.Write($CsvSeparator); }
$StreamWriter.Write($CsvDelimiter);
$StreamWriter.Write($SqlDataReader.GetValue($Field).ToString().Replace($CsvDelimiter, $CsvDelimiterEscape));
$StreamWriter.Write($CsvDelimiter);
}
$StreamWriter.WriteLine();
}
$SqlDataReader.Close();
$SqlDataReader.Dispose();
$SQLConnection.Close();
$SQLConnection.Dispose();
$StreamWriter.Close();
$StreamWriter.Dispose();
As you can see, it's basically the same pattern as yours.
I wondered if I could improve it even more, so I tried adding a StringBuilder since I'd had success doing that with other projects. I still have the code, but I found that it didn't work any faster, and took about 200 MB of RAM:
$SqlServer = '...'
$SqlDatabase = '...'
$OutputFilePath = '...'
$SqlQuery = '...';
$SqlConnectionString = 'Data Source={0};Initial Catalog={1};Integrated Security=SSPI' -f $SqlServer, $SqlDatabase;
$StringBuilderBufferSize = 50MB;
$StringBuilder = New-Object -TypeName System.Text.StringBuilder -ArgumentList ($StringBuilderBufferSize + 1MB);
$Utf8NoBOM = New-Object -TypeName System.Text.UTF8Encoding -ArgumentList $false;
$StreamWriter = New-Object -TypeName System.IO.StreamWriter -ArgumentList $OutputFilePath, $Utf8NoBOM;
$CsvDelimiter = '"';
$CsvDelimiterEscape = '""';
$CsvSeparator = ',';
$SQLConnection = New-Object -TypeName System.Data.SqlClient.SqlConnection -ArgumentList $SqlConnectionString;
$SqlCommand = $SQLConnection.CreateCommand();
$SqlCommand.CommandText = $SqlQuery;
$SQLConnection.Open();
$SqlDataReader = $SqlCommand.ExecuteReader();
for ($Field = 0; $Field -lt $SqlDataReader.FieldCount; $Field++) {
if ($Field -gt 0) { [void]$StringBuilder.Append($CsvSeparator); }
[void]$StringBuilder.Append($CsvDelimiter);
[void]$StringBuilder.Append($SqlDataReader.GetName($Field).Replace($CsvDelimiter, $CsvDelimiterEscape));
[void]$StringBuilder.Append($CsvDelimiter);
}
[void]$StringBuilder.AppendLine();
while ($SqlDataReader.Read()) {
for ($Field = 0; $Field -lt $SqlDataReader.FieldCount; $Field++) {
if ($Field -gt 0) { [void]$StringBuilder.Append($CsvSeparator); }
[void]$StringBuilder.Append($CsvDelimiter);
[void]$StringBuilder.Append($SqlDataReader.GetValue($Field).ToString().Replace($CsvDelimiter, $CsvDelimiterEscape));
[void]$StringBuilder.Append($CsvDelimiter);
}
[void]$StringBuilder.AppendLine();
if ($StringBuilder.Length -ge $StringBuilderBufferSize) {
$StreamWriter.Write($StringBuilder.ToString());
[void]$StringBuilder.Clear();
}
}
$SqlDataReader.Close();
$SqlDataReader.Dispose();
$SQLConnection.Close();
$SQLConnection.Dispose();
$StreamWriter.Write($StringBuilder.ToString());
$StreamWriter.Close();
$StreamWriter.Dispose();
No matter what I tried, I couldn't seem to get it under ~4:30 for about 1 GB of data.
I never considered parallelism because you'd have to break your query up into 4 equal pieces such that you could be certain that you'd get the complete data set, or otherwise do some pretty difficult process management with Runspace Pools. Even then you'd have to write to four different files, and eventually combine the files back together. Maybe it would work out, but it wasn't an interesting problem to me anymore at that point.
Eventually I just created a package with the Import Export Wizard, saved it as a package, and ran it with DTExec.exe. This takes about 45-60 seconds or so for 1 GB of data. The only drawbacks are that you need to specify the table when you build the package, it doesn't dynamically determine the columns, and it's an unreasonable pain to get the output file to be UTF8.
I did find that bcp.exe and sqlcmd.exe were faster. BCP was extremely fast, and took 20-30 seconds. However, the output formats are extremely limited, and BCP in particular is needlessly difficult to use.