spilt text files to tsv UTF-8 with headers in PowerShell

spilt text files to tsv UTF-8 with headers in PowerShell - powershell

Im using the below code for splitting the huge file into 20K TSV UTF-8 files..
However, i need every split files should have the header in the 20k count, how we can do it?
$sourceFile = "C:\Users\lingaguru.c3\Desktop\Test\DE.txt"
$partNumber = 1
$batchSize = 20000
$pathAndFilename = "C:\Users\lingaguru.c3\Desktop\Test\Temp part $partNumber file.tsv"
[System.Text.Encoding]$enc = [System.Text.Encoding]::GetEncoding(65001) # utf8 this one
$fs=New-Object System.IO.FileStream ($sourceFile,"OpenOrCreate", "Read", "ReadWrite",8,"None")
$streamIn=New-Object System.IO.StreamReader($fs, $enc)
$streamout = new-object System.IO.StreamWriter $pathAndFilename
$line = $streamIn.readline()
$counter = 0
while ($line -ne $null)
{
$streamout.writeline($line)
$counter +=1
if ($counter -eq $batchsize)
{
$partNumber+=1
$counter =0
$streamOut.close()
$pathAndFilename = "C:\Users\lingaguru.c3\Desktop\Test\Temp part $partNumber file.tsv"
$streamout = new-object System.IO.StreamWriter $pathAndFilename
}
$line = $streamIn.readline()
}
$streamin.close()
$streamout.close()

Without altering your code too much, you need to capture the header line in a variable and write that out first thing on every new file:
$sourceFile = "C:\Users\lingaguru.c3\Desktop\Test\DE.txt"
# using a template filename saves writing
$pathOut = "C:\Users\lingaguru.c3\Desktop\Test\Temp part {0} file.tsv"
$partNumber = 1
$batchSize = 20000 # max number of data lines to write in each part
# construct the output filename using the template $pathOut
$pathAndFilename = $pathOut -f $partNumber
$enc = [System.Text.Encoding]::UTF8
$fs = [System.IO.FileStream]::new($sourceFile,"Open", "Read") # don't need write access on source file
$streamIn = [System.IO.StreamReader]::new($fs, $enc)
$streamout = [System.IO.StreamWriter]::new($pathAndFilename)
# assuming the first line contains the headers
$header = $streamIn.ReadLine()
# write out the header on the first part
$streamout.WriteLine($header)
$counter = 0
while (($line = $streamIn.ReadLine()) -ne $null) {
$streamout.WriteLine($line)
$counter++
if ($counter -ge $batchsize) {
$partNumber++
$counter = 0
$streamOut.Flush()
$streamOut.Dispose()
$pathAndFilename = $pathOut -f $partNumber
$streamout = [System.IO.StreamWriter]::new($pathAndFilename)
# write the header on this new part
$streamout.WriteLine($header)
}
}
$streamin.Dispose()
$streamout.Dispose()
$fs.Dispose()

Related

powershell rename XLSX spreadsheet columns

I have a spreadsheet that has spaces in the column names, how do I go about replacing the space with underscores on the column headers?
Note: I am new at this so bear with me
using this code with no luck:
Powershell: search & replace in xlsx except first 3 columns
Theo's code works great!
$sheetname = 'my Data'
$file = 'C:\Users\donkeykong\Desktop\1\booka.xlsx'
# create a COM Excel object
$objExcel = New-Object -ComObject Excel.Application
$objExcel.Visible = $false
$workbook = $objExcel.Workbooks.Open($file)
$sheet = $workbook.Worksheets.Item($sheetname)
$sheet.Activate()
# get the number of columns used
$colMax = $sheet.UsedRange.Columns.Count
# loop over the column headers and replace the whitespaces
for ($col = 1; $col -le $colMax; $col++) {
$header = $sheet.Cells.Item(1, $col).Value() -replace '\s+', '_'
$sheet.Cells.Item(1, $col) = $header
}
# close and save the changes
$workbook.Close($true)
$objExcel.Quit()
# IMPORTANT: clean-up used Com objects
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($sheet)
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($workbook)
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($objExcel)
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()

Assuming your Excel file has the headers in the first row, this should work without using the ImportExcel module:
$sheetname = 'my Data'
$file = 'C:\Users\donkeykong\Desktop\1\booka.xlsx'
# create a COM Excel object
$objExcel = New-Object -ComObject Excel.Application
$objExcel.Visible = $false
$workbook = $objExcel.Workbooks.Open($file)
$sheet = $workbook.Worksheets.Item($sheetname)
$sheet.Activate()
# get the number of columns used
$colMax = $sheet.UsedRange.Columns.Count
# loop over the column headers and replace the whitespaces
for ($col = 1; $col -le $colMax; $col++) {
$header = $sheet.Cells.Item(1, $col).Value() -replace '\s+', '_'
$sheet.Cells.Item(1, $col) = $header
}
# close and save the changes
$workbook.Close($true)
$objExcel.Quit()
# IMPORTANT: clean-up used Com objects
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($sheet)
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($workbook)
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($objExcel)
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()

How to download only a single file from an online ZIP archive via Powershell?

I want to download only a single file from an online ZIP archive via Powershell.
For this I created a demo-code which is already working, but I am still struggling to get the correct parsing logic on the ZIP-directory. Here is the code I have so far:
# demo code downloading a single DLL file from an online ZIP archive
# and extracting the DLL into memory for mounting it to the main process.
cls
Remove-Variable * -ea 0
# definition for the ZIP archive, the file to be extracted and the checksum:
$url = 'https://github.com/sshnet/SSH.NET/releases/download/2020.0.1/SSH.NET-2020.0.1-bin.zip'
$sub = 'net40/Renci.SshNet.dll'
$md5 = '5B1AF51340F333CD8A49376B13AFCF9C'
# prepare HTTP client:
Add-Type -AssemblyName System.Net.Http
$handler = [System.Net.Http.HttpClientHandler]::new()
$client = [System.Net.Http.HttpClient]::new($handler)
# get the length of the ZIP archive:
$req = [System.Net.HttpWebRequest]::Create($url)
$req.Method = 'HEAD'
$length = $req.GetResponse().ContentLength
$zip = [byte[]]::new($length)
# get the last 10k:
# how to get the correct length of the central ZIP directory here?
$start = $length-10kb
$end = $length-1
$client.DefaultRequestHeaders.Add('Range', "bytes=$start-$end")
$result = $client.GetAsync($url).Result
$last10kb = $result.content.ReadAsByteArrayAsync().Result
$last10kb.CopyTo($zip, $start)
# get the block containing the DLL file:
# how to get the exact file-offset from the ZIP directory?
$start = $length-3537kb
$end = $length-3201kb
$client.DefaultRequestHeaders.Clear()
$client.DefaultRequestHeaders.Add('Range', "bytes=$start-$end")
$result = $client.GetAsync($url).Result
$block = $result.content.ReadAsByteArrayAsync().Result
$block.CopyTo($zip, $start)
# extract the DLL file from archive:
Add-Type -AssemblyName System.IO.Compression
$stream = [System.IO.Memorystream]::new()
$stream.Write($zip,0,$zip.Length)
$archive = [System.IO.Compression.ZipArchive]::new($stream)
$entry = $archive.GetEntry($sub)
$bytes = [byte[]]::new($entry.Length)
[void]$entry.Open().Read($bytes, 0, $bytes.Length)
# check MD5:
$prov = [Security.Cryptography.MD5CryptoServiceProvider]::new().ComputeHash($bytes)
$hash = [string]::Concat($prov.foreach{$_.ToString("x2")})
if ($hash -ne $md5) {write-host 'dll has wrong checksum.' -f y ;break}
# load the DLL:
[void][System.Reflection.Assembly]::Load($bytes)
# use the single demo-call from the DLL:
$test = [Renci.SshNet.NoneAuthenticationMethod]::new('test')
'done.'
Only open point in this code is the correct method to identify the length of the central directory at the end of the ZIP archive and how to get the correct file-offset for the single file to be extracted (in my code I just found the ranges by pure try&error).
I already checked this wiki https://en.wikipedia.org/wiki/ZIP_(file_format)#Structure and also the PKWARE definitions https://gist.github.com/steakknife/820b73ebf25146180198febdb6f0e183 but beside the block definitions I could not find a programmatical approach to get the offset for ethe EOCD and the individual file. Can someone help here, please?

After a couple of additional tests I came to this solution:
# demo code downloading a single DLL file from an online ZIP archive
# and extracting the DLL into memory to mount it finally to the main process.
cls
Remove-Variable * -ea 0
# definition for the ZIP archive, the file to be extracted and the checksum:
$url = 'https://github.com/sshnet/SSH.NET/releases/download/2020.0.1/SSH.NET-2020.0.1-bin.zip'
$sub = 'net40/Renci.SshNet.dll'
$md5 = '5B1AF51340F333CD8A49376B13AFCF9C'
'prepare HTTP client:'
Add-Type -AssemblyName System.Net.Http
$handler = [System.Net.Http.HttpClientHandler]::new()
$client = [System.Net.Http.HttpClient]::new($handler)
'get the length of the ZIP archive:'
# dont use System.Web.HttpRequest, it is frequently hanging:
$req = [System.Net.Http.HttpRequestMessage]::new('HEAD', $url)
$result = $client.SendAsync($req).Result
$zipLength = $result.Content.Headers.ContentLength
$zip = [byte[]]::new($zipLength)
$req.Dispose()
'get the last 10k:'
$start = $zipLength-10kb
$end = $zipLength-1
$client.DefaultRequestHeaders.Add('Range', "bytes=$start-$end")
$result = $client.GetAsync($url).Result
$last10kb = $result.content.ReadAsByteArrayAsync().Result
$last10kb.CopyTo($zip, $start)
"get the 'End of CD' block:"
$enc = [System.Text.Encoding]::GetEncoding(28591)
$end = $enc.GetString($last10kb, $last10kb.Length-256, 256)
$eocd = [regex]::Match($end, 'PK\x05\x06.*').value
$eocd = $enc.GetBytes($eocd)
'get the central directory:'
$cdLength = [bitconverter]::ToUInt32($eocd, 12)
$cdStart = [bitconverter]::ToUInt32($eocd, 16)
$cd = [byte[]]::new($cdLength)
[array]::Copy($zip, $cdStart, $cd, 0, $cdLength)
'search all file headers for correct file name:'
$fileHeaders = [regex]::Split($enc.GetString($cd),'PK\x01\x02')
foreach ($header in $fileHeaders) {
$len = $header.Length
if ($len -ge 42) {
$bytes = $enc.GetBytes($header)
$nameLength = [bitconverter]::ToUInt16($bytes, 24)
if ($nameLength -eq $sub.length -and ($nameLength + 42) -le $len) {
$name = $header.Substring(42, $nameLength)
if ($name -eq $sub) {
$size = [bitconverter]::ToUInt32($bytes, 16) + 256
$start = [bitconverter]::ToUInt32($bytes, 38)
break
}
}
}
}
if (!$start) {write-host 'we could not find file in the ZIP archive' -f y ;break}
'get the block containing the file:'
$end = $start+$size
$client.DefaultRequestHeaders.Clear()
$client.DefaultRequestHeaders.Add('Range', "bytes=$start-$end")
$result = $client.GetAsync($url).Result
$block = $result.content.ReadAsByteArrayAsync().Result
$block.CopyTo($zip, $start)
$client.dispose()
'extract the DLL file from archive:'
Add-Type -AssemblyName System.IO.Compression
$stream = [System.IO.Memorystream]::new()
$stream.Write($zip,0,$zip.Length)
$archive = [System.IO.Compression.ZipArchive]::new($stream)
$entry = $archive.GetEntry($sub)
$bytes = [byte[]]::new($entry.Length)
[void]$entry.Open().Read($bytes, 0, $bytes.Length)
'check MD5:'
$prov = [Security.Cryptography.MD5CryptoServiceProvider]::new().ComputeHash($bytes)
$hash = [string]::Concat($prov.foreach{$_.ToString("x2")})
if ($hash -ne $md5) {write-host 'dll has wrong checksum.' -f y ;break}
'load the DLL:'
[void][System.Reflection.Assembly]::Load($bytes)
'use the single demo-call from the DLL:'
$test = [Renci.SshNet.NoneAuthenticationMethod]::new('test')
'done.'

Extract sections from range found in word document

Below code works fine, we got start and end point which needs to be extracted but im not able to get range.set/select to work
I'm able to get the range from below, just need to extra and save it to CSV file...
$found = $paras2.Range.SetRange($startPosition, $endPosition) - this piece doesn't work.
$file = "D:\Files\Scan.doc"
$SearchKeyword1 = 'Keyword1'
$SearchKeyword2 = 'Keyword2'
$word = New-Object -ComObject Word.Application
$word.Visible = $false
$doc = $word.Documents.Open($file,$false,$true)
$sel = $word.Selection
$paras = $doc.Paragraphs
$paras1 = $doc.Paragraphs
$paras2 = $doc.Paragraphs
foreach ($para in $paras)
{
if ($para.Range.Text -match $SearchKeyword1)
{
Write-Host $para.Range.Text
$startPosition = $para.Range.Start
}
}
foreach ($para in $paras1)
{
if ($para.Range.Text -match $SearchKeyword2)
{
Write-Host $para.Range.Text
$endPosition = $para.Range.Start
}
}
Write-Host $startPosition
Write-Host $endPosition
$found = $paras2.Range.SetRange($startPosition, $endPosition)
# cleanup com objects
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($doc) | Out-Null
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($word) | Out-Null
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()

This line of code is the problem
$found = $paras2.Range.SetRange($startPosition, $endPosition)
When designating a Range by the start and end position it's necessary to do so relative to the document. The code above refers to a Paragraphs collection. In addition, it uses SetRange, but should only use the Range method. So:
$found = $doc.Range.($startPosition, $endPosition)

PowerShell StreamReader Change Delimiter

I have a PowerShell script to pull data from a database, but some of the fields contain commas and that is resulting in breaking up the fields because the StreamReader splits it up into fields by comma. How can I change the delimiter of how the data is split into it's fields?
$ConnectionString = "Data Source=server1; Database=Development; Trusted_Connection=True;";
$streamWriter = New-Object System.IO.StreamWriter ".\output.csv"
$sqlConn = New-Object System.Data.SqlClient.SqlConnection $ConnectionString
$sqlCmd = New-Object System.Data.SqlClient.SqlCommand
$sqlCmd.Connection = $sqlConn
$sqlCmd.CommandText = "SELECT * FROM Development.dbo.All_Opportunities WITH(NOLOCK)"
$sqlConn.Open();
$reader = $sqlCmd.ExecuteReader();
# Initialze the array the hold the values
$array = #()
for ( $i = 0 ; $i -lt $reader.FieldCount; $i++ )
{ $array += #($i) }
# Write Header
$streamWriter.Write($reader.GetName(0))
for ( $i = 1; $i -lt $reader.FieldCount; $i ++)
{ $streamWriter.Write($("," + $reader.GetName($i))) }
$streamWriter.WriteLine("") # Close the header line
while ($reader.Read())
{
# get the values;
$fieldCount = $reader.GetValues($array);
# add quotes if the values have a comma or double quote
for ($i = 0; $i -lt $array.Length; $i++)
{
if ($array[$i] -match "`"|\S")
{
$array[$i] = '"' + $array[$i].Replace("`"", "`"`"").ToString() + '"';
}
}
$newRow = [string]::Join(",", $array);
$streamWriter.WriteLine($newRow)
}
$reader.Close();
$sqlConn.Close();
$streamWriter.Close();

Have you read this post to see if it helps your effort. It's for a text fiel, but could open you creativity to what is possible.
'stackoverflow.com/questions/14954437/streamreader-with-tab-delimited-text-file'
FYI, there is no delimiter type called 'field'
Otherwise, for those columns that have a comma as part of the value, a common approach is either to double quote the value or escape it.

Document manipluation: powershell

I am attempting to take a large document, search for a "^m" (page break) and create a new text file for each page break I find.
Using:
$SearchText = "^m"
$word = new-object -ComObject "word.application"
$path = "C:\Users\me\Documents\Test.doc"
$doc = $word.documents.open("$path")
$doc.content.find.execute("$SearchText")
I am able to find text, but how do I save the text before the page break into a new file? In VBScript, I would just do a readline and save it to a buffer, but powershell is much different.
EDIT:
$text = $word.Selection.MoveUntil (cset:="^m")
returns an error:
Missing ')' in method call.

I think my solution is kinda stupid, but here is my own solution (please help me find a better one):
Param(
[string]$file
)
#$file = "C:\scripts\docSplit\test.docx"
$word = New-Object -ComObject "word.application"
$doc=$word.documents.open($file)
$txtPageBreak = "<!--PAGE BREAK--!>"
$fileInfo = Get-ChildItem $file
$folder = $fileInfo.directoryName
$fileName = $fileInfo.name
$newFileName = $fileName.replace(".", "")
#$findtext = "^m"
#$replaceText = $txtPageBreak
function Replace-Word ([string]$Document,[string]$FindText,[string]$ReplaceText) {
#Variables used to Match And Replace
$ReplaceAll = 2
$FindContinue = 1
$MatchCase = $False
$MatchWholeWord = $True
$MatchWildcards = $False
$MatchSoundsLike = $False
$MatchAllWordForms = $False
$Forward = $True
$Wrap = $FindContinue
$Format = $False
$Selection = $Word.Selection
$Selection.Find.Execute(
$FindText,
$MatchCase,
$MatchWholeWord,
$MatchWildcards,
$MatchSoundsLike,
$MatchAllWordForms,
$Forward,
$Wrap,
$Format,
$ReplaceText,
$ReplaceAll
)
$newFileName = "$folder\$newFileName.txt"
$Doc.saveAs([ref]"$newFileName",[ref]2)
$doc.close()
}
Replace-Word($file, "^m", $txtPageBreak)
$word.quit()
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($word)
Remove-Variable word
#begin txt file manipulation
#add end of file marker
$eof = "`n<!--END OF FILE!-->"
Add-Content $newfileName $eof
$masterTextFile = Get-Content $newFileName
$buffer = ""
foreach($line in $masterTextFile){
if($line.compareto($eof) -eq 0){
#end of file, save buffer to new file, be done
}
else {
$found = $line.CompareTo($txtPageBreak)
if ($found -eq 1) {
$buffer = "$buffer $line `n"
}
else {
#save the buffer to a new file (still have to write this part)
}
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

spilt text files to tsv UTF-8 with headers in PowerShell - powershell

Related

powershell rename XLSX spreadsheet columns

How to download only a single file from an online ZIP archive via Powershell?

Extract sections from range found in word document

PowerShell StreamReader Change Delimiter

Document manipluation: powershell

Categories

Resources