Error: Bad character (ASCII 0) encountered. in data Fusion - google-cloud-data-fusion

I'm trying to import a file to Google Big Query. When I import it using a BigqueryJob, I get an error:
Error detected while parsing row starting at position: 0. Error: Bad character (ASCII 0) encountered.
I've solved this by replacing the ASCII 0 character in notepad ++ or with PowerShell with the following Script:
$configFiles = Get-ChildItem -Path C:\InputPath\*
foreach ($file in $configFiles)
{
(Get-Content $file.PSPath) |
Foreach-Object { $_ -replace "`0", "" } |
Set-Content $file.PSPath
}
But I need to automate this, so I'm using Google Cloud DataFusion, but when I open this file with the wrapper, I get a screen with a Square symbol (Couldn't copy the character, so I pasted an image):
What can I do to load this file with DataFusion?
If I open this same file in notepad/notepad++ I can see the characters like any other txt file.
Thanks!

Related

Find & replace values

Working Code:
Find Word "STIG" , remove it and save to .txt file
, I'm able to do it using following code:
$content | Where-Object { -not $_.Contains('STIG') } | set-content $file
Problem:
I'm not able to find word "(non-R2)" and Replace it with "non-R2" and Save values to text file.
SAMPLE:
CIS_Microsoft_Windows_Server_2008_R2_Benchmark_v3.2.0-xccdf.xml
CIS_Microsoft_Windows_Server_2012_(non-R2)_Benchmark_v2.2.0-xccdf.xml
CIS_Microsoft_Windows_Server_2012_R2_Benchmark_v2.4.0-xccdf.xml
CIS_Microsoft_Windows_Server_2016_RTM_(Release_1607)_Benchmark_v1.2.0-xccdf
CIS_Microsoft_Windows_Server_2016_STIG_Benchmark_v1.0.0-xccdf.xml
**Desired Result**
Will not contain line having "STIG" & will have replaced value "non-R2" (without brackets) and Save to Txt file:
CIS_Microsoft_Windows_Server_2008_R2_Benchmark_v3.2.0-xccdf.xml
CIS_Microsoft_Windows_Server_2012_non-R2_Benchmark_v2.2.0-xccdf.xml
CIS_Microsoft_Windows_Server_2012_R2_Benchmark_v2.4.0-xccdf.xml
CIS_Microsoft_Windows_Server_2016_RTM_(Release_1607)_Benchmark_v1.2.0-xccdf
((Get-Content -Path "sample.txt" ) -replace '\(non-R2\)','non-R2' ) | Set-Content "sample2.txt"
This will read the content of sample.txt, then replace "(non-R2)" to "non-R2", and then write the results to sample2.txt.

How can I (efficiently) match content (lines) of many small files with content (lines) of a single large file and update/recreate them

I've tried solving the following case:
many small text files (in subfolders) need their content (lines) matched to lines that exist in another (large) text file. The small files then need to be updated or copied with those matching Lines.
I was able to come up with some running code for this but I need to improve it or use a complete other method because it is extremely slow and would take >40h to get through all files.
One idea I already had was to use a SQL Server to bulk-import all files in a single table with [relative path],[filename],[jap content] and the translation file in a table with [jap content],[eng content] and then join [jap content] and bulk-export the joined table as separate files using [relative path],[filename]. Unfortunately I got stuck right at the beginning due to formatting and encoding issues so I dropped it and started working on a PowerShell script.
Now in detail:
Over 40k txt files spread across multiple subfolders with multiple lines each, every line can exist in multiple files.
Content:
UTF8 encoded Japanese text that also can contain special characters like \\[*+(), each Line ending with a tabulator character. Sounds like csv files but they don't have headers.
One large File with >600k Lines containing the translation to the small files. Every line is unique within this file.
Content:
Again UTF8 encoded Japanese text. Each line formatted like this (without brackets):
[Japanese Text][tabulator][English Text]
Example:
ใƒ†ใ‚นใƒˆ[1] Test [1]
End result should be a copy or a updated version of all these small files where their lines got replaced with the matching ones of the translation file while maintaining their relative path.
What I have at the moment:
$translationfile = 'B:\Translation.txt'
$inputpath = 'B:\Working'
$translationarray = [System.Collections.ArrayList]#()
$translationarray = #(Get-Content $translationfile -Encoding UTF8)
Get-Childitem -path $inputpath -Recurse -File -Filter *.txt | ForEach-Object -Parallel {
$_.Name
$filepath = ($_.Directory.FullName).substring(2)
$filearray = [System.Collections.ArrayList]#()
$filearray = #(Get-Content -path $_.FullName -Encoding UTF8)
$filearray = $filearray | ForEach-Object {
$result = $using:translationarray -match ("^$_" -replace '[[+*?()\\.]','\$&')
if ($result) {
$_ = $result
}
$_
}
If(!(test-path B:\output\$filepath)) {New-Item -ItemType Directory -Force -Path B:\output\$filepath}
#$("B:\output\"+$filepath+"\")
$filearray | Out-File -FilePath $("B:\output\" + $filepath + "\" + $_.Name) -Force -Encoding UTF8
} -ThrottleLimit 10
I would appreciate any help and ideas but please keep in mind that I rarely write scripts so anything to complex might fly right over my head.
Thanks
As zett42 states, using a hash table is your best option for mapping the Japanese-only phrases to the dual-language lines.
Additionally, use of .NET APIs for file I/O can speed up the operation noticeably.
# Be sure to specify all paths as full paths, not least because .NET's
# current directory usually differs from PowerShell's
$translationfile = 'B:\Translation.txt'
$inPath = 'B:\Working'
$outPath = (New-Item -Type Directory -Force 'B:\Output').FullName
# Build the hashtable mapping the Japanese phrases to the full lines.
# Note that ReadLines() defaults to UTF-8
$ht = #{ }
foreach ($line in [IO.File]::ReadLines($translationfile)) {
$ht[$line.Split("`t")[0] + "`t"] = $line
}
Get-ChildItem $inPath -Recurse -File -Filter *.txt | Foreach-Object -Parallel {
# Translate the lines to the matching lines including the $translation
# via the hashtable.
# NOTE: If an input line isn't represented as a key in the hashtable,
# it is passed through as-is.
$lines = foreach ($line in [IO.File]::ReadLines($_.FullName)) {
($using:ht)[$line] ?? $line
}
# Synthesize the output file path, ensuring that the target dir. exists.
$outFilePath = (New-Item -Force -Type Directory ($using:outPath + $_.Directory.FullName.Substring(($using:inPath).Length))).FullName + '/' + $_.Name
# Write to the output file.
# Note: If you want UTF-8 files *with BOM*, use -Encoding utf8bom
Set-Content -Encoding utf8 $outFilePath -Value $lines
} -ThrottleLimit 10
Note: Your use of ForEach-Object -Parallel implies that you're using PowerShell [Core] 7+, where BOM-less UTF-8 is the consistent default encoding (unlike in Window PowerShell, where default encodings vary wildly).
Therefore, in lieu of the .NET [IO.File]::ReadLines() API in a foreach loop, you could also use the more PowerShell-idiomatic switch statement with the -File parameter for efficient line-by-line text-file processing.

Remove white space from file - preserve one line

I'm new to PowerShell and trying to remove white space from a file. The file contains some values and has white spaces (indentation):
Hostname=hostname1
Server=server1
Directory=C:\Program Files\Test
Database=db1
I am trying to remove white space, but preserve the "directory" line as it contains a whitespace in the path C:\Program Files\Test and this will break the build. This is the code I have so far:
foreach ($Line in (Get-Content -Path C:\File.txt) | Where-Object {$_ -notcontains "Directory"}) {
$line -replace " ", ""
Set-Content -Path C:\File.txt
}
But this produces an empty file. What am I doing wrong?
Set-Content receives input either via the pipeline or via the parameter -Value. Your code doesn't provide either, so the cmdlet is writing an empty file. Also, your processing would entirely remove all lines containing the string "Directory" from the output.
Change your code to something like this and the problem will disappear:
(Get-Content 'C:\File.txt') | ForEach-Object {
if ($_ -notlike 'Directory=*') {
$_ -replace ' ', ''
} else {
$_
}
} | Set-Content 'C:\File.txt'
If it's only a matter of removing indentation you could use the Trim-function:
$fileContent = Get-Content C:\File.txt
$fileContent | ForEach-Object { $_.Trim() } | Set-Content C:\File.txt
The Trim-function removes only leading and trailing white space. There is also a TrimStart and TrimEnd if you only want to remove either leading or trailing white space.
Note: The content of the file is first stored in a variable to release the lock on the file system before Set-Content is called. If you use a different output file you can pipe the result of Get-Content directly into the ForEach-Object.

PowerShell 2.0 command equivalent to PowerShell 3.0 Get-Content -Raw

I am currently working on a code snippet to convert AS3 files to JS.
Below is my script.
$source = "sample.as"
$dest = "modifiedScript.js"
$raw1 = Get-Content -Path $source | Out-String
$raw1 -replace "extends.*?({)", '{' | Set-Content $dest
Sample.as:
class EquivalentFraction extends MainClass
{
// other codes
function f1(){
}
}
I am trying to get the output like this (replace whatever comes between extends and { by {):
class EquivalentFraction
{
// other codes
function f1(){
}
}
The above code does not work if the opening brace is present in next line.
As I am using PowerShell 2.0, I am unable to use Get-Content -Raw to get the contents without lines.
After searching, I came to know that I have to use Out-String instead of -Raw switch.
But it's not working.
The PowerShell v2 equivalent for Get-Content -Raw is Get-Content | Out-String. The reason why your code doesn't do what you expect has nothing to do with the data import.
You don't get the expected result because of the regular expression you're using. . matches any single character except newlines. Since your data has the opening curly bracket on the line after the line with the extends keyword you do have a newline between extends and {, meaning that extends.*?({) will never match.
You can resolve this by using [\s\S] (match any whitespace and any non-whitespace)
$raw1 -replace 'extends[\s\S]*?{', '{'
or by using the "single line mode" option (which makes . match newlines as well)
$raw1 -replace '(?s)extends.*?{', '{'

find and delete lines without string pattern in text files

I'm trying to find out how to use powershell to find and delete lines without certain string pattern in a set of files. For example, I have the following text file:
111111
22x222
333333
44x444
This needs to be turned into:
22x222
44x444
given that the string pattern of 'x' is not in any of the other lines.
How can I issue such a command in powershell to process a bunch of text files?
thanks.
dir | foreach { $out = cat $_ | select-string x; $out | set-content $_ }
The dir command lists the files in the current directory; the foreach goes through each file; cat reads the file and pipes into select-string; select-string finds the lines that contains the specific pattern, which in this case is "x"; the result of select-string is stored in $out; and finally, $out is written to the same file with set-content.
We need the temporary variable $out because you cannot read and write the same file at the same time.
This will process all txt files from the working directory. Each file content is checked and only lines that have 'x' in them are allowed to pass on. The result is written back to the file.
Get-ChildItem *.txt | ForEach-Object{
$content = Get-Content $_.FullName | Where-Object {$_ -match 'x'}
$content | Out-File $_.FullName
}