How can I get just a part of XML node text?
I have this piece of XML:
<CorpusLink>../Metadata/A_short_autobiography_of_Herculino_Alves.xml</CorpusLink>
<CorpusLink >../Metadata/Wordlist_and_phrases_-_modifiers.xml</CorpusLink>
<CorpusLink >../desano-silva-0151/Metadata/Wordlist_fruits_and_cultural_items.xml</CorpusLink>
<CorpusLink >../desano-silva-0151/Metadata/The_Turtle_and_the_Deer.xml</CorpusLink>
<CorpusLink >../desano-silva-0151/Metadata/Wordlist_and_phrases_parts_of_a_tree.xml</CorpusLink>
<CorpusLink >../desano-silva-0151/Metadata/Wordlist_and_phrases_.xml</CorpusLink>
I need to extract only this piece of text in each one:
../Metadata
../desano-silva-0151/Metadata
I have this code :
$j = 0
$TrgContent.METATRANSCRIPT.Corpus.CorpusLink | ForEach-Object {
[String]$_.'#text'= % {$alltext[$j] + "xml" $j++}}
But it gives me all the text:
../Metadata/A_short_autobiography_of_Herculino_Alves.xml
../desano-silva-0151/Metadata/Wordlist_fruits_and_cultural_items.xml
Thanks in advance for any help.
To achieve what you have asked. I think we have two main steps here:
Extract the content of XML nodes.
Trim the content and take what you need only.
I'm not really familiar with your existing scripts so I will explain all two steps here. The first step is optional to you.
Extract content of XML nodes
My example XML document:
<Corpus>
<CorpusLink>../Metadata/A_short_autobiography_of_Herculino_Alves.xml</CorpusLink>
<CorpusLink>../Metadata/Wordlist_and_phrases_-_modifiers.xml</CorpusLink>
<CorpusLink>../desano-silva-0151/Metadata/Wordlist_fruits_and_cultural_items.xml</CorpusLink>
<CorpusLink>../desano-silva-0151/Metadata/The_Turtle_and_the_Deer.xml</CorpusLink>
<CorpusLink>../desano-silva-0151/Metadata/Wordlist_and_phrases_parts_of_a_tree.xml</CorpusLink>
<CorpusLink>../desano-silva-0151/Metadata/Wordlist_and_phrases_.xml</CorpusLink>
</Corpus>
PS script to get the content:
[xml] $XmlDocument = Get-Content D:\Path_To_Your_File
$XmlDocument.Corpus.CorpusLink # Content of the nodes you need
Trim the content
There are many methods but I think I will go with regex. Simply loop through all the contents and run the regex.
$XmlDocument2.Corpus.CorpusLink | Foreach-Object {
if ($_ -match "\.\.\/.*?\/") {
$Matches.Values
}
}
About the regex, it matches any character except for line terminators between ..\ and /:
\.\. # Escape for 2 dots `..`
\/ # Escapefor slash `/`
.*? # Takes any character except for line terminators in between other listed characters (above and below)
\/ # Escape for slash `/`
I imply the structure of these strings is stable like that, hence the regex.
Related
I have a source file which is in .txt format. It looks like a semi-colon separated file:
100;200;ThisisastringcolumnA;4;
101;400;Thisisastringc;lumnA;5;
102;600;ThisisastringcolumnB;6;
104;600;Thisisa;;ringcolumnB;6;
However, it is determined by length. So it is a length-delimited file.
Fist column for example is from first value to the third (100), then a semi-colon follows.
Second column starts at 5th position (including), until (including) 7th position. A string column can contain a semi-colon.
Now I want to import this length-delimited txt file with Powershell and export it as a csv file. This file should be really semi-colon separated. The result should look like
100;200;ThisisastringcolumnA;4;
101;400;"Thisisastringc;lumnA";5;
102;600;ThisisastringcolumnB;6;
104;600;"Thisisa;;ringcolumnB";6;
But I have simply no idea how to do it? I googled it, but I did not find that much useful code examples for importing length-delimited txt files with PowerShell.
Unfortunately, I cannot use Python. I am not sure, if this task is generally possible using Powershell? Because when exporting, Powershell also needs to recognize that there are string values containing the separator, so it has to pay attention to the quoting: "Thisisa;;ringcolumnB". I think it would be also ok for me, if the whole column is quoted, so every entry in a string column gets quotes added.
You can use regex to describe a string in which the 3rd "column" contains a ; and then inject the quotation marks with the -replace operator:
$lines = Get-Content path\to\file.txt
#($lines) -replace '(.{3});(.{3});(.{20}(?<=;.{0,19}));(.);', '$1;$2;"$3";$4;'
The expression (.{20}(?<=;.{0,19})) is going to match the 20-char 3rd column value only if it contains at least one semi-colon - so lines with no semicolon in that column will be left alone:
# let's try it out with your test data
$lines = #'
100;200;ThisisastringcolumnA;4;
101;400;Thisisastringc;lumnA;5;
102;600;ThisisastringcolumnB;6;
104;600;Thisisa;;ringcolumnB;6;
'# -split '\r?\n'
#($lines) -replace '(.{3});(.{3});(.{20}(?<=;.{0,19}));(.);', '$1;$2;"$3";$4;'
Which yields the following four strings:
100;200;ThisisastringcolumnA;4;
101;400;"Thisisastringc;lumnA";5;
102;600;ThisisastringcolumnB;6;
104;600;"Thisisa;;ringcolumnB";6;
To write the output back to file, use Set-Content:
#($lines) -replace '(.{3});(.{3});(.{20}(?<=;.{0,19}));(.);', '$1;$2;"$3";$4;' |Set-Content path\to\fixed_output.scsv
I want to add text after exactly 20 characters inklusiv blanks. Does someone have a short solution with add-content or can post a link where i can read about a way to do so.
My file looks somthing like this:
/path1/path1/path1 /path2/path2/path2 /path3/path3/path3
than an application will read this pahts (not my application and i can not edit it in any way) the application will read these paths and it will read them on their position so if the second path starts 10 characters later it wont recognize it, so i can not simply replace the path or edit it easy sinc the path has not always the same lenght. Why the application reads it that way dont ask me.
So i need to add a string at start than the next string at exactly character 20 and than the next at charcter 40.
You could use the regex -replace operator to inject a new substring after 20 characters:
PS ~> $inject = "Hello Manuel! ..."
PS ~> $string = "Injected text goes: and then there's more"
PS ~> $string -replace '(?<=^.{20})',$inject
Injected text goes: Hello Manuel! ...and then there's more
The regex pattern (?<=^.{20}) describes a position in the string where exactly 20 characters occur between the start of the string and the current position, and the -replace operator then replaces the empty string at said position with the value in $inject
This did it for me
$data.PadRight(20," ") | Out-File -FilePath F:\test\path.txt -NoNewline -Append
Been struggling for a few hours. I'm trying to hit this link
That has these contents:
# Generated on: Python 3.6.12
# With snowflake-connector-python version: 2.4.6
asn1crypto==1.4.0
azure-common==1.1.27
azure-core==1.15.0
azure-storage-blob==12.8.1
boto3==1.17.98
botocore==1.20.98
certifi==2021.5.30
cffi==1.14.5
chardet==4.0.0
cryptography==3.4.7
dataclasses==0.8
idna==2.10
isodate==0.6.0
jmespath==0.10.0
msrest==0.6.21
oauthlib==3.1.1
oscrypto==1.2.1
pycparser==2.20
pycryptodomex==3.10.1
PyJWT==2.1.0
pyOpenSSL==20.0.1
python-dateutil==2.8.1
pytz==2021.1
requests==2.25.1
requests-oauthlib==1.3.0
s3transfer==0.4.2
six==1.16.0
urllib3==1.26.5
And pass each non-commented out line to Poetry (a command line tool for Python dependency management)
This is my first step
(iwr https://raw.githubusercontent.com/snowflakedb/snowflake-connector-python/v2.5.1/tested_requirements/requirements_36.reqs | Select-Object).Content > req.txt
Where I'm struggling is I've tried doing convertfrom-string with various delimiters, and .Split(), and I can't seem to parse out the pieces I need. Poetry needs to take as input just the "packagename==" although version number is optional. So I essentially want to ignore lines that start with a "#" and then pass each line through a pipe, or even save it as an array. It doesn't seem to respond to setting a delimited to "\t" or carriage return "`r".
So next I would do something like
foreach($package in $package_array){poetry add $package}
Any help would be appreciated.
AdminOfThings provided a good pointer in a comment, but let me try to put it all together:
$url = 'https://raw.githubusercontent.com/snowflakedb/snowflake-connector-python/v2.5.1/tested_requirements/requirements_36.reqs'
foreach ($pkgLine in (irm $url).Trim() -split '\r?\n' -notmatch '^\s*#') {
# Remove `Write-Host` to perform the actual poetry call.
Write-Host poetry add ($pkgLine -replace '=.*')
}
irm is the built-in alias for Invoke-RestMethod, which is a simpler alternative to Invoke-WebRequest (iwr) in this case, because it directly returns the text of interest, as a multi-line string.
As an aside: the | Select-Object in your code is effectively a no-op and can be omitted.
.Trim() trims a trailing newline (all trailing whitespace).
-split '\r?\n' splits the string into individual lines.
-notmatch '^\s*# filters out all lines that start with #, optionally preceded by whitespace.
-replace '=.* removes everything starting with = from each package line.
If I run the below code, $SRN can be written as output or added to another variable, but trying to include either another variable or regular text causes it to be overwritten from the beginning of the line. I'm assuming it's something to do with how I'm assigning $autocode and $SRN initially but can't tell what it's trying to do.
# Load the property set to allow us to get to the email body.
$item.load($psPropertySet) # Load the data.
$bod = ($item.Body.Text -creplace '(?m)^\s*\r?\n','') -split "\n" # Get the body text, remove blank lines, split on line breaks to create an array (otherwise it is a single string).
$autocode = $bod[4].split('-')[2] # Get line 4 (should be Title), split on dash, look for 3rd element, this should contain our automation code.
$SRN = $bod[1] -replace 'ID: ','' # Get line 2 (should be ID), find and replace the preceding text.
# Skip processing if autocode does not match our list of handled ones.
if ($autocode -cin $autocodes)
{
write-host "$SRN $autocode"
write-host "$autocode $SRN"
write-host "$SRN test"
$var = "$SRN $autocode"
$var
}
The code results in this, you can see if $SRN isn't at the start of the line it is fine. Unsure where the extra spaces come from either:
KRNE8385
KRNE SR1788385
test8385
KRNE8385
I would expect to see this:
SR1788385 KRNE
KRNE SR1788385
SR1788385 test
SR1788385 KRNE
LotPings pointed me down the right path, both variables still had either "0D" or "\r" in them. My regex replace was only getting rid of them on blank lines, and I split the array on "\n" only. Changing line 3 in the original code to the below appears to have resolved the issue. First time seeing Format-Hex, but it appears to be excellent for troubleshooting such issues.
$bod = ($item.Body.Text -creplace '(?m)^\s*\r?\n','') -split "\r\n"
I have a binary file that I need to process, but it contains no line breaks in it.
The data is arranged, within the file, into 104 character blocks and then divided into its various fields by character count alone (no delimiting characters).
I'd like to firstly process the file, so that there is a line break (`n) every 104 characters, but after much web searching and a lot of disappointment, I've found nothing useful yet. (Unless I ditch PowerShell and use awk.)
Is there a Split option that understands character counts?
Not only would it allow me to create the file with nice easy to read lines of 104 chars, but it would also allow me to then split these lines into their component fields.
Can anyone help please, without *nix options?
Cheers :)
$s = get-content YourFileName | Out-String
$a = $s.ToCharArray()
$a[0..103] # will return an array of first 104 chars
You can get your string back the following way, the replace removes space char( which is what array element separators turn into)
$ns = ([string]$a[0..103]).replace(" ","")
Using the V4 Where method with Split option:
$text = 'abcdefghi'
While ($text)
{
$x,$text = ([char[]]$text).where({$_},'Split',3)
$x -join ''
}
abc
def
ghi