Find All Words Containing a String From a .txt Document - powershell

I'm cleaning up a few dozen old files, and pulling them into one format. I've extracted the headers for each of the sources to a text (.txt) file, and need to identify the different wording for similar column names.
To do this, I need to be able to identify common words in the headers. For example, I need to identify all column names that might be "firstname", "1stname", "first_name", "f_name", etc, etc, etc.
What PowerShell syntax would I use to find all words containing the string "name" (e.g. "First_Name"), and extract that entire word to a separate text file?

Split the content of the file into individual words, then select the words containing the particular string:
(Get-Content 'input.txt') -split '\s+' -match 'name'

Related

Import length-delimited file with PowerShell and export as csv file

I have a source file which is in .txt format. It looks like a semi-colon separated file:
100;200;ThisisastringcolumnA;4;
101;400;Thisisastringc;lumnA;5;
102;600;ThisisastringcolumnB;6;
104;600;Thisisa;;ringcolumnB;6;
However, it is determined by length. So it is a length-delimited file.
Fist column for example is from first value to the third (100), then a semi-colon follows.
Second column starts at 5th position (including), until (including) 7th position. A string column can contain a semi-colon.
Now I want to import this length-delimited txt file with Powershell and export it as a csv file. This file should be really semi-colon separated. The result should look like
100;200;ThisisastringcolumnA;4;
101;400;"Thisisastringc;lumnA";5;
102;600;ThisisastringcolumnB;6;
104;600;"Thisisa;;ringcolumnB";6;
But I have simply no idea how to do it? I googled it, but I did not find that much useful code examples for importing length-delimited txt files with PowerShell.
Unfortunately, I cannot use Python. I am not sure, if this task is generally possible using Powershell? Because when exporting, Powershell also needs to recognize that there are string values containing the separator, so it has to pay attention to the quoting: "Thisisa;;ringcolumnB". I think it would be also ok for me, if the whole column is quoted, so every entry in a string column gets quotes added.
You can use regex to describe a string in which the 3rd "column" contains a ; and then inject the quotation marks with the -replace operator:
$lines = Get-Content path\to\file.txt
#($lines) -replace '(.{3});(.{3});(.{20}(?<=;.{0,19}));(.);', '$1;$2;"$3";$4;'
The expression (.{20}(?<=;.{0,19})) is going to match the 20-char 3rd column value only if it contains at least one semi-colon - so lines with no semicolon in that column will be left alone:
# let's try it out with your test data
$lines = #'
100;200;ThisisastringcolumnA;4;
101;400;Thisisastringc;lumnA;5;
102;600;ThisisastringcolumnB;6;
104;600;Thisisa;;ringcolumnB;6;
'# -split '\r?\n'
#($lines) -replace '(.{3});(.{3});(.{20}(?<=;.{0,19}));(.);', '$1;$2;"$3";$4;'
Which yields the following four strings:
100;200;ThisisastringcolumnA;4;
101;400;"Thisisastringc;lumnA";5;
102;600;ThisisastringcolumnB;6;
104;600;"Thisisa;;ringcolumnB";6;
To write the output back to file, use Set-Content:
#($lines) -replace '(.{3});(.{3});(.{20}(?<=;.{0,19}));(.);', '$1;$2;"$3";$4;' |Set-Content path\to\fixed_output.scsv

Parsing a text file or an html file to create a table

I have a simple issue with a .msg file from outlook, but I discovered that with a code someone helped me with, it was not working since the htmlbody from the .msg file would vary between different emails even though they are from the same source, so my next option was to save the email as a .txt and .html file, since I have no knowledge of html I have no idea how to grab the table which is structured in the html with a . but on the text I found something easy, for example this is data from one table:
Summary
Date
Good mail
Rule matches
Spam
Malware
2019-10-22
4927
4519
2078
0
2019-10-23
4783
4113
1934
0
this is on the text file, Summary is the keyword, and after that key word, the next 5 lines are the columns of the table, after that ,each 5 lines following are the rows, this goes up to 7 rows in total, so headers and then 7 rows.
Now what I want to do is create a table from this text using the 5 first lines after summary as my columns. Since each .msg is different, this 5 columns will change order on each file randomly so I want to avoid this, my best attempt was to use convertfrom-string to create a table , but I have little idea on how to format the table with the conditions set above.
The problem I have is this simple, I have a table on the txt file shown as above, with 5 columns, each column besides the headers contains 7 rows, therei s also the condition that the email since it has more data, I need to stop there nad just grab that part which should be easy.
How can I use convertfrom-string to create the table using those 5 columns , how can I set the delimiter as a new line and how can I set the first 5 lines as the column headers?
I think trying to make this work with ConvertFrom-StringData is adding more work than necessary. But here is an alternative that works with your sample set.
$text = Get-Content -Path File.txt
$formattedText = if ($text[0] -match '^Summary') {
for ($i = 1; $i -lt $text.count; $i+=5 ) {
$text[$i..($i+4)] -join ','
}
}
$fomattedText | ConvertFrom-Csv | ConvertTo-Html
Explanation:
If we assume your text data is in File.txt, Get-Content is used to read the data as an array ($text). If the first line begins with Summary, the file will be parsed.
The for loop is used to skip 5 lines during each iteration until the end of the file. The for loop begins with $text values (indexes 1, 2, 3, 4, and 5) joined together by a ,. Then the index increment ($i) is increased by 5 and the next five index values are joined together. Each increment will create a new line of comma separated values. The reason for the , join is just to use the simple ConvertFrom-Csv later.
ConvertFrom-Csv converts the CSV data into an array of objects ($formattedText) with the first row becoming those objects' properties.
Finally, the array is piped to ConvertTo-Html, which will output all of the objects in a table.
Note: If you want to resize or add extra format to the table, you may need to do that after the code is generated. If your data has commas, you will need a different delimiter when joining the strings. You will then need to add the -Delimiter parameter to the ConvertFrom-Csv with the delimiter you choose.
Adaptation:
The code is fairly flexible. If you need to work with more than five properties, the $i+=5 will need to reflect the number of properties you need to cycle through. The same change needs to apply to $text[$i..($i+4)]. You want the .. to separate two values that differ by your property number.

Store user input as wildcard

I am having some trouble with a data processing function in MATLAB. The function takes the name of the file to be processed as an input, finds the desired files, and reads in the data.
However, several of the desired files are variants, such as Data_00.dat, Data.dat, or Data_1_March.dat. Within my function, I would like to search for all files containing Data and condense them into one usable file for processing.
To solve this, I would like desiredfile to be converted into a wildcard.
Here is the statement I would like to use.
selectedfiles = dir *desiredfile*.dat % Search for file names containing desiredfile
This returns all files containing the variable name desiredfile, rather than the user input.
The only solution that I can think of is writing a separate function that manually condenses all the variants into one file before my function is run, but I am trying to keep the number of files used down and would like to avoid this.
You could concatenate strings for that. Considering desiredFile as a variable.
desiredFile = input('Files: ');
selectedfiles = dir(['*' desiredfile '*.dat']) % Search for file names containing desiredfile
Enclosing strings between square brackets [string1 string2 ... stringN]concatenates them. Matlab's dir function receives a string.
I believe you can achieve that using the dir command.
dataSets = dir('/path/to/dir/containing/Data*.dat');
dataSets = {dataSets.name};
Now simply loop over them, more information here.
To quote the matlab help:
dir lists the files and folders in the MATLABĀ® current folder. Results appear in the order returned by the operating system.
dir name lists the files and folders that match the string name. When name is a folder, dir lists the contents of the folder. Specify name using absolute or relative path names. You can use wildcards (*).

Talend read file with comma separator

I have .csv file and it looks like:
"Account Number","Account","Payment Method","Account Status","State","Batch ID","Transaction Date","Payment Amount","Entered By"
"00010792-8","Max Little","Credit Card","Active","NC","9155","9/3/2014","$70.85","Diane Barr"
"00036360-0","Bill Miller","Cash","Active","NC","9164","9/3/2014","$181.46","Jennifer Lamar"
"00045576-9","Lsw, Inc","Credit Card","Active","NC","9152","9/3/2014","$173.98","Daniel Sheets"
I try to load it with tFileInputDelimited.
In Component->Basic Settings, I choose Field Separator: ","
Unfortunately, 3rd row Account column value looks like "Lsw, Inc", contains the delimiter ","
How to read this file correctly, without splitting text values, that contains symbols "," into columns.
Your CSV appears to be string quoted so it's just a matter of telling Talend that this is the case.
Thankfully this is pretty easy:
Ticking that "CSV options" box will bring up the options for escape characters and text enclosure. Your CSV appears to be fine with just using double quotes so I'd leave it as that unless you see any other peculiarities.

Determine if string exists in file

I have a list of strings such as:
John
John Doe
Peter Pan
in a .txt file.
I want to make a loop that checks if a certain name exists. However, I do not want it to be true if I search for "Peter" and only "Peter Pan" exists. Each line has to be a full match.
Ha ha, ep0's answer is very sophisticated!
However, you want to use a parsing loop something like this (this example expects that your names are separated by carriage returns). Consider that you have a text file with contents arranged like this:
John
Harry
Bob
Joe
Here is your script:
fileread, thistext, %whatfile% ;get the text from the file into a variable
;Now, loop through each line and see if it matches your results:
loop, parse, thistext, `r`n, `r`n
{
if(a_loopfield = "John")
msgbox, Hey! It's John!
else
msgbox, No, it's %a_loopfield%
}
If your names are arranged in a different order, you might have to either change the delimiter for the parsing loop, or use regex instead of just a simple comparison.
If you want to check for multiple names use a trie. If you have just one name, you can use KMP.
I'll explain this for multiple names you want to check that exist, since for only one, the example provided on Wikipedia is more than sufficient and you can apply the same idea.
Construct the said trie from your names you want to find, and for each line in file, traverse the trie character by character until you hit a final node.
BONUS: trie is used by Aho-Corasick algorithm, which is an extension of KMP to multiple patters. Read about it. It's very worthwhile.
UPDATE:
For checking if a single name exists, hash the name you want to find, then read the text file line by line. For each line, hash it with the same function and compare it to the one you want to find. If they are equal, compare the strings character by character. You need to do this to avoid false positives (see hash collisions)