Script to parse FROM email address from many text files

Script to parse FROM email address from many text files - powershell

I have a collection of 338 .log files. These are just basic text files and no two files have the same file name (but all file names start with "rrm-"). Here is an example of the data they contain:
Receiving message #1 : OK (4480 bytes)
From: <djerry#domain.com>
Subject: 2-303-468-02
Message-ID: <PRODVAPP21XvCsLCXPI0035acee#prod.domain.com>
Forwarding to "Some User" <someuser#somedomain.com> : OK
I need a script that will open each file one at a time, parse only the "From:" lines (could be 10, could be 1000s) to extract only the email address between the < and > characters, and write the output to a single text file, one email address per line. The rest of the data I don't care about. I also don't care about validating the email addresses. The resulting text file would look like this:
djerry#domain.com
bob#domain.com
tom#blah.com
jerry#yada.com
I'm not a programmer, I only know how to break things when I try. I don't even know what software / utility I would need to use for this. I'm using a Windows 10 computer. So maybe a Powershell script? Sorry for such a n00b question, I really hate feeling stupid for not knowing how to or being able to google for a simple solution. Appreciate any help!

Try the following:
Select-String -Pattern '^From: .*?<(.+?)>' -Path rrm-* |
ForEach-Object { $_.Matches.Groups[1].Value } > output.txt
^From: .*?<(.+?)> is a regex (regular expression) that finds lines that start with From: and captures what follows between < and >.
The .*? part is to account for an (optional) actual name preceding the <...>-enclosed email address, as is common; e.g, "Dana Jerry" <djerry#domain.com>. Thanks, TheMadTechnician
$_.Matches.Groups[1].Value retrieves what was captured.
> output.txt saves the results to a file.

Related

Powershell: Compare filenames to a list of formats

I am not looking for a working solution but rather an idea to push me in the right direction as I was thrown a curveball I am not sure how to tackle.
Recently I wrote a script that used split command to check the first part of the file against the folder name. This was successful so now there is a new ask: check all the files against the naming matrix, problem is there are like 50+ file formats on the list.
So for example format of a document would be ID-someID-otherID-date.xls
So for example 12345678-xxxx-1234abc.xls as a format for the amount of characters to see if files have the correct amount of characters in each section to spot typos etc.
Is there any other reasonable way of tackling that besides Regex? I was thinking of using multiple splits using the hyphens but don't really have anything to reference that against other than the amount of characters required in each part.
As always, any (even vague) pointers are most welcome ;-)

Although I would use a regex (as also commented by zett42), there is indeed an other way which is using the ConvertFrom-String cmdlet with a template:
$template = #'
{[Int]Some*:12345678}-{[String]Other:xxxx}-{[DateTime]Date:2022-11-18}.xls
{[Int]Some*:87654321}-{[String]Other:yyyy}-{[DateTime]Date:18Nov2022}.xls
'#
'23565679-iRon-7Oct1963.xls' | ConvertFrom-String -TemplateContent $template
Some : 23565679
Other : iRon
Date : 10/7/1963 12:00:00 AM
RunspaceId : 3bf191e9-8077-4577-8372-e77da6d5f38d

Perl : How to extract unique entries of a text file

I am totally a beginner in Perl. I have a large file (around 100 G) which looks like this:
domain, ip
"www.google.ac.",173.194.33.111
"www.google.ac.",173.194.33.119
"www.google.ac.",173.194.33.120
"www.google.ac.",173.194.33.127
"www.google.ac.",173.194.33.143
"apple.com., 173.194.33.143
"studio.com.", 173.194.33.143
"www.google.ac.",101.78.156.201
"www.google.ac.",101.78.156.201
So basically I have 1-duplicate lines, 2- one ip with different domains, 3- one domain with different ips. and I would like to remove the duplicate lines from the file (the ones with same domain,ip pair).
**I have already reviewed other answers in regards to the same question, none of them address my problem with large files .
Does anybody have a clue how can I do it in PERL? or any suggestion for more optimal language?

The easiest thing to do is read the file a line at a time and use each line as the key of a hash. You have to have memory to store each unique line once, though. There's no getting around that.
Here's a one-liner as run from the shell:
perl -ne '$lines{$_}++; END { print keys %lines }' filename

Maildrop: Filter mail by Date: header

I'm using getmail + maildrop + mutt + msmtp chain with messages stored in Maildir. Very big inbox bothers me, so i wanted to organize mail by date like that:
Maildir
|-2010.11->all messages with "Date: *, * Nov 2010 *"
|-2010.12->same as above...
|-2011.01
`-2011.02
I've googled much and read about mailfilter language, but still it is hard for me to write such filter. Maildrop's mailing list archives has almost nothing on this (as far as i scanned through it). There is some semi-solution on https://unix.stackexchange.com/questions/3092/organize-email-by-date-using-procmail-or-maildrop, but i don't like it, because i want to use "Date:" header and i want to sort by month like "YEAR.MONTH" in digits.
Any help, thoughts, links, materials will be appreciated.

Using mostly man pages, I came up with the following solution for use on Ubuntu 10.04. Create a mailfilter file called, for example, mailfilter-archive with the following content:
DEFAULT="$HOME/mail-archive"
MAILDIR="$DEFAULT"
# Uncomment the following to get logging output
#logfile $HOME/tmp/maildrop-archive.log
# Create maildir folder if it does not exist
`[ -d $DEFAULT ] || maildirmake $DEFAULT`
if (/^date:\s+(.+)$/)
{
datefile=`date -d "$MATCH1" +%Y-%m`
to $DEFAULT/$datefile
}
# In case the message is missing a date header, send it to a default mail file
to $DEFAULT/inbox
This uses the date command, taking the date header content as input (assuming it is in RFC-2822 format) and producing a formatted date to use as the mail file name.
Then execute the following on existing mail files to archive your messages:
cat mail1 mail2 mail3 mail4 | reformail -s maildrop mailfilter-archive
If the mail-archive contents look good, you could remove the mail1, mail2, mail3, mail4, etc. mail files.

Move emails with procmail if it matches from sender

as im using different email clients to read/send my mails i want to setup procmail to move my emails to a the folder which is normally done by Thunderbird filter feature.
I know that i can do it by using the following code for procmail in my email users .procmailrc file:
:0:
* ^From:.test#host.name.com
myfolder
But i have a list of about 50 email adresses which i would like to move to that specific "myfolder".
So by using
:0:
* ^From:.first#mail.com
* ^From:.second#mail.com
jimsmail
doesnt help, because procmail interprets them by using the AND operater. So the code above would be true if From is first#... AND second#..., which will never be true.
So how do i use the OR operator.
Actually i have a simple text file where all email adresses are.
Would be cool to have a feature where procmail ready in that file and checks if From matches with at least one of the lines in the file, the moves email to "myfolder".
Something like
:0:
* ^From:file(email.txt)
myfolder
Does anybode if this or something similar is possible.
I dont want to add these 3 lines 50 times in my procmailrc file.

Procmail uses regexps, so you can separate addresses with the | character.
:0:
* ^From:.((first|second|third)#mail.com|(fourth|fifth)#othermail.com)
myfolder
would work. Could get a little messy with fifty all on one line, mind...

I found the solution.
With this solution im able to use a simple email text file holding all email addresses in each in one line.
The code in my .procmailrc is as follows:
EMAILFILE=/path/to/my/emailfile
FROM=`formail -xFrom: | sed -e 's/ *(.*)//; s/>.*//; s/.*[:<] *//'`
:0
* ? fgrep -qxis $FROM $EMAILFILE
myfolder

PHP - How to identify e-mail addresses from input containing lines of misc data

Apologizing in advance for yet another email pattern matching query.
Here is what I have so far:
$text = strtolower($intext);
$lines = preg_split("/[\s]*[\n][\s]*/", $text);
$pattern = '/[A-Za-z0-9_-]+#[A-Za-z0-9_-]+\.([A-Za-z0-9_-][A-Za-z0-9_]+)/';
$pattern1= '/^[^#]+#[a-zA-Z0-9._-]+\.[a-zA-Z]+$/';
foreach ($lines as $email) {
preg_match($pattern,$email,$goodies);
$goodies[0]=filter_var($goodies[0], FILTER_SANITIZE_EMAIL);
if(filter_var($goodies[0], FILTER_VALIDATE_EMAIL)){
array_push($good,$goodies[0]);
}
}
$Pattern works fine but .rr.com addresses (and more issues I am sure) are stripped of .com
$pattern1 only grabs emails that are on a line by themselves.
I am pasting in a whole page of miscellaneous text into a textarea that contains some emails from an old data file I am trying to recover.
Everything works great except for the emails with more than one "." either before or after the "#".
I am sure there must be more issues as well.
I have tried several patterns I have found as well as some i tried to write.
Can someone show me the light here before I pull my remaining hair out?

How about this?
/((?:\w+[.]*)*(?:\+[^# \t]*)?#(?:\w+[.])+\w+)/
Explanation: (?:\w+[.])* recognizes 0 or more instances of strings of word characters (alphanumeric + _) optionally separated by strings of periods. Next, (?:\+[^# \t]*)? recognizes a plus sign followed by zero or more non-whitespace, non-at-sign characters. Then we have the # sign, and finally (?:\w+[.])+\w+, which matches a sequence of word character strings separated by periods and ending in a word character string. (ie, [subdomain.]domain.topleveldomain)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Script to parse FROM email address from many text files - powershell

Related

Powershell: Compare filenames to a list of formats

Perl : How to extract unique entries of a text file

Maildrop: Filter mail by Date: header

Move emails with procmail if it matches from sender

PHP - How to identify e-mail addresses from input containing lines of misc data

Categories

Resources