Read message body of an email using Apache Nifi - email

Is it possible to retrieve the body content of email, email header details and email attachments in Single step using Apache Nifi.
If so Please help me how to achieve this.

It is not possible in a single step unless you write your own processor or script (using ExecuteScript or InvokeScriptedProcessor). However it is possible in a single flow with something like the following:
ConsumePOP3 -> ExtractEmailHeaders -> ExtractEmailAttachments -> ...
At the end of the flow above, you will have one flow file per attachment, each flow file containing the email headers as attributes and the attachment as the content.

You can use the processor "ExecuteScript", not developing custom processor.
import email
import mimetypes
from email.parser import Parser
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from java.io import BufferedReader, InputStreamReader
from org.apache.nifi.processors.script import ExecuteScript
from org.apache.nifi.processor.io import InputStreamCallback
from org.apache.nifi.processor.io import StreamCallback
class PyInputStreamCallback(InputStreamCallback):
_text = None
def __init__(self):
pass
def getText(self) :
return self._text
def process(self, inputStream):
self._text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
flowFile = session.get()
if flowFile is not None :
reader = PyInputStreamCallback()
session.read(flowFile, reader)
msg = email.message_from_string(reader.getText())
body = ""
if msg.is_multipart():
for part in msg.walk():
ctype = part.get_content_type()
cdispo = str(part.get('Content-Disposition'))
if ctype == 'text/plain' and 'attachment' not in cdispo:
body = part.get_payload(decode=True) # decode
break
else:
body = msg.get_payload(decode=True)
flowFile = session.putAttribute(flowFile, 'msgbody', body.decode('utf-8', 'ignore'))
session.transfer(flowFile, ExecuteScript.REL_SUCCESS)
Screenshot

The gist above posted by #Thomas mostly worked for me.
However, there were cases where the multipart mail was multi-level, i.e. one of the parts was also multipart. To address that, you could use this version of the getTextFromMessage method instead:
private String getTextFromMessage(Part part) throws MessagingException, IOException {
String result = null;
if (part.isMimeType("text/plain")){
// If the part is a plaintext message, just return the content as a String
Object content = part.getContent();
if (content instanceof String) {
result = (String)content;
}
} else if (part.isMimeType("multipart/*")) {
// If the part is a multi-part message, iterate over the sub-parts
MimeMultipart mimeMultipart = (MimeMultipart)part.getContent();
int count = mimeMultipart.getCount();
for (int i = 0; i < count; i ++){
final BodyPart bodyPart = mimeMultipart.getBodyPart(i);
// Inserted from here...
final String text = getTextFromMessage( bodyPart );
if (text != null) {
result = text;
break;
}
}
}
// Just to be safe...
result = (result != null) ? result : "";
return result;
}
Note that it is possible that there is not even a "html/plain" part at all in the email, so you may want to think about how to handle that case as well. For instance, you could loop over the parts again to extract the "text/html" part (if it exists) and deal with that, and so on.

Related

MalformedInputException: Input length = 1 while reading text file with Files.readAllLines(Path.get("file").get(0);

Why am I getting this error? I'm trying to extract information from a bank statement PDF and tally different bills for the month. I write the data from a PDF to a text file so I can get specific data from the file (e.g. ASPEN HOME IMPRO, then iterate down to what the dollar amount is, then read that text line to a string)
When the Files.readAllLines(Path.get("bankData").get(0) code is run, I get the error. Any thoughts why? Encoding issue?
Here is the code:
public static void main(String[] args) throws IOException {
File file = new File("C:\\Users\\wmsai\\Desktop\\BankStatement.pdf");
PDFTextStripper stripper = new PDFTextStripper();
BufferedWriter bw = new BufferedWriter(new FileWriter("bankData"));
BufferedReader br = new BufferedReader(new FileReader("bankData"));
String pdfText = stripper.getText(Loader.loadPDF(file)).toUpperCase();
bw.write(pdfText);
bw.flush();
bw.close();
LineNumberReader lineNum = new LineNumberReader(new FileReader("bankData"));
String aspenHomeImpro = "PAYMENT: ACH: ASPEN HOME IMPRO";
String line;
while ((line = lineNum.readLine()) != null) {
if (line.contains(aspenHomeImpro)) {
int lineNumber = lineNum.getLineNumber();
int newLineNumber = lineNumber + 4;
String aspenData = Files.readAllLines(Paths.get("bankData")).get(0); //This is the code with the error
System.out.println(newLineNumber);
break;
} else if (!line.contains(aspenHomeImpro)) {
continue;
}
}
}
So I figured it out. I had to check the properties of the text file in question (I'm using Eclipse) to figure out what the actual encoding of the text file was.
Then, when creating the file in the program, encode the text file to UTF-8 so that Files.readAllLines could read and grab the data I wanted to get.

Using a Beakerx Custom Magic

I've created a custom Magic command with the intention of generating a spark query programatically. Here's the relevant part of my class that implements the MagicCommandFunctionality:
MagicCommandOutcomeItem execute(MagicCommandExecutionParam magicCommandExecutionParam) {
// get the string that was entered:
String input = magicCommandExecutionParam.command.substring(MAGIC.length())
// use the input to generate a query
String generatedQuery = Interpreter.interpret(input)
MIMEContainer result = Text(generatedQuery);
return new MagicCommandOutput(MagicCommandOutcomeItem.Status.OK, result.getData().toString());
}
This works splendidly. It returns the command that I generated. (As text)
My question is -- how do I coerce the notebook into evaluating that value in the cell? My guess is that a SimpleEvaluationObject and TryResult are involved, but I can't find any examples of their use
Rather than creating the MagicCommandOutput I probably want the Kernel to create one for me. I see that the KernelMagicCommand has an execute method that would do that. Anyone have any ideas?
Okay, I found one way to do it. Here's my solution:
You can ask the current kernelManager for the kernel you're interested in,
then call PythonEntryPoint.evaluate. It seems to do the job!
#Override
MagicCommandOutcomeItem execute(MagicCommandExecutionParam magicCommandExecutionParam) {
String input = magicCommandExecutionParam.command.substring(MAGIC.length() + 1)
// this is the Scala code I want to evaluate:
String codeToExecute = <your code here>
KernelFunctionality kernel = KernelManager.get()
PythonEntryPoint pep = kernel.getPythonEntryPoint(SCALA_KERNEL)
pep.evaluate(codeToExecute)
pep.getShellMsg()
List<Message> messages = new ArrayList<>()
//until there are messages on iopub channel available collect them into response
while (true) {
String iopubMsg = pep.getIopubMsg()
if (iopubMsg == "null") break
try {
Message msg = parseMessage(iopubMsg) //(I didn't show this part)
messages.add(msg)
String commId = (String) msg.getContent().get("comm_id")
if (commId != null) {
kernel.addCommIdManagerMapping(commId, SCALA_KERNEL)
}
} catch (IOException e) {
log.error("There was an error: ${e.getMessage()}")
return new MagicKernelResponse(MagicCommandOutcomeItem.Status.ERROR, messages)
}
}
return new MagicKernelResponse(MagicCommandOutcomeItem.Status.OK, messages)
}

Send Email fields not rendering in Sitecore Web Forms For Marketers

I have an issue with the WFFM Send Email Message save action (Sitecore 6.5.0). I'm trying to send an email that includes the form placeholders from the "Insert Field" dropdown in the Send Email editor. Sometimes the fields will render correctly, but most times the email will include the placeholder text instead of the field's actual value.
For example, this is the email that is coming through:
First Name: [First Name]
Last Name: [Last Name]
Email: [Email Address]
Company Name: [Company Name]
Phone Number: [Phone Number]
I think it has to do with the Send Email editor using a rich text editor for the email template, but I've tried adjusting the message's HTML to no avail. This is what the markup looks like: (the <p> tags and labels used to be inline, but that didn't work either)
<p>First Name:
[<label id="{F49F9E49-626F-44DC-8921-023EE6D7948E}">First Name</label>]
</p>
<p>Last Name:
[<label id="{9CE3D48C-59A0-432F-B6F1-3AFD03687C94}">Last Name</label>]
</p>
<p>Email:
[<label id="{E382A37E-9DF5-4AFE-8780-17169E687805}">Email Address</label>]
</p>
<p>Company Name:
[<label id="{9C08AC2A-4128-47F8-A998-12309B381CCD}">Company Name</label>]
</p>
<p>Phone Number:
[<label id="{4B0C5FAC-A08A-4EF2-AD3E-2B7FDF25AFA7}">Phone Number</label>]
</p>
Does anyone know what could be going wrong?
I have encountered this issue before, but was using a custom email action. I managed to fix it by not using the deprecated methods in the SendMail class and instead using the
Sitecore.Form.Core.Pipelines.ProcessMessage namespace's ProcessMessage and ProcessMessageArgs classes.
My use case was a little more complicated than yours, as we were also attaching a PDF brochure to our message (which is why we were using the custom email action in the first place), but here is the code:
public class SendBrochureEmail : SendMail, ISaveAction, ISubmit
{
public new void Execute(ID formId, AdaptedResultList fields, params object[] data)
{
try
{
var formData = new NameValueCollection();
foreach (AdaptedControlResult acr in fields)
{
formData[acr.FieldName] = acr.Value;
}
var senderName = formData["Your Name"];
var emailTo = formData["Recipient Email"];
var recipientName = formData["Recipient Name"];
var documentTitle = formData["Document Title"];
if (documentTitle.IsNullOrEmpty())
{
documentTitle = String.Format("Documents_{0}", DateTime.Now.ToString("MMddyyyy"));
}
Subject = documentTitle;
if (!String.IsNullOrEmpty(emailTo))
{
BaseSession.FromName = senderName;
BaseSession.CatalogTitle = documentTitle;
BaseSession.ToName = recipientName;
var tempUploadPath = Sitecore.Configuration.Settings.GetSetting("TempPdfUploadPath");
var strPdfFilePath =
HttpContext.Current.Server.MapPath(tempUploadPath + Guid.NewGuid().ToString() + ".pdf");
//initialize object to hold WFFM mail/message arguments
var msgArgs = new ProcessMessageArgs(formId, fields, MessageType.Email);
var theDoc = PdfDocumentGenerator.BuildPdfDoc();
theDoc.Save(strPdfFilePath);
theDoc.Clear();
FileInfo fi = null;
FileStream stream = null;
if (File.Exists(strPdfFilePath))
{
fi = new FileInfo(strPdfFilePath);
stream = fi.OpenRead();
//attach the file with the name specified by the user
msgArgs.Attachments.Add(new Attachment(stream, documentTitle + ".pdf", "application/pdf"));
}
//get the email's "from" address setting
var fromEmail = String.Empty;
var fromEmailNode = Sitecore.Configuration.Factory.GetConfigNode(".//sc.variable[#name='fromEmail']");
if (fromEmailNode != null && fromEmailNode.Attributes != null)
{
fromEmail = fromEmailNode.Attributes["value"].Value;
}
//the body of the email, as configured in the "Edit" pane for the Save Action, in Sitecore
msgArgs.Mail.Append(base.Mail);
//The from address, with the sender's name (specified by the user) in the meta
msgArgs.From = senderName + "<" + fromEmail + ">";
msgArgs.Recipient = recipientName;
msgArgs.To.Append(emailTo);
msgArgs.Subject.Append(Subject);
msgArgs.Host = Sitecore.Configuration.Settings.MailServer;
msgArgs.Port = Sitecore.Configuration.Settings.MailServerPort;
msgArgs.IsBodyHtml = true;
//initialize the message using WFFM's built-in methods
var msg = new ProcessMessage();
msg.AddAttachments(msgArgs);
msg.BuildToFromRecipient(msgArgs);
//change links to be absolute instead of relative
msg.ExpandLinks(msgArgs);
msg.AddHostToItemLink(msgArgs);
msg.AddHostToMediaItem(msgArgs);
//replace the field tokens in the email body with the user-specified values
msg.ExpandTokens(msgArgs);
msg.SendEmail(msgArgs);
//no longer need the file or the stream - safe to close stream and delete delete it
if (fi != null && stream != null)
{
stream.Close();
fi.Delete();
}
}
else
{
Log.Error("Email To is empty", this);
throw new Exception("Email To is empty");
}
}
catch (Exception ex)
{
Log.Error("Test Failed.", ex, (object) ex);
throw;
}
finally
{
BrochureItems.BrochureItemIds = null;
}
}
public void Submit(ID formid, AdaptedResultList fields)
{
Execute(formid, fields);
}
public void OnLoad(bool isPostback, RenderFormArgs args)
{
}
}
It is very possible that the Email Action that WFFM ships with is using the deprecated methods, which could be your problem. I do not have time to look into it, but you can decompile the DLL and look to see what their Email Action is doing. Regardless, the above code should work out of the box, save for updating the fields to those that you are using and removing the code for attaching the PDF, should you choose to not have attachments.
Good luck, and happy coding :)
If you change a field on the form in any way (caption, name, type, etc) the link will change and you need to re-insert the placeholder and move it up to its location in your expected email. This is also true if you duplicate a form. You'll have to reinsert all the fields in the email or you will just get the outcome you show above.
Reinserting upon a change will ensure the value is collected!

' ', hexadecimal value 0x1F, is an invalid character. Line 1, position 1

I am trying to read a xml file from the web and parse it out using XDocument. It normally works fine but sometimes it gives me this error for day:
**' ', hexadecimal value 0x1F, is an invalid character. Line 1, position 1**
I have tried some solutions from Google but they aren't working for VS 2010 Express Windows Phone 7.
There is a solution which replace the 0x1F character to string.empty but my code return a stream which doesn't have replace method.
s = s.Replace(Convert.ToString((byte)0x1F), string.Empty);
Here is my code:
void webClient_OpenReadCompleted(object sender, OpenReadCompletedEventArgs e)
{
using (var reader = new StreamReader(e.Result))
{
int[] counter = { 1 };
string s = reader.ReadToEnd();
Stream str = e.Result;
// s = s.Replace(Convert.ToString((byte)0x1F), string.Empty);
// byte[] str = Convert.FromBase64String(s);
// Stream memStream = new MemoryStream(str);
str.Position = 0;
XDocument xdoc = XDocument.Load(str);
var data = from query in xdoc.Descendants("user")
select new mobion
{
index = counter[0]++,
avlink = (string)query.Element("user_info").Element("avlink"),
nickname = (string)query.Element("user_info").Element("nickname"),
track = (string)query.Element("track"),
artist = (string)query.Element("artist"),
};
listBox.ItemsSource = data;
}
}
XML file:
http://music.mobion.vn/api/v1/music/userstop?devid=
0x1f is a Windows control character. It is not valid XML. Your best bet is to replace it.
Instead of using reader.ReadToEnd() (which by the way - for a large file - can use up a lot of memory.. though you can definitely use it) why not try something like:
string input;
while ((input = sr.ReadLine()) != null)
{
string = string + input.Replace((char)(0x1F), ' ');
}
you can re-convert into a stream if you'd like, to then use as you please.
byte[] byteArray = Encoding.ASCII.GetBytes( input );
MemoryStream stream = new MemoryStream( byteArray );
Or else you could keep doing readToEnd() and then clean that string of illegal characters, and convert back to a stream.
Here's a good resource for cleaning illegal characters in your xml - chances are, youll have others as well...
https://seattlesoftware.wordpress.com/tag/hexadecimal-value-0x-is-an-invalid-character/
What could be happening is that the content is compressed in which case you need to decompress it.
With HttpHandler you can do this the following way:
var client = new HttpClient(new HttpClientHandler
{
AutomaticDecompression = DecompressionMethods.GZip
| DecompressionMethods.Deflate
});
With the "old" WebClient you have to derive your own class to achieve the similar effect:
class MyWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
return request;
}
}
Above taken from here
To use the two you would do something like this:
HttpClient
using (var client = new HttpClient(new HttpClientHandler { AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate }))
{
using (var stream = client.GetStreamAsync(url))
{
using (var sr = new StreamReader(stream.Result))
{
using (var reader = XmlReader.Create(sr))
{
var feed = System.ServiceModel.Syndication.SyndicationFeed.Load(reader);
foreach (var item in feed.Items)
{
Console.WriteLine(item.Title.Text);
}
}
}
}
}
WebClient
using (var stream = new MyWebClient().OpenRead("http://myrss.url"))
{
using (var sr = new StreamReader(stream))
{
using (var reader = XmlReader.Create(sr))
{
var feed = System.ServiceModel.Syndication.SyndicationFeed.Load(reader);
foreach (var item in feed.Items)
{
Console.WriteLine(item.Title.Text);
}
}
}
}
This way you also recieve the benefit of not having to .ReadToEnd() since you are working with the stream instead.
Consider using System.Web.HttpUtility.HtmlDecode if you're decoding content read from the web.
If you are having issues replacing the character
For me there were some issues if you try to replace using the string instead of the char. I suggest trying some testing values using both to see what they turn up. Also how you reference it has some effect.
var a = x.IndexOf('\u001f'); // 513
var b = x.IndexOf(Convert.ToString((byte)0x1F)); // -1
x = x.Replace(Convert.ToChar((byte)0x1F), ' '); // Works
x = x.Replace(Convert.ToString((byte)0x1F), " "); // Fails
I blagged this
I had the same issue and found that the problem was a  embedded in the xml.
The solution was:
s = s.Replace("", " ")
I'd guess it's probably an encoding issue but without seeing the XML I can't say for sure.
In terms of your plan to simply replace the character but not being able to, because you have a stream rather than a text, simply read the stream into a string and then remove the characters you don't want.
Works for me.........
string.Replace(Chr(31), "")
I used XmlSerializer to parse XML and faced the same exception.
The problem is that the XML string contains HTML codes of invalid characters
This method removes all invalid HTML codes from string (based on this thread - https://forums.asp.net/t/1483793.aspx?Need+a+method+that+removes+illegal+XML+characters+from+a+String):
public static string RemoveInvalidXmlSubstrs(string xmlStr)
{
string pattern = "&#((\\d+)|(x\\S+));";
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(xmlStr))
{
xmlStr = regex.Replace(xmlStr, new MatchEvaluator(m =>
{
string s = m.Value;
string unicodeNumStr = s.Substring(2, s.Length - 3);
int unicodeNum = unicodeNumStr.StartsWith("x") ?
Convert.ToInt32(unicodeNumStr.Substring(1), 16)
: Convert.ToInt32(unicodeNumStr);
//according to https://www.w3.org/TR/xml/#charsets
if ((unicodeNum == 0x9 || unicodeNum == 0xA || unicodeNum == 0xD) ||
((unicodeNum >= 0x20) && (unicodeNum <= 0xD7FF)) ||
((unicodeNum >= 0xE000) && (unicodeNum <= 0xFFFD)) ||
((unicodeNum >= 0x10000) && (unicodeNum <= 0x10FFFF)))
{
return s;
}
else
{
return String.Empty;
}
})
);
}
return xmlStr;
}
Nobody can answer if you don't show relevant info - I mean the Xml content.
As a general advice I would put a breakpoint after ReadToEnd() call. Now you can do a couple of things:
Reveal Xml content to this forum.
Test it using VS Xml visualizer.
Copy-paste the string into a txt file and investigate it offline.

A better way of representing File Attachment into a list(c#3.0)

I have written
List<Attachment> lstAttachment = new List<Attachment>();
//Check if any error file is present in which case it needs to be send
if (new FileInfo(Path.Combine(errorFolder, errorFileName)).Exists)
{
Attachment unprocessedFile = new Attachment(Path.Combine(errorFolder, errorFileName));
lstAttachment.Add(unprocessedFile);
}
//Check if any processed file is present in which case it needs to be send
if (new FileInfo(Path.Combine(outputFolder, outputFileName)).Exists)
{
Attachment processedFile = new Attachment(Path.Combine(outputFolder, outputFileName));
lstAttachment.Add(processedFile);
}
Working fine and is giving the expected output.
Basically I am attaching the file to the list based on whether the file is present or not.
I am looking for any other elegant solution than the one I have written.
Reason: Want to learn differnt ways of representing the same program.
I am using C#3.0
Thanks.
Is it looks better?
...
var lstAttachment = new List<Attachment>();
string errorPath = Path.Combine(errorFolder, errorFileName);
string outputPath = Path.Combine(outputFolder, outputFileName);
AddAttachmentToCollection(lstAttachment, errorPath);
AddAttachmentToCollection(lstAttachment, outputPath);
...
public static void AddAttachmentToCollection(ICollection<Attachment> collection, string filePath)
{
if (File.Exists(filePath))
{
var attachment = new Attachment(filePath);
collection.Add(attachment);
}
}
How about a little LINQ?
var filenames = new List<string>()
{
Path.Combine(errorFolder, errorFilename),
Path.Combine(outputFolder, outputFilename)
};
var attachments = filenames.Where(f => File.Exists(f))
.Select(f => new Attachment(f));