Sanitizing inputs with AEM - aem

We have various people updating our AEM website however when they copy and paste from word or from online it retains the HTML. I'm wondering if AEM has any built-in way of sanitizing the input so I don't need to build one.

If you are using Rich Text Editor field in the dialog then the text will be parsed and some tags will be stripped. Take a look here for more information about how to configure it and how it works.

We had a rich text edit component with same issue wherein authors were able to place HTML styling onto RTE and the placed styles were colliding with application styles and was breaking components. Fix was, we stripped out all HTML styling using jsoup API before rendering back on screen.

The usual approach in AEM is to protect the user on output (i.e. take the input as-is and use the built-in XSS API when rendering that input).
https://docs.adobe.com/docs/en/cq/5-6-1/deploying/security_checklist.html#Protect%20against%20Cross-Site%20Scripting%20%28XSS%29
https://docs.adobe.com/content/docs/en/cq/5-6-1/developing/securitychecklist/_jcr_content/par/download/file.res/xss_cheat_sheet.pdf

Related

TinyMCE autocloses HTML tags - How to disable? 2

Same question as here
I have two tinymce Editors One of them for Header other for Footer(needs o be done for email template).
I want for example to have
<div>abra in Header editor. After saving becomes <div>abra</div>(closes the tag)
And
cadabra</div> in Footer editor. After saving becomes cadabra(removes tag)
so that at the end I could get <div>abracadabra</div>
How can i disable it?
You cannot disable TinyMCE from trying to create valid, well-formed HTML. The engine that drives TinyMCE is designed to ensure that the content in any one editor is valid and well-formed and while you realize that the data across two editors is intended to be correct TinyMCE won't allow you to do this. You could certainly post-process the data when extracting it from TinyMCE to get your desired end result.

Can birt support reading html tables from database and displaying them dynamically to a pdf report file?

I have come across a scenario where I have to read html data from database and display it in pdf reports. This html data also contains table structure <table></table> tags and other html element inside it. Previously we used jasper reports for our reporting needs but recently as we came to know that the above functionality is not supported in jasper, I wanted to know which reporting tool can be used so that it can be incorporated with servoy. Does birt provide this functionality?
AFAIK none of the well-known reporting tools does support this, although in BIRT it works "somehow" - but not good enough to be usable.
The reason for this is simple, I think: A reporting tool would have to incorporate a complete browser engine like WebKit or others to achieve this, because it would have to "understand" the structure for its page-breaking algorithm.
Yes, BIRT has a text element where we can set the display type to HTML. If the html table is in a dataset field you will just have to include it in the expression of the text using "value-of" tag, something like this:
<VALUE-OF format="HTML">row["htmlTableField"]</VALUE-OF>
PDF format is taking such html elements into account, including most of simple style settings such background color, text-align, borders etc.
Usually the reports render just fine with html.
There are some tricks to displaying html correctly in BIRT.
You may use a Dynamic Text element and set to html or auto.
Here are some tricks to handling free form text..
Make sure your xml is valid, I recommend replacing line breaks or you may catch a scenario where the rptdocument will not export.
Also, if possible keep these in auto layout, when using run + render. The page breaks may actually be calculated once on run and again on render. You might experience breaking issues with fixed. The page may attempt to display all the html prior to breaking a page when using the RUN() phase, in web viewer or the rptdocument. Then when rendering to pdf the the breaks are applied differently, with fixed layout.

Can I manage content structure with a rich text editor?

I'm developing a CMS, and I'm trying to figure out which rich text editor (if any) I want to use.
The content is stored in a structured form on the server. Let's call it the "canonical form". It is not a simple HTML or markdown page, but a multi-part structure where each part is stored as individual records in the database.
The server reads the canonical form and sends it to the client. The client transforms the canonical form into HTML. I now want to let the user edit the content, and save it back to the server in canonical form.
I'm not sure a rich text editor will do the trick. It seems most RTE's give you HTML, leaving it up to you to parse the HTML and save it. The problem is that the conversion of canonical to HTML is one-way. The canonical form is different enough from HTML that the transformation can't be readily reversed.
So I need some kind of intimate interaction with the editor. I need to track all the things the editor does (select, copy, paste, drag-n-drop, splitting blocks, merging blocks, etc.) as the editor is doing it, so that I can maintain the canonical form in parallel with the displayed HTML.
Is there anything out there that will do this? I'm looking at TinyMCE, CKEditor, etc.
It sounds like you're probably going to need logic that converts content into canonical form on an editor get operation, and the inverse on an editor set operation.
Textbox.io supports the idea of filters for content. You could possibly tie this in with something like Markdown-js to get your canonical format.

How to edit content more easily?

i add content to my confluence page like a html
inside {html} tags. This page will be changed in future every week. It very difficult to understand html so quick for people who never don't work with html.
Is there any way in confluence to add a simple user interface form which helps to edit information inside html?
I know that confluence have embedded jQuery can anybody give advice how to do it better?
Thanks
Use the scaffolding plugin to Show only some special text fields for editing. Then you can hide the HTML code. But scaffolding is not ready for Confluence 4
http://wiki.customware.net/repository/display/AtlassianPlugins/Scaffolding+Plugin
You could download the page with Atlassian CLI, and parse out the section of html you want to modify, put that in your wysiwyg, and then inject it back into the downloaded html and post it back.
Of course it is as fun as it sounds.
An example of the content would help to answer this question.
One option is to put your content in a word .doc file, save it, upload it to the page. Use the office connector macro to display the content of the .doc on the page. The office connector plugin is free.
Note that Confluence V5 editor now has a basic set of editing features found in Microsoft Word.

How safe is the data being parsed by RTF editors like TinyMCE?

I have a great concern in deploying the TinyMCE editor on a website. Looking at the code parsed by the editor it does a great job, and I leave the HTML button off the toolbar configuration so users can not inject their own source.
However, from what I read in the TinyMCE docs, it claims to degrade nicely to a regular textarea should javascript be disabled on a users browser... and therein lies my concern. If it does revert to a normal textarea, then the user is then able to easily inject their own HTML, and this leaves me with a security concern.
I just pass through data created with TinyMCE, and it is used within another page created by my script, so it poses no security risk to my server. The security concern arises over what malicious data may be passed to another user viewing the generated page.
I know many of you will tell me to just use regexes, or parse this data, but that itself could be a nightmare, as I would be trying to either...
a.) Use regexes to try and clean up the HTML without breaking the generated page,
and it is better to parse the data for that anyway.
b.) Reparsing data that has already been parsed by the RTF editor, which also
would probably end up breaking the generated page.
Anyone with any previous experience with this type of scenario, I would really appreciate a 'heads-up' as to any other risks that using an RTF editor for user data could entail.
I would really like to provide this as a user option, but not if the risks outweigh giving the user using the RTF a chance to take a wack at another user viewing the page that is generated by the script.
My gut feeling is to steer a wide berth around use of the RTF at this point.
Thanks for any direction you can give me with your own experiences.
You cannot have client-side security on the web. You simply can't trust the browser, because it's easy for a malicious user to substitute a replacement browser that does whatever he wants.
If you accept HTML from users (using TinyMCE or through any other method) and display it to other users, you must sanitize or validate the HTML in some way on the server. If you're using Perl, the leading package seems to be HTML::Scrubber (along with various other modules that help you plug it in to various frameworks). I haven't had occasion to try it myself.
The TinyMCE Security page mentions some ways to make it harder for people to submit arbitrary HTML, but you still need server-side checks.
Regex is generally not considered good for parsing HTML
RegEx match open tags except XHTML self-contained tags but I have noted the "perl" tag :)
My advice when taking markup from users is to always parse it through something that can accept mal-formed HTML and return well formed HTML. These parses generally produce something that can be queried and updated with some form of XPath.
In Python there is a module called BeautifulSoup, Ruby has Nokogiri and in ASP.NET there is a project called HtmlAgilityPack that all do this sort of thing. I'm not sure what library perl has, but I'm sure there would be something.