How safe is the data being parsed by RTF editors like TinyMCE? - perl

I have a great concern in deploying the TinyMCE editor on a website. Looking at the code parsed by the editor it does a great job, and I leave the HTML button off the toolbar configuration so users can not inject their own source.
However, from what I read in the TinyMCE docs, it claims to degrade nicely to a regular textarea should javascript be disabled on a users browser... and therein lies my concern. If it does revert to a normal textarea, then the user is then able to easily inject their own HTML, and this leaves me with a security concern.
I just pass through data created with TinyMCE, and it is used within another page created by my script, so it poses no security risk to my server. The security concern arises over what malicious data may be passed to another user viewing the generated page.
I know many of you will tell me to just use regexes, or parse this data, but that itself could be a nightmare, as I would be trying to either...
a.) Use regexes to try and clean up the HTML without breaking the generated page,
and it is better to parse the data for that anyway.
b.) Reparsing data that has already been parsed by the RTF editor, which also
would probably end up breaking the generated page.
Anyone with any previous experience with this type of scenario, I would really appreciate a 'heads-up' as to any other risks that using an RTF editor for user data could entail.
I would really like to provide this as a user option, but not if the risks outweigh giving the user using the RTF a chance to take a wack at another user viewing the page that is generated by the script.
My gut feeling is to steer a wide berth around use of the RTF at this point.
Thanks for any direction you can give me with your own experiences.

You cannot have client-side security on the web. You simply can't trust the browser, because it's easy for a malicious user to substitute a replacement browser that does whatever he wants.
If you accept HTML from users (using TinyMCE or through any other method) and display it to other users, you must sanitize or validate the HTML in some way on the server. If you're using Perl, the leading package seems to be HTML::Scrubber (along with various other modules that help you plug it in to various frameworks). I haven't had occasion to try it myself.
The TinyMCE Security page mentions some ways to make it harder for people to submit arbitrary HTML, but you still need server-side checks.

Regex is generally not considered good for parsing HTML
RegEx match open tags except XHTML self-contained tags but I have noted the "perl" tag :)
My advice when taking markup from users is to always parse it through something that can accept mal-formed HTML and return well formed HTML. These parses generally produce something that can be queried and updated with some form of XPath.
In Python there is a module called BeautifulSoup, Ruby has Nokogiri and in ASP.NET there is a project called HtmlAgilityPack that all do this sort of thing. I'm not sure what library perl has, but I'm sure there would be something.

Related

How can I "drill down" into a website using Perl's WWW::Mechanize

I have used the WWW::Mechanize Perl module on a number of projects and it's helped me out a lot.
I am trying to use it on a different site and I can't "drill down" into the content of the site.
The site is https://customer.bookingbug.com/?client=hantsrecyclingcentres#/services
I have tried figure out what the URL would be to get content shown in the resulting HTML, such as bb.d570283b87c834518ba9.css, bb.d570283b87c834518ba9.js and version.js
I tried to copy the resulting HTML into this posting, but used all sorts of quote and code sample combinations and it wouldn't display properly.
Does anyone have any idea how I "navigate" this site using this Perl module please?
WWW::Mechanize is a web client with some HTML parsing capabilities. But as you clearly noticed, the information you want is not in the HTML document you requested. Either download the correct document (whatever that might be), or do what the browser does and execute the JavaScript. This would require a JavaScript engine. The simplest way to achieve that is to remote-control a web browser (e.g. using Selenium::Chrome).

Can I manage content structure with a rich text editor?

I'm developing a CMS, and I'm trying to figure out which rich text editor (if any) I want to use.
The content is stored in a structured form on the server. Let's call it the "canonical form". It is not a simple HTML or markdown page, but a multi-part structure where each part is stored as individual records in the database.
The server reads the canonical form and sends it to the client. The client transforms the canonical form into HTML. I now want to let the user edit the content, and save it back to the server in canonical form.
I'm not sure a rich text editor will do the trick. It seems most RTE's give you HTML, leaving it up to you to parse the HTML and save it. The problem is that the conversion of canonical to HTML is one-way. The canonical form is different enough from HTML that the transformation can't be readily reversed.
So I need some kind of intimate interaction with the editor. I need to track all the things the editor does (select, copy, paste, drag-n-drop, splitting blocks, merging blocks, etc.) as the editor is doing it, so that I can maintain the canonical form in parallel with the displayed HTML.
Is there anything out there that will do this? I'm looking at TinyMCE, CKEditor, etc.
It sounds like you're probably going to need logic that converts content into canonical form on an editor get operation, and the inverse on an editor set operation.
Textbox.io supports the idea of filters for content. You could possibly tie this in with something like Markdown-js to get your canonical format.

How to extract data from a web site and format to raw text - iPhone Dev

I have been looking around for a while and not found anything useful, also not sure if I have worded the question in the clearest fashion so apologies
I have a section of an app I am building called 'Company News'. The company in question has a news page on their website which displays a title, an excerpt of text and a read more option.
At the minute in the iPhone application I just have a UIWebView which links to that URL, displays an error if no connection is available. However, if my user clicks a story to read the news obviously it opens up a new page, I want to avoid having to build in 'back' and 'forward' buttons and stay away from it looking like a browser within the app.
With that said, I am looking for a way to just extract that data from the website and just display it in my app as raw text. I am not particularly bothered about rich text formatting or anything fancy. I would just like the title and body of text.
Is this possible?
In essence, then, you are looking for an HTML parser.
Assuming the HTML you wish to parse has a predictable format, the approach I would take is to load the HTML via whatever URL loading system you want - e.g. NSURLConnection, ASIHTTPRequest, etc.
Then you will need to parse the raw HTML. I use XPath. It requires that you learn the syntax, but it should work.
For more details about how you might use XPath for parsing HTML, see the second response to this question. You will need to link to libxml2 in your project then use XPath to extract the nodes of interest.
Scraping web pages in this way is fragile, though, because it depends on the structure of a page you don't control and which could be changed unpredictably.

How can I return a text file and an error log from a webpage separately

I have a perl script which when run from the command line generates a text file of data with a specific format for use by another application. The script also prints informational warning messages on stderr. I'm writing a web front end for this. In an ideal world when the user clicks 'submit' on the associated form, a page would be displayed in the browser containing the informational messages, and simultaneously a pop-up would appear allowing the user to save the text file of data to disk. I would like this to work on browsers without javascript enabled, so I think exactly what I want is probably not possible.
Some sites I have seen deal with this kind of thing by displaying the page with the informational messages, and a link to the file to be downloaded. This would seem to mean having to store the files and sorting out some sort of security so that another user cannot download your file (not that this is a big deal for the application in question).
I'm wondering if there is a more elegant way of dealing with this? e.g Is it possible to use multipart messages to somehow achieve returning both pieces of information in one go? Is it possible to pop-up a second window with the informational messages without using javascript? Apologies if these seem like basic questions - my programming knowledge is in the domain of DNA sequence manipulation algorithms rather than web page generation..
If (and only if) the data is quick and easy to generate, do it once for error messages and a second time for download. The link or button of the error-message page would regenerate the results and prompt for download.
This is a bit of a hack since you need to consider what to do if the underlying data changes before the user hits the download link. Be careful to set the header correctly for file download vs normal webpage, eg,
if($submit) {
print header(-type=>'application/octet-stream',
-Content_disposition=>'attachment; filename=foobar.dat');
Gen_Results();
}
To be honest, I'd just use a little javascript anyway since it's a pretty safe assumption now a days. Otherwise, use a "noscript" tag for some alternative.

How to build an inline translation system similar to Magento's

I am working on a Zend Framework, MVC, enterprise website project. I would like to develop a friendly translation system with the ability to translate each word according its context (sometimes same word have different translation).
Zend Framework uses Zend_Translate for i18n and localization. We have also seen Magento's (which uses ZF) inline translation system, where users can translate pages directly.
We want to know how this inline translation system works, so that we can build a similar system with improvements.
Where are translations stored: in the database or in CSV files?
How does the system know to fetch translations for the same word when tranlsated differently by the user on different pages?
How should we build a page to support inline translation?
How does the system handle static text vs. dynamic (database-driven) text?
Inline translation seems like it would make the site very slow. How does Magento solve this problem?
Please if you have more points that should be explained, write them. Thanks
Starting from the beginning here (in the future, this is probably more than one logical question):
Magento stores basic translations (provided by the programmer) in CSV files, but inline translations are stored in the database.
Magento's translations operate on entire strings, not words. By providing an entire sentence worth of context for translations, idiomatic translations are achievable. The tradeoff is obviously that every sentence must be translated, rather than every word.
Magento's answer to this is to wrap all localizable strings in a call to the localizer. Magento templates usually look something like this (the double-underscore function maps to the "translate into the current locale" function):
print $this->__("Please translate this string");
Dynamic text (as in product descriptions) in Magento is often not translated, but if you want to do so, it's as simple as passing the right string to the translator, like this:
print $this->__($someString);
It's unlikely that translation will make or break your site (look to your DB queries for most performance problems), but this is a legitimate question nonetheless. Magento does a few things to help. First, it stores serialized versions of the CSV files in a cache, so that reading CSVs is made more efficient. Secondly, Magento offers page caching so that an entire page's output can be stored (assuming that it will render identically), as well as block-level caching for smaller bits of a page. Between these you're in good shape for the most part.
Hope that helps!
Thanks,
Joe