We've been tasked with creating a MOSS workflow that on it's final step will convert a document (most likely from word 2003 or 2007) to PDF and watermark it with the current date.
So far I haven't seen a definitive way to do this. Have looked at using the MS Word Interop dlls, but we will not be installing Word (or Office) onto the server - so that's really not doable. Another option I've looked at is using Aspose dll libraries for conversion.
From a topology standpoint, wondering if using one server exclusively for document conversion is a good way to implement this. (I've read some info that recommends this approach for larger organizations).
If anyone - who has preferable done this sort of thing, can give me some pointers or best practices on this I'd really appreciate it.
Thanks
I would think that starting with one server is the best way to go. Then, just monitor the workload on the machine and if it gets to be too much for one, pop another in there. That's the beauty of MOSS.
Related
I am building up a chart with 582 literal data points. When I call WordprocessingDocument.Close() I get an IsolatedStorageException.
This doesn't make sense to me because the OpenXML SDK, as I understand it, is totally self contained writing data to a stream. There's no calls to Office or anything else that could hit this issue.
Be that as it may, is there anything I can do to avoid this issue?
thanks - dave
I did some research into the IsolatedStorageException related to large XLSX files and found solutions to your problem.
According to Eric White's blog, when legacy OpenXml generates a file larger than 10M it needs to take advantage of Isolated Storage. If mulitple threads access the Isolated Storage during report generation System.IO.Packaging will throw the IsolatedStorageException.
This is because the System.IO.Packaging that is baked into .NET was not written well enough to handle these scenarios. That System.IO.Packaging can't be changed.
To remedy this issue, you can try one of these solutions:
Refactor your code to use a new OpenXml built on a System.IO.Packaging that Eric White refactored to remove the Isolated Storage depedency. Check this chart for reference and use the correct NuGet command to bring in the new version without the WindowsBase dependency.
Don't refactor your OpenXml code, but change your report generation user interface to prohibit (if possible) or discourage generating files larger than 10M.
If your OpenXml code is embedded inside an IIS based web solution and refactoring your code is not feasible, try one of the solutions provided by this blog. These techniques aim to provide permissions needed by IIS to try and get around this exception and might not be related to Eric White's concern.
Without more information about your solution architecture, these are the solutions I can recommend at this time. Hope they help.
We just finished a small project with the primary aim of giving Microsoft StreamInsight a try.
The technology looks fine, but I have a concern about its industry traction. When we ran into issues there were only a handful of materials on the web and generally I miss a vibrant community around it.
Should we expand our use of StreamInsight or it will go down the drain in a few years like Silverlight did?
My opinion: it won't go the way of Silverlight.
Silverlight was, essentially, replaced by other technologies - specifically, HTML 5. StreamInsight doesn't have that. You are correct in that there isn't a whole lot around it. But that is because it's a relatively new technology that has a radically different paradigm, isn't very well known and has more limited use cases than something like ASP.NET MVC. But CEP is an initiative not just from Microsoft (with StreamInsight) but also Oracle, IBM and others. And as data volumes continue to increase, these technologies will be even more important.
I was at a SQL user group meeting about three months ago. I asked that exact question of the speaker (Microsoft Blue Badge). He didn't make a definitive statement but from what he did say it was clear he thought StreamInsight had a big future.
Although there have been quite some posts on these topic, my question is little bit specific.
I need to parse few website and once done, I need to send some data to it. For example, say website A offers me a search tab, I need to programatically feed data to it. The resulting page might differ based on target site's updates.
I want to code such a crawler. So which tools/language would be best to realize this?
I am already well-versed in java and C, so anything based on these would be really helpful.
I would suggest using phantomjs. It's completely free and Windows, Linux, Mac are supported.
It is very simple to install.
It is very simple to execute using
command line.
Community is pretty big and solving straight-forward
problems is trivial.
It uses JavaScript as the scripting language so you'll be fine, I guess, with your Java background.
You'll have to get familiar with DOM structure. Well, you cannot write a crawler without knowing it (even in case you select completely visual solution).
Everything depends on how frequently the crawler should be executed: PhantomJs is great for long-term jobs. Use something else, visual, like iMacros in case you're looking for one-time solution. It can be used inside Mozilla as an extension (free of charge) and there's a standalone version that costs money.
Cheers
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm curious as to how many people are using Open XML (OOXML) these days (either pure or via the SDK) in closed and commercial environments. I'm fairly aware of what's going on on the 'public web' (MSDN, OpenXMLDeveloper.org, etc.), but am wondering about SO people's experience with it, both good and bad.
Are most people opting out of VBA and VSTO in favor of working directly with OOXML formats? What benefits are you getting from OOXML that you're not getting from the object model. I'd love to learn more about why you're using it or why you're not, what are you using it for, etc.
I'm just trying to get a feel from the community on OOXML as an approach toward document automation or other uses. I'm not finding community forums (this one or others) to be incredibly active with questions and users (check out the number of questions by tag of this post), so I'm wondering if I'm one of the very few who is using OOXML extensively.
Now that Microsoft Office 2007 (and especially Excel) support Open XML, I'm finding it far easier to work with than Office Automation. A few major reasons for this are:
Better performance;
No flakey IPC issues (i.e. somebody left Excel open at the Save As dialog, crash);
No dependency on Office itself, or any external components whatsoever;
Fairly easy to write Linq extensions and queries against in C#;
Can be used in server environments without any problems or risks.
Given that Office XP/2003 users can open Office 2007 files with the Compatibility Pack, I don't see any reason to continue using the old automation or "OfficeML" methods. It's a bit of a learning curve, but it's arguably the best option today - it's free, it's reliable, and best of all it's the native format used by Office 2007 today and you don't need any stupid tricks to get it to work (like attaching the XLS content-type to HTML as we did for XL2003, and having XL2007 complain about an incorrect extension).
I wouldn't say it's an outright replacement for VBA/VSTO - the thing about those is that they're usually part of a solution where the requirement is to integrate with the Office environment itself. Using OOXML would generally require you to write an entire application around it. But for simple import/export, which is probably what 90% of automation has been used for in the past, definitely, OOXML is the way to go.
Libraries like Simple OOXML can also greatly help with the learning curve.
We are using Open XML SDK for export to Excel. I have to say it is quite slow so we had to do some caching on our own (for shared strings). The library is just an object representation of the Open XML format. Sometimes it can be good thing, sometimes not. Especially when you have to know Open XML standard very well because the SDK won't handle anything for you. You have to know all the restrictions the format brings, you have to know which elements you cannot ommit in xlsx or docx, etc. It allows you to create inconsistent excel spreadsheets or word documents which is not good. Well, it's free at least :) Better than nothing.
I have used Open Xml SDK for document generation in SharePoint, PPTX generation for custom presentations (pulling from XSLX for the data) and for "building" composite documents from multiple document segments.
The format and the SDK are great. No worries on the server in ASP.NET or SharePoint scenarios, and great speed. I have not found too many scenarios where the SDK or "brute force" xml cannot accomplish a goal. One instance is password protection and DRM for documents, but these are more corner cases. I would agree with aaronaught that this is not an exclusive solution, but SharePoint, VSTO and others are tools in the tool belt for document generation solutions.
I am doing the scenario that sales performance for a country is delivered in PPT Deck using Open XML
I looked at this technology a lot, I program in VBA, but it was way way to complex for my needs. There are some great things about using it that my elearning would benefit from, like Linq and XML, but the bar for going from VBA to the underlying formats is just too high and I don't have the luxury of the kind of time and money it would take investing in learning VS.Net and Open Office XML formats.
But the one thing I think it would really help with is meta-tagging PowerPoint content for an LMS.
I've used it to parse pptx files, looking for special comments in shapes. These comments are links to other resources (uri's most often, PDF files, etc.). I then use deepzoom software to render the pptx and then render the uri's inside the shape. Fun, but slow. Use it to help with research and a "novel" way to look at posters. But this isn't LOB.
We migrated from Word Interop to Open XML very recently.
Our application is used for creating Invoices in the Word and pdf format. When Interop was used, a decent size Invoice took about 5-10 minutes to generate.
Now, using OpenXML and SSRS that time has been reduced to approximately 30 seconds.
The only problem area with Word was that some features were not backward compatible from Word 2010 to 2007 and that took some time to fix and get up and running. Like creating the Table of Contents, merging documents etc.
Other than that, I think Linq, MSDN and Eric White's Blog is enough to get you going in the right direction.
I package our server releases into zip files using a batch file (Windows), running the command-line version of WinZip. Previously we did this sort of thing "by hand" but I developed the process of automating it with a batch file.
The batch file has become quite complicated because our product is complicated (i.e., Which sections are we releasing this time? Are we releasing the config files as well?) and I'm starting to run into some frustrating limitations with batch files.
Would PowerShell be a good thing to investigate as an "upgrade" to the batch file? Or is that complete overkill given that most of what it would be doing is firing off DOS commands?
Bonus: can PowerShell consume .NET assemblies? As in, could I start doing the zipping with SharpZip?
If you have a working solution, then you don't need to go to powershell. Having said that, if you plan to make changes or improve the process then I would highly recommend powershell as the way to go. Powershell can access .Net assemblies...mostly. Some assemblies are structured in a way that makes it more difficult than others.
You can check here for some resources if you decide to look at powershell.
Initially I was really excited about PowerShell. Finally a powerful native shell on Windows. However, I quickly realized that compared to your favorite unix shell PowerShell is just way too verbose. Even doing simple stuff takes way too much typing compared to what you can do with bash and GNU tools for Win32.
I like the idea, that the shell knows about different types, but if I need to do that much additional work, I prefer just getting the necessary data with the various unix stream editors.
EDIT: I just had another look at PowerShell, and I have to admit, that it does have some really useful features that are not available for the traditional unix style tools.
For one the PowerShell owns all the commands which means that it can provide a much more coherent set of features. Parameters are treated uniformly, you can search for commands, parameters and so forth using wild cards which is really useful.
The second great feature is that PowerShell lets you enumerate sources which are normally not available to stream editors such as the Windows registry, the certificate store and so forth. Of course you can have tools that does this for you and present it as text, but the PowerShell approach is just really elegant IMO.
Take a look at PowerShell Community Extensions (PSCX), its FREE and it has Zip cmdlets:
Write-Zip
Write-BZip2
Write-GZip
http://www.codeplex.com/PowerShellCX
You should watch this presentation/discussion with Jeffrey Snover, PowerShell creator and architect. If you're not amazed by the technical details (lots of "wow" moments to be had), you'll be amazed by Jeffrey's enthusiasm :). Once you get the basics, it's easy to be very productive with PowerShell.
The answer is YES - PowerShell can use .NET assemblies. There is a bit of funny business involved in v1 if you need to wire up delegates and v2 makes that much more clean.
Just call LoadFile / LoadAssembly to get the appropriate libraries in memory and away you go
[Reflection.Assembly]::LoadFile('/path/to/sharpzip.dll')
$zip = new-object ICSharpCode.SharpZipLib.Zip.FastZip
$zip.CreateZip('C:\Sample.zip', 'C:\BuildFiles\', 'true', '^au')
# note - I didn't actually test this code
# I don't have SharpZip downloaded - just read their reference.
Also note that the PowerShell Community Extensions support various compression methods like write-zip.
I've tried to replace one of the lengthy build batch files I use with power shell. I found it a pain: at least at that time, documentation focused on the funny verbiage and what cool, perly things you can do with it, but lacked in the "getting simple things done" category. I got it working, but the error handling was to shaky.
YMMV, try powershell, you might enjoy it. But try it before updating your build batches.
My solution: use a C# console application. I've got serious logging, exception handling, can use my utility functions, and if something doesn't work I have a real debugger. It's the first solution I like to modify.
I'm not sure about powershell, but might I recommend using something like IronPython (if you want to have access to the .NET libraries) or plain python? You get a full-blown programming language with very few limitations.
On the one hand, if it works, just leave it. But it sounds like this is something you'll be adding to over time, and of course your eventual successor/coworker who needs to edit the batch file will also need to understand it. If you're from a programming background then you may well find the power of Powershell makes your script a lot shorter and easier to read/maintain (for example, even just having full if statements and for/while loops). On the other hand if you're not overly familiar with programming, a lot of people find Powershell a bit daunting at first glance.
Regarding the .NET part, Powershell is built on top of .NET so yes, you can access .NET assemblies (but you should always see if there's a cmdlet available first).
I would recommend a book called "The Powershell Cookbook" by Lee Holmes, published by O'Reilly. It provides "recipes" which you can use for common tasks; this will probably speed up your time to implement the script, and it'll teach you Powershell along the way.