Could someone please guide me on how to extract a .docx file and load it onto a database using an ETL(Extract-Transform-Load) or ELT(Extract-Load-Transform) tool?
Assuming that the .docx file contains mostly unstructured data, isn't it an ELT tool I should go for instead of ETL?
The ETL and ELT tools I found this far didn't support the MS Word component. What other way is there to extract and store the content in a .docx file onto a database?
My requirement is to:
Extract the data inside the .docx file,
Convert them into meaningful data, and
Store them onto a data lake so I can perform data analysis, and take productive decisions based on those results.
It's just like how e-commerce companies convert customer reviews into meaningful data so they can take decisions to boost their sales. In my case, it's Word files I need to analyze.
I'm asking this because I've searched for so many ETL and ELT tools but couldn't find anything that supported Word files. Maybe it's because I haven't been searching for the right tool or the right way to do it?
If somebody knows a way, please guide me through the process. What should I start looking for? A tool, or a way to code the entire thing?
I've been looking for an answer for weeks now but didn't find a helpful answer. And it's starting to get really frustrating to see all the tools supporting every other component like social media, MongoDB, or whatever EXCEPT Word files.
You have to do this in 2 steps:
Extract the data from the .docx file to txt or xml
Now use SSIS to import. (Azure Data Factory if you are in the cloud)
Related
I need to export courses from Moodle but, but as it is a very closed
application, and the courses are in moodle format, is there any way
to extract the contents / metadata that format to facilitate the
migration to DSpace.
I know, it possible to make on the 'big-hand', but ira spend a lot of
time. For DSpace and moodle use very different and complex databases.
Moodle exports courses with a .mbz extension. Simply rename it to .zip and you can extract the XML files from inside. These files will have all the information you need. You could potentially create a tool that programmatically extracts this information and imports it to DSpace.
Also, Moodle is open source, not a closed application. Source available here: https://github.com/moodle/moodle
I am currently attempting to convert a couple of .NET desktop applications that I have developed into a web application harnessing AngularJS and RESTful services.
One of the key components of these applications is in their ability to generate Word documents on the fly using a .dotx Word template. I am currently exploring the possibility of using a third party library called DocX to generate these Word documents without resorting to using a template.
I guess my question is: Can I use this library to read an existing Word document in .docx format and generate a source code representation of the document? If this is possible could someone point me in the direction of any code samples that I could use? I have looked around and have been unable to find anything that could help me get started.
Generating code representation of the document and using it with DocX seems like a time consuming effort to me. Why not using a template instead and fill it with data at runtime?
I have some experience with Docentric, which is 3rd party OpenXML toolkit. It features an Word Add-in for template design and libraries for document generation and manipulation. It took me less then a week to generate pretty complex documents. If I was in your shoes I would definitely try some 3rd party toolkits. They cost money, but save time so do some math and see it they can be useful for you.
It is possible to read an existing Word document in .docx format with following code
DocX document = DocX.Load(filename)
While it is impossible to generate a source code representation of a document.
I have a Word document(some template format) where it containing some placeholders for the data to be filled in and there are several Word documents like this which lies in some directory. When data comes I will be choosing different templates (based on some criteria) and fill the data and the documents have to be converted to PDF format.
I have been investigating Apache POI for this. If anyone has a good suggestion, it would be much appreciated.
As mbeckish mentioned you should indicate how you are going to run/automate this. For example is it one-off, run by hand or part of another program (and if so what programming languages do you use)?
If you are trying to automate it JODReports and Docmosis are tools that can use templates like you require and can produce PDF. JODReports is free. Docmosis is not but has several APIs. Please note I work for the company that develops Docmosis.
Hope that helps.
I've just uploaded this presentation, which presents three approaches for doing this.
Why not use any of existing PDF virtual printers?
I need to read the Excel data using Xmldocument.Plz help me
You shouldn't.
Better use Office Open XML libraries from Microsoft.
I can't give some code but, you have to extract xlsx contents with System.IO.Packaging, find the sheet you need, and then load it in XmlDocument.
But be advised that it is quite tricky and has many caveats to do so.
I'm trying to generate word documents using open xml sdk. When the documents are small this is no problem (and rather easy). When the documents become larger (+500 pages) I notice the peformance (duration, memory usage, ...) goes down significantly.
Googling this problem I came across some posts that point out the same problem. For excel there is a solution with spreadsheetgear.
I would like to know if there is a word alternative to this or if there are other solutions to generate word documents?
Thanks,
Jelle
I've written a blog post series on generating Open XML WordprocessingML documents. The approach that I take is that you create a template Word document, insert content controls, and then write XPath expressions in those content controls to specify the XML to pull from a source XML data file. I've also explored another approach where you write C# code in Open XML content controls. That approach also works.
http://ericwhite.com/blog/map/generating-open-xml-wordprocessingml-documents-blog-post-series/
-Eric
You might look at http://docx.codeplex.com/
On Java, you could use docx4j. If you were brave, you could create DLLs for it via IKVM...
I decided to go with Aspose Words. It is really fast and not very demanding on resources (CPU, memory). It has the disadvantage that it is quite expensive. I also investigated Softartisans Office writer. The posibilities are the same but due to fact that the company I'm currently working for already used other Aspose components we decided to go with Aspose Word.