How to Read Word Document in Postgresql - postgresql

I am new to this stuff, not sure if this can be achieved, I am expecting a bunch of word documents daily, which (structured data) I need to process and store its values to my POSTGRES. I searched over the internet, all I could find is storing the Word document in Blob, Bytea format, do encode, decode, etc, which is again returning text that I can not process. Can this be achieved, if so can you please provide a sample code that can count words/characters/lines in the word document, I can extend that to my need and requirement. I am using Ubuntu on AWS,
show server_encoding;
UTF8
I have tried below
pg_read_file('/var/lib/postgresql/docs/testDoc.docx');
pg_read_binary_file('/var/lib/postgresql/docs/testDoc.docx')
encode(pg_read_binary_file('/var/lib/postgresql/docs/testDoc.docx'),'base64')
decode(encode(pg_read_binary_file('/var/lib/postgresql/docs/testDoc.docx'),'base64'),'base64')::text;
Regards
Bharat

You can checkout the OpenXML library from Microsoft.
This is a .NET Frameowrk based OSS library that maps office documents as objects.
With this library you can build, for example, a program that extracts informations and send data to your PostgreSQL.
Tis library is available for the .NET Core framework too, so you can build a program tha can be run on Ubuntu too .NET Core library on NUGET.
Another way is to write a Java program. Same concept.
In Java you can use Apache POI library to read office documents.
Remember that an office document is a compressed file (with ZIP algorithm) that contains XML data, that represents the document.

One option is to use some front end language to read the docx files and upload them to Postgres.
In Ruby you could:
Install the docx and pg gem.
gem install docx pg
and then create something like the ruby file:
require 'docx'
require 'pg'
doc = Docx::Document.open('document.docx')
conn = PG.connect( dbname: 'postgres_db', user: 'username', password: 'password' )
doc.paragraphs.each do |p|
conn.exec( "INSERT INTO table paragraphs (paragraph) VALUES ( $1 )", [ p ] )
end
I'm sure this could be done in Python or whatever language you know best.

Related

Can I use a sql query or script to create format description files for multiple tables in an IBM DB2 for System I database?

I have an AS400 with an IBM DB2 database and I need to create a Format Description File (FDF) for each table in the DB. I can create the FDF file using the IBM Export tool but it will only create one file at a time which will take several days to complete. I have not found a way to create the files systematically using a tool or query. Is this possible or should this be done using scripting?
First of all, to correct a misunderstanding...
A Format Description File has nothing at all to do with the format of a Db2 table. It actually describes the format of the data in a stream file that you are uploading into the Db2 table. Sure you can turn on an option during the download from Db2 to create the FDF file, but it's still actually describing the data in the stream file you've just downloaded the data into. You can use the resulting FDF file to upload a modified version of the downloaded data or as the starting point for creating an FDF file that matches the actual data you want to upload.
Which explain why there's no built-in way to create an appropriate FDF file for every table on the system.
I question why you think you actually to generate an FDF file for every table.
As I recall, the format of the FDF (or it's newer variant FDFX) is pretty simple; it shouldn't be all that difficult to generate if you really wanted to. But I don't have one handy at the moment, and my Google-FU has failed me.

AS400 - Parse JSON and store the fields into DB2 table

I have got requirement to parse a JSON document that will be passed to a stored procedure inside a CLOB, and store its details into a DB2 table.
I cannot make use of JSON_TABLE function as I am still using IBM i V7R1.
Is there any way I can achieve this ?
Pre native built in support for JSON in 7.2 I used this third party tool ported to the iSeries and maintained by Scott Klement YAJL (Yet Another JSON Library)
I've used this apis on production. Excellent work.
Mihael Schmidt's Json Parser

Retrieve data from mongodb with binary type

I have a site that I created using mongodb but now I want to create a new site with MySQL. I want to retrieve data from my old site (the one using mongodb). I use RoboMongo software to connect to mongodb server but I don't see my old data (*.pdf, *.doc). I think that the data is in binary, isn't it?
How can I retrieve this data?
The binary data you've highlighted is stored using a convention called GridFS. Robomongo 0.8.x doesn't support decoding GridFS binary data (see: issue #255).
In order to extract the files you'll either need to:
use the command line mongofiles utility included with MongoDB. For example:
mongofiles list to see files stored
mongofiles get filename to get a specific file
use a different program or driver that supports GridFS

Importing AccessDB and Oracle directly into MongoDB

I am receiving .dmp and .mdb files from a customer & need to get that data into MongoDB.
Is there any way to straight import these file types into Mongo?
The goal is to programmatically ingest these into mongo in any way I can. The only rule is that customer will not change their method of data delivery, meaning I'm stuck with the .dmp and .mdb files as a source.
Any assistance would be greatly appreciated.
Here are a few options/ideas:
Convert mdb to csv, then use mongoimport --type csv to import into MongoDB.
Use an ETL tool, e.g. Pentaho, Informatica, etc. This will give you much more flexibility for doing any necessary transformation/conversion of data.
Write a custom ETL tool, using libraries that know how to read mdb and dmp files.
You don't mention how you plan to use this data, how many tables are in the database, and how normalized the tables are. Depending on the specifics of your use case, it's very possible that loading the data from Access "as is" will not be a good choice since normalized schemas are not a good fit for MongoDB and MongoDB does not natively support joins. This is where an ETL tool can help, by extracting the source data and transforming it into an appropriate JSON structure.
MongoDB has released ODBC drivers. Go Here MongoDB ODBC Drivers connect MSAccess directly to MongoDB through ODBC. Voila!

Transferring data from Lotus Notes to DB2 using Agent

before you say something, I searched a lot but didn't find how to do that.
So I got database in .NSF format for use in Lotus Notes. I need to write an Agent (I know how to) so data from that database will be automatically transferred to DB2 database.
So before I create DB2 tables, how do i know which structure I need to use? How do I check how exactly data in that .NSF file is stored?
Thanks
Notes documents are unstructured, there's no guarantee that any two documents in a database have the same structure. You will need to decide what data you want to transfer to a relational table, then check each document to see if it contains the corresponding fields (items). You didn't mention what language you're planning to use for your agent; in Java you would use NotesDocument.getItems() to enumerate all items in a document.
As mustaccio also said, since Notes/Domino is a NoSQL database, you don't have a schema.
You should talk to the developer of the application and get an understanding of what data is lovated where.
You could of course use the Design Synopsis function in Domino Designer to export the actual design, but document can potentially contain data not showing up in the design.
If you want to export the documents as XML, I have a tool I wrote available here: http://www.texasswede.com/home.nsf/Page/Notes%20XML%20Exporter
You can export all the documents and then look at the XML to see what data you have.