Is there a postgreSQL foreign table wrapper that will access zipped csv files? - postgresql

I'm interested in storing long term static data outside of the database, ideally in compressed files that are dynamically uncompressed when accessed. I am currently using the existing file_fdw for some purposed, but would really like to be able to compress the data.
We currently use 9.3.

There seems to be such a wrapper here. It requires Multicorn, so you'll have to install that first.
I have not tried it and don't know how well it works.
Alternatively, did you consider using compression at the storage level?

Related

Redshift. COPY from invalid JSON on S3

I am trying to load data into Redshift from JSON file on S3.
But this file contains a format error - lines QUOTES '$'
${"id":1,"title":"title 1"}$
${"id":2,"title":"title 2"}$
An error was made while exporting data from PostgreSQL.
Now when I try to load data into Redshift, I get the message "Invalid value" for raw_line "$".
Is there any way how to escape these symbols using the Redshift COPY command and avoid data re-uploading or transforming?
MY COMMANDS
-- CREATE TABLE
create table my_table (id BIGINT, title VARCHAR);
-- COPY DATA FROM S3
copy my_table from 's3://my-bucket/my-file.json'
credentials 'aws_access_key_id=***;aws_secret_access_key=***'
format as json 'auto'
Thanks in advance!
I don't think there is a simple "ignore this" option that will work in your case. You could try NULL AS '$' but I expect that will just confuse things in different ways.
Your best bet is to filter the files and replace the originals with the fixed version. As you note in your comment downloading them to your system, modifying, and pushing back is not a good option due to size. This will impact you in transfer speed (over the internet) and data-out costs from S3. You want to do this "inside" of AWS.
There are a number of ways to do this and I expect the best choice will be based on what you can do quickly and not the absolute best way. (Sounds like this is a one time fix operation.) Here are a few:
Fire up an EC2 instance and do the download-modify-upload process to
this system inside of AWS. Remember to have an S3 endpoint in your
VPC.
Create a Lambda function to stream the data in, modify it, and push
back to S3. Just do this as a streaming process since you won't want to download very
large files to Lambda in their entirety.
Define a Glue process to strip out the unwanted characters. This will need some custom coding as your files are not in a valid json format.
Use CloudShell to download the files, modify, and upload. There's a 1GB storage limit on CloudShell so this will need to work on smallish chucks of your data but doesn't need you to start an EC2. This is a new service so there may be other issues with this path but could be an interesting choice.
There are other choices that are possible (EMR) but these seem like the likely ones. I like playing with new things (especially when they are free) so if it was me I'd try CloudShell.

How to store folder in PostgreSQL

We have a requirement where we need to store mail templates in databases. these mail templates use some images. So we are thinking about storing a folder with images and HTML file in the database. The folder will have 200-300 kb of storage and we need to store 15-20 templates.
Which column type should we use to store folder in PostgreSQL? What are the different type of datatype for folder storage and which is the best in performance?
For storing images you can use PostgreSQL's datatype BYTEA.
For the according HTML of that mail-template you could use TEXT or BYTEA as well.
Just be aware that PG might compress really large chunks of data..
The performance for the TEXT datatype is around 15% better than the ByteA datatype as tested here..
I hope that helped ;)

Import Data to cassandra and create the Primary Key

I've got some csv data to import to cassandra. This could work with the copy-command. The Problem is, that the csv doesn't serve a unique ID for the data so I need to create a timeuuid on import.
Is it possible to do this via copy-command or did I need to write a external script for importing?
I would write a quick script to do it, the copy command can really only handle small amounts of data anyway. Try the new python driver. I find it quite fast to setup loading scripts with, especially if you need any sort of minor modifications of the data before being loaded.
If you have a really big set of data bulk-loading is still the way to go.

Download HTTP data with Postgres StoredProcedure

I am wondering if there is a way to import data from an HTTP source from within an pgsql function.
I am porting an old system that harvests data from a website. Rather than maintaining a separate set of files to manage the downloading of the data, I was hoping to put the import routines directly into stored procedures.
I do know how to import data with COPY, but that requires the data to already be available locally. Is there a way to get the download the data with PL/PGSQL? Am I out to lunch?
Related: How to import CSV file data into a PostgreSQL table?
Depending what you're after, the Postgres extension www_fdw might work for you: http://pgxn.org/dist/www_fdw/
If you want download custom data by HTTP protocol, then PostgreSQL extensive support for different languages might be handy. Here is the example of connecting to Google Translate service from Postgres function written in Python:
https://wiki.postgresql.org/wiki/Google_Translate

suggest a postgres tool to find the difference between the schema and the data

Dear all ,
Can any one suggest me the postgres tool for linux which is used to find the
difference between the 2 given database
I tried with the apgdiff 2.3 but it gives the difference in terms of schema not the data
but I need both !
Thanks in advance !
Comparing data is not easy especially if your database is huge. I created Python program that can dump PostgreSQL data schema to file that can be easily compared via 3rd party diff programm: http://code.activestate.com/recipes/576557-dump-postgresql-db-schema-to-text/?in=user-186902
I think that this program can be extended by dumping all tables data into separate CSV files, similar to those used by PostgreSQL COPY command. Remember to add the same ORDER BY in SELECT ... queries. I have created tool that reads SELECT statements from file and saves results in separate files. This way I can manage which tables and fields I want to compare (not all fields can be used in ORDER BY, and not all are important for me). Such configuration can be easily created using "dump schema" utility.
Check out dbsolo DBSOLO. It does both object and data compares and can create a sync script based on the results. It's free to try and $99 to buy. My guess is the 99 bucks will be money well spent to avoid trying to come up with your own software to do this.
Data Compare
http://www.dbsolo.com/help/datacomp.html
Object Compare
http://www.dbsolo.com/help/compare.html
apgdiff https://www.apgdiff.com/
It's an opensource solution. I used it before for checking differences between differences in dumps. Quite useful
[EDIT]
It's for differenting by schema only