I want to upload different CSV files into MySQL to use as JOINS. Should I upload it separately and use joins in the schema?
Related
We have a ORC file format which are stored in s3 and we want to load the files into AWS Aurora postgres DB .
What we got from internet was :
postgres support csv, txt and other formats not ORC ..
INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SELECT * FROM default.foo;
Can any one please help us to find a solution?
This date PostgreSQL on Aurora supports ingestion of data from S3 through the COPY command only from TXT and CSV files.
Since your files are in ORC format, you could convert these tiles in either CSV or TXT and then ingest the data. You could do this very easily with Athena, by simply creating a table for your original data and running a SELECT * FROM table query. As explained in the Working with Query Results, Output Files, and Query History
page, this will automatically generate a CSV file containing the results.
This would not be optimal as you’d pay not only the transform price but also the he storage twice (as original ORC and converted CSV), but it would allow you to convert the data pretty easily.
A better way to do it would instead be to use a service like AWS Glue, that supports S3 as source and that has an Aurora connector. Using this method would give you an actual ETL and even if now you just need the E(xtract) and L(oad), would still leave the door open for any kind of transform you might need in the future.
In this AWS Blog titled How to extract, transform, and load data for analytic processing using AWS Glue (Part 2) they show the opposite flow (Aurora->S3 via Glue), but it should still give you an idea of the process.
Have an AWS Glue crawler which is creating a data catalog with all the tables from an S3 directory that contains parquet files.
I need to copy the contents of these files/ tables to the Redshift table.
I have a few tables where the Parquet file data size cannot be supported by Redshift. VARCHAR(6635) is not sufficient.
In the ideal scenario, would like to truncate these tables.
How do I use the COPY command to load this data into Redshift?
If I use spectrum, I can only user INSERT INTO from the external table to Redshift table, which I understand is slower than a bulk copy?
You can use string instead of varchar(6635) (Can be edited in the catalog as well ) , if not can you elaborate more on this, Of the files are in parquet then , Most of the Data conversion parameters
that copy provides cannot be used like Escape, null as etc ..
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
I am planning a project that would involve hosting a very large CSV or JSON file on Github which I would want to be able to query.
The challenge is using a querying the database file, rather than downloading it for parsing.
How can I efficiently query a database file without server-side logic to parse the database for the client?
Currently we have a workbook developed in Tableau using Oracle server as the data store where we have all our tables and views. Now we are migrating to Redshift fora better performance. We have the same table structure as in the Oracle with the same table names and the field names in the Redshift. We already have the Tableau workbook developed and we need to point to Redshift tables and views now. How do we point the developed workbook to Redshift now, kindly help.
Also let me know any other inputs in this regard.
Thanks,
Raj
Use the Replace Data Source functionality of Tableau Desktop
You can bypass Replace Data Source and move data directly from Oracle to Redshift using bulk loaders.
Simple combo of SQL*Plus + Python + boto + psycopg2 will do the job.
It should:
Open read pipe from Oracle SQL*Plus
Compress data stream
Upload compressed stream to S3
Bulk append data from S3 to Redshift table.
You can check example of how to extract table or query data from Oracle and then load it to Redshift using COPY command from S3.
I am faced with a situation where we get a lot of CSV files from different clients but there is always some issue with column count and column length that out target table is expecting.
What is the best way to handle frequently changing CSV files. My goal is load these CSV files into Postgres database.
I checked the \COPY command in Postgres but it does have an option to create a table.
You could try creating a pg_dump compatible file instead which has the appropriate "create table" section and use that to load your data instead.
I recommend using an external ETL tool like CloverETL, Talend Studio, or Pentaho Kettle for data loading when you're having to massage different kinds of data.
\copy is really intended for importing well-formed data in a known structure.