Redshift. COPY from invalid JSON on S3 - amazon-redshift

I am trying to load data into Redshift from JSON file on S3.
But this file contains a format error - lines QUOTES '$'
${"id":1,"title":"title 1"}$
${"id":2,"title":"title 2"}$
An error was made while exporting data from PostgreSQL.
Now when I try to load data into Redshift, I get the message "Invalid value" for raw_line "$".
Is there any way how to escape these symbols using the Redshift COPY command and avoid data re-uploading or transforming?
MY COMMANDS
-- CREATE TABLE
create table my_table (id BIGINT, title VARCHAR);
-- COPY DATA FROM S3
copy my_table from 's3://my-bucket/my-file.json'
credentials 'aws_access_key_id=***;aws_secret_access_key=***'
format as json 'auto'
Thanks in advance!

I don't think there is a simple "ignore this" option that will work in your case. You could try NULL AS '$' but I expect that will just confuse things in different ways.
Your best bet is to filter the files and replace the originals with the fixed version. As you note in your comment downloading them to your system, modifying, and pushing back is not a good option due to size. This will impact you in transfer speed (over the internet) and data-out costs from S3. You want to do this "inside" of AWS.
There are a number of ways to do this and I expect the best choice will be based on what you can do quickly and not the absolute best way. (Sounds like this is a one time fix operation.) Here are a few:
Fire up an EC2 instance and do the download-modify-upload process to
this system inside of AWS. Remember to have an S3 endpoint in your
VPC.
Create a Lambda function to stream the data in, modify it, and push
back to S3. Just do this as a streaming process since you won't want to download very
large files to Lambda in their entirety.
Define a Glue process to strip out the unwanted characters. This will need some custom coding as your files are not in a valid json format.
Use CloudShell to download the files, modify, and upload. There's a 1GB storage limit on CloudShell so this will need to work on smallish chucks of your data but doesn't need you to start an EC2. This is a new service so there may be other issues with this path but could be an interesting choice.
There are other choices that are possible (EMR) but these seem like the likely ones. I like playing with new things (especially when they are free) so if it was me I'd try CloudShell.

Related

Insight into Redshift Spectrum query error

I'm trying to use Redshift Spectrum to query data in s3. The data has been crawled by Glue, I've run successful data profile jobs on the files with DataBrew (so I know Glue has correctly read it), and I can see the correct tables in the query editor after creating the schema. But when I try to run simple queries I get one of two errors: if it's a small file I get: "ERROR: Parsed manifest is not a valid JSON object...."; if it's a large file I get: "ERROR: Manifest too large Detail:...". I suspect it's looking for or believes that the file in the query is a manifest, but I have no idea why or how to address it. I've followed the documentation as rigorously as possible, and I've replicated the process via a screen share with an AWS tech support rep who is also stumped.
Discovered the issue: error is happening because I had more than one type of file (i.e., files of differing layouts) in the same s3 folder. There may be other ways to solve the problem, but isolating one type of file for a given s3 folder solved the problem and allowed Redshift Spectrum to successfully execute queries against my file(s).

Using Data compare to copy one database over another

Ive used the Data Comare tool to update schema between the same DB's on different servers, but what If so many things have changed (including data), I simply want to REPLACE the target database?
In the past Ive just used TSQL, taken a backup then restored onto the target with the replace command and/or move if the data & log files are on different drives. Id rather have an easier way to do this.
You can use Schema Compare (also by Red Gate) to compare the schema of your source database to a blank target database (and update), then use Data Compare to compare the data in them (and update). This should leave you with the target the same as the source. However, it may well be easier to use the backup/restore method in that instance.

Tableau TDE or connect to files directly?

I have a personal license for Tableau. I am using it to connect to .csv and .xlsx files currently but am running into some issues.
1) The .csv files are massive (10+ gig)
2) The Excel files are starting to reach the 1mil row limit
3) I need to add certain columns to the .csv files sometimes (like unique ID and a few formulas) which means that I need to open sections of them in Excel, modify what I need to, then save a new file
Would it be better to create an extract for each of these files and then connect the Tableau Workbook to the extract instead of the file? Currently I am connected directly to files and then extract data from there and refresh everyday.
I don't know about others, but I'm using that exactly guideline. I'll have some Workbooks that will simply serve to extract data from some datasource (be it SQL, xlsx, csv, mdb, or any other), and all analysis will be performed in other Workbooks, that'll connect only to tdes
The advantages are:
1) Whenever you need to update a data source, you'll need to only update once (and replace the tde file) and all your workbooks will be up to date. If you connect to the same data source and extract to different tde files, you'll have to extract to all those different tde files (and worry about having updated the extract in that specific Workbook). And even if you extract to the same tde (which doesn't make much sense), it can be confusing (am I connected to the tde or to the file? Does the extract I made in the other workbook updated this one too? Well, yes it did, but it can be confusing)
2) You don't have to worry about replacing a datasource, especially when it's a csv, xlsx or mdb file. You can keep many different versions of those files, and choose which one is the best one. For instance, I'll have table_v1.mdb, table_v2.mdb, ..., and a single table_v1.tde, which will be the extract of one of those mdb files. And I still have the previous versions in case I need them.
3) When you have a SQL connection, or anything that is not a file (csv, xlsx, mdb), extracts are very handy for basically the same reasons above, with (at least) one upside. You don't need to connect to a server every time you want to perform an analysis. That means you can do everything offline, and the person using Tableau doesn't need to have access to the SQL table (or any other source).
One good practice is always keeping a back-up when updating a tde (because, well, shit happens)
10 gig csv, wow. Yes, you should absolutely use a data extract, that would be much quicker. For that much data you could look at other connections such as MS Access or a SQL instance.
If your data have that many rows, I would try to set up a small MySQL instance on your local machine and keep the data there instead. You would be able to connect Tableau directly to the MySQL instance and would be able to easily edit the source data.

Can COPY FROM tolerantly consume bad CSV?

I am trying to load text data into a postgresql database via COPY FROM. Data is definitely not clean CSV.
The input data isn't always consistent: sometimes there are excess fields (separator is part of a field's content) or there are nulls instead of 0's in integer fields.
The result is that PostgreSQL throws an error and stops loading.
Currently I am trying to massage the data into consistency via perl.
Is there a better strategy?
Can PostgreSQL be asked to be as tolerant as mysql or sqlite in that respect?
Thanks
PostgreSQL's COPY FROM isn't designed to handle bodgy data and is quite strict. There's little support for tolerance of dodgy data.
I thought there was little interest in adding any until I saw this proposed patch posted just a few days ago for possible inclusion in PostgreSQL 9.3. The patch has been resoundingly rejected, but shows that there's some interest in the idea; read the thread.
It's sometimes possible to COPY FROM into a staging TEMPORARY table that has all text fields with no constraints. Then you can massage the data using SQL from there. That'll only work if the SQL is at least well-formed and regular, though, and it doesn't sound like yours is.
If the data isn't clean, you need to pre-process it with a script in a suitable scripting language.
Have that script:
Connect to PostgreSQL and INSERT rows;
Connect to PostgreSQL and use the scripting language's Pg APIs to COPY rows in; or
Write out clean CSV that you can COPY FROM
Python's csv module can be handy for this. You can use any language you like; perl, python, php, Java, C, whatever.
If you were enthusiastic you could write it in PL/Perlu or PL/Pythonu, inserting the data as you read it and clean it up. I wouldn't bother.

Extract Active Directory into SQL database using VBScript

I have written a VBScript to extract data from Active Directory into a record set. I'm now wondering what the most efficient way is to transfer the data into a SQL database.
I'm torn between;
Writing it to an excel file then firing an SSIS package to import it or...
Within the VBScript, iterating through the dataset in memory and submitting 3000+ INSERT commands to the SQL database
Would the latter option result in 3000+ round trips communicating with the database and therefore be the slower of the two options?
Sending an insert row by row is always the slowest option. This is what is known as Row by Agonizing Row or RBAR. You should avoid that if possible and take advantage of set based operations.
Your other option, writing to an intermediate file is a good option, I agree with #Remou in the comments that you should probably pick CSV rather than Excel if you are going to choose this option.
I would propose a third option. You already have the design in VB contained in your VBscript. You should be able to convert this easily to a script component in SSIS. Create an SSIS package, add a DataFlow task, add a Script Component (as a datasource {example here}) to the flow, write your fields out to the output buffer, and then add a sql destination and save yourself the step of writing to an intermediate file. This is also more secure, as you don't have your AD data on disk in plaintext anywhere during the process.
You don't mention how often this will run or if you have to run it within a certain time window, so it isn't clear that performance is even an issue here. "Slow" doesn't mean anything by itself: a process that runs for 30 minutes can be perfectly acceptable if the time window is one hour.
Just write the simplest, most maintainable code you can to get the job done and go from there. If it runs in an acceptable amount of time then you're done. If it doesn't, then at least you have a clean, functioning solution that you can profile and optimize.
If you already have it in a dataset and if it's SQL Server 2008+ create a user defined table type and send the whole dataset in as an atomic unit.
And if you go the SSIS route, I have a post covering Active Directory as an SSIS Data Source