Insight into Redshift Spectrum query error - amazon-redshift

I'm trying to use Redshift Spectrum to query data in s3. The data has been crawled by Glue, I've run successful data profile jobs on the files with DataBrew (so I know Glue has correctly read it), and I can see the correct tables in the query editor after creating the schema. But when I try to run simple queries I get one of two errors: if it's a small file I get: "ERROR: Parsed manifest is not a valid JSON object...."; if it's a large file I get: "ERROR: Manifest too large Detail:...". I suspect it's looking for or believes that the file in the query is a manifest, but I have no idea why or how to address it. I've followed the documentation as rigorously as possible, and I've replicated the process via a screen share with an AWS tech support rep who is also stumped.

Discovered the issue: error is happening because I had more than one type of file (i.e., files of differing layouts) in the same s3 folder. There may be other ways to solve the problem, but isolating one type of file for a given s3 folder solved the problem and allowed Redshift Spectrum to successfully execute queries against my file(s).

Related

Why does Pyspark throw : " AnalysisException: `/path/to/adls/mounted/interim_data.delta` is not a Delta table ". even though the file exists...?

I am using databricks on azure,
Pyspark reads data that's dumped in azure data lake storage [adls]
Every now and then when i try to read the data from adls like so:
spark.read.format('delta').load(`/path/to/adls/mounted/interim_data.delta` )
it throws the following error
AnalysisException: `/path/to/adls/mounted/interim_data.delta` is not a Delta table.
the data necessarily exists
the folder contents and files show up when i run
%fs ls /path/to/adls/mounted/interim_data.delta
right now the only fix is to re run the script that populated the above interim_data.delta table which is not a viable fix
Make sure you have copied the data in delta format correctly.
Below is the standard command to do so:
df = spark.read.format(file_type).option("header","true").option("inferSchema", "true").option("delimiter", '|').load(file_location)
df.write.format("delta").save()
You access data in Delta tables either by specifying the path on DBFS ("/mnt/delta/events") or the table name ("events"). Make sure the path or file name should be in correct format. Please refer example below:
val events = spark.read.format("delta").load("/mnt/delta/events")
Refer https://learn.microsoft.com/en-us/azure/databricks/delta/quick-start#read-a-table to know more about Delta Lake.
Feel free to ask in case you have any query.
I am answering my own question...
TLDR: Root cause of the issue: frequent remounting of ADLS
There was this section of the code that remounts the ADLS gen2 to ADB, when other teams ran their script, the remounting took 20-45 seconds, and as the number of scripts that ran in the high concurrency cluster increased, it was a matter of time that one of us ran into the issue, where the scripts tired to read data from the ADLS while it was being mounted...
this is how it turned out to be intermittent...
Why was this remounting hack in place..? this was put in place because, we faced an issue with data not showing up, in adb even though it was visible in ADLS Gen2, and the only way to fix this back then, was to force a remount to make that data visible in ADB.

Redshift. COPY from invalid JSON on S3

I am trying to load data into Redshift from JSON file on S3.
But this file contains a format error - lines QUOTES '$'
${"id":1,"title":"title 1"}$
${"id":2,"title":"title 2"}$
An error was made while exporting data from PostgreSQL.
Now when I try to load data into Redshift, I get the message "Invalid value" for raw_line "$".
Is there any way how to escape these symbols using the Redshift COPY command and avoid data re-uploading or transforming?
MY COMMANDS
-- CREATE TABLE
create table my_table (id BIGINT, title VARCHAR);
-- COPY DATA FROM S3
copy my_table from 's3://my-bucket/my-file.json'
credentials 'aws_access_key_id=***;aws_secret_access_key=***'
format as json 'auto'
Thanks in advance!
I don't think there is a simple "ignore this" option that will work in your case. You could try NULL AS '$' but I expect that will just confuse things in different ways.
Your best bet is to filter the files and replace the originals with the fixed version. As you note in your comment downloading them to your system, modifying, and pushing back is not a good option due to size. This will impact you in transfer speed (over the internet) and data-out costs from S3. You want to do this "inside" of AWS.
There are a number of ways to do this and I expect the best choice will be based on what you can do quickly and not the absolute best way. (Sounds like this is a one time fix operation.) Here are a few:
Fire up an EC2 instance and do the download-modify-upload process to
this system inside of AWS. Remember to have an S3 endpoint in your
VPC.
Create a Lambda function to stream the data in, modify it, and push
back to S3. Just do this as a streaming process since you won't want to download very
large files to Lambda in their entirety.
Define a Glue process to strip out the unwanted characters. This will need some custom coding as your files are not in a valid json format.
Use CloudShell to download the files, modify, and upload. There's a 1GB storage limit on CloudShell so this will need to work on smallish chucks of your data but doesn't need you to start an EC2. This is a new service so there may be other issues with this path but could be an interesting choice.
There are other choices that are possible (EMR) but these seem like the likely ones. I like playing with new things (especially when they are free) so if it was me I'd try CloudShell.

COPY command runs but no data being copied from Teradata (on-prem)

I am running into an issue where I have a set up a pipeline that gets a list of tables from Teradata using a Lookup activity and then passes those items to a ForEach activity that then copies the data in parallel and saves them as a gzipped file. The requirement is to essentially archive some tables that are no longer being used.
For this pipeline I am not using any partition options as most of the tables are small and I kept it to be flexible.
Pipeline
COPY activity within ForEach activity
99% of the tables ran without issues and were copied as gz files into blob storage, but two tables in particular run for long time (apprx 4 to 6 hours) without any of the data being written into a blob storage account.
Note that the image above says "Cancelled", but that was done by me. Before that I had a run time as described above, but still no data being written. This is affecting only 2 tables.
I checked with our Teradata team and those tables are not being used by any one (hence its not locked). I also looked at "Teradata Viewpoint" (admin tool) and looked at the query monitor and saw that the query was running on Teradata without issues.
Any insight would be greatly apreciated.
Onlooking issue mention it look the data size of table is more than a blob can store ( As you are not using any partition options )
Use partition option for optimize performance and hold the data
Link
Just in case someone else comes across this, the way I solved this was to create a new data store connection called "TD_Prod_datasetname". The purpose of this dataset is to not point to a specific table, but to just accept a "item().TableName" value.
This datasource contains two main values. 1st is the #dataset().TeradataName
Dataset property
I only came up with that after doing a little bit of digging in Google.
I then created a parameter called "TeradataTable" as String.
I then updated my pipeline. As above the main two activities remain the same. I have a lookup and then a ForEach Activity (where for each will get the item values):
However, in the COPY command inside the ForEach activity I updated the source. Instead of getting "item().Name" I am passing through #item().TableName:
This then enabled me to then select the "Table" option and because I am using Table instead of query I can then use the "Hash" partition. I left it blank because according to Microsoft documentation it will automatically find the Primary Key that will be used for this.
The only issue that I ran into when using this was that if you run into a table that does not have a Primary Key then this item will fail and will need to be run through either a different process or manually outside of this job.
Because of this change the previously files that just hung there and did not copy now copied successfully into our blob storage account.
Hope this helps someone else that wants to see how to create parallel copies using Teradata as a source and pass through multiple table values.

Inject big local json file into Druid

It's my first Druid experience.
I have got a local setup of Druid in local machine.
Now I'd like to make some query performance test. My test data is a huge local json file 1.2G.
The idea was to load it into druid and run required SQL query. The file is getting parsed and successfully processed (I'm using Druid web-based UI to submit an injection task).
The problem I run into is the datasource size. It doesn't makes sense that 1.2G of raw json data results in 35M of datasource. Is there any limitation the locally running Druid setup has. I think the test data is processed partially. Unfortunately didn't find any relevant config to change it. Will appreciate if some one is able to shed light on this.
Thanks in advance
With druid 80-90 percent compression is expected. I have seen 2GB CSV file reduced to 200MB druid datasoruce.
Can you query the count to make sure all data is ingested? All please disable approximate algorithm hyper-log-log to get exact count.Druid SQL will switch to exact distinct counts if you set "useApproximateCountDistinct" to "false", either through query context or through broker configuration.( refer http://druid.io/docs/latest/querying/sql.html )
Also can check logs for exception and error messages. If it faces problem to ingest particular JSON record it skips that record.

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile