I found the website of SnappyData recently. I'm interested about SparkSQL query performance. Is there anybody who tried loading s3 and saving s3 operation with SnappyData ? I can't find such a document.
I want to use pyspark, and specify 'com.databricks.spark.csv' format and various options.
yes, you can. Here is an example.
Related
Recently I face an issue while writing the dataframe data into BigQuery using pyspark. Here it was:
pyspark.sql.utils.IllegalArgumentException: u'Temporary or persistent GCS bucket must be informed
After research the issue I found that Temporary GCS bucket to be mentioned spark.conf.
bucket = "temp_bucket"
spark.conf.set('temporaryGcsBucket', bucket)
I think there is no concept to have a file for a table in Biquery like Hive.
I would like to know more about it, why we need to have temp-gcs-bucket to write the data into bigquery?
I was searching for the reason behind this but I couldn't.
Please clarify.
Spark BigQuery connector has two write modes(writeMethod), 1. Direct 2.Indirect while writing data into BigQuery. This is a optional parameter, default is Indirect.
Indirect
You can specify indirect option like this option("writeMethod","indirect"). Its optional, and Indirect is default. This requires you to specify a temporary gcs bucket, if not you will get the error.
The need of temporary bucket is .
The connector writes the data to BigQuery by first buffering all the
data into a Cloud Storage temporary table. Then it copies all data
from into BigQuery in one operation.
Taken from the GCFS spark example docs here
Direct
In this method the data is written directly to BigQuery using the BigQuery Storage Write API
In scala you can specify like this option("writeMethod","direct"). which eliminates the need for a temporary bucket.
You can read more about the bigquery connector here
I have my Spark project in Scala I want to use Redshift as my DataWarehouse, I have found spark-redshift repo exists but Databricks made it private since a couple of years ago and doesn't support it publicly anymore.
What's the best option right now to deal with Amazon Redshift and Spark (Scala)
This is a partial answer as I have only been using Spark->Redshift in a real world use-case and never benchmarked Spark read from Redshift performance.
When it comes to writing from Spark to Redshift, by far the most performant way that I could find was to write parquet to S3 and then use Redshift Copy to load the data. Writing to Redshift through JDBC also works but it is several orders of magnitude slower than the former method. Other storage formats could be tried as well, but I would be surprised if any row-oriented format could beat Parquet as Redshift internally stores data in columnar format. Another columnar format that is supported by both Spark and Redshift is ORC.
I never came across a use-case of reading large amounts of data from Redshift using Spark as it feels more natural to load all the data to Redshift and use it for joins and aggregations. It is probably not cost-efficient to use Redshift just as a bulk storage and use another engine for joins and aggregations. For reading small amounts of data, JDBC works fine. For large reads, my best guess is Unload command and S3.
Amazon gives a very detailed documentation for copying data from EMR to Redshift (through S3), but there doesn't seem like any docs on the other way around, which makes me wonder if it's a good practice at all to load data from redshift to EMR (directly, or through some medium)
Theoretically I don't see why not, but I don't know the consequence of it
I think you can use Redshift Unload.
Export data as Parquet, then read data from EMR Hadoop(Spark, Hive)
UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
authorization
FORMAT PARQUET
https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
I'm working a project which needs to generate parquet files from a huge PostgreSQL database. The data size can be gigantic (ex: 10TB). I'm very new to this topic and has done some research online but did not find a direct way to convert the data to Parquet file. Here are my questions:
The only feasible solution I saw is to load Postgres table to Apache Spark via JDBC and save as a parquet file. But I assume it will be very slow while transferring 10TB data.
Is it possible to generate a huge parquet file size that is 10 TB? Or is it better to create multiple parquet files?
Hope my question is clear and I really appreciate any helpful feedbacks. Thanks in advance!
Use the ORC format instead of the parquet format for this volume.
I assume the data is partitioned, so I think it's a good idea to extract in parallel taking advantage of data partitioning.
Hi i am starting to learning the snappydata rowstore link there i tried all the example its working , but i need to store the csv , json data in snappydata table, in the example they are using manually connecting the snappy-shell and creating the tables and inserting the records,and another option for JDBC client link , i am tried this way but i dont know how to load csv and store in snappy table,then i tried another method direct query based access in snappydata store also link ,if anyone knows how to store csv data in snappytable using jdbc please share me, Thank you..
You should go through the SnappyData docs - http://snappydatainc.github.io/snappydata/. The "How-to" section has loading from CSV examples ... http://snappydatainc.github.io/snappydata/howto/#how-to-load-data-in-snappydata-tables
val someCSV_DF = snSession.read.csv("Myfile.csv")
someCSV_DF.write.insertInto("MyTable")