Loading data from Amazon redshift to HDFS - scala

I'm trying to load data from Amazon Redshift to HDFS.
val df = spark.read.format("com.databricks.spark.redshift")
> .option("forward_spark_s3_credentials", "true").option("url",
> "jdbc:redshift://xxx1").option("user","xxx2").option("password",
> "xxx3") .option("query", "xxx4") .option("driver",
> "com.amazon.redshift.jdbc.Driver") .option("tempdir", "s3n://xxx5")
> .load()
This is the Scala code I'm using. When I do df.count() and df.printSchema(), it's giving me the right schema and count. But, when I do df.show() or try to write it to hdfs it says
S3ServiceException:The AWS Access Key Id you provided does not exist in our records.,Status 403,Error InvalidAccessKeyId

you need to export below environment variables to write to s3.
export AWS_SECRET_ACCESS_KEY=XXX
export AWS_ACCESS_KEY_ID=XXX

Related

How to make 'First Row' the Header when saving data to SQL Server with Databricks

Can someone let me know make the first row the header when saving to SQL Server with Databricks
I am currently using the following code to upload / save to SQL in Azure
jdbcUrl = f"jdbc:sqlserver://{DBServer}.database.windows.net:1433;database={DBDatabase};user={DBUser};password={DBPword};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
df = spark.read.csv("/mnt/lake/RAW/OptionsetMetadata.csv")
df.write.mode("overwrite") \
.format("jdbc") \
.option("url", jdbcUrl) \
.option("dbtable", 'UpdatedProducts')\
.save()
The table looks like the following after saving:
JDBC driver creates the table according to the schema. It looks like that you're reading from the CSV file, and don't specify .option("header", "true") when reading. Just add this option to your read operation.

how to connect to mongodb Atlas from databricks cluster using pyspark

how to connect to mongodb Atlas from databricks cluster using pyspark
This is my simple code in notebook
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb+srv://admin:<password>#mongocluster.fxilr.mongodb.net/TestDatabase.Events") \
.getOrCreate()
df = spark.read.format("mongo").load()
df.printSchema()
But I am getting error as
IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
What is wrong am i doing
I followed this steps and I was able to connect.
Install org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 maven library to your cluster as I was using scala2.12
Goto Cluster detail page and in Advance option under Spark tab , you add below two config parameters
spark.mongodb.output.uri connection-string
spark.mongodb.input.uri connection-string
Note connection-string should look like this - (have your appropriate user, password and database names)
mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database?retryWrites=true&w=majority
Use following python code in your notebook and it should load your sample collection to a dataframe
# Reading from MongoDB
df = spark.read\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/database?retryWrites=true&w=majority")\
.option("database", "my_database")\
.option("collection", "my_collection")\
.load()
You can use following to write to mongo db
events_df.write\
.format('com.mongodb.spark.sql.DefaultSource')\
.mode("append")\
.option( "uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database.my_collection?retryWrites=true&w=majority") \
.save()
Hope this should work for you. Please do let others know if it worked.

Save a DataFrame in postgresql

I`m trying to save my data frame in *.orc using jdbc in postgresql. I have an intermediate table created in my localhost on the server I use, but the table is not saved in postgresql.
I would like to find out what extensions postgresql works with (you may not be able to create a *.orc table in it), and that it accepts - a Dataset or sql query from the created table.
I'm using spark.
Properties conProperties = new Properties();
conProperties.setProperty("driver", "org.postgresql.Driver");
conProperties.setProperty("user", srgs[2]);
conProperties.setProperty("password", args[3]);
finalTable.write()
.format("orc")
.mode(SaveMode.Overwrite)
.jdbc(args[1], "dataTable", conProperties);
spark-submit --class com.pack.Main --master yarn --deploy-mode cluster /project/build/libs/mainC-1.0-SNAPSHOT.jar ha-cluster?jdbc:postgresql://MyHostame:5432/nameUser&user=nameUser&password=passwordUser
You cannot keep the .orc format in Postgres. You wouldn't want that either.
See updated write below
finalTable.write
.format("jdbc")
.option("url", srgs[1])
.option("dbtable", "dataTable")
.option("user", srgs[2])
.option("password", args[3])
.save()

How can i load a bigquery table to dataproc cluster

I am new to dataproc cluster and PySpark so, in the process of looking for codes to load table from bigquery to the cluster, i came across the code below and was unable to figure out what all am i suppose to change for my usecase in this code and what are we providing as an input in the input directory
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
import subprocess
sc = SparkContext()
spark = SparkSession(sc)
bucket = spark._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = spark._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'dataset_new',
'mapred.bq.input.dataset.id': 'retail',
'mapred.bq.input.table.id': 'market',
}
You are trying to use Hadoop BigQuery connector, for Spark you should use Spark BigQuery connector.
To read data from BigQuery you can follow an example:
# Use the Cloud Storage bucket for temporary BigQuery export data used
# by the connector.
bucket = "[bucket]"
spark.conf.set('temporaryGcsBucket', bucket)
# Load data from BigQuery.
words = spark.read.format('bigquery') \
.option('table', 'bigquery-public-data:samples.shakespeare') \
.load()
words.createOrReplaceTempView('words')
# Perform word count.
word_count = spark.sql(
'SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')
word_count.show()

Unable to write Spark's DataFrame to hive using presto

I'm writing some code to save a DataFrame to a hive database using presto
df.write.format("jdbc")
.option("url", "jdbc:presto://myurl/hive?user=user/default")
.option("driver","com.facebook.presto.jdbc.PrestoDriver")
.option("dbtable", "myhivetable")
.mode("overwrite")
.save()
this actually must work , but this actually raises an exception
java.lang.IllegalArgumentException: Can't get JDBC type for array<string>