I have a question: Does anyone know how to connect sparkR with Redshift?
I am trying to use spark on my redshift cluster to do some querying and data wrangling
Thank you
I'm also trying to do that, I'm using databricks for my spark cluster, I couldn't do it conventionally but something that work for me is that I first load the data in the SQLContext with Scala after that I can access with sparkR to the SQLContext, maybe it's not the best solution but it's working with a nice perfomance.
This is the conector Link
In Scala I do this:
val redshift_data = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc://....")
.option("tempdir", "s3a://....")
.option("query", "SELECT * FROM table_name")
.load()
.registerTempTable("redshift_data")
Then in R
data <- sql(sqlContext, "SELECT * FROM redshift_data")
I really hope this may be helpful.
Related
Scenario: Cassandra is hosted on a server a.b.c.d and spark runs on server say w.x.y.z.
Assume i want to transform the data from a table(say table)casssandra and rewrite the same to other table(say tableNew) in cassandra using Spark,The code that i write looks something like this
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "a.b.c.d")
.set("spark.cassandra.auth.username", "<UserName>")
.set("spark.cassandra.auth.password", "<Password>")
val spark = SparkSession.builder().master("yarn")
.config(conf)
.getOrCreate()
val dfFromCassandra = spark.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "<table>", "keyspace" -> "<Keyspace>")).load()
val filteredDF = dfFromCassandra.filter(filterCriteria).write.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "<tableNew>", "keyspace" -> "<Keyspace>")).save
Here filterCriteria represents the transformation/filtering that I do. Iam not sure how Spark cassandra connector works in this case internally.
This is the Confusion that I have:
1: Does spark load the data from Cassandra source table to the memory and then filter the same and reload the same to the Target table Or
2: Does Spark cassandra connector convert the filter criteria to Where clause and loads only the relevant data to form RDD and writes the same back to target table in Cassandra Or
3:Does the entire operation happens as a cql operation where the query is converted to sqllike query and is executed in cassandra itself?(I am almost sure that this is not what happens)
It is either 1. or 2. depending on your filterCriteria. Naturally Spark itself can't do any CQL filtering but custom datasources can implement it using predicate pushdown. In case if Cassandra driver, it is implemented here and the answer depends if that covers the used filterCriteria.
I am using Spark 2.1.0 and Kafka 0.9.0.
I am trying to push the output of a batch spark job to kafka. The job is supposed to run every hour but not as streaming.
While looking for an answer on the net I could only find kafka integration with Spark streaming and nothing about the integration with the batch job.
Does anyone know if such thing is feasible ?
Thanks
UPDATE :
As mentioned by user8371915, I tried to follow what was done in Writing the output of Batch Queries to Kafka.
I used a spark shell :
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0
Here is the simple code that I tried :
val df = Seq(("Rey", "23"), ("John", "44")).toDF("key", "value")
val newdf = df.select(to_json(struct(df.columns.map(column):_*)).alias("value"))
newdf.write.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("topic", "alerts").save()
But I get the error :
java.lang.RuntimeException: org.apache.spark.sql.kafka010.KafkaSourceProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:497)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 50 elided
Have any idea what is this related to ?
Thanks
tl;dr You use outdated Spark version. Writes are enabled in 2.2 and later.
Out-of-the-box you can use Kafka SQL connector (the same as used with Structured Streaming). Include
spark-sql-kafka in your dependencies.
Convert data to DataFrame containing at least value column of type StringType or BinaryType.
Write data to Kafka:
df
.write
.format("kafka")
.option("kafka.bootstrap.servers", server)
.save()
Follow Structured Streaming docs for details (starting with Writing the output of Batch Queries to Kafka).
If you have a dataframe and you want to write it to a kafka topic, you need to convert columns first to a "value" column that contains data in a json format. In scala it is
import org.apache.spark.sql.functions._
val kafkaServer: String = "localhost:9092"
val topicSampleName: String = "kafkatopic"
df.select(to_json(struct("*")).as("value"))
.selectExpr("CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", kafkaServer)
.option("topic", topicSampleName)
.save()
For this error
java.lang.RuntimeException: org.apache.spark.sql.kafka010.KafkaSourceProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
I think you need to parse the message to Key value pair. Your dataframe should have value column.
Let say if you have a dataframe with student_id, scores.
df.show()
>> student_id | scores
1 | 99.00
2 | 98.00
then you should modify your dataframe to
value
{"student_id":1,"score":99.00}
{"student_id":2,"score":98.00}
To convert you can use similar code like this
df.select(to_json(struct($"student_id",$"score")).alias("value"))
I have a following situation. I have large Cassandra table (with large number of columns) which i would like process with Spark. I want only selected columns to be loaded in to Spark ( Apply select and filtering on Cassandra server itself)
val eptable =
sc.cassandraTable("test","devices").select("device_ccompany","device_model","devi
ce_type")
Above statement gives a CassandraTableScanRDD but how do i convert this in to DataSet/DataFrame ?
Si there any other way i can do server side filtering of columns and get dataframes?
In DataStax Spark Cassandra Connector, you would read Cassandra data as a Dataset, and prune columns on the server-side as follows:
val df = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "devices", "keyspace" -> "test" ))
.load()
val dfWithColumnPruned = df
.select("device_ccompany","device_model","device_type")
Note that the selection operation I do after reading is pushed to the server-side using Catalyst optimizations. Refer this document for further information.
I need to use Athena in spark but spark uses preparedStatement when using JDBC drivers and it gives me an exception
"com.amazonaws.athena.jdbc.NotImplementedException: Method Connection.prepareStatement is not yet implemented"
Can you please let me know how can I connect Athena in spark
I don't know how you'd connect to Athena from Spark, but you don't need to - you can very easily query the data that Athena contains (or, more correctly, "registers") from Spark.
There are two parts to Athena
Hive Metastore (now called the Glue Data Catalog) which contains mappings between database and table names and all underlying files
Presto query engine which translates your SQL into data operations against those files
When you start an EMR cluster (v5.8.0 and later) you can instruct it to connect to your Glue Data Catalog. This is a checkbox in the 'create cluster' dialog. When you check this option your Spark SqlContext will connect to the Glue Data Catalog, and you'll be able to see the tables in Athena.
You can then query these tables as normal.
See https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html for more.
You can use this JDBC driver: SimbaAthenaJDBC
<dependency>
<groupId>com.syncron.amazonaws</groupId>
<artifactId>simba-athena-jdbc-driver</artifactId>
<version>2.0.2</version>
</dependency>
to use:
SparkSession spark = SparkSession
.builder()
.appName("My Spark Example")
.getOrCreate();
Class.forName("com.simba.athena.jdbc.Driver");
Properties connectionProperties = new Properties();
connectionProperties.put("User", "AWSAccessKey");
connectionProperties.put("Password", "AWSSecretAccessKey");
connectionProperties.put("S3OutputLocation", "s3://my-bucket/tmp/");
connectionProperties.put("AwsCredentialsProviderClass",
"com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider");
connectionProperties.put("AwsCredentialsProviderArguments", "/my-folder/.athenaCredentials");
connectionProperties.put("driver", "com.simba.athena.jdbc.Driver");
List<String> predicateList =
Stream
.of("id = 'foo' and date >= DATE'2018-01-01' and date < DATE'2019-01-01'")
.collect(Collectors.toList());
String[] predicates = new String[predicateList.size()];
predicates = predicateList.toArray(predicates);
Dataset<Row> data =
spark.read()
.jdbc("jdbc:awsathena://AwsRegion=us-east-1;",
"my_env.my_table", predicates, connectionProperties);
You can also use this driver in a Flink application:
TypeInformation[] fieldTypes = new TypeInformation[] {
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO
};
RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
JDBCInputFormat jdbcInputFormat = JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.simba.athena.jdbc.Driver")
.setDBUrl("jdbc:awsathena://AwsRegion=us-east-1;UID=my_access_key;PWD=my_secret_key;S3OutputLocation=s3://my-bucket/tmp/;")
.setQuery("select id, val_col from my_env.my_table WHERE id = 'foo' and date >= DATE'2018-01-01' and date < DATE'2019-01-01'")
.setRowTypeInfo(rowTypeInfo)
.finish();
DataSet<Row> dbData = env.createInput(jdbcInputFormat, rowTypeInfo);
You can't directly connect Spark to Athena. Athena is simply an implementation of Prestodb targeting s3. Unlike Presto, Athena cannot target data on HDFS.
However, if you want to use Spark to query data in s3, then you are in luck with HUE, which will let you query data in s3 from Spark on Elastic Map Reduce (EMR).
See Also:
Developer Guide for Hadoop User Experience (HUE) on EMR.
The response of #Kirk Broadhurst is correct if you want to use the data of Athena.
If you want to use the Athena engine, then, there is a lib on github that overcomes the preparedStatement problem.
Note that I didn't succeed to use the lib, due to my lack of experience with Maven etc
Actually you can use B2W's Spark Athena Driver.
https://github.com/B2W-BIT/athena-spark-driver
I'm running a query in scala using databricks with Spark 1.6.0 (Hadoop 1) to filter some URL's data that I have in redshift but once the query finish successfully, if I run a count on the data frame it shows that there is data on the data frame but I try to display the data or join the data the data frame appears to be empty, it doesn't display anything and can't make a join.
This is the code for getting the data into databricks
val df = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", jdbcUrl)
.option("tempdir", s"s3a://....")
.option("query", s"select * from table where column like '%word1%word2%word3%' )
.load()
The actual data is something like this
'https://www.asdfg.com/word1?word2=/word3/asdasdadasd'
or
'https://www.asdfg.com/word1?word2=%2Fword3%2Fasdasdadasd'
I can't understand why if i run a count I get results but for any other operation the data frame appears to be empty, any ideas why this is happening?