spark Athena connector - pyspark

I need to use Athena in spark but spark uses preparedStatement when using JDBC drivers and it gives me an exception
"com.amazonaws.athena.jdbc.NotImplementedException: Method Connection.prepareStatement is not yet implemented"
Can you please let me know how can I connect Athena in spark

I don't know how you'd connect to Athena from Spark, but you don't need to - you can very easily query the data that Athena contains (or, more correctly, "registers") from Spark.
There are two parts to Athena
Hive Metastore (now called the Glue Data Catalog) which contains mappings between database and table names and all underlying files
Presto query engine which translates your SQL into data operations against those files
When you start an EMR cluster (v5.8.0 and later) you can instruct it to connect to your Glue Data Catalog. This is a checkbox in the 'create cluster' dialog. When you check this option your Spark SqlContext will connect to the Glue Data Catalog, and you'll be able to see the tables in Athena.
You can then query these tables as normal.
See https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html for more.

You can use this JDBC driver: SimbaAthenaJDBC
<dependency>
<groupId>com.syncron.amazonaws</groupId>
<artifactId>simba-athena-jdbc-driver</artifactId>
<version>2.0.2</version>
</dependency>
to use:
SparkSession spark = SparkSession
.builder()
.appName("My Spark Example")
.getOrCreate();
Class.forName("com.simba.athena.jdbc.Driver");
Properties connectionProperties = new Properties();
connectionProperties.put("User", "AWSAccessKey");
connectionProperties.put("Password", "AWSSecretAccessKey");
connectionProperties.put("S3OutputLocation", "s3://my-bucket/tmp/");
connectionProperties.put("AwsCredentialsProviderClass",
"com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider");
connectionProperties.put("AwsCredentialsProviderArguments", "/my-folder/.athenaCredentials");
connectionProperties.put("driver", "com.simba.athena.jdbc.Driver");
List<String> predicateList =
Stream
.of("id = 'foo' and date >= DATE'2018-01-01' and date < DATE'2019-01-01'")
.collect(Collectors.toList());
String[] predicates = new String[predicateList.size()];
predicates = predicateList.toArray(predicates);
Dataset<Row> data =
spark.read()
.jdbc("jdbc:awsathena://AwsRegion=us-east-1;",
"my_env.my_table", predicates, connectionProperties);
You can also use this driver in a Flink application:
TypeInformation[] fieldTypes = new TypeInformation[] {
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO
};
RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
JDBCInputFormat jdbcInputFormat = JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.simba.athena.jdbc.Driver")
.setDBUrl("jdbc:awsathena://AwsRegion=us-east-1;UID=my_access_key;PWD=my_secret_key;S3OutputLocation=s3://my-bucket/tmp/;")
.setQuery("select id, val_col from my_env.my_table WHERE id = 'foo' and date >= DATE'2018-01-01' and date < DATE'2019-01-01'")
.setRowTypeInfo(rowTypeInfo)
.finish();
DataSet<Row> dbData = env.createInput(jdbcInputFormat, rowTypeInfo);

You can't directly connect Spark to Athena. Athena is simply an implementation of Prestodb targeting s3. Unlike Presto, Athena cannot target data on HDFS.
However, if you want to use Spark to query data in s3, then you are in luck with HUE, which will let you query data in s3 from Spark on Elastic Map Reduce (EMR).
See Also:
Developer Guide for Hadoop User Experience (HUE) on EMR.

The response of #Kirk Broadhurst is correct if you want to use the data of Athena.
If you want to use the Athena engine, then, there is a lib on github that overcomes the preparedStatement problem.
Note that I didn't succeed to use the lib, due to my lack of experience with Maven etc

Actually you can use B2W's Spark Athena Driver.
https://github.com/B2W-BIT/athena-spark-driver

Related

Load a table from another database in pyspark

I am currently working with AWS and PySpark. My tables are stored in S3 and queryable from Athena.
In my Glue jobs, I'm used to load my tables as:
my_table_df = sparkSession.table("myTable")
However, this time, I want to access a table from another database, in the same data source (AwsDataCatalog). So I do something that works well:
my_other_table_df = sparkSession.sql("SELECT * FROM anotherDatabase.myOtherTable")
I am just looking for a better way to write the same thing, without using a SQL query, in one line, just by specifying the database for this operation. Something that should looks like
sparkSession.database("anotherDatabase").table("myOtherTable")
Any suggestion would be welcome
You can use the DynamicFrameReader for that. This will return you a DynamicFrame. You can just call .toDF() on that DynamicFrame to transform it into a native Spark DataFrame though.
sc = SparkContext()
glue_context = GlueContext(sc)
spark = glue_context.spark_session
job = Job(glue_context)
data_source = glue_context.create_dynamic_frame.from_catalog(
database="database",
table_name="table_name"
).toDF()

Spark Cassandra Connector in Action:How does it work if Cassandra is hosted on a different server

Scenario: Cassandra is hosted on a server a.b.c.d and spark runs on server say w.x.y.z.
Assume i want to transform the data from a table(say table)casssandra and rewrite the same to other table(say tableNew) in cassandra using Spark,The code that i write looks something like this
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "a.b.c.d")
.set("spark.cassandra.auth.username", "<UserName>")
.set("spark.cassandra.auth.password", "<Password>")
val spark = SparkSession.builder().master("yarn")
.config(conf)
.getOrCreate()
val dfFromCassandra = spark.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "<table>", "keyspace" -> "<Keyspace>")).load()
val filteredDF = dfFromCassandra.filter(filterCriteria).write.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "<tableNew>", "keyspace" -> "<Keyspace>")).save
Here filterCriteria represents the transformation/filtering that I do. Iam not sure how Spark cassandra connector works in this case internally.
This is the Confusion that I have:
1: Does spark load the data from Cassandra source table to the memory and then filter the same and reload the same to the Target table Or
2: Does Spark cassandra connector convert the filter criteria to Where clause and loads only the relevant data to form RDD and writes the same back to target table in Cassandra Or
3:Does the entire operation happens as a cql operation where the query is converted to sqllike query and is executed in cassandra itself?(I am almost sure that this is not what happens)
It is either 1. or 2. depending on your filterCriteria. Naturally Spark itself can't do any CQL filtering but custom datasources can implement it using predicate pushdown. In case if Cassandra driver, it is implemented here and the answer depends if that covers the used filterCriteria.

How to write result of structured query with hive format?

I'm trying to insert into a hive table data via DataStreamWriter class using hive format.
rdf.writeStream.format("hive").partitionBy("date")
.option("checkpointLocation", "/tmp/")
.option("db", "default")
.option("table", "Daily_summary_data")
.queryName("socket-hive-streaming")
.start()
It throws the following error:
org.apache.spark.sql.AnalysisException: Hive data source can only be
used with tables, you can not write files of Hive data source
directly.;
How to resolve this?
It's not possible to write the result of a structured query using hive data source (as per the exception that you can find here).
You can write using parquet format that may be an option.

How to save data in parquet format and append entries

I am trying to follow this example to save some data in parquet format and read it. If I use the write.parquet("filename"), then the iterating Spark job gives error that
"filename" already exists.
If I use SaveMode.Append option, then the Spark job gives the error
".spark.sql.AnalysisException: Specifying database name or other qualifiers are not allowed for temporary tables".
Please let me know the best way to ensure new data is just appended to the parquet file. Can I define primary keys on these parquet tables?
I am using Spark 1.6.2 on Hortonworks 2.5 system. Here is the code:
// Option 1: peopleDF.write.parquet("people.parquet")
//Option 2:
peopleDF.write.format("parquet").mode(SaveMode.Append).saveAsTable("people.parquet")
// Read in the parquet file created above
val parquetFile = spark.read.parquet("people.parquet")
//Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile")
val teenagers = sqlContext.sql("SELECT * FROM people.parquet")
I believe if you use .parquet("...."), you should use .mode('append'),
not SaveMode.Append:
df.write.mode('append').parquet("....")

N1QL Query to connect databricks spark 1.6 to couchbase server 4.5

I am trying to setup a connection from Databricks to couchbase server 4.5 and then run a N1QL query.
The scala code below will return 1 record but fails when introducing the N1QL. Any help is appreciated.
import com.couchbase.client.java.CouchbaseCluster;
import scala.collection.JavaConversions._;
import com.couchbase.client.java.query.Select.select;
import com.couchbase.client.java.query.dsl.Expression;
import com.couchbase.client.java.query.Query
// Connect to a cluster on localhost
val cluster = CouchbaseCluster.create("http://**************")
// Open the default bucket
val bucket = cluster.openBucket("travel-sample", "password");
// Read it back out
//val streamsense = bucket.get("airline_1004546") - Works and returns one record
// Create a DataFrame with schema inference
val ev = sql.read.couchbase(schemaFilter = EqualTo("type", "airline"))
//Show the inferred schema
ev.printSchema()
//query using the data frame
ev
.select("id", "type")
.show(10)
//issue sql query for the same data (N1ql)
val query = "SELECT type, meta().id FROM `travel-sample` LIMIT 10"
sc
.couchbaseQuery(N1qlQuery.simple(query))
.collect()
.foreach(println)
In Databricks (and any interactive Spark cloud environment usually) you do not define the cluster nodes, buckets or sc variable, instead you need to set the configuration settings for Spark to use when setting up the Databricks cluster. Use the advanced settings option as shown below.
I've only used this approach with spark2.0 so your mileage may vary.
You can remove your cluster and bucket variable initialisation as well.
You have a syntax error in the N1QL query. You have:
val query = "SELECT type, id FROM `travel-sample` WHERE LIMIT 10"
You need to either remove the WHERE, or add a condition.
You also need to change id to META().id.