Connecting AWS Glue to Mongodb Atlas Cluster - mongodb

Has anyone ever managed to get this to work? I've added a connection in AWS Glue to connect to my Mongodb cluster in Atlas but experiencing
Check that your connection definition references your Mongo database with correct URL syntax, username, and password Exiting with error code 30
in aws.
I spun up an ec2 instance in the same subnet as the glue connection in my VPC and it connects just fine. Also allowed all traffic in my security group but still getting the same error.

You might gotta take a look at authSource for optional compositions in a string uri of MongoDB.
Scala Example
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamicFrame
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
object GlueApp {
val DEFAULT_URI: String = "mongodb://<an_ip_from_atlas_project_ip_access_list>:27017"
val WRITE_URI: String = "mongodb://<an_ip_from_atlas_project_ip_access_list>:27017"
lazy val defaultJsonOption = jsonOptions(DEFAULT_URI)
lazy val writeJsonOption = jsonOptions(WRITE_URI)
def main(sysArgs: Array[String]): Unit = {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
// Get DynamicFrame from MongoDB
val resultFrame: DynamicFrame = glueContext.getSource("mongodb", defaultJsonOption).getDynamicFrame()
// Write DynamicFrame to MongoDB and DocumentDB
glueContext.getSink("mongodb", writeJsonOption).writeDynamicFrame(resultFrame)
Job.commit()
}
private def jsonOptions(uri: String): JsonOptions = {
new JsonOptions(
s"""{"uri": "${uri}",
|"database":"test",
|"collection":"coll",
|"username": "username",
|"password": "pwd",
|"ssl":"true",
|"ssl.domain_match":"false",
|"partitioner": "MongoSamplePartitioner",
|"partitionerOptions.partitionSizeMB": "10",
|"partitionerOptions.partitionKey": "_id"}""".stripMargin)
}
}
You maybe need to assign the authentication source as a database in the cluster you intend to connect.
mongodb://<an_ip_from_atlas_project_ip_access_list>:27017?authSource=test
References
docs.atlas.mongodb.com. 2021. Connect to a Cluster. [ONLINE] Available at: https://docs.atlas.mongodb.com/connect-to-cluster.
docs.aws.amazon.com. 2021. Examples: Setting Connection Types and Options. [ONLINE] Available at: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-samples.html.
docs.mongodb.com. 2021. Configuration Options. [ONLINE] Available at: https://docs.mongodb.com/spark-connector/master/configuration#partitioner-conf.
docs.mongodb.com. 2021. Connection String URI Format. [ONLINE] Available at: https://docs.mongodb.com/manual/reference/connection-string/.

Related

Not able to read data from AWS S3(orc) through Intellij local(spark/Scala)

we are reading the date/table from AWS(hive) through Spark/scala using Intellij(witch is on local machine). we can able to see the schema of table. but not able to read data.
please find below flow to get better understanding
Intellij(spark/scala)------> hive:9083(remote)------> s3(orc)
Note: Here Intellij is present on local and hive and S3 present on AWS
Please find below code for the same:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
//import org.apache.spark.sql.hive.HiveContext
object hiveconnect {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("SparkHiveExample")
.config("hive.metastore.uris", "thrift://10.20.30.40:9083")
.master("local[*]")
.config("spark.sql.warehouse.dir", "s3://abc/test/main")
.config("spark.driver.allowMultipleContexts", "true")
.config("access-key","key")
.config("secret-key","key")
.enableHiveSupport()
.getOrCreate()
println("Start of SQL Session--------------------")
spark.sql("show databases").show()
spark.sql("select *from ace.visit limit 5").show()
}
}
Error: Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

JDBC-HiveServer:'client_protocol is unset!'-Both 1.1.1 in CS

When I ask this question, I have already read many many article through google. Many answers show that is the mismatch version between client-side and server-side. So I decide to copy the jars from server-side to client-side directly, and the result is .... as you know, same exception:
org.apache.thrift.TApplicationException: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{use:database=default})
It goes well when I connect to hiveserver2 through beeline :)
see my connection.
So, I think it will work when I use jdbc too. But, unfortunately, it throws that exception, below is my jars in my project.
hive-jdbc-1.1.1.jar
hive-jdbc-standalone.jar
hive-metastore-1.1.1.jar
hive-service-1.1.1.jar
those hive jars are copied from server-side.
def connect_hive(master:String){
val conf = new SparkConf()
.setMaster(master)
.setAppName("Hive")
.set("spark.local.dir", "./tmp");
val sc = new SparkContext(conf);
val sqlContext = new SQLContext(sc);
val url = "jdbc:hive2://192.168.40.138:10000";
val prop= new Properties();
prop.setProperty("user", "hive");
prop.setProperty("password", "hive");
prop.setProperty("driver", "org.apache.hive.jdbc.HiveDriver");
val conn = DriverManager.getConnection(url, prop);
sc.stop();
}
The configment of my server:
hadoop 2.7.3
spark 1.6.0
hive 1.1.1
Does anyone encounter the same situation when connecting hive through spark-JDBC?
Since beeline works, it is expected that your program also should execute correctly.
print current project class path
you can try some thing like this to understand your self.
import java.net.URL
import java.net.URLClassLoader
import scala.collection.JavaConversions._
object App {
def main(args: Array[String]) {
val cl = ClassLoader.getSystemClassLoader
val urls = cl.asInstanceOf[URLClassLoader].getURLs
for (url <- urls) {
println(url.getFile)
}
}
}
Also check hive.aux.jars.path=<file urls> to understand what jars are present in the classpath.

Running Mlib via Spark Job Server

I was practising developing sample model using online resources provided in spark website. I managed to create the model and run it for sample data using Spark-Shell , But how to do actually run the model in production environment ? Is it via Spark Job server ?
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
The above code works perfect when i run it in spark-shell , But i have no idea how do we actually run model in production environment. I tried to run it via spark jobserver but i get error ,
curl -d "input.string = 1, 2, 3, 4, 5, 6, 7, 8, 9" 'ptfhadoop01v:8090/jobs?appName=SQL&classPath=spark.jobserver.SparkPredict'
I am sure its because am passing a String value whereas the program expects it be vector elements , Can someone guide me on how to achieve this . And also is this how the data being passed to Model in production environment ? Or is it some other way.
Spark Job-server is used in production use-cases, where you want to design pipelines of Spark jobs, and also (optionally) use the SparkContext across jobs, over a REST API. Sparkplug is an alternative to Spark Job-server, providing similar constructs.
However, to answer your question on how to run a (singular) Spark job in production environments, the answer is you do not need a third-party library to do so. You only need to construct a SparkContext object, and use it to trigger Spark jobs. For instance, for your code snippet, all that is needed is;
package runner
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import com.typesafe.config.{ConfigFactory, Config}
import org.apache.spark.{SparkConf, SparkContext}
/**
*
*/
object SparkRunner {
def main (args: Array[String]){
val config: Config = ConfigFactory.load("app-default-config") /*Use a library to read a config file*/
val sc: SparkContext = constructSparkContext(config)
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
}
def constructSparkContext(config: Config): SparkContext = {
val conf = new SparkConf()
conf
.setMaster(config.getString("spark.master"))
.setAppName(config.getString("app.name"))
/*Set more configuration values here*/
new SparkContext(conf)
}
}
Optionally, you can also use the wrapper for spark-submit script, SparkSubmit, provided in the Spark library itself.

Writing DataFrame to MemSQL Table in Spark

Im trying to load a .parquet file into a MemSQL Database with Spark and MemSQL Connector.
package com.memsql.spark
import com.memsql.spark.context._
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import com.memsql.spark.connector._
import com.mysql.jdbc._
object readParquet {
def main(args: Array[String]){
val conf = new SparkConf().setAppName("ReadParquet")
val sc = new SparkContext(conf)
sc.addJar("/data/applications/spark-1.5.1-bin-hadoop2.6/lib/mysql-connector-java-5.1.37-bin.jar")
sc.addJar("/data/applications/spark-1.5.1-bin-hadoop2.6/lib/memsql-connector_2.10-1.1.0.jar")
Class.forName("com.mysql.jdbc.Driver")
val host = "xxxx"
val port = 3306
val dbName = "WP1"
val user = "root"
val password = ""
val tableName = "rt_acc"
val memsqlContext = new com.memsql.spark.context.MemSQLContext(sc, host, port, user, password)
val rt_acc = memsqlContext.read.parquet("tachyon://localhost:19998/rt_acc.parquet")
val func_rt_acc = new com.memsql.spark.connector.DataFrameFunctions(rt_acc)
func_rt_acc.saveToMemSQL(dbName, tableName, host, port, user, password)
}
}
I'm fairly certain that Tachyon is not causing the problem, as the same exceptions occur if loaded from disk and i can use sql-queries on the dataframe.
I've seen people suggest df.saveToMemSQL(..) however it seems this method is in DataFrameFunctions now.
Also the table doesnt exist yet but saveToMemSQL should do CREATE TABLE as documentation and source code tell me.
Edit: Ok i guess i misread something. saveToMemSQL doesn't create the table. Thanks.
Try using createMemSQLTableAs instead of saveToMemSQL.
saveToMemSQL loads a dataframe into an existing table, where as createMemSQLTableAs creates the table and then loads it.
It also returns a handy dataframe wrapping that MemSQL table :).

Writing to HBase via Spark: Task not serializable

I'm trying to write some simple data in HBase (0.96.0-hadoop2) using Spark 1.0 but I keep getting getting serialization problems. Here is the relevant code:
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.SparkContext
import java.util.Properties
import java.io.FileInputStream
import org.apache.hadoop.hbase.client.Put
object PutRawDataIntoHbase{
def main(args: Array[String]): Unit = {
var propFileName = "hbaseConfig.properties"
if(args.size > 0){
propFileName = args(0)
}
/** Load properties here **/
val theData = sc.textFile(prop.getProperty("hbase.input.filename"))
.map(l => l.split("\t"))
.map(a => Array("%010d".format(a(9).toInt)+ "-" + a(0) , a(1)))
val tableName = prop.getProperty("hbase.table.name")
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.rootdir", prop.getProperty("hbase.rootdir"))
hbaseConf.addResource(prop.getProperty("hbase.site.xml"))
val myTable = new HTable(hbaseConf, tableName)
theData.foreach(a=>{
var p = new Put(Bytes.toBytes(a(0)))
p.add(Bytes.toBytes(hbaseColFamily), Bytes.toBytes("col"), Bytes.toBytes(a(1)))
myTable.put(p)
})
}
}
Running the code results in:
Failed to run foreach at putDataIntoHBase.scala:79
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException:org.apache.hadoop.hbase.client.HTable
Replacing the foreach with map doesn't crash but I doesn't write either.
Any help will be greatly appreciated.
The class HBaseConfiguration represents a pool of connections to HBase servers. Obviously, it can't be serialized and sent to the worker nodes. Since HTable uses this pool to communicate with the HBase servers, it can't be serialized too.
Basically, there are three ways to handle this problem:
Open a connection on each of worker nodes.
Note the use of foreachPartition method:
val tableName = prop.getProperty("hbase.table.name")
<......>
theData.foreachPartition { iter =>
val hbaseConf = HBaseConfiguration.create()
<... configure HBase ...>
val myTable = new HTable(hbaseConf, tableName)
iter.foreach { a =>
var p = new Put(Bytes.toBytes(a(0)))
p.add(Bytes.toBytes(hbaseColFamily), Bytes.toBytes("col"), Bytes.toBytes(a(1)))
myTable.put(p)
}
}
Note that each of worker nodes must have access to HBase servers and must have required jars preinstalled or provided via ADD_JARS.
Also note that since the connection pool if opened for each of partitions, it would be a good idea to reduce the number of partitions roughly to the number of worker nodes (with coalesce function). It's also possible to share a single HTable instance on each of worker nodes, but it's not so trivial.
Serialize all data to a single box and write it to HBase
It's possible to write all data from an RDD with a single computer, even if it the data doesn't fit to memory. The details are explained in this answer: Spark: Best practice for retrieving big data from RDD to local machine
Of course, it would be slower than distributed writing, but it's simple, doesn't bring painful serialization issues and might be the best approach if the data size is reasonable.
Use HadoopOutputFormat
It's possible to create a custom HadoopOutputFormat for HBase or use an existing one. I'm not sure if there exists something that fits your needs, but Google should help here.
P.S. By the way, the map call doesn't crash since it doesn't get evaluated: RDDs aren't evaluated until you invoke a function with side-effects. For example, if you called theData.map(....).persist, it would crash too.