Read Data from MongoDB through Apache Spark with a query - mongodb

I am able to read the data stored in MongoDB via Apache Spark via the conventional methods described in its documentation. I have a mongoDB query that I would like to be used while loading the collection. The query is simple, but I can't seem to find the correct way to specify the query the config() function in SparkSession object.
Following is my SparkSession builder
val confMap: Map[String, String] = Map(
"spark.mongodb.input.uri" -> "mongodb://xxx:xxx#mongodb1:27017,mongodb2:27017,mongodb3:27017/?ssl=true&replicaSet=MongoShard-0&authSource=xxx&retryWrites=true&authMechanism=SCRAM-SHA-1",
"spark.mongodb.input.database" -> "A",
"spark.mongodb.input.collection" -> "people",
"spark.mongodb.output.database" -> "B",
"spark.mongodb.output.collection" -> "result",
"spark.mongodb.input.readPreference.name" -> "primaryPreferred"
)
conf.setAll(confMap)
val spark: SparkSession =
SparkSession.builder().master("local[1]").config(conf).getOrCreate()
Is there a way to specify the MongoDB query in the SparkConf object so that the SparkSession reads only the specific fields present in the collection.

Use .withPipeline API
Example Code:
val readConfig = ReadConfig(Map("uri" -> MONGO_DEV_URI, "collection" -> MONGO_COLLECTION_NAME, "readPreference.name" -> "secondaryPreferred"))
MongoSpark
.load(spark.sparkContext, readConfig)
.withPipeline(Seq(Document.parse(query)))
As per comments:
sparkSession.read.format("com.mongodb.spark.sql.DefaultSource")
.option("pipeline", "[{ $match: { name: { $exists: true } } }]")
.option("uri","mongodb://127.0.0.1/mydb.mycoll")
.load()

Related

Scala Spark job to upload CSV file data to mongo DB fails due to CodecConfigurationException

I'm new to both spark and scala. I'm trying to upload a csv file to Mongo DB using a spark job in Scala.
On upload, facing the following error during the job execution,
org.bson.codecs.configuration.CodecConfigurationException: Can't find a codec for class .
Path to input file will be passed during the execution.
I'm kind of stuck with this issue for past 2 days. Any help to overcome this issue is appreciated.
Thanks.
I have tried it for uploading to elastic search and it worked like a charm.
import org.apache.spark.sql.Row
import com.mongodb.spark._
import com.mongodb.spark.config.WriteConfig
import org.apache.spark.sql.{SaveMode, SparkSession}
import com.test.Config
object MongoUpload {
val host = <host>
val user = <user>
val pwd = <password>
val database = <db>
val collection = <collection>
val uri = "mongodb://${user}:${pwd}#${host}/"
val NOW = java.time.LocalDate.now.toString
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.appName("Mongo-Test-Upload")
.config("spark.mongodb.output.uri", uri)
.getOrCreate()
spark
.read
.format("csv")
.option("header", "true")
.load(args(0))
.rdd
.map(toEligibility)
.saveToMongoDB(
WriteConfig(
Map(
"uri" -> uri,
"database" -> database,
"collection" -> collection
)
)
)
}
def toEligibility(row: Row): Eligibility =
Eligibility(
row.getAs[String]("DATE_OF_BIRTH"),
row.getAs[String]("GENDER"),
row.getAs[String]("INDIVIDUAL_ID"),
row.getAs[String]("PRODUCT_NAME"),
row.getAs[String]("STATE_CODE"),
row.getAs[String]("ZIPCODE"),
NOW
)
}
case class Eligibility (
dateOfBirth: String,
gender: String,
recordId: String,
ProductIdentifier: String,
stateCode: String,
zipCode: String,
updateDate: String
)
Spark job fails with the following error, Caused by: org.bson.codecs.configuration.CodecConfigurationException: Can't find a codec for class Eligibility
You can either map to a Document of the desired format or convert to a Dataset and then save it eg:
import spark.implicits._
spark
.read
.format("csv")
.option("header", "true")
.load(args(0))
.rdd
.map(toEligibility)
.toDS()
.write()
.format("com.mongodb.spark.sql.DefaultSource")
.options(Map("uri" -> uri,"database" -> database, "collection" -> collection)
.save()
}

InvalidJobConfException. Output directory not set

I'm trying to write some data into bigtable using a SparkSession
val spark = SparkSession
.builder
.config(conf)
.appName("my-job")
.getOrCreate()
val hadoopConf = spark.sparkContext.hadoopConfiguration
import spark.implicits._
case class BestSellerRecord(skuNbr: String, slsQty: String, slsDollar: String, dmaNbr: String, productId: String)
val seq: DataFrame = Seq(("foo", "1", "foo1"), ("bar", "2", "bar1")).toDF("key", "value1", "value2")
val bigtablePuts = seq.toDF.rdd.map((row: Row) => {
val put = new Put(Bytes.toBytes(row.getString(0)))
put.addColumn(Bytes.toBytes(columnFamily), Bytes.toBytes("nbr"), Bytes.toBytes(row.getString(0)))
(new ImmutableBytesWritable(), put)
})
bigtablePuts.saveAsNewAPIHadoopDataset(hadoopConf)
But this gives me the following exception.
Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:138)
at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.assertConf(SparkHadoopWriter.scala:391)
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081)
which is coming from
bigtablePuts.saveAsNewAPIHadoopDataset(hadoopConf)
this line. Also I tried to set the different configurations using hadoopConf.set such as conf.set("spark.hadoop.validateOutputSpecs", "false") but this gives me a NullPointerException.
How may I fix this issue?
Can you try to upgrade to the mapreduce api, as the mapred is deprecated.
This question here shows an example of rewriting this code segment: Output directory not set exception when save RDD to hbase with spark
Hope this is helpful.

Query Cassandra table for every Kafka Message

I am trying to query a cassandra table for every single kafka message.
Below is the code that I have been working on:
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Spark SQL basic example")
.config("spark.cassandra.connection.host", "localhost")
.config("spark.cassandra.connection.port", "9042")
.getOrCreate()
val topicsSet = List("Test").toSet
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "12345",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
val lines = messages.map(_.value)
val lines_myobjects = lines.map(line =>
new Gson().fromJson(line, classOf[myClass]) // The myClass is a simple case class which extends serializable
//This changes every single message into an object
)
Now things get complicated, I cannot get around the point where I can query the cassandra table with relevant to the message from the kafka message. Every single kafka message object has a return method.
I have tried multiple ways to get around this. For instance:
val transformed_data = lines_myobjects.map(myobject => {
val forest = spark.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "mytable", "keyspace" -> "mydb"))
.load()
.filter("userid='" + myobject.getuserId + "'")
)}
I have also tried ssc.cassandraTable which gave me no luck.
The main goal is to get all the rows from the database where the userid matches with the userid that comes from the kafka message.
One thing I would like to mention is that even though loading or querying the cassandra database every time is not efficient, the cassandra database changes everytime.
You can't do spark.read or ssc.cassandraTable inside .map(. Because it means you would try to create new RDD per each message. It shouldn't work like that.
Please, sider the following options:
1 - If you could ask required data by one/two CQL queries, try to use CassandraConnector inside the .mapPartitions(. Something like this:
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql._
val connector = ...instantiate CassandraConnector onece here
val transformed_data = lines_myobjects.mapPartitions(it => {
connector.withSessionDo { session =>
it.map(myobject => session.execute("CQL QUERY TO GET YOUR DATA HERE", myobject.getuserId)
})
2 - Otherwise (if you select by primary/partition key) consider .joinWithCassandraTable. Something like this:
import com.datastax.spark.connector._
val mytableRDD = sc.cassandraTable("mydb", "mytable")
val transformed_data = lines_myobjects
.map(myobject => {
Tuple1(myobject.getuserId) // you need to wrap ids to a tuple to do join with Cassandra
})
.joinWithCassandraTable("mydb", "mytable")
// process results here
I would approach this a different way.
The data that is flowing into Cassandra, route it through Kafka (and from Kafka send to the Cassandra with the Kafka Connect sink).
With your data in Kafka, you can then join between your streams of data, whether in Spark, or with Kafka's Streams API, or KSQL.
Both Kafka Streams and KSQL support stream-table joins that you're doing here. You can see it in action with KSQL here and here.

How to add a MongoDB-specific query using mongo-spark connector?

i am using "mongo-spark" in order to read mongodb from spark 2.0 application.
(https://github.com/mongodb/mongo-spark)
Here is a code example:
val readConfig: ReadConfig = ReadConfig(Map(
"spark.mongodb.input.uri"-> "mongodb://mongodb01.blabla.com/xqwer",
"collection" -> "some_collection"),
None)
sparkSession.read.format("com.mongodb.spark.sql").options(readConfig.asOptions).load()
does anyone know how to add mongodb query (e.g. "find({ uid: 'ZesSZY3Ch0k8nQtQUIfH' })" ) ?
You can use filter() on df
val df = sparkSession.read.format("com.mongodb.spark.sql")
.options(readConfig.asOptions).load()
df.filter($"uid".equalTo(lit("ZesSZY3Ch0k8nQtQUIfH"))).show()

Connect to SQLite in Apache Spark

I want to run a custom function on all tables in a SQLite database. The function is more or less the same, but depends on the schema of the individual table. Also, the tables and their schemata are only known at runtime (the program is called with an argument that specifies the path of the database).
This is what I have so far:
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// somehow bind sqlContext to DB
val allTables = sqlContext.tableNames
for( t <- allTables) {
val df = sqlContext.table(t)
val schema = df.columns
sqlContext.sql("SELECT * FROM " + t + "...").map(x => myFunc(x,schema))
}
The only hint I found so far needs to know the table in advance, which is not the case in my scenario:
val tableData =
sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:sqlite:/path/to/file.db", "dbtable" -> t))
.load()
I am using the xerial sqlite jdbc driver. So how can I conntect solely to a database, not to a table?
Edit: Using Beryllium's answer as a start I updated my code to this:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val metaData = sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:sqlite:/path/to/file.db",
"dbtable" -> "(SELECT * FROM sqlite_master) AS t")).load()
val myTableNames = metaData.select("tbl_name").distinct()
for (t <- myTableNames) {
println(t.toString)
val tableData = sqlContext.table(t.toString)
for (record <- tableData.select("*")) {
println(record)
}
}
At least I can read the table names at runtime which is a huge step forward for me. But I can't read the tables. I tried both
val tableData = sqlContext.table(t.toString)
and
val tableData = sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:sqlite:/path/to/file.db",
"dbtable" -> t.toString)).load()
in the loop, but in both cases I get a NullPointerException. Although I can print the table names it seems I cannot connect to them.
Last but not least I always get an SQLITE_ERROR: Connection is closed error. It looks to be the same issue described in this question: SQLITE_ERROR: Connection is closed when connecting from Spark via JDBC to SQLite database
There are two options you can try
Use JDBC directly
Open a separate, plain JDBC connection in your Spark job
Get the tables names from the JDBC meta data
Feed these into your for comprehension
Use a SQL query for the "dbtable" argument
You can specify a query as the value for the dbtable argument. Syntactically this query must "look" like a table, so it must be wrapped in a sub query.
In that query, get the meta data from the database:
val df = sqlContext.read.format("jdbc").options(
Map(
"url" -> "jdbc:postgresql:xxx",
"user" -> "x",
"password" -> "x",
"dbtable" -> "(select * from pg_tables) as t")).load()
This example works with PostgreSQL, you have to adapt it for SQLite.
Update
It seems that the JDBC driver only supports to iterate over one result set.
Anyway, when you materialize the list of table names using collect(), then the following snippet should work:
val myTableNames = metaData.select("tbl_name").map(_.getString(0)).collect()
for (t <- myTableNames) {
println(t.toString)
val tableData = sqlContext.read.format("jdbc")
.options(
Map(
"url" -> "jdbc:sqlite:/x.db",
"dbtable" -> t)).load()
tableData.show()
}