I am writing a code to cache RDBMS data using spark SQLContext JDBC connection. Once a Dataframe is created I want to cache that reusltset using apache ignite thereby making other applications to make use of the resultset. Here is the code snippet.
object test
{
def main(args:Array[String])
{
val configuration = new Configuration()
val config="src/main/scala/config.xml"
val sparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc=new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sql_dump1=sqlContext.read.format("jdbc").option("url", "jdbc URL").option("driver", "com.mysql.jdbc.Driver").option("dbtable", mysql_table_statement).option("user", "username").option("password", "pass").load()
val ic = new IgniteContext[Integer, Integer](sc, config)
val sharedrdd = ic.fromCache("hbase_metadata")
//How to cache sql_dump1 dataframe
}
}
Now the question is how to cache a dataframe, IgniteRDD has savepairs method but it accepts key and value as RDD[Integer], but I have a dataframe even if I convert that to RDD i would only be getting RDD[Row]. The savepairs method consisting of RDD of Integer seems to be more specific what if I have a string of RDD as value? Is it good to cache dataframe or any other better approach to cache the resultset.
There is no reason to store DataFrame in an Ignite cache (shared RDD) since you won't benefit from it too much: at least you won't be able to execute Ignite SQL over the DataFrame.
I would suggest doing the following:
provide CacheStore implementation for hbase_metadata cache that will preload all the data from your underlying database. Then you can preload all the data into the cache using Ignite.loadCache method. Here you may find an example on how to use JDBC persistent stores along with Ignite cache (shared RDD)
use Ignite Shared RDD sql api to query over cached data.
Alternatively you can get sql_dump1 as you're doing, iterate over each row and store each row individually in the shared RDD using IgniteRDD.savePairs method. After this is done you can query over data using the same Ignite Shared RDD SQL mentioned above.
Related
Usecase: Read protobuf messages from Kafka, deserialize them, apply some transformation (flatten out some columns), and write to dynamodb.
Unfortunately, Kafka Flink Connector only supports - csv, json and avro formats. So, I had to use lower level APIs (datastream).
Problem: If I can create a table out of the datastream object, then I can accept a query to run on that table. It would make the transformation part seamless and generic. Is it possible to run a SQL query over datastream object?
If You have a DataStream of objects, then You can simply register given DataStream as Table using StreamTableEnvironment.
This would look more or less like below:
val myStream = ...
val env: StreamExecutionEnvironment = configureFlinkEnv(StreamExecutionEnvironment.getExecutionEnvironment)
val tEnv: StreamTableEnvironment = StreamTableEnvironment.create(env)
tEnv.registerDataStream("myTable", myStream, [Field expressions])
Then You should be able to query the dynamic table created from Your DataStream.
I'm connecting Spark to Cassandra and I was able to print the lines of my CSV using the conventional COPY method. However, if the CSV was very large as it usually happens in Big Data, how could one load the CSV file couple of lines per couple of lines in order to avoid freezing related issues etc. ?
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
object SparkCassandra {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("SparkCassandra").setMaster("local").set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
val my_rdd = sc.cassandraTable("my_keyspace", "my_csv")
my_rdd.take(20).foreach(println)
sc.stop()
}
}
Should one use a time variable or something of that nature?
If you want just to load data into Cassandra, or unload data from Cassandra using the command-line, I would recommend to look to the DataStax Bulk Loader (DSBulk) - it's heavily optimized for loading data to/from Cassandra/DSE. It works with both open source Cassandra and DSE.
In simplest case loading into & unloading from table will look as (default format is CSV):
dsbulk load -k keyspace -t table -url my_file.csv
dsbulk unload -k keyspace -t table -url my_file.csv
For more complex cases you may need to provide more options. You can find more information in following series of the blog posts.
If you want to do this with Spark, then I recommend to use Dataframe API instead of RDDs. In this case you'll just use standard read & write functions.
to export data from Cassandra to CSV:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("tbl", "ks").load()
data.write.format("csv").save("my_file.csv")
or read from CSV and store in Cassandra:
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.SaveMode
val data = spark.read.format("csv").save("my_file.csv")
data.cassandraFormat("tbl", "ks").mode(SaveMode.Append).save()
Situation
I'm currently writing a consumer/producer using AVRO and a schema repository.
From what I gather My options for serializing this data is either use the Confluent's avro serializer, or go with Twitter's Bijection.
It seemed Bijection looked the most straightforward.
So I want to produce date in the following format ProducerRecord[String,Array[Byte]], this comes down to [some string ID, serialized GenericRecord]
(note: I'm going for Generic records as this codebase has to handle thousands of schema's that get parsed from Json/csv/...)
Question:
The whole reason I serialize and use AVRO, is that you don't need to have a schema in the data itself (like you would with Json/XML/...).
When checking the data in the topic however, I see the whole scheme is contained together with the data. Am I doing something fundamentally wrong, is this by design, or should I use the confluent serializer instead?
Code:
def jsonStringToAvro(jString: String, schema: Schema): GenericRecord = {
val converter = new JsonAvroConverter
val genericRecord = converter.convertToGenericDataRecord(jString.replaceAll("\\\\/","_").getBytes(), schema)
genericRecord
}
def serializeAsByteArray(avroRecord: GenericRecord): Array[Byte] = {
//val genericRecordInjection = GenericAvroCodecs.toBinary(avroRecord.getSchema)
val r: Array[Byte] = GenericAvroCodecs.toBinary(avroRecord.getSchema).apply(avroRecord)
r
}
//schema comes from a rest call to the schema repository
new ProducerRecord[String, Array[Byte]](topic, myStringKeyGoesHere, serializeAsByteArray(jsonStringToAvro(jsonObjectAsStringGoesHere, schema)))
producer.send(producerRecord, new Callback {...})
If you look at the Confluent source code , you'll see that order of operations for interacting with a schema repository are
Take the schema from the Avro record, and compute its ID. Ideally POST-ing the Schema to the repository, or otherwise hashing it should give you an ID.
Allocate a ByteBuffer
Write the returned ID to the buffer
Write the Avro object value (excluding the schema) as bytes into the buffer
Send that byte buffer to Kafka
Presently, your Bijection usage will include the schema in the bytes, not replace it with an ID
SparkSession
.builder
.master("local[*]")
.config("spark.sql.warehouse.dir", "C:/tmp/spark")
.config("spark.sql.streaming.checkpointLocation", "C:/tmp/spark/spark-checkpoint")
.appName("my-test")
.getOrCreate
.readStream
.schema(schema)
.json("src/test/data")
.cache
.writeStream
.start
.awaitTermination
While executing this sample in Spark 2.1.0 I got error.
Without the .cache option it worked as intended but with .cache option i got:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[src/test/data]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:196)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:33)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:33)
at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:58)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:67)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:102)
at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:65)
at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:89)
at org.apache.spark.sql.Dataset.persist(Dataset.scala:2479)
at org.apache.spark.sql.Dataset.cache(Dataset.scala:2489)
at org.me.App$.main(App.scala:23)
at org.me.App.main(App.scala)
Any idea?
Your (very interesting) case boils down to the following line (that you can execute in spark-shell):
scala> :type spark
org.apache.spark.sql.SparkSession
scala> spark.readStream.text("files").cache
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[files]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104)
at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68)
at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92)
at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603)
at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613)
... 48 elided
The reason for this turned out quite simple to explain (no pun to Spark SQL's explain intended).
spark.readStream.text("files") creates a so-called streaming Dataset.
scala> val files = spark.readStream.text("files")
files: org.apache.spark.sql.DataFrame = [value: string]
scala> files.isStreaming
res2: Boolean = true
Streaming Datasets are the foundation of Spark SQL's Structured Streaming.
As you may have read in Structured Streaming's Quick Example:
And then start the streaming computation using start().
Quoting the scaladoc of DataStreamWriter's start:
start(): StreamingQuery Starts the execution of the streaming query, which will continually output results to the given path as new data arrives.
So, you have to use start (or foreach) to start the execution of the streaming query. You knew it already.
But...there are Unsupported Operations in Structured Streaming:
In addition, there are some Dataset methods that will not work on streaming Datasets. They are actions that will immediately run queries and return results, which does not make sense on a streaming Dataset.
If you try any of these operations, you will see an AnalysisException like "operation XYZ is not supported with streaming DataFrames/Datasets".
That looks familiar, doesn't it?
cache is not in the list of the unsupported operations, but that's because it has simply been overlooked (I reported SPARK-20927 to fix it).
cache should have been in the list as it does execute a query before the query gets registered in Spark SQL's CacheManager.
Let's go deeper into the depths of Spark SQL...hold your breath...
cache is persist while persist requests the current CacheManager to cache the query:
sparkSession.sharedState.cacheManager.cacheQuery(this)
While caching a query CacheManager does execute it:
sparkSession.sessionState.executePlan(planToCache).executedPlan
which we know is not allowed since it is start (or foreach) to do so.
Problem solved!
I am trying to use Apache spark to create an index in Elastic search(Writing huge data to ES).I have done a Scala program to create index using Apache spark.I have to index huge data, which is getting as my product bean in a LinkedList. Then.Then i tried to traverse over the product bean list and create the index. My code given below.
val conf = new SparkConf().setAppName("ESIndex").setMaster("local[*]")
conf.set("es.index.auto.create", "true").set("es.nodes", "127.0.0.1")
.set("es.port", "9200")
.set("es.http.timeout", "5m")
.set("es.scroll.size", "100")
val sc = new SparkContext(conf)
//Return my product bean as a in a linkedList.
val list: util.LinkedList[product] = getData()
for (item <- list) {
sc.makeRDD(Seq(item)).saveToEs("my_core/json")
}
The issue with this approach is taking too much time to create the index.
Is there any way to create the index in a better way?
Don't pass data through driver unless it is necessary. Depending on what is the source of data returned from getData you should use relevant input method or create your own. If data comes from MongoDB use for example mongo-hadoop, Spark-MongoDB or Drill with JDBC connection. Then use map or similar method to build the required objects and use saveToEs on transformed RDD.
Creating a RDD with as single element doesn't make sense. It doesn't benefit from Spark architecture at all. You just start a potentially huge number of tasks which have nothing with only a single active executor.