I am trying to use Apache spark to create an index in Elastic search(Writing huge data to ES).I have done a Scala program to create index using Apache spark.I have to index huge data, which is getting as my product bean in a LinkedList. Then.Then i tried to traverse over the product bean list and create the index. My code given below.
val conf = new SparkConf().setAppName("ESIndex").setMaster("local[*]")
conf.set("es.index.auto.create", "true").set("es.nodes", "127.0.0.1")
.set("es.port", "9200")
.set("es.http.timeout", "5m")
.set("es.scroll.size", "100")
val sc = new SparkContext(conf)
//Return my product bean as a in a linkedList.
val list: util.LinkedList[product] = getData()
for (item <- list) {
sc.makeRDD(Seq(item)).saveToEs("my_core/json")
}
The issue with this approach is taking too much time to create the index.
Is there any way to create the index in a better way?
Don't pass data through driver unless it is necessary. Depending on what is the source of data returned from getData you should use relevant input method or create your own. If data comes from MongoDB use for example mongo-hadoop, Spark-MongoDB or Drill with JDBC connection. Then use map or similar method to build the required objects and use saveToEs on transformed RDD.
Creating a RDD with as single element doesn't make sense. It doesn't benefit from Spark architecture at all. You just start a potentially huge number of tasks which have nothing with only a single active executor.
Related
Here, we developed multi services each uses akka actors and communication between services are via Akka GRPC. There is one service which fills an in memory database and other service called Reader applies some query and shape data then transfer them to elasticsearch service for insertion/update. The volume of data in each reading phase is about 1M rows.
The problem arises when Reader transfers large amount of data so elasticsearch can not process them and insert/update them all.
I used akka stream method for these two services communication. I also use scalike jdbc lib and code like below to read and insert batch data instead of whole ones.
def applyQuery(query: String,mergeResult:Map[String, Any] => Unit) = {
val publisher = DB readOnlyStream {
SQL(s"${query}").map(_.toMap()).list().fetchSize(100000)
.iterator()
}
Source.fromPublisher(publisher).runForeach(mergeResult)
}
////////////////////////////////////////////////////////
var batchRows: ListBuffer[Map[String, Any]] = new ListBuffer[Map[String, Any]]
val batchSize: Int = 100000
def mergeResult(row:Map[String, Any]):Unit = {
batchRows :+= row
if (batchRows.size == batchSize) {
send2StorageServer(readyOutput(batchRows))
batchRows.clear()
}
}
def readyOutput(res: ListBuffer[Map[String, Any]]):ListBuffer[StorageServerRequest] = {
// code to format res
}
Now, when using 'foreach' command, it makes operations much slower. I tried different batch size but it made no sense. Am I wrong in using foreach command or is there any better way to resolve speed problem using akka stream, flow, etc.
I found that operation to be used to append to ListBuffer is
batchRows += row
but using :+ does not produce bug but is very inefficient so by using correct operator, foreach is no longer slow, although the speed problem again exists. This time, reading data is fast but writing to elasticsearch is slow.
After some searches, I came up with these solutions:
1. The use of queue as buffer between database and elasticsearch may help.
2. Also if blocking read operation until write is done is not costly,
it can be another solution.
I am using Scala.
I tried to fetch all data from a table with about 4 million rows. I used stream and the code is like
val stream Stream[Record] = expression.stream().iterator().asScala.toStream
stream.map(println(_))
expression is SelectFinalStep[Record] in Jooq.
However, the first line is too slow. It costs minutes. Am I doing something wrong?
Use the Stream API directly
If you're using Scala 2.12, you don't have to transform the Java stream returned by expression.stream() to a Scala Iterator and then to a Scala Stream. Simply call:
expression.stream().forEach(println);
While jOOQ's ResultQuery.stream() method creates a lazy Java 8 Stream, which is discarded again after consumption, Scala's Stream keeps previously fetched records in memory for re-traversal. That's probably what's causing most performance issues, when fetching 4 million records.
A note on resources
Do note that expression.stream() returns a resourceful stream, keeping an open underlying ResultSet and PreparedStatement. Perhaps, it's a good idea to explicitly close the stream after consumption.
Optimise JDBC fetch size
Also, you might want to look into calling expression.fetchSize(), which calls through to JDBC's Statement.setFetchSize(). This allows for the JDBC driver to fetch batches of N rows. Some JDBC drivers default to a reasonable fetch size, others default to fetching all rows into memory prior to passing them to the client.
Another solution would be to fetch the records lazily and construct the a scala stream. For example:
def allRecords():Stream[Record] = {
val cur = expression.fetchLazy()
def inner(): Stream[Record] = {
if(cur.hasNext) {
val next = cur.fetchOne
next #:: inner()
}
else
Stream.empty
}
inner()
}
I am writing a code to cache RDBMS data using spark SQLContext JDBC connection. Once a Dataframe is created I want to cache that reusltset using apache ignite thereby making other applications to make use of the resultset. Here is the code snippet.
object test
{
def main(args:Array[String])
{
val configuration = new Configuration()
val config="src/main/scala/config.xml"
val sparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc=new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sql_dump1=sqlContext.read.format("jdbc").option("url", "jdbc URL").option("driver", "com.mysql.jdbc.Driver").option("dbtable", mysql_table_statement).option("user", "username").option("password", "pass").load()
val ic = new IgniteContext[Integer, Integer](sc, config)
val sharedrdd = ic.fromCache("hbase_metadata")
//How to cache sql_dump1 dataframe
}
}
Now the question is how to cache a dataframe, IgniteRDD has savepairs method but it accepts key and value as RDD[Integer], but I have a dataframe even if I convert that to RDD i would only be getting RDD[Row]. The savepairs method consisting of RDD of Integer seems to be more specific what if I have a string of RDD as value? Is it good to cache dataframe or any other better approach to cache the resultset.
There is no reason to store DataFrame in an Ignite cache (shared RDD) since you won't benefit from it too much: at least you won't be able to execute Ignite SQL over the DataFrame.
I would suggest doing the following:
provide CacheStore implementation for hbase_metadata cache that will preload all the data from your underlying database. Then you can preload all the data into the cache using Ignite.loadCache method. Here you may find an example on how to use JDBC persistent stores along with Ignite cache (shared RDD)
use Ignite Shared RDD sql api to query over cached data.
Alternatively you can get sql_dump1 as you're doing, iterate over each row and store each row individually in the shared RDD using IgniteRDD.savePairs method. After this is done you can query over data using the same Ignite Shared RDD SQL mentioned above.
How to access the state of all keys that has been built by several microbatches.
val stateSpec = StateSpec.function(stateUpdate _)
.numPartitions(numPartitions)
.timeout(Seconds(7200))
// ... multiple steps....
val sessionizedTuples = endTimedTuples.mapWithState(stateSpec)
// ..... multiple steps.....
I am successfully updating state of keys by micro batch and eventually end up with lot of keys. What is the way to get all the keys and their state so i can apply some rdd function on them, all the methods i see is at micro batch level and not on the whole set build over time.
Try
val state = sessionizedTuples.stateSnapshots()
stateSnapshots : Return a pair DStream where each RDD is the snapshot of the state of all the keys.
I want to perform geoip lookups of my data in spark. To do that I'm using MaxMind's geoIP database.
What I want to do is to initialize a geoip database object once on each partition, and later use that to lookup the city related to an IP address.
Does spark have an initialization phase for each node, or should I instead check whether an instance variable is undefined, and if so, initialize it before continuing? E.g. something like (this is python but I want a scala solution):
class IPLookup(object):
database = None
def getCity(self, ip):
if not database:
self.database = self.initialise(geoipPath)
...
Of course, doing this requires spark will serialise the whole object, something which the docs caution against.
In Spark, per partition operations can be do using :
def mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)
This mapper will execute the function f once per partition over an iterator of elements. The idea is that the cost of setting up resources (like DB connections) will be offset with the usage of such resources over a number of elements in the iterator.
Example:
val logsRDD = ???
logsRDD.mapPartitions{iter =>
val geoIp = new GeoIPLookupDB(...)
// this is local map over the iterator - do not confuse with rdd.map
iter.map(elem => (geoIp.resolve(elem.ip),elem))
}
This seems like a good usage of a broadcast variable. Have you looked at the documentation for that functionality and if you have does it fail to meet your requirements in someway?
As #bearrito mentioned - you can use load your GeoDB and then broadcast it from your Driver.
Another option to consider is to provide an external service that you can use to do a lookup. It could be an in memory cache such as Redis/Memcached/Tacheyon or a regular datastore.