I followed This Unable to send data to MongoDB using Kafka-Spark Structured Streaming to send data from spark structured streaming to mongoDB and I implemented it successfully but there is one issue .
Like when function
override def process(record: Row): Unit = {
val doc: Document = Document(record.prettyJson.trim)
// lazy opening of MongoDB connection
ensureMongoDBConnection()
val result = collection.insertOne(doc)
if (messageCountAccum != null)
messageCountAccum.add(1)
}
code is executing without any problem but no data is being send to MongoDB
But if i add a print statement like this
override def process(record: Row): Unit = {
val doc: Document = Document(record.prettyJson.trim)
// lazy opening of MongoDB connection
ensureMongoDBConnection()
val result = collection.insertOne(doc)
result.foreach(println) //print statement
if (messageCountAccum != null)
messageCountAccum.add(1)
}
Data is getting inserted in MongoDB
I don't know why????
foreach initializes the writer sink. Without the foreach your dataframe is never calculated.
Try this :
val df = // your df here
df.map(r => process(r))
df.count()
Related
I'm reading data coming from a Kafka (100.000 line per second) using Structured Spark Streaming, and i'm trying to insert all the data in HBase.
I'm in Cloudera Hadoop 2.6 and I'm using Spark 2.3
I tried something like I've seen here.
eventhubs.writeStream
.foreach(new MyHBaseWriter[Row])
.option("checkpointLocation", checkpointDir)
.start()
.awaitTermination()
MyHBaseWriter looks like this :
class AtomeHBaseWriter[RECORD] extends HBaseForeachWriter[Row] {
override def toPut(record: Row): Put = {
override val tableName: String = "hbase-table-name"
override def toPut(record: Row): Put = {
// Get Json
val data = JSON.parseFull(record.getString(0)).asInstanceOf[Some[Map[String, Object]]]
val key = data.getOrElse(Map())("key")+ ""
val val = data.getOrElse(Map())("val")+ ""
val p = new Put(Bytes.toBytes(key))
//Add columns ...
p.addColumn(Bytes.toBytes(columnFamaliyName),Bytes.toBytes(columnName), Bytes.toBytes(val))
p
}
}
And the HBaseForeachWriter class looks like this :
trait HBaseForeachWriter[RECORD] extends ForeachWriter[RECORD] {
val tableName: String
def pool: Option[ExecutorService] = None
def user: Option[User] = None
private var hTable: Table = _
private var connection: Connection = _
override def open(partitionId: Long, version: Long): Boolean = {
connection = createConnection()
hTable = getHTable(connection)
true
}
def createConnection(): Connection = {
// I create HBase Connection Here
}
def getHTable(connection: Connection): Table = {
connection.getTable(TableName.valueOf(Variables.getTableName()))
}
override def process(record: RECORD): Unit = {
val put = toPut(record)
hTable.put(put)
}
override def close(errorOrNull: Throwable): Unit = {
hTable.close()
connection.close()
}
def toPut(record: RECORD): Put
}
So here I'm doing a put line by line, even if I allow 20 executors and 4 cores for each, I don't have the data inserted immediatly in HBase. So what I need to do is a bulk load ut I'm struggled because all what I find in the internet is to realize it with RDDs and Map/Reduce.
What I understand is slow rate of record ingestion in to hbase. I have few suggestions to you.
1) hbase.client.write.buffer .
the below property may help you.
hbase.client.write.buffer
Description Default size of the BufferedMutator write buffer in bytes. A bigger buffer takes more memory — on both the client and
server side since server instantiates the passed write buffer to
process it — but a larger buffer size reduces the number of RPCs made.
For an estimate of server-side memory-used, evaluate
hbase.client.write.buffer * hbase.regionserver.handler.count
Default 2097152 (around 2 mb )
I prefer foreachBatch see spark docs (its kind of foreachPartition in spark core) rather foreach
Also in your hbase writer extends ForeachWriter
open method intialize array list of put
in process add the put to the arraylist of puts
in close table.put(listofputs); and then reset the arraylist once you updated the table...
what it does basically your buffer size mentioned above is filled with 2 mb then it will flush in to hbase table. till then records wont go to hbase table.
you can increase that to 10mb and so....
In this way number of RPCs will be reduced. and huge chunk of data will be flushed and will be in hbase table.
when write buffer is filled up and a flushCommits in to hbase table is triggered.
Example code : in my answer
2) switch off WAL you can switch off WAL(write ahead log - Danger is no recovery) but it will speed up writes... if dont want to recover the data.
Note : if you are using solr or cloudera search on hbase tables you
should not turn it off since Solr will work on WAL. if you switch it
off then, Solr indexing wont work.. this is one common mistake many of
us does.
How to swtich off : https://hbase.apache.org/1.1/apidocs/org/apache/hadoop/hbase/client/Put.html#setWriteToWAL(boolean)
Basic architechture and link for further study :
http://hbase.apache.org/book.html#perf.writing
as I mentioned list of puts is good way... this is the old way (foreachPartition with list of puts) of doing before structured streaming example is like below .. where foreachPartition operates for each partition not every row.
def writeHbase(mydataframe: DataFrame) = {
val columnFamilyName: String = "c"
mydataframe.foreachPartition(rows => {
val puts = new util.ArrayList[ Put ]
rows.foreach(row => {
val key = row.getAs[ String ]("rowKey")
val p = new Put(Bytes.toBytes(key))
val columnV = row.getAs[ Double ]("x")
val columnT = row.getAs[ Long ]("y")
p.addColumn(
Bytes.toBytes(columnFamilyName),
Bytes.toBytes("x"),
Bytes.toBytes(columnX)
)
p.addColumn(
Bytes.toBytes(columnFamilyName),
Bytes.toBytes("y"),
Bytes.toBytes(columnY)
)
puts.add(p)
})
HBaseUtil.putRows(hbaseZookeeperQuorum, hbaseTableName, puts)
})
}
To sumup :
What I feel is we need to understand the psycology of spark and hbase
to make then effective pair.
I've a HBase table which look like following in a static Dataframe as HBaseStaticRecorddf
---------------------------------------------------------------
|rowkey|Name|Number|message|lastTS|
|-------------------------------------------------------------|
|266915488007398|somename|8759620897|Hi|1539931239 |
|266915488007399|somename|8759620898|Welcome|1540314926 |
|266915488007400|somename|8759620899|Hello|1540315092 |
|266915488007401|somename|8759620900|Namaskar|1537148280 |
--------------------------------------------------------------
Now I've a file stream source from which I'll get streaming rowkey. Now this timestamp(lastTS) for streaming rowkey's has to be checked whether they're older than one day or not. For this I've the following code where joinedDF is a streaming DataFrame, which is formed by joining another streaming DataFrame and HBase static dataframe as follows.
val HBaseStreamDF = HBaseStaticRecorddf.join(anotherStreamDF,"rowkey")
val newdf = HBaseStreamDF.filter(HBaseStreamDF.col("lastTS").cast("Long") < ((System.currentTimeMillis - 86400*1000)/1000))//records older than one day are eligible to get updated
Once the filter is done I want to save this record to the HBase like below.
newDF.writeStream
.foreach(new ForeachWriter[Row] {
println("inside foreach")
val tableName: String = "dummy"
val hbaseConfResources: Seq[String] = Seq("hbase-site.xml")
private var hTable: Table = _
private var connection: Connection = _
override def open(partitionId: Long, version: Long): Boolean = {
connection = createConnection()
hTable = getHTable(connection)
true
}
def createConnection(): Connection = {
val hbaseConfig = HBaseConfiguration.create()
hbaseConfResources.foreach(hbaseConfig.addResource)
ConnectionFactory.createConnection(hbaseConfig)
}
def getHTable(connection: Connection): Table = {
connection.getTable(TableName.valueOf(tableName))
}
override def process(record: Row): Unit = {
var put = saveToHBase(record)
hTable.put(put)
}
override def close(errorOrNull: Throwable): Unit = {
hTable.close()
connection.close()
}
def saveToHBase(record: Row): Put = {
val p = new Put(Bytes.toBytes(record.getString(0)))
println("Now updating HBase for " + record.getString(0))
p.add(Bytes.toBytes("messageInfo"),
Bytes.toBytes("ts"),
Bytes.toBytes((System.currentTimeMillis/1000).toString)) //saving as second
p
}
}
).outputMode(OutputMode.Update())
.start().awaitTermination()
Now when any record is coming HBase is getting updated for the first time only. If the same record comes afterwards, it's just getting neglected and not working. However if some unique record comes which has not been processed by the Spark application, then it works. So any duplicated record is not getting processed for the second time.
Now here's some interesting thing.
If I remove the 86400 sec subtraction from (System.currentTimeMillis - 86400*1000)/1000) then everything is getting processed even if there's redundancy among the incoming records. But it's not intended and useful as it doesn't filter 1 day older records.
If I do the comparison in the filter condition in milliseconds without dividing by 1000(this requires HBase data also in millisecond) and save the record as second in the put object then again everything is processed. But If I change the format to seconds in the put object then it doesn't work.
I tried testing individually the filter and HBase put and they both works fine. But together they mess up if System.currentTimeMillis in filter has some arithmetic operations such as /1000 or -864000. If I remove the HBase sink part and use
newDF.writeStream.format("console").start().awaitTermination()
then again the filter logic works. And If I remove the filter then HBase sink works fine. But together, the custom sink for the HBase only works for the first time for the unique records. I tried several other filter logic like below but issue remains the same.
val newDF = newDF1.filter(col("lastTS").lt(LocalDateTime.now().minusDays(1).toEpochSecond(ZoneOffset.of("+05:30"))))
or
val newDF = newDF1.filter(col("lastTS").cast("Long") < LocalDateTime.now().minusDays(1).toEpochSecond(ZoneOffset.of("+05:30")))
How do I make the filter work and save the filtered records to the HBase with updated timestamp? I took reference of several other posts. But the result is same.
I want to traverse the stream of data, run a query on it and return the results which should be written into ElasticSearch. I tried to use mapPartitions method for creation of the connection to the database, however, I get such an error, which indicates that partition returns None to the rdd (I guess, some action should be added after the transformations):
org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [10/10]. Error sample (first [5] error messages)
What can be changed in the code to get the data into rdd and send it to ElasticSearch without any troubles?
Alos, I had a variant of the solution for this problem with flatMap in foreachRDD, however, I create a connection to the database on each rdd, which is not effective in terms of performance.
This is the code for streaming data processing:
wordsArrays.foreachRDD(rdd => {
rdd.mapPartitions { part => {
val neo4jConfig = neo4jConfigurations.getNeo4jConfig(args(1))
part.map(
data => {
val recommendations = execNeo4jSearchQuery(neo4jConfig, data)
val calendarTime = Calendar.getInstance.getTime
val recommendationsMap = convertDataToMap(recommendations, calendarTime)
recommendationsMap
})
}
}
}.saveToEs("rdd-timed/output")
)
The problem was that I tried to convert the iterator directly into the Array, although it holds multiple rows of my records. That is why ElasticSEarch was not able to map this collection of records to the defined single record schema.
Here is the code that works properly:
wordsArrays.foreachRDD(rdd => {
rdd.mapPartitions { partition => {
val neo4jConfig = neo4jConfigurations.getNeo4jConfig(args(1))
val result = partition.map( data => {
val recommendations = execNeo4jSearchQuery(neo4jConfig, data)
val calendarTime = Calendar.getInstance.getTime
convertDataToMap(recommendations, calendarTime)
}).toList.flatten
result.iterator
}
}.saveToEs("rdd-timed/output")
})
I am fetching data from Kafka and processing in Spark Streaming and writing Data into Cassandra
I am trying to Filter the DStream records but it doesn't filter the records and write the complete records in Cassandra,
Any suggestion with sample/example Code to filter multiple columns of records and any help will be highly appreciated i have done a research on this but not able to get any solution.
class SparkKafkaConsumer1(val recordStream : org.apache.spark.streaming.dstream.DStream[String], val streaming : StreamingContext) {
val internationalAddress = recordStream.map(line => line.split("\\|")(10).toUpperCase)
def timeToStr(epochMillis: Long): String =
DateTimeFormat.forPattern("YYYYMMddHHmmss").print(epochMillis)
if(internationalAddress =="INDIA")
{
print("-----------------------------------------------")
recordStream.print()
val riskScore = "1"
val timestamp: Long = System.currentTimeMillis
val formatedTimeStamp = timeToStr(timestamp)
var wc1 = recordStream.map(_.split("\\|")).map(r=>Row(r(0),r(1),r(2),r(3),r(4).toInt,r(5).toInt,r(6).toInt,r(7),r(8),r(9),r(10),r(11),r(12),r(13),r(14),r(15),r(16),riskScore.toInt,0,0,0,formatedTimeStamp))
implicit val rowWriter = SqlRowWriter.Factory
wc1.saveToCassandra("fraud", "fraudrating", SomeColumns("purchasetimestamp","sessionid","productdetails","emailid","productprice","itemcount","totalprice","itemtype","luxaryitem","shippingaddress","country","bank","typeofcard","creditordebitcardnumber","contactdetails","multipleitem","ipaddress","consumer1score","consumer2score","consumer3score","consumer4score","recordedtimestamp"))
}
(Note: I am have records with internationalAddress = INDIA in Kafka and I am very much new to Scala)
I'm not really sure what you're trying to do, but if you are simply trying to filter on records pertaining to India, you could do this:
implicit val rowWriter = SqlRowWriter.Factory
recordStream
.filter(_.split("\\|")(10).toUpperCase) == "INDIA")
.map(_.split("\\|"))
.map(r => Row(...))
.saveToCassandra(...)
As a side note, I think case classes would be really helpful for you.
I'm learning Akka Streams and as an exercise I would like to insert logs into Cassandra. The issue is that I could not manage to make the stream insert logs into the database.
I naively tried the following :
object Application extends AkkaApp with LogApacheDao {
// The log file is read line by line
val source: Source[String, Unit] = Source.fromIterator(() => scala.io.Source.fromFile(filename).getLines())
// Each line is converted to an ApacheLog object
val flow: Flow[String, ApacheLog, Unit] = Flow[String]
.map(rawLine => {
rawLine.split(",") // implicit conversion Array[String] -> ApacheLog
})
// Log objects are inserted to Cassandra
val sink: Sink[ApacheLog, Future[Unit]] = Sink.foreach[ApacheLog] { log => saveLog(log) }
source.via(flow).to(sink).run()
}
saveLog() is defined in LogApacheDao like this (I omitted the column values for a clearer code):
val session = cluster.connect()
session.execute(s"CREATE KEYSPACE IF NOT EXISTS $keyspace WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};")
session.execute(s"DROP TABLE IF EXISTS $keyspace.$table;")
session.execute(s"CREATE TABLE $keyspace.$table (...)")
val preparedStatement = session.prepare(s"INSERT INTO $keyspace.$table (...) VALUES (...);")
def saveLog(logEntry: ApacheLog) = {
val stmt = preparedStatement.bind(...)
session.executeAsync(stmt)
}
The conversion from Array[String] to ApacheLog when entering in the sink happens without issue (verified with println). Also, the keyspace and table are both created, but when the execution comes to saveLog, it seems that something is blocking and no insertion is made.
I do not get any errors but Cassandra driver core (3.0.0) keeps giving me :
Connection[/172.17.0.2:9042-1, inFlight=0, closed=false] was inactive for 30 seconds, sending heartbeat
Connection[/172.17.0.2:9042-2, inFlight=0, closed=false] heartbeat query succeeded
I should add that I use a dockerized Cassandra.
Try using the Cassandra Connector in alpakka.