Connect scala-hbase to dockered hbase - scala

I am trying play-framework template "play-hbase".
It's a template so I expect it works in most cases.
But in my case hbase is running with boot2docker on Win 7 x64.
So I added some config details to template:
object Application extends Controller {
val barsTableName = "bars"
val family = Bytes.toBytes("all")
val qualifier = Bytes.toBytes("json")
lazy val hbaseConfig = {
val conf = HBaseConfiguration.create()
// ADDED ADDED specify boot2docker vm
conf.set("hbase.zookeeper.quorum", "192.168.59.103")
conf.set("hbase.zookeeper.property.clientPort", "2181")
conf.set("hbase.master", "192.168.59.103:60000");
val hbaseAdmin = new HBaseAdmin(conf)
// create a table in HBase if it doesn't exist
if (!hbaseAdmin.tableExists(barsTableName)) {
val desc = new HTableDescriptor(barsTableName)
desc.addFamily(new HColumnDescriptor(family))
hbaseAdmin.createTable(desc)
Logger.info("bars table created")
}
// return the HBase config
conf
}
It is being compiled and runs, but shows "bad request error" during this code.
def addBar() = Action(parse.json) { request =>
// create a new row in the table that contains the JSON sent from the client
val table = new HTable(hbaseConfig, barsTableName)
val put = new Put(Bytes.toBytes(UUID.randomUUID().toString))
put.add(family, qualifier, Bytes.toBytes(request.body.toString()))
table.put(put)
table.close()
Ok
}

Related

Task not serializable - foreach function spark

I have a function getS3Object to get a json object stored in S3
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = s3client.getObject(bucketName, s3ObjectName)
val file = new File(filename)
fileWriter = new FileWriter(file)
bw = new BufferedWriter(fileWriter)
bw.write(object_to_write)
bw.close()
fileWriter.close()
}
My dataframe (df) contains one column where each row is the S3ObjectName
S3ObjectName
a1.json
b2.json
c3.json
d4.json
e5.json
When I execute the below logic I get an error saying "task is not serializable"
Method 1:- df.foreach(x => getS3Object(x.getString(0))
I tried converting the df to rdd but still get the same error
Method 2:- df.rdd.foreach(x => getS3Object(x.getString(0))
However it works with collect()
Method 3:- df.collect.foreach(x => getS3Object(x.getString(0))
I do not wish to use the collect() method as all the elements of the dataframe are collected to the driver and potentially result in OutOfMemory error.
Is there a way to make the foreach() function work using Method 1?
The problem for your s3Client can be solved as following. But you have to remember that these functions run on executor nodes (other machines), so your whole val file = new File(filename) thing is probably not going to work here.
You can put your files on some distibuted file system like HDFS or S3.
object S3ClientWrapper extends Serializable {
// s3Client must be created here.
val s3Client = {
val awsCreds = new BasicAWSCredentials("access_key_id", "secret_key_id")
AmazonS3ClientBuilder.standard()
.withCredentials(new AWSStaticCredentialsProvider(awsCreds))
.build()
}
}
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = S3ClientWrapper.s3Client.getObject(bucketName, s3ObjectName)
// now you have to solve your file problem
}

Spark JobServer JobEnvironment

def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[4]").setAppName("LongPiJob")
val sc = new SparkContext(conf)
val env = new JobEnvironment {
def jobId: String = "abcdef"
//scalastyle:off
def namedObjects: NamedObjects = ???
def contextConfig: Config = ConfigFactory.empty
}
val results = runJob(sc, env, 5)
println("Result is " + results)
}
I took this code from the longpi example for spark jobserver relating to the new api which is part of the github repo. I don't understand what new JobEnvironment or any of the variables inside it. My IDE is complains with these default settings.
https://github.com/spark-jobserver/spark-jobserver/blob/spark-2.0-preview/job-server-tests/src/main/scala/spark/jobserver/LongPiJob.scala
JobEnvironment has runtime information about the Job. Like jobId, contextConfig and namedObjects
Now it is easy for you to access these information from runJob.

how to add MongoOptions in MongoClient in casbah mongo scala driver

i want to add mongoOptions to MongoClient basically i want to add ConnectionPerHost value its default is 10 i want to increase it to 20
but i am getting errors in the code i have tried with two different ways
val SERVER:ServerAddress = {
val hostName=config.getString("db.hostname")
val port=config.getString("db.port").toInt
new ServerAddress(hostName,port)
}
val DATABASE:String = config.getString("db.dbname")
method 1
val options=MongoClientOptions.apply( connectionsPerHost=20 )
val connectionMongo = MongoConnection(SERVER).addOption(options.getConnectionsPerHost)//returning Unit instead of MongoClient
val collectionMongo = connectionMongo(DATABASE)("testdb")
getting error on last line Unit does not take parameters
method 2
val mongoOption=MongoClientOptions.builder()
.connectionsPerHost(20)
.build();
getting error on MongoClientOptions.builder() line
value builder is not a member of object com.mongodb.casbah.MongoClientOptions
-
i want to set connectionsPerHost value to 20 please help what is the right way to do this
This seems to be working.
val config = ConfigFactory.load();
val hostName = config.getString("db.hostname")
val port = config.getInt("db.port")
val server = new ServerAddress(hostName, port)
val database = config.getString("db.dbname")
val options = MongoClientOptions(connectionsPerHost = 20)
val connectionMongo = MongoClient(server, options)
val collectionMongo = connectionMongo(database)("testdb")
Note that MongoConnection is deprecated.

BasicDataSource to DataSource

I am trying to use camels sql: and have googled myself into this:
import org.apache.commons.dbcp2.BasicDataSource
import org.apache.camel.impl.{SimpleRegistry, DefaultCamelContext}
object CamelApplication {
val jdbcUrl = "jdbc:mysql://host:3306"
val user = "test"
val password = "secret"
val driverClass = "com.mysql.jdbc.Driver"
// code to create data source here
val ds = new BasicDataSource
ds.setUrl(jdbcUrl)
ds.setUsername(user)
ds.setPassword(password)
ds.setDriverClassName(driverClass)
val registry = new SimpleRegistry
registry.put("dataSource", ds)
def main(args: Array[String]) = {
val context = new DefaultCamelContext(registry)
context.setUseMDCLogging(true)
context.addRoutes(new DlrToDb)
context.start()
Thread.currentThread.join()
}
}
and my DlrToDb route is this:
import org.apache.camel.scala.dsl.builder.RouteBuilder
class DlrToDb extends RouteBuilder{
"""netty:tcp://localhost:12000?textline=true""" ==> {
id("DlrToDb")
log("sql insert coming up")
to("sql:insert into camel_test (msgid, dlr_body) VALUES ('some_id','test')")
}
}
i.e. when I telnet to localhost and press enter I would like some data to be added in my database. However it is a BasicDataSource and not a DataSource so I get an error:
Failed to create route DlrToDb .....
.... due to: Property 'dataSource' is required
Do I need to change/convert from the BasicDatasource, or do I need to do something to the registry to make it work?
You need to append query option "dataSource" to the URI:
....
to("sql:insert into camel_test (msgid, dlr_body) VALUES ('some_id','test')?dataSource=dataSource")
....

Spark Streaming using Scala to insert to Hbase Issue

I am trying to read records from Kafka message and put into Hbase. Though the scala script is running with out any issue, the inserts are not happening. Please help me.
Input:
rowkey1,1
rowkey2,2
Here is the code which I am using:
object Blaher {
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(row(1)))
hTable.put(thePut)
}
}
object TheMain extends Serializable{
def run() {
val ssc = new StreamingContext(sc, Seconds(1))
val topicmap = Map("test" -> 1)
val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap).map(_._2)
val words = lines.map(line => line.split(",")).map(line => (line(0),line(1)))
val store = words.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc.start()
}
}
TheMain.run()
From the API doc for HTable's flushCommits() method: "Executes all the buffered Put operations". You should call this at the end of your blah() method -- it looks like they're currently being buffered but never executed or executed at some random time.