ClickHouse Spark Connector - Scala Dependancy - scala

I am using https://github.com/DmitryBe/clickhouse-spark-connector
I create my jar with sbt assembly after I cloned the repo and then I add my import statements.
import io.clickhouse.ext.ClickhouseConnectionFactory
import io.clickhouse.ext.spark.ClickhouseSparkExt._
object clickhouse is not a member of package spark.jobserver.io
I can see that these paths exist and they are added as dependencies the same way I have added all the others. I have cleaned and rebuilt etc but it has made no difference. I am using scala-ide(eclipse).

You can try using https://github.com/yandex/clickhouse-jdbc
Here is a snippet which you can use to write dataframe into Clickhouse using your own dialect. ClickhouseDialect is a class which extends JdbcDialects. You can create your dialect and register it using JdbcDialects.registerDialect(clickhouse)
def write(data: DataFrame, jdbcUrl: String, tableName: String): Unit = {
val clickhouse = new ClickhouseDialect()
JdbcDialects.registerDialect(clickhouse)
val props = new java.util.Properties()
props.put("driver", "ru.yandex.clickhouse.ClickHouseDriver")
props.put("connection", jdbcUrl)
val repartionedData = data.repartition(100)
repartionedData.write
.mode(SaveMode.Append)
.jdbc(jdbcUrl, tableName, props)
JdbcDialects.unregisterDialect(clickhouse)
}
You can check here to create your own dialect. You might have to override canHandle, getJDBCType, getCatalystType methods of JdbcDialects for your usage.

Related

MiniDFS cluster setup for multiple test classes throws java.net.BindException: Address already in use

I am writing unit test cases for spark code that reads/writes data from/to both hdfs files and spark's catalog. For this I created a separate trait that provides initialisation of minidfs cluster and I am using the generated hdfs uri in value for - spark.sql.warehouse.dir while creating the SparkSession object. Here is the code for it -
trait TestSparkSession extends BeforeAndAfterAll {
self: Suite =>
var hdfsCluster: MiniDFSCluster = _
def nameNodeURI: String = s"hdfs://localhost:${hdfsCluster.getNameNodePort}/"
def withLocalSparkSession(tests: SparkSession => Any): Any = {
val baseDir = new File(PathUtils.getTestDir(getClass), "miniHDFS")
val conf = new HdfsConfiguration()
conf.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, baseDir.getAbsolutePath)
val builder = new MiniDFSCluster.Builder(conf)
hdfsCluster = builder.nameNodePort(9000)
.manageNameDfsDirs(true)
.manageDataDfsDirs(true)
.format(true)
.build()
hdfsCluster.waitClusterUp()
val testSpark = SparkSession
.builder()
.master("local")
.appName("Test App")
.config("spark.sql.warehouse.dir", s"${nameNodeURI}spark-warehouse/")
.getOrCreate()
tests(testSpark)
}
def stopHdfs(): Unit = hdfsCluster.shutdown(true, true)
override def afterAll(): Unit = stopHdfs()
}
While writing my tests, I inherit this trait and then write test cases like -
class SampleSpec extends FunSuite with TestSparkSession {
withLocalSparkSession {
testSpark =>
import testSpark.implicits._
// Test 1 Here
// Test 2 Here
}
}
Everything works fine when I run my test classes one at a time. But when run them all at once it throws java.net.BindException: Address already in use.
It should mean that the already created hdfsCluster is not yet down when the next set of tests are executed. That is why it is unable to create another one that binds to the same port. But then in the afterAll() I stopped the hfdsCluster.
My question is can I share single instance of hdfs cluster and spark session instead of initialising it every time ? I have tried to extract out the initialisation outside of the method but it still throwing same exception. Even if I can't share it, how can I properly stop my cluster and re-initialise it on next test class execution ?
Also, please let me know if my approach for writing 'unit' test cases that uses SparkSession and HDFS storage is correct.
Any help will be greatly appreciated.
I resolved it by creating the hdfs cluster in companion object instead so that it creates a single instance of it for all the test suits.

How to convert RDD[GenericRecord] to dataframe in scala?

I get tweets from kafka topic with Avro (serializer and deserializer).
Then i create a spark consumer which extracts tweets in Dstream of RDD[GenericRecord].
Now i want to convert each rdd to a dataframe to analyse these tweets via SQL.
Any solution to convert RDD[GenericRecord] to dataframe please ?
I spent some time trying to make this work (specially how deserialize the data properly but it looks like you already cover this) ... UPDATED
//Define function to convert from GenericRecord to Row
def genericRecordToRow(record: GenericRecord, sqlType : SchemaConverters.SchemaType): Row = {
val objectArray = new Array[Any](record.asInstanceOf[GenericRecord].getSchema.getFields.size)
import scala.collection.JavaConversions._
for (field <- record.getSchema.getFields) {
objectArray(field.pos) = record.get(field.pos)
}
new GenericRowWithSchema(objectArray, sqlType.dataType.asInstanceOf[StructType])
}
//Inside your stream foreachRDD
val yourGenericRecordRDD = ...
val schema = new Schema.Parser().parse(...) // your schema
val sqlType = SchemaConverters.toSqlType(new Schema.Parser().parse(strSchema))
var rowRDD = yourGeneircRecordRDD.map(record => genericRecordToRow(record, sqlType))
val df = sqlContext.createDataFrame(rowRDD , sqlType.dataType.asInstanceOf[StructType])
As you see, I am using a SchemaConverter to get the dataframe structure from the schema that you used to deserialize (this could be more painful with schema registry). For this you need the following dependency
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>3.2.0</version>
</dependency>
you will need to change your spark version depending on yours.
UPDATE: the code above only works for flat avro schemas.
For nested structures I used something different. You can copy the class SchemaConverters, it has to be inside of com.databricks.spark.avro (it uses some protected classes from the databricks package) or you can try to use the spark-bigquery dependency. The class will not be accessible by default, so you will need to create a class inside a package com.databricks.spark.avro to access the factory method.
package com.databricks.spark.avro
import com.databricks.spark.avro.SchemaConverters.createConverterToSQL
import org.apache.avro.Schema
import org.apache.spark.sql.types.StructType
class SchemaConverterUtils {
def converterSql(schema : Schema, sqlType : StructType) = {
createConverterToSQL(schema, sqlType)
}
}
After that you should be able to convert the data like
val schema = .. // your schema
val sqlType = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
....
//inside foreach RDD
var genericRecordRDD = deserializeAvroData(rdd)
///
var converter = SchemaConverterUtils.converterSql(schema, sqlType)
...
val rowRdd = genericRecordRDD.flatMap(record => {
Try(converter(record).asInstanceOf[Row]).toOption
})
//To DataFrame
val df = sqlContext.createDataFrame(rowRdd, sqlType)
A combination of https://stackoverflow.com/a/48828303/5957143 and https://stackoverflow.com/a/47267060/5957143 works for me.
I used the following to create MySchemaConversions
package com.databricks.spark.avro
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.DataType
object MySchemaConversions {
def createConverterToSQL(avroSchema: Schema, sparkSchema: DataType): (GenericRecord) => Row =
SchemaConverters.createConverterToSQL(avroSchema, sparkSchema).asInstanceOf[(GenericRecord) => Row]
}
And then I used
val myAvroType = SchemaConverters.toSqlType(schema).dataType
val myAvroRecordConverter = MySchemaConversions.createConverterToSQL(schema, myAvroType)
// unionedResultRdd is unionRDD[GenericRecord]
var rowRDD = unionedResultRdd.map(record => MyObject.myConverter(record, myAvroRecordConverter))
val df = sparkSession.createDataFrame(rowRDD , myAvroType.asInstanceOf[StructType])
The advantage of having myConverter in the object MyObject is that you will not encounter serialization issues (java.io.NotSerializableException).
object MyObject{
def myConverter(record: GenericRecord,
myAvroRecordConverter: (GenericRecord) => Row): Row =
myAvroRecordConverter.apply(record)
}
Even though something like this may help you,
val stream = ...
val dfStream = stream.transform(rdd:RDD[GenericRecord]=>{
val df = rdd.map(_.toSeq)
.map(seq=> Row.fromSeq(seq))
.toDF(col1,col2, ....)
df
})
I'd like to suggest you an alternate approach. With Spark 2.x you can skip the whole process of creating DStreams. Instead, you can do something like this with structured streaming,
val df = ss.readStream
.format("com.databricks.spark.avro")
.load("/path/to/files")
This will give you a single dataframe which you can directly query. Here, ss is the instance of spark session. /path/to/files is the place where all your avro files are being dumped from kafka.
PS: You may need to import spark-avro
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"
Hope this helped. Cheers
You can use createDataFrame(rowRDD: RDD[Row], schema: StructType), which is available in the SQLContext object. Example for converting an RDD of an old DataFrame:
import sqlContext.implicits.
val rdd = oldDF.rdd
val newDF = oldDF.sqlContext.createDataFrame(rdd, oldDF.schema)
Note that there is no need to explicitly set any schema column. We reuse the old DF's schema, which is of StructType class and can be easily extended. However, this approach sometimes is not possible, and in some cases can be less efficient than the first one.

JDBC-HiveServer:'client_protocol is unset!'-Both 1.1.1 in CS

When I ask this question, I have already read many many article through google. Many answers show that is the mismatch version between client-side and server-side. So I decide to copy the jars from server-side to client-side directly, and the result is .... as you know, same exception:
org.apache.thrift.TApplicationException: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{use:database=default})
It goes well when I connect to hiveserver2 through beeline :)
see my connection.
So, I think it will work when I use jdbc too. But, unfortunately, it throws that exception, below is my jars in my project.
hive-jdbc-1.1.1.jar
hive-jdbc-standalone.jar
hive-metastore-1.1.1.jar
hive-service-1.1.1.jar
those hive jars are copied from server-side.
def connect_hive(master:String){
val conf = new SparkConf()
.setMaster(master)
.setAppName("Hive")
.set("spark.local.dir", "./tmp");
val sc = new SparkContext(conf);
val sqlContext = new SQLContext(sc);
val url = "jdbc:hive2://192.168.40.138:10000";
val prop= new Properties();
prop.setProperty("user", "hive");
prop.setProperty("password", "hive");
prop.setProperty("driver", "org.apache.hive.jdbc.HiveDriver");
val conn = DriverManager.getConnection(url, prop);
sc.stop();
}
The configment of my server:
hadoop 2.7.3
spark 1.6.0
hive 1.1.1
Does anyone encounter the same situation when connecting hive through spark-JDBC?
Since beeline works, it is expected that your program also should execute correctly.
print current project class path
you can try some thing like this to understand your self.
import java.net.URL
import java.net.URLClassLoader
import scala.collection.JavaConversions._
object App {
def main(args: Array[String]) {
val cl = ClassLoader.getSystemClassLoader
val urls = cl.asInstanceOf[URLClassLoader].getURLs
for (url <- urls) {
println(url.getFile)
}
}
}
Also check hive.aux.jars.path=<file urls> to understand what jars are present in the classpath.

Issue with VectorUDT when using Spark ML

I am writing an UDAF to be applied to a Spark data frame column of type Vector (spark.ml.linalg.Vector). I rely on spark.ml.linalg package so that I do not have to go back and forth between dataframe and RDD.
Inside the UDAF, I have to specify a data type for the input, buffer, and output schemas:
def inputSchema = new StructType().add("features", new VectorUDT())
def bufferSchema: StructType =
StructType(StructField("list_of_similarities", ArrayType(new VectorUDT(), true), true) :: Nil)
override def dataType: DataType = ArrayType(DoubleType,true)
VectorUDT is what I would use with spark.mllib.linalg.Vector:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
However, when I try to import it from spark.ml instead: import org.apache.spark.ml.linalg.VectorUDT
I get a runtime error (no errors during the build):
class VectorUDT in package linalg cannot be accessed in package org.apache.spark.ml.linalg
Is it expected/can you suggest a workaround?
I am using Spark 2.0.0
In Spark 2.0.0, the proper way to go is to use org.apache.spark.ml.linalg.SQLDataTypes.VectorType instead of VectorUDT. It was introduced in this issue.

(Play 2.4.2, Play Slick 1.0.0) How do I apply database Evolutions to a Slick managed database within a test?

I would like to write database integration tests against a Play Slick managed database and apply and unapply Evolutions using the helper methods described in the Play documentation namely, Evolutions.applyEvolutions(database) and Evolutions.cleanupEvolutions(database). However these require a play.api.db.Database instance which is not possible to get hold of from what I can see. The jdbc library conflicts with play-slick so how do I get the database instance from slick? I use the following to get a slick database def for running slick queries:
val dbConfig = DatabaseConfigProvider.get[JdbcProfile]("my-test-db")(FakeApplication())
import dbConfig.driver.api._
val db = dbConfig.db
Thanks,
Leanne
Here is how I dow it with Guice:
I inject with Guice:
lazy val appBuilder = new GuiceApplicationBuilder()
lazy val injector = appBuilder.injector()
lazy val databaseApi = injector.instanceOf[DBApi] //here is the important line
(You have to import play.api.db.DBApi.)
And in my tests, I simply do the following (actually I use an other database for my tests):
override def beforeAll() = {
Evolutions.applyEvolutions(databaseApi.database("default"))
}
override def afterAll() = {
Evolutions.cleanupEvolutions(databaseApi.database("default"))
}
(I'm using Scalatest but it the same thing with an other testing framework.)