I am running tests in scala using spark and cassandra, for the test I am using testcontainers and for some work reasons we are not using the scala variant of testcontainers, the problem with testcontainers the ports are randomly assigned and I don't know what parameter to use to get the port so I can conect
//code block
#RunWith(classOf[JUnitRunner])
class ConnectorSpec extends AnyFlatSpec
with BeforeAndAfterAll{
val container = new CassandraContainer("cassandra:latest")
container.withExposedPorts(9042)
container.waitingFor(Wait.forListeningPort())
container.start()
val ip = container.getContainerIpAddress() //output localhost
val port = container.getMappedPort(9042)
val cluster = container.getCluster()
val session = cluster.connect()
session.execute("CREATE KEYSPACE IF NOT EXISTS test WITH replication = {'class':'SimpleStrategy','replication_factor':'1'};")
session.execute("CREATE TABLE IF NOT EXISTS test.eureka (id int PRIMARY KEY, city varchar, role varchar);")
session.execute("INSERT INTO test.tester (id, city, role) VALUES (1, 'Jakarta', 'Devops')")
session.execute("SELECT * FROM test.tester")
assert(container.isRunning)
val spark = SparkSession
.builder()
.appName("ReadCassandra")
.master("local[*]")
.getOrCreate()
spark.setCassandraConf(CassandraConnectorConf.KeepAliveMillisParam.option(10000))
spark.setCassandraConf(cluster.getClusterName(), CassandraConnectorConf.ConnectionHostParam.option("127.0.0.1"))
val df = spark.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "tester", "keyspace" -> "test"))
.load()
df.show()
#sbt dependency
libraryDependencies += "org.testcontainers" % "cassandra" % "1.15.3" % Test
Output showing Actual port
Testing started at 5:30 PM ...
Connected to the target VM, address: '127.0.0.1:53611', transport: 'socket'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/kenneth/.cache/coursier/v1/https/repo1.maven.org/maven2/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/kenneth/.cache/coursier/v1/https/repo1.maven.org/maven2/org/slf4j/slf4j-nop/1.7.30/slf4j-nop-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
You already have the local port in your code:
val port = container.getMappedPort(9042)
TestContainers provide getMappedPort specifically for this purpose: it will give you the local port mapped to the port specified as parameter in the container (the local port mapped to 9042 in Cassandra container in your case).
Note: you must start the container before calling getMappedPort but that is already your case.
Related
I have tried to read/write Parquet files from/to Azurite using Spark like this:
import com.holdenkarau.spark.testing.DatasetSuiteBase
import org.apache.spark.SparkConf
import org.apache.spark.sql.SaveMode
import org.scalatest.WordSpec
class SimpleAzuriteSpec extends WordSpec with DatasetSuiteBase {
val AzuriteHost = "localhost"
val AzuritePort = 10000
val AzuriteAccountName = "devstoreaccount1"
val AzuriteAccountKey = "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=="
val AzuriteContainer = "container1"
val AzuriteDirectory = "dir1"
val AzuritePath = s"wasb://$AzuriteContainer#$AzuriteAccountName.blob.core.windows.net/$AzuriteDirectory/"
override final def conf: SparkConf = {
val cfg = super.conf
val settings =
Map(
s"spark.hadoop.fs.azure.storage.emulator.account.name" -> AzuriteAccountName,
s"spark.hadoop.fs.azure.account.key.${AzuriteAccountName}.blob.core.windows.net" -> AzuriteAccountKey
)
settings.foreach { case (k, v) =>
cfg.set(k, v)
}
cfg
}
"Spark" must {
"write to/read from Azurite" in {
import spark.implicits._
val xs = List(Rec(1, "Alice"), Rec(2, "Bob"))
val inputDs = spark.createDataset(xs)
inputDs.write
.format("parquet")
.mode(SaveMode.Overwrite)
.save(AzuritePath)
val ds = spark.read
.format("parquet")
.load(AzuritePath)
.as[Rec]
ds.show(truncate = false)
val actual = ds.collect().toList.sortBy(_.id)
assert(actual == xs)
}
}
}
case class Rec(id: Int, name: String)
I have tried both Azurite 3.9.0 and Azurite 2.7.0 (both in Docker). I can transfer files to/from Azurite using az (dockerized as well).
The test above runs on the Docker host. Azurite is reachable from the Docker host.
I am using Spark 2.4.5, Hadoop 2.10.0, and this dependency:
libraryDependencies += "org.apache.hadoop" % "hadoop-azure" % "2.10.0"
When using az, this connection string works:
AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://azurite-3.9.0:10000/devstoreaccount1;QueueEndpoint=http://azurite-3.9.0:10001/devstoreaccount1;"
yet I do not know how to configure this in Spark.
My question: How can I configure the host, the port, credentials etc. (in the path or in SparkConf)?
Yes, that's possible, but azurite should be accessible via 127.0.0.1:10000 for wasb (so if it runs on another machine then port forwarding will help) and then specify following spark args as example:
./pyspark --conf "spark.hadoop.fs.defaultFS=wasb://container#azurite" --conf "spark.hadoop.fs.azure.storage.emulator.account.name=azurite"
Then default file system will be backed up by your instance of azurite.
Don't use Azurite, just add these Jars to your Spark Dockerfile:
# Set JARS env
ENV JARS=${SPARK_HOME}/jars/azure-storage-${AZURE_STORAGE_VER}.jar,${SPARK_HOME}/jars/hadoop-azure-${HADOOP_AZURE_VER}.jar,${SPARK_HOME}/jars/jetty-util-ajax-${JETTY_VER}.jar,${SPARK_HOME}/jars/jetty-util-${JETTY_VER}.jar
RUN echo "spark.jars ${JARS}" >> $SPARK_HOME/conf/spark-defaults.conf
Set your configuration:
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().set(f"fs.azure.account.key.{ os.environ['AZURE_STORAGE_ACCOUNT'] }.blob.core.windows.net", os.environ['AZURE_STORAGE_KEY'])
Then you can read it:
val df = spark.read.parquet("wasbs://<container-name>#<storage-account-name>.blob.core.windows.net/<directory-name>")
I am writing unit test cases for spark code that reads/writes data from/to both hdfs files and spark's catalog. For this I created a separate trait that provides initialisation of minidfs cluster and I am using the generated hdfs uri in value for - spark.sql.warehouse.dir while creating the SparkSession object. Here is the code for it -
trait TestSparkSession extends BeforeAndAfterAll {
self: Suite =>
var hdfsCluster: MiniDFSCluster = _
def nameNodeURI: String = s"hdfs://localhost:${hdfsCluster.getNameNodePort}/"
def withLocalSparkSession(tests: SparkSession => Any): Any = {
val baseDir = new File(PathUtils.getTestDir(getClass), "miniHDFS")
val conf = new HdfsConfiguration()
conf.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, baseDir.getAbsolutePath)
val builder = new MiniDFSCluster.Builder(conf)
hdfsCluster = builder.nameNodePort(9000)
.manageNameDfsDirs(true)
.manageDataDfsDirs(true)
.format(true)
.build()
hdfsCluster.waitClusterUp()
val testSpark = SparkSession
.builder()
.master("local")
.appName("Test App")
.config("spark.sql.warehouse.dir", s"${nameNodeURI}spark-warehouse/")
.getOrCreate()
tests(testSpark)
}
def stopHdfs(): Unit = hdfsCluster.shutdown(true, true)
override def afterAll(): Unit = stopHdfs()
}
While writing my tests, I inherit this trait and then write test cases like -
class SampleSpec extends FunSuite with TestSparkSession {
withLocalSparkSession {
testSpark =>
import testSpark.implicits._
// Test 1 Here
// Test 2 Here
}
}
Everything works fine when I run my test classes one at a time. But when run them all at once it throws java.net.BindException: Address already in use.
It should mean that the already created hdfsCluster is not yet down when the next set of tests are executed. That is why it is unable to create another one that binds to the same port. But then in the afterAll() I stopped the hfdsCluster.
My question is can I share single instance of hdfs cluster and spark session instead of initialising it every time ? I have tried to extract out the initialisation outside of the method but it still throwing same exception. Even if I can't share it, how can I properly stop my cluster and re-initialise it on next test class execution ?
Also, please let me know if my approach for writing 'unit' test cases that uses SparkSession and HDFS storage is correct.
Any help will be greatly appreciated.
I resolved it by creating the hdfs cluster in companion object instead so that it creates a single instance of it for all the test suits.
I am trying to write classic sql query using scala to insert some information into a sql server database table.
The connection to my database works perfectly and I succeed to read data from JDBC, from a table recently created called "textspark" which has only 1 column called "firstname" create table textspark(firstname varchar(10)).
However, when I try to write data into the table , I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: textspark
this is my code:
//Step 1: Check that the JDBC driver is available
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
//Step 2: Create the JDBC URL
val jdbcHostname = "localhost"
val jdbcPort = 1433
val jdbcDatabase ="mydatabase"
val jdbcUsername = "mylogin"
val jdbcPassword = "mypwd"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
//Step 3: Check connectivity to the SQLServer database
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
//Read data from JDBC
val textspark_table = spark.read.jdbc(jdbcUrl, "textspark", connectionProperties)
textspark_table.show()
//the read operation works perfectly!!
//Write data to JDBC
import org.apache.spark.sql.SaveMode
spark.sql("insert into textspark values('test') ")
.write
.mode(SaveMode.Append) // <--- Append to the existing table
.jdbc(jdbcUrl, "textspark", connectionProperties)
//the write operation generates error!!
Can anyone help me please to fix this error?
You don't use insert statement in Spark. You specified the append mode what is ok. You shouldn't insert data, you should select / create it. Try something like this:
spark.sql("select 'text'")
.write
.mode(SaveMode.Append)
.jdbc(jdbcUrl, "textspark", connectionProperties)
or
Seq("test").toDS
.write
.mode(SaveMode.Append)
.jdbc(jdbcUrl, "textspark", connectionProperties)
I have a scenario to compare two different tables source and destination from two separate remote hive servers, can we able to use two SparkSessions something like I tried below:-
val spark = SparkSession.builder().master("local")
.appName("spark remote")
.config("javax.jdo.option.ConnectionURL", "jdbc:mysql://192.168.175.160:3306/metastore?useSSL=false")
.config("javax.jdo.option.ConnectionUserName", "hiveroot")
.config("javax.jdo.option.ConnectionPassword", "hivepassword")
.config("hive.exec.scratchdir", "/tmp/hive/${user.name}")
.config("hive.metastore.uris", "thrift://192.168.175.160:9083")
.enableHiveSupport()
.getOrCreate()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
val sparkdestination = SparkSession.builder()
.config("javax.jdo.option.ConnectionURL", "jdbc:mysql://192.168.175.42:3306/metastore?useSSL=false")
.config("javax.jdo.option.ConnectionUserName", "hiveroot")
.config("javax.jdo.option.ConnectionPassword", "hivepassword")
.config("hive.exec.scratchdir", "/tmp/hive/${user.name}")
.config("hive.metastore.uris", "thrift://192.168.175.42:9083")
.enableHiveSupport()
.getOrCreate()
I tried with SparkSession.clearActiveSession() and SparkSession.clearDefaultSession() but it isn't working, throwing the error below:
Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
is there any other way we can achieve accessing two hive tables using multiple SparkSessions or SparkContext.
Thanks
I use this way and working perfectly fine with Spark 2.1
val sc = SparkSession.builder()
.config("hive.metastore.uris", "thrift://dbsyz1111:10000")
.enableHiveSupport()
.getOrCreate()
// Createdataframe 1 from by reading the data from hive table of metstore 1
val dataframe_1 = sc.sql("select * from <SourcetbaleofMetaStore_1>")
// Resetting the existing Spark Contexts
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
//Initialize Spark session2 with Hive Metastore 2
val spc2 = SparkSession.builder()
.config("hive.metastore.uris", "thrift://dbsyz2222:10004")
.enableHiveSupport()
.getOrCreate()
// Load dataframe 2 of spark context 1 into a new dataframe of spark context2, By getting schema and data by converting to rdd API
val dataframe_2 = spc2.createDataFrame(dataframe_1.rdd, dataframe_1.schema)
dataframe_2.write.mode("Append").saveAsTable(<targettableNameofMetastore_2>)
Look at SparkSession getOrCreate method
which state that
gets an existing [[SparkSession]] or, if there is no existing one,
creates a new one based on the options set in this builder.
This method first checks whether there is a valid thread-local
SparkSession, and if yes, return that one. It then checks whether
there is a valid global default SparkSession, and if yes, return
that one. If no valid global default SparkSession exists, the method
creates a new SparkSession and assigns the newly created
SparkSession as the global default.
In case an existing SparkSession is returned, the config options specified in this builder will be applied to the existing
SparkSession.
That's the reason its returning first session and its configurations.
Please go through the docs to find out alternative ways to create session..
I'm working on <2 spark version. So I am not sure how to create new session with out collision of configuration exactly..
But, here is useful test case i.e SparkSessionBuilderSuite.scala to do that-
DIY..
Example method in that test case
test("use session from active thread session and propagate config options") {
val defaultSession = SparkSession.builder().getOrCreate()
val activeSession = defaultSession.newSession()
SparkSession.setActiveSession(activeSession)
val session = SparkSession.builder().config("spark-config2", "a").getOrCreate()
assert(activeSession != defaultSession)
assert(session == activeSession)
assert(session.conf.get("spark-config2") == "a")
assert(session.sessionState.conf == SQLConf.get)
assert(SQLConf.get.getConfString("spark-config2") == "a")
SparkSession.clearActiveSession()
assert(SparkSession.builder().getOrCreate() == defaultSession)
SparkSession.clearDefaultSession()
}
How can I load a file from SFTP server into spark RDD. After loading this file I need to perform some filtering on the data. Also the file is csv file so could you please help me decide if I should use Dataframes or RDDs.
You can use spark-sftp library in your program in following ways:
For Spark 2.x
Maven Dependency
<dependency>
<groupId>com.springml</groupId>
<artifactId>spark-sftp_2.11</artifactId>
<version>1.1.0</version>
</dependency>
SBT Dependency
libraryDependencies += "com.springml" % "spark-sftp_2.11" % "1.1.0"
Using with Spark shell
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:
$ bin/spark-shell --packages com.springml:spark-sftp_2.11:1.1.0
Scala API
// Construct Spark dataframe using file in FTP server
val df = spark.read.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
option("inferSchema", "true").
load("/ftp/files/sample.csv")
// Write dataframe as CSV file to FTP server
df.write.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
save("/ftp/files/sample.csv")
For Spark 1.x (1.5+)
Maven Dependency
<dependency>
<groupId>com.springml</groupId>
<artifactId>spark-sftp_2.10</artifactId>
<version>1.0.2</version>
</dependency>
SBT Dependency
libraryDependencies += "com.springml" % "spark-sftp_2.10" % "1.0.2"
Using with Spark shell
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:
$ bin/spark-shell --packages com.springml:spark-sftp_2.10:1.0.2
Scala API
import org.apache.spark.sql.SQLContext
// Construct Spark dataframe using file in FTP server
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
option("inferSchema", "true").
load("/ftp/files/sample.csv")
// Write dataframe as CSV file to FTP server
df.write().
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
save("/ftp/files/sample.csv")
For more information on spark-sftp you can visit there github page springml/spark-sftp
Loading from SFTP is straight forward using the sftp-connector.
https://github.com/springml/spark-sftp
Remember it is single thread application and lands data into hdfs even you dont specify it. It Streams the data into hdfs and then creates an DataFrame on top of it
While Loading we need to specify couple of more parameters.
Normally with out specifying the location also it may work when your user sudo user of hdfs. It will create the temp file in / of hdfs and will delete it once the process is completed.
val data = sparkSession.read.format("com.springml.spark.sftp").
option("host", "host").
option("username", "user").
option("password", "password").
option("fileType", "json").
option("createDF", "true").
option("hdfsTempLocation","/user/currentuser/").
load("/Home/test_mapping.json");
All the available options are the following, Source code
https://github.com/springml/spark-sftp/blob/master/src/main/scala/com/springml/spark/sftp/DefaultSource.scala
override def createRelation(sqlContext: SQLContext, parameters: Map[String, String], schema: StructType) = {
val username = parameters.get("username")
val password = parameters.get("password")
val pemFileLocation = parameters.get("pem")
val pemPassphrase = parameters.get("pemPassphrase")
val host = parameters.getOrElse("host", sys.error("SFTP Host has to be provided using 'host' option"))
val port = parameters.get("port")
val path = parameters.getOrElse("path", sys.error("'path' must be specified"))
val fileType = parameters.getOrElse("fileType", sys.error("File type has to be provided using 'fileType' option"))
val inferSchema = parameters.get("inferSchema")
val header = parameters.getOrElse("header", "true")
val delimiter = parameters.getOrElse("delimiter", ",")
val createDF = parameters.getOrElse("createDF", "true")
val copyLatest = parameters.getOrElse("copyLatest", "false")
//System.setProperty("java.io.tmpdir","hdfs://devnameservice1/../")
val tempFolder = parameters.getOrElse("tempLocation", System.getProperty("java.io.tmpdir"))
val hdfsTemp = parameters.getOrElse("hdfsTempLocation", tempFolder)
val cryptoKey = parameters.getOrElse("cryptoKey", null)
val cryptoAlgorithm = parameters.getOrElse("cryptoAlgorithm", "AES")
val supportedFileTypes = List("csv", "json", "avro", "parquet")
if (!supportedFileTypes.contains(fileType)) {
sys.error("fileType " + fileType + " not supported. Supported file types are " + supportedFileTypes)
}