spark scala on windows machine - scala

I am learning from the class. I have run the code as shown in the class and i get below errors. Any idea what i should do?
I have spark 1.6.1 and Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
val datadir = "C:/Personal/V2Maestros/Courses/Big Data Analytics with Spark/Scala"
//............................................................................
//// Building and saving the model
//............................................................................
val tweetData = sc.textFile(datadir + "/movietweets.csv")
tweetData.collect()
def convertToRDD(inStr : String) : (Double,String) = {
val attList = inStr.split(",")
val sentiment = attList(0).contains("positive") match {
case true => 0.0
case false => 1.0
}
return (sentiment, attList(1))
}
val tweetText=tweetData.map(convertToRDD)
tweetText.collect()
//val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
var ttDF = sqlContext.createDataFrame(tweetText).toDF("label","text")
ttDF.show()
The error is:
scala> ttDF.show()
[Stage 2:> (0 + 2) / 2]16/03/30 11:40:25 ERROR ExecutorClassLoader: Failed to check existence of class org.apache.spark.sql.catalyst.expressio
REPL class server at http://192.168.56.1:54595
java.net.ConnectException: Connection timed out: connect
at java.net.TwoStacksPlainSocketImpl.socketConnect(Native Method)
re/4729300

I'm no expert but the connection IP in the error message looks like a private node or even your router/modem local address.
As stated in the comment it could be that you're running the context with a wrong configuration that tries to spread the work to a cluster that's not there, instead of in your local jvm process.
For further information you can read here and experiment with something like
import org.apache.spark.SparkContext
val sc = new SparkContext(master = "local[4]", appName = "tweetsClass", conf = new SparkConf)
Update
Since you're using the interactive shell and the provided SparkContext available there, I guess you should pass the equivalent parameters to the shell command as in
<your-spark-path>/bin/spark-shell --master local[4]
Which instructs the driver to assign a master for the spark cluster on the local machine, on 4 threads.

I think the problem comes with connectivity and not from within the code.
Check if you can actually connect to this address and port (54595).

Probably your spark master is not accessible at the specified port. Use local[*] to validate using a smaller dataset and local master. Then, ckeck if the port is accessible or change it based on Spark port configuration (http://spark.apache.org/docs/latest/configuration.html)

Related

how to detect if your code is running under pyspark

For staging and production, my code will be running on PySpark. However, in my local development environment, I will not be running my code on PySpark.
This presents a problem from the standpoint of logging. Because one uses the Java library Log4J via Py4J when using PySpark, one will not be using Log4J for the local development.
Thankfully, the API for Log4J and the core Python logging module are the same: once you get a logger object, with either module you simply debug() or info() etc.
Thus, I wish to detect whether or not my code is being imported/run in PySpark or a non-PySpark environment: similar to:
class App:
def our_logger(self):
if self.running_under_spark():
sc = SparkContext(conf=conf)
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger(__name__)
log.warn("Hello World!")
return log
else:
from loguru import logger
return logger
How might I implement running_under_spark()
Simply trying to import pyspark and seeing if it works is not a fail-proof way of doing this because I have pyspark in my dev environment to kill warnings about non-imported modules in the code from my IDE.
Maybe you can set some environment variable in your spark environment that you check for at runtime ( in $SPARK_HOME/conf/spark-env.sh):
export SPARKY=spark
Then you check if SPARKY exists to determine if you're in your spark environment.
from os import environ
class App:
def our_logger(self):
if environ.get('SPARKY') is not None:
sc = SparkContext(conf=conf)
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger(__name__)
log.warn("Hello World!")
return log
else:
from loguru import logger
return logger

Spark ElasticSearch Configuration - Reading Elastic Search from Spark

I am trying to read data from ElasticSearch via Spark Scala. I see lot of post addressing this question, i have tried all the options they have mentioned in various posts but seems nothing is working for me
JAR Used - elasticsearch-hadoop-5.6.8.jar (Used elasticsearch-spark-5.6.8.jar too without any success)
Elastic Search Version - 5.6.8
Spark - 2.3.0
Scala - 2.11
Code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
val spark = SparkSession.builder.appName("elasticSpark").master("local[*]").getOrCreate()
val reader = spark.read.format("org.elasticsearch.spark.sql").option("es.index.auto.create", "true").option("spark.serializer", "org.apache.spark.serializer.KryoSerializer").option("es.port", "9200").option("es.nodes", "xxxxxxxxx").option("es.nodes.wan.only", "true").option("es.net.http.auth.user","xxxxxx").option("es.net.http.auth.pass", "xxxxxxxx")
val read = reader.load("index/type")
Error:
ERROR rest.NetworkClient: Node [xxxxxxxxx:9200] failed (The server xxxxxxxxxxxxx failed to respond); no other nodes left - aborting...
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:294)
at org.elasticsearch.spark.sql.SchemaUtils$.discoverMappingAndGeoFields(SchemaUtils.scala:98)
at org.elasticsearch.spark.sql.SchemaUtils$.discoverMapping(SchemaUtils.scala:91)
at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema$lzycompute(DefaultSource.scala:129)
at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema(DefaultSource.scala:129)
at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:133)
at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:133)
at scala.Option.getOrElse(Option.scala:121)
at org.elasticsearch.spark.sql.ElasticsearchRelation.schema(DefaultSource.scala:133)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:432)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
... 53 elided
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[xxxxxxxxxxx:9200]]
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:461)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:425)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:429)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:155)
at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:655)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:287)
... 65 more
Apart from this I have also tried below properties without any success:
option("es.net.ssl.cert.allow.self.signed", "true")
option("es.net.ssl.truststore.location", "<path for elasticsearch cert file>")
option("es.net.ssl.truststore.pass", "xxxxxx")
Please note elasticsearch node is within Unix edge node and is http://xxxxxx:9200 (mentioning it just in case if that makes any difference with the code)
What I am missing here? Any other properties? Please Help
Use below Jar which support spark 2+ version instead of Elastic-Hadoop or Elastic-Spark jar.
https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-spark-20_2.11/5.6.8

How to query flink's queryable state

I am using flink 1.8.0 and I am trying to query my job state.
val descriptor = new ValueStateDescriptor("myState", Types.CASE_CLASS[Foo])
descriptor.setQueryable("my-queryable-State")
I used port 9067 which is the default port according to this, my client:
val client = new QueryableStateClient("127.0.0.1", 9067)
val jobId = JobID.fromHexString("d48a6c980d1a147e0622565700158d9e")
val execConfig = new ExecutionConfig
val descriptor = new ValueStateDescriptor("my-queryable-State", Types.CASE_CLASS[Foo])
val res: Future[ValueState[Foo]] = client.getKvState(jobId, "my-queryable-State","a", BasicTypeInfo.STRING_TYPE_INFO, descriptor)
res.map(_.toString).pipeTo(sender)
but I am getting :
[ERROR] [06/25/2019 20:37:05.499] [bvAkkaHttpServer-akka.actor.default-dispatcher-5] [akka.actor.ActorSystemImpl(bvAkkaHttpServer)] Error during processing of request: 'org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9067'. Completing with 500 Internal Server Error response. To change default exception handling behavior, provide a custom ExceptionHandler.
java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9067
what am I doing wrong ?
how and where should I define QueryableStateOptions
So If You want to use the QueryableState You need to add the proper Jar to Your flink. The jar is flink-queryable-state-runtime, it can be found in the opt folder in Your flink distribution and You should move it to the lib folder.
As for the second question the QueryableStateOption is just a class that is used to create static ConfigOption definitions. The definitions are then used to read the configurations from flink-conf.yaml file. So currently the only option to configure the QueryableState is to use the flink-conf file in the flink distribution.
EDIT: Also, try reading this]1 it provides more info on how does Queryable State works. You shouldn't really connect directly to the server port but rather You should use the proxy port which by default is 9069.

Error while using truncateTable method of JDBCUtils in PostgreTable using Spark

I am trying to truncate table postgre_table from Spark using JDBCUtils, but it is throwing below error
< console>:71: error: value truncateTable is not a member of object org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
val trucate_table = JdbcUtils.truncateTable()
I am using the below code:
import org.apache.spark.sql.execution.datasources.jdbc._
import java.sql.DriverManager
import java.sql.Connection
val connection : Connection = DriverManager.getConnection(postgres_host + postgres_database,postgres_username,postgres_password)
val table_existing = JdbcUtils.tableExists(connection, postgres_host + postgres_database, postgre_table)
JdbcUtils.truncateTable(connection, postgres_host + postgres_database, postgre_table)
I am able to drop the table but not truncate it. I can see truncateTable method in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
Please suggest a solution and how to use it in databricks.
Looks like your compile time and runtime spark libraries are of different versions. Please make sure the runtime version match compile time version of spark. Seems the method is available from the version 2.1 alone.
available from this release :
https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

Kerberos Exception launching Spark locally

I am trying to set up a Spark Testng unit test:
#Test
def testStuff(): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
...
}
The code fails with: IllegalArgumentException: Can't get Kerberos realm
What am I missing?
The error suggests that your JVM is unable to locate the kerberos config (krb5.conf file).
Depending on your company's environment/infrastruture you have a few options:
Check if your company has standard library to set kerberos authentication.
Alternatively try:
set JVM property: -Djava.security.krb5.conf=/file-path/for/krb5.conf
Put the krb5.conf file into the <jdk-home>/jre/lib/security folder