Has anyone been successful running Apache Spark & Shark on Cassandra - scala

I am trying to configure a 5 node cassandra cluster to run Spark/Shark to test out some Hive queries.
I have installed Spark, Scala, Shark and configured according to Amplab [Running Shark on a cluster] https://github.com/amplab/shark/wiki/Running-Shark-on-a-Cluster.
I am able to get into the Shark CLI and when I try to create an EXTERNAL TABLE out of one of my Cassandra ColumnFamily tables, I keep getting this error
Failed with exception
org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading
storage
handler.org.apache.hadoop.hive.cassandra.CassandraStorageHandler
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask
I have configured HIVE_HOME, HADOOP_HOME, SCALA_HOME. Perhaps I'm pointing HIVE_HOME and HADOOP_HOME to the wrong paths? HADOOP_HOME is set to my Cassandra hadoop folder (/etc/dse/cassandra), HIVE_HOME is set to the unpacked Amplad download of Hadoop1/hive, and I have also set HIVE_CONF_DIR to my Cassandra Hive path (/etc/dse/hive).
Am I missing any steps? Or have I configured these locations wrongly? Any ideas please? Any help will be very much appreciated. Thanks

Yes, I have got it.
Try https://github.com/2013Commons/hive-cassandra
whick is working with cassandra 2.0.4, hive 0.11, hadoop 2.0

Related

Use pyspark on yarn cluster without creating the context

I'll try to do my best to explain my self. I'm using JupyterHub to connect to the cluster of my university and write some code. Basically i'm using pyspark but, since i'va always used "yarn kernel" (i'm not sure of what i'm sying) i've never defined the spark context or the spark session. Now, for some reason, it doesn't work anymore and when i try to use spark this error appears:
Code ->
df = spark.read.csv('file:///%s/.....
Error ->
name 'spark' is not defined
It already happend to me but i just solved by installing another version of pyspark. Now i don't know what to do.

Scala code execution on master of spark cluster?

The spark application uses some API calls which do not use spark-session. I believe when the piece of code doesn't use spark it is getting executed on the master node!
Why do I want to know this?
I am getting a java heap space error while I am trying to POST some files using API calls and I believe if I upgrade the master and increase driver mem it can be solved.
I want to understand how this type of application is executed on the Spark cluster?
Is my understanding right or am I missing something?
It depends - closures/functions passed to the built-in function transform or any code in udfs you create, code in forEachBatch (and maybe a few other places) will run on the workers. Other code runs on driver

An error occurred while creating datasets: Dataset could not be created

I have a running Kylin Cluster in Kubernetes and Superset in Kubernets also.
Kylin is already configured with a built cube "kylin_sales_cube"
Superset is already configured with Kylin driver and the connection is established.
While trying to create a dataset from a Kylin table I get the following error message:
An error occurred while creating datasets: Dataset could not be created.
On the other hand, I am able to run a query on the same table, but without dataset, I cannot use charts.
Any ideas?
It seems a lack of implementation of a method in kylinpy (or somehere else) but until someone solves it, I suggest everyone who has this problem to implement the has_table method in sqla_dialect.py from kylinpy plugin. You will find it in kylinpy/sqla_dialect.py, and the method is:
You should change that return to the next line:
return table_name in self.get_table_names(connection, schema)
And everything will be back to normal.

Running Scala module with databricks-connect

I've tried to follow the instructions here to set up databricks-connect with IntelliJ. My understanding is that I can run code from the IDE and it will run on the databricks cluster.
I added the jar directory from the miniconda environment and moved it above all of the maven dependencies in File -> Project Structure...
However I think I did something wrong. When I tried to run my module I got the following error:
21/07/17 22:44:24 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: System memory 259522560 must be at least 471859200. Please increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration.
at org.apache.spark.memory.UnifiedMemoryManager$.getMaxMemory(UnifiedMemoryManager.scala:221)
at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:201)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:413)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:262)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:291)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:495)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2834)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1016)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1010)
at com.*.sitecomStreaming.sitecomStreaming$.main(sitecomStreaming.scala:184)
at com.*.sitecomStreaming.sitecomStreaming.main(sitecomStreaming.scala)
The system memory being 259 gb makes me think it's trying to run locally on my laptop instead of the dbx cluster? I'm not sure if this is correct and what I can do to get this up and running properly...
Any help is appreciated!
The driver in the databricks-connect is always running locally - only the executors are running in the cloud. Also, this reported memory is in the bytes, so 259522560 is ~256Mb - you can increase it using the option that it reports.
P.S. But if you're using structured streaming, then yes - it's a known limitation of databricks-connect.

Crate JDBC driver load balancing issue

Q1. I have a crate cluster of version 1.0.2 and I am using older version of crate JDBC driver to connect to it from java program. I have specified all nodes of crate in the JDBC driver URL by separating them with comma. When I fire queries from my java program to crate, I can see memory and CPU usage of only 1 crate node increases and this node is 1st in the comma separated list given in the connection URL. After some time, that node runs out of memory . Could someone please explain why this happens ? I remember reading documentation of crate driver which indicated that crate driver load balances the queries across all specified client nodes. All my nodes are client enabled.
Q2. I tried same experiment with Crate 2.1.6 and JDBC driver 2.1.7 and I can see same behavior. I have verified that all the queries are getting fired on the data which is spread across multiple nodes. In latest documentation, I can see a new property got added viz loadBalanceHosts https://crate.io/docs/clients/jdbc/en/latest/connecting.html#jdbc-url-format
Right now I do not have this property added. Was this property present and required on JDBC driver version 2.1.7 ? Why do developer have to do worry about load balancing when crate cluster and JDBC drivers are supposed to provide that ?
FYI, most of my queries have group by clause and I have few billions of records to experiment with. Memory configured is 30GB per node.
This has been fixed in the lastest driver: https://github.com/crate/crate-jdbc/blob/master/docs/connecting.txt#L62