metastore_db doesn't get created with apache spark 2.2.1 in windows 7 - scala

I want read CSV files using latest Apache Spark Version i.e 2.2.1 in Windows 7 via cmd but unable to do so because there is some problem with the metastore_db. I tried below steps:
1. spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 //Since my scala
// version is 2.11
2. val df = spark.read.format("csv").option("header", "true").option("mode", "DROPMALFORMED").load("file:///D:/ResourceData.csv")// As //in latest versions we use SparkSession variable i.e spark instead of //sqlContext variable
but it throws me below error:
Caused by: org.apache.derby.iapi.error.StandardException: Failed to start database 'metastore_db' with class loader o
.spark.sql.hive.client.IsolatedClientLoader
Caused by: org.apache.derby.iapi.error.StandardException: Another instance of Derby may have already booted the database
I am able to read csv in 1.6 version but I want to do it in latest version. Can anyone help me with this?? I am stuck since many days .

Open Spark Shell
spark-shell
Pass Spark Context through SQLContext and assign it to sqlContext Variable
val sqlContext = new org.apache.spark.sql.SQLContext(sc) // As Spark context available as 'sc'
Read the CSV file as per your requirement
val bhaskar = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/home/burdwan/Desktop/bhaskar.csv") // Use wildcard, with * we will be able to import multiple csv files in a single load ...Desktop/*.csv
Collect the RDDs and Print
bhaskar.collect.foreach(println)
Output
_a1 _a2 Cn clr clarity depth aprx price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Good J VVS2 63 57 336 3.94 3.96 2.48

Finally even this also worked only in linux based O.S. Download apache spark from the official documentation and set it up using this link. Just verify whether you are able to invoke spark-shell. Now enjoy loading and performing actions with any type of file with the latest spark version. I don't know why its not working on windows even though I am running it for the first time.

Related

Read a huge Oracle table using Spark

Spark version: 2.4.7
OS: Linux RHEL Fedora
I have the following use case:
Read an oracle table that contains ~150 million records (Done Daily), and then write these records to 800 files (using repartition) on a shared file system to be used in another application.
I can read the table into a data frame, but when trying to write it never finishes.
df
res38: org.apache.spark.sql.DataFrame = [ROW_ID: string, CREATED: string ... 276 more fields]
When I limit the number of retrieved records to 1 million, and repartition by (6) the process to read and write takes 2-3 minutes.
I searched for insights on the issue but couldn't figure it out, when running the process on the full data set I see in the UI logs the following line (keeps repeating) :
Blockquote
INFO sort.UnsafeExternalSorter: Thread 135 spilling sort data of 5.2 GB to disk (57 times so far)
I submit the job using the following:
time spark-submit --verbose --conf spark.dynamicAllocation.enabled=false --conf spark.spark.sql.broadcastTimeout=1000 --conf spark.sql.shuffle.partitions=1500 --conf "spark.ui.enabled=true" --master yarn --driver-memory 60G --executor-memory 10G --num-executors 40 --executor-cores 8 --jars ojdbc6.jar SparkOracleExtractor.jar
Please note there are enough resources on the cluster and the queue is only 5.0% used, and a constraint is to use Spark.
Appreciate any help on where to get information to resolve the issue and speed up the process.
This is the code:
val myquery = "select * from mytable"
val dt=20221023
spark.read.format("jdbc").
option("url", s"jdbc:oracle:thin:#$db_connect_string").
option("driver", "oracle.jdbc.driver.OracleDriver").
option("query", myquery).
option("user", db_user).
option("password", db_pass).
option("fetchsize", 10000).
option("delimiter", "|").
load()
df.repartition(800).write.csv(s"file:///fs/extrat_path/${dt}")
These are the shuffle spill sizes after 2.5 hours:
Shuffle Spill (Memory) Shuffle Spill (Disk)
259.4 GB 22.0 GB

Spark Dataframe writes part files to _temporary in instead directly creating partFiles in output directory [duplicate]

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size:
scala> df.count
res0: Long = 4067
The following code works fine for writing df to hdfs:
scala> val hdf = spark.read.parquet("/tmp/topVendors")
hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint]
scala> hdf.count
res4: Long = 4067
However using the same code to write to a local parquet or csv file end up with empty results:
df.repartition(1).write.mode("overwrite").parquet("file:///tmp/topVendors")
scala> val locdf = spark.read.parquet("file:///tmp/topVendors")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
We can see why it fails:
ls -l /tmp/topVendors
total 0
-rw-r--r-- 1 hadoop hadoop 0 Jul 30 22:38 _SUCCESS
So there is no parquet file being written.
I have tried this maybe twenty times and for both csv and parquet and on two different EMR Servers: this same behavior is exhibited in all cases.
Is this an EMR specific bug? A more general EC2 bug? Something else? This code works on spark on macOS.
In case it matters - here is the versioning info:
Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Spark 2.3.0, Hive 2.3.2, Zeppelin 0.7.3
That is not a bug and it is the expected behavior. Spark does not really support writes to non-distributed storage (it will work in local mode, just because you have shared file system).
Local path is not interpreted (only) as a path on the driver (this would require collecting the data) but local path on each executor. Therefore each executor will write its own chunk to its own local file system.
Not only output is no readable back (to load data each executor and the driver should see the same state of the file system), but depending on the commit algorithm, might not be even finalized (move from the temporary directory).
This error usually occurs when you try to read an empty directory as parquet.
You could check
1. if the DataFrame is empty with outcome.rdd.isEmpty() before write it.
2. Check the if the path you are giving is correct
Also in what mode you are running your application? Try running it in client mode if you are running in cluster mode.

SparkSession return nothing with an HiveServer2 connection throught JDBC

I have an issue about reading data from a remote HiveServer2 using JDBC and SparkSession in Apache Zeppelin.
Here is the code.
%spark
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
val prop = new java.util.Properties
prop.setProperty("user","hive")
prop.setProperty("password","hive")
prop.setProperty("driver", "org.apache.hive.jdbc.HiveDriver")
val test = spark.read.jdbc("jdbc:hive2://xxx.xxx.xxx.xxx:10000/", "tests.hello_world", prop)
test.select("*").show()
When i run this, I've got no errors but no data too, i just retrieve all the column name of table, like this :
+--------------+
|hello_world.hw|
+--------------+
+--------------+
Instead of this :
+--------------+
|hello_world.hw|
+--------------+
+ data_here +
+--------------+
I'am running all of this on :
Scala 2.11.8,
OpenJDK 8,
Zeppelin 0.7.0,
Spark 2.1.0 ( bde/spark ),
Hive 2.1.1 ( bde/hive )
I run this setup in Docker which each of those have their own container but connected in the same network.
Furthermore it just works when i use use the spark beeline to connect to my remote Hive.
Did i have forgot something ?
Any help would be appreciated.
Thanks in advance.
EDIT :
I've found a workaround, which is sharing docker volume or docker data-container between Spark and Hive, more precisily the Hive warehouse folder between them, and with configuring the spark-defaults.conf. Then you can acces hive through SparkSession without JDBC. Here is the step by step to how to do it :
Share the Hive warehouse folder between Spark and Hive
Configure spark-defaults.conf with like this :
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory Xg
spark.driver.cores X
spark.executor.memory Xg
spark.executor.cores X
spark.sql.warehouse.dir file:///your/path/here
Replace 'X' with your values.
Hope it helps.

is there a way to optimize spark sql code?

updated:
I`m using spark sql 1.5.2. Trying to read many parquet files and filter and aggregate rows - there are ~35M of rows stored in ~30 files in my hdfs and it takes more than 10 minutes to process
val logins_12 = sqlContext.read.parquet("events/2015/12/*/login")
val l_12 = logins_12.where("event_data.level >= 90").select(
"pid",
"timestamp",
"event_data.level"
).withColumn("event_date", to_date(logins_12("timestamp"))).drop("timestamp").toDF("pid", "level", "event_date").groupBy("pid", "event_date").agg(Map("level"->"max")).toDF("pid", "event_date", "level")
l_12.first()
my spark is running in two node cluster with 8cores and 16Gb ram each, scala output makes me thing the computation runs in just one thread:
scala> x.first()
[Stage 1:=======> (50 + 1) / 368]
when I try count() instead of first() it looks like two threads are doing computations. which is still less than I was expecting as there are ~30 files that can be processed in parallel
scala> l_12.count()
[Stage 4:=====> (34 + 2) / 368]
I'm starting spark console with 14g for executor and 4g for driver in yarn-client mode
./bin/spark-shell -Dspark.executor.memory=14g -Dspark.driver.memory=4g --master yarn-client
my default config for spark:
spark.executor.memory 2g
spark.logConf true
spark.eventLog.dir maprfs:///apps/spark
spark.eventLog.enabled true
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni
spark.executor.extraClassPath
spark.yarn.historyServer.address http://test-01:18080
there are 200 partitions of the rdd
scala> logins_12.rdd.partitions.size
res2: Int = 368
scala> l_12.rdd.partitions.size
res0: Int = 200
is there a way to optimize this code?
thanks
Both behaviors are more or less expected. Spark is rather lazy and it not only doesn't execute transformations unless you trigger an action but can also skip tasks if there are not required for the output. Since first requires only a single element it can compute only one partition. It is most likely the reason why you see only one running thread at some point.
Regarding the second issue it is most likely a matter of configuration. Assuming there is nothing wrong with YARN configuration (I don't use YARN but yarn.nodemanager.resource.cpu-vcores looks like a possible source of the problem) it is most likely a matter of Spark defaults. As you can read in the Configuration guide spark.executor.cores on Yarn is by default set to 1. Two workers gives two running threads.

How to parallelize loading data into an RDD in Spark

I am using the MLLib library and the MLUtils.loadLibSVMFile() method. I have a file that is 10GB and cluster with 5 slaves, with 2 cores and 6 GB memory each. According to the documentation found here, the method has the following parameters:
loadLibSVMFile(sc, path, multiclass=False, numFeatures=-1, minPartitions=None)
I would like to have 10 partitions and I don't know the number of features. When I run the method without specifying the minimum partitions, I get the java.lang.OutOfMemoryError: Java heap space as would be expected.
However, when I specify the numFeatures to be -1 as is default in the documentation, and the number to partitions to 10, based on the WebUI it does distribute the work, but after some time I get a java.lang.ArrayIndexOutOfBoundsException. The rest of the code looks identical to the example code written here.
I am brand new to Spark so please tell me if I am making some obvious mistake. Thanks!
EDIT 2: I ran the same thing with a sample dataset that was tiny and dense, and it works fine. Seems to be a problem with the number of features?
EDIT 3:
Here is the code I am trying to run:
object MLLibExample {
def main(args: Array[String]) {
val sc = new SparkContext()
//loading the RDD
val data = MLUtils.loadLibSVMFile(sc, "s3n://file.libsvm", -1, 10)
// Run training algorithm to build the model
val numIterations = 10
val model = SVMWithSGD.train(data, numIterations)
}
}
Here is the exception:
15/07/20 20:42:43 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 35, 10.249.67.97): java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:136)
at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:106)
at org.apache.spark.mllib.optimization.HingeGradient.compute(Gradient.scala:317)
at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:192)
at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:190)
-I have made sure the input file is correctly formatted (0/1 labels and space-delimited)
-Tried a subsample of the original file that is 10 MB
-Set both driver and executor memory
-Added this tag to my spark-submit command: --conf "spark.driver.maxResultSize=6G"