I'm new here to learn Spark and Scala using the Notebook and Cluster in Databricks.com, here is my very simple code to load a file:
import sys.process._
val localpath="file:/tmp/myfile.json"
dbutils.fs.mkdirs("dbfs:/datasets/")
dbutils.fs.cp(localpath, "dbfs:/datasets/")
but I got error like this:
java.io.FileNotFoundException: File file:/tmp/myfile.json does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:402)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:255)
at com.databricks.backend.daemon.dbutils.FSUtils$.cp(DBUtilsCore.scala:82)
at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.cp(DbfsUtilsImpl.scala:40)
I'm using Mac and I've made sure that the file exists in this absolute path, is this a spark error? Thanks!
The line:
val localpath="file:/tmp/myfile.json"
should be:
val localpath="file://tmp/myfile.json"
Basically all URIs are of the format ://path see RFC-3986
Related
When I run the example code in cmd, everything is ok.
>>> import pyspark
>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1='Alice', _2=1)]
But when I execute the code in pycharm, I get an error.
spark.createDataFrame(l).collect()
NameError: name 'spark' is not defined
Maybe something wrong when I link Pycharm to pyspark.
Environment Variable
Project Structure
Project Interpreter
When you start pyspark from the command line, you have a sparkSession object and a sparkContext available to you as spark and sc respectively.
For using it in pycharm, you should create these variables first so you can use them.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
EDIT:
Please have a look at : Failed to locate the winutils binary in the hadoop binary path
I'm reading metrics data from json files from S3. What is the right way to handle the case when a path to the file doesn't exist? Currently I'm getting an AnalysisException: Path does not exist when there is no file with a given $metricsData name.
I think one way is to throw an exception but how should I correctly check if a path to the file exists?
val metricsDataDF: DataFrame = spark.read.option("multiline", "true")
.json(s"$dataPath/$metricsData.json")
I wouldn't use java.nio.file, it doesn't have a proper binding to S3 and/or HDFS. If you want your code to be applicable for all filesystems (local, in Docker (CI/CD), S3, HDFS, etc.) try using Apache Hadoop utils:
import org.apache.hadoop.fs.Path
import org.apache.hadoop.conf.Configuration
val path = new Path("base/path/to/data")
val fs = path.getFileSystem(new Configuration())
// applicable for local and remote FS
if (fs.exists(path)) {
sparkSession.read(...)
}
You can use java.nio.file :
import java.nio.file.{Paths, Files}
if(Files.exists(Paths.get(s"$dataPath/$metricsData.json")))
val metricsDataDF: DataFrame = spark.read.option("multiline", "true")
.json(s"$dataPath/$metricsData.json")
How to check if path or file exist in Scala
I am running both hadoop and spark and I want to use files from hdfs as an argument on spark-submit, so I made a folder in hdfs with the files
eg. /user/hduser/test/input
and I want to run spark-submit like this:
$SPARK_HOME/bin/spark-submit --master spark://admin:7077 ./target/scala-2.10/test_2.10-1.0.jar hdfs://user/hduser/test/input
but I cant make it work, what's the right way to do it?
the error I am getting is :
WARN FileInputDStream: Error finding new files
java.lang.NullPointerException
Check if you are able to access HDFS from Spark code, If yes then you need to add following line of code in your Scala import.
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.spark.SparkFiles
then in your code add following lines
var hadoopConf = new org.apache.hadoop.conf.Configuration()
var fileSystem = FileSystem.get(hadoopConf)
var path = new Path(args(0))
actually the problem was the path. I had to use hdfs://localhost:9000/user/hduser/...
I am reading sas file from azure blob . Converting it to csv and trying to upload csv to azure blob . However for small files in MBs I am able to do the same successfully with the following spark scala code .
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import com.github.saurfang.sas.spark._
val sqlContext = new SQLContext(sc)
val df=sqlContext.sasFile("wasbs://container#storageaccount/input.sas7bdat")
df.write.format("csv").save("wasbs://container#storageaccount/output.csv");
But for large files in GB it gives me Analysis exception wasbs://container#storageaccount/output.csv file already exists exception. I have tried overwrite also . But no luck . Any help would be appriciated
Actually, you could not overwrite an existing file on HDFS normally, even for small files in MBs.
Please try to use the code below to overwrite, please check your spark version because there are some differences to use the methed for different spark version.
df.write.format("csv").mode("overwrite").save("wasbs://container#storageaccount/output.csv");
I don't know the code above using overwrite mode whether you had tried as you said.
So there is another way to do it that first delete the existing files befer do the overwrite operation.
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("<hdfs://<namenodehost>/ or wasb[s]://<containername>#<accountname>.blob.core.windows.net/<path> >"), hadoopConf)
try { hdfs.delete(new org.apache.hadoop.fs.Path(filepath), true) } catch { case _ : Throwable => { } }
And there is a spark topic discussed similar issue, please see http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html.
In Scala, as an experiment I create a sequence file on Tachyon using Spark and read it back in. I want to delete the file from Tachyon using the Spark script also.
val rdd = sc.parallelize(Array(("a",2), ("b",3), ("c",1)))
rdd.saveAsSequenceFile("tachyon://127.0.0.1:19998/files/123.sf2")
val rdd2 = sc.sequenceFile[String,Int]("tachyon://127.0.0.1:19998/files/123.sf2")
I don't understand the Scala language very well and I cannot find a reference about file path manipulation. I did find a way of somehow using Java in Scala to do this, but I cannot get it to work using Tachyon.
import java.io._
new File("tachyon://127.0.0.1:19998/files/123.sf2").delete()
There are different approaches, e.g.:
CLI:
./bin/tachyon tfs rm filePath
More info: http://tachyon-project.org/Command-Line-Interface.html
API:
TachyonFS sTachyonClient = TachyonFS.get(args[0]);
sTachyonClient.delete(filePath, true);
More info:
https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/examples/BasicOperations.java