When i try to read a file using scala it is not working , but in this location i have file
Is this permission issue
scala> val loc = "C:\\Users\\gvenk\\OneDrive\\Desktop\\Input\\accountdetails.txt"
loc: String = C:\Users\gvenk\OneDrive\Desktop\Input\accountdetails.txt
scala> Source.fromFile(loc, "UTF-8").getLines().toList
java.io.FileNotFoundException: C:\Users\gvenk\OneDrive\Desktop\Input\accountdetails.txt (The
system cannot find the file specified)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStr
The solution is in the error message: The system cannot find the file specified. This means either the file doesn't exist or misspelled something.
Related
I'm trying to Hello World example scala file according to on my windows 10 desktop:
https://github.com/bazelbuild/rules_scala
My WORKSHOP file is as recommended in the rules_scala repo.
My BUILD file is as follows:
load("#io_bazel_rules_scala//scala:scala.bzl", "scala_test", "scala_binary")
scala_binary(
name = "helloworld",
srcs = ["HelloWorld.scala"],
main_class="HelloWorld",
)
I've set my environment variable JAVA_HOME to where my javac.exe file is located. But when I try to build it, I get the following:
EDIT: After enabling developer mode on windows, the first half of the error disappeared, but still get:
PS C:\Users\m80\Documents\GitHub\BotsForGames\HexAgony\src\hexagony> bazel build :helloworld
INFO: Repository local_jdk instantiated at:
/DEFAULT.WORKSPACE.SUFFIX:40:22: in <toplevel>
Repository rule local_java_repository defined at:
C:/users/m80/_bazel_m80/4fgq25dr/external/bazel_tools/tools/jdk/local_java_repository.bzl:66:40: in <toplevel>
ERROR: An error occurred during the fetch of repository 'local_jdk':
Traceback (most recent call last):
File "C:/users/m80/_bazel_m80/4fgq25dr/external/bazel_tools/tools/jdk/local_java_repository.bzl", line 46, column 35, in _local_java_repository_impl
repository_ctx.symlink(file, file.basename)
Error in symlink: java.io.IOException: Could not create symlink from C:/Users/m80/_bazel_m80/4fgq25dr/external/remotejdk11_win/BUILD.bazel to C:/users/m80/_bazel_m80/4fgq25dr/external/local_jdk/BUILD.bazel: C:/users/m80/_bazel_m80/4fgq25dr/external/local_jdk/BUILD.bazel (File exists)
ERROR: Error fetching repository: Traceback (most recent call last):
File "C:/users/m80/_bazel_m80/4fgq25dr/external/bazel_tools/tools/jdk/local_java_repository.bzl", line 46, column 35, in _local_java_repository_impl
repository_ctx.symlink(file, file.basename)
Error in symlink: java.io.IOException: Could not create symlink from C:/Users/m80/_bazel_m80/4fgq25dr/external/remotejdk11_win/BUILD.bazel to C:/users/m80/_bazel_m80/4fgq25dr/external/local_jdk/BUILD.bazel: C:/users/m80/_bazel_m80/4fgq25dr/external/local_jdk/BUILD.bazel (File exists)
ERROR: C:/users/m80/_bazel_m80/4fgq25dr/external/bazel_tools/tools/jdk/BUILD:336:6: #bazel_tools//tools/jdk:jdk depends on #local_jdk//:jdk in repository #local_jdk which failed to fetch. no such package '#local_jdk//': java.io.IOException: Could not create symlink from C:/Users/m80/_bazel_m80/4fgq25dr/external/remotejdk11_win/BUILD.bazel to C:/users/m80/_bazel_m80/4fgq25dr/external/local_jdk/BUILD.bazel: C:/users/m80/_bazel_m80/4fgq25dr/external/local_jdk/BUILD.bazel (File exists)
ERROR: Analysis of target '//hexagony/src/hexagony:helloworld' failed; build aborted: Analysis failed
INFO: Elapsed time: 0.588s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (3 packages loaded, 39 targets configured)
What does this mean? How could I resolve it? I couldn't seem to find many similar problems wrt Scala + Bazel
I am running below code on pycharm , this code is working properly if i provide --jars through command prompt
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("pySparksqLite_test").\
config('spark.jars.packages', "C:/jars/DataVisualization/sqlite-jdbc-3.20.0.jar").getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", "5")
df_flight_info = spark.read.format("jdbc").option(url="jdbc:sqlite:C:/sqlite-tools-win32-x86-3290000/my-sqlite.db",
driver="org.sqlite.JDBC",
dbtable="(select DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count from flight_info)")\
.load()
but with pycharm i am getting below error
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: C:/Users/jars/sqlite-jdbc-3.20.0.jar
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.deploy.SparkSubmitUtils$$anonfun$extractMavenCoordinates$1.apply(SparkSubmit.scala:1000)
at org.apache.spark.deploy.SparkSubmitUtils$$anonfun$extractMavenCoordinates$1.apply(SparkSubmit.scala:998)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.deploy.SparkSubmitUtils$.extractMavenCoordinates(SparkSubmit.scala:998)
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1220)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:49)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:350)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "C:/...../proj1/pySparksqLite.py", line 4, in <module>
config('spark.jars.packages', "C:/Users/jars/sqlite-jdbc-3.20.0.jar").getOrCreate()
File "C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\context.py", line 331, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\context.py", line 115, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\context.py", line 280, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\java_gateway.py", line 95, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
Process finished with exit code 1
I have also tried providing jar file path through environment variable and setting it through os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars C:/Users/jars/sqlite-jdbc-3.27.2.jar'
but even this is not working
The equivalent of the --jars submitting parameter is spark.jars which allows you to specify local jars to transfer them to the cluster. You used spark.jars.packages which allows you to download packages from maven by specifying the maven coordinates. The submitting parameter equivalent of that is --packages.
Have a look at the documentation for more information: configuration and submitting parameters
While doing I/O in Scala using Cygwin, I have copied data at this location:
/cygdrive/c/DataResearch/retail_db/order_items/part-00000
but when I am trying to access the file from the Scala prompt with the following command I get this error:
val orderItems = Source
.fromFile("/cygdrive/c/DataResearch/retail_db/order_items/part-00000")
Error:
java.i[![enter image description here][1]][1]o.FileNotFoundException: \c\DataResearch\retail_db\order_items\part-
00000
(The system cannot find the path specified)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(Unknown Source)
at java.io.FileInputStream.<init>(Unknown Source)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
What can I try to resolve this?
I'm new to Spark and trying to figure out how the pipe method works. I have the following code in Scala
sc.textFile(hdfsLocation).pipe("preprocess.py").saveAsTextFile(hdfsPreprocessedLocation)
The values hdfsLocation and hdfsPreprocessedLocation are fine. As proof, the following code works from the command line
hadoop fs -cat hdfsLocation/* | ./preprocess.py | head
When I run the above Spark code I get the following errors
14/11/25 09:41:50 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Cannot run program "preprocess.py": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:135)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
... 12 more
In order to solve this for Hadoop streaming I would just use the --files attribute, so I tried the same thing for Spark. I start Spark with the following command
bin/spark-shell --files ./preprocess.py
but that gave the same error.
I couldn't find a good example of using Spark with an external process via pipe, so I'm not sure if I'm doing this correctly. Any help would be greatly appreciated.
Thanks
I'm not sure if this is the correct answer, so I won't finalize this, but it appears that the file paths are different when running spark in local and cluster mode. When running spark without --master the paths to the pipe command are relative to the local machine. When running spark with --master the paths to the pipe command are ./
UPDATE:
This actually isn't correct. I was using SparkFiles.get() to get the file name. It turns out that when calling .pipe() on an RDD the command string is evaluated on the driver and then passed to the worker. Because of this SparkFiles.get() is not the appropriate way to get the file name. The file name should be ./ because SparkContext.addFile() should put that file on ./ relative to to where each worker is run from. But I'm so sour on .pipe now that I've take .pipe out of my code in total in favor of .mapPartitions in combination of a PipeUtils object that I wrote here. This is actually more efficient because I only have to incur the script startup costs once per partition instead of once per example.
I have downloaded and installed NCTOOLBOX into MATLAB (2013a) to read netcdf and grb files. As a test, I copied a netcdf, grb and grb2 file to a directory on my computer. This is placed within my script as:
pathnc = 'c:\test\era40_moda_200205.nc'
pathgrb = 'c:\test\era40_moda_200205.grb'
pathgrb2 = 'c:\test\multi_1.at_4m.dp.200607.grb2'
I used the following code to read the *.nc file:
nc = ncdataset(pathnc);
nc.variables
The code works great....with no error messages..and all variables listed..on netcdf files...... however, when I run it for the grb files using:
nc = ncdataset(pathgrb);
nc.variables
I get this very long list of errors:
2014-03-05 08:40:15,744 [main] WARN ucar.nc2.grib.grib2.Grib2Index - Grib2Index bad size = -1 for c:/test/multi_1.at_4m.dp.200607.grb2 index = c:\test\multi_1.at_4m.dp.200607.grb2.gbx9
Warning: Escape sequence '\m' is not valid. See 'help
sprintf' for valid escape sequences.
> In ncdataset>ncdataset.ncdataset at 89
In GRIB_and_NC_Reader_Prog at 14
Error using ncdataset (line 91)
Failed to open c: est
Error in GRIB_and_NC_Reader_Prog (line 14)
nc = ncdataset(pathgrb2);
Caused by:
Error using ncdataset (line 75)
Java exception occurred:
java.lang.RuntimeException: java.lang.NoSuchFieldError:
alwaysUseFieldBuilders...............etc, etc....ad nauseum...............
In case it was just a bad file, I tried the code on a different grb file and got the same results. Yes I have read the previous posts on reading grb with NCTOOLBOX...but still 'dead in the water.' I would greatly appreciate any insight to get my script reading grb and grb2 files.
I was getting a similar java error: java.lang.NoSuchFieldError:alwaysUseFieldBuilders. I tried running the same code in R2014a and it worked.