Apache Spark Multiple sources found for csv Error - scala

I'm trying to run my spark program using the spark-submit command (i'm working with scala), i specified the master adress, the class name, the jar file with all dependencies, the input file and then the output file but i'm having and error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Multiple sources found for csv
(org.apache.spark.sql.execution.datasources.v2.csv.CSVDataSourceV2,
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat), please
specify the fully qualified class name.;
Here is a screenshot for this error, What is it about? How can i fix it?
Thank you

Here you got some warnings also,
If you correctly run your fat-jar file with correct permissions you can get a output like this for ./spark-submit
Check whether if correctly set environmental variables for spark (~/.bashrc). Also check the source CSV file permissions. May be it will be the problem.
If you are running on linux environment set the folder permissions for the source CSV folder as
sudo chmod -R 777 /source_folder
After that again try to run ./spark-submit with your fat-jar file.

Related

docker-compose cannot find the yaml file

I've placed a docker compose file project.yaml at the location /etc/project/project.yaml
the file and well as the project directory have the same file permission, i.e. -rxwrxxrwx
but when I run docker-compose
sudo docker-compose -f ./project.yaml up -d
if errors out with the following
Cannot find the file ./project.yaml
I have checked various times and it seems there is no permission issue. Can anyone tell why we have this problem and what would be the solution
Beside using the full path, as commented by quoc9x, double-check your current working directory when you call a command with a relative path ./project.yaml
If you are not in the right folder, that would explain the error message.

Exception in thread "main" java.io.FileNotFoundException: t.txt (No such file or directory)

I wrote a scala script named updateTables.scala and created a .txt file named t.txtunder the directory com/sk/data in IntelliJ IDEA. One clause in updateTables.scala tries to read t.txt:val wholeQuery = Source.fromFile("t.txt").
I use mvn clean package in Windows PowerShell to obtain a jar package from the project SparkScalaExample and then use spark-submit to execute the jar package. But error FileNotFoundException is thrown out.
I have tried various paths of t.txt, such as
absolute path D:\SparkScalaExample\src\main\scala\com\sk\data\t.txt,
path from content root src/main/scala/com/sk/data/t.txt,
path from source root com/sk/data/t.txt, but none of them works. Put t.txt under SparkScalaExample doesn't work either.
The following picture is the directory structure of project SparkScalaExample.
So, where should I put t.txt and which path of t.txt should I use?
Thanks very much for your help!
Because I run spark-submit on a server, If I want to read file t.txt by Source.fromFile, two requirements have to be met:
t.txt has to be in the server;
the path of t.txt in Source.fromFile has to be where it is located.
After meeting the two requirements, it works!

install4j: Installation doesnt create an alternativeLogfile

When i Invoke the installer with:
installerchecker_windows-x64_19_2_1_0-SNAPSHOT.exe
-q
-c
-varfile install.varfile
-Dinstall4j.alternativeLogfile=d:/tmp/logs/installchecker.log
-Dinstall4j.logToStderr=true
it creates and writes the standard log file installation.log in the .install4j Directory, but doesnt create my custom log in d:/tmp/logs. As configured there is an additional error.log with the correct content.
The installation.log shows the comand-line config : install4j.alternativeLogfile=d:/tmp/logs/installchecker.log
The Directory d:/tmp/logs has full access.
Where is the failure in my config ?
The alternative log file is intended to debug situations where the installer fails. To avoid moving the log file to its final destination in .install4j/installation.log, the VM parameter
-Dinstall4j.noPermanentLogFile=true
can be specified.

Scala Spark - overwrite parquet file failed to delete file or dir

I'm trying to create parquet files for several days locally. The first time I run the code, everything works fine. The second time it fails to delete a file. The third time it fails to delete another file. It's totally random which file can not be deleted.
The reason I need this to work is because I want to create parquet files everyday for the last seven days. So the parquet files that are already there should be overwritten with the updated data.
I use Project SDK 1.8, Scala version 2.11.8 and Spark version 2.0.2.
After running that line of code the second time:
newDF.repartition(1).write.mode(SaveMode.Overwrite).parquet(
OutputFilePath + "/day=" + DateOfData)
this error occurs:
WARN FileUtil:
Failed to delete file or dir [C:\Users\...\day=2018-07-15\._SUCCESS.crc]:
it still exists.
Exception in thread "main" java.io.IOException:
Unable to clear output directory file:/C:/Users/.../day=2018-07-15
prior to writing to it
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:91)
After the third time:
WARN FileUtil: Failed to delete file or dir
[C:\Users\day=2018-07-20\part-r-00000-8d1a2bde-c39a-47b2-81bb-decdef8ea2f9.snappy.parquet]: it still exists.
Exception in thread "main" java.io.IOException: Unable to clear output directory file:/C:/Users/day=2018-07-20 prior to writing to it
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:91)
As you see it's another file than the second time running the code.
And so on.. After deleting the files manually all parquet files can be created.
Does somebody know that issue and how to fix it?
Edit: It's always a crc-file that can't be deleted.
Thanks for your answers. :)
The solution is not to write in the Users directory. There seems to be a permission problem. So I created a new folder in the C: directory and it works perfect.
this problem occurs when you open the destination directory in windows. You just need to close the directory.
Perhaps another Windows process has a lock on the file so it can't be deleted.

input file or path doesn't exist when executing scala script

I just started learning Spark/Scala, here is a confusing issue I came across on my very first practice:
I created a test file: input.txt in /etc/spark/bin
I created a RDD
I started to do a word count but received the error saying Input path does not exist
Here is a screenshot:
Why the input.txt not picked up by Scala? if it is permission related, the file was created by root but I am also running Spark/Scala under root.
Thank you very much.
Spark reads file from hdfs. Copy it with:
hadoop fs -put /etc/spark/bin/input.txt /hdfs/path
If it's a local installation file will be moved to hadoop folder.