reading compressed file in spark with scala - scala

I am trying to read the content of .gz file in spark/scala in a dataframe/rdd using the following code
val conf = new SparkConf()
val sc = new SparkContext(conf)
val data = sc.wholeTextFiles("path to gz file")
data.collect().foreach(println);
.gz file is 28 mb and when i do the spark submit using this command
spark-submit --class sample--master local[*] target\spark.jar
It gives ma Java Heap space issue in the console .
Is this the best way of reading .gz file and if yes how could i solve java heap error issue .
Thanks

Disclaimer: That code and description will purely read in a small compressed text file using spark, collect it to an array of every line and print every line in the entire file to console. The number of ways and reasons to do this outside far outnumber those to do it in spark
1) use SparkSession instead of SparkContext if you can swing it. sparkSession.read.text() is the command to use (it automatically handles a few compression formats)
2) Or at least use sc.textFile() instead of wholeTextFiles
3) you're calling .collect on that data which brings the entire file back to the driver (in this case since you're local not network bound). Add the --driver-memory option to the spark shell to increase memory if you MUST do the collect.

Related

Reading Excel(xlsx) with Pyspark does not work above a certain medium size

Having the following configuration of a cluster in databricks: 64GB, 8 cores
The tests have been carried out as the only notebook in the cluster, at that time there were no other notebooks running.
I find that reading a simple 30 MB Excel file in spark keeps loading and does not work. Using the following code for this purpose:
sdf = spark.read.format("com.crealytics.spark.excel")\
.option("header", True)\
.option("inferSchema", "true")\
.load(my_path)
display(sdf)
I have tried reducing the excel file and it works fine up to 15MB.
As a workaround I am going to export the excel to csv and read it from there, but I find it shocking that spark can't even read 30MB of excel.
or am I doing something wrong in the configuration?
You need to install these 2 libraries on your databricks cluster to read excel files. Follow these paths to install:
Clusters -> select your cluster -> Libraries -> Install New -> Maven -> in Coordinates: com.crealytics:spark-excel_2.12:0.13.5
Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: xlrd
Now, you will be able to read your excel as follows:
sdf = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'NameOfYourExcelSheet'!A1") \
.load(filePath)
Can you please try the below option as shown in this spark-excel - github ?
Based on your input you can modify the number of rows. The value 20 is an sample value.
.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)
As mentioned above, the option does not work for .xls files.
In case the files are really big, consider the options as show in the link #590
Please validate before using any of the options specified.
Cheers...

How to provide a text file location in spark when file is on server

I am Learning spark to implement in my project. I want to run command in spark shell-
val rddFromFile = spark.sparkContext.textFile("abc");
where abc is file location. My file is on remote server and through that remote server I am opening spark shell, how should I specify file location.
I tried to put a text file in local C drive and provided the location to read that, it also did not worked. I am getting similar error for all the file location.
Error :
scala> val rddFromFile = spark.sparkContext.textFile("C:/Users/eee/Spark test/Testspark.txt")
rddFromFile: org.apache.spark.rdd.RDD[String] = C:/Users/eee/Spark test/Testspark.txt MapPartitionsRDD[1] at textFile at <console>:23
scala> rddFromFile.collect().foreach(f=>{
| println(f)
| })
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "C"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
Spark is expecting the file to be present in the Hadoop FS, as it looks like that's the default file system set in your app.
To load a file from local FS, you need to put it like
val rddFromFile = spark.sparkContext.textFile("file:///C:/Users/eee/Spark test/Testspark.txt")
That will work when you run Spark in local mode.
If you run Spark in the cluster, then the file would have to be present on all executor nodes.

Copy file from Hdfs to Hdfs scala

Is there a known way using Hadoop api / spark scala to copy files from one directory to another on Hdfs ?
I have tried using copyFromLocalFile but was not helpful
Try Hadoop's FileUtil.copy() command, as described here: https://hadoop.apache.org/docs/r2.8.5/api/org/apache/hadoop/fs/FileUtil.html#copy(org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path,%20boolean,%20org.apache.hadoop.conf.Configuration)
val conf = new org.apache.hadoop.conf.Configuration()
val srcPath = new org.apache.hadoop.fs.Path("hdfs://my/src/path")
val dstPath = new org.apache.hadoop.fs.Path("hdfs://my/dst/path")
org.apache.hadoop.fs.FileUtil.copy(
srcPath.getFileSystem(conf),
srcPath,
dstPath.getFileSystem(conf),
dstPath,
true,
conf
)
As I've understand your question, the answer is as easy as abc. Actually, there is no difference between your OS filesystem and some other distributed versions in the fundamental concepts like copying files in them. That is true that each would have its own rules in commands. For instance, when you want to copy a file from one directory to another you can do something like:
hdfs dfs -cp /dir_1/file_1.txt /dir_2/file_1_new_name.txt
The first part of the example command is just to let the command to be routed to the true destination not the OS's own file system.
for further reading you can use: copying data in hdfs

How to continuously monitor a directory by using Spark Structured Streaming

I want spark to continuously monitor a directory and read the CSV files by using spark.readStream as soon as the file appears in that directory.
Please don't include a solution of Spark Streaming. I am looking for a way to do it by using spark structured streaming.
Here is the complete Solution for this use Case:
If you are running in stand alone mode. You can increase the driver memory as:
bin/spark-shell --driver-memory 4G
No need to set the executor memory as in Stand Alone mode executor runs within the Driver.
As Completing the solution of #T.Gaweda, find the solution below:
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
.readStream
.option("sep", ";")
.schema(userSchema) // Specify schema of the csv files
.csv("/path/to/directory") // Equivalent to format("csv").load("/path/to/directory")
csvDf.writeStream.format("console").option("truncate","false").start()
now the spark will continuously monitor the specified directory and as soon as you add any csv file in the directory your DataFrame operation "csvDF" will be executed on that file.
Note: If you want spark to inferschema you have to first set the following configuration:
spark.sqlContext.setConf("spark.sql.streaming.schemaInferenc‌​e","true")
where spark is your spark session.
As written in official documentation you should use "file" source:
File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
Code example taken from documentation:
// Read all the csv files written atomically in a directory
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
.readStream
.option("sep", ";")
.schema(userSchema) // Specify schema of the csv files
.csv("/path/to/directory") // Equivalent to format("csv").load("/path/to/directory")
If you don't specify trigger, Spark will read new files as soon as possible

Spark Tachyon: How to delete a file?

In Scala, as an experiment I create a sequence file on Tachyon using Spark and read it back in. I want to delete the file from Tachyon using the Spark script also.
val rdd = sc.parallelize(Array(("a",2), ("b",3), ("c",1)))
rdd.saveAsSequenceFile("tachyon://127.0.0.1:19998/files/123.sf2")
val rdd2 = sc.sequenceFile[String,Int]("tachyon://127.0.0.1:19998/files/123.sf2")
I don't understand the Scala language very well and I cannot find a reference about file path manipulation. I did find a way of somehow using Java in Scala to do this, but I cannot get it to work using Tachyon.
import java.io._
new File("tachyon://127.0.0.1:19998/files/123.sf2").delete()
There are different approaches, e.g.:
CLI:
./bin/tachyon tfs rm filePath
More info: http://tachyon-project.org/Command-Line-Interface.html
API:
TachyonFS sTachyonClient = TachyonFS.get(args[0]);
sTachyonClient.delete(filePath, true);
More info:
https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/examples/BasicOperations.java