Count files in HDFS directory with Scala - scala

In Scala, I am trying to count the files from an Hdfs directory.
I tryed to get a list of the files with val files = fs.listFiles(path, false) and make a count on it or get it's size, but it doesn't work as files type is RemoteIterator[LocatedFileStatus]
Any idea on how I should process ?
Thank's for helping

This has been done before but generally people use the FSImage. (A copy of the name node file.)
They'll then throw that into a hive table and then you can query it for information about your hdfs file system.
Here's a really good tutorial that explains how to export the fsimage and throw it into a hive table.
Here's another that I think I prefer:
Fetch and copy the fsimage file into HDFS
#connect to any hadoop cluster node as hdfs user
#downloads the fsimage file from namenode
hdfs dfsadmin -fetchImage /tmp
#converts the fsimage file into tab delimited file
hdfs oiv -i /tmp/fsimage_0000000000450297390 -o /tmp/fsimage.csv -p Delimited
#remove the header and copy to HDFS
sed -i -e "1d" fsimage.csv
hdfs dfs -mkdir /tmp/fsimage
hdfs dfs -copyFromLocal /tmp/fsimage.csv /tmp/fsimage
#create the intermediate external table in impala
CREATE EXTERNAL TABLE HDFS_META_D (
PATH STRING ,
REPL INT ,
MODIFICATION_TIME STRING ,
ACCESSTIME STRING ,
PREFERREDBLOCKSIZE INT ,
BLOCKCOUNT DOUBLE,
FILESIZE DOUBLE ,
NSQUOTA INT ,
DSQUOTA INT ,
PERMISSION STRING ,
USERNAME STRING ,
GROUPNAME STRING)
row format delimited
fields terminated by '\t'
LOCATION '/tmp/fsimage';
Once it's in a table you really can do the rest in scala/spark.

I end up using:
var count: Int = 0
while (files.hasNext) {
files.next
count += 1
}
As a Scala begginer, I didn't know how to make a count++ (the answear is count += 1). This actually works quiet well

Related

Check if directory contains json files using org.apache.hadoop.fs.Path in HDFS

I'm following the steps indicated here Avoid "Path does not exist" in dir based spark load to filter which directories in an array contain json files before sending them to the spark.read method.
When I use
inputPaths.filter(f => fs.exists(new org.apache.hadoop.fs.Path(f + "/*.json*")))
It returns empty despite json files existing in the path in one of the paths, one of the comments says this doesn't work with HDFS, is there a way to do make this work?
I running this in a databricks notebook
There is a method for listing files in dir:
fs.listStatus(dir)
Sort of
inputPaths.filter(f => fs.listStatus(f).exists(file => file.getPath.getName.endsWith(".json")))

Facing issue on adding Parallelism feature in an Avroconvertor application

I have an application which is used to take zip file and convert the text file under the zip files into avro files.
It executes the process in a serial manner in following way:
1) Picks the zip file and unzip it
2) Take each text file under that zip file and its content
3) Take avsc files(Schema files) from different location
4) Merge the text file content with the respective schema and hence making an avro file
But this process is done in serial manner(one file at a time).
Now I want execute this process in parallel. I have all the zip files under a folder.
folder/
A.zip
B.zip
C.zip
1) Under each zip file there are text files which only consist of data (without Schema/headers)
My text file looks like this:
ABC 1234
XYZ 2345
EFG 3456
PQR 4567
2) Secondly I have a avsc files which has the schema for the same text files
My avsc File looks like
{
"Name": String,
"Employee Id" : Int
}
As an experiment I used
SparkContext.parallelize(Folder having all the zip files).map {each file => //code of avro conversion}
but in the code of avro conversion part(which is under SparkContext.parallelize) I have used SparkContext.newHadoopAPIFile feature of spark which also returns an RDD
So when I run application with these changes I get Task not Serializable issue.
Suspecting this issue because of two reasons
1) Have used SparkContext under SparkContext.parallelize
2) Have made an RDD inside an RDD.
org.apache.spark.SparkException: Task not serializable
Now I need to have the Parallelism feature but not sure if there is any alternative approach to achieve parallelism for this Use Case OR how to resolve this Task not Serializable issue.
I am using Spark 1.6 and Scala Version 2.10.5

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?
So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.
I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)

How to delete multiple hdfs directories starting with some word in Apache Spark

I have persisted object files in spark streaming using dstream.saveAsObjectFiles("/temObj") method it shows multiple files in hdfs.
temObj-1506338844000
temObj-1506338848000
temObj-1506338852000
temObj-1506338856000
temObj-1506338860000
I want to delete all temObj files after reading all. What is the bet way to do it in spark. I tried
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://localhost:9000"), hadoopConf)
hdfs.delete(new org.apache.hadoop.fs.Path(Path), true)
But it only can delete ane folder at a time
Unfortunately, delete doesn't support globs.
You can use globStatus and iterate over the files/directories one by one and delete them.
val hdfs = FileSystem.get(sc.hadoopConfiguration)
val deletePaths = hdfs.globStatus(new Path("/tempObj-*") ).map(_.getPath)
deletePaths.foreach{ path => hdfs.delete(path, true) }
Alternatively, you can use sys.process to execute shell commands
import scala.sys.process._
"hdfs dfs -rm -r /tempObj*" !

How to access hive data using spark

I have table stored as text file e.g employee in hive and I want to access it using spark.
First i have set sql context object using
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Then i have created table
scala>sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(
id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY
',' LINES TERMINATED BY '\n'")
Further i was trying to load the contents of text file by using
scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO TABLE employee")
I am getting error as
SET hive.support.sql11.reserved.keywords=false
FAILED: SemanticException Line 1:23 Invalid path ''employee.txt'': No files
matching path file:/home/username/employee.txt
If i have to place the textfile in current directory where the spark-shell is running how to do that ?
Do you run hive on hadoop?
try to use absolute path... if this doesn't work, try to load your file to hdfs and then give absolute path to your file (hdfs location) .
Try doing the below steps
Start spark-shell in local mode a Eg:spark-shell --master local[*]
Give the file full path for loading file
Eg:file:///home/username/employee.txt