HDFS file Split Command : Spark/Scala using spark-submit

HDFS file Split Command : Spark/Scala using spark-submit - scala

I am trying to spllit my csv files into multiple csv part file using spark scala.When i am manulaay executing code its working fine and i am able to see the part files. But when i m executing same command using jar and trying to submit using spark submit i m getting error like
split: can not open 'file_location/filename' for reading No such file or directory.
can someone please guide me what is the issue here.
code:
val file1 = filelocation/filename
val file2 = file1.replace(".csv","_")
if (fs.exists(new org.apache.hadoop.fs.Path(file1))) {
s"split -l 20000000 -d --additional-suffix=.csv /hadoop$file1 /hadoop$file2"!
}

Related

sequence files from sqoop import

I have imported a table using sqoop and saved it as a sequence file.
How do I read this file into an RDD or Dataframe?
I have tried sc.sequenceFile() but I'm not sure what to pass as keyClass and value Class. I tried tried using org.apache.hadoop.io.Text, org.apache.hadoop.io.LongWritable for keyClass and valueClass
but it did not work. I am using pyspark for reading the files.

in python its not working however in SCALA it works:
You need to do following steps:
step1:
If you are importing as sequence file from sqoop, there is a jar file generated, you need to use that as ValueClass while reading sequencefile. This jar file is generally placed in /tmp folder, but you can redirect it to a specific folder (i.e. to local folder not hdfs) using --bindir option.
example:
sqoop import --connect jdbc:mysql://ms.itversity.com/retail_export --
username retail_user --password itversity --table customers -m 1 --target-dir '/user/srikarthik/udemy/practice4/problem2/outputseq' --as-sequencefile --delete-target-dir --bindir /home/srikarthik/sqoopjars/
step2:
Also, you need to download the jar file from below link:
http://www.java2s.com/Code/Jar/s/Downloadsqoop144hadoop200jar.htm
step3:
Suppose, customers table is imported using sqoop as sequence file.
Run spark-shell --jars path-to-customers.jar,sqoop-1.4.4-hadoop200.jar
example:
spark-shell --master yarn --jars /home/srikarthik/sqoopjars/customers.jar,/home/srikarthik/tejdata/kjar/sqoop-1.4.4-hadoop200.jar
step4: Now run below commands inside the spark-shell
scala> import org.apache.hadoop.io.LongWritable
scala> val data = sc.sequenceFile[LongWritable,customers]("/user/srikarthik/udemy/practice4/problem2/outputseq")
scala> data.map(tup => (tup._1.get(), tup._2.toString())).collect.foreach(println)

You can use SeqDataSourceV2 package to read the sequence file with the DataFrame API without any prior knowledge of the schema (aka keyClass and valueClass).
Please note that the current version is only compatible with Spark 2.4
$ pyspark --packages seq-datasource-v2-0.2.0.jar
df = spark.read.format("seq").load("data.seq")
df.show()

How to write cucumber json report in HDFS

I am using cucumber with scala and using below jars
cucumber-junit-1.2.0.jar
cucumber-core-1.2.0.jar
cucumber-html-0.2.3.jar
cucumber-jvm-deps-1.0.3.jar
cucumber-java-1.2.0.jar
i am using cucumber framework in my big data testing, and using spark to read/write/process and data.
i am using cucumber cli.Main method to run my features
import cucumber.api.cli.Main
glue = args(0)
gluePath = args(1)
tag = args(2)
tagName = args(3)
val fileNames = args(4)
val arrFileNames = fileNames.split(",")
arrFileNames.foreach(x => sqlContext.sparkContext.addFile(x))
plugin = "-p"
pluginNameAndPath = "com.cucumber.listener.ExtentCucumberFormatter:hdfs:///tmp/target/cucumber-reports/report.html"
pluginNameAndPathJson = "json:hdfs:///tmp/target/cucumber-reports/report.json"
Main.main( Array(glue,gluePath,tag,tagName,plugin,pluginNameAndPath,plugin,pluginNameAndPathJson,SparkFiles.get("xxx.feature")
in the above code when i run in cluster mode it ran successfully but cucumber report not generated in given HDFS location.
But when i run in client mode(without hdfs:/// ) it ran successfully and created cucumber report on the local node.
It seems like cucumber does't have hdfs file system so cannot create the file in hdfs
Can anyone please help how to create cucumber report by giving the hdfs path or any other way to achieve this?

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?

So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.

I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)

How to delete multiple hdfs directories starting with some word in Apache Spark

I have persisted object files in spark streaming using dstream.saveAsObjectFiles("/temObj") method it shows multiple files in hdfs.
temObj-1506338844000
temObj-1506338848000
temObj-1506338852000
temObj-1506338856000
temObj-1506338860000
I want to delete all temObj files after reading all. What is the bet way to do it in spark. I tried
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://localhost:9000"), hadoopConf)
hdfs.delete(new org.apache.hadoop.fs.Path(Path), true)
But it only can delete ane folder at a time

Unfortunately, delete doesn't support globs.
You can use globStatus and iterate over the files/directories one by one and delete them.
val hdfs = FileSystem.get(sc.hadoopConfiguration)
val deletePaths = hdfs.globStatus(new Path("/tempObj-*") ).map(_.getPath)
deletePaths.foreach{ path => hdfs.delete(path, true) }
Alternatively, you can use sys.process to execute shell commands
import scala.sys.process._
"hdfs dfs -rm -r /tempObj*" !

reading compressed file in spark with scala

I am trying to read the content of .gz file in spark/scala in a dataframe/rdd using the following code
val conf = new SparkConf()
val sc = new SparkContext(conf)
val data = sc.wholeTextFiles("path to gz file")
data.collect().foreach(println);
.gz file is 28 mb and when i do the spark submit using this command
spark-submit --class sample--master local[*] target\spark.jar
It gives ma Java Heap space issue in the console .
Is this the best way of reading .gz file and if yes how could i solve java heap error issue .
Thanks

Disclaimer: That code and description will purely read in a small compressed text file using spark, collect it to an array of every line and print every line in the entire file to console. The number of ways and reasons to do this outside far outnumber those to do it in spark
1) use SparkSession instead of SparkContext if you can swing it. sparkSession.read.text() is the command to use (it automatically handles a few compression formats)
2) Or at least use sc.textFile() instead of wholeTextFiles
3) you're calling .collect on that data which brings the entire file back to the driver (in this case since you're local not network bound). Add the --driver-memory option to the spark shell to increase memory if you MUST do the collect.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse