How to read data from hdfs using scala language [duplicate] - scala

This question already has answers here:
Spark - load CSV file as DataFrame?
(14 answers)
Closed 4 years ago.
How do I read the data from hdfs data sets using scala language? data is any "CSV" file with limited records.

You tagged the question with Spark, so I'm assuming you are trying to use that. I would recommend you start by reading through the Spark documentation here to get an idea of how to use Spark to interact with your data.
https://spark.apache.org/docs/latest/quick-start.html
https://spark.apache.org/docs/latest/sql-programming-guide.html
But, to answer your specific question, in Spark you would read in the CSV file using code like this:
val csvDf = spark.read.format("csv")
.option("sep", ",")
.option("header", "true")
.load("hdfs://some/path/to/data.csv/")
The path your provide will be to a CSV file on HDFS, or a folder containing multiple CSV files. Also, Spark will accept other types of file systems. For example you could also use "file://" to access the local file system, or "s3://" to use S3. Once you have loaded the data you will have a Spark DataFrame object with SQL like methods available to interact with it.
Note, I provided an option for separator just to show you how to do it, but it defaults to "," anyways, so it is not required. Also, if your CSV files do not include a header, you will need to specify the Schema yourself and set header to false instead.

You can read data from HDFS by following this approach :-
val hdfs = FileSystem.get(new URI("hdfs://hdfsUrl:port/"), new Configuration())
val path = new Path("/pathOfTheFileInHDFS/")
val stream = hdfs.open(path)
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
//This example checks line for null and prints every existing line consequentally
readLines.takeWhile(_ != null).foreach(line => println(line))
Also please have a look at this article https://blog.matthewrathbone.com/2013/12/28/reading-data-from-hdfs-even-if-it-is-compressed
Please let me know if this answers your question.

Related

Spark - Write DataFrame with custom file name [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 1 year ago.
I have a Spark (2.4) DataFrame that I want to write as a Pipe separated file. It should be pretty straightforward like so
val myDF = spark.table("mySchema.myTable")
myDF.coalesce(1).write.format("csv").options("header", "true").options("delimiter", "|").save("/tmp/myDF")
I get a part-*.csv file in /tmp/myDF.
So far, so good. But I actually want the file name to be something specific, e.g. /tmp/myDF.csv
But giving this String in save will just create a dir called myDF.csv and create the part*.csv file in there.
Is there a way to write the DataFrame with a specific name?
You can't do that with Spark
You can rename the file later accessing the fileSystem
val directory = new File(/tmp/myDF)
if (directory.exists && directory.isDirectory) {
val file = directory.listFiles.filter(_.getName.endsWith(".csv")).head
file.renameTo("myDF.csv")
}

Spark Spark Empty Json Files reading from Directory

I'm reading from a path say /json//myfiles_.json
I'm then flattening the json using explode. This causes an error since I have some empty files. How do I tell it to ignore empty files of somehow filter them out?
I can detect individual files checking if the head is empty but I need to do this on the collection of files iterated in the dataframe with the use of the wildcard path.
So the answer seems to be that I need to provide a schema explicitly because it can't infer one from empty file - as you would expect!
e.g.
val schemadf = sqlContext.read.json(schemapath) //infer schema from file with data or do manually
val schema = schemadf.schema
val raw = sqlContext.read.schema(schema).json(monthfile)
val prep = raw.withColumn("MyArray", explode($"MyArray"))
.select($"ID", $"name", $"CreatedAt")
display(prep)

Fast file writing in scala?

So I have a scala program that iterates through a graph and writes out data line by line to a text file. It is essentially an edge list file for use with graphx.
The biggest slow down is actually creating this text file, were talking maybe million records it writes to this text file. Is there a way I can somehow parallel this task or making faster in any way by somehow storing it in memory or anything?
More info:
I am using a hadoop cluster to iterate through a graph and here is my code snippet for my text file creation im doing now to write to HDFS:
val fileName = dbPropertiesFile + "-edgelist-" + System.currentTimeMillis()
val path = new Path("/home/user/graph/" + fileName + ".txt")
val conf = new Configuration()
conf.set("fs.defaultFS", "hdfs://host001:8020")
val fs = FileSystem.newInstance(conf)
val os = fs.create(path)
while (edges.hasNext) {
val current = edges.next()
os.write(current.inVertex().id().toString.getBytes())
os.write(" ".getBytes())
os.write(current.outVertex().id().toString.getBytes())
os.write("\n".toString.getBytes())
}
fs.close()
Writing files to HDFS is never fast. Your tags seem to suggest that you are already using spark anyway, so you could as well, take advantage of it.
sparkContext
.makeRDD(20, edges.toStream)
.map(e => e.inVertex.id -> e.outVertex.id)
.toDF
.write
.delimiter(" ")
.csv(path)
This splits your input into 20 partitions (you can control that number with the numeric parameter to makeRDD above), and writes them in parallel to 20 different chunks in hdfs, that represent your resulting file.

Specifying the filename when saving a DataFrame as a CSV [duplicate]

This question already has an answer here:
Spark dataframe save in single file on hdfs location [duplicate]
(1 answer)
Closed 5 years ago.
Say I have a Spark DF that I want to save to disk a CSV file. In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows]) as a DataFrameWriter and use the .csv method to write the file.
The function is defined as
def csv(path: String): Unit
path : the location/folder name and not the file name.
Spark stores the csv file at the location specified by creating CSV files with name - part-*.csv.
Is there a way to save the CSV with specified filename instead of part-*.csv ? Or possible to specify prefix to instead of part-r ?
Code :
df.coalesce(1).write.csv("sample_path")
Current Output :
sample_path
|
+-- part-r-00000.csv
Desired Output :
sample_path
|
+-- my_file.csv
Note : The coalesce function is used to output a single file and the executor has enough memory to collect the DF without memory error.
It's not possible to do it directly in Spark's save
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. You can easily change filename after processing just like in this question
In Scala it will look like:
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val file = fs.globStatus(new Path("path/file.csv/part*"))(0).getPath().getName()
fs.rename(new Path("csvDirectory/" + file), new Path("mydata.csv"))
fs.delete(new Path("mydata.csv-temp"), true)
or just:
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
fs.rename(new Path("csvDirectory/data.csv/part-0000"), new Path("csvDirectory/newData.csv"))
Edit: As mentioned in comments, you can also write your own OutputFormat, please see documents for information about this approach to set file name

spark scala issue uploading csv

i am trying to upload a csv file into a tempTable such that I can query on it and I am having two issues.
First: I tried uploading the csv to a DataFrame, and this csv has some empty fields.... and I didn't find a way to do it. I found someone posting in another post to use :
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("cars.csv")
but it gives me an error saying "Failed to load class for data source: com.databricks.spark.csv"
Then I uploaded the file and read it as a text file, without the headings as:
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
import sqlContext.implicits._;
case class cars(id: Int, name: String, licence: String);
val carsDF = sc.textFile("../myTests/cars.csv").map(_.split(",")).map(p => cars( p(0).trim.toInt, p(1).trim, p(2).trim) ).toDF();
carsDF.registerTempTable("cars");
val dgp = sqlContext.sql("SELECT * FROM cars");
dgp.show()
gives an error because one of the licence field is empty... I tried to control this issue when I build the data frame but did not work.
I can obviously go into the csv file but and fix by adding a null to it but U do not want to do this because of there are a lot of fields it could be problematic. I want to fix it programmatically either when i create the dataframe or the class...
any other thoughts please let me know as well
To be able to use spark-csv you have to make sure it is available. In an interactive mode the simplest solution is to use packages argument when you start shell:
bin/spark-shell --packages com.databricks:spark-csv_2.10:1.1.0
Regarding manual parsing working with csv files, especially malformed like cars.csv, requires much more work than simply splitting on commas. Some things to consider:
how to detect csv dialect, including method of string quoting
how to handle quotes and new line characters inside strings
how handle malformed lines
In case of example file you have to at least:
filter empty lines
read header
map lines to fields providing default value if field is missing
Here you go. Remember to check the delimiter for your CSV.
// create spark session
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Spark CSV Reader")
.getOrCreate;
// read csv
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.option("delimiter", ",")
.load("/your/csv/dir/simplecsv.csv")
// create a table from dataframe
df.createOrReplaceTempView("tableName")
// run your sql query
val sqlResults = spark.sql("SELECT * FROM tableName")
// display sql results
display(sqlResults)