Spark - load CSV file as DataFrame? - scala

I would like to read a CSV in spark and convert it as DataFrame and store it in HDFS with df.registerTempTable("table_name")
I have tried:
scala> val df = sqlContext.load("hdfs:///csv/file/dir/file.csv")
Error which I got:
java.lang.RuntimeException: hdfs:///csv/file/dir/file.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [49, 59, 54, 10]
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:418)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:277)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:276)
at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:165)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
What is the right command to load CSV file as DataFrame in Apache Spark?

spark-csv is part of core Spark functionality and doesn't require a separate library.
So you could just do for example
df = spark.read.format("csv").option("header", "true").load("csvfile.csv")
In scala,(this works for any format-in delimiter mention "," for csv, "\t" for tsv etc)
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter", ",")
.load("csvfile.csv")

Parse CSV and load as DataFrame/DataSet with Spark 2.x
First, initialize SparkSession object by default it will available in shells as spark
val spark = org.apache.spark.sql.SparkSession.builder
.master("local") # Change it as per your cluster
.appName("Spark CSV Reader")
.getOrCreate;
Use any one of the following ways to load CSV as DataFrame/DataSet
1. Do it in a programmatic way
val df = spark.read
.format("csv")
.option("header", "true") //first line in file has headers
.option("mode", "DROPMALFORMED")
.load("hdfs:///csv/file/dir/file.csv")
Update: Adding all options from here in case the link will be broken in future
path: location of files. Similar to Spark can accept standard Hadoop globbing expressions.
header: when set to true the first line of files will be used to name columns and will not be included in data. All types will be assumed string. The default value is false.
delimiter: by default columns are delimited using, but delimiter can be set to any character
quote: by default the quote character is ", but can be set to any character. Delimiters inside quotes are ignored
escape: by default, the escape character is , but can be set to any character. Escaped quote characters are ignored
parserLib: by default, it is "commons" that can be set to "univocity" to use that library for CSV parsing.
mode: determines the parsing mode. By default it is PERMISSIVE. Possible values are:
PERMISSIVE: tries to parse all lines: nulls are inserted for missing tokens and extra tokens are ignored.
DROPMALFORMED: drops lines that have fewer or more tokens than expected or tokens which do not match the schema
FAILFAST: aborts with a RuntimeException if encounters any malformed line
charset: defaults to 'UTF-8' but can be set to other valid charset names
inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default
comment: skip lines beginning with this character. Default is "#". Disable comments by setting this to null.
nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame
dateFormat: specifies a string that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. By default, it is null which means trying to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf().
2. You can do this SQL way as well
val df = spark.sql("SELECT * FROM csv.`hdfs:///csv/file/dir/file.csv`")
Dependencies:
"org.apache.spark" % "spark-core_2.11" % 2.0.0,
"org.apache.spark" % "spark-sql_2.11" % 2.0.0,
Spark version < 2.0
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("mode", "DROPMALFORMED")
.load("csv/file/path");
Dependencies:
"org.apache.spark" % "spark-sql_2.10" % 1.6.0,
"com.databricks" % "spark-csv_2.10" % 1.6.0,
"com.univocity" % "univocity-parsers" % LATEST,

It's for whose Hadoop is 2.6 and Spark is 1.6 and without "databricks" package.
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType};
import org.apache.spark.sql.Row;
val csv = sc.textFile("/path/to/file.csv")
val rows = csv.map(line => line.split(",").map(_.trim))
val header = rows.first
val data = rows.filter(_(0) != header(0))
val rdd = data.map(row => Row(row(0),row(1).toInt))
val schema = new StructType()
.add(StructField("id", StringType, true))
.add(StructField("val", IntegerType, true))
val df = sqlContext.createDataFrame(rdd, schema)

With Spark 2.0, following is how you can read CSV
val conf = new SparkConf().setMaster("local[2]").setAppName("my app")
val sc = new SparkContext(conf)
val sparkSession = SparkSession.builder
.config(conf = conf)
.appName("spark session example")
.getOrCreate()
val path = "/Users/xxx/Downloads/usermsg.csv"
val base_df = sparkSession.read.option("header","true").
csv(path)

In Java 1.8 This code snippet perfectly working to read CSV files
POM.xml
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>2.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.4.0</version>
</dependency>
Java
SparkConf conf = new SparkConf().setAppName("JavaWordCount").setMaster("local");
// create Spark Context
SparkContext context = new SparkContext(conf);
// create spark Session
SparkSession sparkSession = new SparkSession(context);
Dataset<Row> df = sparkSession.read().format("com.databricks.spark.csv").option("header", true).option("inferSchema", true).load("hdfs://localhost:9000/usr/local/hadoop_data/loan_100.csv");
//("hdfs://localhost:9000/usr/local/hadoop_data/loan_100.csv");
System.out.println("========== Print Schema ============");
df.printSchema();
System.out.println("========== Print Data ==============");
df.show();
System.out.println("========== Print title ==============");
df.select("title").show();

Penny's Spark 2 example is the way to do it in spark2. There's one more trick: have that header generated for you by doing an initial scan of the data, by setting the option inferSchema to true
Here, then, assumming that spark is a spark session you have set up, is the operation to load in the CSV index file of all the Landsat images which amazon host on S3.
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
val csvdata = spark.read.options(Map(
"header" -> "true",
"ignoreLeadingWhiteSpace" -> "true",
"ignoreTrailingWhiteSpace" -> "true",
"timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSZZZ",
"inferSchema" -> "true",
"mode" -> "FAILFAST"))
.csv("s3a://landsat-pds/scene_list.gz")
The bad news is: this triggers a scan through the file; for something large like this 20+MB zipped CSV file, that can take 30s over a long haul connection. Bear that in mind: you are better off manually coding up the schema once you've got it coming in.
(code snippet Apache Software License 2.0 licensed to avoid all ambiguity; something I've done as a demo/integration test of S3 integration)

There are a lot of challenges to parsing a CSV file, it keeps adding up if the file size is bigger, if there are non-english/escape/separator/other characters in the column values, that could cause parsing errors.
The magic then is in the options that are used. The ones that worked for me and hope should cover most of the edge cases are in code below:
### Create a Spark Session
spark = SparkSession.builder.master("local").appName("Classify Urls").getOrCreate()
### Note the options that are used. You may have to tweak these in case of error
html_df = spark.read.csv(html_csv_file_path,
header=True,
multiLine=True,
ignoreLeadingWhiteSpace=True,
ignoreTrailingWhiteSpace=True,
encoding="UTF-8",
sep=',',
quote='"',
escape='"',
maxColumns=2,
inferSchema=True)
Hope that helps. For more refer: Using PySpark 2 to read CSV having HTML source code
Note: The code above is from Spark 2 API, where the CSV file reading API comes bundled with built-in packages of Spark installable.
Note: PySpark is a Python wrapper for Spark and shares the same API as Scala/Java.

In case you are building a jar with scala 2.11 and Apache 2.0 or higher.
There is no need to create a sqlContext or sparkContext object. Just a SparkSession object suffices the requirement for all needs.
Following is mycode which works fine:
import org.apache.spark.sql.{DataFrame, Row, SQLContext, SparkSession}
import org.apache.log4j.{Level, LogManager, Logger}
object driver {
def main(args: Array[String]) {
val log = LogManager.getRootLogger
log.info("**********JAR EXECUTION STARTED**********")
val spark = SparkSession.builder().master("local").appName("ValidationFrameWork").getOrCreate()
val df = spark.read.format("csv")
.option("header", "true")
.option("delimiter","|")
.option("inferSchema","true")
.load("d:/small_projects/spark/test.pos")
df.show()
}
}
In case you are running in cluster just change .master("local") to .master("yarn") while defining the sparkBuilder object
The Spark Doc covers this:
https://spark.apache.org/docs/2.2.0/sql-programming-guide.html

With Spark 2.4+, if you want to load a csv from a local directory, then you can use 2 sessions and load that into hive. The first session should be created with master() config as "local[*]" and the second session with "yarn" and Hive enabled.
The below one worked for me.
import org.apache.log4j.{Level, Logger}
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.sql._
object testCSV {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark_local = SparkSession.builder().appName("CSV local files reader").master("local[*]").getOrCreate()
import spark_local.implicits._
spark_local.sql("SET").show(100,false)
val local_path="/tmp/data/spend_diversity.csv" // Local file
val df_local = spark_local.read.format("csv").option("inferSchema","true").load("file://"+local_path) // "file://" is mandatory
df_local.show(false)
val spark = SparkSession.builder().appName("CSV HDFS").config("spark.sql.warehouse.dir", "/apps/hive/warehouse").enableHiveSupport().getOrCreate()
import spark.implicits._
spark.sql("SET").show(100,false)
val df = df_local
df.createOrReplaceTempView("lcsv")
spark.sql(" drop table if exists work.local_csv ")
spark.sql(" create table work.local_csv as select * from lcsv ")
}
When ran with spark2-submit --master "yarn" --conf spark.ui.enabled=false testCSV.jar it went fine and created the table in hive.

Add following Spark dependencies to POM file :
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>
Spark configuration:
val spark = SparkSession.builder().master("local").appName("Sample App").getOrCreate()
Read csv file:
val df = spark.read.option("header", "true").csv("FILE_PATH")
Display output:
df.show()

Try this if using spark 2.0+
For non-hdfs file:
df = spark.read.csv("file:///csvfile.csv")
For hdfs file:
df = spark.read.csv("hdfs:///csvfile.csv")
For hdfs file (with different delimiter than comma:
df = spark.read.option("delimiter","|")csv("hdfs:///csvfile.csv")
Note:- this work for any delimited file. Just use option(“delimiter”,) to change value.
Hope this is helpful.

To read from relative path on the system use System.getProperty method to get current directory and further uses to load the file using relative path.
scala> val path = System.getProperty("user.dir").concat("/../2015-summary.csv")
scala> val csvDf = spark.read.option("inferSchema","true").option("header", "true").csv(path)
scala> csvDf.take(3)
spark:2.4.4 scala:2.11.12

Default file format is Parquet with spark.read.. and file reading csv that why you are getting the exception. Specify csv format with api you are trying to use

With in-built Spark csv, you can get it done easily with new SparkSession object for Spark > 2.0.
val df = spark.
read.
option("inferSchema", "false").
option("header","true").
option("mode","DROPMALFORMED").
option("delimiter", ";").
schema(dataSchema).
csv("/csv/file/dir/file.csv")
df.show()
df.printSchema()
There are various options you can set.
header: whether your file includes header line at the top
inferSchema: whether you want to infer schema automatically or not. Default is true. I always prefer to provide schema to ensure proper datatypes.
mode: parsing mode, PERMISSIVE, DROPMALFORMED or FAILFAST
delimiter: to specify delimiter, default is comma(',')

Related

Regarding org.apache.spark.sql.AnalysisException error when creating a jar file using Scala

I have following simple Scala class , which i will later modify to fit some machine learning models.
I need to create a jar file out of this as i am going to run these models in amazon-emr . I am a beginner in this process. So i first tested whether i can successfully import the following csv file and write to another file by creating a jar file using the Scala class mention below.
The csv file looks like this and its include a Date column as one of the variables.
+-------------------+-------------+-------+---------+-----+
| Date| x1 | y | x2 | x3 |
+-------------------+-------------+-------+---------+-----+
|0010-01-01 00:00:00|0.099636562E8|6405.29| 57.06|21.55|
|0010-03-31 00:00:00|0.016645123E8|5885.41| 53.54|21.89|
|0010-03-30 00:00:00|0.044308936E8|6260.95|57.080002|20.93|
|0010-03-27 00:00:00|0.124928214E8|6698.46|65.540001|23.44|
|0010-03-26 00:00:00|0.570222885E7|6768.49| 61.0|24.65|
|0010-03-25 00:00:00|0.086162414E8|6502.16|63.950001|25.24|
Data set link : https://drive.google.com/open?id=18E6nf4_lK46kl_zwYJ1CIuBOTPMriGgE
I created a jar file out of this using intelliJ IDEA. And it was successfully done.
object jar1 {
def main(args: Array[String]): Unit = {
val sc: SparkSession = SparkSession.builder()
.appName("SparkByExample")
.getOrCreate()
val data = sc.read.format("csv")
.option("header","true")
.option("inferSchema","true")
.load(args(0))
data.write.format("text").save(args(1))
}
}
After that I upload this jar file along with the csv file mentioned above in amazon-s3 and tried to ran this in a cluster of amazon-emr .
But it was failed and i got following error message :
ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.sql.AnalysisException: Text data source does not support timestamp data type.;
I am sure this error is something to do with the Date variable in the data set. But i dont know how to fix this.
Can anyone help me to figure this out ?
Updated :
I tried to open the same csv file that i mentioned earlier without the date column . In this case i am getting this error :
ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.sql.AnalysisException: Text data source does not support double data type.;
Thank you
As I paid attention later that your are going to write to a text file. Spark's .format(text) doesn't support any specific types except String/Text. So to achive a goal you need to first convert the all the types to String and store:
df.rdd.map(_.toString().replace("[","").replace("]", "")).saveAsTextFile("textfilename")
If it's you could consider other oprions to store the data as file based, then you can have benefits of types. For example using CSV or JSON.
This is working code example based on your csv file for csv.
val spark = SparkSession.builder
.appName("Simple Application")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
import spark.sqlContext.implicits._
val df = spark.read
.format("csv")
.option("delimiter", ",")
.option("header", "true")
.option("inferSchema", "true")
.option("dateFormat", "yyyy-MM-dd")
.load("datat.csv")
df.printSchema()
df.show()
df.write
.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.option("delimiter", "\t")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
.option("escape", "\\")
.save("another")
There is no need custom encoder/decoder.

Load a file from SFTP server into spark RDD

How can I load a file from SFTP server into spark RDD. After loading this file I need to perform some filtering on the data. Also the file is csv file so could you please help me decide if I should use Dataframes or RDDs.
You can use spark-sftp library in your program in following ways:
For Spark 2.x
Maven Dependency
<dependency>
<groupId>com.springml</groupId>
<artifactId>spark-sftp_2.11</artifactId>
<version>1.1.0</version>
</dependency>
SBT Dependency
libraryDependencies += "com.springml" % "spark-sftp_2.11" % "1.1.0"
Using with Spark shell
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:
$ bin/spark-shell --packages com.springml:spark-sftp_2.11:1.1.0
Scala API
// Construct Spark dataframe using file in FTP server
val df = spark.read.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
option("inferSchema", "true").
load("/ftp/files/sample.csv")
// Write dataframe as CSV file to FTP server
df.write.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
save("/ftp/files/sample.csv")
For Spark 1.x (1.5+)
Maven Dependency
<dependency>
<groupId>com.springml</groupId>
<artifactId>spark-sftp_2.10</artifactId>
<version>1.0.2</version>
</dependency>
SBT Dependency
libraryDependencies += "com.springml" % "spark-sftp_2.10" % "1.0.2"
Using with Spark shell
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:
$ bin/spark-shell --packages com.springml:spark-sftp_2.10:1.0.2
Scala API
import org.apache.spark.sql.SQLContext
// Construct Spark dataframe using file in FTP server
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
option("inferSchema", "true").
load("/ftp/files/sample.csv")
// Write dataframe as CSV file to FTP server
df.write().
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
save("/ftp/files/sample.csv")
For more information on spark-sftp you can visit there github page springml/spark-sftp
Loading from SFTP is straight forward using the sftp-connector.
https://github.com/springml/spark-sftp
Remember it is single thread application and lands data into hdfs even you dont specify it. It Streams the data into hdfs and then creates an DataFrame on top of it
While Loading we need to specify couple of more parameters.
Normally with out specifying the location also it may work when your user sudo user of hdfs. It will create the temp file in / of hdfs and will delete it once the process is completed.
val data = sparkSession.read.format("com.springml.spark.sftp").
option("host", "host").
option("username", "user").
option("password", "password").
option("fileType", "json").
option("createDF", "true").
option("hdfsTempLocation","/user/currentuser/").
load("/Home/test_mapping.json");
All the available options are the following, Source code
https://github.com/springml/spark-sftp/blob/master/src/main/scala/com/springml/spark/sftp/DefaultSource.scala
override def createRelation(sqlContext: SQLContext, parameters: Map[String, String], schema: StructType) = {
val username = parameters.get("username")
val password = parameters.get("password")
val pemFileLocation = parameters.get("pem")
val pemPassphrase = parameters.get("pemPassphrase")
val host = parameters.getOrElse("host", sys.error("SFTP Host has to be provided using 'host' option"))
val port = parameters.get("port")
val path = parameters.getOrElse("path", sys.error("'path' must be specified"))
val fileType = parameters.getOrElse("fileType", sys.error("File type has to be provided using 'fileType' option"))
val inferSchema = parameters.get("inferSchema")
val header = parameters.getOrElse("header", "true")
val delimiter = parameters.getOrElse("delimiter", ",")
val createDF = parameters.getOrElse("createDF", "true")
val copyLatest = parameters.getOrElse("copyLatest", "false")
//System.setProperty("java.io.tmpdir","hdfs://devnameservice1/../")
val tempFolder = parameters.getOrElse("tempLocation", System.getProperty("java.io.tmpdir"))
val hdfsTemp = parameters.getOrElse("hdfsTempLocation", tempFolder)
val cryptoKey = parameters.getOrElse("cryptoKey", null)
val cryptoAlgorithm = parameters.getOrElse("cryptoAlgorithm", "AES")
val supportedFileTypes = List("csv", "json", "avro", "parquet")
if (!supportedFileTypes.contains(fileType)) {
sys.error("fileType " + fileType + " not supported. Supported file types are " + supportedFileTypes)
}

Parsing json in spark

I was using json scala library to parse a json from a local drive in spark job :
val requestJson=JSON.parseFull(Source.fromFile("c:/data/request.json").mkString)
val mainJson=requestJson.get.asInstanceOf[Map[String,Any]].get("Request").get.asInstanceOf[Map[String,Any]]
val currency=mainJson.get("currency").get.asInstanceOf[String]
But when i try to use the same parser by pointing to hdfs file location it doesnt work:
val requestJson=JSON.parseFull(Source.fromFile("hdfs://url/user/request.json").mkString)
and gives me an error:
java.io.FileNotFoundException: hdfs:/localhost/user/request.json (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
... 128 elided
How can i use Json.parseFull library to get data from hdfs file location ?
Thanks
Spark does have an inbuilt support for JSON documents parsing which will be available in spark-sql_${scala.version} jar.
In Spark 2.0+ :
import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val df = spark.read.format("json").json("json/file/location/in/hdfs")
df.show()
with df object you can do all supported SQL operations on it and it's data processing will be distributed among the nodes whereas requestJson
will be computed in single machine only.
Maven dependencies
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.0</version>
</dependency>
Edit: (as per comment to read file from hdfs)
val hdfs = org.apache.hadoop.fs.FileSystem.get(
new java.net.URI("hdfs://ITS-Hadoop10:9000/"),
new org.apache.hadoop.conf.Configuration()
)
val path=new Path("/user/zhc/"+x+"/")
val t=hdfs.listStatus(path)
val in =hdfs.open(t(0).getPath)
val reader = new BufferedReader(new InputStreamReader(in))
var l=reader.readLine()
code credits: from another SO
question
Maven dependencies:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.2</version> <!-- you can change this as per your hadoop version -->
</dependency>
It is much more easy in spark 2.0
val df = spark.read.json("json/file/location/in/hdfs")
df.show()
One can use following in Spark to read the file from HDFS:
val jsonText = sc.textFile("hdfs://url/user/request.json").collect.mkString("\n")

Merge Spark output CSV files with a single header

I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning.
I have a Scala script that takes raw data from S3, processes it and writes it to HDFS or even S3 with Spark-CSV. I think I can use multiple files as input if I want to use AWS Machine Learning tool for training a prediction model. But if I want to use something else, I presume it is best if I receive a single CSV output file.
Currently, as I do not want to use repartition(1) nor coalesce(1) for performance purposes, I have used hadoop fs -getmerge for manual testing, but as it just merges the contents of the job output files, I am running into a small problem. I need a single row of headers in the data file for training the prediction model.
If I use .option("header","true") for the spark-csv, then it writes the headers to every output file and after merging I have as many lines of headers in the data as there were output files. But if the header option is false, then it does not add any headers.
Now I found an option to merge the files inside the Scala script with Hadoop API FileUtil.copyMerge. I tried this in spark-shell with the code below.
import org.apache.hadoop.fs.FileUtil
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
val configuration = new Configuration();
val fs = FileSystem.get(configuration);
FileUtil.copyMerge(fs, new Path("smallheaders"), fs, new Path("/home/hadoop/smallheaders2"), false, configuration, "")
But this solution still just concatenates the files on top of each other and does not handle headers. How can I get an output file with only one row of headers?
I even tried adding df.columns.mkString(",") as the last argument for copyMerge, but this added the headers still multiple times, not once.
you can walk around like this.
1.Create a new DataFrame(headerDF) containing header names.
2.Union it with the DataFrame(dataDF) containing the data.
3.Output the union-ed DataFrame to disk with option("header", "false").
4.merge partition files(part-0000**0.csv) using hadoop FileUtil
In this ways, all partitions have no header except for a single partition's content has a row of header names from the headerDF. When all partitions are merged together, there is a single header in the top of the file. Sample code are the following
//dataFrame is the data to save on disk
//cast types of all columns to String
val dataDF = dataFrame.select(dataFrame.columns.map(c => dataFrame.col(c).cast("string")): _*)
//create a new data frame containing only header names
import scala.collection.JavaConverters._
val headerDF = sparkSession.createDataFrame(List(Row.fromSeq(dataDF.columns.toSeq)).asJava, dataDF.schema)
//merge header names with data
headerDF.union(dataDF).write.mode(SaveMode.Overwrite).option("header", "false").csv(outputFolder)
//use hadoop FileUtil to merge all partition csv files into a single file
val fs = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)
FileUtil.copyMerge(fs, new Path(outputFolder), fs, new Path("/folder/target.csv"), true, spark.sparkContext.hadoopConfiguration, null)
Output the header using dataframe.schema
( val header = dataDF.schema.fieldNames.reduce(_ + "," + _))
create a file with the header on dsefs
append all the partition files (headerless) to the file in #2 using hadoop Filesystem API
We had a similar issue, following the below approach to get single output file-
Write dataframe to hdfs with headers and without using coalesce or repartition (after the transformations).
dataframe.write.format("csv").option("header", "true").save(hdfs_path_for_multiple_files)
Read the files from the previous step and write back to different location on hdfs with coalesce(1).
dataframe = spark.read.option('header', 'true').csv(hdfs_path_for_multiple_files)
dataframe.coalesce(1).write.format('csv').option('header', 'true').save(hdfs_path_for_single_file)
This way, you will avoid performance issues related to coalesce or repartition while execution of transformations (Step 1).
And the second step provides single output file with one header line.
To merge files in a folder into one file:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}
If you want to merge all files into one file, but still in the same folder (but this brings all data to the driver node):
dataFrame
.coalesce(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save(out)
Another solution would be to use solution #2 then move the one file inside the folder to another path (with the name of our CSV file).
def df2csv(df: DataFrame, fileName: String, sep: String = ",", header: Boolean = false): Unit = {
val tmpDir = "tmpDir"
df.repartition(1)
.write
.format("com.databricks.spark.csv")
.option("header", header.toString)
.option("delimiter", sep)
.save(tmpDir)
val dir = new File(tmpDir)
val tmpCsvFile = tmpDir + File.separatorChar + "part-00000"
(new File(tmpCsvFile)).renameTo(new File(fileName))
dir.listFiles.foreach( f => f.delete )
dir.delete
}
Try to specify the schema of the header and read all file from the folder using the option drop malformed of spark-csv. This should let you read all the files in the folder keeping only the headers (because you drop the malformed).
Example:
val headerSchema = List(
StructField("example1", StringType, true),
StructField("example2", StringType, true),
StructField("example3", StringType, true)
)
val header_DF =sqlCtx.read
.option("delimiter", ",")
.option("header", "false")
.option("mode","DROPMALFORMED")
.option("inferSchema","false")
.schema(StructType(headerSchema))
.format("com.databricks.spark.csv")
.load("folder containg the files")
In header_DF you will have only the rows of the headers, from this you can trasform the dataframe the way you need.
// Convert JavaRDD to CSV and save as text file
outputDataframe.write()
.format("com.databricks.spark.csv")
// Header => true, will enable to have header in each file
.option("header", "true")
Please follow the link with Integration test on how to write a single header
http://bytepadding.com/big-data/spark/write-a-csv-text-file-from-spark/

Reading TSV into Spark Dataframe with Scala API

I have been trying to get the databricks library for reading CSVs to work. I am trying to read a TSV created by hive into a spark data frame using the scala api.
Here is an example that you can run in the spark shell (I made the sample data public so it can work for you)
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val sqlContext = new SQLContext(sc)
val segments = sqlContext.read.format("com.databricks.spark.csv").load("s3n://michaeldiscenza/data/test_segments")
The documentation says you can specify the delimiter but I am unclear about how to specify that option.
All of the option parameters are passed in the option() function as below:
val segments = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.load("s3n://michaeldiscenza/data/test_segments")
With Spark 2.0+ use the built-in CSV connector to avoid third party dependancy and better performance:
val spark = SparkSession.builder.getOrCreate()
val segments = spark.read.option("sep", "\t").csv("/path/to/file")
You May also try to inferSchema and check for schema.
val df = spark.read.format("csv")
.option("inferSchema", "true")
.option("sep","\t")
.option("header", "true")
.load(tmp_loc)
df.printSchema()