Parsing json in spark - scala

I was using json scala library to parse a json from a local drive in spark job :
val requestJson=JSON.parseFull(Source.fromFile("c:/data/request.json").mkString)
val mainJson=requestJson.get.asInstanceOf[Map[String,Any]].get("Request").get.asInstanceOf[Map[String,Any]]
val currency=mainJson.get("currency").get.asInstanceOf[String]
But when i try to use the same parser by pointing to hdfs file location it doesnt work:
val requestJson=JSON.parseFull(Source.fromFile("hdfs://url/user/request.json").mkString)
and gives me an error:
java.io.FileNotFoundException: hdfs:/localhost/user/request.json (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
... 128 elided
How can i use Json.parseFull library to get data from hdfs file location ?
Thanks

Spark does have an inbuilt support for JSON documents parsing which will be available in spark-sql_${scala.version} jar.
In Spark 2.0+ :
import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val df = spark.read.format("json").json("json/file/location/in/hdfs")
df.show()
with df object you can do all supported SQL operations on it and it's data processing will be distributed among the nodes whereas requestJson
will be computed in single machine only.
Maven dependencies
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.0</version>
</dependency>
Edit: (as per comment to read file from hdfs)
val hdfs = org.apache.hadoop.fs.FileSystem.get(
new java.net.URI("hdfs://ITS-Hadoop10:9000/"),
new org.apache.hadoop.conf.Configuration()
)
val path=new Path("/user/zhc/"+x+"/")
val t=hdfs.listStatus(path)
val in =hdfs.open(t(0).getPath)
val reader = new BufferedReader(new InputStreamReader(in))
var l=reader.readLine()
code credits: from another SO
question
Maven dependencies:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.2</version> <!-- you can change this as per your hadoop version -->
</dependency>

It is much more easy in spark 2.0
val df = spark.read.json("json/file/location/in/hdfs")
df.show()

One can use following in Spark to read the file from HDFS:
val jsonText = sc.textFile("hdfs://url/user/request.json").collect.mkString("\n")

Related

Scala: Error reading Kafka Avro messages from spark structured streaming

I have been trying to read Kafka's avro serialized messages from spark structured streaming (2.4.4) with Scala 2.11.
For this purpose i have used spark-avro (dependency below).
I generate kafka messages from python using confluent-kafka library.
Spark streaming is able to consume the messages with the schema but it doesn't read the values of the fields correctly.
I have prepared a simple example to show the problem, the code is avalible here:
https://github.com/anigmo97/SimpleExamples/tree/master/Spark_streaming_kafka_avro_scala
I create records in python, the schema of the records is:
{
"type": "record",
"namespace": "example",
"name": "RawRecord",
"fields": [
{"name": "int_field","type": "int"},
{"name": "string_field","type": "string"}
]
}
And they are generated like this:
from time import sleep
from confluent_kafka.avro import AvroProducer, load, loads
def generate_records():
avro_producer_settings = {
'bootstrap.servers': "localhost:19092",
'group.id': 'groupid',
'schema.registry.url': "http://127.0.0.1:8081"
}
producer = AvroProducer(avro_producer_settings)
key_schema = loads('"string"')
value_schema = load("schema.avsc")
i = 1
while True:
row = {"int_field": int(i), "string_field": str(i)}
producer.produce(topic="avro_topic", key="key-{}".format(i),
value=row, key_schema=key_schema, value_schema=value_schema)
print(row)
sleep(1)
i+=1
The consumption from spark structured streaming (in Scala) is done like this:
import org.apache.spark.sql.{ Dataset, Row}
import org.apache.spark.sql.streaming.{ OutputMode, StreamingQuery}
import org.apache.spark.sql.avro._
...
try {
log.info("----- reading schema")
val jsonFormatSchema = new String(Files.readAllBytes(
Paths.get("./src/main/resources/schema.avsc")))
val ds:Dataset[Row] = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaServers)
.option("subscribe", topic)
.load()
val output:Dataset[Row] = ds
.select(from_avro(ds.col("value"), jsonFormatSchema) as "record")
.select("record.*")
output.printSchema()
var query: StreamingQuery = output.writeStream.format("console")
.option("truncate", "false").outputMode(OutputMode.Append()).start();
query.awaitTermination();
} catch {
case e: Exception => log.error("onApplicationEvent error: ", e)
//case _: Throwable => log.error("onApplicationEvent error:")
}
...
Printing the schema in spark, it's strange that the fields are nullable although the avro schema does not allow that.
Spark shows this:
root
|-- int_field: integer (nullable = true)
|-- string_field: string (nullable = true)
I have checked the messages with another consumer in python and the messages are fine but
independently of the message content spark shows this.
+---------+------------+
|int_field|string_field|
+---------+------------+
|0 | |
+---------+------------+
The main dependencies used are:
<properties>
<spark.version>2.4.4</spark.version>
<scala.version>2.11</scala.version>
</properties>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
Does anyone know why this might be happening?
Thanks in advance.
The code to reproduce the error is here:
https://github.com/anigmo97/SimpleExamples/tree/master/Spark_streaming_kafka_avro_scala
SOLUTION
The problem was that i was using the confluent_kafka library in python
and i was reading the avro messages in spark structured streaming
using spark-avro library.
Confluent_kafka library uses confluent's avro format and spark avro
reads using standard avro format.
The difference is that in order to use schema registry, confluent avro
prepends the message with four bytes that indicates which schema
should be used.
Source:
https://www.confluent.io/blog/kafka-connect-tutorial-transfer-avro-schemas-across-schema-registry-clusters/
For being able to use confluent avro and read it from spark structured
streaming i replaced spark-avro library for Abris ( abris allow to
integrate avro and confluent avro with spark).
https://github.com/AbsaOSS/ABRiS
SOLUTION
The problem was that i was using the confluent_kafka library in python
and i was reading the avro messages in spark structured streaming
using spark-avro library.
Confluent_kafka library uses confluent's avro format and spark avro
reads using standard avro format.
The difference is that in order to use schema registry, confluent avro
prepends the message with four bytes that indicates which schema
should be used.
Source:
https://www.confluent.io/blog/kafka-connect-tutorial-transfer-avro-schemas-across-schema-registry-clusters/
For being able to use confluent avro and read it from spark structured
streaming i replaced spark-avro library for Abris ( abris allow to
integrate avro and confluent avro with spark).
https://github.com/AbsaOSS/ABRiS
My dependencies changed like this:
<properties>
<spark.version>2.4.4</spark.version>
<scala.version>2.11</scala.version>
</properties>
<!-- SPARK- AVRO -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- SPARK -AVRO AND CONFLUENT-AVRO -->
<dependency>
<groupId>za.co.absa</groupId>
<artifactId>abris_2.11</artifactId>
<version>3.1.1</version>
</dependency>
And here you can see an easy example that get the message and deserializes its values as avro and confluent avro.
var input: Dataset[Row] = sparkSession.readStream
//.format("org.apache.spark.sql.kafka010.KafkaSourceProvider")
.format("kafka")
.option("kafka.bootstrap.servers", kafkaServers)
.option("subscribe", topicConsumer)
.option("failOnDataLoss", "false")
// .option("startingOffsets", "latest")
// .option("startingOffsets", "earliest")
.load();
// READ WITH spark-avro library (standard avro)
val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./src/main/resources/schema.avsc")))
var inputAvroDeserialized: Dataset[Row] = input
.select(from_avro(functions.col("value"), jsonFormatSchema) as "record")
.select("record.*")
//READ WITH Abris library (confuent avro)
val schemaRegistryConfig = Map(
SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> "http://localhost:8081",
SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> topicConsumer,
SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> SchemaManager.SchemaStorageNamingStrategies.TOPIC_NAME, // choose a subject name strategy
SchemaManager.PARAM_VALUE_SCHEMA_ID -> "latest" // set to "latest" if you want the latest schema version to used
)
var inputConfluentAvroDeserialized: Dataset[Row] = inputConfluentAvroSerialized
.select(from_confluent_avro(functions.col("value"), schemaRegistryConfig) as "record")
.select("record.*")

How to write and update by kudu API in Spark 2.1

I want to write and update by Kudu API.
This is the maven dependency:
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-client</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-spark2_2.11</artifactId>
<version>1.1.0</version>
</dependency>
In the following code, I have no idea about KuduContext parameter.
My code in spark2-shell:
val kuduContext = new KuduContext("master:7051")
Also the same error in Spark 2.1 streaming:
import org.apache.kudu.spark.kudu._
import org.apache.kudu.client._
val sparkConf = new SparkConf().setAppName("DirectKafka").setMaster("local[*]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val messages = KafkaUtils.createDirectStream("")
messages.foreachRDD(rdd => {
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val bb = spark.read.options(Map("kudu.master" -> "master:7051","kudu.table" -> "table")).kudu //good
val kuduContext = new KuduContext("master:7051") //error
})
Then the error:
org.apache.spark.SparkException: Only one SparkContext may be running
in this JVM (see SPARK-2243). To ignore this error, set
spark.driver.allowMultipleContexts = true. The currently running
SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
Update your version of Kudu to the latest one (currently 1.5.0). The KuduContext takes the SparkContext as an input parameter in later versions and that should prevent this problem.
Also, do the initial Spark initialization outside of the foreachRDD. In the code you provided, move both the spark and kuduContext out of the foreach. Also, you do not need to create a separate sparkConf, you can use the newer SparkSession only.
val spark = SparkSession.builder.appName("DirectKafka").master("local[*]").getOrCreate()
import spark.implicits._
val kuduContext = new KuduContext("master:7051", spark.sparkContext)
val bb = spark.read.options(Map("kudu.master" -> "master:7051", "kudu.table" -> "table")).kudu
val messages = KafkaUtils.createDirectStream("")
messages.foreachRDD(rdd => {
// do something with the bb table and messages
})

Load a file from SFTP server into spark RDD

How can I load a file from SFTP server into spark RDD. After loading this file I need to perform some filtering on the data. Also the file is csv file so could you please help me decide if I should use Dataframes or RDDs.
You can use spark-sftp library in your program in following ways:
For Spark 2.x
Maven Dependency
<dependency>
<groupId>com.springml</groupId>
<artifactId>spark-sftp_2.11</artifactId>
<version>1.1.0</version>
</dependency>
SBT Dependency
libraryDependencies += "com.springml" % "spark-sftp_2.11" % "1.1.0"
Using with Spark shell
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:
$ bin/spark-shell --packages com.springml:spark-sftp_2.11:1.1.0
Scala API
// Construct Spark dataframe using file in FTP server
val df = spark.read.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
option("inferSchema", "true").
load("/ftp/files/sample.csv")
// Write dataframe as CSV file to FTP server
df.write.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
save("/ftp/files/sample.csv")
For Spark 1.x (1.5+)
Maven Dependency
<dependency>
<groupId>com.springml</groupId>
<artifactId>spark-sftp_2.10</artifactId>
<version>1.0.2</version>
</dependency>
SBT Dependency
libraryDependencies += "com.springml" % "spark-sftp_2.10" % "1.0.2"
Using with Spark shell
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:
$ bin/spark-shell --packages com.springml:spark-sftp_2.10:1.0.2
Scala API
import org.apache.spark.sql.SQLContext
// Construct Spark dataframe using file in FTP server
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
option("inferSchema", "true").
load("/ftp/files/sample.csv")
// Write dataframe as CSV file to FTP server
df.write().
format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "****").
option("fileType", "csv").
save("/ftp/files/sample.csv")
For more information on spark-sftp you can visit there github page springml/spark-sftp
Loading from SFTP is straight forward using the sftp-connector.
https://github.com/springml/spark-sftp
Remember it is single thread application and lands data into hdfs even you dont specify it. It Streams the data into hdfs and then creates an DataFrame on top of it
While Loading we need to specify couple of more parameters.
Normally with out specifying the location also it may work when your user sudo user of hdfs. It will create the temp file in / of hdfs and will delete it once the process is completed.
val data = sparkSession.read.format("com.springml.spark.sftp").
option("host", "host").
option("username", "user").
option("password", "password").
option("fileType", "json").
option("createDF", "true").
option("hdfsTempLocation","/user/currentuser/").
load("/Home/test_mapping.json");
All the available options are the following, Source code
https://github.com/springml/spark-sftp/blob/master/src/main/scala/com/springml/spark/sftp/DefaultSource.scala
override def createRelation(sqlContext: SQLContext, parameters: Map[String, String], schema: StructType) = {
val username = parameters.get("username")
val password = parameters.get("password")
val pemFileLocation = parameters.get("pem")
val pemPassphrase = parameters.get("pemPassphrase")
val host = parameters.getOrElse("host", sys.error("SFTP Host has to be provided using 'host' option"))
val port = parameters.get("port")
val path = parameters.getOrElse("path", sys.error("'path' must be specified"))
val fileType = parameters.getOrElse("fileType", sys.error("File type has to be provided using 'fileType' option"))
val inferSchema = parameters.get("inferSchema")
val header = parameters.getOrElse("header", "true")
val delimiter = parameters.getOrElse("delimiter", ",")
val createDF = parameters.getOrElse("createDF", "true")
val copyLatest = parameters.getOrElse("copyLatest", "false")
//System.setProperty("java.io.tmpdir","hdfs://devnameservice1/../")
val tempFolder = parameters.getOrElse("tempLocation", System.getProperty("java.io.tmpdir"))
val hdfsTemp = parameters.getOrElse("hdfsTempLocation", tempFolder)
val cryptoKey = parameters.getOrElse("cryptoKey", null)
val cryptoAlgorithm = parameters.getOrElse("cryptoAlgorithm", "AES")
val supportedFileTypes = List("csv", "json", "avro", "parquet")
if (!supportedFileTypes.contains(fileType)) {
sys.error("fileType " + fileType + " not supported. Supported file types are " + supportedFileTypes)
}

spark streaming Twitter- filter languaje

I know that this question has been treated in several posts, but I can't find out how to solve my problem. I am trying to filter twitts by languaje. I have read in this forum that I have to use twitter4j api. I have already added it in my dependencies:
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-stream</artifactId>
<version>3.0.3</version>
</dependency>
My code is:
import twitter4j.api
[....]
val sc = new SparkContext("local", "Simple", "$SPARK_HOME", List("target/streamingTwitter-1.0.jar"))
val ssc = new StreamingContext(sc, Seconds(10))
var filter = Array("filter")
val tweets = TwitterUtils.createStream(ssc, None, filter).filter(status => _.getLang == "es")
The error is:
cannot resolve symbol getlang
Why the compiler doesnt recognize getLang method? This method is supposed to be implemented in twitter4j api, right? Is not enough to import twitter4j and set dependencies in order to use its methods?

Spark - load CSV file as DataFrame?

I would like to read a CSV in spark and convert it as DataFrame and store it in HDFS with df.registerTempTable("table_name")
I have tried:
scala> val df = sqlContext.load("hdfs:///csv/file/dir/file.csv")
Error which I got:
java.lang.RuntimeException: hdfs:///csv/file/dir/file.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [49, 59, 54, 10]
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:418)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:277)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:276)
at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:165)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
What is the right command to load CSV file as DataFrame in Apache Spark?
spark-csv is part of core Spark functionality and doesn't require a separate library.
So you could just do for example
df = spark.read.format("csv").option("header", "true").load("csvfile.csv")
In scala,(this works for any format-in delimiter mention "," for csv, "\t" for tsv etc)
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter", ",")
.load("csvfile.csv")
Parse CSV and load as DataFrame/DataSet with Spark 2.x
First, initialize SparkSession object by default it will available in shells as spark
val spark = org.apache.spark.sql.SparkSession.builder
.master("local") # Change it as per your cluster
.appName("Spark CSV Reader")
.getOrCreate;
Use any one of the following ways to load CSV as DataFrame/DataSet
1. Do it in a programmatic way
val df = spark.read
.format("csv")
.option("header", "true") //first line in file has headers
.option("mode", "DROPMALFORMED")
.load("hdfs:///csv/file/dir/file.csv")
Update: Adding all options from here in case the link will be broken in future
path: location of files. Similar to Spark can accept standard Hadoop globbing expressions.
header: when set to true the first line of files will be used to name columns and will not be included in data. All types will be assumed string. The default value is false.
delimiter: by default columns are delimited using, but delimiter can be set to any character
quote: by default the quote character is ", but can be set to any character. Delimiters inside quotes are ignored
escape: by default, the escape character is , but can be set to any character. Escaped quote characters are ignored
parserLib: by default, it is "commons" that can be set to "univocity" to use that library for CSV parsing.
mode: determines the parsing mode. By default it is PERMISSIVE. Possible values are:
PERMISSIVE: tries to parse all lines: nulls are inserted for missing tokens and extra tokens are ignored.
DROPMALFORMED: drops lines that have fewer or more tokens than expected or tokens which do not match the schema
FAILFAST: aborts with a RuntimeException if encounters any malformed line
charset: defaults to 'UTF-8' but can be set to other valid charset names
inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default
comment: skip lines beginning with this character. Default is "#". Disable comments by setting this to null.
nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame
dateFormat: specifies a string that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. By default, it is null which means trying to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf().
2. You can do this SQL way as well
val df = spark.sql("SELECT * FROM csv.`hdfs:///csv/file/dir/file.csv`")
Dependencies:
"org.apache.spark" % "spark-core_2.11" % 2.0.0,
"org.apache.spark" % "spark-sql_2.11" % 2.0.0,
Spark version < 2.0
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("mode", "DROPMALFORMED")
.load("csv/file/path");
Dependencies:
"org.apache.spark" % "spark-sql_2.10" % 1.6.0,
"com.databricks" % "spark-csv_2.10" % 1.6.0,
"com.univocity" % "univocity-parsers" % LATEST,
It's for whose Hadoop is 2.6 and Spark is 1.6 and without "databricks" package.
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType};
import org.apache.spark.sql.Row;
val csv = sc.textFile("/path/to/file.csv")
val rows = csv.map(line => line.split(",").map(_.trim))
val header = rows.first
val data = rows.filter(_(0) != header(0))
val rdd = data.map(row => Row(row(0),row(1).toInt))
val schema = new StructType()
.add(StructField("id", StringType, true))
.add(StructField("val", IntegerType, true))
val df = sqlContext.createDataFrame(rdd, schema)
With Spark 2.0, following is how you can read CSV
val conf = new SparkConf().setMaster("local[2]").setAppName("my app")
val sc = new SparkContext(conf)
val sparkSession = SparkSession.builder
.config(conf = conf)
.appName("spark session example")
.getOrCreate()
val path = "/Users/xxx/Downloads/usermsg.csv"
val base_df = sparkSession.read.option("header","true").
csv(path)
In Java 1.8 This code snippet perfectly working to read CSV files
POM.xml
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>2.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.4.0</version>
</dependency>
Java
SparkConf conf = new SparkConf().setAppName("JavaWordCount").setMaster("local");
// create Spark Context
SparkContext context = new SparkContext(conf);
// create spark Session
SparkSession sparkSession = new SparkSession(context);
Dataset<Row> df = sparkSession.read().format("com.databricks.spark.csv").option("header", true).option("inferSchema", true).load("hdfs://localhost:9000/usr/local/hadoop_data/loan_100.csv");
//("hdfs://localhost:9000/usr/local/hadoop_data/loan_100.csv");
System.out.println("========== Print Schema ============");
df.printSchema();
System.out.println("========== Print Data ==============");
df.show();
System.out.println("========== Print title ==============");
df.select("title").show();
Penny's Spark 2 example is the way to do it in spark2. There's one more trick: have that header generated for you by doing an initial scan of the data, by setting the option inferSchema to true
Here, then, assumming that spark is a spark session you have set up, is the operation to load in the CSV index file of all the Landsat images which amazon host on S3.
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
val csvdata = spark.read.options(Map(
"header" -> "true",
"ignoreLeadingWhiteSpace" -> "true",
"ignoreTrailingWhiteSpace" -> "true",
"timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSZZZ",
"inferSchema" -> "true",
"mode" -> "FAILFAST"))
.csv("s3a://landsat-pds/scene_list.gz")
The bad news is: this triggers a scan through the file; for something large like this 20+MB zipped CSV file, that can take 30s over a long haul connection. Bear that in mind: you are better off manually coding up the schema once you've got it coming in.
(code snippet Apache Software License 2.0 licensed to avoid all ambiguity; something I've done as a demo/integration test of S3 integration)
There are a lot of challenges to parsing a CSV file, it keeps adding up if the file size is bigger, if there are non-english/escape/separator/other characters in the column values, that could cause parsing errors.
The magic then is in the options that are used. The ones that worked for me and hope should cover most of the edge cases are in code below:
### Create a Spark Session
spark = SparkSession.builder.master("local").appName("Classify Urls").getOrCreate()
### Note the options that are used. You may have to tweak these in case of error
html_df = spark.read.csv(html_csv_file_path,
header=True,
multiLine=True,
ignoreLeadingWhiteSpace=True,
ignoreTrailingWhiteSpace=True,
encoding="UTF-8",
sep=',',
quote='"',
escape='"',
maxColumns=2,
inferSchema=True)
Hope that helps. For more refer: Using PySpark 2 to read CSV having HTML source code
Note: The code above is from Spark 2 API, where the CSV file reading API comes bundled with built-in packages of Spark installable.
Note: PySpark is a Python wrapper for Spark and shares the same API as Scala/Java.
In case you are building a jar with scala 2.11 and Apache 2.0 or higher.
There is no need to create a sqlContext or sparkContext object. Just a SparkSession object suffices the requirement for all needs.
Following is mycode which works fine:
import org.apache.spark.sql.{DataFrame, Row, SQLContext, SparkSession}
import org.apache.log4j.{Level, LogManager, Logger}
object driver {
def main(args: Array[String]) {
val log = LogManager.getRootLogger
log.info("**********JAR EXECUTION STARTED**********")
val spark = SparkSession.builder().master("local").appName("ValidationFrameWork").getOrCreate()
val df = spark.read.format("csv")
.option("header", "true")
.option("delimiter","|")
.option("inferSchema","true")
.load("d:/small_projects/spark/test.pos")
df.show()
}
}
In case you are running in cluster just change .master("local") to .master("yarn") while defining the sparkBuilder object
The Spark Doc covers this:
https://spark.apache.org/docs/2.2.0/sql-programming-guide.html
With Spark 2.4+, if you want to load a csv from a local directory, then you can use 2 sessions and load that into hive. The first session should be created with master() config as "local[*]" and the second session with "yarn" and Hive enabled.
The below one worked for me.
import org.apache.log4j.{Level, Logger}
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.sql._
object testCSV {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark_local = SparkSession.builder().appName("CSV local files reader").master("local[*]").getOrCreate()
import spark_local.implicits._
spark_local.sql("SET").show(100,false)
val local_path="/tmp/data/spend_diversity.csv" // Local file
val df_local = spark_local.read.format("csv").option("inferSchema","true").load("file://"+local_path) // "file://" is mandatory
df_local.show(false)
val spark = SparkSession.builder().appName("CSV HDFS").config("spark.sql.warehouse.dir", "/apps/hive/warehouse").enableHiveSupport().getOrCreate()
import spark.implicits._
spark.sql("SET").show(100,false)
val df = df_local
df.createOrReplaceTempView("lcsv")
spark.sql(" drop table if exists work.local_csv ")
spark.sql(" create table work.local_csv as select * from lcsv ")
}
When ran with spark2-submit --master "yarn" --conf spark.ui.enabled=false testCSV.jar it went fine and created the table in hive.
Add following Spark dependencies to POM file :
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>
Spark configuration:
val spark = SparkSession.builder().master("local").appName("Sample App").getOrCreate()
Read csv file:
val df = spark.read.option("header", "true").csv("FILE_PATH")
Display output:
df.show()
Try this if using spark 2.0+
For non-hdfs file:
df = spark.read.csv("file:///csvfile.csv")
For hdfs file:
df = spark.read.csv("hdfs:///csvfile.csv")
For hdfs file (with different delimiter than comma:
df = spark.read.option("delimiter","|")csv("hdfs:///csvfile.csv")
Note:- this work for any delimited file. Just use option(“delimiter”,) to change value.
Hope this is helpful.
To read from relative path on the system use System.getProperty method to get current directory and further uses to load the file using relative path.
scala> val path = System.getProperty("user.dir").concat("/../2015-summary.csv")
scala> val csvDf = spark.read.option("inferSchema","true").option("header", "true").csv(path)
scala> csvDf.take(3)
spark:2.4.4 scala:2.11.12
Default file format is Parquet with spark.read.. and file reading csv that why you are getting the exception. Specify csv format with api you are trying to use
With in-built Spark csv, you can get it done easily with new SparkSession object for Spark > 2.0.
val df = spark.
read.
option("inferSchema", "false").
option("header","true").
option("mode","DROPMALFORMED").
option("delimiter", ";").
schema(dataSchema).
csv("/csv/file/dir/file.csv")
df.show()
df.printSchema()
There are various options you can set.
header: whether your file includes header line at the top
inferSchema: whether you want to infer schema automatically or not. Default is true. I always prefer to provide schema to ensure proper datatypes.
mode: parsing mode, PERMISSIVE, DROPMALFORMED or FAILFAST
delimiter: to specify delimiter, default is comma(',')