Scala: Error reading Kafka Avro messages from spark structured streaming - scala

I have been trying to read Kafka's avro serialized messages from spark structured streaming (2.4.4) with Scala 2.11.
For this purpose i have used spark-avro (dependency below).
I generate kafka messages from python using confluent-kafka library.
Spark streaming is able to consume the messages with the schema but it doesn't read the values of the fields correctly.
I have prepared a simple example to show the problem, the code is avalible here:
https://github.com/anigmo97/SimpleExamples/tree/master/Spark_streaming_kafka_avro_scala
I create records in python, the schema of the records is:
{
"type": "record",
"namespace": "example",
"name": "RawRecord",
"fields": [
{"name": "int_field","type": "int"},
{"name": "string_field","type": "string"}
]
}
And they are generated like this:
from time import sleep
from confluent_kafka.avro import AvroProducer, load, loads
def generate_records():
avro_producer_settings = {
'bootstrap.servers': "localhost:19092",
'group.id': 'groupid',
'schema.registry.url': "http://127.0.0.1:8081"
}
producer = AvroProducer(avro_producer_settings)
key_schema = loads('"string"')
value_schema = load("schema.avsc")
i = 1
while True:
row = {"int_field": int(i), "string_field": str(i)}
producer.produce(topic="avro_topic", key="key-{}".format(i),
value=row, key_schema=key_schema, value_schema=value_schema)
print(row)
sleep(1)
i+=1
The consumption from spark structured streaming (in Scala) is done like this:
import org.apache.spark.sql.{ Dataset, Row}
import org.apache.spark.sql.streaming.{ OutputMode, StreamingQuery}
import org.apache.spark.sql.avro._
...
try {
log.info("----- reading schema")
val jsonFormatSchema = new String(Files.readAllBytes(
Paths.get("./src/main/resources/schema.avsc")))
val ds:Dataset[Row] = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaServers)
.option("subscribe", topic)
.load()
val output:Dataset[Row] = ds
.select(from_avro(ds.col("value"), jsonFormatSchema) as "record")
.select("record.*")
output.printSchema()
var query: StreamingQuery = output.writeStream.format("console")
.option("truncate", "false").outputMode(OutputMode.Append()).start();
query.awaitTermination();
} catch {
case e: Exception => log.error("onApplicationEvent error: ", e)
//case _: Throwable => log.error("onApplicationEvent error:")
}
...
Printing the schema in spark, it's strange that the fields are nullable although the avro schema does not allow that.
Spark shows this:
root
|-- int_field: integer (nullable = true)
|-- string_field: string (nullable = true)
I have checked the messages with another consumer in python and the messages are fine but
independently of the message content spark shows this.
+---------+------------+
|int_field|string_field|
+---------+------------+
|0 | |
+---------+------------+
The main dependencies used are:
<properties>
<spark.version>2.4.4</spark.version>
<scala.version>2.11</scala.version>
</properties>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
Does anyone know why this might be happening?
Thanks in advance.
The code to reproduce the error is here:
https://github.com/anigmo97/SimpleExamples/tree/master/Spark_streaming_kafka_avro_scala
SOLUTION
The problem was that i was using the confluent_kafka library in python
and i was reading the avro messages in spark structured streaming
using spark-avro library.
Confluent_kafka library uses confluent's avro format and spark avro
reads using standard avro format.
The difference is that in order to use schema registry, confluent avro
prepends the message with four bytes that indicates which schema
should be used.
Source:
https://www.confluent.io/blog/kafka-connect-tutorial-transfer-avro-schemas-across-schema-registry-clusters/
For being able to use confluent avro and read it from spark structured
streaming i replaced spark-avro library for Abris ( abris allow to
integrate avro and confluent avro with spark).
https://github.com/AbsaOSS/ABRiS

SOLUTION
The problem was that i was using the confluent_kafka library in python
and i was reading the avro messages in spark structured streaming
using spark-avro library.
Confluent_kafka library uses confluent's avro format and spark avro
reads using standard avro format.
The difference is that in order to use schema registry, confluent avro
prepends the message with four bytes that indicates which schema
should be used.
Source:
https://www.confluent.io/blog/kafka-connect-tutorial-transfer-avro-schemas-across-schema-registry-clusters/
For being able to use confluent avro and read it from spark structured
streaming i replaced spark-avro library for Abris ( abris allow to
integrate avro and confluent avro with spark).
https://github.com/AbsaOSS/ABRiS
My dependencies changed like this:
<properties>
<spark.version>2.4.4</spark.version>
<scala.version>2.11</scala.version>
</properties>
<!-- SPARK- AVRO -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- SPARK -AVRO AND CONFLUENT-AVRO -->
<dependency>
<groupId>za.co.absa</groupId>
<artifactId>abris_2.11</artifactId>
<version>3.1.1</version>
</dependency>
And here you can see an easy example that get the message and deserializes its values as avro and confluent avro.
var input: Dataset[Row] = sparkSession.readStream
//.format("org.apache.spark.sql.kafka010.KafkaSourceProvider")
.format("kafka")
.option("kafka.bootstrap.servers", kafkaServers)
.option("subscribe", topicConsumer)
.option("failOnDataLoss", "false")
// .option("startingOffsets", "latest")
// .option("startingOffsets", "earliest")
.load();
// READ WITH spark-avro library (standard avro)
val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./src/main/resources/schema.avsc")))
var inputAvroDeserialized: Dataset[Row] = input
.select(from_avro(functions.col("value"), jsonFormatSchema) as "record")
.select("record.*")
//READ WITH Abris library (confuent avro)
val schemaRegistryConfig = Map(
SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> "http://localhost:8081",
SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> topicConsumer,
SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> SchemaManager.SchemaStorageNamingStrategies.TOPIC_NAME, // choose a subject name strategy
SchemaManager.PARAM_VALUE_SCHEMA_ID -> "latest" // set to "latest" if you want the latest schema version to used
)
var inputConfluentAvroDeserialized: Dataset[Row] = inputConfluentAvroSerialized
.select(from_confluent_avro(functions.col("value"), schemaRegistryConfig) as "record")
.select("record.*")

Related

How to parse confluent avro messages in Spark

Currently I am using Abris library to de-serialize Confluent Avro messages getting from KAFKA and it works well when topic has only messages with one version of schema as soon as topic has data with different versions it start giving me malformed data found error which is obvious because while creating the config I am passing the SchemaManager.PARAM_VALUE_SCHEMA_ID=-> "latest"
But my questions is how to know the schema Id at run time basically for each record and then pass it to the Abris config here is the sample code:
Spark version: Spark 2.4.0
Scala :2.11.12
Abris:5.0.0
def getTopicSchemaMap(topicNm: String): Map[String, String] = {
Map(
SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> topicNm,
SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> schemaRegUrl,
SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> "topic.name",
SchemaManager.PARAM_VALUE_SCHEMA_ID -> "latest")
}
val kafkaDataFrameRaw = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", kafkaUrl)
.option("subscribe", topics)
.option("maxOffsetsPerTrigger", maxOffsetsPerTrigger)
.option("startingOffsets", "earliest")
.option("failOnDataLoss", false)
.load()
val df= kafkaDataFrameRaw.select(
from_confluent_avro(col("value"), getTopicSchemaMap(topicNm)) as 'value, col("offset").as("offsets"))

Not able to perform transformations and extract JSON values from Flink DataStream and Kafka Topic

I am trying to read data from the Kafka topic and I was able to read it successfully. However, I want to extract data and return it as a Tuple. So for that, I am trying to perform map operation but it is not allowing me to perform by saying that cannot resolve overloaded method 'map'. Below is my code:
package KafkaAsSource
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.datastream.DataStream
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import java.util.Properties
object ReadAndValidateJSON {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment()
//env.enableCheckpointing(5000)
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("zookeeper.connect", "localhost:2181")
val data:DataStream[String] = getDataFromKafkaTopic(properties,env)
val mappedData: DataStream[jsonData] = data.map(v => v)
.map {
v =>
val id = v["id"]
val category = v["category"]
val eventTime = v["eventTime"]
jsonData(id,category,eventTime)
}
data.print()
env.execute("ReadAndValidateJSON")
}
def getDataFromKafkaTopic(properties: Properties,env:StreamExecutionEnvironment): DataStream[String] = {
val consumer = new FlinkKafkaConsumer[String]("maddy1", new SimpleStringSchema(), properties)
consumer.setStartFromEarliest()
val src: DataStream[String] = env.addSource(consumer)
return src
}
}
Pom.xml
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-core -->
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>${flink-version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>${flink-version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>${flink-version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-core -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-core</artifactId>
<version>${flink-version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-cassandra_2.11</artifactId>
<version>${flink-version}</version>
</dependency>
</dependencies>
Kafka Topic Data:
{
"id":"7",
"Category":"Flink",
"eventTime":"2021-12-27 20:52:58.708"
}
{
"id":"9",
"Category":"Flink",
"eventTime":"2021-12-27 20:52:58.727"
}
{
"id":"10",
"Category":"Flink",
"eventTime":"2021-12-27 20:52:58.734"
}
Where am I exactly going wrong? Are the dependencies correct? My Flink version is 1.12.2
Try adding
import org.apache.flink.streaming.api.scala._

write into kafka topic using spark and scala

I am reading data from Kafka topic and write back the data received into another Kafka topic.
Below is my code ,
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.kafka.clients.producer.{Kafka Producer, ProducerRecord}
import org.apache.spark.sql.ForeachWriter
//loading data from kafka
val data = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "*******:9092")
.option("subscribe", "PARAMTABLE")
.option("startingOffsets", "latest")
.load()
//Extracting value from Json
val schema = new StructType().add("PARAM_INSTANCE_ID",IntegerType).add("ENTITY_ID",IntegerType).add("PARAM_NAME",StringType).add("VALUE",StringType)
val df1 = data.selectExpr("CAST(value AS STRING)")
val dataDF = df1.select(from_json(col("value"), schema).as("data")).select("data.*")
//Insert into another Kafka topic
val topic = "SparkParamValues"
val brokers = "********:9092"
val writer = new KafkaSink(topic, brokers)
val query = dataDF.writeStream
.foreach(writer)
.outputMode("update")
.start().awaitTermination()
I am getting the below error,
<Console>:47:error :not found: type KafkaSink
val writer = new KafkaSink(topic, brokers)
I am very new to spark, Someone suggest how to resolve this or verify the above code whether it is correct. Thanks in advance .
In spark structured streaming, You can write to Kafka topic after reading from another topic using existing DataStreamWriter for Kafka or you can create your own sink by extending ForeachWriter class.
Without using custom sink:
You can use below code to write a dataframe to kafka. Assuming df as the dataframe generated by reading from kafka topic.
Here dataframe should have atleast one column with name as value. If you have multiple columns you should merge them into one column and name it as value. If key column is not specified then key will be marked as null in destination topic.
df.select("key", "value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "<topicName>")
.start()
.awaitTermination()
Using custom sink:
If you want to implement your own Kafka sink you need create a class by extending ForeachWriter. You need override some methods and pass the object of this class to foreach() method.
// By using Anonymous class to extend ForeachWriter
df.writeStream.foreach(new ForeachWriter[Row] {
// If you are writing Dataset[String] then new ForeachWriter[String]
def open(partitionId: Long, version: Long): Boolean = {
// open connection
}
def process(record: String) = {
// write rows to connection
}
def close(errorOrNull: Throwable): Unit = {
// close the connection
}
}).start()
You can check this databricks notebook for the implemented code (Scroll down and check the code under Kafka Sink heading). I think you are referring to this page only. To solve the issue you need to make sure that KafkaSink class is available to your spark code. You can bring both spark code file and class file in same package. If you are running on spark-shell paste the KafkaSink class before pasting spark code.
Read structured streaming kafka integration guide to explore more.

Parsing json in spark

I was using json scala library to parse a json from a local drive in spark job :
val requestJson=JSON.parseFull(Source.fromFile("c:/data/request.json").mkString)
val mainJson=requestJson.get.asInstanceOf[Map[String,Any]].get("Request").get.asInstanceOf[Map[String,Any]]
val currency=mainJson.get("currency").get.asInstanceOf[String]
But when i try to use the same parser by pointing to hdfs file location it doesnt work:
val requestJson=JSON.parseFull(Source.fromFile("hdfs://url/user/request.json").mkString)
and gives me an error:
java.io.FileNotFoundException: hdfs:/localhost/user/request.json (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
... 128 elided
How can i use Json.parseFull library to get data from hdfs file location ?
Thanks
Spark does have an inbuilt support for JSON documents parsing which will be available in spark-sql_${scala.version} jar.
In Spark 2.0+ :
import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val df = spark.read.format("json").json("json/file/location/in/hdfs")
df.show()
with df object you can do all supported SQL operations on it and it's data processing will be distributed among the nodes whereas requestJson
will be computed in single machine only.
Maven dependencies
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.0</version>
</dependency>
Edit: (as per comment to read file from hdfs)
val hdfs = org.apache.hadoop.fs.FileSystem.get(
new java.net.URI("hdfs://ITS-Hadoop10:9000/"),
new org.apache.hadoop.conf.Configuration()
)
val path=new Path("/user/zhc/"+x+"/")
val t=hdfs.listStatus(path)
val in =hdfs.open(t(0).getPath)
val reader = new BufferedReader(new InputStreamReader(in))
var l=reader.readLine()
code credits: from another SO
question
Maven dependencies:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.2</version> <!-- you can change this as per your hadoop version -->
</dependency>
It is much more easy in spark 2.0
val df = spark.read.json("json/file/location/in/hdfs")
df.show()
One can use following in Spark to read the file from HDFS:
val jsonText = sc.textFile("hdfs://url/user/request.json").collect.mkString("\n")

Spark - load CSV file as DataFrame?

I would like to read a CSV in spark and convert it as DataFrame and store it in HDFS with df.registerTempTable("table_name")
I have tried:
scala> val df = sqlContext.load("hdfs:///csv/file/dir/file.csv")
Error which I got:
java.lang.RuntimeException: hdfs:///csv/file/dir/file.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [49, 59, 54, 10]
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:418)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:277)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:276)
at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:165)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
What is the right command to load CSV file as DataFrame in Apache Spark?
spark-csv is part of core Spark functionality and doesn't require a separate library.
So you could just do for example
df = spark.read.format("csv").option("header", "true").load("csvfile.csv")
In scala,(this works for any format-in delimiter mention "," for csv, "\t" for tsv etc)
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter", ",")
.load("csvfile.csv")
Parse CSV and load as DataFrame/DataSet with Spark 2.x
First, initialize SparkSession object by default it will available in shells as spark
val spark = org.apache.spark.sql.SparkSession.builder
.master("local") # Change it as per your cluster
.appName("Spark CSV Reader")
.getOrCreate;
Use any one of the following ways to load CSV as DataFrame/DataSet
1. Do it in a programmatic way
val df = spark.read
.format("csv")
.option("header", "true") //first line in file has headers
.option("mode", "DROPMALFORMED")
.load("hdfs:///csv/file/dir/file.csv")
Update: Adding all options from here in case the link will be broken in future
path: location of files. Similar to Spark can accept standard Hadoop globbing expressions.
header: when set to true the first line of files will be used to name columns and will not be included in data. All types will be assumed string. The default value is false.
delimiter: by default columns are delimited using, but delimiter can be set to any character
quote: by default the quote character is ", but can be set to any character. Delimiters inside quotes are ignored
escape: by default, the escape character is , but can be set to any character. Escaped quote characters are ignored
parserLib: by default, it is "commons" that can be set to "univocity" to use that library for CSV parsing.
mode: determines the parsing mode. By default it is PERMISSIVE. Possible values are:
PERMISSIVE: tries to parse all lines: nulls are inserted for missing tokens and extra tokens are ignored.
DROPMALFORMED: drops lines that have fewer or more tokens than expected or tokens which do not match the schema
FAILFAST: aborts with a RuntimeException if encounters any malformed line
charset: defaults to 'UTF-8' but can be set to other valid charset names
inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default
comment: skip lines beginning with this character. Default is "#". Disable comments by setting this to null.
nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame
dateFormat: specifies a string that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. By default, it is null which means trying to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf().
2. You can do this SQL way as well
val df = spark.sql("SELECT * FROM csv.`hdfs:///csv/file/dir/file.csv`")
Dependencies:
"org.apache.spark" % "spark-core_2.11" % 2.0.0,
"org.apache.spark" % "spark-sql_2.11" % 2.0.0,
Spark version < 2.0
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("mode", "DROPMALFORMED")
.load("csv/file/path");
Dependencies:
"org.apache.spark" % "spark-sql_2.10" % 1.6.0,
"com.databricks" % "spark-csv_2.10" % 1.6.0,
"com.univocity" % "univocity-parsers" % LATEST,
It's for whose Hadoop is 2.6 and Spark is 1.6 and without "databricks" package.
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType};
import org.apache.spark.sql.Row;
val csv = sc.textFile("/path/to/file.csv")
val rows = csv.map(line => line.split(",").map(_.trim))
val header = rows.first
val data = rows.filter(_(0) != header(0))
val rdd = data.map(row => Row(row(0),row(1).toInt))
val schema = new StructType()
.add(StructField("id", StringType, true))
.add(StructField("val", IntegerType, true))
val df = sqlContext.createDataFrame(rdd, schema)
With Spark 2.0, following is how you can read CSV
val conf = new SparkConf().setMaster("local[2]").setAppName("my app")
val sc = new SparkContext(conf)
val sparkSession = SparkSession.builder
.config(conf = conf)
.appName("spark session example")
.getOrCreate()
val path = "/Users/xxx/Downloads/usermsg.csv"
val base_df = sparkSession.read.option("header","true").
csv(path)
In Java 1.8 This code snippet perfectly working to read CSV files
POM.xml
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>2.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.4.0</version>
</dependency>
Java
SparkConf conf = new SparkConf().setAppName("JavaWordCount").setMaster("local");
// create Spark Context
SparkContext context = new SparkContext(conf);
// create spark Session
SparkSession sparkSession = new SparkSession(context);
Dataset<Row> df = sparkSession.read().format("com.databricks.spark.csv").option("header", true).option("inferSchema", true).load("hdfs://localhost:9000/usr/local/hadoop_data/loan_100.csv");
//("hdfs://localhost:9000/usr/local/hadoop_data/loan_100.csv");
System.out.println("========== Print Schema ============");
df.printSchema();
System.out.println("========== Print Data ==============");
df.show();
System.out.println("========== Print title ==============");
df.select("title").show();
Penny's Spark 2 example is the way to do it in spark2. There's one more trick: have that header generated for you by doing an initial scan of the data, by setting the option inferSchema to true
Here, then, assumming that spark is a spark session you have set up, is the operation to load in the CSV index file of all the Landsat images which amazon host on S3.
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
val csvdata = spark.read.options(Map(
"header" -> "true",
"ignoreLeadingWhiteSpace" -> "true",
"ignoreTrailingWhiteSpace" -> "true",
"timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSZZZ",
"inferSchema" -> "true",
"mode" -> "FAILFAST"))
.csv("s3a://landsat-pds/scene_list.gz")
The bad news is: this triggers a scan through the file; for something large like this 20+MB zipped CSV file, that can take 30s over a long haul connection. Bear that in mind: you are better off manually coding up the schema once you've got it coming in.
(code snippet Apache Software License 2.0 licensed to avoid all ambiguity; something I've done as a demo/integration test of S3 integration)
There are a lot of challenges to parsing a CSV file, it keeps adding up if the file size is bigger, if there are non-english/escape/separator/other characters in the column values, that could cause parsing errors.
The magic then is in the options that are used. The ones that worked for me and hope should cover most of the edge cases are in code below:
### Create a Spark Session
spark = SparkSession.builder.master("local").appName("Classify Urls").getOrCreate()
### Note the options that are used. You may have to tweak these in case of error
html_df = spark.read.csv(html_csv_file_path,
header=True,
multiLine=True,
ignoreLeadingWhiteSpace=True,
ignoreTrailingWhiteSpace=True,
encoding="UTF-8",
sep=',',
quote='"',
escape='"',
maxColumns=2,
inferSchema=True)
Hope that helps. For more refer: Using PySpark 2 to read CSV having HTML source code
Note: The code above is from Spark 2 API, where the CSV file reading API comes bundled with built-in packages of Spark installable.
Note: PySpark is a Python wrapper for Spark and shares the same API as Scala/Java.
In case you are building a jar with scala 2.11 and Apache 2.0 or higher.
There is no need to create a sqlContext or sparkContext object. Just a SparkSession object suffices the requirement for all needs.
Following is mycode which works fine:
import org.apache.spark.sql.{DataFrame, Row, SQLContext, SparkSession}
import org.apache.log4j.{Level, LogManager, Logger}
object driver {
def main(args: Array[String]) {
val log = LogManager.getRootLogger
log.info("**********JAR EXECUTION STARTED**********")
val spark = SparkSession.builder().master("local").appName("ValidationFrameWork").getOrCreate()
val df = spark.read.format("csv")
.option("header", "true")
.option("delimiter","|")
.option("inferSchema","true")
.load("d:/small_projects/spark/test.pos")
df.show()
}
}
In case you are running in cluster just change .master("local") to .master("yarn") while defining the sparkBuilder object
The Spark Doc covers this:
https://spark.apache.org/docs/2.2.0/sql-programming-guide.html
With Spark 2.4+, if you want to load a csv from a local directory, then you can use 2 sessions and load that into hive. The first session should be created with master() config as "local[*]" and the second session with "yarn" and Hive enabled.
The below one worked for me.
import org.apache.log4j.{Level, Logger}
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.sql._
object testCSV {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark_local = SparkSession.builder().appName("CSV local files reader").master("local[*]").getOrCreate()
import spark_local.implicits._
spark_local.sql("SET").show(100,false)
val local_path="/tmp/data/spend_diversity.csv" // Local file
val df_local = spark_local.read.format("csv").option("inferSchema","true").load("file://"+local_path) // "file://" is mandatory
df_local.show(false)
val spark = SparkSession.builder().appName("CSV HDFS").config("spark.sql.warehouse.dir", "/apps/hive/warehouse").enableHiveSupport().getOrCreate()
import spark.implicits._
spark.sql("SET").show(100,false)
val df = df_local
df.createOrReplaceTempView("lcsv")
spark.sql(" drop table if exists work.local_csv ")
spark.sql(" create table work.local_csv as select * from lcsv ")
}
When ran with spark2-submit --master "yarn" --conf spark.ui.enabled=false testCSV.jar it went fine and created the table in hive.
Add following Spark dependencies to POM file :
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>
Spark configuration:
val spark = SparkSession.builder().master("local").appName("Sample App").getOrCreate()
Read csv file:
val df = spark.read.option("header", "true").csv("FILE_PATH")
Display output:
df.show()
Try this if using spark 2.0+
For non-hdfs file:
df = spark.read.csv("file:///csvfile.csv")
For hdfs file:
df = spark.read.csv("hdfs:///csvfile.csv")
For hdfs file (with different delimiter than comma:
df = spark.read.option("delimiter","|")csv("hdfs:///csvfile.csv")
Note:- this work for any delimited file. Just use option(“delimiter”,) to change value.
Hope this is helpful.
To read from relative path on the system use System.getProperty method to get current directory and further uses to load the file using relative path.
scala> val path = System.getProperty("user.dir").concat("/../2015-summary.csv")
scala> val csvDf = spark.read.option("inferSchema","true").option("header", "true").csv(path)
scala> csvDf.take(3)
spark:2.4.4 scala:2.11.12
Default file format is Parquet with spark.read.. and file reading csv that why you are getting the exception. Specify csv format with api you are trying to use
With in-built Spark csv, you can get it done easily with new SparkSession object for Spark > 2.0.
val df = spark.
read.
option("inferSchema", "false").
option("header","true").
option("mode","DROPMALFORMED").
option("delimiter", ";").
schema(dataSchema).
csv("/csv/file/dir/file.csv")
df.show()
df.printSchema()
There are various options you can set.
header: whether your file includes header line at the top
inferSchema: whether you want to infer schema automatically or not. Default is true. I always prefer to provide schema to ensure proper datatypes.
mode: parsing mode, PERMISSIVE, DROPMALFORMED or FAILFAST
delimiter: to specify delimiter, default is comma(',')