why the error is given while reading the text file in spark

why the error is given while reading the text file in spark - scala

The path given to the text file is correct still I am getting error " Input path does not exist: file:/C:/Users/cmpil/Downloads/hunger_games.txt". Why is it happening
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object WordCountDataSet {
case class Book(value:String)
def main(args:Array[String]): Unit ={
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.appName("WordCount")
.master("local[*]")
.getOrCreate()
import spark.implicits._
//Another way of doing it
val bookRDD = spark.sparkContext.textFile("C:/Users/cmpil/Downloads/hunger_games.txt")
val wordsRDD = bookRDD.flatMap(x => x.split("\\W+"))
val wordsDS = wordsRDD.toDS()
val lowercaseWordsDS = wordsDS.select(lower($"value").alias("word"))
val wordCountsDS = lowercaseWordsDS.groupBy("word").count()
val wordCountsSortedDS = wordCountsDS.sort("count")
wordCountsSortedDS.show(wordCountsSortedDS.count().toInt)
}
}

on windows you have to use '\\' in place of '/'
try using "C:\\Users\\cmpil\\Downloads\\hunger_games.txt"

Related

Convert scala map to dataframe

I am trying to stream twitter data using Apache Spark and I want to save it as csv file into HDFS. I understand that I have to convert it to a dataframe but I am not able to do so.
Here is my full code:
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.twitter.TwitterUtils
//import com.google.gson.Gson
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
//import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
//import org.apache.spark.sql.functions._
import sentimentAnalysis.sentimentScore
case class twitterCaseClass (userID: String = "", user: String = "", createdAt: String = "", text: String = "", sentimentType: String = "")
object twitterStream {
//private val gson = new Gson()
def main(args: Array[String]) {
//Twitter API
Logger.getLogger("org").setLevel(Level.ERROR)
System.setProperty("twitter4j.oauth.consumerKey", "#######")
System.setProperty("twitter4j.oauth.consumerSecret", "#######")
System.setProperty("twitter4j.oauth.accessToken", "#######")
System.setProperty("twitter4j.oauth.accessTokenSecret", "#######")
val spark = SparkSession.builder().appName("twitterStream").master("local[*]").getOrCreate()
val sc: SparkContext = spark.sparkContext
val streamContext = new StreamingContext(sc, Seconds(5))
import spark.implicits._
val filters = Array("Singapore")
val filtered = TwitterUtils.createStream(streamContext, None, filters)
val englishTweets = filtered.filter(_.getLang() == "en")
englishTweets.print()
val tweets = englishTweets.map{ col => {
(
"userID" -> col.getId,
"user" -> col.getUser.getScreenName,
"createdAt" -> col.getCreatedAt.toInstant.toString,
"text" -> col.getText.toLowerCase.split(" ").filter(_.matches("^[a-zA-Z0-9 ]+$")).fold("")((a, b) => a + " " + b).trim,
"sentimentType" -> sentimentScore(col.getText).toString
)
}
}
//val tweets = englishTweets.map(gson.toJson(_))
//tweets.saveAsTextFiles("hdfs://localhost:9000/usr/sparkApp/test/")
streamContext.start()
streamContext.awaitTermination()
}
}
I am not sure where did I possibly went wrong. There is another way to go about which is using case class. Is there a good example I can follow?
Update
The result of the Map function which is save into HDFS is like this:
((userID,1345940003533312000),(user,rei_yang),(createdAt,2021-01-04T03:47:57Z),(text,just posted a photo singapore),(sentimentType,NEUTRAL))
Is there a way to code it to a dataframe?

Error: Using Spark Structured Streaming to read and write data to another topic in kafka

I am doing a small task of reading access_logs file using a kafka topic, then i count the status and send the count of Status to another kafka topic.
But i keep getting errors like,
while i use no output mode or append mode:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
When using complete mode:
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: requirement failed: KafkaTable does not support Complete mode.
This is my code:
structuredStreaming.scala
package com.spark.sparkstreaming
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.sql.functions._
import java.util.regex.Pattern
import java.util.regex.Matcher
import java.text.SimpleDateFormat
import java.util.Locale
import Utilities._
object structuredStreaming {
case class LogEntry(ip:String, client:String, user:String, dateTime:String, request:String, status:String, bytes:String, referer:String, agent:String)
val logPattern = apacheLogPattern()
val datePattern = Pattern.compile("\\[(.*?) .+]")
def parseDateField(field: String): Option[String] = {
val dateMatcher = datePattern.matcher(field)
if (dateMatcher.find) {
val dateString = dateMatcher.group(1)
val dateFormat = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.ENGLISH)
val date = (dateFormat.parse(dateString))
val timestamp = new java.sql.Timestamp(date.getTime());
return Option(timestamp.toString())
} else {
None
}
}
def parseLog(x:Row) : Option[LogEntry] = {
val matcher:Matcher = logPattern.matcher(x.getString(0));
if (matcher.matches()) {
val timeString = matcher.group(4)
return Some(LogEntry(
matcher.group(1),
matcher.group(2),
matcher.group(3),
parseDateField(matcher.group(4)).getOrElse(""),
matcher.group(5),
matcher.group(6),
matcher.group(7),
matcher.group(8),
matcher.group(9)
))
} else {
return None
}
}
def main(args: Array[String]) {
val spark = SparkSession
.builder
.appName("StructuredStreaming")
.master("local[*]")
.config("spark.sql.streaming.checkpointLocation", "/home/UDHAV.MAHATA/Documents/Checkpoints")
.getOrCreate()
setupLogging()
// val rawData = spark.readStream.text("/home/UDHAV.MAHATA/Documents/Spark/logs")
val rawData = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "testing")
.load()
import spark.implicits._
val structuredData = rawData.flatMap(parseLog).select("status")
val windowed = structuredData.groupBy($"status").count()
//val query = windowed.writeStream.outputMode("complete").format("console").start()
val query = windowed
.writeStream
.outputMode("complete")
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "sink")
.start()
query.awaitTermination()
spark.stop()
}
}
Utilities.scala
package com.spark.sparkstreaming
import org.apache.log4j.Level
import java.util.regex.Pattern
import java.util.regex.Matcher
object Utilities {
def setupLogging() = {
import org.apache.log4j.{Level, Logger}
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
}
def apacheLogPattern():Pattern = {
val ddd = "\\d{1,3}"
val ip = s"($ddd\\.$ddd\\.$ddd\\.$ddd)?"
val client = "(\\S+)"
val user = "(\\S+)"
val dateTime = "(\\[.+?\\])"
val request = "\"(.*?)\""
val status = "(\\d{3})"
val bytes = "(\\S+)"
val referer = "\"(.*?)\""
val agent = "\"(.*?)\""
val regex = s"$ip $client $user $dateTime $request $status $bytes $referer $agent"
Pattern.compile(regex)
}
}
Can anyone help me with where am i doing mistake?

As the error message is suggesting, you need to add a watermark to your grouping.
Replace this line
val windowed = structuredData.groupBy($"status").count()
with
import org.apache.spark.sql.functions.{window, col}
val windowed = structuredData.groupBy(window(col("dateTime"), "10 minutes"), "status").count()
It is important that the column dateTime is of type timestamp which you parse from the Kafka source anyway if I understood your code correctly.
Without the window, Spark will not know how much data to be aggregated.

Cannot convert an RDD to Dataframe

I've converted a dataframe to an RDD:
val rows: RDD[Row] = df.orderBy($"Date").rdd
And now I'm trying to convert it back:
val df2 = spark.createDataFrame(rows)
But I'm getting an error:
Edit:
rows.toDF()
Also produces an error:
Cannot resolve symbol toDF
Even though I included this line earlier:
import spark.implicits._
Full code:
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.util._
import org.apache.spark.mllib.rdd.RDDFunctions._
import org.apache.spark.rdd._
object Playground {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("Playground")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
val sc = spark.sparkContext
val df = spark.read.csv("D:/playground/mre.csv")
df.show()
val rows: RDD[Row] = df.orderBy($"Date").rdd
val df2 = spark.createDataFrame(rows)
rows.toDF()
}
}

Your IDE is right, SparkSession.createDataFrame needs a second parameter: either a bean class or a schema.
This will fix your problem:
val df2 = spark.createDataFrame(rows, df.schema)

Testing a utility function by writing a unit test in apache spark scala

I have a utility function written in scala to read parquet files from s3 bucket. Could someone help me in writing unit test cases for this
Below is the function which needs to be tested.
def readParquetFile(spark: SparkSession,
locationPath: String): DataFrame = {
spark.read
.parquet(locationPath)
}
So far i have created a SparkSession for which the master is local
import org.apache.spark.sql.SparkSession
trait SparkSessionTestWrapper {
lazy val spark: SparkSession = {
SparkSession.builder().master("local").appName("Test App").getOrCreate()
}
}
I am stuck with testing the function. Here is the code where I am stuck. The question is should i create a real parquet file and load to see if the dataframe is getting created or is there a mocking framework to test this.
import com.github.mrpowers.spark.fast.tests.DataFrameComparer
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.scalatest.FunSpec
class ReadAndWriteSpec extends FunSpec with DataFrameComparer with SparkSessionTestWrapper {
import spark.implicits._
it("reads a parquet file and creates a dataframe") {
}
}
Edit:
Basing on the inputs from the comments i came up with the below but i am still not able to understand how this can be leveraged.
I am using https://github.com/findify/s3mock
class ReadAndWriteSpec extends FunSpec with DataFrameComparer with SparkSessionTestWrapper {
import spark.implicits._
it("reads a parquet file and creates a dataframe") {
val api = S3Mock(port = 8001, dir = "/tmp/s3")
api.start
val endpoint = new EndpointConfiguration("http://localhost:8001", "us-west-2")
val client = AmazonS3ClientBuilder
.standard
.withPathStyleAccessEnabled(true)
.withEndpointConfiguration(endpoint)
.withCredentials(new AWSStaticCredentialsProvider(new AnonymousAWSCredentials()))
.build
/** Use it as usual. */
client.createBucket("foo")
client.putObject("foo", "bar", "baz")
val url = client.getUrl("foo","bar")
println(url.getFile())
val df = ReadAndWrite.readParquetFile(spark,url.getPath())
df.printSchema()
}
}

I figured out and kept it simple. I could complete some basic test cases.
Here is my solution. I hope this will help someone.
import org.apache.spark.sql
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.scalatest.{BeforeAndAfterEach, FunSuite}
import loaders.ReadAndWrite
class ReadAndWriteTestSpec extends FunSuite with BeforeAndAfterEach{
private val master = "local"
private val appName = "ReadAndWrite-Test"
var spark : SparkSession = _
override def beforeEach(): Unit = {
spark = new sql.SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("creating data frame from parquet file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = spark.read.json("src/test/resources/people.json")
peopleDF.write.mode(SaveMode.Overwrite).parquet("src/test/resources/people.parquet")
val df = ReadAndWrite.readParquetFile(sparkSession,"src/test/resources/people.parquet")
df.printSchema()
}
test("creating data frame from text file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
peopleDF.printSchema()
}
test("counts should match with number of records in a text file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
peopleDF.printSchema()
assert(peopleDF.count() == 3)
}
test("data should match with sample records in a text file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
peopleDF.printSchema()
assert(peopleDF.take(1)(0)(0).equals("Michael"))
}
test("Write a data frame as csv file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
//header argument should be boolean to the user to avoid confusions
ReadAndWrite.writeDataframeAsCSV(peopleDF,"src/test/resources/out.csv",java.time.Instant.now().toString,",","true")
}
override def afterEach(): Unit = {
spark.stop()
}
}
case class Person(name: String, age: Int)

cant find temp in zeppelin

enter image description hereI receive a error when try do select over my temp table. Somebody can help me please?
object StreamingLinReg extends java.lang.Object{
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1").setAppName("Streaming Liniar Regression")
.set("spark.cassandra.connection.port", "9042")
.set("spark.driver.allowMultipleContexts", "true")
.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val sc = new SparkContext(conf);
val ssc = new StreamingContext(sc, Seconds(1));
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
import sqlContext.implicits._
val trainingData = ssc.cassandraTable[String]("features","consumodata").select("consumo", "consumo_mensal", "soma_pf", "tempo_gasto").map(LabeledPoint.parse)
trainingData.toDF.registerTempTable("training")
val dstream = new ConstantInputDStream(ssc, trainingData)
val numFeatures = 100
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.zeros(numFeatures))
.setNumIterations(1)
.setStepSize(0.1)
.setMiniBatchFraction(1.0)
model.trainOn(dstream)
model.predictOnValues(dstream.map(lp => (lp.label, lp.features))).foreachRDD { rdd =>
val metrics = new RegressionMetrics(rdd)
val MSE = metrics.meanSquaredError //Squared error
val RMSE = metrics.rootMeanSquaredError //Squared error
val MAE = metrics.meanAbsoluteError //Mean absolute error
val Rsquared = metrics.r2
//val Explained variance = metrics.explainedVariance
rdd.toDF.registerTempTable("liniarRegressionModel")
}
}
ssc.start()
ssc.awaitTermination()
//}
}
%sql
select * from liniarRegressionModel limit 10
when I do select the temporary table I get an error message.I run first paragraph after execute the select over temp table.
org.apache.spark.sql.AnalysisException: Table not found: liniarRegressionModel; line 1 pos 14 at org.apache.spark.sql.catalyst.analysis.
package$AnalysisErrorAt.failAnalysis (package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
$.getTable (Analyzer.scala:305) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
$$anonfun$apply$9.applyOrElse
(Analyzer.scala:314) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
$$anonfun$apply$9.applyOrElse(Analyzer.scala:309) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57) at org.apache.spark.sql.catalyst.trees.CurrentOrigin
$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators
(LogicalPlan.scala:56) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
$$anonfun$1.apply(LogicalPlan.scala:54) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply
(LogicalPlan.scala:54) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply
(TreeNode.scala:281) at scala.collection.Iterator
$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$
class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach
(Iterator.scala:1157) at scala.collection.generic.Growable $class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.
$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.
$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to
(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to
(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer
(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer
(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray
(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray
(Iterator.scala:1157)

My output after execute the code
import java.lang.Object
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.StreamingContext._
import com.datastax.spark.connector.streaming._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.ConstantInputDStream
import org.apache.spark.mllib.evaluation.RegressionMetrics
defined module StreamingLinReg
FINISHED
Took 15 seconds