Convert scala map to dataframe - scala

I am trying to stream twitter data using Apache Spark and I want to save it as csv file into HDFS. I understand that I have to convert it to a dataframe but I am not able to do so.
Here is my full code:
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.twitter.TwitterUtils
//import com.google.gson.Gson
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
//import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
//import org.apache.spark.sql.functions._
import sentimentAnalysis.sentimentScore
case class twitterCaseClass (userID: String = "", user: String = "", createdAt: String = "", text: String = "", sentimentType: String = "")
object twitterStream {
//private val gson = new Gson()
def main(args: Array[String]) {
//Twitter API
Logger.getLogger("org").setLevel(Level.ERROR)
System.setProperty("twitter4j.oauth.consumerKey", "#######")
System.setProperty("twitter4j.oauth.consumerSecret", "#######")
System.setProperty("twitter4j.oauth.accessToken", "#######")
System.setProperty("twitter4j.oauth.accessTokenSecret", "#######")
val spark = SparkSession.builder().appName("twitterStream").master("local[*]").getOrCreate()
val sc: SparkContext = spark.sparkContext
val streamContext = new StreamingContext(sc, Seconds(5))
import spark.implicits._
val filters = Array("Singapore")
val filtered = TwitterUtils.createStream(streamContext, None, filters)
val englishTweets = filtered.filter(_.getLang() == "en")
englishTweets.print()
val tweets = englishTweets.map{ col => {
(
"userID" -> col.getId,
"user" -> col.getUser.getScreenName,
"createdAt" -> col.getCreatedAt.toInstant.toString,
"text" -> col.getText.toLowerCase.split(" ").filter(_.matches("^[a-zA-Z0-9 ]+$")).fold("")((a, b) => a + " " + b).trim,
"sentimentType" -> sentimentScore(col.getText).toString
)
}
}
//val tweets = englishTweets.map(gson.toJson(_))
//tweets.saveAsTextFiles("hdfs://localhost:9000/usr/sparkApp/test/")
streamContext.start()
streamContext.awaitTermination()
}
}
I am not sure where did I possibly went wrong. There is another way to go about which is using case class. Is there a good example I can follow?
Update
The result of the Map function which is save into HDFS is like this:
((userID,1345940003533312000),(user,rei_yang),(createdAt,2021-01-04T03:47:57Z),(text,just posted a photo singapore),(sentimentType,NEUTRAL))
Is there a way to code it to a dataframe?

Related

why the error is given while reading the text file in spark

The path given to the text file is correct still I am getting error " Input path does not exist: file:/C:/Users/cmpil/Downloads/hunger_games.txt". Why is it happening
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object WordCountDataSet {
case class Book(value:String)
def main(args:Array[String]): Unit ={
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.appName("WordCount")
.master("local[*]")
.getOrCreate()
import spark.implicits._
//Another way of doing it
val bookRDD = spark.sparkContext.textFile("C:/Users/cmpil/Downloads/hunger_games.txt")
val wordsRDD = bookRDD.flatMap(x => x.split("\\W+"))
val wordsDS = wordsRDD.toDS()
val lowercaseWordsDS = wordsDS.select(lower($"value").alias("word"))
val wordCountsDS = lowercaseWordsDS.groupBy("word").count()
val wordCountsSortedDS = wordCountsDS.sort("count")
wordCountsSortedDS.show(wordCountsSortedDS.count().toInt)
}
}
on windows you have to use '\\' in place of '/'
try using "C:\\Users\\cmpil\\Downloads\\hunger_games.txt"

Iceberg's FlinkSink doesn't update metadata file in streaming writes

I have been trying to use Iceberg's FlinkSink to consume the data and write to sink.
I was successful in fetching the data from kinesis and I see that the data is being written into the appropriate partition. However, I don't see the metadata.json being updated. Without which I am not able to query the table.
Any help or pointers are appreciated.
The following is the code.
package test
import java.util.{Calendar, Properties}
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer
import org.apache.flink.streaming.connectors.kinesis.config.{AWSConfigConstants, ConsumerConfigConstants}
import org.apache.flink.table.data.{GenericRowData, RowData, StringData}
import org.apache.hadoop.conf.Configuration
import org.apache.iceberg.catalog.TableIdentifier
import org.apache.iceberg.flink.{CatalogLoader, TableLoader}
import org.apache.iceberg.flink.TableLoader.HadoopTableLoader
import org.apache.iceberg.flink.sink.FlinkSink
import org.apache.iceberg.hadoop.HadoopCatalog
import org.apache.iceberg.types.Types
import org.apache.iceberg.{PartitionSpec, Schema}
import scala.collection.JavaConverters._
object SampleApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val warehouse = "file://<local folder path>"
val catalog = new HadoopCatalog(new Configuration(), warehouse)
val ti = TableIdentifier.of("test_table")
if (!catalog.tableExists(ti)) {
println("table doesnt exist. creating it.")
val schema = new Schema(
Types.NestedField.optional(1, "message", Types.StringType.get()),
Types.NestedField.optional(2, "year", Types.StringType.get()),
Types.NestedField.optional(3, "month", Types.StringType.get()),
Types.NestedField.optional(4, "date", Types.StringType.get()),
Types.NestedField.optional(5, "hour", Types.StringType.get())
)
val props = Map(
"write.metadata.delete-after-commit.enabled" -> "true",
"write.metadata.previous-versions-max" -> "5",
"write.target-file-size-bytes" -> "1048576"
)
val partitionSpec = PartitionSpec.builderFor(schema)
.identity("year")
.identity("month")
.identity("date")
.identity("hour")
.build();
catalog.createTable(ti, schema, partitionSpec, props.asJava)
} else {
println("table exists. not creating it.")
}
val inputProperties = new Properties()
inputProperties.setProperty(AWSConfigConstants.AWS_REGION, "us-east-1")
inputProperties.setProperty(ConsumerConfigConstants.STREAM_INITIAL_POSITION, "LATEST")
val stream: DataStream[RowData] = env
.addSource(new FlinkKinesisConsumer[String]("test_kinesis_stream", new SimpleStringSchema(), inputProperties))
.map(x => {
val now = Calendar.getInstance()
GenericRowData.of(
StringData.fromString(x),
StringData.fromString(now.get(Calendar.YEAR).toString),
StringData.fromString("%02d".format(now.get(Calendar.MONTH))),
StringData.fromString("%02d".format(now.get(Calendar.DAY_OF_MONTH))),
StringData.fromString("%02d".format(now.get(Calendar.HOUR_OF_DAY)))
)
})
FlinkSink
.forRowData(stream.javaStream)
.tableLoader(TableLoader.fromHadoopTable(s"$warehouse/${ti.name()}", new Configuration()))
.build()
env.execute("test app")
}
}
Thanks in Advance.
you should set checkpointing:
env.enableCheckpointing(1000)

Testing a utility function by writing a unit test in apache spark scala

I have a utility function written in scala to read parquet files from s3 bucket. Could someone help me in writing unit test cases for this
Below is the function which needs to be tested.
def readParquetFile(spark: SparkSession,
locationPath: String): DataFrame = {
spark.read
.parquet(locationPath)
}
So far i have created a SparkSession for which the master is local
import org.apache.spark.sql.SparkSession
trait SparkSessionTestWrapper {
lazy val spark: SparkSession = {
SparkSession.builder().master("local").appName("Test App").getOrCreate()
}
}
I am stuck with testing the function. Here is the code where I am stuck. The question is should i create a real parquet file and load to see if the dataframe is getting created or is there a mocking framework to test this.
import com.github.mrpowers.spark.fast.tests.DataFrameComparer
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.scalatest.FunSpec
class ReadAndWriteSpec extends FunSpec with DataFrameComparer with SparkSessionTestWrapper {
import spark.implicits._
it("reads a parquet file and creates a dataframe") {
}
}
Edit:
Basing on the inputs from the comments i came up with the below but i am still not able to understand how this can be leveraged.
I am using https://github.com/findify/s3mock
class ReadAndWriteSpec extends FunSpec with DataFrameComparer with SparkSessionTestWrapper {
import spark.implicits._
it("reads a parquet file and creates a dataframe") {
val api = S3Mock(port = 8001, dir = "/tmp/s3")
api.start
val endpoint = new EndpointConfiguration("http://localhost:8001", "us-west-2")
val client = AmazonS3ClientBuilder
.standard
.withPathStyleAccessEnabled(true)
.withEndpointConfiguration(endpoint)
.withCredentials(new AWSStaticCredentialsProvider(new AnonymousAWSCredentials()))
.build
/** Use it as usual. */
client.createBucket("foo")
client.putObject("foo", "bar", "baz")
val url = client.getUrl("foo","bar")
println(url.getFile())
val df = ReadAndWrite.readParquetFile(spark,url.getPath())
df.printSchema()
}
}
I figured out and kept it simple. I could complete some basic test cases.
Here is my solution. I hope this will help someone.
import org.apache.spark.sql
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.scalatest.{BeforeAndAfterEach, FunSuite}
import loaders.ReadAndWrite
class ReadAndWriteTestSpec extends FunSuite with BeforeAndAfterEach{
private val master = "local"
private val appName = "ReadAndWrite-Test"
var spark : SparkSession = _
override def beforeEach(): Unit = {
spark = new sql.SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("creating data frame from parquet file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = spark.read.json("src/test/resources/people.json")
peopleDF.write.mode(SaveMode.Overwrite).parquet("src/test/resources/people.parquet")
val df = ReadAndWrite.readParquetFile(sparkSession,"src/test/resources/people.parquet")
df.printSchema()
}
test("creating data frame from text file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
peopleDF.printSchema()
}
test("counts should match with number of records in a text file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
peopleDF.printSchema()
assert(peopleDF.count() == 3)
}
test("data should match with sample records in a text file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
peopleDF.printSchema()
assert(peopleDF.take(1)(0)(0).equals("Michael"))
}
test("Write a data frame as csv file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
//header argument should be boolean to the user to avoid confusions
ReadAndWrite.writeDataframeAsCSV(peopleDF,"src/test/resources/out.csv",java.time.Instant.now().toString,",","true")
}
override def afterEach(): Unit = {
spark.stop()
}
}
case class Person(name: String, age: Int)

Putting Multiple column names from a HBase Table into one SparkRDD

I have to put multiple column families from a table in HBase into one sparkRDD. I am attempting this using the following code: (question edited after first aanswer)
import org.apache.hadoop.hbase.client.{HBaseAdmin, Result}
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import scala.collection.JavaConverters._
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark._
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.client._
object HBaseRead {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local").set("spark.driver.allowMultipleContexts","true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val tableName = "TableName"
////setting up required stuff
System.setProperty("user.name", "hdfs")
System.setProperty("HADOOP_USER_NAME", "hdfs")
conf.set("hbase.master", "localhost:60000")
conf.setInt("timeout", 120000)
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
sparkConf.registerKryoClasses(Array(classOf[org.apache.hadoop.hbase.client.Result]))
val admin = new HBaseAdmin(conf)
if (!admin.isTableAvailable(tableName)) {
val tableDesc = new HTableDescriptor(tableName)
admin.createTable(tableDesc)
}
case class Model(Shoes: String,Clothes: String,T-shirts: String)
var hBaseRDD2 = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
val transformedRDD = hBaseRDD2.map(tuple => {
val result = tuple._2
Model(Bytes.toString(result.getValue(Bytes.toBytes("Category"),Bytes.toBytes("Shoes"))),
Bytes.toString(result.getValue(Bytes.toBytes("Category"),Bytes.toBytes("Clothes"))),
Bytes.toString(result.getValue(Bytes.toBytes("Category"),Bytes.toBytes("T-shirts")))
)
})
val totalcount = transformedRDD.count()
println(totalcount)
}
}
What I want to do is to make a single rdd wherein values of first row (and subsequent rows later on) from these column families would be combined in a single array in the rdd. Any help would be appreciated. Thanks
You can do it couple of ways, inside rdd map you can get all the columns from the parent rdd[hBaseRDD2] and transform it and return it as another single rdd.
or you can create a case class and map it to that columns.
For example:
case class Model(column1: String,
column1: String,
column1: String)
var hBaseRDD2 = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
val transformedRDD = hBaseRDD2.map(tuple => {
val result = tuple._2
Model(Bytes.toString(result.getValue(Bytes.toBytes("cf1"),Bytes.toBytes("Columnname1"))),
Bytes.toString(result.getValue(Bytes.toBytes("cf2"),Bytes.toBytes("Columnname2"))),
Bytes.toString(result.getValue(Bytes.toBytes("cf2"),Bytes.toBytes("Columnname2")))
)
})

Scala/Spark serialization error - streaming data to HBASE

I am a newbie to Scala/Spark. In the following code, I am extracting Twitter public stream content to the HBase.
On commenting the last four lines (put commands in HBase), I am able to print content of tweet on the terminal, however unable to dump it to the HBase table.
I need help in on the following regards:
1. How can I overcome the serialilzation error?
2. Are there efficient methods (may be useing Kryo serialilzation) to overcome this error?
Caused by: java.io.NotSerializableException:
org.apache.hadoop.conf.Configuration Serialization stack:
- object not serializable (class: org.apache.hadoop.conf.Configuration, value: Configuration:
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml)
import twitter4j.auth._
import twitter4j.conf._
import twitter4j._
import twitter4j.json._
import scala.io.Source
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.KeyValue
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.mapreduce.Job
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.hadoop.hbase.util.Bytes
import java.io._
import org.apache.spark.streaming.twitter.TwitterUtils
////////////////////////////
val conf = new SparkConf().setAppName("model1").setMaster("local[*]")
// val sc = new SparkContext(conf)
val TABLE_NAME = "publicrd"
val CF_USER = "user"
val CF_TWEET = "tweet"
val CF_ENTITIES = "entities"
val CF_PLACES = "places"
val hadoopConf = new Configuration
val conf = HBaseConfiguration.create(hadoopConf)
val admin = new HBaseAdmin(conf)
val tableDesc = new HTableDescriptor(Bytes.toBytes(TABLE_NAME))
// Define column family descriptor
val ColumnFamilyDesc1 = new HColumnDescriptor(Bytes.toBytes(CF_USER))
val ColumnFamilyDesc2 = new HColumnDescriptor(Bytes.toBytes(CF_TWEET))
val ColumnFamilyDesc3 = new HColumnDescriptor(Bytes.toBytes(CF_ENTITIES))
val ColumnFamilyDesc4 = new HColumnDescriptor(Bytes.toBytes(CF_PLACES))
// Add column family in table descriptor
tableDesc.addFamily(ColumnFamilyDesc1)
tableDesc.addFamily(ColumnFamilyDesc2)
tableDesc.addFamily(ColumnFamilyDesc3)
tableDesc.addFamily(ColumnFamilyDesc4)
// Check if the table exists
if (admin.tableExists(TABLE_NAME)){
print(">>>>>" + TABLE_NAME + " already exists <<<<<")
admin.disableTable(TABLE_NAME)
admin.deleteTable(TABLE_NAME)
}
// Create HBASE table
admin.createTable(tableDesc)
val table = new HTable(conf, TABLE_NAME)
/////
val timewindow = 2 // seconds
val ssc = new StreamingContext(sc, Seconds(timewindow))
val cb = new ConfigurationBuilder
val ckey = "ckey"
val csecret = "csecret"
val atoken = "atoken"
val atokensecret = "atokensecret"
cb.setDebugEnabled(true).
setOAuthConsumerKey(ckey).
setOAuthConsumerSecret(csecret).
setOAuthAccessToken(atoken).
setOAuthAccessTokenSecret(atokensecret).
setJSONStoreEnabled(true)
val auth = new OAuthAuthorization(cb.build)
val tweets = TwitterUtils.createStream(ssc,Some(auth))
val status = tweets.filter(_.getLang()=="en")
status.foreachRDD(foreachFunc = rdd => {
rdd.foreachPartition {
records => while (records.hasNext) {
var record = records.next
print("\n\n>>>>"+record)
var tweetID = record.getUser().getId().toString//.isInstanceOf[Int]
print("\ntweetID : "+tweetID)
var tweetBody = record.getText()//.toString
print("\ntweetBody : "+tweetBody)
var favoritesCount = record.getFavoriteCount()//.toInt
print("\nfavoritesCount : "+favoritesCount)
var keyrow = "t_"+tweetID //"t_"+
print("\nkeyrow : "+keyrow+"\n")
var theput= new Put(Bytes.toBytes(keyrow))
theput.add(Bytes.toBytes(CF_TWEET),Bytes.toBytes("tweetid"),Bytes.toBytes(tweetID))
theput.add(Bytes.toBytes(CF_TWEET),Bytes.toBytes("tweetid"),Bytes.toBytes(tweetBody))
theput.add(Bytes.toBytes(CF_USER),Bytes.toBytes("tweetid"),Bytes.toBytes(favoritesCount))
table.put(theput)
}
}
}
)
The code is run on the terminal via:
spark-shell --driver-class-path /opt/hadoop/hbase-1.2.1/lib/hbase-server-1.1.4.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-protocol-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-hadoop2-compat-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-client-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-common-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/htrace-core-3.2.0-incubating.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/guava-19.0.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/spark-streaming-twitter_2.10-1.6.1.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-async-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-core-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-examples-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-media-support-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-stream-4.0.4.jar
It says the object org.apache.hadoop.conf.Configuration is not serialisable which mean it does not implement the Serializable interface while it's required. To get rid of that add #transient keyword.
#transient val hadoopConf = new Configuration