Spark Streaming - Twitter - Filtering tweet data - scala

I am new to Scala and Spark. I am working on spark streaming with twitter data. I flatmapped the stream into individual words.Now, I need to eliminate tweet words like which start with #,# and words like RT from streaming data before processing them. I knew it is quite easy to do.I wrote filter for this, but it is not working. Can anyone help on this. My code is
val sparkConf = new SparkConf().setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val stream = TwitterUtils.createStream(ssc, None)
//val lanFilter = stream.filter(status => status.getLang == "en")
val RDD1 = stream.flatMap(status => status.getText.split(" "))
val filterRDD = RDD1.filter(word =>(word !=word.startsWith("#")))
filterRDD.print()
Also language filter is showing error.
Thank you.

You can use a built in word filter support:
TwitterUtils.createStream(ssc, None, Array("filter", "these", "words"))
But if you want to fix your code:
.filterNot(_.getText.startsWith("#"))
Regarding language, see this question.

Is your lambda expression correct? I think you want:
val filterRDD = RDD1.filter(word => !word.startsWith("#"))

Related

loading csv file to HBase through Spark

this is simple " how to " question::
We can bring data to Spark environment through com.databricks.spark.csv. I do know how to create HBase table through spark, and write data to the HBase tables manually. But is that even possible to load a text/csv/jason files directly to HBase through Spark? I cannot see anybody talking about it. So, just checking. If possible, please guide me to a good website that explains the scala code in detail to get it done.
Thank you,
There are multiple ways you can do that.
Spark Hbase connector:
https://github.com/hortonworks-spark/shc
You can see lot of examples on the link.
Also you can use SPark core to load the data to Hbase using HbaseConfiguration.
Code Example:
val fileRDD = sc.textFile(args(0), 2)
val transformedRDD = fileRDD.map { line => convertToKeyValuePairs(line) }
val conf = HBaseConfiguration.create()
conf.set(TableOutputFormat.OUTPUT_TABLE, "tableName")
conf.set("hbase.zookeeper.quorum", "localhost:2181")
conf.set("hbase.master", "localhost:60000")
conf.set("fs.default.name", "hdfs://localhost:8020")
conf.set("hbase.rootdir", "/hbase")
val jobConf = new Configuration(conf)
jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName)
jobConf.set("mapreduce.job.output.value.class", classOf[LongWritable].getName)
jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text]].getName)
transformedRDD.saveAsNewAPIHadoopDataset(jobConf)
def convertToKeyValuePairs(line: String): (ImmutableBytesWritable, Put) = {
val cfDataBytes = Bytes.toBytes("cf")
val rowkey = Bytes.toBytes(line.split("\\|")(1))
val put = new Put(rowkey)
put.add(cfDataBytes, Bytes.toBytes("PaymentDate"), Bytes.toBytes(line.split("|")(0)))
put.add(cfDataBytes, Bytes.toBytes("PaymentNumber"), Bytes.toBytes(line.split("|")(1)))
put.add(cfDataBytes, Bytes.toBytes("VendorName"), Bytes.toBytes(line.split("|")(2)))
put.add(cfDataBytes, Bytes.toBytes("Category"), Bytes.toBytes(line.split("|")(3)))
put.add(cfDataBytes, Bytes.toBytes("Amount"), Bytes.toBytes(line.split("|")(4)))
return (new ImmutableBytesWritable(rowkey), put)
}
Also you can use this one
https://github.com/nerdammer/spark-hbase-connector

How to filter TwitterInputDStream results by user

I'm trying to make a Scala + Spark application that specifically listens for tweets from specific users that I define in my filters array. Currently, editing the filters array only changes the text of the tweet that it is receiving. I want to do this, except for users.
I was originally using
val statuses = TwitterUtils.createStream(ssc, None, filter)
To get my stream, however, in another post, it reccomended using TwitterInputDStream. I just implemented that, but I don't get how to filter based on users.
Any help is MUCH appreciated, as this is something I've struggled with for quite a while.
My code thus far:
val config = new twitter4j.conf.ConfigurationBuilder()
.setOAuthConsumerKey("2pzru6Cd8aRumYHUumoCiKwYS")
.setOAuthConsumerSecret("rKKQfISDzk715OOKS7wvGkSKdkqGsLFOdTrZ9QdJOfAjMXekNB")
.setOAuthAccessToken("962554652-SU0aBc8Iyukka1gFJQY9Ux5XczDXSgczRpuih3ml")
.setOAuthAccessTokenSecret("k0DWDIGeJ6u2ijYJeIv6dfCuCmTPIZT8xPJ5AG4P9PV72")
.build
val twitter_auth = new TwitterFactory(config)
val a = new OAuthAuthorization(config)
val atwitter = twitter_auth.getInstance(a).getAuthorization()
//val auth = new Option[Authorization] {}
val auth = Option(atwitter)
val sparkConf = new SparkConf().setAppName("Main2")
.setMaster("local[4]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val test = new StorageLevel()
val twitter = new TwitterInputDStream(ssc, auth, filters, test )
val rec = twitter.getReceiver()
twitter.print()
Thanks!

Spark Streaming with Kafka

I want to do simple machine learning in Spark.
First the application should do some learning from historical data from a file, train the machine learning model and then read input from kafka to give predictions in real time. To do that I believe I should use spark streaming. However, I'm afraid that I don't really understand how spark streaming works.
The code looks like this:
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("test App")
val sc = new SparkContext(conf)
val fromFile = parse(sc, Source.fromFile("my_data_.csv").getLines.toArray)
ML.train(fromFile)
real_time(sc)
}
Where ML is a class with some machine learning things in it and train gives it data to train. There also is a method classify which calculates predictions based on what it learned.
The first part seems to work fine, but real_time is a problem:
def real_time(sc: SparkContext) : Unit = {
val ssc = new StreamingContext(new SparkConf(), Seconds(1))
val topic = "my_topic".split(",").toSet
val params = Map[String, String](("metadata.broker.list", "localhost:9092"))
val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, params, topic)
var lin = dstream.map(_._2)
val str_arr = new Array[String](0)
lin.foreach {
str_arr :+ _.collect()
}
val lines = parse(sc, str_arr).map(i => i.features)
ML.classify(lines)
ssc.start()
ssc.awaitTermination()
}
What I would like it to do is check the Kafka stream and compute it if there are any new lines. This doesn't seem to be the case, I added some prints and it is not printed.
How to use spark streaming, how should it be used in my case?

Saving twitter streams into a single file with spark streaming, scala

So after help from this answer Spark Streaming : Join Dstream batches into single output Folder I was able to create a single file for my twitter streams. However,now I don't see any tweets being saved in this file. Please find below my code snippet for this. What am I doing wrong?
val ssc = new StreamingContext(sparkConf, Seconds(5))
val stream = TwitterUtils.createStream(ssc, None, filters)
val tweets = stream.map(r => r.getText)
tweets.foreachRDD{rdd =>
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
val df = rdd.map(t => Record(t)).toDF()
df.save("com.databricks.spark.csv",SaveMode.Append,Map("path"->"tweetstream.csv")
}
ssc.start()
ssc.awaitTermination()
}

Pushing Spark Streaming RDDs to Neo4j -Scala

I need to establish a connection from Spark Streaming to Neo4j graph database.The RDDs are of type((is,I),(am,Hello)(sam,happy)....). I need to establish a edge between each pair of words in Neo4j.
In Spark Streaming documentation I found
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
to the push to the data to an external database.
I am doing this in Scala. I am little confused about how to go about? I found AnormCypher and Neo4jScala wrapper. Can I use these to get work done? If so, how can I do that? If not, all there any better alternatives?
Thank you all....
I did an experiment with AnormCypher
Like this:
implicit val connection = Neo4jREST.setServer("localhost", 7474, "/db/data/")
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(FILE, 4).cache()
val count = logData
.flatMap( _.split(" "))
.map( w =>
Cypher("CREATE(:Word {text:{text}})")
.on( "text" -> w ).execute()
).filter( _ ).count()
Neo4j 2.2.x has great concurrent write performance that you can use from Spark. So the more concurrent threads you can have to write to Neo4j the better. If you can batch statements in batches of 100 to 1000 each per request then even better.
Take a look at MazeRunner (http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html) as it will give you some ideas.