How to filter TwitterInputDStream results by user - scala

I'm trying to make a Scala + Spark application that specifically listens for tweets from specific users that I define in my filters array. Currently, editing the filters array only changes the text of the tweet that it is receiving. I want to do this, except for users.
I was originally using
val statuses = TwitterUtils.createStream(ssc, None, filter)
To get my stream, however, in another post, it reccomended using TwitterInputDStream. I just implemented that, but I don't get how to filter based on users.
Any help is MUCH appreciated, as this is something I've struggled with for quite a while.
My code thus far:
val config = new twitter4j.conf.ConfigurationBuilder()
.setOAuthConsumerKey("2pzru6Cd8aRumYHUumoCiKwYS")
.setOAuthConsumerSecret("rKKQfISDzk715OOKS7wvGkSKdkqGsLFOdTrZ9QdJOfAjMXekNB")
.setOAuthAccessToken("962554652-SU0aBc8Iyukka1gFJQY9Ux5XczDXSgczRpuih3ml")
.setOAuthAccessTokenSecret("k0DWDIGeJ6u2ijYJeIv6dfCuCmTPIZT8xPJ5AG4P9PV72")
.build
val twitter_auth = new TwitterFactory(config)
val a = new OAuthAuthorization(config)
val atwitter = twitter_auth.getInstance(a).getAuthorization()
//val auth = new Option[Authorization] {}
val auth = Option(atwitter)
val sparkConf = new SparkConf().setAppName("Main2")
.setMaster("local[4]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val test = new StorageLevel()
val twitter = new TwitterInputDStream(ssc, auth, filters, test )
val rec = twitter.getReceiver()
twitter.print()
Thanks!

Related

How to apply several Indexers and Encoders without creating countless intermediate DataFrames?

Here's my code :
val workindexer = new StringIndexer().setInputCol("workclass").setOutputCol("workclassIndex")
val workencoder = new OneHotEncoder().setInputCol("workclassIndex").setOutputCol("workclassVec")
val educationindexer = new StringIndexer().setInputCol("education").setOutputCol("educationIndex")
val educationencoder = new OneHotEncoder().setInputCol("educationIndex").setOutputCol("educationVec")
val maritalindexer = new StringIndexer().setInputCol("marital_status").setOutputCol("maritalIndex")
val maritalencoder = new OneHotEncoder().setInputCol("maritalIndex").setOutputCol("maritalVec")
val occupationindexer = new StringIndexer().setInputCol("occupation").setOutputCol("occupationIndex")
val occupationencoder = new OneHotEncoder().setInputCol("occupationIndex").setOutputCol("occupationVec")
val relationindexer = new StringIndexer().setInputCol("relationship").setOutputCol("relationshipIndex")
val relationencoder = new OneHotEncoder().setInputCol("relationshipIndex").setOutputCol("relationshipVec")
val raceindexer = new StringIndexer().setInputCol("race").setOutputCol("raceIndex")
val raceencoder = new OneHotEncoder().setInputCol("raceIndex").setOutputCol("raceVec")
val sexindexer = new StringIndexer().setInputCol("sex").setOutputCol("sexIndex")
val sexencoder = new OneHotEncoder().setInputCol("sexIndex").setOutputCol("sexVec")
val nativeindexer = new StringIndexer().setInputCol("native_country").setOutputCol("native_countryIndex")
val nativeencoder = new OneHotEncoder().setInputCol("native_countryIndex").setOutputCol("native_countryVec")
val labelindexer = new StringIndexer().setInputCol("label").setOutputCol("labelIndex")
Is there any way to apply all these encoders and indexers without creating countless intermediate dataframes ?
I'd use RFormula:
import org.apache.spark.ml.feature.RFormula
val features = Seq("workclass", "education",
"marital_status", "occupation", "relationship",
"race", "sex", "native", "country")
val formula = new RFormula().setFormula(s"label ~ ${features.mkString(" + ")}")
It will apply the same transformations as the indexers used in the example and assemble the features Vector.
Use the feature of Spark MLlib called ML Pipelines:
ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines.
With ML Pipelines you could "chain" (or "pipe") the "encoders and indexers without creating countless intermediate dataframes".
import org.apache.spark.ml._
val pipeline = new Pipeline().setStages(Array(workindexer, workencoder...))

How to use feature extraction with DStream in Apache Spark

I have data that arrive from Kafka through DStream. I want to perform feature extraction in order to obtain some keywords.
I do not want to wait for arrival of all data (as it is intended to be continuous stream that potentially never ends), so I hope to perform extraction in chunks - it doesn't matter to me if the accuracy will suffer a bit.
So far I put together something like that:
def extractKeywords(stream: DStream[Data]): Unit = {
val spark: SparkSession = SparkSession.builder.getOrCreate
val streamWithWords: DStream[(Data, Seq[String])] = stream map extractWordsFromData
val streamWithFeatures: DStream[(Data, Array[String])] = streamWithWords transform extractFeatures(spark) _
val streamWithKeywords: DStream[DataWithKeywords] = streamWithFeatures map addKeywordsToData
streamWithFeatures.print()
}
def extractFeatures(spark: SparkSession)
(rdd: RDD[(Data, Seq[String])]): RDD[(Data, Array[String])] = {
val df = spark.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(numOfFeatures)
val rawFeatures = hashingTF.transform(df)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(rawFeatures)
val rescaledData = idfModel.transform(rawFeature)
import spark.implicits._
rescaledData.select("data", "features").as[(Data, Array[String])].rdd
}
However, I received java.lang.IllegalStateException: Haven't seen any document yet. - I am not surprised as I just try out to scrap things together, and I understand that since I am not waiting for an arrival of some data, the generated model might be empty when I try to use it on data.
What would be the right approach for this problem?
I used advises from comments and split the procedure into 2 runs:
one that calculated IDF model and saves it to file
def trainFeatures(idfModelFile: File, rdd: RDD[(String, Seq[String])]) = {
val session: SparkSession = SparkSession.builder.getOrCreate
val wordsDf = session.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedDf)
idfModel.write.save(idfModelFile.getAbsolutePath)
}
one that reads IDF model from file and simply runs it on all incoming information
val idfModel = IDFModel.load(idfModelFile.getAbsolutePath)
val documentDf = spark.createDataFrame(rdd).toDF("update", "document")
val tokenizer = new Tokenizer().setInputCol("document").setOutputCol("words")
val wordsDf = tokenizer.transform(documentDf)
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val extractor = idfModel.setInputCol("rawFeatures").setOutputCol("features")
val featuresDf = extractor.transform(featurizedDf)
featuresDf.select("update", "features")

Spark Dataframe content can be printed out but (e.g.) not counted

Strangely this doesnt work. Can someone explain the background? I want to understand why it doesnt take this.
The Inputfiles are parquet files spread across multiple folders. When I print the results, they are structured as I want to. When I use a dataframe.count() on the joined dataframe, the job will run forever. Can anyone help with the Details on that
import org.apache.spark.{SparkContext, SparkConf}
object TEST{
def main(args: Array[String] ) {
val appName = args(0)
val threadMaster = args(1)
val inputPathSent = args(2)
val inputPathClicked = args(3)
// pass spark configuration
val conf = new SparkConf()
.setMaster(threadMaster)
.setAppName(appName)
// Create a new spark context
val sc = new SparkContext(conf)
// Specify a SQL context and pass in the spark context we created
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create two dataframes for sent and clicked files
val dfSent = sqlContext.read.parquet(inputPathSent)
val dfClicked = sqlContext.read.parquet(inputPathClicked)
// Join them
val dfJoin = dfSent.join(dfClicked, dfSent.col("customer_id")
===dfClicked.col("customer_id") && dfSent.col("campaign_id")===
dfClicked.col("campaign_id"), "left_outer")
dfJoin.show(20) // perfectly shows the first 20 rows
dfJoin.count() //Here we run into trouble and it runs forever
}
}
Use println(dfJoin.count())
You will be able to see the count in your screen.

Spark and HBase Snapshots

Under the assumption that we could access data much faster if pulling directly from HDFS instead of using the HBase API, we're trying to build an RDD based on a Table Snapshot from HBase.
So, I have a snapshot called "dm_test_snap". I seem to be able to get most of the configuration stuff working, but my RDD is null (despite there being data in the Snapshot itself).
I'm having a hell of a time finding an example of anyone doing offline analysis of HBase snapshots with Spark, but I can't believe I'm alone in trying to get this working. Any help or suggestions are greatly appreciated.
Here is a snippet of my code:
object TestSnap {
def main(args: Array[String]) {
val config = ConfigFactory.load()
val hbaseRootDir = config.getString("hbase.rootdir")
val sparkConf = new SparkConf()
.setAppName("testnsnap")
.setMaster(config.getString("spark.app.master"))
.setJars(SparkContext.jarOfObject(this))
.set("spark.executor.memory", "2g")
.set("spark.default.parallelism", "160")
val sc = new SparkContext(sparkConf)
println("Creating hbase configuration")
val conf = HBaseConfiguration.create()
conf.set("hbase.rootdir", hbaseRootDir)
conf.set("hbase.zookeeper.quorum", config.getString("hbase.zookeeper.quorum"))
conf.set("zookeeper.session.timeout", config.getString("zookeeper.session.timeout"))
conf.set("hbase.TableSnapshotInputFormat.snapshot.name", "dm_test_snap")
val scan = new Scan
val job = Job.getInstance(conf)
TableSnapshotInputFormat.setInput(job, "dm_test_snap",
new Path("hdfs://nameservice1/tmp"))
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableSnapshotInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
hBaseRDD.count()
System.exit(0)
}
}
Update to include the solution
The trick was, as #Holden mentioned below, the conf wasn't getting passed through. To remedy this, I was able to get it working by changing this the call to newAPIHadoopRDD to this:
val hBaseRDD = sc.newAPIHadoopRDD(job.getConfiguration, classOf[TableSnapshotInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
There was a second issue that was also highlighted by #victor's answer, that I was not passing in a scan. To fix that, I added this line and method:
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
def convertScanToString(scan : Scan) = {
val proto = ProtobufUtil.toScan(scan);
Base64.encodeBytes(proto.toByteArray());
}
This also let me pull out this line from the conf.set commands:
conf.set("hbase.TableSnapshotInputFormat.snapshot.name", "dm_test_snap")
*NOTE: This was for HBase version 0.96.1.1 on CDH5.0
Final full code for easy reference:
object TestSnap {
def main(args: Array[String]) {
val config = ConfigFactory.load()
val hbaseRootDir = config.getString("hbase.rootdir")
val sparkConf = new SparkConf()
.setAppName("testnsnap")
.setMaster(config.getString("spark.app.master"))
.setJars(SparkContext.jarOfObject(this))
.set("spark.executor.memory", "2g")
.set("spark.default.parallelism", "160")
val sc = new SparkContext(sparkConf)
println("Creating hbase configuration")
val conf = HBaseConfiguration.create()
conf.set("hbase.rootdir", hbaseRootDir)
conf.set("hbase.zookeeper.quorum", config.getString("hbase.zookeeper.quorum"))
conf.set("zookeeper.session.timeout", config.getString("zookeeper.session.timeout"))
val scan = new Scan
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
val job = Job.getInstance(conf)
TableSnapshotInputFormat.setInput(job, "dm_test_snap",
new Path("hdfs://nameservice1/tmp"))
val hBaseRDD = sc.newAPIHadoopRDD(job.getConfiguration, classOf[TableSnapshotInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
hBaseRDD.count()
System.exit(0)
}
def convertScanToString(scan : Scan) = {
val proto = ProtobufUtil.toScan(scan);
Base64.encodeBytes(proto.toByteArray());
}
}
Looking at the Job information, its making a copy of the conf object you are supplying to it (The Job makes a copy of the Configuration so that any necessary internal modifications do not reflect on the incoming parameter.) so most likely the information that you need to set on the conf object isn't getting passed down to Spark. You could instead use TableSnapshotInputFormatImpl which has a similar method that works on conf objects. There might be additional things needed but at first pass through the problem this seems like the most likely cause.
As pointed out in the comments, another option is to use job.getConfiguration to get the updated config from the job object.
You have not configured your M/R job properly:
This is example in Java on how to configure M/R over snapshots:
Job job = new Job(conf);
Scan scan = new Scan();
TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName,
scan, MyTableMapper.class, MyMapKeyOutput.class,
MyMapOutputValueWritable.class, job, true);
}
You, definitely, skipped Scan. I suggest you taking a look at TableMapReduceUtil's initTableSnapshotMapperJob implementation to get idea how to configure job in Spark/Scala.
Here is complete configuration in mapreduce Java
TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName, // Name of the snapshot
scan, // Scan instance to control CF and attribute selection
DefaultMapper.class, // mapper class
NullWritable.class, // mapper output key
Text.class, // mapper output value
job,
true,
restoreDir);

Spark Streaming - Twitter - Filtering tweet data

I am new to Scala and Spark. I am working on spark streaming with twitter data. I flatmapped the stream into individual words.Now, I need to eliminate tweet words like which start with #,# and words like RT from streaming data before processing them. I knew it is quite easy to do.I wrote filter for this, but it is not working. Can anyone help on this. My code is
val sparkConf = new SparkConf().setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val stream = TwitterUtils.createStream(ssc, None)
//val lanFilter = stream.filter(status => status.getLang == "en")
val RDD1 = stream.flatMap(status => status.getText.split(" "))
val filterRDD = RDD1.filter(word =>(word !=word.startsWith("#")))
filterRDD.print()
Also language filter is showing error.
Thank you.
You can use a built in word filter support:
TwitterUtils.createStream(ssc, None, Array("filter", "these", "words"))
But if you want to fix your code:
.filterNot(_.getText.startsWith("#"))
Regarding language, see this question.
Is your lambda expression correct? I think you want:
val filterRDD = RDD1.filter(word => !word.startsWith("#"))