FilterTweepy streaming on polarity - streaming

Using streaming API I am trying to get Tweets with positive polarity on a particular topic. My filter statement is something like this
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
.......
twitterStream = Stream(auth, listener())
try:
twitterStream.filter(track=["Trump :)"],languages=['en'])
except:
print("Streaming error")
Objective is to get Positive tweets on a subject. However I am not getting any response. I am unable to resolve the problem.

Working fine. Error was some where else. Filter code given above is correct.

Related

Scala troubles getting data in a single Json line

I am using Scala spark streaming , and I need to push my results to a kafka topic. I am using .selectExpr("to_json(struct(Column1,Column2,Column3,Column4))as value") .
And I get the result:
{"Column1":"Value_Column1","Column4":"Value_Column4"}
{"Column1":"Value_Column1","Column2":"Value_Column2"}
{"Column1":"Value_Column1","Column3":"Value_COlumn3"}
How should I change the .selectExpr or what are the steps I need to take to get an output like this :
{"Column1":"Value_Column1","Column2":"Value_Column2","Column3":"Value_Column3","Column4":"Value_Column4"}
Thank you all in advance!

Plot a graph with ipycytoscape (and networkx)

Following the instructions of ipycitoscape I am not able to plot a graph using ipycitoscape.
according to: https://github.com/QuantStack/ipycytoscape/blob/master/examples/Test%20NetworkX%20methods.ipynb
this should work:
import networkx as nx
import ipycytoscape
G2 = nx.Graph()
G2.add_nodes_from([*'ABCDEF'])
G2.add_edges_from([('A','B'),('B','C'),('C','D'),('E','F')])
print(G2.nodes)
print(G2.edges)
cytoscapeobj = ipycytoscape.CytoscapeWidget()
cytoscapeobj.graph.add_graph_from_networkx(nx_graph)
G2 is a networkx graph example and it looks ok since print(G2) gives the networkx object back and G2.nodes and G2.edges can be printed.
The error:
ValueError: invalid literal for int() with base 10: 'A'
Why should a node be an integer?
More general what to do if the starting data point if a pandas dataframe with a million rows edges those being strings like ProcessA-ProcessB, processC-processD etc
Also having a look to the examples it is to be noted that the list of nodes is composed of a dictionary data for every node. that data including an "id" per node and also "Atribute". The surprise here is that the networkx Graph should have all those properties.
thanks
This problem was fixed. See attachment.
Please let me know if it's still happening. Feel free to open an issue: https://github.com/QuantStack/ipycytoscape/
I'm just playing around with ipycytoscape myself, so I could be way off-base, but, shouldn't the line be:
cytoscapeobj.graph.add_graph_from_networkx(G2) # your graph name goes here
Trying to generate a cytoscape object built on a graph that doesn't exist might trigger a ValueError because it can't find any nodes.

I am getting a syntax error while computing average number of friends in apache spark

Recently I have started doing a course of Frank Kane namely Taming big data by apache spark using python.
In the line where I have to compute average number of friends, I am getting a syntax error. I cannot understand how to fix this error. Please refer the code below.FYI I m using python 3. I have highlighted the code having syntax error.Please help as I have got stuck here.
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("AverageAge")
sc = SparkContext(conf = conf)
def parseline(line):
fields =line.split(',')
friend_age= int(fields[2])
friends_number= int(fields[3])
return (friend_age,friends_number)
lines = sc.textFile("file:///Sparkcourse/SparkCourse/fakefriends.csv")
rdd=lines.map(parseline)
making_keys=rdd.mapByValues(lambda x:(x,1))
totalsByAge=making_keys.reduceByKeys(lambda x,y: (x[0]+y[0],x[1]+y[1])
**averages_by_keys= totalsByAge.mapValues(lambda x: x[0] / x[1])**(Syntax Error)
results=averageByKeys.collect()
for result in results:
print result
Look at the line above, you're missing a closing parenthesis.

Window function in pyspark - strange behavior

I'm using Pyspark window functions extensively in my code. But it seems to be not working properly.
But i'm getting the correct results only for the last record by order by column for the partition.
Documentation says , it is experimental, can we use it in production systems
http://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.Window
Sample code:
invWindow = Window.partitionBy(masterDrDF["ResId"], masterDrDF["vrsn_strt_dts"]).orderBy(masterDrDF["vrsn_strt_dts"]).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
max(when(invDetDF["InvoiceItemType"].like('ABD%'), 1).otherwise(0)).over(invWindow).alias("ABD_PKG_IN")

Pyspark - Transfer control out of Spark Session (sc)

This is a follow up question on
Pyspark filter operation on Dstream
To keep a count of how many error messages/warning messages has come through for say a day, hour - how does one design the job.
What I have tried:
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
def counts():
counter += 1
print(counter.value)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 5)
counter = sc.accumulator(0)
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreach(counts))
errors.pprint()
ssc.start()
ssc.awaitTermination()
this however has multiple issues, to start with print doesn't work (does not output to stdout, I have read about it, the best I can use here is logging). Can I save the output of that function to a text file and tail that file instead?
I am not sure why the program just comes out, there is no error/dump anywhere to look further into (spark 1.6.2)
How does one preserve state? What I am trying is to aggregate logs by server and severity, another use case is to count how many transactions were processed by looking for certain keywords
Pseudo Code for what I want to try:
foreachRDD(Dstream):
if RDD.contains("keyword1 | keyword2 | keyword3"):
dictionary[keyword] = dictionary.get(keyword,0) + 1 //add the keyword if not present and increase the counter
print dictionary //or send this dictionary to else where
The last part of sending or printing dictionary requires switching out of spark streaming context - Can someone explain the concept please?
print doesn't work
I would recommend reading the design patterns section of the Spark documentation. I think that roughly what you want is something like this:
def _process(iter):
for item in iter:
print item
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreachPartition(_process))
This will get your call print to work (though it is worth noting that the print statement will execute on the workers and not the drivers, so if you're running this code on a cluster you will only see it on the worker logs).
However, it won't solve your second problem:
How does one preserve state?
For this, take a look at updateStateByKey and the related example.