Making RDD operations on sqlContext - scala

I am working on a tutorial in apache spark, and I making use of the Cassandra database, Spark2.0, and Python
I am trying to do an RDD Operation on a sql Query, using this tutorial,
https://spark.apache.org/docs/2.0.0-preview/sql-programming-guide.html
it says #The results of SQL queries are RDDs and support all the normal RDD operations.
I currently have this line of codes that says
sqlContext = SQLContext(sc)
results = sqlContext.sql("SELECT word FROM tweets where word like '%#%'").show(20, False)
df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="wordcount", keyspace= "demo")\
.load()
df.select("word")
df.createOrReplaceTempView("tweets")
usernames = results.map(lambda p: "User: " + p.word)
for name in usernames.collect():
print(name)
AttributeError: 'NoneType' object has no attribute 'map'
If the variable results is a result of sql Query, why am I getting this error? Can anyone please explain this to me.
Everything works fine, the tables print, only time I get an error is when I
try to do a RDD Operation.
please bear in mind sc is an existing spark context

It's because show() only prints content.
Use:
results = sqlContext.sql("SELECT word FROM tweets where word like '%#%'")
result.show(20, False)

Related

Apache Spark Scala - data analysis - error

I am new to/still learning Apache Spark/Scala. I am trying to analyze a dataset and have loaded the dataset into Scala. However, when I try to perform a basic analysis such as max, min or average, I get an error -
error: value select is not a member of org.apache.spark.rdd.RDD[Array[String]]
Could anyone please shed some light on this please? I am running Spark on the cloudlab of an organization.
Code:
// Reading in the csv file
val df = sc.textFile("/user/Spark/PortbankRTD.csv").map(x => x.split(","))
// Select Max of Age
df.select(max($"age")).show()
Error:
<console>:40: error: value select is not a member of org.apache.spark.rdd.RDD[Array[String]]
df.select(max($"age")).show()
Please let me know if you need any more information.
Thanks
Following up on my comment, the textFile method returns an RDD[String]. select is a method on DataFrame. You will need to convert your RDD[String] into a DataFrame. You can do this in a number of ways. One example is
import spark.implicits._
val rdd = sc.textFile("/user/Spark/PortbankRTD.csv")
val df = rdd.toDF()
There are also built-in readers for many types of input files:
spark.read.csv("/user/Spark/PortbankRTD.csv")
returns a DataFrame immediately.

PySpark: SQLContext temp table is not returning any table

I am quite new to PySpark. Therefore this question may appear as quite elementary to others.
I am trying to export a data frame created via createOrReplaceTempView() to Hive. The steps are as follows
sqlcntx = SQLContext(sc)
df = sqlcntx.read.format("jdbc").options(url="sqlserver://.....details of MS Sql server",dbtable = "table_name").load()
df_cv_temp = df.createOrReplaceTempView("df")
When I use df_cv_temp.show(5) it is giving an error as follows
NoneType Object has no attribute 'show'
Interestingly when I try to see df.show(5) I am getting proper output.
Naturally when I see the above error I am not able to proceed further.
Now I have two questions.
How to fix the above issue?
Assuming the 1st issue is taken care of, what is the best way to export df_cv_temp to HIVE tables?
P.S. I am using PySaprk 2.0
Update: Incorporating Jim's Answer
Post answer received from Jim, I have updated the code. Please see below the revised code.
from pyspark.sql import HiveContext,SQLContext
sql_cntx = SQLContext(sc)
df = sqlcntx.read.format("jdbc").options(url="sqlserver://.....details of MS Sql server",dbtable = "table_name").load()
df_curr_volt.createOrReplaceTempView("df_cv_temp")
df_cv_filt = sql_cntx.sql("select * from df_cv_temp where DeviceTimeStamp between date_add(current_date(),-1) and current_date()") # Retrieving just a day's record
hc = HiveContext(sc)
Now the problem begins. Please refer to my question 2.
df_cv_tbl = hc.sql("create table if not exits df_cv_raw as select * from df_cv_filt")
df_cv_tbl.write.format("orc").saveAsTable("df_cv_raw")
The above two lines is producing the error as shown below.
pyspark.sql.utils.AnalysisException: u'Table or view not found: df_cv_filt; line 1 pos 14'
So what is the right way of approaching this?
Instead of
df_cv_temp = df.createOrReplaceTempView("df")
you have to use,
df.createOrReplaceTempView("table1")
This is because, df.createOrReplaceTempView(<name_of_the_view>) creates (or replaces if that view name already exists) a lazily evaluated "view" that you can then use like a hive table in Spark SQL. The expression does not produce any output as such, hence it is a NoneType object.
Further, the temp view can be queried as below:
spark.sql("SELECT field1 AS f1, field2 as f2 from table1").show()
Incase, you are sure to have memory space, then you can persist it to be a hive table directly like below. This will create a managed Hive table physically; upon which you can query it even in your Hive CLI.
df.write.saveAsTable("table1")

Applying transformations with filter or map which one is faster Scala spark

Iam trying to do some transformations on the dataset with spark using scala currently using spark sql but want to shift the code to native scala code. i want to know whether to use filter or map, doing some operations like matching the values in column and get a single column after the transformation into a different dataset.
SELECT * FROM TABLE WHERE COLUMN = ''
Used to write something like this earlier in spark sql can someone tell me an alternative way to write the same using map or filter on the dataset, and even which one is much faster when compared.
You can read documentation from Apache Spark website. This is the link to API documentation at https://spark.apache.org/docs/2.3.1/api/scala/index.html#package.
Here is a little example -
val df = sc.parallelize(Seq((1,"ABC"), (2,"DEF"), (3,"GHI"))).toDF("col1","col2")
val df1 = df.filter("col1 > 1")
df1.show()
val df2 = df1.map(x => x.getInt(0) + 3)
df2.show()
If I understand you question correctly, you need to rewrite your SQL query to DataFrame API. Your query reads all columns from table TABLE and filter rows where COLUMN is empty. You can do this with DF in the following way:
spark.read.table("TABLE")
.where($"COLUMN".eqNullSafe(""))
.show(10)
Performance will be the same as in your SQL. Use dataFrame.explain(true) method to understand what Spark will do.

Why does my query fail with AnalysisException?

I am new to Spark streaming. I am trying structured Spark streaming with local csv files. I am getting the below exception while processing.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[file:///home/Teju/Desktop/SparkInputFiles/*.csv]
This is my code.
val df = spark
.readStream
.format("csv")
.option("header", "false") // Use first line of all files as header
.option("delimiter", ":") // Specifying the delimiter of the input file
.schema(inputdata_schema) // Specifying the schema for the input file
.load("file:///home/Teju/Desktop/SparkInputFiles/*.csv")
val filterop = spark.sql("select tagShortID,Timestamp,ListenerShortID,rootOrgID,subOrgID,first(rssi_weightage(RSSI)) as RSSI_Weight from my_table where RSSI > -127 group by tagShortID,Timestamp,ListenerShortID,rootOrgID,subOrgID order by Timestamp ASC")
val outStream = filterop.writeStream.outputMode("complete").format("console").start()
I created cron job so every 5 mins I will get one input csv file. I am trying to parse through Spark streaming.
(This is not a solution but more a comment, but given its length it ended up here. I'm going to make it an answer eventually right after I've collected enough information for investigation).
My guess is that you're doing something incorrect on df that you have not included in your question.
Since the error message is about FileSource with the path as below and it is a streaming dataset that must be df that's in play.
FileSource[file:///home/Teju/Desktop/SparkInputFiles/*.csv]
Given the other lines I guess that you register the streaming dataset as a temporary table (i.e. my_table) that you then use in spark.sql to execute SQL and writeStream to the console.
df.createOrReplaceTempView("my_table")
If that's correct, the code you've included in the question is incomplete and does not show the reason for the error.
Add .writeStream.start to your df, as the Exception is telling you.
Read the docs for more detail.

Spark DataFrame groupBy

I have Spark Java that looked like this. Code pulls data from oracle table using JDBC and displays the groupby output.
DataFrame jdbcDF = sqlContext.read().format("jdbc").options(options).load();
jdbcDF.show();
jdbcDF.groupBy("VA_HOSTNAME").count().show();
Long ll = jdbcDF.count();
System.out.println("ll="+ll);
When I ran the code, jdbcDF.show(); is working, whereas the groupBy and count are not printing anything and no errors were thrown.
My column name is correct. I tried by printing that column and it worked, but when groupBy it's not working.
Can someone help me with DataFrame output? I am using spark 1.6.3.
You can try
import org.apache.spark.sql.functions.count
jdbcDF.groupBy("VA_HOSTNAME").agg(count("*")).show()