save output over Scala spark - scala

I executed one simple SQL query and wanted to save the output as a text file on my server. It started to run for almost 3 days and nothing happened.
Can you say what is the problem? Or maybe what other codes can I use to save and measure the run time of queries over spark scala?
val in = spark.read.option("header","true").csv("mytablename.csv")
in.registerTempTable("mytablename")
val dn = spark.sql("select * from mytablename").show(10)
val df = spark.time(df.show(10))
df.write.text("file://hadoop/location/text.txt")

Related

Why does my query fail with AnalysisException?

I am new to Spark streaming. I am trying structured Spark streaming with local csv files. I am getting the below exception while processing.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[file:///home/Teju/Desktop/SparkInputFiles/*.csv]
This is my code.
val df = spark
.readStream
.format("csv")
.option("header", "false") // Use first line of all files as header
.option("delimiter", ":") // Specifying the delimiter of the input file
.schema(inputdata_schema) // Specifying the schema for the input file
.load("file:///home/Teju/Desktop/SparkInputFiles/*.csv")
val filterop = spark.sql("select tagShortID,Timestamp,ListenerShortID,rootOrgID,subOrgID,first(rssi_weightage(RSSI)) as RSSI_Weight from my_table where RSSI > -127 group by tagShortID,Timestamp,ListenerShortID,rootOrgID,subOrgID order by Timestamp ASC")
val outStream = filterop.writeStream.outputMode("complete").format("console").start()
I created cron job so every 5 mins I will get one input csv file. I am trying to parse through Spark streaming.
(This is not a solution but more a comment, but given its length it ended up here. I'm going to make it an answer eventually right after I've collected enough information for investigation).
My guess is that you're doing something incorrect on df that you have not included in your question.
Since the error message is about FileSource with the path as below and it is a streaming dataset that must be df that's in play.
FileSource[file:///home/Teju/Desktop/SparkInputFiles/*.csv]
Given the other lines I guess that you register the streaming dataset as a temporary table (i.e. my_table) that you then use in spark.sql to execute SQL and writeStream to the console.
df.createOrReplaceTempView("my_table")
If that's correct, the code you've included in the question is incomplete and does not show the reason for the error.
Add .writeStream.start to your df, as the Exception is telling you.
Read the docs for more detail.

Recursively adding rows to a dataframe

I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.

Making RDD operations on sqlContext

I am working on a tutorial in apache spark, and I making use of the Cassandra database, Spark2.0, and Python
I am trying to do an RDD Operation on a sql Query, using this tutorial,
https://spark.apache.org/docs/2.0.0-preview/sql-programming-guide.html
it says #The results of SQL queries are RDDs and support all the normal RDD operations.
I currently have this line of codes that says
sqlContext = SQLContext(sc)
results = sqlContext.sql("SELECT word FROM tweets where word like '%#%'").show(20, False)
df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="wordcount", keyspace= "demo")\
.load()
df.select("word")
df.createOrReplaceTempView("tweets")
usernames = results.map(lambda p: "User: " + p.word)
for name in usernames.collect():
print(name)
AttributeError: 'NoneType' object has no attribute 'map'
If the variable results is a result of sql Query, why am I getting this error? Can anyone please explain this to me.
Everything works fine, the tables print, only time I get an error is when I
try to do a RDD Operation.
please bear in mind sc is an existing spark context
It's because show() only prints content.
Use:
results = sqlContext.sql("SELECT word FROM tweets where word like '%#%'")
result.show(20, False)

write dataframe to csv file took too much time to write spark

I want to aggregate data based on intervals on timestamp columns.
I saw that it takes 53 seconds for computation, but 5 minutes to write result in the CSV file. It seems like df.csv() takes too much to write.
How can I optimize the code please ?
Here is my code snippet :
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:\\dataSet.csv\\inputDataSet.csv")
//convert all column to numeric value in order to apply aggregation function
df.columns.map { c =>df.withColumn(c, col(c).cast("int")) }
//add a new column inluding the new timestamp column
val result2=df.withColumn("new_time",((unix_timestamp(col("_c0"))/300).cast("long") * 300).cast("timestamp")).drop("_c0")
val finalresult=result2.groupBy("new_time").agg(result2.drop("new_time").columns.map(mean(_)).head,result2.drop("new_time").columns.map(mean(_)).tail: _*).sort("new_time")
finalresult.coalesce(1).write.option("header", "true").csv("C:/result_with_time.csv")//<= it took to much to write
Here are some thoughts on optimization based on your code.
inferSchema: it will be faster to have a predefined schema rather than using inferSchema.
Instead of writing into your local, you can try writing it in hdfs and then scp the file into local.
df.coalesce(1).write will take more time than just df.write. But you will get multiple files which can be combined using different techniques. or else you can just let it be in one directory with with multiple parts of the file.

Spark for loop with Rdd transformation

I am trying to accomplish the following:
For iterator i from 0 to n
Create data frames using i as one of the filter criteria in the select statement of sparksql
Create Rdd from dataframe
Perform multiple operations on rdd
How do I make sure that for loop works? I am trying to run the Scala code on a cluster.
First I would suggest to run it locally in some test suite (as in scalatest). If you are not the type of unit/integration testing, you could simply do a DF.show() on your data frames as you iteration though them. this will print a sample from each data frame.
(0 until 5).foreach(i => {
val df = [some data frame you use i in filtering]
df.show()
val df_rdd = df.rdd
})