Action in Scala/Spark is not getting executed after a transformation - scala

I am currently experimenting with Apache Spark through Scala. I am currently using version 2.4.3 of Spark Core (as defined in my build.sbt file). I am running a simple example: generating a RDD through a text file and filtering all the lines that contain the word "pandas". After that, I use an action to count the number of lines that actually contains that word in the file. If I simply try to count the total number of lines of the file, everything goes ok, but if I apply a filter transformation and then try to count the number of elements, it does not finish the execution.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
println("Creating Context")
val conf = new SparkConf().setMaster("local").setAppName("Test")
val sc = new SparkContext(conf)
val lines = sc.textFile("/home/lbali/example.txt")
val pandas = lines filter(line => line.contains("pandas"))
println("+++++ number of lines: " + lines.count()) // this works ok.
println("+++++ number of lines with pandas: " + pandas.count()) // This does not work
sc.stop()

Try persisting the dataframe. When multiple actions are performed over the same dataframe, it is better to persist it rather than going through the cycle again
lines.persist(MEMORY_AND_DISK)

Think I found a solution, downgrading the Scala version from 2.12.8 to 2.11.12 solved the issue.

Related

I am getting a syntax error while computing average number of friends in apache spark

Recently I have started doing a course of Frank Kane namely Taming big data by apache spark using python.
In the line where I have to compute average number of friends, I am getting a syntax error. I cannot understand how to fix this error. Please refer the code below.FYI I m using python 3. I have highlighted the code having syntax error.Please help as I have got stuck here.
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("AverageAge")
sc = SparkContext(conf = conf)
def parseline(line):
fields =line.split(',')
friend_age= int(fields[2])
friends_number= int(fields[3])
return (friend_age,friends_number)
lines = sc.textFile("file:///Sparkcourse/SparkCourse/fakefriends.csv")
rdd=lines.map(parseline)
making_keys=rdd.mapByValues(lambda x:(x,1))
totalsByAge=making_keys.reduceByKeys(lambda x,y: (x[0]+y[0],x[1]+y[1])
**averages_by_keys= totalsByAge.mapValues(lambda x: x[0] / x[1])**(Syntax Error)
results=averageByKeys.collect()
for result in results:
print result
Look at the line above, you're missing a closing parenthesis.

Write Header only CSV record from Spark Scala DataFrame

My requirement is to write only Header CSV record using Spark Scala DataFrame. Can any one help me on this.
val OHead1 = "/xxxxx/xxxx/xxxx/xxx/OHead1/"
val sc = sparkFile.sparkContext
val outDF = csvDF.select("col_01", "col_02", "col_03").schema
sc.parallelize(Seq(outDF.fieldNames.mkString("\t"))).coalesce(1).saveAsTextFile(s"$OHead1")
The above one is working and able to create header in the CSV with tab delimiter. Since I am using spark session I am creating sparkContext in the second line. outDF is my dataframe created before these statements.
Two things are outstanding, can you one of you help me.
1. The above working code is not overriding the files, so every time I need to delete the files manually. I could not find override option, can you help me.
2. Since I am doing a select statement and schema, will it be consider as action and start another lineage for this statement. If it is true then this would degrade the performance.
If you need to output only header you can use this code:
df.schema.fieldNames.reduce(_ + "," + _)
It will create line of CSV with names of columns
I tested and the solution below did not affect any performance.
val OHead1 = "/xxxxx/xxxx/xxxx/xxx/OHead1/"
val sc = sparkFile.sparkContext
val outDF = csvDF.select("col_01", "col_02", "col_03").schema
sc.parallelize(Seq(outDF.fieldNames.mkString("\t"))).coalesce(1).saveAsTextFile(s"$OHead1")
I got a solution to handle this situation. Define the columns in the configuration file and write those columns in an file. Here is the snipet.
val Header = prop.getProperty("OUT_HEADER_COLUMNS").replaceAll("\"","").replaceAll(",","\t")
scala.tools.nsc.io.File(s"$HeadOPath").writeAll(s"$Header")

Fast file writing in scala?

So I have a scala program that iterates through a graph and writes out data line by line to a text file. It is essentially an edge list file for use with graphx.
The biggest slow down is actually creating this text file, were talking maybe million records it writes to this text file. Is there a way I can somehow parallel this task or making faster in any way by somehow storing it in memory or anything?
More info:
I am using a hadoop cluster to iterate through a graph and here is my code snippet for my text file creation im doing now to write to HDFS:
val fileName = dbPropertiesFile + "-edgelist-" + System.currentTimeMillis()
val path = new Path("/home/user/graph/" + fileName + ".txt")
val conf = new Configuration()
conf.set("fs.defaultFS", "hdfs://host001:8020")
val fs = FileSystem.newInstance(conf)
val os = fs.create(path)
while (edges.hasNext) {
val current = edges.next()
os.write(current.inVertex().id().toString.getBytes())
os.write(" ".getBytes())
os.write(current.outVertex().id().toString.getBytes())
os.write("\n".toString.getBytes())
}
fs.close()
Writing files to HDFS is never fast. Your tags seem to suggest that you are already using spark anyway, so you could as well, take advantage of it.
sparkContext
.makeRDD(20, edges.toStream)
.map(e => e.inVertex.id -> e.outVertex.id)
.toDF
.write
.delimiter(" ")
.csv(path)
This splits your input into 20 partitions (you can control that number with the numeric parameter to makeRDD above), and writes them in parallel to 20 different chunks in hdfs, that represent your resulting file.

Pyspark - Transfer control out of Spark Session (sc)

This is a follow up question on
Pyspark filter operation on Dstream
To keep a count of how many error messages/warning messages has come through for say a day, hour - how does one design the job.
What I have tried:
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
def counts():
counter += 1
print(counter.value)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 5)
counter = sc.accumulator(0)
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreach(counts))
errors.pprint()
ssc.start()
ssc.awaitTermination()
this however has multiple issues, to start with print doesn't work (does not output to stdout, I have read about it, the best I can use here is logging). Can I save the output of that function to a text file and tail that file instead?
I am not sure why the program just comes out, there is no error/dump anywhere to look further into (spark 1.6.2)
How does one preserve state? What I am trying is to aggregate logs by server and severity, another use case is to count how many transactions were processed by looking for certain keywords
Pseudo Code for what I want to try:
foreachRDD(Dstream):
if RDD.contains("keyword1 | keyword2 | keyword3"):
dictionary[keyword] = dictionary.get(keyword,0) + 1 //add the keyword if not present and increase the counter
print dictionary //or send this dictionary to else where
The last part of sending or printing dictionary requires switching out of spark streaming context - Can someone explain the concept please?
print doesn't work
I would recommend reading the design patterns section of the Spark documentation. I think that roughly what you want is something like this:
def _process(iter):
for item in iter:
print item
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreachPartition(_process))
This will get your call print to work (though it is worth noting that the print statement will execute on the workers and not the drivers, so if you're running this code on a cluster you will only see it on the worker logs).
However, it won't solve your second problem:
How does one preserve state?
For this, take a look at updateStateByKey and the related example.

Read ORC files directly from Spark shell

I am having issues reading an ORC file directly from the Spark shell. Note: running Hadoop 1.2, and Spark 1.2, using pyspark shell, can use spark-shell (runs scala).
I have used this resource http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_orc-spark-quickstart.html .
from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)
inputRead = sc.hadoopFile("hdfs://user#server:/file_path",
classOf[inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],
classOf[outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat])
I get an error generally saying wrong syntax. One time, the code seemed to work, I used just the 1st of three arguments passed to hadoopFile, but when I tried to use
inputRead.first()
the output was RDD[nothing, nothing]. I don't know if this is because the inputRead variable did not get created as an RDD or if it was not created at all.
I appreciate any help!
In Spark 1.5, I'm able to load my ORC file as:
val orcfile = "hdfs:///ORC_FILE_PATH"
val df = sqlContext.read.format("orc").load(orcfile)
df.show
You can try this code, it's working for me.
val LoadOrc = spark.read.option("inferSchema", true).orc("filepath")
LoadOrc.show()
you can also add the multiple path to read from
val df = sqlContext.read.format("orc").load("hdfs://localhost:8020/user/aks/input1/*","hdfs://localhost:8020/aks/input2/*/part-r-*.orc")