The code below suppose to filter the data contains string x or y. It works fine in spark shell, but when I run the script in bash it only finds the data contains y and ignored x.
val targetData= Namedata.filter(x => (x(1).contains(x)||x(1).contains(y)))
Anyone know how to fix this??
Thank you for your help
This code works:
val targetData= Namedata.filter(r => (r(1).contains(x)||r(1).contains(y)))
Related
Recently I have started doing a course of Frank Kane namely Taming big data by apache spark using python.
In the line where I have to compute average number of friends, I am getting a syntax error. I cannot understand how to fix this error. Please refer the code below.FYI I m using python 3. I have highlighted the code having syntax error.Please help as I have got stuck here.
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("AverageAge")
sc = SparkContext(conf = conf)
def parseline(line):
fields =line.split(',')
friend_age= int(fields[2])
friends_number= int(fields[3])
return (friend_age,friends_number)
lines = sc.textFile("file:///Sparkcourse/SparkCourse/fakefriends.csv")
rdd=lines.map(parseline)
making_keys=rdd.mapByValues(lambda x:(x,1))
totalsByAge=making_keys.reduceByKeys(lambda x,y: (x[0]+y[0],x[1]+y[1])
**averages_by_keys= totalsByAge.mapValues(lambda x: x[0] / x[1])**(Syntax Error)
results=averageByKeys.collect()
for result in results:
print result
Look at the line above, you're missing a closing parenthesis.
I want to remove header from a file. But, since the file will be split into partitions, I can't just drop the first item. So I was using a filter function to figure it out and here below is the code I am using :
val noHeaderRDD = baseRDD.filter(line=>!line.contains("REPORTDATETIME"));
and the error I am getting says "error not found value line "what could be the issue here with this code?
I don't think anybody answered the obvious, whereby line.contains also possible:
val noHeaderRDD = baseRDD.filter(line => !(line contains("REPORTDATETIME")))
You were nearly there, just a syntax issue, but that is significant of course!
Using textFile as below:
val rdd = sc.textFile(<<path>>)
rdd.filter(x => !x.startsWith(<<"Header Text">>))
Or
In Spark 2.0:
spark.read.option("header","true").csv("filePath")
buffer.slice(mouse,highlight).distinct
Now when I perform this it seems to apply .distinct to the whole string rather than the selection I use with slice. (mouse and highlight are just index positions and buffer is a StringBuilder). I'm just wondering what is the reason for this.
Your approach is correct. Please see below code for more clarification.
slice() function gives you the sub-string so in your approach It will first find the sub-string and then distinct.
Please follow below step by step approach for more understanding.
val buffer=new StringBuilder
buffer.append("bbbaabbbcccbdbcdbd")
val sl=buffer.slice(2,10)
The variable sl contains
sl= baabbbcc
Now you can apply distinct on sl variable
val result=sl.distinct
Finally your output
result= bac
This is the how your single line of code is working.
So, I'm reading data from a JSON file and creating a DataFrame. Usually, I would use
sqlContext.read.json("//line//to//some-file.json")
Problem is that my JSON file isn't consistent. So, for each line in the file, there are 2 JSONs. Each line looks like this
{...data I don't need....}, {...data I need....}
I only need my DataFrame to be formed from the data I need, i.e. the second JSON of each line. So I read each line as a string and substring the part that I need, like so
val lines = sc.textFile(link, 2)
val part = lines.map( x => x.substring(x.lastIndexOf('{')).trim)
I want to get all the elements in 'part' as an Array[String] then turn the Array[String] into one string and make the DataFrame. Like so
val strings = part .collect() //doesn't work
val strings = part.take(1000) //works
val jsonStr = "[".concat(strings.mkString(", ")).concat("]")
The problem is, if I call part.collect(), it doesn't work but if I call part.take(N) it works. However, I'd like to get all my data and not just the first N.
Also, if I try part.take(part.count().toInt) it still doesn't work.
Any Ideas??
EDIT
I realized my problem after a good sleep. It was a silly mistake on my part. The very last line of my input file had a different format from the rest of the file.
So part.take(N) would work for all N less than part.count(). That's why part.collect() wasn't working.
Thanks for the help though!
I'm running a Spark Job in Scala and I'm struck with parsing the input file.
The Input file(TAB separated) is something like,
date=20160701 name=mike age=26
date=20160402 name=john age=33
I want to parse it and extract only values and not the keys, such as,
20160701 mike 26
20160402 john 33
How can this be achieved in SCALA?
I'm using,
SCALA VERSION: 2.11
You can use CSVParser() and you know the location for key, it will be easy and clean
Test data
val data = "date=20160701\tname=mike\tage=26\ndate=20160402 name=john\tage=33\n"
One statement to do what you asked
val rdd = sc.parallelize(data.split('\n'))
.map(_.split('\t') // split into key=value
.map(_.split('=')(1))) // split those at "=" and select only the value
Display what we got
rdd.collect().foreach(r=>println(r.mkString(",")))
// 20160701,mike,26
// 20160402,john,33
But don't do this for real code. It's very fragile in the face of data format errors, etc. Use CSVParser or something instead as Narendra Parmar suggests.
val rdd = sc.textFile()
rdd.map(x => x.split("\t")).map(x => x.split("=")(1)).map(x => x.mkstring("\t")).saveAsTextFile("")