To find out a given keyword exists in a huge text file or not, I came up wit below two approaches.
def keywordExists(line):
if (line.find(“my_keyword”) > -1):
return 1
return 0
lines = sparkContext.textFile(“test_file.txt”);
isExist =;
sum = isExist.reduce(sum);
print(“Found” if sum>0 else “Not Found”)
var keyword="my_keyword"
val rdd=sparkContext.textFile("test_file.txt")
val count= rdd.filter(line=>line.contains(keyword)).count
print(“Found” if count>0 else “Not Found”)
Main difference is first one using map and then reducing whereas second one is filtering and doing a count.
Could anyone suggest which is efficient.
I would suggest:
val wordFound = !rdd.filter(line=>line.contains(keyword)).isEmpty()
Benefit: The search can be stopped once 1 occurence of keyword was found
see also Spark: Efficient way to test if an RDD is empty
Using Spark/Scala to attempt a "simple" query. I have a file which, after line 1 below runs, looks like this
The first and third columns (EmpReg, RegPay) come from one source and the second and third columns (EmpOT, OTPay) come from a second source. My objective is output that looks like this.
Here is the code that I have been trying, at least what I have saved.
var q2 = q.join(q1, q("EmpReg") === q1("EmpOT"), "fullouter")
//q2 ="EmpReg", ($"RegPay" + $"OTPay"))
//q2 = q2.groupBy($"EmpReg".sum($"RegPay" + $"OTPay"))
var add =$"RegPay" + $"OTPay"))
//q2 = q2.sum("RegPay", "OTPay")
//q2 = q2.groupBy("EmpReg", "EmpOT")
//var q2 = q.join(q1).where("EmpReg") === "EmpOT"))
//q2 ="EmpReg").sum("RegPay", "OTPay")
[q] is the first file which represents regular pay. [q1] is the second file which represents overtime pay. [q2] is the combination shown in the first example above. Primary keys are [EmpReg] and [EmpOT]. don't really need to combine [EmpReg] and [EmpOT] since they are the same, and it doesn't make any difference which I use.
I really need to add [RegPay] and [OTPay] to get [Pay], but for the life of me I can't get it to work. The lines commented out return various errors. I can add the two pay columns, and select an appropriate employee column, but can't seem to do it in one query. I am constrained to use Scala on Databricks. Othewise, I might do something like this.
select q.EmpReg as Emp, (q.RegPay + q1.OTPay) as Pay
from q join q1 on q.EmpReg = q1.EmpOT
(Why can't things ever be simple?)
You can use a similar approach as in your SQL query:
val q2 = q.join(q1, q("EmpReg") === q1("EmpOT"), "fullouter")
val add ="EmpReg").as("Emp"), (q("RegPay") + q1("OTPay")).as("Pay"))
Your code has this line"EmpReg", ($"RegPay" + $"OTPay"))
which should work if you add $ before "EmpReg". You can't have both strings and columns in the select statement. This works in Python but not Scala.
Lines of the document as follows:
I am 12 year old.
I go to school.
I am playing.
Its 4 pm.
There are two lines of the document that contain numbers in them. I want to count how many lines are there in the document with number?
This is to be implemented in scala spark.
val lineswithnum=linesRdd.filter(line => (line.contains([^0-9]))).count()
I expect output to be 2 . But I am getting 0
You can use exists method:
val lineswithnum=linesRdd.filter(line => line.exists(_.isDigit)).count()
In line with your original approach and not discounting the other answer(s):
val textFileLines = sc.textFile("/FileStore/tables/so99.txt")
val linesWithNumCollect = textFileLines.filter(_.matches(".*[0-9].*")).count
The .* added so as to capture within a line string.
I have directories/files in S3 in the below structure.
Now i want to pass a date range as start_date=20180101 and end_date=20180102 . I want the pyspark code to read files from these directories included in the range. How can i achieve this.
**The range is configurable, i.e it can be 1 week/30days/90days
I created a list of paths of the date range and passed to sc.text().
start = datetime.datetime.strptime(start_date, '%Y%m%d')
end = datetime.datetime.strptime(end_date, '%Y%m%d')
step = datetime.timedelta(days=1)
paths = []
while start <= end:
start += step
str1 = ','.join(paths)
meta_mueller <- search_tweets("mueller", n = 250000, retryonratelimit = TRUE)
Within the dataframe is a column "geo_coords". A majority upon visual scan are c(NA,NA).
I have dplyr installed (other packages are fine, too) and I want to identify any rows that do not equal c(NA,NA).
This did not work.
meta_mueller_location = select(meta_mueller, place_full_name)
meta_mueller_location_filter = filter(meta_mueller_location,
place_full_name != "NA")
Instead of geo_coords I used the command on "place_full_name" column which was only NA not c(NA,NA). This was a better solution for my needs.