Read Delta table from multiple folders - pyspark

I'm working on a Databricks. I'm reading my delta table like this:
path = "/root/data/foo/year=2021/"
df = spark.read.format("delta").load(path)
However within the year=2021 folder there are sub-folders for each day day=01, day=02, day=03, etc...
How can I read folders of day 4,5,6 for example?
edit#1
I'm reading answer from different questions and it seems that the proper way to achieve this is to use a filter applied the partitioned column

Seems the better way to read partitioned delta tables is to apply a filter on the partitions:
df = spark.read.format("delta").load('/whatever/path')
df2 = df.filter("year = '2021' and month = '01' and day in ('04','05','06')")

List them as comma separated values enclosed in curly brackets
path = "/root/data/foo/year=2021/{04,05,06}/"
or
path = "/root/data/foo/year=2021/[04,05,06]/"
path = "/root/data/foo/year=2021/0[4|5|6]/"

Remove .format("delta"). from your Source code path
Use below UDF
def fileexists(filepath, FromDay, ToDay):
mylist = dbutils.fs.ls(filepath)
maxcount = len(mylist) - ToDay + 1
maxcount1 = maxcount - FromDay
return [item[0] for item in mylist][maxcount1:maxcount]
filepath = Parent File Path = "/root/data/foo/year=2021/"
FromDay = Start day folder
ToDay = End day folder
Note: Change the function as per your requirement.

.load() accepts a list as well as a str. In your particular example, tryp this:
path = [f'/root/data/foo/year=2021/day={ea}' for ea in ['01', '02, '03]]
N.b. glob pattern is acceptable, but not RegEx. I'm on Spark 3.2.1.

Related

spark program to check if a given keyword exists in a huge text file or not

To find out a given keyword exists in a huge text file or not, I came up wit below two approaches.
Approach1:
def keywordExists(line):
if (line.find(“my_keyword”) > -1):
return 1
return 0
lines = sparkContext.textFile(“test_file.txt”);
isExist = lines.map(keywordExists);
sum = isExist.reduce(sum);
print(“Found” if sum>0 else “Not Found”)
Approach2:
var keyword="my_keyword"
val rdd=sparkContext.textFile("test_file.txt")
val count= rdd.filter(line=>line.contains(keyword)).count
print(“Found” if count>0 else “Not Found”)
Main difference is first one using map and then reducing whereas second one is filtering and doing a count.
Could anyone suggest which is efficient.
I would suggest:
val wordFound = !rdd.filter(line=>line.contains(keyword)).isEmpty()
Benefit: The search can be stopped once 1 occurence of keyword was found
see also Spark: Efficient way to test if an RDD is empty

Spark, Scala, Databricks, combine and add columns

Using Spark/Scala to attempt a "simple" query. I have a file which, after line 1 below runs, looks like this
EmpReg,EmpOT,RegPay,OTPay
Alice,Alice,400,20
Bob,Bob,300,0
Carol,Carol,450,120
Dan,Dan,400,200
Ellen,Ellen,360,40
The first and third columns (EmpReg, RegPay) come from one source and the second and third columns (EmpOT, OTPay) come from a second source. My objective is output that looks like this.
Emp,Pay
Alice,420
Bob,300
Carol,570
Dan,600
Ellen,400
Here is the code that I have been trying, at least what I have saved.
var q2 = q.join(q1, q("EmpReg") === q1("EmpOT"), "fullouter")
//q2 = q2.select("EmpReg", ($"RegPay" + $"OTPay"))
//q2 = q2.groupBy($"EmpReg".sum($"RegPay" + $"OTPay"))
var add = q2.select(($"RegPay" + $"OTPay"))
//q2 = q2.sum("RegPay", "OTPay")
//q2 = q2.groupBy("EmpReg", "EmpOT")
//var q2 = q.join(q1).where("EmpReg") === "EmpOT"))
//q2 = q2.select("EmpReg").sum("RegPay", "OTPay")
//q2.show
add.show
[q] is the first file which represents regular pay. [q1] is the second file which represents overtime pay. [q2] is the combination shown in the first example above. Primary keys are [EmpReg] and [EmpOT]. don't really need to combine [EmpReg] and [EmpOT] since they are the same, and it doesn't make any difference which I use.
I really need to add [RegPay] and [OTPay] to get [Pay], but for the life of me I can't get it to work. The lines commented out return various errors. I can add the two pay columns, and select an appropriate employee column, but can't seem to do it in one query. I am constrained to use Scala on Databricks. Othewise, I might do something like this.
select q.EmpReg as Emp, (q.RegPay + q1.OTPay) as Pay
from q join q1 on q.EmpReg = q1.EmpOT
(Why can't things ever be simple?)
You can use a similar approach as in your SQL query:
val q2 = q.join(q1, q("EmpReg") === q1("EmpOT"), "fullouter")
val add = q2.select(q("EmpReg").as("Emp"), (q("RegPay") + q1("OTPay")).as("Pay"))
Your code has this line
q2.select("EmpReg", ($"RegPay" + $"OTPay"))
which should work if you add $ before "EmpReg". You can't have both strings and columns in the select statement. This works in Python but not Scala.

How to get number of lines from RDD which contain any digits

Lines of the document as follows:
I am 12 year old.
I go to school.
I am playing.
Its 4 pm.
There are two lines of the document that contain numbers in them. I want to count how many lines are there in the document with number?
This is to be implemented in scala spark.
val lineswithnum=linesRdd.filter(line => (line.contains([^0-9]))).count()
I expect output to be 2 . But I am getting 0
You can use exists method:
val lineswithnum=linesRdd.filter(line => line.exists(_.isDigit)).count()
In line with your original approach and not discounting the other answer(s):
val textFileLines = sc.textFile("/FileStore/tables/so99.txt")
val linesWithNumCollect = textFileLines.filter(_.matches(".*[0-9].*")).count
The .* added so as to capture within a line string.

How can I read data from multiple directories in pyspark

I have directories/files in S3 in the below structure.
root/
20180101/files.txt
20180102/files.txt
20180103/files.txt
Now i want to pass a date range as start_date=20180101 and end_date=20180102 . I want the pyspark code to read files from these directories included in the range. How can i achieve this.
**The range is configurable, i.e it can be 1 week/30days/90days
I created a list of paths of the date range and passed to sc.text().
start = datetime.datetime.strptime(start_date, '%Y%m%d')
end = datetime.datetime.strptime(end_date, '%Y%m%d')
step = datetime.timedelta(days=1)
paths = []
while start <= end:
paths.append(s3_input_path+str(start.date().strftime("%Y%m%d"))+"/")
start += step
str1 = ','.join(paths)

Filtering Data From Scraped Tweets Using rtweet Package

meta_mueller <- search_tweets("mueller", n = 250000, retryonratelimit = TRUE)
Within the dataframe is a column "geo_coords". A majority upon visual scan are c(NA,NA).
I have dplyr installed (other packages are fine, too) and I want to identify any rows that do not equal c(NA,NA).
filter(!is.na(meta_mueller(geo_coords))
This did not work.
Solution:
meta_mueller_location = select(meta_mueller, place_full_name)
meta_mueller_location_filter = filter(meta_mueller_location,
place_full_name != "NA")
Instead of geo_coords I used the command on "place_full_name" column which was only NA not c(NA,NA). This was a better solution for my needs.