How can I read data from multiple directories in pyspark

How can I read data from multiple directories in pyspark - pyspark

I have directories/files in S3 in the below structure.
root/
20180101/files.txt
20180102/files.txt
20180103/files.txt
Now i want to pass a date range as start_date=20180101 and end_date=20180102 . I want the pyspark code to read files from these directories included in the range. How can i achieve this.
**The range is configurable, i.e it can be 1 week/30days/90days

I created a list of paths of the date range and passed to sc.text().
start = datetime.datetime.strptime(start_date, '%Y%m%d')
end = datetime.datetime.strptime(end_date, '%Y%m%d')
step = datetime.timedelta(days=1)
paths = []
while start <= end:
paths.append(s3_input_path+str(start.date().strftime("%Y%m%d"))+"/")
start += step
str1 = ','.join(paths)

Related

Read Delta table from multiple folders

I'm working on a Databricks. I'm reading my delta table like this:
path = "/root/data/foo/year=2021/"
df = spark.read.format("delta").load(path)
However within the year=2021 folder there are sub-folders for each day day=01, day=02, day=03, etc...
How can I read folders of day 4,5,6 for example?
edit#1
I'm reading answer from different questions and it seems that the proper way to achieve this is to use a filter applied the partitioned column

Seems the better way to read partitioned delta tables is to apply a filter on the partitions:
df = spark.read.format("delta").load('/whatever/path')
df2 = df.filter("year = '2021' and month = '01' and day in ('04','05','06')")

List them as comma separated values enclosed in curly brackets
path = "/root/data/foo/year=2021/{04,05,06}/"
or
path = "/root/data/foo/year=2021/[04,05,06]/"
path = "/root/data/foo/year=2021/0[4|5|6]/"

Remove .format("delta"). from your Source code path
Use below UDF
def fileexists(filepath, FromDay, ToDay):
mylist = dbutils.fs.ls(filepath)
maxcount = len(mylist) - ToDay + 1
maxcount1 = maxcount - FromDay
return [item[0] for item in mylist][maxcount1:maxcount]
filepath = Parent File Path = "/root/data/foo/year=2021/"
FromDay = Start day folder
ToDay = End day folder
Note: Change the function as per your requirement.

.load() accepts a list as well as a str. In your particular example, tryp this:
path = [f'/root/data/foo/year=2021/day={ea}' for ea in ['01', '02, '03]]
N.b. glob pattern is acceptable, but not RegEx. I'm on Spark 3.2.1.

Copy data from one sheet, add current date to each new row, and paste

I've done some reading but my limited knowledge on scripts is making things difficult. I want to:
Copy a variable number of rows data range, known colums, from one sheet titled 'Download'
Paste that data in a new sheet titled 'Trade History' from Column B
In the new sheet, add today's date formatted (DD/MM/YYYY) in a new column A for each record copied
The data in worksheet 'Download' uses IMPORTHTML
The data copied from Download to store a historical record needs a date in Column A
I've managed to get 1 and 2 working, but can't work out the 3rd. See current script below.
function recordHistory() {
var ss = SpreadsheetApp.getActive(),
sheet = ss.getSheetByName('Trade_History');
var source = sheet.getRange("a2:E2000");
ss.getSheetByName('Download').getRange('A2:E5000').copyTo(sheet.getRange(sheet.getLastRow()+1, 2))
}

You need to use Utilities.formatDate() to format today's date to DD/MM/YYYY.
Because you're copying one set of values, and then next to it (in column A), pasting another, I altered your code a bit as well.
function recordHistory() {
var ss = SpreadsheetApp.getActive(),
destinationSheet = ss.getSheetByName('Trade_History');
var sourceData = ss.getSheetByName('Download').getDataRange().getValues();
for (var i=0; i<sourceData.length; i++) {
var row = sourceData[i];
var today = Utilities.formatDate(new Date(), 'GMT+10', 'dd/MM/yyyy'); // AEST is GMT+10
row.unshift(today); // Places data at the beginning of the row array
}
destinationSheet.getRange(destinationSheet.getLastRow()+1, // Append to existing data
1, // Start at Column A
sourceData.length, // Number of new rows to be added (determined from source data)
sourceData[0].length // Number of new columns to be added (determined from source data)
).setValues(sourceData); // Printe the values
}
Start by getting the values of the source data. This returns an array that can be looped through to add today's date. Once the date has been added to all of the source data, determine the range boundaries for where it will be printed. Rather than simply selecting the start cell as could be done with the copyTo() method, the full dimensions now have to be defined. Finally, print the values to the defined range.

VB Scripting - comparing time values, one coming from a text file?

I'm attempting to pull some information out of text file that is updated after I query a piece of equipment. The text file contains lines such as shown here (abbreviated):
05-Nov-13 11:11:54.3496 ( -1 7020 10244) scpeng.exe:Automation Server...
05-Nov-13 14:10:54.3496 ( -1 7020 10244) scpeng.exe:Automation Server...
05-Nov-13 14:10:54.3496 ( -1 7020 10244) scpeng.exe:Automation Server...
05-Nov-13 14:10:56.3496 ( -1 7020 10244) scpeng.exe:CServer.cpp,....
The text file can contain up to several weeks of information. I have a subroutine that will run a few seconds after I query the equipment which should allow for the reply and the applicable line to be present in the text file. In the routine, I am trying to scroll through the lines examining the date to arrive at the date of the subroutine call followed by the time (or a time ~10 seconds prior the the current time) to arrive at the lines pertinent to where the information could be found.
do
msg = msgstream.ReadLine
logdate = mide(msg,1,9)
logday = Cdate(logdate)
loop while logday < date
do
msg = msgstream.Readline
logtime = mid(msg,12,8)
'logtime = CDate(logtime) This mod is not working
loop while logtime < time
The date loop appears to work however the time is giving me problems. It does not error out but I can't get it to run beyond one line of text. Can anyone suggest a fix or better option? I have read that the built-in Date function can include the time but I do not believe this version I'm using does. Also, the text file contains times in a 24 hour format where I believe the time function returns values in a 12 hr format ie "12:43:27 PM ST".

You're making this way too complicated. Simply parse the whole date string into a datetime value:
refdate = Now
Do
msg = msgstream.ReadLine
logdate = CDate(Mid(msg, 1, 19))
Loop While logdate < refdate
You can extract date and time portions from the value later, e.g. like this:
WScript.Echo DateValue(logdate)
WScript.Echo TimeValue(logdate)
Also, Time returns the current (unformatted) system time. Whether it's displayed in 12 hour or 24 hour format depends on your system's region settings. However, you can always get the hour (0-23) by using the Hour function.

Parse each line with a regex to get the correct date and time part. I prefer a regexp above string manipulation functions because you can separate format and code.
Reassemble the date from the two parts and see if the date is smaller than yesterday at this time.
Option Explicit
dim strTest, re, matches, myDatePart, myTimePart, logDate
' teststring
strTest = "08-Nov-13 14:10:56.3496 ( -1 7020 10244) scpeng.exe:CServer.cpp,...."
Set re = new regexp
' This pattern extracts two part, the date as (dd-www-dd) and the time as (hh:mm:ss)
re.pattern = "(\d{2}-\w{3}-\d{2}) +(\d{2}:\d{2}:\d{2})"
Set matches = re.Execute(strTest)
' Get the first and second submatch to define the date and time
myDatePart = matches(0).submatches(0)
myTimePart = matches(0).submatches(1)
' datevalue and timevalue automatically tranforms to Date type
logDate = datevalue(myDatePart) + timevalue(myTimePart)
' See if the date is smaller than yesterday exactly this time
msgbox (logDate < (DateAdd("d", -1, now))) ' Returns True, because 08 Nov is earlier than yesterday.

IDL and MatLab getting strange values from NetCDF file

I have a NetCDF file, which contains data representing total precipitation across the globe over several months (so it's stored in a three dimensional array). I first ensured that the data was sensible, and the way it was formed, both in XConv and ncdump. All looks sensible - values vary from very small (~10^-10 - this makes sense, as this is model data, and effectively represents zero) to about 5x10^-3.
The problems start when I try to handle this data in IDL or MatLab. The arrays generated in these programs are full of huge negative numbers such as -4x10^4, with occasional huge positive numbers, such as 5000. Strangely, looking at a plot of the data in MatLab with respect to latitude and longitude (at a specific time), the pattern of rainfall looks sensible, but the values are just completely wrong.
In IDL, I'm reading the file in to write it to a text file so it can be handled by some software that takes very basic text files. Here's the code I'm using:
PRO nao_heaps
address = '/Users/levyadmin/Downloads/'
file_base = 'output'
ncid = ncdf_open(address + file_base + '.nc')
MONTHS=['january','february','march','april','may','june','july','august','september','october','november','december']
varid_field = ncdf_varid(ncid, "tp")
varid_lon = ncdf_varid(ncid, "longitude")
varid_lat = ncdf_varid(ncid, "latitude")
varid_time = ncdf_varid(ncid, "time")
ncdf_varget,ncid, varid_field, total_precip
ncdf_varget,ncid, varid_lat, lats
ncdf_varget,ncid, varid_lon, lons
ncdf_varget,ncid, varid_time, time
ncdf_close,ncid
lats = reform(lats)
lons = reform(lons)
time = reform(time)
total_precip = reform(total_precip)
total_precip = total_precip*1000. ;put in mm
noLats=(size(lats))(1)
noLons=(size(lons))(1)
noMonths=(size(time))(1)
; the data may not be an integer number of years (otherwise we could make this next loop cleaner)
av_precip=fltarr(noLons,noLats,12)
for month=0, 11 do begin
year = 0
while ( (year*12) + month lt noMonths ) do begin
av_precip(*,*,month) = av_precip(*,*,month) + total_precip(*,*, (year*12)+month )
year++
endwhile
av_precip(*,*,month) = av_precip(*,*,month)/year
endfor
fname = address + file_base + '.dat'
OPENW,1,fname
PRINTF,1,'longitude'
PRINTF,1,lons
PRINTF,1,'latitude'
PRINTF,1,lats
for month=0,11 do begin
PRINTF,1,MONTHS(month)
PRINTF,1,av_precip(*,*,month)
endfor
CLOSE,1
END
Anyone have any ideas why I'm getting such strange values in MatLab and IDL?!

AH! Found the answer. NetCDF files use an offset, and a scale factor for the data to keep the size of the file to a minimum. To get the correct values, I simply need to:
total_precip = offset + (scale_factor * total_precip) ;put into correct range
At present I'm getting the scale factor and offset from ncdump, and hard coding them into my IDL program, but does anyone know how I can get them dynamically in my IDL code..?

How to read data in chunks from notepad file in Matlab?

My data is in following format:
TABLE NUMBER 1
FILE: name_1
name_2
TIME name_3
day name_4
-0.01 0
364.99 35368.4
729.99 29307
1094.99 27309.5
1460.99 26058.8
1825.99 25100.4
2190.99 24364
2555.99 23757.1
2921.99 23240.8
3286.99 22785
3651.99 22376.8
4016.99 22006.1
4382.99 21664.7
4747.99 21348.3
5112.99 21052.5
5477.99 20774.1
5843.99 20509.9
6208.99 20259.7
6573.99 20021.3
6938.99 19793.5
7304.99 19576.6
TABLE NUMBER 2
FILE: name_1
name_5
TIME name_6
day name_7
-0.01 0
364.99 43110.4
729.99 37974.1
1094.99 36175.9
1460.99 34957.9
1825.99 34036.3
2190.99 33293.3
2555.99 32665.8
2921.99 32118.7
3286.99 31626.4
3651.99 31175.1
4016.99 30758
4382.99 30368.5
4747.99 30005.1
5112.99 29663
5477.99 29340
5843.99 29035.2
6208.99 28752.4
6573.99 28489.7
6938.99 28244.2
7304.99 28012.9
TABLE NUMBER 3
Till now I was splitting this data and reading the variables (time and name_i) from each file in following way:
[TIME(:,j), name_i(:,j)]=textread('filename','%f\t%f','headerlines',5);
But now I am producing the data of those files into 1 file as shown in beginning. For example I want to read and store TIME data in vectors TIME1, TIME2, TIME3, TIME4, TIME5 for name_3, name_6, _9 respectively, and similarly for others.

First of all, I suggest you don't use variable names such as TIME1,TIME2 etc, since that gets messy quickly. Instead, you can e.g. use a cell array with five rows (one for each well), and one or two columns. In the sample code below, wellData{2,1} is the time for the second well, wellData{2,2} is the corresponding Oil Rate SC - Yearly.
There might be more elegant ways to do the reading; here's something quick:
%# open the file
fid = fopen('Reportq.rwo');
%# read it into one big array, row by row
fileContents = textscan(fid,'%s','Delimiter','\n');
fileContents = fileContents{1};
fclose(fid); %# don't forget to close the file again
%# find rows containing TABLE NUMBER
wellStarts = strmatch('TABLE NUMBER',fileContents);
nWells = length(wellStarts);
%# loop through the wells and read the numeric data
wellData = cell(nWells,2);
wellStarts = [wellStarts;length(fileContents)];
for w = 1:nWells
%# read lines containing numbers
tmp = fileContents(wellStarts(w)+5:wellStarts(w+1)-1);
%# convert strings to numbers
tmp = cellfun(#str2num,tmp,'uniformOutput',false);
%# catenate array
tmp = cat(1,tmp{:});
%# assign output
wellData(w,:) = mat2cell(tmp,size(tmp,1),[1,1]);
end

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How can I read data from multiple directories in pyspark - pyspark

Related

Read Delta table from multiple folders

Copy data from one sheet, add current date to each new row, and paste

VB Scripting - comparing time values, one coming from a text file?

IDL and MatLab getting strange values from NetCDF file

How to read data in chunks from notepad file in Matlab?

Categories

Resources