Querying last row of sorted column where value is less than specific amount from parquet file - pyspark

I have a large parquet file where the data in one of the columns is sorted. A very simplified example is below.
X Y
0 1 Red
1 5 Blue
2 8 Green
3 12 Purple
4 15 Blue
5 17 Purple
I am interested in querying the last value of column Y given that X is less than some amount in the most efficient way possible using python.
I am guaranteed that column X is sorted in ascending order.
As an example, given that X is less than 11, I would expect a Y value of "Green".
I have tried the following:
columns='Y'
filters=[('X','<',11]
pd.read_parquet('my_data.parquet',filters=filters,columns=columns).tail(1)
The code above "works" but I am hoping optimizations are possible as this query is run 1M+ times per day.
The parquet file is too large to be read into memory.
I cannot put a starting value for column "X", as there is no guarantee of the size of the gap between values of X. For example, if I were to require "X > 10 and X < 11" I would not get a value for Y returned.
I was hoping given the fact the data is sorted there is a way to optimize this.
I am open to using DuckDB or some other library to do this.

I think that's what .search_sorted() is for.
You can also use .scan_parquet() to lazy load the data instead of .read_parquet()
You may need to use when/then to handle the case of the first row being a match - and using index 0 instead of row - 1 - or the case of there being no match (if that's possible.)
(pl.scan_parquet("search.parquet")
.select(
pl.col("Y")
.take(pl.col("X").search_sorted(11, side="left") - 1)
).collect()
shape: (1, 1)
┌───────┐
│ Y │
│ --- │
│ str │
╞═══════╡
│ Green │
└───────┘

Related

tm to tidytext conversion

I am trying to learn tidytext. I can follow the examples on tidytext website so long as I use the packages (janeaustenr, eg). However, most of my data are text files in a corpus. I can reproduce the tm to tidytext conversion example for sentiment analysis (ap_sentiments) on the tidytext website. I am having trouble, however, understanding how the tidytext data are structured. For example, the austen novels are stored by "book" in the austenr package. For my tm data, however, what is the equivalent for calling the vector for book? Here is the specific example for my data:
'cname <- file.path(".", "greencomments" , "all")
I can then use tidytext successfully after running the tm preprocessing:
practice <- tidy(tdm)
practice
partysentiments <- practice %>%
inner_join(get_sentiments("bing"), by = c(term = "word"))
partysentiments
# A tibble: 170 x 4
term document count sentiment
<chr> <chr> <dbl> <chr>
1 benefit 1 1.00 positive
2 best 1 2.00 positive
3 better 1 7.00 positive
4 cheaper 1 1.00 positive
5 clean 1 24.0 positive
7 clear 1 1.00 positive
8 concern 1 2.00 negative
9 cure 1 1.00 positive
10 destroy 1 3.00 negative
But, I can't reproduce the simple ggplots of word frequencies in tidytext. Since my data/corpus are not arranged with a column for "book" in the dataframe, the code (and therefore much of the tidytext functionality) won't work.
Here is an example of the issue. This works fine:
practice %>%
count(term, sort = TRUE)
# A tibble: 989 x 2
term n
<chr> <int>
1 activ 3
2 air 3
3 altern 3
but, what how to I arrange the tm corpus to match the structure of the books in the austenr package? Is "document" the equivalent of "book"? I have text files in folders for the corpus. I have tried replacing this in the code, and it doesn't work. Maybe I need to rename this? Apologies in advance - I am not a programmer.

KDB/Q : What is Vector operation?

I am learning KDB+ and Q programming and read about the following statement -
"select performs vector operations on column lists". What does Vector operation mean here? Could somebody please explain with an example? Also, How its faster than standard SQL?
A vector operation is an operation that takes one or more vectors and produces another vector. For example + in q is a vector operation:
q)a:1 2 3
q)b:10 20 30
q)a + b
11 22 33
If a and b are columns in a table, you can perform vector operations on them in a select statement. Continuing with the previous example, let's put a and b vectors in a table as columns:
q)([]a;b)
a b
----
1 10
2 20
3 30
Now,
q)select c:a + b from ([]a;b)
c
--
11
22
33
The select statement performed the same a+b vector addition, but took input and returned output as table columns.
How its faster than standard SQL?
"Standard" SQL implementations typically store data row by row. In a table with many columns the first element of a column and its second element can be separated in memory by the data from other columns. Modern computers operate most efficiently when the data is stored contiguously. In kdb+, this is achieved by storing tables column by column.
A vector is a list of atoms of the same type. Some examples:
2 3 4 5 / int
"A fine, clear day" / char
`ibm`goog`aapl`ibm`msft / symbol
2017.01 2017.02 2017.03m / month
Kdb+ stores and handles vectors very efficiently. Q operators – not just +-*% but e.g. mcount, ratios, prds – are optimised for vectors.
These operators can be even more efficient when vectors have attributes, such as u (no repeated items) and s (items are in ascending order).
When table columns are vectors, those same efficiencies are available. These efficiencies are not available to standard SQL, which views tables as unordered sets of rows.
Being column-oriented, kdb+ can splay large tables, storing each column as a separate file, which reduces file I/O when selecting from large tables.
The sentence means when you refer to a specific column of a table with a column label, it is resolved into the whole column list, rather than each element of it, and any operations on it shall be understood as list operations.
q)show t: flip `a`b!(til 3;10*til 3)
a b
----
0 0
1 10
2 20
q)select x: count a, y: type b from t
x y
---
3 7
q)type t[`b]
7h
q)type first t[`b]
-7h
count a in the above q-sql is equivalent to count t[`a] which is count 0 1 2 = 3. The same goes to type b; the positive return value 7 means b is a list rather than an atom: http://code.kx.com/q/ref/datatypes/#primitive-datatypes

Comparing, matching and combining columns of data

I need some help matching data and combining it. I currently have four columns of data in an Excel sheet, similar to the following:
Column: 1 2 3 4
U 3 A 0
W 6 B 0
R 1 C 0
T 9 D 0
... ... ... ...
Column two is a data value that corresponds to the letter in column one. What I need to do is compare column 3 with column 1 and whenever it matches copy the corresponding value from column 2 to column 4.
You might ask why don't I do this manually ? I have a spreadsheet with around 100,000 rows so this really isn't an option!
I do have access to MATLAB and have the information imported, if this would be more easily completed within that environment, please let me know.
As mentioned by #bla:
a formula similar to =IF(A1=C1,B1,0)
should serve (Excel).

printing FFT at multiple points in tabulated form (matrix form)

I've simulated the voltage at multiple points in time domain using C++. The output is printed out in tabulated form (i.e. time in first column and voltage at each point in the following columns).
I'm new to Matlab, but using it to fft the previous file in which i need to get a table of frequency in first column followed by frequency domain values for each point (in the same file).
I've tried all means to produce such table but it's always displayed in one column matrix (i.e. all data are set in one column matrix).
I need it to be in the form:
╔═══════════════════════════════════════════════╗
║ *f V1(f) V2(f) ..... Vn(f)* ║
╠═══════════════════════════════════════════════╣
║ ║
║ f1 .. .. .. ║
║ ║
║ f2 .. .. .. ║
║ ║
║ f3 .. .. .. ║
╚═══════════════════════════════════════════════╝
Also, if i'm able to create such matrix how can i get its transpose (to fft it once more w.r.t to space) ?
The code is as follows:
itr=importdata('filename.itr');
L=length(itr);
T=itr(L,1);
dt=itr(2,1);
t=(0:dt:T-dt);
fs=1/dt;
FR_length=L;
[M,N]=size(itr(1:end,1:end));
f=-FR_length/2:FR_length/2-1;
f=f3.*(fs/FR_length);
for n=2:N
FR=fft(itr(:,n),FR_length);
end
can anyone help me with this?
Many Thanks :)
I changed some of your code based on what I assume is the structure of input itr:
[L,N]=size(itr);
dt=itr(2,1);
fs=1/dt;
f=-L/2:L/2-1;
f=f*(fs/L);
FR =fftshift(fft(itr(:,2:end)));
disp([f' FR])
The last line will display your data as a table on the command window.
Note that I removed the loop since matlab allows vectorized notation. Also added fftshift to get your frequencies and amplitudes to line up correctly.
You can use save or fprintf to write to file, for instance as follows (you'll need to change the format string to match the number of data columns):
fid=fopen('test.dat','w')
fprintf(fid,'%f %f %f \n',[f ; real(FR)' ; imag(FR)'])
fclose(fid)

Multiple Columns, way to select closest to a value

I'm trying to analyze data sets that are obtained from CSV files. After the data is read into matlab, I am left with a variable of my data only. The number of columns and rows changes between each file. Is there a way to average each column and then create a variable for the one with the closest average to a certain value? and then also select the columns directly before and after this middle column and create variables for them, as well as create a variable for the column with the lowest average? Currently, I am selecting the columns manually and creating a variables for them that way.
For example:
I have this table of numbers. (I used the same number in each column for sake of easy averaging in this example.
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
Let's say I want the column whose average is closest to 3.2
That column would be column 3 whose average is 3. Then I would want the code to select the column before (column 2) and the column after (column 4). As well as the column with the lowest average (column 1)
First get the averages (I assume the data matrix is in variable X):
Xmns = mean(X);
Then to find the minimum, use "min":
[val,ind] = min(Xmns);
"val" holds the minimum value, "ind" the corresponding index in Xmns, which is the corresponding column.
To find the column mean closest to a particular value, again you can use min:
[val,ind] = min(abs(Xmns-key_val));
Now "ind" holds the column index with mean closest to "key_val". The next column is just "ind+1" and the previous "ind-1" - just be sure to check you are not beyond the ends of the matrix (i.e. ind may already be 1 or size(X,2)).
Also, given the column index "ind", to create a new variable with that column, you just use:
sc= X(:,ind);
and if you want to remove that column from X:
X(:,ind) = [];
and that is all.