I have this DataFrame bellow:
Ref ° | indice_1 | Indice_2 | rank_1 | rank_2 | echelon_from | section_from | echelon_to | section_to
--------------------------------------------------------------------------------------------------------------------------------------------
70574931 | 19 | 37.1 | 32 | 62 | ["10032,20032"] | ["11/12","13"] | ["40062"] | ["14A"]
---------------------------------------------------------------------------------------------------------------------------------------------
70574931 | 18 | 36 | 32 | 62 | ["20032"] | ["13"] | ["30062,40062"] | ["14,14A"]
I want concatenate the lines that have the same Ref° number, to concatenate echelon_from values, section_from values, echelon_to values and section_to values with duplicates there values, like in example bellow, and without touch the rest of the columns.
Ref ° | Indice_1 | Indice_2 | rank_1 | rank_2 | echelon_from | section_from | echelon_to | section_to
---------------------------------------------------------------------------------------------------------------------------------------------
70574931 | 19 | 37.1 | 32 | 62 | ["10032,20032"] | ["11/12","13"] | ["30062,40062"] | ["14,14A"]
----------------------------------------------------------------------------------------------------------------------------------------------
70574931 | 18 | 36 | 32 | 62 | ["10032,20032"] | ["11/12","13"] | ["30062,40062"] | ["14,14A"]
Some columns values in my original Dataframe are duplicates, I shouldn't touch them, I should keep there values to keep the same line numer of my DataFrame.
Someone can help me please how can I do it ?
Thank you!
There is multiple ways of doing this. One way is to explode all the given lists and collect them back again as a set.
from pyspark.sql import functions as F
lists_to_concat = ['echelon_from', 'section_from', 'echelon_to', 'section_to']
columns_not_to_concat = [c for c in df.columns if c not in lists_to_concat]
for c in lists_to_concat:
df = df.withColumn(c, F.explode(c))
df = (
df
.groupBy(*columns_not_to_concat)
.agg(
*[F.collect_set(c).alias(c) for c in lists_to_concat]
)
)
Another more elegant way is to use flatten().
from pyspark.sql import functions as F
lists_to_concat = ['echelon_from', 'section_from', 'echelon_to', 'section_to']
for c in lists_to_concat:
df = df.withColumn(c, F.flatten(c))
References:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.flatten
Related
I want to explode string column based on a specific delimiter (| in my case )
I have a dataset like this:
+-----+--------------+
|Col_1|Col_2 |
+-----+--------------+
| 1 | aa|bb |
| 2 | cc |
| 3 | dd|ee |
| 4 | ff |
+-----+-----+---------
I want an output like this:
+-----+--------------+
|Col_1|Col_2 |
+-----+--------------+
| 1 | aa |
| 1 | bb |
| 2 | cc |
| 3 | dd |
| 3 | ee |
| 4 | ff |
+-----+-----+---------
Use explode and split functions, and use \\ escape |.
val df1 = df.select(col("Col_1"), explode(split(col("Col_2"),"\\|")).as("Col_2"))
Let me explain my problem.
I have a table with this shape:
+----+------+-----------+-----------+
| ID | A | B | W |
+----+------+-----------+-----------+
| 1 | 534 | [a,b,c] | [4,6,2] |
| 2 | 534 | [a,b,d,e] | [6,3,6,2] |
| … | … | … | … |
| 54 | 667 | [a,b,r,e] | [4,6,2,3] |
| 55 | 8789 | [d] | [9] |
| 56 | 8789 | [a,b,d] | [7,2,3] |
| 57 | 8789 | [d,e,f,g] | [4,2,2,8] |
| … | … | … | … |
+----+------+-----------+-----------+
The query the I need to perform is the following: given an input with A,B and W values (e.g. A=8789; B=[a,b]; W=[3,2]) I need to find the "closest" line in the table that has the same value on A.
I've already defined my custom distance function.
The naive approach would be something like (given the input in the example):
SELECT * from my_table T, dist_function(T.B,T.W,ARRAY[a,b],ARRAY[3,2]) as dist
WHERE T.A = 8789
ORDER BY dist ASC
LIMIT 7
In my understanding this is a classical KNN problem for which I realized something already exists:
KNN-GiST
GiST & SP-GiST
SP-Gist example
I'm just not sure about which is the best index to consider.
Thanks.
I have this DataFrame bellow:
Ref ° | Indice_1 | Indice_2 | 1 | 2 | indice_from | indice_from | indice_to | indice_to
---------------------------------------------------------------------------------------------------------------------------------------------
1 | 19 | 37.1 | 32 | 62 | ["20031,10031"] | ["13,11/12"] | ["40062,30062"] | ["14A,14"]
---------------------------------------------------------------------------------------------------------------------------------------------
2 | 19 | 37.1 | 44 | 12 | ["40062,30062"] | ["13,11/12"] | ["40062,30062"] | ["14A,14"]
---------------------------------------------------------------------------------------------------------------------------------------------
3 | 19 | 37.1 | 22 | 64 | ["20031,10031"] | ["13,11/12"] | ["20031,10031"] | ["13,11/12"]
---------------------------------------------------------------------------------------------------------------------------------------------
4 | 19 | 37.1 | 32 | 98 | ["20032,10032"] | ["13,11/12"] | ["40062,30062"] | ["13,11/12"]
I want sort asc the values of the column indice_from, indice_from, indice_to, and indice_to and I shouldn't touch the rest of the columns of my DataFrame.
Knowing that the 2 columns indice_from and indice_to contains some times a number + letter like: ["14,14A"]
In case if I have an example like ["14,14A"], always I should have the same structure, for example if I have:
The number 15, the second value should 15 + letter, and 15 < 15 + letter, if first value is 9, the second value should 9 + letter and 9 < 9+letter
New DataFrame:
Ref ° | Indice_1 | Indice_2 | 1 | 2 | indice_from | indice_from | indice_to | indice_to
---------------------------------------------------------------------------------------------------------------------------------------------
1 | 19 | 37.1 | 32 | 62 | ["10031,20031"] | ["11/12,13"] | ["30062,40062"] | ["14,14A"]
---------------------------------------------------------------------------------------------------------------------------------------------
2 | 19 | 37.1 | 44 | 12 | ["30062,40062"] | ["11/12,13"] | ["30062,40062"] | ["14,14A"]
---------------------------------------------------------------------------------------------------------------------------------------------
3 | 19 | 37.1 | 22 | 64 | ["10031,20031"] | ["11/12,13"] | ["10031,20031"] | ["11/12,13"]
---------------------------------------------------------------------------------------------------------------------------------------------
4 | 19 | 37.1 | 32 | 98 | ["10031,20031"] | ["11/12,13"] | ["30062,40062"] | ["11/12,13"]
Someone please can help how can I sort the values of columns indice_from, indice_from, indice_to, and indice_to to obtain new Dataframe like the second df above ?
Thank you
If I understand it correctly then
from pyspark.sql import functions as F
columns_to_sort = ['indice_from', 'indice_from', 'indice_to', 'indice_to']
for c in columns_to_sort:
df = (
df
.withColumn(
c,
F.sort_array(c)
)
)
will do the trick. Let me know if it doesn't
For example, if I have a database table of transactions done over the counter. And I would like to search whether there was any time that was defined as extremely busy (Processed more than 10 transaction in the span of 10 minutes). How would I go about querying it? Could I aggregate based on time range and count the amount of transaction id within those ranges?
Adding example to clarify my input and desired output:
+----+--------------------+
| Id | register_timestamp |
+----+--------------------+
| 25 | 08:10:50 |
| 26 | 09:07:36 |
| 27 | 09:08:06 |
| 28 | 09:08:35 |
| 29 | 09:12:08 |
| 30 | 09:12:18 |
| 31 | 09:12:44 |
| 32 | 09:15:29 |
| 33 | 09:15:47 |
| 34 | 09:18:13 |
| 35 | 09:18:42 |
| 36 | 09:20:33 |
| 37 | 09:20:36 |
| 38 | 09:21:04 |
| 39 | 09:21:53 |
| 40 | 09:22:23 |
| 41 | 09:22:42 |
| 42 | 09:22:51 |
| 43 | 09:28:14 |
+----+--------------------+
Desired output would be something like:
+-------+----------+
| Count | Min |
+-------+----------+
| 1 | 08:10:50 |
| 3 | 09:07:36 |
| 7 | 09:12:08 |
| 8 | 09:20:33 |
+-------+----------+
How about this:
SELECT time,
FROM (
SELECT count(*) AS c, min(time) AS time
FROM transactions
GROUP BY floor(extract(epoch from time)/600);
)
WHERE c > 10;
This will find all ten minute intervals for which more than ten transactions occurred within that interval. It assumes that the table is called transactions and that it has a column called time where the timestamp is stored.
Thanks to redneb, I ended up with the following query:
SELECT count(*) AS c, min(register_timestamp) AS register_timestamp
FROM trak_participants_data
GROUP BY floor(extract(epoch from register_timestamp)/600)
order by register_timestamp
It works close enough for me to be able tell which time chunks are the most busiest for the counter.
I have data set like this which I am taking from csv file and converting it into RDD using scala.
+-----------+-----------+----------+
| recent | Freq | Monitor |
+-----------+-----------+----------+
| 1 | 1234 | 199090|
| 4 | 2553| 198613|
| 6 | 3232 | 199090|
| 1 | 8823 | 498831|
| 7 | 2902 | 890000|
| 8 | 7991 | 081097|
| 9 | 7391 | 432370|
| 12 | 6138 | 864981|
| 7 | 6812 | 749821|
+-----------+-----------+----------+
How to sort the data on all columns ?
Thanks
Suppose your input RDD/DataFrame is called df.
To sort recent in descending order, Freq and Monitor both in ascending you can do:
import org.apache.spark.sql.functions._
val sorted = df.sort(desc("recent"), asc("Freq"), asc("Monitor"))
You can use df.orderBy(...) as well, it's an alias of sort().
csv.sortBy(r => (r.recent, r.freq)) or equivalent should do it