Input dataframe has 4 columns - id (str), group (str), elapsed time in days (int) and label (int).
inp = spark.createDataFrame([
['1', "A", 23, 2],
['1', "A", 45, 2],
['1', "A", 73, 2],
['1', "A", 84, 3],
['1', "A", 95, 3],
['1', "A", 101, 2],
['1', "A", 105, 2],
['1', "B", 20, 1],
['1', "B", 40, 1],
['1', "B", 60, 2],
['2', "A", 10, 4],
['2', "A", 20, 4],
['2', "A", 30, 4]
], schema=["id","grp","elap","lbl"])
For every (id,grp) I need the output frame to have records with the first occurence of a different label.
out = spark.createDataFrame([
['1', "A", 23, 2],
['1', "A", 84, 3],
['1', "A", 101, 2],
['1', "B", 20, 1],
['1', "B", 60, 2],
['2', "A", 10, 4],
], schema=["id","grp","elap","lbl"])
The dataframe has a billion rows and looking for an efficient way to do this.
Check if current label is not equal to previous label (group by id and grp):
from pyspark.sql.window import Window
import pyspark.sql.functions as f
inp.withColumn('prevLbl', f.lag('lbl').over(Window.partitionBy('id', 'grp').orderBy('elap')))\
.filter(f.col('prevLbl').isNull() | (f.col('prevLbl') != f.col('lbl')))\
.drop('prevLbl').show()
+---+---+----+---+
| id|grp|elap|lbl|
+---+---+----+---+
| 1| A| 23| 2|
| 1| A| 84| 3|
| 1| A| 101| 2|
| 1| B| 20| 1|
| 1| B| 60| 2|
| 2| A| 10| 4|
+---+---+----+---+
I am interested in accessing data attrbitue values as rows with each item inside that row to be assigned value to the corresponding column name mentioned in the sample at the bottom of this question.
{
"meta": {
"a": {
"b": []
}
},
"data" : [ [ "row-r9pv-p86t.ifsp", "00000000-0000-0000-0838-60C2FFCC43AE", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOEY", "KINGS", "F", "11" ]
, [ "row-7v2v~88z5-44se", "00000000-0000-0000-C8FC-DDD3F9A72DFF", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOEY", "SUFFOLK", "F", "6" ]
, [ "row-hzc9-4kvv~mbc9", "00000000-0000-0000-562E-D9A0792557FC", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOEY", "MONROE", "F", "6" ]
, [ "row-3473_8cwy~3vez", "00000000-0000-0000-B19D-7B88FF2FB6A0", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOEY", "ERIE", "F", "9" ]
, [ "row-tyuh.nmy9.r2n3", "00000000-0000-0000-7D66-E7EC8F12BB8D", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOE", "ULSTER", "F", "5" ]
, [ "row-ct48~ui69-2zsn", "00000000-0000-0000-7ECC-F350540A8F92", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOE", "WESTCHESTER", "F", "24" ]
, [ "row-gdva~4v8k-vuwy", "00000000-0000-0000-30FB-CB5E36017AD5", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOE", "BRONX", "F", "13" ]
, [ "row-gzu3~a7hk~bqym", "00000000-0000-0000-E380-AAAB1FA5C7A7", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOE", "NEW YORK", "F", "55" ]
, [ "row-ekbw_tb7c.yvgp", "00000000-0000-0000-A7FF-8A4260B3A505", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOE", "NASSAU", "F", "15" ]
, [ "row-zk7s-r2ma_t8mk", "00000000-0000-0000-3F7C-4DECA15E0F5B", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOE", "ERIE", "F", "6" ]
, [ "row-ieja_864x~w2ki", "00000000-0000-0000-854E-D29D5B4D5636", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOE", "SUFFOLK", "F", "14" ]
, [ "row-8fp4.rjtj.h46h", "00000000-0000-0000-C177-43F52BFECC07", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOE", "KINGS", "F", "34" ]
]
}
I tried following but it only gives null values for each row. Can you help how to get each item in the row into a specific field such as each value is assigned to attribute named on the right of the value below.
val schema = new StructType()
.add(
"data", new ArrayType(new StructType(), false), false
)
val nestDF = spark.read.schema(schema).json("dbfs:/tmp/rows.json")
Here's the expected structure :
/* [
"row-r9pv-p86t.ifsp" <-- sid
"00000000-0000-0000-0838-60C2FFCC43AE" <-- id
0 <-- position
1574264158 <-- created_at
null <-- created_meta
1574264158 <-- updated_at
null <-- updated_meta
"{ }" <-- meta
"2007" <-- year of birth
"ZOEY" <-- child's first name
"KINGS" <-- county
"F" <-- gender
"11" <-- count
]
*/
Atharva, you can try this piece of code. I didn't cast the attributes to expected datatypes but should be easy now :) :
import sparkSession.implicits._
import org.apache.spark.sql.functions._
val df = sparkSession.read.option("multiLine", true).json("src/main/resources/json.json")
val schema = StructType(Seq("sid","id","position","created_at","created_meta","updated_at","updated_meta","meta","yearOfBirth","childsFirstName","county","gender","count").map(c => StructField(c, StringType)))
val toStruct = udf({seq: Seq[String] => Row.fromSeq(seq)}, schema)
val newDF = df.select(explode($"data").as("dataRow"))
.select(toStruct($"dataRow").as("struct"))
.select("struct.*")
newDF.printSchema()
root
|-- sid: string (nullable = true)
|-- id: string (nullable = true)
|-- position: string (nullable = true)
|-- created_at: string (nullable = true)
|-- created_meta: string (nullable = true)
|-- updated_at: string (nullable = true)
|-- updated_meta: string (nullable = true)
|-- meta: string (nullable = true)
|-- yearOfBirth: string (nullable = true)
|-- childsFirstName: string (nullable = true)
|-- county: string (nullable = true)
|-- gender: string (nullable = true)
|-- count: string (nullable = true)
newDF.show(false)
+------------------+------------------------------------+--------+----------+------------+----------+------------+----+-----------+---------------+-----------+------+-----+
|sid |id |position|created_at|created_meta|updated_at|updated_meta|meta|yearOfBirth|childsFirstName|county |gender|count|
+------------------+------------------------------------+--------+----------+------------+----------+------------+----+-----------+---------------+-----------+------+-----+
|row-r9pv-p86t.ifsp|00000000-0000-0000-0838-60C2FFCC43AE|0 |1574264158|null |1574264158|null |{ } |2007 |ZOEY |KINGS |F |11 |
|row-7v2v~88z5-44se|00000000-0000-0000-C8FC-DDD3F9A72DFF|0 |1574264158|null |1574264158|null |{ } |2007 |ZOEY |SUFFOLK |F |6 |
|row-hzc9-4kvv~mbc9|00000000-0000-0000-562E-D9A0792557FC|0 |1574264158|null |1574264158|null |{ } |2007 |ZOEY |MONROE |F |6 |
|row-3473_8cwy~3vez|00000000-0000-0000-B19D-7B88FF2FB6A0|0 |1574264158|null |1574264158|null |{ } |2007 |ZOEY |ERIE |F |9 |
|row-tyuh.nmy9.r2n3|00000000-0000-0000-7D66-E7EC8F12BB8D|0 |1574264158|null |1574264158|null |{ } |2007 |ZOE |ULSTER |F |5 |
|row-ct48~ui69-2zsn|00000000-0000-0000-7ECC-F350540A8F92|0 |1574264158|null |1574264158|null |{ } |2007 |ZOE |WESTCHESTER|F |24 |
|row-gdva~4v8k-vuwy|00000000-0000-0000-30FB-CB5E36017AD5|0 |1574264158|null |1574264158|null |{ } |2007 |ZOE |BRONX |F |13 |
|row-gzu3~a7hk~bqym|00000000-0000-0000-E380-AAAB1FA5C7A7|0 |1574264158|null |1574264158|null |{ } |2007 |ZOE |NEW YORK |F |55 |
|row-ekbw_tb7c.yvgp|00000000-0000-0000-A7FF-8A4260B3A505|0 |1574264158|null |1574264158|null |{ } |2007 |ZOE |NASSAU |F |15 |
|row-zk7s-r2ma_t8mk|00000000-0000-0000-3F7C-4DECA15E0F5B|0 |1574264158|null |1574264158|null |{ } |2007 |ZOE |ERIE |F |6 |
|row-ieja_864x~w2ki|00000000-0000-0000-854E-D29D5B4D5636|0 |1574264158|null |1574264158|null |{ } |2007 |ZOE |SUFFOLK |F |14 |
|row-8fp4.rjtj.h46h|00000000-0000-0000-C177-43F52BFECC07|0 |1574264158|null |1574264158|null |{ } |2007 |ZOE |KINGS |F |34 |
+------------------+------------------------------------+--------+----------+------------+----------+------------+----+-----------+---------------+-----------+------+-----+
I have two dataframe
df1 :
+---------------+-------------------+-----+------------------------+------------------------+---------+
|id |dt |speed|stats |lag_stat |lag_speed|
+---------------+-------------------+-----+------------------------+------------------------+---------+
|358899055773504|2018-07-31 18:38:36|0 |[9, -1, -1, 13, 0, 1, 0]|null |null |
|358899055773504|2018-07-31 18:58:34|0 |[9, 0, -1, 22, 0, 1, 0] |[9, -1, -1, 13, 0, 1, 0]|0 |
|358899055773505|2018-07-31 18:54:23|4 |[9, 0, 0, 22, 1, 1, 1] |null |null |
+---------------+-------------------+-----+------------------------+------------------------+---------+
df2 :
+---------------+-------------------+-----+------------------------+
|id |dt |speed|stats |
+---------------+-------------------+-----+------------------------+
|358899055773504|2018-07-31 18:38:34|0 |[9, -1, -1, 13, 0, 1, 0]|
|358899055773505|2018-07-31 18:48:23|4 |[8, -1, 0, 22, 1, 1, 1] |
+---------------+-------------------+-----+------------------------+
I want to replace the null value in column lag_stat,speed in df1 with the value of stat and speed from dataframe df2 wrt to the same id.
Desired output looks like this:
+---------------+-------------------+-----+--------------------+--------------------+---------+
| id| dt|speed| stats| lag_stat|lag_speed|
+---------------+-------------------+-----+--------------------+--------------------+---------+
|358899055773504|2018-07-31 18:38:36| 0|[9, -1, -1, 13, 0, 1,0]|[9, -1, -1, 13, 0, 1, 0]| 0|
|358899055773504|2018-07-31 18:58:34| 0|[9, 0, -1, 22, 0, 1, 0]|[9, -1, -1, 13, 0, 1, 0]| 0|
|358899055773505|2018-07-31 18:54:23| 4|[9, 0, 0, 22, 1, 1, 1]|[8, -1, 0, 22, 1, 1, 1] | 4 |
+---------------+-------------------+-----+--------------------+--------------------+---------+
One possible way could be join the DFs and then apply some when functions on that columns.
For example, this:
val output = df1.join(df2, df1.col("id")===df2.col("id"))
.select(df1.col("id"),
df1.col("dt"),
df1.col("speed"),
df1.col("stats"),
when(df1.col("lag_stat").isNull,df2.col("stats")).otherwise(df1.col("lag_stat")).alias("lag_stats"),
when(df1.col("lag_speed").isNull,df2.col("speed")).otherwise(df1.col("lag_speed")).alias("lag_speed")
)
will give you the expected output:
+---------------+------------------+-----+------------------+------------------+---------+
| id| dt|speed| stats| lag_stats|lag_speed|
+---------------+------------------+-----+------------------+------------------+---------+
|358899055773504|2018-07-3118:38:36| 0|[9,-1,-1,13,0,1,0]|[9,-1,-1,13,0,1,0]| 0|
|358899055773504|2018-07-3118:58:34| 0| [9,0,-1,22,0,1,0]|[9,-1,-1,13,0,1,0]| 0|
|358899055773505|2018-07-3118:54:23| 4| [9,0,0,22,1,1,1]| [8,-1,0,22,1,1,1]| 4|
+---------------+------------------+-----+------------------+------------------+---------+
I have a sparse matrix and want to divide the region into 4 parts, dividing x and y in 2 equidistant pieces and want to calculate the sum of the corresponding values.
For the example below, the coordinates x-y each corresponds to [0,16] so the region is a square. There is a sparse matrix in this square, which is symmetrical. I would like to divide the region into smaller squares and sum up the sparse values. Region 0:8,0:8 has 2 elements and their values are both (2,3)=(3,2)=8 so the sum is 16.
Summation of the 1st region should give 16, 2nd and 3rd are 36 and the 4th one is 26.
x = sparse(16,16);
x (3,2) = 8;
x (10,2) = 8;
x (13,2) = 8;
x (14,2) = 4;
x (15,2) = 4;
x (2,3) = 8;
x (10,3) = 4;
x (13,3) = 4;
x (14,3) = 2;
x (15,3) = 2;
x (2,10) = 8;
x (3,10) = 4;
x (13,10) = 4;
x (14,10) = 2;
x (15,10) = 2;
x (2,13) = 8;
x (3,13) = 4;
x (10,13) = 4;
x (14,13) = 2;
x (15,13) = 2;
x (2,14) = 4;
x (3,14) = 2;
x (10,14) = 2;
x (13,14) = 2;
x (15,14) = 1;
x (2,15) = 4;
x (3,15) = 2;
x (10,15) = 2;
x (13,15) = 2;
x (14,15) = 1;
i would rather appriciate a shorter way, rather than writing a line for each sub-square. lets say for 6000 sub-squares one should write 6000 lines?
Let's define the input in a more convenient way:
X = sparse([...
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
0, 0, 8, 0, 0, 0, 0, 0, 0, 8, 0, 0, 8, 4, 4
0, 8, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 2, 2
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
0, 8, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 2, 2
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
0, 8, 4, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 2, 2
0, 4, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 1
0, 4, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 1, 0]);
For convenience, we first make the array dimensions even. We don't use padarray() for this because this makes the sparse matrix full!
sz = size(X);
newX = sparse(sz(1)+1,sz(2)+1);
padTopLeft = true; % < chosen arbitrarily
if padTopLeft
newX(2:end,2:end) = X;
else % bottom right
newX(1:sz(1),1:sz(2)) = X;
end
%% Preallocate results:
sums = zeros(2,2,2);
Method #1: accumarray
We create a mask of the form:
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4
and then use it to sum the appropriate elements of newX:
sums(:,:,1) = reshape(...
accumarray(reshape(repelem([1,2;3,4], ceil(sz(1)/2), ceil(sz(2)/2)),[],1),...
reshape(newX, [],1),...
[],#sum) ,2,2);
Method #2: blockproc (requires the Image Processing Toolbox)
sums(:,:,2) = blockproc(full(newX), ceil(sz/2), #(x)sum(x.data(:)));
Several notes:
I also tried histcounts2, which is very short, but it only tells you the amount of values in each quadrant, not their sum:
[r,c] = find(newX);
histcounts2(r,c,[2,2])
I might've overcomplicated the accumarray solution.
Although your question is not very precise and you don't made any efford to find a solution, here is what you are asking..
clear;clc;close;
Matrix=rand(20,20);
Acc=zeros(1,4);
Acc(1)=sum(sum( Matrix(1:size(Matrix,1)/2,1:size(Matrix,2)/2) ));
Acc(2)=sum(sum( Matrix((size(Matrix,1)/2)+1:end,1:size(Matrix,2)/2)));
Acc(3)=sum(sum( Matrix(1:size(Matrix,1)/2,((size(Matrix,2)/2)+1):end )));
Acc(4)=sum(sum( Matrix((size(Matrix,1)/2)+1:end,((size(Matrix,2)/2)+1):end)));
% Verification
sum(sum(Matrix)) % <- is the same with
sum(Acc) % <- this
You can define any rectangle within the matrix by defining the 4 corners of it. Then use a for loop to process all rectangles.
regions = [
1 8 1 8
9 16 1 8
1 8 9 16
9 16 9 16
];
regionsum = zeros(size(regions,1),1);
for rr = 1:size(regions,1)
submat = x(regions(rr,1):regions(rr,2),regions(rr,3):regions(rr,4));
regionsum(rr) = sum(submat(:));
end
>> regionsum
regionsum =
16
36
36
26
If you mean you want to divide the square matrix into 2^N (N>2) squares of the same size then you can write regions with a for loop.
N = 1; % 2^N-by-2^N sub-squares
L = size(x,1);
dL = L/(2^N);
assert(dL==int32(dL),'Too many divisions')
segments = zeros(2^N,2);
for nn = 1:2^N
segments(nn,:) = [1,dL]+dL*(nn-1);
end
regions = zeros(2^(2*N),4);
for ss = 1:2^N
for tt = 1:2^N
regions((2^N)*(ss-1) + tt,:) = [segments(ss,:),segments(tt,:)];
end
end
example output with dividing into 16 (N=2) square submatrices:
>> regions
regions =
1 4 1 4
1 4 5 8
1 4 9 12
1 4 13 16
5 8 1 4
5 8 5 8
5 8 9 12
5 8 13 16
9 12 1 4
9 12 5 8
9 12 9 12
9 12 13 16
13 16 1 4
13 16 5 8
13 16 9 12
13 16 13 16
>> regionsum
regionsum =
16
0
12
24
0
0
0
0
12
0
0
8
24
0
8
10
>>