why is hive.merge.mapredfiles=true not working? - merge

I use hive 1.2.1. and The default block size is set to 128MB.
And the file in hdfs is saved in orc format.
There are too many small files, so I gave the hive.file.merge option, but it doesn't seem to work. I would appreciate it if you could tell me the reason.
1. Setting Properties
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict ;
set hive.exec.max.dynamic.partitions=1000 ;
set hive.exec.max.dynamic.partitions.pernode=1000;
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.smallfiles.avgsize=268435456;
set hive.merge.size.per.task=536870912 ;
set hive.exec.compress.output=true;
set hive.exec.compress.intermediate=true;
set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.intermediate.compression.type=BLOCK;
set hive.execution.engine=mr;
2. Table Create DDL
CREATE EXTERNAL TABLE `temp.test_dst`(
`colA` string COMMENT ' ',
`colB` string COMMENT ' ',
`colC` string COMMENT ' ',
`date` string COMMENT ' ',
PARTITIONED BY (
`date` string COMMENT ' ')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'orc.compress'='ZLIB');
3. Insert Data
Insert overwrite table temp.test_dst partition(date)
select colA, colB, colC, date from temp.src_table
where date = '2021-11-01'
cluster by colA, colB;
4. Result
319.0 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000014_0
254.2 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000034_0
253.0 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000053_0
252.9 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000054_0
252.9 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000055_0
252.9 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000056_0
170.1 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000088_0
157.7 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000093_0
148.1 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000099_0
130.7 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000104_0
85.9 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000126_0
83.2 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000128_0
49.0 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000153_0
48.3 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000155_0
34.6 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000166_0
26.5 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000177_0
26.0 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000178_0
22.8 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000181_0
18.5 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000185_0
17.9 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000187_0
13.9 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000190_0
10.5 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000191_0
9.3 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000194_0
5.5 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000200_0
5.3 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000202_0
4.9 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000203_0
4.1 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000205_0
3.8 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000206_0
2.3 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000210_0
1.9 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000211_0
1.9 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000212_0
1.8 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000213_0
1.5 M hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000214_0
605.2 K hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000216_0
568.5 K hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000217_0
461.5 K hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000218_0
459.6 K hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000219_0
237.2 K hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000220_0
200.0 K hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000221_0
162.2 K hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000222_0
92.3 K hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000223_0
74.1 K hdfs://user/hive/warehouse/temp.db/test_dst/date=2021-11-01/000224_0

Related

Apply groupby in udf from a increase function Pyspark

I have the follow function:
import copy
rn = 0
def check_vals(x, y):
global rn
if (y != None) & (int(x)+1) == int(y):
return rn + 1
else:
# Using copy to deepcopy and not forming a shallow one.
res = copy.copy(rn)
# Increment so that the next value with start form +1
rn += 1
# Return the same value as we want to group using this
return res + 1
return 0
#pandas_udf(IntegerType(), functionType=PandasUDFType.GROUPED_AGG)
def check_final(x, y):
return lambda x, y: check_vals(x, y)
I need apply this function in a follow df:
index initial_range final_range
1 1 299
1 300 499
1 500 699
1 800 1000
2 10 99
2 100 199
So I need that follow output:
index min_val max_val
1 1 699
1 800 1000
2 10 199
See, that the grouping field there are a news abrangencies, that are the values min(initial) and max(final), until the sequence is broken, applying the groupBy.
I tried:
w = Window.partitionBy('index').orderBy(sf.col('initial_range'))
df = (df.withColumn('nextRange', sf.lead('initial_range').over(w))
.fillna(0,subset=['nextRange'])
.groupBy('index')
.agg(check_final("final_range", "nextRange").alias('check_1'))
.withColumn('min_val', sf.min("initial_range").over(Window.partitionBy("check_1")))
.withColumn('max_val', sf.max("final_range").over(Window.partitionBy("check_1")))
)
But, don't worked.
Anyone can help me?
I think pure Spark SQL API can solve your question and it doesn't need to use any UDF, which might be an impact of your Spark performance. Also, I think two window function is enough to solve this question:
df.withColumn(
'next_row_initial_diff', func.col('initial_range')-func.lag('final_range', 1).over(Window.partitionBy('index').orderBy('initial_range'))
).withColumn(
'group', func.sum(
func.when(func.col('next_row_initial_diff').isNull()|(func.col('next_row_initial_diff')==1), func.lit(0))
.otherwise(func.lit(1))
).over(
Window.partitionBy('index').orderBy('initial_range')
)
).groupBy(
'group', 'index'
).agg(
func.min('initial_range').alias('min_val'),
func.max('final_range').alias('max_val')
).drop(
'group'
).show(100, False)
Column next_row_initial_diff: Just like the lead you use to shift/lag the row and check if it's in sequence.
Column group: To group the sequence in index partition.

Simulating Ogata's Thinning Algorithm in R

I am trying to implement Ogata's Thinning Algorithm exactly as given in Algorithm 3 in https://www.math.fsu.edu/~ychen/research/Thinning%20algorithm.pdf with the parameters they specify to generate Figure 2.
This is the code that replicates it verbatim.
# Simulation of a Univariate Hawkes Poisson with
# Exponential Kernel γ(u) = αe^(−βu), on [0, T].
# Based on: https://www.math.fsu.edu/~ychen/research/Thinning%20algorithm.pdf
library(tidyverse)
# Initialize parameters that remains the same for all realizations
mu <- 1.2
alpha <- 0.6
beta <- 0.8
T <- 10
# Initialize other variables that change through realizations
bigTau <- vector('numeric')
bigTau <- c(bigTau,0)
s <- 0
n <- 0
lambda_vec_accepted <- c(mu)
# Begin loop
while (s < T) {
# -------------------------------
# Compute lambda_bar
# -------------------------------
sum_over_big_Tau <- 0
for (i in c(1:length(bigTau))) {
sum_over_big_Tau <- sum_over_big_Tau + alpha*exp(-beta*(s-bigTau[i]))
}
lambda_bar <- mu + sum_over_big_Tau
u <- runif(1)
# so that w ∼ exponential(λ_bar)
w <- -log(u)/lambda_bar
# so that s is the next candidate point
s <- s+w
# Generate D ∼ uniform(0,1)
D <- runif(1)
# -------------------------------
# Compute lambda_s
# -------------------------------
sum_over_big_Tau <- 0
for (i in c(1:length(bigTau))) {
sum_over_big_Tau <- sum_over_big_Tau + alpha*exp(-beta*(s-bigTau[i]))
}
lambda_s <- mu + sum_over_big_Tau
# -------------------------------
# Accepting with prob. λ_s/λ_bar
# -------------------------------
if (D*lambda_bar <= lambda_s ) {
lambda_vec_accepted <- c(lambda_vec_accepted,lambda_s)
# updating the number of points accepted
n <- n+1
# naming it t_n
t_n <- s
# adding t_n to the ordered set bigTau
bigTau <- c(bigTau,t_n)
}
}
df<-data.frame(x=bigTau,y=lambda_vec_accepted)
ggplot(df) + geom_line(aes(x=bigTau,y=lambda_vec_accepted))
However, the plot I managed to get (running several times) for lambda vs time is something like this and nowhere near what they got in Figure 2 (exponentially decreasing).
I am not sure what is the mistake I am doing. It will be great if anyone can help. This is needed for my research. I am from biology and so please excuse if I am doing something silly. Thanks.

q/KDB - nprev function to get all the previous n elements

I am struggling to write a nprev function in KDB; xprev function returns the nth element but I need all the prev n elements relative to the current element.
q)t:([] i:1+til 26; s:.Q.a)
q)update xp:xprev[3;]s,p:prev s from t
Any help is greatly appreciated.
You can achieve the desired result by applying prev repeatedly and flipping the result
q)n:3
q)select flip 1_prev\[n;s] from t
s
-----
" "
"a "
"ba "
"cba"
"dcb"
"edc"
..
If n is much smaller than the rows count, this will be faster than some of the more straightforward solutions.
The xprev function basically looks like this :
xprev1:{y til[count y]-x} //readable xprev
We can tweak it to get all n elements
nprev:{y til[count y]-\:1+til x}
using nprev in the query
q)update np: nprev[3;s] , xp1:xprev1[3;s] , xp: xprev[3;s], p:prev[s] from t
i s np xp1 xp p
-------------------
1 a " "
2 b "a " a
3 c "ba " b
4 d "cba" a a c
5 e "dcb" b b d
6 f "edc" c c e
k equivalent of nprev
k)nprev:{$[0h>#y;'`rank;y(!#y)-\:1+!x]}
and similarly nnext would look like
k)nnext:{$[0h>#y;'`rank;y(!#y)+\:1+!x]}

q - apply function on table rowwise

Given a table and a function
t:([] c1:1 2 3; c2:`a`b`c; c3:13:00 13:01 13:02)
f:{[int;sym;date]
symf:{$[x=`a;1;x=`b;2;3]};
datef:{$[x=13:00;1;x=13:01;2;3]};
r:int + symf[sym] + datef[date];
r
};
I noticed that when applying the function f onto columns of t, then the entire columns are passed into f and if they can be operated on atomically then the output will be of the same length as the inputs and a new column is produced. However in our example this wont work:
update newcol:f[c1;c2;c3] from t / 'type error
because the inner functions symf and datef cannot be applied to the entire column c2, c3, respectively.
If I dont want to change the function f at all, how can I apply it row by row and collect the values into a new column in t.
What's the most q style way to do this?
EDIT
If not changing f is really inconvenient one could workaround like so
f:{[arglist]
int:arglist 0;
sym:arglist 1;
date:arglist 2;
symf:{$[x=`a;1;x=`b;2;3]};
datef:{$[x=13:00;1;x=13:01;2;3]};
r:int + symf[sym] + datef[date];
r
};
f each (t`c1),'(t`c2),'(t`c3)
Still I would be interested how to get the same result when working with the original version of f
Thanks!
You can use each-both for this e.g.
q)update newcol:f'[c1;c2;c3] from t
c1 c2 c3 newcol
------------------
1 a 13:00 3
2 b 13:01 6
3 c 13:02 9
However you will likely get better performance by modifying f to be "vectorised" e.g.
q)f2
{[int;sym;date]
symf:3^(`a`b!1 2)sym;
datef:3^(13:00 13:01!1 2)date;
r:int + symf + datef;
r
}
q)update newcol:f2[c1;c2;c3] from t
c1 c2 c3 newcol
------------------
1 a 13:00 3
2 b 13:01 6
3 c 13:02 9
q)\ts:1000 update newcol:f2[c1;c2;c3] from t
4 1664
q)\ts:1000 update newcol:f'[c1;c2;c3] from t
8 1680
In general in KDB, if you can avoid using any form of each and stick to vector operations, you'll get much more efficiency

Crystal Reports Cross-tab with mix of Sum, Percentages and Computed values

Being new to crystal, I am unable to figure out how to compute rows 3 and 4 below.
Rows 1 and 2 are simple percentages of the sum of the data.
Row 3 is a computed value (see below.)
Row 4 is a sum of the data points (NOT a percentage as in row 1 and row 2)
Can someone give me some pointers on how to generate the display as below.
My data:
2010/01/01 A 10
2010/01/01 B 20
2010/01/01 C 30
2010/02/01 A 40
2010/02/01 B 50
2010/02/01 C 60
2010/03/01 A 70
2010/03/01 B 80
2010/03/01 C 90
I want to display
2010/01/01 2010/02/01 2010/03/01
========== ========== ==========
[ B/(A + B + C) ] 20/60 50/150 80/240 <=== percentage of sum
[ C/(A + B + C) ] 30/60 60/150 90/240 <=== percentage of sum
[ 1 - A/(A + B + C) ] 1 - 10/60 1 - 40/150 1 - 70/240 <=== computed
[ (A + B + C) ] 60 150 250 <=== sum
Assuming you are using a SQL data source, I suggest deriving each of the output rows' values (ie. [B/(A + B + C)], [C/(A + B + C)], [1 - A/(A + B + C)] and [(A + B + C)]) per date in the SQL query, then using Crystal's crosstab feature to pivot them into the output format desired.
Crystal's crosstabs aren't particularly suited to deriving different calculations on different rows of output.