I have a spark dataframe with this format:
opp_id__reference|oplin_status| stage| std_amount| std_line_amount|
+-----------------+------------+--------------------+----------------+----------------+
|OP-171102-67318| Won|7 - Deliver & Val...|6243.316662349|6243.31666234948|
|OP-180910-77114| Won|7 - Deliver & Val...|5014.57880858921|5014.57880858921|
|OP-180910-76544| Pending|7 - Deliver & Val...|5014.57880858921|5014.57880858921|
|OP-180910-76544| Pending|7 - Deliver & Val...|5014.57880858921|5614.57880858921|
|OP-180910-76544| Won|7 - Deliver & Val...|5014.57880858921|5994.57880858921|
I would like to extract the list of opp_id__reference that the sum of records which has oplin_status = "Pending" is bigger than std_amount
This hiw I did :
# select opp_line which stage =='7 - Deliver & Validate' and oplin_status =='Pending'
DF_BR8 = df.filter(df.stage.contains("7 - Deliver")).select('opp_id__reference', 'oplin_status', 'stage', 'std_amount', 'std_line_amount')
DF_BR8_1 = DF_BR8.groupby('opp_id__reference', 'std_amount', 'oplin_status').agg({'std_line_amount': 'sum'}).withColumnRenamed('sum(std_line_amount)','sum_column')
DF_res = DF_BR8_1.filter(DF_BR8_1.oplin_status.contains("Pending"))
DF_res1 =DF_res.filter(DF_res.sum_column <= 0.3*DF_BR8_1.std_amount)
My question is it : is it correct what i did? is there any other way more simple to do?
Thanks
Related
I am trying to consume data from Azure Event Hubs with Databricks PySpark and write it in an ADLS sink. Somehow, the spark jobis not able to finish and gets aborted after running for 2 hours. The error is Caused by: java.util.concurrent.RejectedExecutionException: ReactorDispatcher instance is closed.
here is a full error https://gist.github.com/kingindanord/a5f585c6ee7053c275c714d1b07c6538#file-spark_error-log
and here is my python script
import json
from datetime import date, timedelta, datetime
from pyspark.sql import functions as F
KEY_VAULT_NAME="KEY_VAULT_NAME"
EVENT_HUBS_SECRET_NAME="EVENT_HUBS_SECRET_NAME"
EVENT_HUBS_CONSUMER_NAME="EVENT_HUBS_CONSUMER_NAME"
BATCH_START_DATE = datetime.strptime("2022-03-22 23:00:00", "%Y-%m-%d %H:%M:%S")
BATCH_END_DATE = datetime.strptime("2022-03-23 00:00:00", "%Y-%m-%d %H:%M:%S")
CONTAINER_NAME = "CONTAINER_NAME_AZ"
HUB_NAME = "HUB_NAME"
ROOT_FOLDER = "ROOT_FOLDER"
SINK_URI = 'abfss://{CONTAINER_NAME}#.dfs.core.windows.net/{SINK_ROOT_FOLDER}'.format(CONTAINER_NAME=CONTAINER_NAME, SINK_ROOT_FOLDER=ROOT_FOLDER)
connection = dbutils.secrets.get(scope = KEY_VAULT_NAME, key = EVENT_HUBS_SECRET_NAME)
ehConf = {}
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connection)
ehConf['eventhubs.consumerGroup'] = EVENT_HUBS_CONSUMER_NAME
# Create the positions
startingEventPosition = {
"offset": None,
"seqNo": -1, #not in use
"enqueuedTime": BATCH_START_DATE.strftime("%Y-%m-%dT00:00:00.000Z"),
"isInclusive": True
}
endingEventPosition = {
"offset": None,
"seqNo": -1,
"enqueuedTime": BATCH_END_DATE.strftime("%Y-%m-%dT00:00:00.000Z"),
"isInclusive": True
}
ehConf["eventhubs.startingPosition"] = json.dumps(startingEventPosition)
ehConf["eventhubs.endingPosition"] = json.dumps(endingEventPosition)
ehConf["eventhubs.MaxEventsPerTrigger"] = 1000
ehConf["eventhubs.UseExclusiveReceiver"] = True
df = spark.read.format("eventhubs").options(**ehConf).load()
df2 = df.withColumn("body", df["body"].cast("string")) \
.withColumn("year", F.date_format(df["enqueuedTime"], "yyyy")) \
.withColumn("month", F.date_format(df["enqueuedTime"], "MM")) \
.withColumn("day", F.date_format(df["enqueuedTime"], "dd"))\
.select("body", "year", "month", "day")
df2.write.partitionBy("year", "month", "day").mode("overwrite") \
.format("delta") \
.parquet(SINK_URI)
I am using a separate consumer group for this application. The Event hub has 3 partitions, Auto-inflate throughput units are enabled and it is set to 21 units.
Databricks Runtime Version: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12) Worker type & Driver type are Standard_E16_v3 (128GB Memory, 16 Cores) Min workers: 1, Max workers, 3.
As you can see in the code, startingEventPosition and endingEventPosition are only one hour apart, so the size of data should be around 3 GB, I don't know why I am not able to consume them. Can you please help me with this issue.
You can try the 2 workarounds:
Set different Consumer Groups for each stream.
Restart databricks cluster and then try again.
Refer this github link
I'm trying to create an EDUCATIONAL PURPOSES ONLY virus. I do not plan on spreading it. It's purpose is to grow a file to the point your storage is full and slow your computer down. It prints the size of the file every 0.001 seconds. With that, I also want to know how fast it is growing the file. The following code doesn't seem to let it run:
class Vstatus():
def _init_(Status):
Status.countspeed == True
Status.active == True
Status.growingspeed == 0
import time
import os
#Your storage is at risk of over-expansion. Please do not let this file run forever, as your storage will fill continuously.
#This is for educational purposes only.
while Vstatus.Status.countspeed == True:
f = open('file.txt', 'a')
f.write('W')
fsize = os.stat('file.txt')
Key1 = fsize
time.sleep(1)
Key2 = fsize
Vstatus.Status.growingspeed = (Key2 - Key1)
Vstatus.Status.countspeed = False
while Vstatus.Status.active == True:
time.sleep(0.001)
f = open('file.txt', 'a')
f.write('W')
fsize = os.stat('file.txt')
print('size:' + fsize.st_size.__str__() + ' at a speed of ' + Vstatus.Status.growingspeed + 'bytes per second.')
This is for Educational Purposes ONLY
The main error I keep getting when running the file is here:
TypeError: unsupported operand type(s) for -: 'os.stat_result' and 'os.stat_result'
What does this mean? I thought os.stat returned an integer Can I get a fix on this?
Vstatus.Status.growingspeed = (Key2 - Key1)
You can't subtract os.stat objects. Your code also has some other problems. Your loops will run sequentially, meaning that your first loop will try to estimate how quickly the file is being written to without writing anything to the file.
import time # Imports at the top
import os
class VStatus:
def __init__(self): # double underscores around __init__
self.countspeed = True # Assignment, not equality test
self.active = True
self.growingspeed = 0
status = VStatus() # Make a VStatus instance
# You need to do the speed estimation and file appending in the same loop
with open('file.txt', 'a+') as f: # Only open the file once
start = time.time() # Get the current time
starting_size = os.fstat(f.fileno()).st_size
while status.active: # Access the attribute of the VStatus instance
size = os.fstat(f.fileno()).st_size # Send file desciptor to stat
f.write('W') # Writing more than one character at a time will be your biggest speed up
f.flush() # make sure the byte is written
if status.countspeed:
diff = time.time() - start
if diff >= 1: # More than a second has gone by
status.countspeed = False
status.growingspeed = (os.fstat(f.fileno()).st_size - starting_size)/diff # get rate of growth
else:
print(f"size: {size} at a speed of {status.growingspeed}")
I need to merge different large tables (up to 10Gb each) into a single one. To do so I am using a computer cluster with 50+ cores and 10+Gb Ram that runs on Linux.
I always end up with an error message like: "Cannot allocate vector of size X Mb".
Given that commands like memory.limit(size=X) are Windows-specific and not accepted, I cannot find a way around to merge my large tables.
Any suggestion welcome!
This is the code I use:
library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
temp = list.files(pattern="*.txt$")
gc()
Here the error occurs:
myfiles = parLapply(cl,temp, function(x) read.csv(x,
header=TRUE,
sep=";",
stringsAsFactors=F,
encoding = "UTF-8",
na.strings = c("NA","99","")))
myfiles.final = do.call(rbind, myfiles)
You could just use merge, for example:
`
mergedTable <- merge(table1, table2, by = "dbSNP_RSID")
If your samples have overlapping column names, then you'll find that the mergedTable has (for example) columns called Sample1.x and Sample1.y. This can be fixed by renaming the columns before or after the merge.
Reproducible example:
x <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
y <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
colnames(x)[2:101] <- paste0("Sample", 1:100)
colnames(y)[2:101] <- paste0("Sample", 101:200)
mergedDf <- merge(x, y, by = "dbSNP_RSID")
`
One way to approach this is with python and dask. The dask dataframe is stored mostly on disk rather than in ram- allowing you to work with larger than ram data- and can help you do computations with clever parallelization. A nice tutorial of ways to work on big data can be found in this kaggle post which might also be helpful for you. I also suggest checking out the docs on dask performance here. To be clear, if your data can fit in RAM using regular R dataframe or pandas dataframe will be faster.
Here's a dask solution which will assume you have named columns in the tables to align the concat operation. Please add to your question if you have any other special requirements about the data we need to consider.
import dask.dataframe as dd
import glob
tmp = glob.glob("*.txt")
dfs= []
for f in tmp:
# read the large tables
ddf = dd.read_table(f)
# make a list of all the dfs
dfs.append(ddf)
#row-wise concat of the data
dd_all = dd.concat(dfs)
#repartition the df to 1 partition for saving
dd_all = dd_all.repartition(npartitions=1)
# save the data
# provide list of one name if you don't want the partition number appended on
dd_all.to_csv(['all_big_files.tsv'], sep = '\t')
if you just wanted to cat all the tables together you can do something like this in straight python. (you could also use linux cat/paste).
with open('all_big_files.tsv', 'w') as O:
file_number = 0
for f in tmp:
with open(f, 'rU') as F:
if file_number == 0:
for line in F:
line = line.rstrip()
O.write(line + '\n')
else:
# skip the header line
l = F.readline()
for line in F:
line = line.rstrip()
O.write(line + '\n')
file_number +=1
Kinesis Firehose continuously stream json files to S3 bucket. Using pyspark I need to convert into parquet and write to another bucket. Plannig to run this EMR job once in every 30 mins. I used the following code, but EMR job is running long(more than 4 hours and then i killed the job) to convert small files(Around 400K small files and total size is around 30GB).
Can someone give your input to increase the performance of the job?
Code:
# Loop through all folders
for k in sourceBucket.list():
source = k.bucket.name + '/' + k.key
srcsubfolder = source.split('/')
if srcsubfolder[2] != '':
sourcePath = 's3://' + k.bucket.name + '/' + k.key
# derive target path
removefilenamefromkey = k.key.split('/')[:-1]
targetPath = 's3://' + targetBucket.name +'/' + '/'.join(removefilenamefromkey)
#Covert Json into Parquet files
dfjson = spark.read.json(sourcePath)
dfjsoncollower = dfjson.toDF(*[x.lower() for x in dfjson.columns])
dfjsoncollower.write.mode("append").parquet(targetPath)
I am attaching image from spark UI, and i am asking what is causing the delay( represented by white space) based on the description of my code below
Description:
1) isEmpt: is a action triggered on a Dataset DS1. it takes fe milliseconds : 60ms.
2) The white space between "isEmpty" and " run at ThreadPool...".
3) "collect at graphUtil" : collection of Datasets created between 1) and 2)
The script is running on yarn cluster.
Between 1) and 2) i am declaring Datasets which uses sqlContext.implicits._, i am not collecting them here.so this is supposed to be work on Driver.Those Datasets contains Join/filter/....
Having that i am not collecting them between 1) and 2) what could be causing this delay.
Code between 1) and 2)
import sqlContext.implicits._
val intermediateInputFlowsIdsDS= intermediateInputFlowsDS
.map(x=>x.flow)
.toDF("flowid").distinct().as[Int].repartition($"flowid")
val df_exch_flow_interm_out=df_exch_flow.filter(df_exch_flow("flow_type")==="PRODUCT_FLOW"
&&df_exch_flow("is_input")==="0" )
val allproducersExchDS= intermediateInputFlowsIdsDS.join(df_exch_flow_interm_out,
intermediateInputFlowsIdsDS("flowid")===df_exch_flow_interm_out("f_flow") )
.repartition($"f_owner")
//proc{id,name,proctype}/inter{flowid}/df_exch{exch,proc,flow,direct,amount,provider,unit}/df_flow{id,name,type}/unit{id,src,factor,dest}
df_proc.join(allproducersExchDS,df_proc("Id")=== allproducersExchDS("f_owner"))
.map(row => {
/*(flowid,procid,value)*/
new FlowProducer( row.getInt(3),// flowid output of producer
row.getInt(0) ,// the process id of producer
row.getDouble(8),// value of the matrix A cell,
row.getDouble(16),//factor
row.getString(17),//destination unit
row.getString(2)//process type
)
}).repartition($"producer_flow")