Redshift Concurrent Transactions - postgresql

I'm having issues concurrently writing to a Redshift db. Writing to the db using a single connection works well, but is a little slow, so I am trying to use multiple concurrent connections but it looks like there can only be a single transaction at a time.
I investigated by running the following python script alone, and then running it 4 times simultaneously.
import psycopg2
import time
import os
if __name__ == "__main__":
rds_conn = psycopg2.connect(host="www.host.com", port="5439", dbname='db_name', user='db_user', password='db_pwd')
cur = rds_conn.cursor()
with open("tmp/test.query", 'r') as file:
query = file.read().replace('\n', '')
counter = 0
start_time = time.time()
try:
while True:
cur.execute(query)
rds_conn.commit() # first commit location
print("sent couter: %s" % counter)
counter += 1
except KeyboardInterrupt:
# rds_conn.commit() # secondary commit location
total_time = time.time() - start_time
queries_per_sec = counter / total_time
print("total queries/sec: %s" % queries_per_sec)
The test.query file being loaded up is a multi-row insert file ~16.8mb that looks a little like:
insert into category_stage values
(default, default, default, default),
(20, default, 'Country', default),
(21, 'Concerts', 'Rock', default);
(Just a lot longer)
The results of the scripts showed:
---------------------------------------------------
| process count | queries/sec | total queries/sec |
---------------------------------------------------
| 1 | 0.1786 | 0.1786 |
---------------------------------------------------
| 8 | 0.0359 | 0.2872 |
---------------------------------------------------
...which is far from the increase I'm looking for. When you can see the counter increasing across the scripts there's a clear circular pattern where each waits for the prior script's query to finish.
When the commit is moved from the first commit location to the second commit location (so commits only when the script is interrupted), only one script advances at a time. If that isn't a clear indication of some sort of transaction lock, I don't know what is.
As far as I can tell from searching, there's no document that says we can't have concurrent transactions, so what could the problem be? It's crossed my mind that the query size is so large that only one can be performed at a time, but I would have expected Redshift to have much more than ~17mb per transaction.

Inline with Guy's comment, I ended up using a COPY from an S3 bucket. This ended up being an order of magnitude faster, requiring only a single thread to call the query, and then allowing AWS to process the files from S3 in parallel. I used the guide detailed here and managed to insert about 120Gb of data in just over an hour.

Related

Skewed Window Function & Hive Source Partitions?

The data I am reading via Spark is highly skewed Hive Table with the following stats.
(MIN, 25TH, MEDIAN, 75TH, MAX) via Spark UI:
1506.0 B / 0 232.4 KB / 27288 247.3 KB / 29025 371.0 KB / 42669 269.0 MB / 27197137
I believe it is causing problems downstream in the job when I perform some Window Funcs, and Pivots.
I tried exploring this parameter to limit the partition size however nothing changed and the partitions are still skewed upon read.
spark.conf.set("spark.sql.files.maxPartitionBytes")
Also, when I cache this DF with the Hive table as source it takes a few min and even causes some GC in the Spark UI most likely because of the skew as well.
Does this spark.sql.files.maxPartitionBytes work on Hive tables or only files?
What is the best course of action for handling this skewed Hive source?
Would something like a stage barrier write to parquet or Salting be suitable for this problem?
I would like to avoid .repartition() on read as it adds another layer to an already data roller-coaster of a job.
Thank you
==================================================
After further research it appears the Window Function is causing skewed data too and this is where the Spark Job hangs.
I am performing some time series filling via double Window Function (forward then backward fill to impute all the null sensor readings) and am trying to follow this article to try a salt method to evenly distribute ... however the following code produces all null values so the salt method is not working.
Not sure why I am getting skews after Window since each measure item I am partitioning by has roughly the same amount of records after checking via .groupBy() ... thus why would salt be needed?
+--------------------+-------+
| measure | count|
+--------------------+-------+
| v1 |5030265|
| v2 |5009780|
| v3 |5030526|
| v4 |5030504|
...
salt post => https://medium.com/appsflyer/salting-your-spark-to-scale-e6f1c87dd18
nSaltBins = 300 # based off number of "measure" values
df_fill = df_fill.withColumn("salt", (F.rand() * nSaltBins).cast("int"))
# FILLS [FORWARD + BACKWARD]
window = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
# FORWARD FILLING IMPUTER
ffill_imputer = F.last(df_fill['new_value'], ignorenulls=True)\
.over(window)
fill_measure_DF = df_fill.withColumn('value_impute_temp', ffill_imputer)\
.drop("value", "new_value")
window = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(0,Window.unboundedFollowing)
# BACKWARD FILLING IMPUTER
bfill_imputer = F.first(df_fill['value_impute_temp'], ignorenulls=True)\
.over(window)
df_fill = df_fill.withColumn('value_impute_final', bfill_imputer)\
.drop("value_impute_temp")
Salting might be helpful in the case where a single partition is big enough to not fit in memory on a single executor. This might happen even if all the keys are equally distributed as well (as in your case).
You have to include the salt column in your partitionBy clause which you are using to create the Window.
window = Window.partitionBy('measure', 'salt')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
Again you have to create another window which will operate on the intermediate result
window1 = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
Hive based solution :
You can enable Skew join optimization using hive configuration. Applicable settings are:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
See databricks tips for this :
skew hints may work in this case

How to update a local table remotely?

I have a large table on a remote server with an unknown (millions) amount of rows of data. I'd like to be able to fetch the data in batches of 100,000 rows at a time, update my local table with those fetched rows, and complete this until all rows have been fetched. Is there a way I can update a local table remotely?
Currently I have a dummy table called t on the server along with the following variables...
t:([]sym:1000000?`A`B`Ab`Ba`C`D`Cd`Dc;id:1+til 1000000)
selector:select from t where sym like "A*"
counter:count selector
divy:counter%100000
divyUP:ceiling divy
and the below function on the client along with the variables index set to 0 and normTable, which is a copy of the remote table...
index:0
normTable:h"0#t"
batches:{[idx;divy;anty;seltr]
if[not idx=divy;
batch:select[(anty;100000)] from seltr;
`normTable upsert batch;
idx+::1;
divy:divy;
anty+:100000;
seltr:seltr;
batches[idx;divy;anty;seltr]];
idx::0}
I call that function using the following command...
batches[index;h"divyUP";0;h"selector"]
The problem with this approach though is h"selector" fetches all the rows of data at the same time (and multiple times - for each batch of 100,000 that it upserts to my local normTable).
I could move the batches function to the remote server but then how would I update my local normTable remotely?
Alternatively I could break up the rows into batches on the server and then pull each batch individually. But if I don't know how many rows there are how do I know how many variables are required? For example the following would work, but only up to the first 400k rows...
batch1:select[100000] from t where symbol like "A*"
batch2:select[100000 100000] from t where symbol like "A*"
batch3:select[200000 100000] from t where symbol like "A*"
batch4:select[300000 100000] from t where symbol like "A*"
Is there a way to set a batchX variable so that it creates a new variable that equals the count of divyUP?
I would suggest few changes as you are trying to connect to remote server:
Do not run synchronous request as that would make server to slow down its processing. Try to make asynchronous request using callbacks.
Do not do full table scan(for heavy comparison) in each call specially for regex. Its possible that most of the data might be available in cache in next call but still it is not guaranteed which will again impact the server normal operations.
Do not make data requests in burst. Either use timer or make another data request call when last batch data has arrived.
Below approach is based on above suggestions. It will avoid scanning full table for columns other than index column(which is light weight) and make next request only when last batch has arrived.
Create Batch processing function
This function will run on server and read small batch of data from table using indices and return the required data.
q) batch:{[ind;s] ni:ind+s; d:select from t where i within (ind;ni), sym like "A*";
neg[.z.w](`upd;d;$[ni<count t;ni+1;0]) }
It takes 2 arguments- starting index and batch size to work on.
This function will finally call upd function on local macine asynchronously and will pass 2 arguments.
Table index to start next batch from (return 0 in case all rows are done to stop next batch processing)
Data from current batch request
Create Callback function
Result from batch processing function will come into this function.
If index > 0 that means there is more data to process and next batch should start form this index.
q) upd:{[data;ind] t::t,data;if[ind>0;fetch ind]}
Create Main function to start process
q)fetch:{[ind] h (batch;ind;size)}
Finally open connection, create table variable and run fetch function.
q) h:hopen `:server:port
q) t:()
q) size:100
q) fetch 0
Now, above method is based on the assumption that server table is static. In case its getting updates in real time then changes would be required depending upon how the table is getting updated on server.
Also, other optimizations can be done depending upon attributes set on remote table which can improve the performance.
If you're ok sending sync messages it can be simplified to something like:
{[h;i]`mytab upsert h({select from t where i in x};i)}[h]each 0N 100000#til h"count t"
And you can easily change it to control the number of batches (rather than the size) by instead using 10 0N# (that would do it in 10 batches)
Rather than having individual variables, the cut function can split up the result of the select into chunks of 100000 rows. Indexing each element is a table.
batches:100000 cut select from t where symbol like "A*"

"The requested reader was not valid. The reader either does not exist or has expired" Error while fetching Performance data in SCOM

Snippet of the script that I am executing :
$reader = $managementgroupobj.GetMonitoringPerformanceDataReader()
while ($reader.Read()) // << Error in this line.
{
$perfData = $reader.GetMonitoringPerformanceData()
$valueReader = $perfData.GetValueReader($starttime,$endtime)
while ($valueReader.Read())
{
$perfValue = $valueReader.GetMonitoringPerformanceDataValue()
}
}
Here, $managementgroupobj is an instance of class ManagementGroup.
The difference of $starttime and $endtime veries from 15 minutes to 1 hour depending on the last execution of the same script.
The snippet collects the performance the data successfully for long time. but then, out of nowhere it throws following error:
"The requested reader was not valid. The reader either does not exist or has expired"
[ log_level=WARN pid=2716 ] Execute command 'get-scomallperfdata' failed. The requested reader was not valid. The reader either does not exist or has expired.
at GetSCOMPerformanceData, E:\perf\scom_command_loader.ps1: line 628
at run, E:\perf\scom_command_loader.ps1: line 591
at <ScriptBlock>, E:\perf\scom_command_loader.ps1: line 815
at <ScriptBlock>, <No file>: line 1
at <ScriptBlock>, <No file>: line 46
at Microsoft.EnterpriseManagement.Common.Internal.ServiceProxy.HandleFault(String methodName, Message message)
at Microsoft.EnterpriseManagement.Common.Internal.EntityObjectsServiceProxy.GetObjectsFromReader(Guid readerId, Int32 count)
at Microsoft.EnterpriseManagement.Common.DataReader.Read()
at CallSite.Target(Closure , CallSite , Object )
What is the cause of the mentioned error.?
It would be great if I get to know the mechanism of the PerformanceDataReader.
Note:
The amount of data it fetched before getting error was 100k+. and it took almost an hour to fetch that data.
I think the possible issue was with amount of data it has to fetch, It might be a kind of TimoutException.
It would be great if I get atleast some knowledge of both questioned mention above.
Thanks.
Since the end goal is offload ALL performance data to another tool, SCOM API will not provide enough performance thus direct SQL query are recommended.
A bit of background:
SCOM has tow DBs. Operational holds all current status, including almost "real time" performance data. Data Warehouse DB holds historical data, including aggregated (hourly and daily) performance data. All the queries below are for Operational DB.
SCOM as a platform can monitor absolutely anything -- it's implemented in Management Packs, so each MP can introduce new classes (types) of monitored entities, and/or new performance counters for existing classes. Say, you can create an MP for SAN appliance and start collecting its perf data. Or you can create another MP, which will add "Number of Files" counter to "Windows Logical Disk" class.
Keeping the above bits in mind, the below queries are for "Windows Computer" class (so won't work if you monitor Unix servers, you'll need to change class) and all associated objects.
Step 1: Find all available counters for a Windows Computer by its name.
NB!: results may be different depending OS version and MPs installed in your SCOM.
declare #ServerName as nvarchar(200) = 'server1.domain.local'
select pc.*
from PerformanceCounterView pc
join TypedManagedEntity tme on tme.TypedManagedEntityId = pc.ManagedEntityId
join BaseManagedEntity bme on tme.BaseManagedEntityId = bme.BaseManagedEntityId
where (bme.TopLevelHostEntityId = (select BaseManagedEntityId from BaseManagedEntity where FullName = 'Microsoft.Windows.Computer:'+#ServerName))
order by ObjectName, CounterName, InstanceName
Step 2: Retry actual performance data for each counter found in the step 1.
#SrcId parameter is PerformanceSourceInternalId column from the previous query.
NB!: all timestamps in SCOM are in UTC. The query below accept input in local time and produce output in local time as well.
declare #SrcID as int = XXXX
declare #End as datetime = GETDATE()
declare #Start as datetime = DATEADD(HOUR, -4, #End)
declare #TZOffset as int = DATEDIFF(MINUTE,GETUTCDATE(),GETDATE())
SELECT SampleValue, DATEADD(MINUTE, #TZOffset, TimeSampled) as TS
FROM PerformanceDataAllView
where (PerformanceSourceInternalId = #SrcID)
and (TimeSampled > DATEADD(MINUTE, -#TZOffset, #Start))
and (TimeSampled < DATEADD(MINUTE, -#TZOffset, #End))
By default SCOM keeps only last 7 days of "real time" performance, then it gets aggregated and offloaded to Data Warehouse.
Don't call these queries too often or use "NO LOCK" statement to avoid blocking SCOM itself.
Hope that helps.
Cheers
Max
The reader call will return true if the reader moved to the next result and false if not; according to the method's documentation. If you are getting an exception, it couldn't do either of those. I'd assume something broke the connection between you and the SCCM instance.
If it's a timeout issue, I'm not sure it's an SCCM timeout. The error doesn't say anything about a timeout. As far as I know, this is an RPC call under the hood, and RPC doesn't have a timeout:
There are two ways your client can hang: network connectivity can
cause server requests to become lost, or the server itself can crash.
With default options, RPC will never time out a call, and your client
thread will wait forever for a response.
Maybe a firewall is closing your connection after a certain period of time?
If you want to dial-in your performance, consider caching. It looks like you've got a much larger script that the snippet we see. Tossing this out just to make you aware its an option.

multiple cron jobs on the same postgres table

I have a cron job that runs every 2 mins it takes a 10 records from a postgres table and working on them then it set a flag when it is finished. i want to make sure if the fist cron runs and takes more than 2 min the other one will run on different data on DBs not on the same data.
is there any why to handle this case?
This can be solved using a Database Transaction.
BEGIN;
SELECT
id,status,server
FROM
table_log
WHERE
(direction = '2' AND status_log = '1')
LIMIT 100
FOR UPDATE SKIP LOCKED;
what are we doing?
We are Selecting all rows available (not locked) from other cron-jobs that might be running. And selecting them for update. So this means all this Query grabs its unlocked and all results will be locked for this cron-job only.
how to update my locked rows?
Simple use a for loop on your processor language (Python, Ruby, PHP) and do a concatenation for each update remember we are building 1 single update.
UPDATE table_log SET status_log = '6' ,server = '1' WHERE id = '1';
Finally we use
COMMIT;
And all rows locked will be updated. This prevents other Queries from touching the same data at the same time. Hope it helps.
Turn your "finished" flag from binary to ternary ("needs work", "in process", "finished"). You also might want to store the pid of the "in process" process, in case it dies and you need to clean it up, and a timestamp for when it started.
Or use a queueing system that someone already wrote and debugged for you.

postgres trigger to check for data

I need to write a trigger that will check a table column to see if data is there or not. The trigger needs to run all the time and log msg every hour.
Basically it will run a select statement if result found then sleep for an hour else log and sleep for an hour
What you want is a scheduled job. pgAgent : http://www.pgadmin.org/docs/1.4/pgagent.html create an hourly job that checks for that line and then logs as required.
Edit to add:
Curious if you've considered writing a SQL script that generates the log on the fly by reading the table instead of a job. If you have a timestamp field, it is quite possible to have a script that returns all hourly periods that don't have a corresponding entry within that time frame (assuming the time stamp isn't updated). Why store a second log when you can generate it directly against the data?
Triggers (in pg and in every dbms i know) can execute before or after events such insert, update or delete. What you probably want here is a script launched via something like cron (if you are using a unix system) every hour, redirecting your output to the log file.
I used something like this many times and it sounded like this (written in python):
#!/usr/bin/python
import psycopg2
try:
conn = psycopg2.connect("dbname='"+dbmane+"' user='"+user+"' host='"+hostname+"' password='"+passwd+"'")
except:
# Get the most recent exception
exceptionType, exceptionValue, exceptionTraceback = sys.exc_info()
# Exit the script and print an error telling what happened.
logging.debug("I am unable to connect to the database!\n ->%s" % (exceptionValue))
exit(2)
cur = conn.cursor()
query = "SELECT whatever FROM wherever WHERE yourconditions"
try:
cur.execute(query)
except ProgrammingError:
print "Programming error, no result produced"
result = cur.fetchone()
if(result == None):
#do whatever you need, if result == None the data is not in your table column
I used to launch my script via cron every 10 minutes, you can easily configure it to launch the script every hour redirecting its output to the log file of your choice.
If your working in a windows environment, than you'll be looking for something like cron.
I don't think that a trigger can help you with this, they fire only after some events (you can use a trigger to check after every insert if the inserted data is the one you want to check every hour, but it's not the same, doing it via script is the best solution in my experience)