Query for null records where filestream is enabled - sql-server-2008-r2

I have a table that we just enabled FileStreams on. We created a new varbinary column and set it to store to a filestream. Then we copied everything from the existing column to the new one in order to get the file data pushed to the file system.
So far so good.
However, we weren't able to take the DB offline while doing this (uptime SLA) and there were 2 records out of 7400 that came in after the update statement ran but before we renamed the columns. We currently have 2 columns: FileData and FileDataOld. Where FileData is the one tied to the filestream.
The average file size is a little over 2MB. So, I decided to run a very simple select statement to find the records that didn't go:
select DocumentId, FileName
from docslist
where FileData is null
When I ran this query, the CPU spiked to 80% and sat there for quite a while. Ultimately I killed the select after 2 minutes because that was just insane.
If I run something like:
select DocumentId, FileName from docslist
It returns almost instantly.
However, as soon as I try to query where FileData or FileDataOld is null it spins off into forever land.
When using Resource Monitor, and I query for 'FileData is null', I can see it pulling every byte from every single document off the file system. Which is pretty odd; you'd think that info would be stored within the table itself.
If I query for FileDataOld is null, it looks like it's trying to load the entire table (16GB) in memory.
How can I improve this?? I just need to get the 2 records that happened after the update statement and force those documents to move over.

Can't you do:
select DocumentId, FileName from docslist WHERE DATALENGTH(FileData)>0
On mdsn it says:
DATALENGTH is especially useful with varchar, varbinary, text, image,
nvarchar, and ntext data types because these data types can store
variable-length data.
The DATALENGTH of NULL is NULL.
Reference here

Related

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

Do logical or physical deletes perform better in Spanner?

Based on the example below, which of the following deletion strategies is recommended in Spanner?
CREATE TABLE Singers (
SingerId INT64 NOT NULL,
FirstName STRING(1024),
LastName STRING(1024),
SingerInfo BYTES(MAX),
BirthDate DATE,
Status STRING(1024)
) PRIMARY KEY(SingerId);
Logical Delete
Status field = 'DELETED'
Status field = null
Phyiscal Delete
Singer record deleted from the table
From my understanding of the way Spanner works, logically deleted records will need to be shifted to the end of the scan so that the relevant data is read first and it's not needlessly scanning through deleted records.
Physically deleted records would cause Spanner to have to re-index or split the data.
I'm not sure which is preferred, or if my understanding of data modification in Spanner is truly correct.
Going with Physical deletion is preferred over logical, since you'll have less date to scan in the end, and you can try to avoid full scans in this way, as this is more time consuming.
As for the splits, it's true that with less splits, the reads are faster, but they are created when you add more rows, so, I would go with Physical.

Returning JSON from Postgres is slow

I have a table in Postgres with a JSONB column, each row of the table contains a large JSONB object (~4500 keys, JSON string is around 110 KB in a txt file). I want to query these rows and get the entire JSONB object.
The query is fast -- when I run EXPLAIN ANALYZE, or omit the JSONB column, it returns in 100-300 ms. But when I execute the full query, it takes on the order of minutes. The exact same query on a previous version of the data was also fast (each JSONB was about half as large).
Some notes:
This ends up in Python (via SQLAlchemy/psycopg2). I'm worried that the query executor is converting JSONB to JSON, then it gets encoded to text for transfer over the wire, then gets JSON encoded again on the Python end.
Is this correct? If so how could I mitigate this issue? When I select the JSONB column as ::text, the query is roughly twice as fast.
I only need a small subset of the JSON (around 300 keys or 6% of keys). I tried methods of filtering the JSON output in the query but they caused a substantial further performance hit -- it ended up being faster to return the entire object.
This is not necessarily a solution, but here is an update:
By casting the JSON column to text in the Postgres query, I was able to substantially cut down query execution and results fetching on the Python end.
On the Python end, doing json.loads for every single row in the result set brings me to the exact timing as using the regular query. However, with the ujson library I was able to obtain a significant speedup. The performance of casting to text in the query, then calling ujson.loads on the python end is roughly 3x faster than simply returning JSON in the query.

postgres memory issue in transaction

We have a postgres process that works as follows:-
1) CSV file split into a table with a row per record
2) pagent runs a postgres function that reads each record and writes it to a new table as either a new record or an update
3) a trigger runs on the new table and depending on the record value runs a plv8 function to update its data (there's a fair bit of json processing involved and plv8 was the easiest way to code it). The second update comes from plv8 and we've used the pattern below:-
query = plv8.prepare('...');
query.execute(<params>);
query.free();
When we monitor this we see that processing 5000 records uses 14Gb of virtual memory. So something is awry as the CSV record is < 1k in size. This became acute after we added a new index to the table.
Where should we look for solutions to this? Is it normal and is it linked to the indexes being updated in the transaction or another factor.

How to tell if record has changed in Postgres

I have a bit of an "upsert" type of question... but, I want to throw it out there because it's a little bit different than any that I've read on stackoverflow.
Basic problem.
I'm working on moving from mysql to PostgreSQL 9.1.5 (hosted on Heroku). As a part of that, I need to import multiple CSV files everyday. Some of the data is sales information and is almost guaranteed to be new and need to be inserted. But, other parts of the data is almost guaranteed to be the same. For example, the csv files (note plural) will have POS (point of sale) information in them. This rarely changes (and is most likely only via additions). Then there is product information. There are about 10,000 products (vast majority will be unchanged, but it's possible to have both additions and updates).
The final item (but is important), is that I have a requirement to be able to provide an audit trail/information for any given item. For example, if I add a new POS record, I need to be able to trace that back to the file it was found in. If I change a UPC code or description of a product, then I need to be able to trace it back to the import (and file) where the change came from.
Solution that I'm contemplating.
Since the data is provided to me via CSV, then I'm working around the idea that COPY will be the best/fastest way. The structure of the data in the files is not exactly what I have in the database (i.e. final destination). So, I'm copying them into tables in the staging schema that match the CSV (note: one schema per datasource). The tables in the staging schemas will have a before insert row triggers. These triggers can decide what to do with the data (insert, update or ignore).
For the tables that are most likely to contain new data, then it will try to insert first. If the record is already there, then it will return NULL (and stop the insert into the staging table). For tables that rarely change, then it will query the table and see if the record is found. If it is, then I need a way to see if any of the fields are changed. (because remember, I need to show that the record was modified by import x from file y) I obviously can just boiler plate out the code and test each column. But, was looking for something a little more "eloquent" and more maintainable than that.
In a way, what I'm kind of doing is combining a importing system with an audit trail system. So, in researching audit trails, I reviewed the following wiki.postgresql.org article. It seems like the hstore might be a nice way of getting changes (and being able to easily ignore some columns in the table that aren't important - e.g. "last_modified")
I'm about 90% sure it will all work... I've created some testing tables etc and played around with it.
My question?
Is a better, more maintainable way of accomplishing this task of finding the maybe 3 records out of 10K that require a change to the database. I could certainly write a python script (or something else) that reads the file and tries to figure out what to do with each record, but that feels horribly inefficient and will lead to lots of round trips.
A few final things:
I don't have control over the input files. I would love it if they only sent me the deltas, but they don't and it's completely outside of my control or influence.
he system is grow and new data sources are likely to be added that will greatly increase the amount of data being processed (so, I'm trying to keep things efficient)
I know this is not nice, simple SO question (like "how to sort a list in python") but I believe one of the great things about SO is that you can ask hard questions and people will share their thoughts about how they think the best way to solve it is.
I have lots of similar operations. What I do is COPY to temporary staging tables:
CREATE TEMP TABLE target_tmp AS
SELECT * FROM target_tbl LIMIT 0; -- only copy structure, no data
COPY target_tmp FROM '/path/to/target.csv';
For performance, run ANALYZE - temp. tables are not analyzed by autovacuum!
ANALYZE target_tmp;
Also for performance, maybe even create an index or two on the temp table, or add a primary key if the data allows for that.
ALTER TABLE ADD CONSTRAINT target_tmp_pkey PRIMARY KEY(target_id);
You don't need the performance stuff for small imports.
Then use the full scope of SQL commands to digest the new data.
For instance, if the primary key of the target table is target_id ..
Maybe DELETE what isn't there any more?
DELETE FROM target_tbl t
WHERE NOT EXISTS (
SELECT 1 FROM target_tmp t1
WHERE t1.target_id = t.target_id
);
Then UPDATE what's already there:
UPDATE target_tbl t
SET col1 = t1.col1
FROM target_tmp t1
WHERE t.target_id = t1.target_id
To avoid empty UPDATEs, simply add:
...
AND col1 IS DISTINCT FROM t1.col1; -- repeat for relevant columns
Or, if the whole row is relevant:
...
AND t IS DISTINCT FROM t1; -- check the whole row
Then INSERT what's new:
INSERT INTO target_tbl(target_id, col1)
SELECT t1.target_id, t1.col1
FROM target_tmp t1
LEFT JOIN target_tbl t USING (target_id)
WHERE t.target_id IS NULL;
Clean up if your session goes on (temp tables are dropped at end of session automatically):
DROP TABLE target_tmp;
Or use ON COMMIT DROP or similar with CREATE TEMP TABLE.
Code untested, but should work in any modern version of PostgreSQL except for typos.