I am creating a DMN with around 43 input fields in a single data type. But when creating a two DMN tables of 26 row items, it is taking lot of time to validate (10 mins +). What can we do to resolve this issue?
Related
In my project i store high number of rows(~300mln records) in one table in postgres database. One row contains 77 columns with String, Integer types few booleans and dates. My issue is that when im trying to browse records on GUI it works extremly slow even if my DB contains only 3% of its final capacity.
For DB connection im using Spring JPA, and findAll() method with pagination, 20 records per page takes almost 3.5s (after adding indexes, before it was 7,5sec), when i have ~10mln records in DB(3.3GB).
If i do same type of query from pgsql cli
slelect * from myTable order by column1 limit 20;
I get same result set immediately, so it is not issue with postgres performance.
Overriding defautl implementation of findAll() with native query reduced time to about 2,8s which is still to long, i also treid with fetching few comuns only using view containing 6 out of 77 columns, but it didn't change much.
Any suggestion where to look or how can i improve performance that this app will be still usable with database containing 30 times more records?
P.S. App is nothing more than GUI for browsing these records, in addition every 15minutes ~50K of new records is being inserted. My VM have assigned 8 vCPU and 12GB of RAM.
I think I found what is causing this delay, it is the count() method. With mocked count this findAll() takes aboud 300ms for first query, all subsequent queries take like 12-20ms.
Bit of a challenge here
I have around 45,000 historic .parquet files
partitioned like this yyyy,mm,dd (2021/08/19) in the dd level I have 24 files (one for each hour)
The columns in each day file are pretty wide, anything up to 250 columns. It has increased and decreased over time, hence there being schema drift when trying to load into SQL using mapping dataflows that made the file larger.
Around 200 of those columns I require and I know what they are. I even have them in a schema template. The rest are legacy or unwanted
I'd like to retain the original files in blob as they are, but load files with those 200 columns per file into SQL.
What is the best way to achieve this?
How do I iterate over every file but only take the columns I need?
I tried using a wildcard path
'2021/**/*.parquet'
within mapping dataflows to pick up All files in blob so I don't have to iterate creating multiple clusters or a foreach
I'm not even sure how to handle this or whether it should be a copy activity or a mapping df
both have their benefits but I think I can only use mapping df if I need to transform parts of these files in depth.
should I be combining the months or even years into a single file then trying to read from this files so I can exclude the additional from the columns I want to take into SQL server.
ideally this is a bulk load that need some refinement when it lands.
Thank in advance
Add a data flow to the pipeline and use a Select transformation to choose the columns you wish to propagate. You can create pattern-based rules in the data flow Select transformation to choose the columns that you wish to pick from each file schema.
I have a Core Data model which has an Entity TrainingDiary which contains Day entities. I have set up a couple of test Training Diaries on with 32 Days in it and the other a full import of a live Training Diary which has 5,103 days in it. Days have relationships to each other for yesterday and tomorrow.
I have a derived property on Day which takes a value from yesterday and value from today to calculate a value on today. This worked but when scrolling through the table of days it was relatively slow. Since after a day has past it would be very rare for this value to change I decided it would be better to store the value and calculate it once.
Thus I have a calculation that can be done on a selected set of days. When a diary is first imported it would be performed on all days but after that it would in the vast majority of cases only be calculated on the most recent Day.
When I check the performance of this on my test data I get the following:
Calculating all 32 days in the 32 day Training diary takes 0.4916 seconds of which 0.4866 seconds is spent setting the values on the Core Data model.
Calculating 32 days in the 5,103 day Training diary takes 47.9568 seconds of which 47.9560 seconds is spent setting the values on the Core Data model.
The Core Data model is not being saved. I had wondered whether there were loads of notifications going on so I removed the observer I had set on NSNotification.Name.NSManagedObjectContextObjectsDidChange and I removed my implementation of keyPathsForValuesAffectingValue(forKey:) implementation but it made no difference.
I can import the whole diary from JSON in far less time - that involves creating all the objects and setting every value.
I am at a loss as to why this is. Does any one have any suggestions on what it could be or on steps to investigate this ?
[I'm using XCode9 and SWIFT4]
I have this legacy Job that runs every 5 minutes (inside it has a SP that is doing a MERGE).
It ran perfectly for the last couple of years… But the issue started when the SOURCE for the MERGE changed from 1,000 rows to 200,000 rows…
Did anyone go through something similar? This job used to take at most 2 minutes… Now sometimes it takes 2,3 hours to run…
Below a screenshot of the execution plan for the biggest merge (the SP has in total has 4 merges…) I try to use the execution plan to troubleshoot, but it was misleading… As even though it says (relative to the batch)= 0% this is the portion of the code that takes 90% of the time.
Thanks for all your replies; you have been really helpful…
The Job (stored procedure), involves 9 tables, with 4 merges and lots of joins and updates after the merge…
(Publishing it all would be way too long).
Finally, I have narrowed down the issue to the complexity of the SP (overall is doing 4 merges, 9 joins, and updating 3 more tables in the end):
And it all depends on this:
declare #TableVariable as SF_OpportunityMerge
insert # TableVariable
select (20 columns)
from [Remote_Server].[Salesforce].[dbo].[Opportunity]
where UpdateDate > #LastUpdateDate
When this select (which is the base for most of the merges), returns 10, 20 rows… The Job runs in seconds, when it returns half a million rows will take around 2 hours, and it makes perfect sense, there are no blockings, no indexes missing (the table variable has an index on opportunity)…
Inside the job, the #TableVariable is used several times (including the merges), so the more results I get, it exponentially increases its duration…
If processing of 1000 rows takes 2 minutes thus processing of 200'000 rows should take (predicted linearly) 6-7 hours. Performance does not seem to change dramatically - it's just bad and nothing more. Dig into real issue: to process poor 1000 rows your sp works up to 2 minutes. What is it doing? How many indexes do you have on your tables? Are there any clustered indexes on guids? Any clustered views dependent on these tables? Triggers?
We use Postgres for analytics (star schema).
Every few seconds we get reports on ~500 metrics types.
The simplest schema would be:
timestamp metric_type value
78930890 FOO 80.9
78930890 ZOO 20
Our DBA has came up with a suggestion to flatten all reports of the same 5 seconds to:
timestamp metric1 metric2 ... metric500
78930890 90.9 20 ...
Some developers push back on this saying this adds a huge complexity on development (batching data so it is written in one shot) and to maintainability (just looking at the table or adding fields is more complex).
Is the DBA model the standard practice in such systems or only a last resort once the original model is clearly not scalable enough?
EDIT: the end goal is to draw a line chart for the users. So queries will mostly be selecting a few metrics, folding them by hours, and selecting min/max/avg per hour (or any otehr time period).
EDIT: the DBA arguments are:
This is relevant from day 1 (see below) but even if was not this is something the system eventually will need to do and migrating from another schema will be a pain
Reducing the number of rows x500 times will allow more efficient indexes and memory (the table will contain hundreds of millions of rows before this optimization)
When selecting multiple metrics the suggested schema will allow one pass over the data instead of separate query for each metric (or some complex combinations of OR and GroupBY)
EDIT: 500 metrics is an "upper bound" but in practice most of the time only ~40 metrics are reported per 5 seconds (not the same 40 though)
The DBA's suggestion isn't totally unreasonable if the metrics are fairly fixed, and make sense to group together. A couple of problems you'll likely face, though:
Postgres has a limit of between 250 and 1,600 columns (depending on data type)
The table will be hard for developers to work with, especially if you often want to query for only a subset of the attributes
Adding new columns will be slow
Instead, you might want to consider using an HSTORE column:
CREATE TABLE metrics (
timestamp INTEGER,
values HSTORE
)
This will give you some flexibility in storing attributes, and allows for indices. For example, to index just one of the metrics:
CREATE INDEX metrics_metric3 ON metrics ((values->'metric3'))
One drawback of this is that values can only be text strings… so if you need to do numeric comparisons, a JSON column might also be worth considering:
CREATE TABLE metrics (
timestamp INTEGER,
values JSON
)
CREATE INDEX metrics_metric3 ON metrics ((values->'metric3'))
The drawback here is that you'll need to use Postgres 9.3, which is still reasonably new.