Save result for multiple request - orientdb

I made a python batch that compare a filtered dataset of vertex with a specific vertex. My issue is that I need to execute more than 35000 times this batch. (with the same filtered dataset)
At first, I was querying this filtered dataset every time (30sec/request):
SELECT #rid FROM Expression WHERE OUT("identifie").asList().size() > 3
Then, I decided to saved all the rids from this dataset on a python variable and select it from rids in every request. (5sec/request) :
LET $filtered_expressions = (SELECT FROM [92:0, 92:1, ... , 98:2])
So, I'm wondering, can I save this filtered dataset in the database memory or something like that to improve my batch because it's this thing that cost the most time in every request ?

Related

Scala object to unpivot the data fields

Read RAW_us_deaths.csv and RAW_us_confirmed.csv Develop scala objects
to Unpivot the date fields from raw files to populate metrics by date in rows for death
and confirmed cases with the given schema.
details
Merge Case count & Death count:Join Deaths & Confirmed covid cases data sets that
were arrived in the previous step and prepare one data set that holds both Case_count
& death_count for both US & global and write it to local path
details2
Remove the records with both case_count & death_count = 0 and write the final output
to a parquet file.

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

How to update a local table remotely?

I have a large table on a remote server with an unknown (millions) amount of rows of data. I'd like to be able to fetch the data in batches of 100,000 rows at a time, update my local table with those fetched rows, and complete this until all rows have been fetched. Is there a way I can update a local table remotely?
Currently I have a dummy table called t on the server along with the following variables...
t:([]sym:1000000?`A`B`Ab`Ba`C`D`Cd`Dc;id:1+til 1000000)
selector:select from t where sym like "A*"
counter:count selector
divy:counter%100000
divyUP:ceiling divy
and the below function on the client along with the variables index set to 0 and normTable, which is a copy of the remote table...
index:0
normTable:h"0#t"
batches:{[idx;divy;anty;seltr]
if[not idx=divy;
batch:select[(anty;100000)] from seltr;
`normTable upsert batch;
idx+::1;
divy:divy;
anty+:100000;
seltr:seltr;
batches[idx;divy;anty;seltr]];
idx::0}
I call that function using the following command...
batches[index;h"divyUP";0;h"selector"]
The problem with this approach though is h"selector" fetches all the rows of data at the same time (and multiple times - for each batch of 100,000 that it upserts to my local normTable).
I could move the batches function to the remote server but then how would I update my local normTable remotely?
Alternatively I could break up the rows into batches on the server and then pull each batch individually. But if I don't know how many rows there are how do I know how many variables are required? For example the following would work, but only up to the first 400k rows...
batch1:select[100000] from t where symbol like "A*"
batch2:select[100000 100000] from t where symbol like "A*"
batch3:select[200000 100000] from t where symbol like "A*"
batch4:select[300000 100000] from t where symbol like "A*"
Is there a way to set a batchX variable so that it creates a new variable that equals the count of divyUP?
I would suggest few changes as you are trying to connect to remote server:
Do not run synchronous request as that would make server to slow down its processing. Try to make asynchronous request using callbacks.
Do not do full table scan(for heavy comparison) in each call specially for regex. Its possible that most of the data might be available in cache in next call but still it is not guaranteed which will again impact the server normal operations.
Do not make data requests in burst. Either use timer or make another data request call when last batch data has arrived.
Below approach is based on above suggestions. It will avoid scanning full table for columns other than index column(which is light weight) and make next request only when last batch has arrived.
Create Batch processing function
This function will run on server and read small batch of data from table using indices and return the required data.
q) batch:{[ind;s] ni:ind+s; d:select from t where i within (ind;ni), sym like "A*";
neg[.z.w](`upd;d;$[ni<count t;ni+1;0]) }
It takes 2 arguments- starting index and batch size to work on.
This function will finally call upd function on local macine asynchronously and will pass 2 arguments.
Table index to start next batch from (return 0 in case all rows are done to stop next batch processing)
Data from current batch request
Create Callback function
Result from batch processing function will come into this function.
If index > 0 that means there is more data to process and next batch should start form this index.
q) upd:{[data;ind] t::t,data;if[ind>0;fetch ind]}
Create Main function to start process
q)fetch:{[ind] h (batch;ind;size)}
Finally open connection, create table variable and run fetch function.
q) h:hopen `:server:port
q) t:()
q) size:100
q) fetch 0
Now, above method is based on the assumption that server table is static. In case its getting updates in real time then changes would be required depending upon how the table is getting updated on server.
Also, other optimizations can be done depending upon attributes set on remote table which can improve the performance.
If you're ok sending sync messages it can be simplified to something like:
{[h;i]`mytab upsert h({select from t where i in x};i)}[h]each 0N 100000#til h"count t"
And you can easily change it to control the number of batches (rather than the size) by instead using 10 0N# (that would do it in 10 batches)
Rather than having individual variables, the cut function can split up the result of the select into chunks of 100000 rows. Indexing each element is a table.
batches:100000 cut select from t where symbol like "A*"

Java 8 Entity List Iterator consume too much time to process the query

I have a table in PostgreSQL "items" and there I have some information like id, name, desc, config etc.
It contains 1.6 million records.
I need to make a query to get all result like "select id, name, description from items"
What is the proper pattern for iterating over large result sets?
I used EntityListIterator:
EntityListIterator iterator = EntityQuery.use(delegator)
.select("id", "name", "description")
.from("items")
.cursorScrollInsensitive()
.queryIterator();
int total = iterator.getResultsSizeAfterPartialList();
List<GenericValue> items = iterator.getPartialList(start+1, length);
iterator.close();
the start here is 0 and the length is 10.
I implemented this so I can do pagination with Datatables.
The problem with this is that I have millions of records and it takes like 20 seconds to complete.
What can I do to improve the performance?
If you are implementing pagination, you shouldn't load all 1.6 million records in memory at once. Use order by id in your query and id from 0 to 10, 10 to 20, etc. in the where clause. Keep a counter that denotes up till which id you have traversed.
If you really want to pull all records in memory, then just load the first few pages' records (e.g. from id=1 to id=100), return it to the client, and then use something like CompletableFuture to asynchronously retrieve the rest of the records in the background.
Another approach is to run multiple small queries in separate threads, depending on how many parallel reads your database supports, and then merge the results.
What about CopyManager? You could fetch your data as a text/csv outputstream, maybe in this way it would be faster to retrieve.
CopyManager cm = new CopyManager((BaseConnection) conn);
String sql = "COPY (SELECT id, name, description FROM items) TO STDOUT WITH DELIMITER ';'";
cm.copyOut(sql, new BufferedWriter(new FileWriter("C:/export_transaction.csv")));

Data Lake Analytics - Large vertex query

I have a simple query which make a GROUP BY using two fields:
#facturas =
SELECT a.CodFactura,
Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,
SUM(a.Consumo) AS Consumo
FROM #table_facturas AS a
GROUP BY a.CodFactura, a.DateKey;
#table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?
First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.
Secondly: In order to answer your question, we would need to know how the full query is defined. Where does #table_facturas come from? How was it produced?
Without this information, I can only give some wild speculative guesses:
If #table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.
You can also try to hint a small number of rows in your query that creates #table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).