SQLite vs Memory - iphone

I have a situation with my app.
Suppose I have 6 users, each user can have up to 9 score entries (i.e score 1000 points at 8:00pm with gold collected 3, silver 4 etc etc), say score per stage and 9 stages.
All these scores are being taken from an API call, so it can update with an interval of 3+minutes.
Operations I need to do on this data is
find the nearest min, max record from stage 4.
and some more operations like add or subtract two scores etc
All these 6 users, and their score records are already in database, being updated in needed after the API call.
Now my questions is :
Is this a better way for such kind of data (data of scores here) to keep all the data for all the 6 users in memory in NSArray or NSDictionary, and find min and max in that array by a min-max algorithm.
OR
It should be taken from Database by a query like " WHERE score<=200 " AND " WHERE score >=200", in short, 2 database queries which return nearest min and max record each, and not keeping all the data in memory.
What we are focusing on is speed, and memory usage both. The point is, Would a DB call be fast and efficient to find min and max OR a search for min,max in an Array of all the records from DB.
All records can be 6users * 9scores for each = 54.
Update time for records can be 3+ minutes.
Frequency of finding min max for certain values are high.
Please ask, if any more details are required.
Thanks in advance.

You're working with such a small amount of data that I wouldn't imagine it would be worth worrying about. Do whichever method makes your development process easiest!
Edit:
If I had a lot of data (hundreds of competitors) I'd use SQLite. You can do queries like the following:
SELECT MIN(`score`) FROM `T_SCORE` WHERE `stage` = '4';
That way you can let the database handle doing the calculation for you, so you never have to fetch all the results.
My SQL-fu isn't the most awesome, but I think you can also do this:
SELECT `stage`, MIN(`score`) AS min, MAX(`score`) AS max FROM `T_SCORE` GROUP BY `stage`
That would do all the calculations in one single query.

Related

Is is possible limit the number of rows in the output of a Dataprep flow?

I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.
Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.
You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.
So indeed there are 2 approaches to this.
As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.
randbetween :
rand :
These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\1000 of data (million out of a billion).
Alternatively, if you just want to have million records in your output, but either
Don't want to rely on the knowledge of the size of the entire table
just want the first million rows, agnostic to how many rows there are -
You can just use 2 of these 3 row filtering methods: (top rows\ range)
P.S
By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.
BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem.
Cheers!

Statistical query to loop through different date periods

I have a massive query log table in postgresql. I have been asked to get statistical data from it, but the table is sooooo massive. It has about ~170000000 rows in it.
So I've been asked a statistical data for last 6 months, that will have count of services for each day.
The issue is that since the table is so big, it will take forever to get this data.
Here's the current query I use:
SELECT ql.query_time::timestamp::date,count(ql.query_name),ql.query_name
FROM query_log ql
WHERE ql.query_time BETWEEN '2017-12-20 14:00:00.000'::timestamp AND '2018-06-20 14:00:00.000'::timestamp AND success=TRUE
GROUP BY ql.query_time::timestamp::date, ql.query_name;
Please make proposals how to make this query faster and and effective. I want to save the output into the CSV.
I've been thinking on looping through each day for past 6 months but dont know how to do it.
OH, ql.query_time is indexed.
Thx!

Query distinct values from historical database

If I run this query on large Historical database without specifying a date, will KDB be smart enough to retrive status values from index and not bring database down?
select distinct status from trades
The only way kdb can possibly tell all the distinct status is by reading from every partition. Yes this will take a lot of memory but unless you yourself want to maintain a cache of all distinct status, there is nothing else you can do. As previous mentioned an attribute will speed the query up but the query time will still only scale with the number of partitions.
To retrieve using index, kdb provides 'g#' attribute. Distinct alone can take more time which depends on size of your table(it will be linear search without `g# attribute).
Check this-> http://code.kx.com/q4m3/8_Tables/#88-attributes
Let's look at simple example:
q) a: 10000000#1 2 3 5
q) b:`g#a
q) \ts distinct a
68 134217888
q) \ts distinct b
0 288
Difference shows that g# attribute makes a lot of difference in time and space taken during searching. It is becauseg# attribute creates and maintains index on vector.

Best way to query 4 B+ records in Tableau

I am looking a best way to analyse 4B records (1TB data) stored in Vertica using Tableau. I tried using extract of 1M records which works perfectly. but dont know how to manage 4B records, because its taking too long to query on 4B records.
I have following dataset :
timestamp id url domain keyword nor_word cat_1 cat_2 cat_3
So here I need to create descending list of Top 10 ID's, Top 10 url, Top 10 domain, Top 10 keyword, Top 10 nor_word, Top 10 cat_1, Top 10 cat_2, Top 10 cat_3 depending count of each field value in separate worksheet and combine all worksheet in one dashboard.
There is no primary key. This dataset of 1 month so I want to make global filter start date and end date to reduce the query size. But don't know how to create global date filter and display on dashboard ?
You have two questions, one about Vertica and one about Tableau. You should split these up.
Regarding Vertica, you need to know that Vertica stores data in ascending sort order in physical storage. This means that an additional step will always be required anytime you want to get a descending sort order.
I would suggest creating a partition on the date, and subsequently running Database Designer (DBD) in incremental mode and using your queries as samples. By partitioning the data, Vertica can eliminate the partitions during optimization.
Running the DBD will generate some better optimized projections. You should consider the trade-off between how often you will need this data and whether it's worth creating these additional projections as it will impact your load performance.

MongoDB performance issue: Single Huge collection vs Multiple Small Collections

I tested two scenarios Single Huge collection vs Multiple Small Collections and found huge difference in performance while querying. Here is what I did.
Case 1: I created a product collection containing 10 million records for 10 different types of product, and in this exactly 1 million records for each product type, and I created index on ProductType. When I ran a sample query with condition ProductType=1 and ProductPrice>100 and limit(10) to return 10 records of ProductType=1 and whose price is greater than 100, it took about 35 milliseconds when the collection has lot of products whose price is more than 100, and the same query took about 8000 millisecond (8 second) when we have very less number of products in ProductType=1 whose price is greater than 100.
Case 2: I created 10 different Product table for each ProductType each containing 1 million records. In collection 1 which contains records for productType 1, when I ran the same sample query with condition ProductPrice>100 and limit(10) to return 10 records of products whose price is greater than 100, it took about 2.5 milliseconds when the collection has lot of products whose price is more than 100, and the same query took about 1500 millisecond (1.5 second) when we have very less number of products whose price is greater than 100.
So why there is so much difference? The only difference between the case one and case two is one huge collection vs multiple smaller collection, but I have created index of ProductType in the first case one single huge collection. I guess the performance difference is caused by the Index in the first case, and I need that index in the first case otherwise it will be more worst in performance. I expected some performance slow in the first case due to the Index but I didn't expect the huge difference about 10 times slow in the first case.
So 8000 milliseconds vs 1500 milliseconds on one huge collection vs multiple small collection. Why?
Separating the collections gives you a free index without any real overhead. There is overhead for an index scan, especially if the index is not really helping you cut down on the number of results it has to scan (if you have a million results in the index, but you have to scan them all and inspect them, it's not going to help you much).
In short, separating them out is a valid optimization, but you should make your indexes better for your queries before you actually decide to take that route, which I consider a drastic measure (an index on product price might help you more in this case).
Using explain() can help you understand how queries work. Some basics are: You want a low nscanned to n ratio, ideally. You don't want scanAndOrder = true, and you don't want BasicCursor, usually (this means you're not using an index at all).