Image and File store - HBase, MongoDB or Cassandra - mongodb

I want to built a distributed (across continents), fault-tolerant and fast image and file store. There would be a REST end-point in front of the storage which would serve the images and/or files.
The images or files are stored/inserted from a central location but served from a local intranet installed webserver which authenticates and authorises the user.
One object can have multiple sizes of the same image and probably files related to it. Using the mentioned storage gives me the ability to choose the column family and/or column qualifier to fetch the requested entity.
I did consider the FileSystem, however, to retrieve the requested entity I either need to know the correct path from the DB or the path should be intelligently designed. Which also means creating folders when a new year begins.
One entity can have the different sizes (thumbnail, grid, preview, etc.) for different years.
The request to get the image would look like -
entityId 123
year 2017
size thumbnail
The request to get all available image for a given entity for a year would look like -
entityId 123
year 2017
I am open for any other storage solution as long as the above are achievable. Thank you for your help and suggestions.

You could do as you suggest and build a filesystem table like th
cqlsh> use keyspace1;
cqlsh:keyspace1> create table filesystem(
... entitiyId int,
... year int,
... size text,
... payload blob,
... primary key (entitiyId, year, size));
cqlsh:keyspace1> insert into filesystem (entitiyId, year, size, payload) values (1,2017,'small',textAsBlob('payload'));
cqlsh:keyspace1> insert into filesystem (entitiyId, year, size, payload) values (1,2017,'big',textAsBlob('payload'));
cqlsh:keyspace1> insert into filesystem (entitiyId, year, size, payload) values (1,2016,'small',textAsBlob('payload'));
cqlsh:keyspace1> insert into filesystem (entitiyId, year, size, payload) values (1,2016,'big',textAsBlob('payload'));
cqlsh:keyspace1> insert into filesystem (entitiyId, year, size, payload) values (2,2016,'small',textAsBlob('payload'));
cqlsh:keyspace1>
cqlsh:keyspace1>
cqlsh:keyspace1> select * from filesystem where entitiyId=1 and year=2016;
entitiyid | year | size | payload
-----------+------+-------+------------------
1 | 2016 | big | 0x7061796c6f6164
1 | 2016 | small | 0x7061796c6f6164
(2 rows)
cqlsh:keyspace1>
and
cqlsh:keyspace1> select * from filesystem where entitiyId=1 and year=2016 and size='small';
entitiyid | year | size | payload
-----------+------+-------+------------------
1 | 2016 | small | 0x7061796c6f6164
(1 rows)
cqlsh:keyspace1>
What you cant do with this approach is selecting images for a specific size and id without specifying the year.
For related files you could build a list with the foreign entitiyIds or a seperate grouping table to keep them together.
But the cassandra blob type has a theoretically limit of 2GB but if you need performance the practial limit is about 1MB, in rare cases a few MB (performance degrades in many ways with bigger blobs). If that's no problem go ahead and just try out.
Another idea is using something like AWS S3 for storing the actual data with enabled cross region replication and cassandra for the metadata. But if someone goes to AWS - they also have EFS with cross region replication.
MongoDB is also easily deployed with cross region replication (https://docs.mongodb.com/manual/tutorial/deploy-geographically-distributed-replica-set/). In MongoDB you could keep all your data in one document and just query the relevant parts of it. In my opinion MongoDB requires more housekeeping than cassandra does (there is more config and planning necessary).

Related

PostgreSQL Table size and partition consideration

I am working on a use case where initial data load for name table in PostgreSQL DB will be around 650 million rows with average row size of 0.6 KB bringing table size up to 400 GB. After that there could be up to 20,000 inserts or updates on daily basis.
I am new to PostgreSQL, want to check if I should consider partitioning looking at the table size.
Updating some information from Comments section:
It is an OLTP application for Identity resolution for business names, this is one specific table where all the business names are stored along with the metadata such as Start Date, End Date and any incoming name is matched with existing names to identify if it is related to another business. This table is updated throughout the day using batch files from different data sources.
Also, we are not planning to expire or remove any data from this table.

Can I build a blob in Postgres?

I have a scenario where a db has a few hundred million rows, keeping a history of a few weeks only.
In this database are different products (tradable instruments) and each has about 72k rows of data per hour and there is roughly 30 products.
The data is always requested by blocks of 1h, aligned on 1h. For example "I want data for X from 2pm to 3pm.
This data is processed by several tools and the requests are very demanding for the database.
Each tool does its own disk caching, building a binary blob for each hour.
But I was wondering if it would be possible to build these directly in Postgres?
Data is indexed by timestamp and is written in a linear fashion as the writes represent live data.
So it would be possible to detect with a trigger that we just crossed an hour.
Would it be possible when we detect this to get all this data, build a binary blob out of it and save it in it own table? the data is simply all the columns one after another in binary forms. They're all numbers, no strings, etc so the alignment / format is very simple and rigid.
In practice the rows are like this:
instrument VARCHAR NOT NULL,
ts TIMESTAMP WITHOUT TIME ZONE NOT NULL,
quantity FLOAT8 NOT NULL,
price FLOAT8 NOT NULL,
direction INTEGER NOT NULL
and I would like, at the end of each hour to build a byte array that's like this:
0 4 12 20 24
|ts|quantity|price|direction|ts|quantity|price|direction...
with every row of the hour. Build one blob per instrument and write it in a table like this:
instrument VARCHAR
ts TIMESTAMP
blob BYTEA
My questions are:
is this possible? or would be it be very inefficient to get 30 (products) * 72k rows each to aggregate and save every hour in an efficient way?
is there anywhere I could find an example leading me toward this?

Union with Redshift native tables and external tables (Spectrum)

If I have a view that contains a union between a native table and external table like so (pseudocode):
create view vwPageViews as
select from PageViews
union all
select from PageViewsHistory
PageViews has for the last 2 years. External table has for older data than 2 years.
If a user selects from the view with filters for the last 6 months, how does RS Spectrum handle it - does it read the entire external table even though none will be returned (and accordingly cost us money for all of it)? (Assuming the s3 files are parquet based).
ex.
Select from vwPageViews where MyDate >= '01/01/2021'
What's the best approach for querying both cold and historical data using RS and Spectrum? Thanks!
How this will happen on Spectrum will depend on whether or not you have provided partitions for the data in S3. Without partitions (and a where clause on the partition) the Spectrum engines in S3 will have to read every file to determine if the needed data is in any of them. The cost of this will depend on the number and size of the files AND what format they are in. (CSV is more expensive than Parquet for example.)
The way around this is to partition the data in S3 and to have a WHERE clause on the partition value. This will exclude files from needing to be read when they don't match on the partition value.
The rub is in providing the WHERE clause for the partition as this will likely be less granular than the date or timestamp you using in your base data. For example if you partition on YearMonth (YYYYMM) and want to have a day level WHERE clause you will need to 2 parts to the WHERE clause - WHERE date_col >= 2015-07-12 AND part_col >= 201507. How to produce both WHERE conditions will depend on your solution around Redshift.

Oracle Gather Statistics after Partition Exchange

Database: Oracle 12c
I have a process that selects data from Fact Table, summarizes it and push it to Summary Table.
Summary table is Range Partitioned (Trade Date) and List Partitioned (File Id).
the process picks up data from Fact table (where file_id=<> for all Trade Dates), summarizes it in a temp table and use Partition Exchange to move data from Temp table to one of the SubPartitions in Summary Table (as the process works on a File Id level).
Summary table is completely refreshed everyday (100% data will be exchanged).
Before the data is exchanged at the subpartition level, statistics are gathered and exchanged along with the data.
After the process is completed, we run dbms_gather_table_stats at partition level (in a for loop - for each partition) with granularity set as "approx_global and partition".
Even though we collect stats at the global level, user_tab_statistics for the summary table has "STALE_STATS" = YES for this table, however, partition & Subpartition stats are available.
when we run a query against the summary table (for a date range of 3 years), the query spins for a long time - spiking the CPU to 90%, but never returns any data.
I checked the explain plan on the query, Cardinality is showing as 1.
I read about incremental stats, but it seems increment will work if a few partitions change - it may not be the best option in my case, where data across all the partitions change completely.
I m looking for a strategy to gather statistics on the summary table - don't want to run a full gather stats.
Thanks.

Best way to query 4 B+ records in Tableau

I am looking a best way to analyse 4B records (1TB data) stored in Vertica using Tableau. I tried using extract of 1M records which works perfectly. but dont know how to manage 4B records, because its taking too long to query on 4B records.
I have following dataset :
timestamp id url domain keyword nor_word cat_1 cat_2 cat_3
So here I need to create descending list of Top 10 ID's, Top 10 url, Top 10 domain, Top 10 keyword, Top 10 nor_word, Top 10 cat_1, Top 10 cat_2, Top 10 cat_3 depending count of each field value in separate worksheet and combine all worksheet in one dashboard.
There is no primary key. This dataset of 1 month so I want to make global filter start date and end date to reduce the query size. But don't know how to create global date filter and display on dashboard ?
You have two questions, one about Vertica and one about Tableau. You should split these up.
Regarding Vertica, you need to know that Vertica stores data in ascending sort order in physical storage. This means that an additional step will always be required anytime you want to get a descending sort order.
I would suggest creating a partition on the date, and subsequently running Database Designer (DBD) in incremental mode and using your queries as samples. By partitioning the data, Vertica can eliminate the partitions during optimization.
Running the DBD will generate some better optimized projections. You should consider the trade-off between how often you will need this data and whether it's worth creating these additional projections as it will impact your load performance.