Best approach to build a dashboard [closed] - tableau-api

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Looking for some advice... just finished a ETL pipeline where all data ends up in Amazon Athena. The data is produced via the click stream of high volume mobile apps (so essentially it’s lots and lots of raw events). I want to build a number of dashboards for the business that show different metrics/KPIs depending on the requirements. However, since we’re talking about huge volumes of data I’m not sure the best way to do this? Here’s an example:
I want a dashboard that shows all the MAUs (monthly active users), along with certain pages that perform particularly well and the most popular navigation routes through the app. My thinking is that I’d want a custom query per graph I.e. one query that is counting the distinct IDs each day (and then refreshing every 24hr)... another query for a graph that produces a breakdown of counts per page and truncates... etc
The main reason for thinking this is otherwise I’d be pulling in huge amounts of raw data just to calculate a simple metric like MAUs (not even sure extract would work - certainly wouldn’t be efficient).
Is this completely the wrong approach? Any suggestions/feedback?
Thanks in advance!

It sounds like you have multiple unrelated SQL queries that you want to run once per day, and update in Tableau once per day.
There's always a pull-push between the processing at the source and in the visualization engine.
Set up a Tableau server extract for each Athena SQL query. Build your dashboards, and schedule your extracts to refresh daily. Like an OLAP cube, this will process all the aggregates your dashboards need with the refresh, for better dashboard performance.
Alternatively, if you feel you don't need all the detail in Tableau, then build your aggregates in SQL, so that your Tableau data sources are smaller.

Related

High performant and reliable database [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm working on a project with a requirement of coming up with a huge amount of data.
For this we are looking for a data store to save and fetch a huge amount of data. The database is easy, there is one object for vouchers and a one to many relation to transactions. One voucher has ~ 10 - 100 transaction.
Sometimes it is possible that the system has to generate several thousand voucher in a short time, and it also possible that the system writes or delete several thousand transaction.
And it is very important that the applications returns quickly if a voucher is valid or not (easy search request).
I have looked several blogs to find the best database for this and on the shortlist is
MongoDB
Elastic Search
Cassandra
My favourite is Elastic Search but I found several blogs which says ES is not reliable enough to use as a primary data store.
I also read some blogs that say mongodb has problems to run in cluster.
Do you have experience with Cassandra for a job like this? Or do you prefer any other database?
I've some experience on MongoDB, but I'll go agnostic on this.
There are MANY factors that goes in game when you say that you want a fast database. You have to think about indexing, vertical or horizontal scaling, relational or nosql, writing performance vs reading performance, and if you choose any of them should think about reading preferences, balancing, networking... The topics goes from the DB to the hardware.
I'd suggest go for a database you know, and that you can scale, admin and tune well.
In my personal experience, I've had no problems running MongoDB on cluster (sharding), may be problems comes due to a bad administration or planning, and that's why I suggest going for a database you know well.
The selection of the database is the least concern in designing a huge database that needs high performance. Most Nosql and Relational databases can be made to run this type of application effectively. The hardware is critical, the actual design of your database and your indexing is critical, the types of queries you run need to be performant.
If I were to take on a project that required a very large database with high performance, the first and most critical thing to do is to hire a database expert who has worked with those types of systems for several years. This is not something an application developer should EVER do. This is not the job for a beginner or even someone like me who has worked with only medium sized databases, albeit for over 20 years. You get what you pay for. In this case, you need to pay for real expertise at the design stage because database design mistakes are difficult to fix once they have data. Hire a contractor if you don't want a permanent emplyee, but hire expertise.

How to implement version control on Firebase? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm currently using Firebase as a prototyping tool to showcase a front end design for a documentation tool. In the process we've come to really like the real-time power of Firebase, and are exploring the potential to use it for our production instance of an open source/community version.
The first challenge is version control. Our legacy project used Hibernate/Envers in a Java stack, and we were previously looking at Gitlab as a way to move into a more "familiar" git environment.
This way?
Is there a way to timestamp and version control the data being saved? And thoughts on how to best recall this data without redesigning the wheel (e.g. any open source modules?)?
The real-time aspect of something like Firepad is great for documentation, but we require the means to commit or otherwise distinctly timestamp the state or save of a document.
Or?
Or is it best to use Firebase only for the realtime functionality, and implement Gitlab to commit the instance to a non-realtime database. In other words abstracting the version control entirely out to a more traditional relationship to a db?
Thoughts?
Both options you offer are valid, and feasible. In general, I'd suggest you to use firebase only as your real-time stack (data sync). And connect it to your own backend (gitlib or custom-db).
I've went that path, and find the best solution is to integrate your own backend db with firebase on top. Depend on firebase exclusively for everything, and you'll hit walls sooner or later..
The best solution is to keep full control on your data structure, security and access models, and use firebase where needed to keep clients in sync (online and offline). The integration is simple.

MongoDB + Neo4J vs OrientDB vs ArangoDB [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am currently on design phase of a MMO browser game, game will include tilemaps for some real time locations (so tile data for each cell) and a general world map. Game engine I prefer uses MongoDB for persistent data world.
I will also implement a shipping simulation (which I will explain more below) which is basically a Dijkstra module, I had decided to use a graph database hoping it will make things easier, found Neo4j as it is quite popular.
I was happy with MongoDB + Neo4J setup but then noticed OrientDB , which apparently acts like both MongoDB and Neo4J (best of both worlds?), they even have VS pages for MongoDB and Neo4J.
Point is, I heard some horror stories of MongoDB losing data (though not sure it still does) and I don't have such luxury. And for Neo4J, I am not big fan of 12K€ per year "startup friendly" cost although I'll probably not have a DB of millions of vertexes. OrientDB seems a viable option as there may be also be some opportunities of using one database solution.
In that case, a logical move might be jumping to OrientDB but it has a small community and tbh didn't find much reviews about it, MongoDB and Neo4J are popular tools widely used, I have concerns if OrientDB is an adventure.
My first question would be if you have any experience/opinion regarding these databases.
And second question would be which Graph Database is better for a shipping simulation. Used Database is expected to calculate cheapest route from any vertex to any vertex and traverse it (classic Dijkstra). But also have to change weights depending on situations like "country B has embargo on country A so any item originating from country A can't pass through B, there is flood at region XYZ so no land transport is possible" etc. Also that database is expected to cache results. I expect no more than 1000 vertexes but many edges.
Thanks in advance and apologies in advance if questions are a bit ambiguous
PS : I added ArangoDB at title but tbh, hadn't much chance to take a look.
Late edit as of 18-Apr-2016 : After evaluating responses to my questions and development strategies, I decided to use ArangoDB as their roadmap is more promising for me as they apparently not trying to add tons of hype features that are half baked.
Disclaimer: I am the author and owner of OrientDB.
As developer, in general, I don't like companies that hide costs and let you play with their technology for a while and as soon as you're tight with it, start asking for money. Actually once you invested months to develop your application that use a non standard language or API you're screwed up: pay or migrate the application with huge costs.
You know, OrientDB is FREE for any usage, even commercial. Furthermore OrientDB supports standards like SQL (with extensions) and the main Java API is the TinkerPop Blueprints, the "JDBC" standard for Graph Databases. Furthermore OrientDB supports also Gremlin.
The OrientDB project is growing every day with new contributors and users. The Community Group (Free channel to ask support) is the most active community in GraphDB market.
If you have doubts with the GraphDB to use, my suggestion is to get what is closer to your needs, but then use standards as more as you can. In this way an eventual switch would have a low impact.
It sounds as if your use case is exactly what ArangoDB is designed for: you seem to need different data models (documents and graphs) in the same application and might even want to mix them in a single query. This is where a multi-model database as ArangoDB shines.
If MongoDB has served you well so far, then you will immediately feel comfortable with ArangoDB, since it is very similar in look and feel. Additionally, you can model graphs by storing your vertices in one (or multiple) collections, and your edges in one or more so-called "edge-collections". This means that individual edges are simply documents in their own right and can hold arbitrary JSON data. The database then offers traversals, customizable with JavaScript to match any needs you might have.
For your variations of the queries, you could for example add attributes about these embargos to your vertices and program the queries/traversals to take these into account.
The ArangoDB database is licensed under the Apache 2 license, and community as well as professional support is readily available.
If you have any more specific questions do not hesitate to ask in the google group
https://groups.google.com/forum/#!forum/arangodb
or contact
hackers (at) arangodb.org
directly.
Neo4j's pricing is actually quite flexible, so don't be put away by the prices on the website.
You can also get started with the community edition or personal edition for a long time.
The Neo4j community is very active and helpful and quickly provide support and help for your questions. I think that's the biggest plus besides performance and convenience. I
n general using a graph model
Regarding your use-case:
Neo4j is used exactly for this route calculation scenario by one of the largest logistic companies in the world where it routes up to 4000 packages per second across the country.
And it is used in other game engines, like here at GameSys for game economy simulation and in another one for the routing (not in earth coordinates but in game-world-coordinates using Neo4j-Spatial).
I'm curious why you have only that few nodes? Are those like transport portals? I wonder where you store the details and the dynamics about the routes (like the criteria you mentioned) are they coming from the outside - in memory state of the game engine?
You should probably share some more details about your model and the concrete use-case.
And it might help to know that both Emil, one of the founders of Neo4j and I are old time players of multi user dungeons (MUDs), so it is definitely a use-case close to our heart :)

Moving quickly between aggregate and record-level data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I am using postgreSQL to store and process data for a research project. I can program in SQL, R, and Python but am not a software developer or system administrator. I find myself constantly aggregating data and then wanting to see the individual records contributing to a single cell in the aggregation. The records contain text fields and I use CASE and LIKE statements to determine how these will be counted. I'm looking for a GUI that will allow me to quickly move between different levels and kinds of aggregation so I don't lose access to details when looking at the big picture. I believe the answer to my question involves OLAP and/or faceted search but would like recommendations for specific products, open source and turnkey if possible.
thank you,
-david
icCube is not open source but allow for going from the big picture to the details (either via drilldown or drillthrough). Depending on your PostgreSQL model the work to setup the cube model might be minimal. Note once the model has been setup you've the full power of MDX analysis for more challenging requests.
Maybe Power Pivot from Microsoft is a tool that would be right for you. For Excel 2010, it is a plugin that you can download free of charge from Microsoft. For Excel 2013 and Excel as part of Office365 (the cloud based MS Office), it is already contained. Older versions of Excel are not supported. The tool is an OLAP solution aimed to be used by business users without support from IT staff. Data is saved in the Excel workbook in an internal, compressed format optimized for fast analysis (millions of rows are not a problem), and you use a formula language very much like that used within standard Excel to define calculations, while you analyze the data script free with point and click pivot tables.
Basically, you don't want to lose any of your detailed data, to allow for the drill-down OLAP operation.
In a data warehouse, the grain of say, customer orders, would be order line item, ie the most detailed.
What you should do is to figure out which aggregates to pre-calculate, and use a tool to do automate that for you. The aggregated data would go in its own tables.
A smart OLAP cube will realize when you should use an aggregate and re-write your query to use the aggregated data instead.
Check out Pentaho Aggregation Designer, as well as Mondrian OLAP server/Saiku pivot tables. All FOSS.

Technology behind the Facebook Search [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
If you just start typing a person name in the Facebook search box(displayed in blue upper bar on your profile) then suddenly within a fraction of a second search results appear and they comes much more faster if that person already exist in your profile.
so i just want to know that what is behind this search.i mean to say that which software tool and algorithm they are using for it.
i know that no one other than Facebook can explain exactly about it thats why i am just asking to give me an idea about that.
i am sure that they are using something which is open source.
The response is an autocomplete Ajax form, but this is not really the question. The key question here is how fast can you search in a text field. Facebook splits this into 2 parts. First they search in the list of your friend which is a cached and relatively rarely changed file containing 100 to a 1000 entries. This is quite fast. The other thing is to search for a name in ALL Facebook, which means I guess 1 billion names. This is a little more tricky, but I guess they have them splitted and indexed by letters or letters combinations. For example:
// search query Alice Cooper
A ... they give you a list of A like names Alina, Ana, Alice...
Al .... they limit it to Alicia, Alice, Alina ...
// and so on
Probably after a 3 letters they are starting to do a search, but not in the 1 billion rows, but in a limited subset of your 3rd level friends and probably increasing it each time.
Probably your query is never compared with the whole table, and there are definitely cached levels or recalculated queries for most of the common names.
This is in terms of technique. In terms of technology take a look at Solandra, a search engine build on top of Cassandra which Facebook is using, though I cannot confirm that this is what they are using, but just to give you a research direction.
I assume the underlying technology is AJAX, with some caching mechanism that increases performance for profiles in your friends list.