Apache Drill - Aggregation SUM query gives exponential result - mongodb

When we do aggregated sum query in Drill for mongo storage , the output result is in exponential form .
Is there anywhere can we configure in drill , so that we can get output without exponential form ?
We dont want exponential result.
Thanks in Advance.

Drill provides two tools that display query results (and so would format numbers): the Drill web UI and the Sqlline command line tool. Are you using one of these? These tools are often used for experimental queries. I'm not aware of any way to customize the display in that UI.
That said, if you use the ODBC or JDBC driver, then numbers are stored in binary format and so any formatting would be done by the tool using the drivers to run queries. The xDBC drivers are more for production use as they handle large result sets better than the UI tools.
Now, a third possibility is that something in the Mongo plugin converts numbers to strings (VARCHAR). In that case, you might be able to cast the string back to a number.
If you can provide a bit more detail on the tool you are using, perhaps we can provide a bit more focused answer.

Related

Sorting in Data fusion

I have very simple queries around Data Fusion and the ETL transformation capabilities:
How do you sort a file in data fusion using a particular column/columns? Couldn't find any plugin or any directive in wrangler.
How do you perform a cumulative aggregation?
Not sure if I completely understood the use case, but I'll answer to the best of my understanding and knowledge.
If you're looking to sort records from an input set, then hopefully the TopN plugin (which can be deployed from the Hub) will be useful to you. You could try to use it in conjunction with other plugins (in case you wish to output records to a file for example). In regards to Wrangler - I also couldn't find that Wrangler has a sorting operation available currently (see cheatsheet here, ‘develop’ branch).
As Sakshi also suggested, you could use the Group by plugin in the Analytics section of Data Fusion UI to perform aggregations.
For reference, Cloud Data Fusion plugins list is available here.
If this is not what you are looking for, then it’ll be helpful if you could share more about your use case and maybe provide an example of the input vs output.

How to get datasource query From PowerBI Datasets

We have hundreds of datasets where one or more table has source to AnalysisService Tabular (Import mode). On Analysis Service we have set up a log (extended events), where we can find how long the query was processed.
Now I wonder how to find from which Powerbi Report/Dataset certain queries come from. That I can point the business users to change some of the bad performing queries to better ones. I can't find a way to find this.
Is there a way to do that? Can we list a dataset with queries?
No, you can't do that, or at least not exactly. One option is, if you have Power BI Premium, to use Metrics app to find the reports with highest query wait times. Another option is to you the Scanner API, which can give you the tables used in each model.

MongoDB + Elasticsearch or only Elasticsearch?

We have a new project there for index a large amount of data and for provide real time. I have also complexe search with facets, full text, geospatial...
The first prototype is to index in MongoDB and next, into Elasticsearch, because I had read that Elasticsearch does not apply a checksum on stored files and the index can't be fully trusted.
But since last versions (in the version 1.5), there is now a checksum and I'm guessing if we can use Elasticsearch as primary data store ? And what is the benefit to use MongoDB in addition to Elasticsearch ?
I can't find up to date answer about thoses features in Elasticsearch
Thanks a lot
Talking about arguments to use Mongo instead of/together with ES:
User/role management.
Built-in in MongoDB. May not fit all your needs, may be clumsy somewhere, but it exists and it was implemented pretty long time ago.
The only thing for security in ES is shield. But it ships only for Gold/Platinum subscription for production use.
Schema
ES is schemaless, but its built on top of Lucene and written in Java. The core idea of this tool - index and search documents, and working this way requires index consistency. At back end, all documents should be fitted in flat lucene index, which requires some understanding about how ES should deal with your nested documents and values, and how you should organize your indexes to maintain balance between speed and data completeness/consistency. Working with ES requires you to keep some things about schema in mind constantly. I.e: as you can index almost anything to ES without putting corresponding mapping in advance, ES can "guess" mapping on the fly but sometimes do it wrong and sometimes implicit mapping is evil, because once it put, it can't be changed w/o reindexing whole index. So, its better to not treat ES as schemaless store, because you can step on a rake some time (and this will be pain :) ), but rather treat it as schema-intensive, at least when you work with documents, that can be sliced to concrete fields.
Mongo, on the other hand, can "chew and leave no crumbs" out of almost anything you put in it. And most your queries will work fine, `til you remember how Mongo will deal with your data from JavaScript perspective. And as JS is weakly typed, you can work with really schemaless workflow (for sure, if you need such)
Handling non-table-like data.
ES is limited to handle data without putting it to search index. And this solution is good enough, when you need to store and retrieve some extra data (comparing to data you want to search against).
MongoDB supports gridFS. This gives you ability to handle large chunks of data behind the same interface. I.e., you can store binary data in Mongo and retrieve it within the same interface, from your code perspective.
Well, choose the right tool for the right job. If you require searching capabilities such as full text search, faceting etc, then nothing can beat a full fledged search engine. ElasticSearch(ES) or Solr is just a matter of choice.
You can actually feed(index) documents into ES for searching and then fetch the complete details for a particular entry from MongoDB or any other database.
I can make your task easier, do take a look at my open source work that's using MongoDB, ES, Redis and RabbitMQ, all integrated at one place, here on github
Please note that the application is built in .Net C#.
After having used Elasticsearch on production, I can add up to this thread few notes :
We securized our Elasticsearch clustering via a reverse proxy which check client certificate authenticity at request time before letting the query in : it proves that there is multiple way to add authentication anyway. (If you need more accuracy in security, like by using roles, there is few plugins that can be added to manage permissions)
Elasticsearch mapping and settings (tuning) are really important concepts to fully understand before going on production with it, and that's no that easy to get how everything works quickly.
Clustering and horizontal scaling is very flexible and easy to set up
The suite tools (Kibana, beats, etc ..) are a very convinient way to gather logs, expose key data, etc ...
Search features are extremely advanced, you can really do amazing things when you master a bit how full text search works (fuzzyness, boosting, scoring, stemming, tokenizer, analyzers, and so on ...).
API's are a bit scattered and there is not unique ways to achieve something. And some API are really WTF to use, like the bulk insert API: you need to pass binary data, with JSON format (ofc don't forget end of line characters) and repeating some fields multiple times. This is very verbose and I guess it's legacy code like we all have in our projects ;).
Last thing : if you develop a Java project, do not use Hibernate Search to duplicate data from a datasource to your ES cluster, we had so much issues with Hibernate Search, if we had to do that again, we'd do that manually.
Now about the real question :
To my mind, using only Elasticsearch is sufficient and may reduce complexity of having a multiple NoSQL storage systems.
I think it's worthy when you are doing a duo Relational and Transactional database + NoSQL search engine, but having two system which roughly serves the same purposes is a bit overkilled
I have recently developed a feature in my company,
we wanted to perform some searches and rank the result according to its relevance on multiple factors and conditions.
So in my application, we were already using MongoDB as Db,
So on ElasticSearch index, I exported some of the fields from MongoDB that I want to perform search and filters on.
So according to required conditions I prepared my mongo query and elasticsearch query also and performed the search. Then I filtered and sorted the result according to my need.
The whole flow will was designed in such a way that,
even if there is an error from ES, mongo will fetch the records.
If I get the result from ES then, mongo result will depend on ES result.
This is how I used mongo and ES in combination.
Also, don't forget to properly handle all updates, deletes and new record insertions.
And Just to Know, results for me were Really Good.

can we use loop functions in tableau

Can we use loop functions(for,while,do while) in tableau calculated Fields? If we can, how can we use the these functions in calculated fields and how can we initialise the variables which are declared in these functions?
No we can't. There are some hacks to do some calculations like that, using PREVIOUS_VALUE and other table calculations, but there is no loop functions in Tableau.
Why? Because Tableau isn't meant to be a data processing tool, but rather a data visualization tool. Don't get me wrong, Tableau engine is very good to process data, but only to perform "query-like" operations.
So why don't you post exactly what you are trying to achieve and we can think if it's possible to be accomplished with Tableau, or you require some pre-processing in your data

DATASTAGE capabilities

I'm a Linux programmer.
I used to write code in order to get things done: java perl php c.
I need to start working with DATA STAGE.
All I see is that DATA STAGE is working on table/csv style data and doing it line by line.
I want to know if DATA STAGE can work on file that are not table/csv like. can it load
data into data structures and run function on them, or is it limited to working
only on one line at a time.
thank you for any information that you can give on the capabilities of DATA SATGE
IBM (formerly Ascential) DataStage is an ETL platform that, indeed, works on data sets by applying various transformations.
This does not necessarily mean that you are constrained on applying only single line transformations (you can also aggregate, join, split etc). Also, DataStage has it's own programming language - BASIC - that allows you to modify the design of your jobs as needed.
Lastly, you are still free to call external scripts from within DataStage (either using the DSExecute function, Before Job property, After Job property or the Command stage).
Please check the IBM Information Center for a comprehensive documentation on BASIC Programming.
You could also check the DSXchange forums for DataStage specific topics.
Yes it can, as Razvan said you can join, aggregate, split. It can uses loops and external scripts, it can also handles XML.
My advice for you is that if you have large quantities of data you're gonna have to work on then datastage is your friend, else if the data that you're going to have to load is not very big then it's going to be easier to use JAVA, c, or any programming language that you know.
You can all times of functions , conversions , manipulate the data. mainly Datastage is used for ease of use when you handling humongous data from datamart /datawarehouse.
The main process of datastage would be ETL - Extraction Transformation Loading.
If a programmer uses 100 lines of code to connect to some database here we can do it with one click.
Anything can be done here even c , c++ coding in a rountine activity.
If you are talking about hierarchical files, like XML or JSON, the answer is yes.
If you are talking about complex files, such as are produced by COBOL, the answer is yes.
All using in-built functionality (e.g. Hierarchical Data stage, Complex Flat File stage). Review the DataStage palette to find other examples.