REST API for processing data stored in hbase - rest

I have a lot of records in hbase store (millions) like this
key = user_id:service_id:usage_timestamp value = some_int
That means an user used some service_id for some_int at usage_timestamp.
And now I wanted to provide some rest api for aggregating that data. For example "find sum of all values for requested user" or "find max of them" and so on. So I'm looking for the best practise. Simple java application doesn't met my performance expectations.
My current approach - aggregates data via apache spark application, looks good enough but there are some issues to use it with java rest api so far as spark doesn't support request-response model (also I have took a view into spark-job-server, seems raw and unstable)
Thanks,
Any ideas?

I would offer Hbase + Solr if you are using Cloudera (i.e Cloudera search)
Solrj api for aggregating data(instead of spark), to interact with rest services
Solr Solution (in cloudera its Cloudera search) :
Create a collection (similar to hbase table) in solr.
Indexing : Use NRT lily indexer or custom mapreduce solr document creator to load data as solr documents.
If you don't like NRT lily indexer you can use spark or mapreduce job with Solrj to do the indexing For ex: Spark Solr :
Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.
Data Retrieval : Use Solrj to get the solr docs from your web service call.
In Solrj,
There is FieldStatInfo through which Sum,Max etc.... can be achieved
There are Facets and Facetpivots to group data
Pagination is supported for rest API calls
you can integrate solr results with Jersey or some other web service as we have already implemented this way.
/**This method returns the records for the specified rows from Solr Server which you can integrate with any rest api like jersey etc...
*/
public SolrDocumentList getData(int start, int pageSize, SolrQuery query) throws SolrServerException {
query.setStart(start); // start of your page
query.setRows(pageSize);// number of rows per page
LOG.info(ClientUtils.toQueryString(query, true));
final QueryResponse queryResponse = solrCore.query(query, METHOD.POST); // post is important if you are querying huge result set Note : Get will fail for huge results
final SolrDocumentList solrDocumentList = queryResponse.getResults();
if (isResultEmpty(solrDocumentList)) { // check if list is empty
LOG.info("hmm.. No records found for this query");
}
return solrDocumentList;
}
Also look at
my answer in "Create indexes in solr on top of HBase"
https://community.hortonworks.com/articles/7892/spark-dataframe-to-solr-cloud-runs-on-sandbox-232.html
Note : I think same can be achieved with elastic search as well. But out of my experience , Im confident with Solr + solrj

I see two possibilities:
Livy REST Server - new REST Server, created by Cloudera. You can submit Spark jobs in REST way. It is new and developed by Cloudera, one of the biggest Big Data / Spark company, so it's very possible that it will be developed in future, not abandoned
You can run Spark Thrift Server and connect just like to normal database via JDBC. Here you've got documentation. Workflow: read data, preprocess and then share by Spark Thrift Server
If you want to isolate third-party apps from Spark you can create simple application that will have user-friendly endpoint and will translate query received by endpoint to Livy-Spark jobs or SQL that will be used with Spark Thrift Server

Related

Issue in MongoDB document search

I am new to MongoDB. And I have the following issue on currently developing web application.
We have an application where we use mongoDB to store data.
And we have an API where we search for the document via text search.
As an example: if the user type “New York” then the request should send the all the available data in the collection to the keyword “New York". (Here we call the API for each letter typed.) We have nearly 200000 data in the DB. Once the user searches for a document then it returns nearly 4000 data for some keywords. We tried with limiting the data to 5 – so it returns the top 5 data, and not the other available data. And we tried without limiting data now it returns hundreds and thousands of data as I mentioned. And it causes the request to slow down.
At Frontend we Bind search results to a dropdown. (NextJs)
My question:
Is there an optimizing way to search a document?
Are there any suggestions of a suitable way that I can implement this requirement using mongoDB and net5.0?
Or any other Implementation methods regarding this requirement?
Following code segment shows the query to retrieve the data to the incomming keyword.
var hotels = await _hotelsCollection
.Find(Builders<HotelDocument>.Filter.Text(keyword))
.Project<HotelDocument>(hotelFields)
.ToListAsync();
var terminals = await _terminalsCollection
.Find(Builders<TerminalDocument>.Filter.Text(keyword))
.Project<TerminalDocument>(terminalFeilds)
.ToListAsync();
var destinations = await _destinationsCollection
.Find(Builders<DestinationDocument>.Filter.Text(keyword))
.Project<DestinationDocument>(destinationFields)
.ToListAsync();
So this is a classic "autocomplete" feature, there are some known best practices you should follow:
On the client side you should use a debounce feature, this is a most. there is no reason to execute a request for each letter. This is most critical for an autocomplete feature.
On the backend things can get a bit more complicated, naturally you want to be using a db that is suited for this task, specifically MongoDB have a service called Atlas search that is a lucene based text search engine.
This will get you autocomplete support out of the box, however if you don't want to make big changes to your infra here are some suggestions:
Make sure the field your searching on is indexed.
I see your executing 3 separate requests, consider using something like Task.WhenAll to execute all of them at once instead of 1 by 1, I am not sure how the client side is built but if all 3 entities are shown in the same list then ideally you merge the labels into 1 collection so you could paginate the search properly.
As mentioned in #2 you must add server side pagination, no search engine can exist without one. I can't give specifics on how you should implement it as you have 3 separate entities and this could potentially make pagination implementation harder, i'd consider wether or not you need all 3 of these in the same API route.

How to create a H2OFrame using H2O REST API

Is it possible to create a H2OFrame using the H2O's REST API and if so how?
My main objective is to utilize models stored inside H2O so as to make predictions on external H2OFrames.
I need to be able to generate those H2OFrames externally from JSON (I suppose by calling an endpoint)
I read the API documentation but couldn't find any clear explanation.
I believe that the closest endpoints are
/3/CreateFrame which creates random data and /3/ParseSetup
but I couldn't find any reliable tutorial.
Currently there is no REST API endpoint to directly convert some JSON record into a Frame object. Thus, the only way forward for you would be to first write the data to a CSV file, then upload it to h2o using POST /3/PostFile, and then parse using POST /3/Parse.
(Note that POST /3/PostFile endpoint is not in the documentation. This is because it is handled separately from the other endpoints. Basically, it's an endpoint that takes an arbitrary file in the body of the post request, and saves it as "raw data file").
The same job is much easier to do in Python or in R: for example in order to upload some dataset into h2o for scoring, you only need to say
df = h2o.H2OFrame(plaindata)
I am already doing something similar in my project. Since, there is no REST API endpoint to directly convert JSON record into a Frame object. So, I am doing the following: -
1- For Model Building:- first transfer and write the data into the CSV file where h2o server or cluster is running.Then import data into the h2o using POST /3/ImportFiles, and then parse and build a model etc. I am using the h2o-bindings APIs (RESTful APIs) for it. Since I have a large data (hundreds MBs to few GBs), so I use /3/ImportFiles instead POST /3/PostFile as latter is slow to upload large data.
2- For Model Scoring or Prediction:- I am using the Model MOJO and POJO. In your case, you use POST /3/PostFile as suggested by #Pasha, if your data is not large. But, as per h2o documentation, it's advisable to use the MOJO or POJO for model scoring or prediction in a production environment and not to call h2o server/cluster directly. MOJO and POJO are thread safe, so you can scale it using multithreading for concurrent requests.

How do we do select query using phantom driver without table defintion

I have streaming of data coming from SparkStreaming. Which i need to process and finally want to store the data in Cassandra. So, earlier i was trying to use SparkCassandra connector. But it doesn't give the access of SparkStreaming Context object on workers. So, I have to use separate cassandra-scala driver. Hence, i ended up with phantom. Now, my question is i have already defined the column family in the cassnandra. So, how do i do the select and update query from scala.
I have followed these documentation link1 but i don't understand why do we need to give the table definition at client (scala code) side. Why can't we just give Keyspace, ClusterPoints and ColumnFamily and be done with it.
object CustomConnector {
val hosts = Seq("IP1", "IP2")
val Connector = ContactPoints(hosts).keySpace("KEYSPACE_NAME")
}
realTimeAgg.foreachRDD{ x => if (x.toLocalIterator.nonEmpty) {
x.foreachPartition {
How to achieve select/insert in Cassandra table here using phantom
}
This is not yet possible using phantom, we are actively working on phantom-spark to allow you to do this, but at this stage in time this is still a few months away.
In the interim, you will have to rely on the spark cassandra connector and use the non type-safe API to achieve this. It's a more unfortunate setup, but in the very near future this will be resolved.

Conditional routing in Apache NiFi

I'm using NiFi to get data from an Oracle database and put some of this data in Kafka (using the processor PutKafka).
Example : if the attribute "id" contains "aaabb"
Is that possible in Apache NiFi? How can i do it?
This should definitely be possible, the flow might be something like this...
1) ExecuteSQL or QueryDatabaseTable to get the data from the database, these produce Avro
2) ConvertAvroToJson processor to convert the Avro to Json
3) EvaluateJsonPath to extract the id field into an attribute
4) RouteOnAttribute to route flow files where the id attribute contains "aaabbb"
5) PutKafka to deliver any of the matching results from RouteOnAttribute
To add on to Bryan's example flow, I wanted to point you to some great documentation that should help introduce you to Apache NiFi.
Firstly, I would suggest checking out the NiFi documentation. It is very good and should help a lot. In addition to providing details on each of the processors Bryan mentioned it also has general documentation for every type of user.
For a basic introduction to build a NiFi flow check out this video.
For example templates check out this repo. It's a has an excel file at it's root level which has a description and list of processors for each template.

Make Orion fetch data from Cosmos and publish

I have set up a subscription between Orion ContextBroker and Cosmos BigData using Cygnus, and data is properly persisted in Cosmos when an update is made to Orion.
But I want to analyze the data in Cosmos and return the results to Orion, and finally access the result data in Orion from "outside".
How would one do this? Of course, I would like the solution I build to be as "automated" as possible, but mostly I just want to solve this problem.
Any advise is much appreciated!
As general response (as also the question is very general ;), what you need is a process that access to information stored in Cosmos (either using HDFS APIs -such as WebHDFS or HttpFs-, Hive queries, general MapReduce jobs on top of Hadoop, etc.), then implement the client side of the NGSI API that Orion implements in order to inject context elements into Orion based in the information you retrieved from Cosmos. The key operation to do so in the Orion API is updateContext.
The automation degree would depend on how you implement that process. It can be as automated as you want.
EDIT: considering this answer comments, I will try to add more detail.
What I mean is to develop a piece of software (let's call it APOS -A Piece Of Software) implementing the following behaviour:
APOS will grab data from Cosmos any of the interfaces provided by Cosmos, i.e. WebHDFS/HttpFs, Hive, MapReduce jobs, etc.
APOS will process the data to produce some result
APOS will inject that result in Orion, using the Orion REST API described in the Orion user manual. It is particularly useful for that task the updateContext operation. From a client-server point of view, Orion is a server exposing a REST API and APOS is the client interacting with that server.
It is completely up to you how to implement this APOS and how orchestrate the flow from 1 to 3 (e.g. it can run in batch mode all midnights, be triggered by user interaction on a web portal, etc.).
At the present moment, FI-WARE doesn't provide any generic enabler to convert from Cosmos data to NGSI given that each particular realization of the steps 1 to 3 above is different and depends on the use case. However, note that there is software component named Cygnus which implements the other way: from NGIS to Cosmos.