I'm new to Azure Databricks and Scala, i'm trying to consume HTTP REST API that's returning JSON, i went around the databricks docs but i don't see any Datasource that would work with rest api.Is there any library or tutorial on how to work with rest api in databricks. If i make multiple api calls (cause of pagination) it would be nice to get it done in parallel way (spark way).
I would be glad if you guys could point me if there is a Databricks or Spark way to consume REST API as i was shocked that there's no information in docs about api datasource.
Here is A Simple Implementation.
Basic Idea is spark.read.json can read an RDD.
So, just create an RDD from GET call and then read it as regular dataframe.
%spark
def get(url: String) = scala.io.Source.fromURL(url).mkString
val myUrl = "https://<abc>/api/v1/<xyz>"
val result = get(myUrl)
val jsonResponseStrip = result.toString().stripLineEnd
val jsonRdd = sc.parallelize(jsonResponseStrip :: Nil)
val jsonDf = spark.read.json(jsonRdd)
That's it.
It sounds to me like what you want, is to import into scala a library for making HTTP requests. I suggest the HTTP instead of a higher level REST interface, because the pagination may be handled in the REST library and may or may not support parallelism.
Managing with the lower level HTTP lets you de-couple pagination. Then you can use the parallelism mechanism of your choice.
There are a number of libraries out there, but recommending a specific one is out of scope.
If you do not want to import a library, you could have your scala notebook call upon another notebook running a language which has HTTP included in the standard library. This notebook would then return the data to your scala notebook.
Related
I am trying to implement simple reasoning operator using Apache Flink on scala. Now I can read data as a stream from a .csv file. But I cannot cope with RDF and OWL data processing.
Here is my code to load data from .csv:
val csvTableSource = CsvTableSource
.builder
.path("src/main/resources/data.stream")
.field("subject", Types.STRING)
.field("predicate", Types.STRING)
.field("object", Types.STRING)
.fieldDelimiter(";")
.build()
Could anyone show me an example to load this data with Flink using RDF and OWL? As I understood, an RDF stream contains dynamic data and the OWL is for static. I have to create a simple reasoning operator, which I can ask for information, e.g who is a friend of a friend.
Any help will be appreciated.
Is it possible to create a H2OFrame using the H2O's REST API and if so how?
My main objective is to utilize models stored inside H2O so as to make predictions on external H2OFrames.
I need to be able to generate those H2OFrames externally from JSON (I suppose by calling an endpoint)
I read the API documentation but couldn't find any clear explanation.
I believe that the closest endpoints are
/3/CreateFrame which creates random data and /3/ParseSetup
but I couldn't find any reliable tutorial.
Currently there is no REST API endpoint to directly convert some JSON record into a Frame object. Thus, the only way forward for you would be to first write the data to a CSV file, then upload it to h2o using POST /3/PostFile, and then parse using POST /3/Parse.
(Note that POST /3/PostFile endpoint is not in the documentation. This is because it is handled separately from the other endpoints. Basically, it's an endpoint that takes an arbitrary file in the body of the post request, and saves it as "raw data file").
The same job is much easier to do in Python or in R: for example in order to upload some dataset into h2o for scoring, you only need to say
df = h2o.H2OFrame(plaindata)
I am already doing something similar in my project. Since, there is no REST API endpoint to directly convert JSON record into a Frame object. So, I am doing the following: -
1- For Model Building:- first transfer and write the data into the CSV file where h2o server or cluster is running.Then import data into the h2o using POST /3/ImportFiles, and then parse and build a model etc. I am using the h2o-bindings APIs (RESTful APIs) for it. Since I have a large data (hundreds MBs to few GBs), so I use /3/ImportFiles instead POST /3/PostFile as latter is slow to upload large data.
2- For Model Scoring or Prediction:- I am using the Model MOJO and POJO. In your case, you use POST /3/PostFile as suggested by #Pasha, if your data is not large. But, as per h2o documentation, it's advisable to use the MOJO or POJO for model scoring or prediction in a production environment and not to call h2o server/cluster directly. MOJO and POJO are thread safe, so you can scale it using multithreading for concurrent requests.
I have a lot of records in hbase store (millions) like this
key = user_id:service_id:usage_timestamp value = some_int
That means an user used some service_id for some_int at usage_timestamp.
And now I wanted to provide some rest api for aggregating that data. For example "find sum of all values for requested user" or "find max of them" and so on. So I'm looking for the best practise. Simple java application doesn't met my performance expectations.
My current approach - aggregates data via apache spark application, looks good enough but there are some issues to use it with java rest api so far as spark doesn't support request-response model (also I have took a view into spark-job-server, seems raw and unstable)
Thanks,
Any ideas?
I would offer Hbase + Solr if you are using Cloudera (i.e Cloudera search)
Solrj api for aggregating data(instead of spark), to interact with rest services
Solr Solution (in cloudera its Cloudera search) :
Create a collection (similar to hbase table) in solr.
Indexing : Use NRT lily indexer or custom mapreduce solr document creator to load data as solr documents.
If you don't like NRT lily indexer you can use spark or mapreduce job with Solrj to do the indexing For ex: Spark Solr :
Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.
Data Retrieval : Use Solrj to get the solr docs from your web service call.
In Solrj,
There is FieldStatInfo through which Sum,Max etc.... can be achieved
There are Facets and Facetpivots to group data
Pagination is supported for rest API calls
you can integrate solr results with Jersey or some other web service as we have already implemented this way.
/**This method returns the records for the specified rows from Solr Server which you can integrate with any rest api like jersey etc...
*/
public SolrDocumentList getData(int start, int pageSize, SolrQuery query) throws SolrServerException {
query.setStart(start); // start of your page
query.setRows(pageSize);// number of rows per page
LOG.info(ClientUtils.toQueryString(query, true));
final QueryResponse queryResponse = solrCore.query(query, METHOD.POST); // post is important if you are querying huge result set Note : Get will fail for huge results
final SolrDocumentList solrDocumentList = queryResponse.getResults();
if (isResultEmpty(solrDocumentList)) { // check if list is empty
LOG.info("hmm.. No records found for this query");
}
return solrDocumentList;
}
Also look at
my answer in "Create indexes in solr on top of HBase"
https://community.hortonworks.com/articles/7892/spark-dataframe-to-solr-cloud-runs-on-sandbox-232.html
Note : I think same can be achieved with elastic search as well. But out of my experience , Im confident with Solr + solrj
I see two possibilities:
Livy REST Server - new REST Server, created by Cloudera. You can submit Spark jobs in REST way. It is new and developed by Cloudera, one of the biggest Big Data / Spark company, so it's very possible that it will be developed in future, not abandoned
You can run Spark Thrift Server and connect just like to normal database via JDBC. Here you've got documentation. Workflow: read data, preprocess and then share by Spark Thrift Server
If you want to isolate third-party apps from Spark you can create simple application that will have user-friendly endpoint and will translate query received by endpoint to Livy-Spark jobs or SQL that will be used with Spark Thrift Server
I have streaming of data coming from SparkStreaming. Which i need to process and finally want to store the data in Cassandra. So, earlier i was trying to use SparkCassandra connector. But it doesn't give the access of SparkStreaming Context object on workers. So, I have to use separate cassandra-scala driver. Hence, i ended up with phantom. Now, my question is i have already defined the column family in the cassnandra. So, how do i do the select and update query from scala.
I have followed these documentation link1 but i don't understand why do we need to give the table definition at client (scala code) side. Why can't we just give Keyspace, ClusterPoints and ColumnFamily and be done with it.
object CustomConnector {
val hosts = Seq("IP1", "IP2")
val Connector = ContactPoints(hosts).keySpace("KEYSPACE_NAME")
}
realTimeAgg.foreachRDD{ x => if (x.toLocalIterator.nonEmpty) {
x.foreachPartition {
How to achieve select/insert in Cassandra table here using phantom
}
This is not yet possible using phantom, we are actively working on phantom-spark to allow you to do this, but at this stage in time this is still a few months away.
In the interim, you will have to rely on the spark cassandra connector and use the non type-safe API to achieve this. It's a more unfortunate setup, but in the very near future this will be resolved.
When it comes to creating a REST web service with 60+ API on akka http. How can I choose whether I should go with akka streams or akka actors?
In his post, Jos shows two ways to create an API on akka http but he doesn't show when I should select one over the other.
This is a difficult question. Obviously, both approaches work. So to a certain degree it is a matter of taste/familiarity. So everything following now is just my personal opinion.
When possible, I prefer using akka-stream due to its more high-level nature and type safety. But whether this is a viable approach depends very much on the task of the REST API.
Akka-stream
If your REST API is a service that e.g. answers questions based on external data (e.g. a currency exchange rate API), it is preferable to implement it using akka-stream.
Another example where akka-stream would be preferable would be some kind of database frontend where the task of the REST API is to parse query parameters, translate them into a DB query, execute the query and translate the result according to the content-type requested by the user. In both cases, the data flow maps easily to akka-stream primitives.
Actors
An example where using actors would be preferable might be if your API allows querying and updating a number of persistent actors on a cluster. In that case either a pure actor-based solution or a mixed solution (parsing query parameters and translating results using akka-stream, do the rest using actors) might be preferable.
Another example where an actor-based solution might be preferable would be if you have a REST API for long-running requests (e.g. websockets), and want to deploy the processing pipeline of the REST API itself on a cluster. I don't think something like this is currently possible at all using akka-stream.
Summary
So to summarize: look at the data flow of each API and see if it maps cleanly to the primitives offered by akka-stream. If this is the case, implement it using akka-stream. Otherwise, implement using actors or a mixed solution.
Don't Forget Futures!
One addendum I would make to Rudiger Klaehn's fine answer is to also consider the use case of a Future. The composability of Futures and resource management of ExecutionContext make Futures ideal for many, if not most, situations.
There is an excellent blog post describing when Futures are a better choice than Actors. Further, the back-pressure provided by Streams comes with some pretty hefty overhead.
Just because you're down the rabbit hole using akka-http does not mean all concurrency within your request handler has to be confined to Actors or Streams.
Route
Route inherently accomodates Futures in the type definition:
type Route = (RequestContext) ⇒ Future[RouteResult]
Therefore you can bake a Future directly into your Route using only functions and Futures, no Directives:
val requestHandler : RequestContext => HttpResponse = ???
val route : Route =
(requestContext) => Future(requestHandler(requestContext)) map RouteResult.Complete
onComplete Directive
The onComplete Directive allows you to "unwrap" a Future within your Route:
val route =
get {
val future : Future[HttpResponse] = ???
onComplete(future) {
case Success(httpResponse) => complete(httpResponse)
case Failure(exception) => complete(InternalServerError -> exception.toString)
}
}