HDFS file system, get latest folders using scala API - scala

Our application reads data from several HDFS data folders, folders get updated weekly/daily/monthly so based on the updated period we need to find the latest path and then read the data.
We would like to do this using programmatic way using scala, so is there libraries available?
We could only see but just wondering any better libraries available?
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/package-summary.html

The linked library is the recommended way to use HDFS API programmatically without going through hadoop fs CLI scripts. Any other library you may find would be built using the same package.

Related

Distribute third-party jar dependency on large-scale spark application

We have a third-party jar file on which our Spark application is dependent that. This jar file size is ~15MB. Since we want to deploy our Spark application on a large-scale cluster(~500 workers), we are concerned about distributing the third-party jar file. According to the Apache Spark documentation(https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management), we have such options as using HDFS, HTTP server, driver HTTP server, and local-path for distributing the file.
We do not prefer to use local-path because it requires copying the jar file on all workers' spark libs directory. On the other hand, if we use HDFS or HTTP server when spark workers try to get the jar file, they may make a DoS attack against our Spark driver server. So, What is the best way to address this challenge?
On the other hand, if we use HDFS or HTTP server when spark workers try to get the jar file, they may make a DoS attack against our Spark driver server. So, What is the best way to address this challenge?
If you put the 3rd jar in HDFS why it effect the spark driver server?!
each node should take the addintiol jar directly from the hdfs not from the spark server.
After examining different proposed methods, we found out that the best way to distribute the jar files is to copy all nodes(as #egor mentioned, too.) based on our deployment tools, we released a new spark ansible role that supports external jars. At this moment, Apache Spark does not provide any optimized solution. If you give a remote URL (HTTP, HTTPS, HDFS, FTP) as an external jar path, spark workers fetch the jar file every time a new job submit. So, it is not an optimized solution from a network perspective.

$CFG->dataroot in Moodle clustering

I'm trying to launch several instances of Moodle in a Kubernetes-like container platform to improve performance and make my installation reliable. I came across the following requirement
$CFG->dataroot This MUST be a shared directory where each cluster node
is accessing the files directly. It must be very reliable,
administrators cannot manipulate files directly.
Which tool can be used to transparently sync this directory across several containers? What is the best way to meet this requirement?
I successfully resolved the issue by using ObjectFS plugin for S3 storage and moving sessions to database instead of file system.

openshif cloudcomputing configuration,is possible complete on cloud?

Is possible build a bigdata application on cloud with RED HAT'PaaS OpenShift? I'm looking how build on cloud an Scala Application with Hadoop (HDFS),Spark,an Apache Mahout but i can't find any thing about it.I've seen something with HortonWorks but nothing clear about how install it in an openshift environment an how add HDFS node in Cloud too.Is it possible with OpneShift?
It's possible in Amazon but my question is : IS possible in OpenShift ??
It really depends on what you're ultimately trying to achieve. I know you mention building a big data application on Openshift with Scala but what will the application ultimately be doing?
I've gotten Hadoop running in a gear before but if you want a better example check out this quickstart here to get an idea of how its done https://github.com/ryanj/flask-hbase-todos. I know its not scala but here's a good article that will show you how to put together a scala app https://www.openshift.com/blogs/building-distributed-and-event-driven-applications-in-java-or-scala-with-akka-on-openshift.
What will the application ultimately be doing?:
Forecasting for football match result for several football leagues,a web application (ruby) and
statistic computation and data mining ,calculations with Scala language
and apache frameworks(spark & mahout).
We get the info via CSV files, process and save it in nosql db (Cassandra).
And all of this on cloud(OpenShift),that's the idea.
I've seen the info https://github.com/ryanj/flask-hbase-todos.I'll try by this way but
with Scala.

Search Engine for MongoDB ?

i'm using mongodb to store data.
But to search I prefer to use elasticsearch or similar. But i didn't found solution.
Because I read some problems and issues with RIVER .
What's your experience and recommendations ?
I'm using elasticsearch with mongodb. I tried Solr but I didnt have the integration. The two tools are using the lucene so has "approximately" the same query syntax.
There are some tutorial, but it didnt work for me. I believe the reason is that the github doesnt allow now to upload and download binary files. So, we can not use the ./plugin command. To overcome this problem you have to git clone the repositories and make the .jar files on your own. To do that you have to use apache maven and make mvn package to create the packages.
Add both river and Mapper Attachments to elasticsearch. And make sure that you follow the compatible versions according to the river version table.
After that everything will working file.

HDFS web interface alternative

Alright, this is annoying! I am new to Hadoop. And I am trying to find decent alternative to basic HDFS web interface. i tried with hadoop eclipse plugin but seems it's oudated already and it's pain to set it up correctly! I have cloudera's distribution installed and I heard about cloudera desktop but it's no longer available. Can anybody tell me decent alternative to HDFS web interface where I can download and upload files to HDFS via GUI easily? P.S I am running everything on my local no, cluster involved. Tried a lot to find , but nothing seems to be pointing towards right direction
You can use webhdfs of which REST API supports the complete FileSystem interface for HDFS. http://hadoop.apache.org/docs/r1.0.4/webhdfs.html
OR
You can integrate hadoop with hoop(HDFS over HTTP), which is used to access HDFS via HTTP protocol. Hoop provides access to all Hadoop Distributed File System (HDFS) operations (read and write) over HTTP/S
for more details please refer.
http://bigobject.blogspot.in/2013/03/hoop-https-over-hdfs.html
or also you can user HTTPFS as a option to Hoop
http://bigobject.blogspot.in/2013/03/apache-hadoop-httpfs-service-that.html