Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am using MeteorJS framework for one of my project .
I have build a basic webApp once before using MeteorJS and it works perfectly fine when its just Client, Server and MongoDB.
In this project, I want the monogDB (which comes in build with MeteorJS) to populate data from Apache Spark.
Basically, Apache Spark will process some data and inject it into mongoDB
Is this doable ?
Please can you point me to the right tutorial for this
How complex is this to implement ?
Thanks in advance for your help
Yes this is very possible and quite easy. That said it won't be via MeteorJS, it would be part of the Apache Spark job and would be configured there.
Using the MongoDB Spark Connector taking data from a DataFrame or an RDD and saving it to MongoDB is easy.
First you would configure how and where the data is written:
// Configure where to save the data
val writeConfig = WriteConfig(Map("uri" -> "mongodb://localhost/databaseName.collectionName"))
With RDD's you should convert them into Documents via a map function eg:
val documentRDD = rdd.map(data => Document) // map the RDD into documents
MongoSpark.save(documentRDD, writeConfig)
If you are using DataFrames it's much easier as you can just provide a DataFrameWriter and writeConfig:
MongoSpark.save(dataFrame.write, writeConfig)
There is more information in the documentation or there are examples in the github repo.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
What is the best way to insert the content of the file (.csv, up to 800 MBytes) uploaded by a web-application user into the PostgreSQL database?
I see three options:
Insert statement for each file row
Insert statement for multiple rows (insert batches containing e.g. 1000 rows)
Store temp file, upload it using PostgreSQL COPY command (I have shared directory between servers where application and database located)
Which way is better? Or maybe there is any other way?
Additional details:
I use Java 8 and JSP
Database: PostgreSQL 9.5
To handle multipart data I use Apache Commons FileUpload and Apache Commons CSV to parse the file
Definitely NOT a single insert for each row. Relaying on PostgreSQL COPY command should be the fastest option.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am looking into getting into Apache Spark to use with a cassandra database with scala and Akka and I ahve been trying to find the answer to the question of whether i could actually drop my existing Cassandra driver and use Spark exclusively. Does it have means to find records by partition keys and so on or can it only take the entire table and filter it. I knoe you could filter to a single record but that means iterating through a potentially massive table. I want spart to essentially issue CQL where clauses and allow me to fetch only a single row if I choose or a set of rows. If this is not possible then I need to stick with my existing driver for the normal db operations and spark for the analytics.
It is possible to issue CQL where clause in Spark with CassandraRDD.where()
To filter rows, you can use the filter transformation provided by Spark. However, this approach causes all rows to be fetched from Cassandra and then filtered by Spark. Also, some CPU cycles are wasted serializing and deserializing objects that wouldn't be included in the result. To avoid this overhead, CassandraRDD offers the where method, which lets you pass arbitrary CQL condition(s) to filter the row set on the serve
Here is a simple example on how to use CassandraRDD.where()
If you have a table
CREATE TABLE test.data (
id int PRIMARY KEY,
data text
);
You can use spark to select and filter with primary key.
sc.cassandraTable("test", "data ").select("id", "data").where("id = ?", 1).toArray.foreach(println)
More on : https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md
But In Cassandra driver you have more flexibility and control over your query and also Spark will cost you more cpu, time and memory than the cassandra driver.
As RussS Says
"While this is correct and the where clause allows you to run a single partition request, This is orders of magnitude more expensive than running analogous queries directly through the Java Driver"
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
We have Terabytes of data stored in HDFS, comprising of customer data and behavioral information. Business Analysts want to perform slicing and dicing of this data using filters.
These filters are similar to Spark RDD filters. Some examples of the filter are:
age > 18 and age < 35, date between 10-02-2015, 20-02-2015, gender=male, country in (UK, US, India), etc. We want to integrate this filter functionality in our JSF (or Play) based web application.
Analysts would like to experiment by applying/removing filters, and verifying if the count of the final filtered data is as desired. This is a repeated exercise, and the maximum number of people using this web application could be around 100.
We are planning to use Scala as a programming language for implementing the filters. The web application would initialize a single SparkContext at the load of the server, and every filter would reuse the same SparkContext.
Is Spark good for this use case of interactive querying through a web application. Also, the idea of sharing a single SparkContext, is this a work-around approach? The other alternative we have is Apache Hive with Tez engine using ORC compressed file format, and querying using JDBC/Thrift. Is this option better than Spark, for the given job?
It's not the best use case for Spark, but it is completely possible. The latency can be high though.
You might want to check out Spark Jobserver, it should offer most of your required features. You can also get an SQL view over your data using Spark's JDBC Thrift server.
In general I'd advise using SparkSQL for this, it already handles a lot of the things you might be interested in.
Another option would be to use Databricks Cloud, but it's not publicly available yet.
Analysts would like to experiment by applying/removing filters, and verifying if the count of the final filtered data is as desired. This is a repeated exercise, and the maximum number of people using this web application could be around 100.
Apache Zeppelin provides a framework for interactively ingesting and visualizing data (via web application) using apache spark as the back end. Here is a video demonstrating the features.
Also, the idea of sharing a single SparkContext, is this a work-around approach?
It looks like that project uses a single sparkContext for low latency query jobs.
I'd like to know which solution you chose in the end.
I have two propositions:
following the zeppelin idea of #quickinsights, there is also the interactive notebook jupyter that is well established now. It is firstly designed for python, but specialized kernel can be installed. I tried using toree a couple of month ago. The basic installation is simple:
pip install jupyter
pip install toree
jupyter install toree
but at the time I had to do a
couple low level twicks to make it works (s.as editing /usr/local/share/jupyter/kernels/toree/kernel.json). But it worked and I could use a spark cluster from a scala notebook. Check this tuto, it
fits what I have in
memory.
Most (all?) docs on spark speak about running app with spark-submit or using spark-shell for interactive usage (sorry but spark&scala shell are so disappointing...). They never speak about using spark in an interactive app, such as a web-app. It is possible (I tried), but there are indeed some issues to be check, such as sharing sparkContext as you mentioned, and also some issues about managing dependencies. You can checks the two getting-started-prototypes I made to use spark in a spring web-app. It is in java, but I would strongly recommend using scala. I did not work long enough with this to learn a lot. However I can say that it is possible, and it works well (tried on a 12 nodes cluster + app running on an edge node)
Just remember that the spark driver, i.e. where the code with rdd is running, should be physically on the same cluster that the spark nodes: there are lots of communications between the driver and the workers.
Apache Livy enables programmatic, fault-tolerant, multi-tenant submission of Spark jobs from web/mobile apps (no Spark client needed). So, multiple users can interact with your Spark cluster concurrently.
We had a similar problem at our company. We have ~2-2.5 TB of data in the form of logs. Had some basic analytics to do on that data.
We used following:
Apache Flink for Streaming data from source to HDFS via Hive.
Have Zeppelin configured on the top of HDFS.
SQL interface for Joins and JDBC connection to connect to HDFS via
hive.
Spark for putting batches of data offline
You can use Flink + Hive-HDFS
Filters can be applied via SQL ( Yes! everything is supported in latest releases)
Zeppelin can automate task of report generation and it has cool features of filters without actually mordifying sql queries using ${sql-variable} feature.
Check it out. I am sure you'll find your answer:)
Thanks.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I will ask my question as an example .
If we use ORACLE as a database , and if we want to get data from it what we should know is SQL . with the help of an sql we can get the data from oracle.
If we use Mongo db as a database do we have to know about NoSql . ??
in simpler terms .
SQL for ORACLE . And NoSql for MongoDB ? am i right .?
There is no such thing as the NoSQL query language. All the databases usually grouped under the catch-all label "NoSQL" are completely different technologies which are used in completely different ways.
MongoDB has a query language which is based on javascript object notations. It doesn't have much to do with SQL and not anything either with the query languages of most other NoSQL databases. An interactive tutorial can be found on the MongoDB website. It should give you a basic understanding of how the query language works. The full documentation is a good source of in-depth knowledge.
Keep in mind that when you learned everything about MongoDB and its query language, you still know absolutely nothing about other NoSQL databases like Redis, Neo4j, CouchDB etc.. These are as different from MongoDB (and as different from each other) as MongoDB is different from SQL databases.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Does anybody know a good solution for export/import in Redis?
Generally I need to dump DB (and edit the dump as a case) from a server and load it to another one (e.g. localhost).
Maybe some scripts?
Redis has two binary format files supported: RDB and AOF.
RDB is a dump like what you asked. You can call save to force a rdb. It will be stored in the dbfilename setting you have, or dump.rdb in the current working directory if that setting is missing.
More Info:
http://redis.io/topics/persistence
If you want a server to load the content from other server, no dump is required. You may use slaveof to sync the data and once it's up to date call slaveof no one.
More information on replication can be found in this link: http://redis.io/topics/replication
you can try my dump util, rdd, it extract or insert data into redis and can split, merge, filter, rename