Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am looking into getting into Apache Spark to use with a cassandra database with scala and Akka and I ahve been trying to find the answer to the question of whether i could actually drop my existing Cassandra driver and use Spark exclusively. Does it have means to find records by partition keys and so on or can it only take the entire table and filter it. I knoe you could filter to a single record but that means iterating through a potentially massive table. I want spart to essentially issue CQL where clauses and allow me to fetch only a single row if I choose or a set of rows. If this is not possible then I need to stick with my existing driver for the normal db operations and spark for the analytics.
It is possible to issue CQL where clause in Spark with CassandraRDD.where()
To filter rows, you can use the filter transformation provided by Spark. However, this approach causes all rows to be fetched from Cassandra and then filtered by Spark. Also, some CPU cycles are wasted serializing and deserializing objects that wouldn't be included in the result. To avoid this overhead, CassandraRDD offers the where method, which lets you pass arbitrary CQL condition(s) to filter the row set on the serve
Here is a simple example on how to use CassandraRDD.where()
If you have a table
CREATE TABLE test.data (
id int PRIMARY KEY,
data text
);
You can use spark to select and filter with primary key.
sc.cassandraTable("test", "data ").select("id", "data").where("id = ?", 1).toArray.foreach(println)
More on : https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md
But In Cassandra driver you have more flexibility and control over your query and also Spark will cost you more cpu, time and memory than the cassandra driver.
As RussS Says
"While this is correct and the where clause allows you to run a single partition request, This is orders of magnitude more expensive than running analogous queries directly through the Java Driver"
Related
I am not an expert of Spark SQL API, nor of the underlying RDD one.
But, knowing of the Catalyst optimization engine, I would expect Spark to try and minimize in-memory effort.
This is my situation:
I have, let's say, two table
TABLE GenericOperation (ID, CommonFields...)
TABLE SpecificOperation (OperationID, SpecificFields...)
They are both quite huge (~500M, not big data, but unfeasible to have as a whole in memory in a standard application server)
That said, suppose I have to retrieve using Spark (part of a larger use case) all the SpecificOperation instances that match some particular condition on fields that belong to GenericOperation.
This is the code that I am using:
val gOps = spark.read.jdbc(db.connection, "GenericOperation", db.properties)
val sOps = spark.read.jdbc(db.connection, "SpecificOperation", db.properties)
val joined = sOps.join(gOps).where("ID = OperationID")
joined.where("CommonField= 'SomeValue'").select("SpecificField").show()
Problem is, when it comes to run the above, I can see from SQL Profiler that Spark does not execute the join on the database, but rather retrieves all the OperationID from SpecificOperation, and then I assume it will be running all the merge in memory. Since no filter is applicable on SpecificOperation, such retrieve would bring a lot, too much, data to the end system.
Is it possible to write the above so that the join is demanded directly to dbms?
Or it depends on some magic configuration of Spark I am not aware of?
Of course, I could simply hardcode the join as a subquery when retrieving, but that's not feasible in my case: statements hve to be created at runtime starting from simple building blocks. Hence, I need to implement this starting from two spark.sql.DataFrame already built up
As a side note, I am running this with Spark 2.3.0 for Scala 2.11, against a SQL Server 2016 database instance.
Is it possible to write the above so that the join is demanded directly to dbms? Or it depends on some magic configuration of Spark I am not aware of?
Excluding statically generated queries (In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?), Spark doesn't support join pushdown. Only predicates and selection can be delegated to the source.
There is no magic configuration or code that could even support this type of process.
In general if server can handle join, data is usually not large enough to benefit from Spark.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 days ago.
Improve this question
I have a bunch of data in files stored in Amazon S3 and am planning to use it to build a Data Vault in Redshift. My first question is if the right approach is to build the DV and Data Marts all in Redshift or if I should consider the S3 as my Data Lake and have only the Data Marts in Redshift?
In my architecture I'm currently considering the former (i.e. S3 Data Lake + Redshift Vault and Marts). However, I don't know if I can create ETL processes directly in Redshift to populate the Marts with data in the Vault or if I'll have to for example use Amazon EMR to process the raw data in S3, generate new files there and finally load them in the Marts.
So, my second question is: What should the ETL strategy be? Thanks.
Apologies! Dont have the reputation to comment that is why I am writing in the answer section. I am exactly in the same boat as you are. Trying to perform my ETL operation in redshift and as of now I have 3 billion rows and expecting to grow drastically. Right now, loading data into data marts in redshift using DML's that are called at regular interval from AWS lambda. According to me it is very difficult to create a data vault in Redshift.
Im a bit late the party and no doubt you've resolved this however it still might be relevant. Just thought I'd share my opinion on this. One solution is to use S3 and Hive as a Persistent Staging Area(Data Lake if you will) to land the data from sources. Construct your DV entirely in Redshift. You will still need a staging area in Redshift in order to ingest files from S3 ensuring the hashes are calculated on the way into Redshift staging tables(That's where EMR/Hive comes in). You could add the hashes directly in Redshift but it could put Redshift under duress depending on volume. Push data from staging into the DV via plain old bulk insert and update statements and then virtualise your marts in Redshift using views.
You could use any data pipeline tool to achieve this and lambda could also be a candidate for you or another workflow/pipeline tool.
I strongly recommend you check out Matillion for Redshift: https://redshiftsupport.matillion.com/customer/en/portal/articles/2775397-building-a-data-vault
It's fantastic and affordable for Redshift ETL and has a Data Vault sample project.
I recommend you to read this article and consider following the design explained there in much details in case AWS is your potential tech stack.
I do not want to copy-paste the article here but it is really a recipe how to implement Data Vault and I believe it addresses your requirements.
S3 is just a key-value store for files. You can't create a DV or DW there. So you can use Redshift or EMR to process the data into the relational format for your DV. It's up to you as to whether or not which you choose; EMR has specific use cases IMO
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am using MeteorJS framework for one of my project .
I have build a basic webApp once before using MeteorJS and it works perfectly fine when its just Client, Server and MongoDB.
In this project, I want the monogDB (which comes in build with MeteorJS) to populate data from Apache Spark.
Basically, Apache Spark will process some data and inject it into mongoDB
Is this doable ?
Please can you point me to the right tutorial for this
How complex is this to implement ?
Thanks in advance for your help
Yes this is very possible and quite easy. That said it won't be via MeteorJS, it would be part of the Apache Spark job and would be configured there.
Using the MongoDB Spark Connector taking data from a DataFrame or an RDD and saving it to MongoDB is easy.
First you would configure how and where the data is written:
// Configure where to save the data
val writeConfig = WriteConfig(Map("uri" -> "mongodb://localhost/databaseName.collectionName"))
With RDD's you should convert them into Documents via a map function eg:
val documentRDD = rdd.map(data => Document) // map the RDD into documents
MongoSpark.save(documentRDD, writeConfig)
If you are using DataFrames it's much easier as you can just provide a DataFrameWriter and writeConfig:
MongoSpark.save(dataFrame.write, writeConfig)
There is more information in the documentation or there are examples in the github repo.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Please help me to setup connectivity from PostgreSQL to Informix (latest versions for both). I would like to be able to perform a query on Informix from PostgreSQL. I am looking for a solution that will not require data exports (from Informix) and imports (to PostgreSQL) for every query.
I am very new in PostgreSQL and need detailed instructions.
As Chris Travers said, what you're seeking to do is not easy to do.
In theory, if you were working with Informix and needed to access PostgreSQL, you could (buy and) use the Enterprise Gateway Manager (EGM) and use the ODBC driver for PostgreSQL to allow Informix to connect to PostgreSQL. The EGM would do its utmost to appear to be another Informix database while actually accessing PostgreSQL. (I've not validated that PostgreSQL is supported, but EGM basically needs an ODBC driver to work, so there shouldn't be any problem — 'famous last words', probably.) This will include an emulation of 2PC (two-phase commit); not perfect, but moderately close.
For the converse connection (working with PostgreSQL and connecting to Informix), you will need to look to the PostgreSQL tool suite — or other sources.
You haven't said which version you are using. There are some limitations to be aware of but there are a wide range of choices.
Since you say this is import/export, I will assume that read-only options are not sufficient. That rules out PostgreSQL 9.1's foreign data wrapper system.
Depending on your version David Fetter's DBI-Link may suit your needs since it can execute queries on remote tables (see https://github.com/davidfetter/DBI-Link). It hasn't been updated in a while but the implementation should be pretty stable and usable across versions. If that fails you can write stored procedures in an untrusted language (PL/PythonU, PL/PerlU, etc) to connect to Informix and run the queries there. Note there are limits regarding transaction handling you will run into in this case so you may want to run any queries on the other tables using deferred constraint triggers so everything gets run at commit time.
Edit: A cleaner way occurred to me: use foreign data wrappers for import and a separate client app for export.
In this approach, you are going to have four basic components but this will be loosely coupled and subject to proper transactional controls. You can even use two-phase commit if you want. The four components are (not providing a complete working example here but at least a roadmap to one):
Foreign data wrappers for data import, allowing you to see data from Informix.
Views of data to be exported.
External application which manages the export aspect, written in a language of your choice. This listens on a channel like LISTEN export_informix;
Triggers on underlying tables which make view of data to be exported which raise a NOTIFY export_informix
The notifications are riased on the commit and so basically you have two stages to your transaction in this way:
Write data in PostgreSQL, flag data to be exported. Commit.
Read data from PostgreSQL, export to Informix. Commit on both sides (TPC?).
I almost do not know anything about HBase. Sorry for basic questions.
Imagine I have a table of 100 billion rows with 10 int, one datetime, and one string column.
Does HBase allow querying this table and Group the result based on key (even a composite key)?
If so, does it have to run a map/reduce job to it?
How do you feed it the query?
Can HBase in general perform real-time like queries on a table?
Data aggregation in HBase intersects with the "real time analytics" need. While HBase is not built for this type of functionality there is a lot of need for it. So the number of ways to do so is / will be developed.
1) Register HBase table as an external table in Hive and do aggregations. Data will be accessed via HBase API what is not that efficient. Configuring Hive with Hbase this is a discussion about how it can be done.
It is the most powerful way to group by HBase data. It does imply running MR jobs but by Hive, not by HBase.
2) You can write you own MR job working with HBase data sitting in HFiles in the HDFS. It will be most efficient way, but not simple and data you processed would be somewhat stale. It is most efficient since data will not be transferred via HBase API - instead it will be accesses right from HDFS in sequential manner.
3) Next version of HBase will contain coprocessors which would be able to aggregations inside specific regions. You can assume them to be a kind of stored procedures in the RDBMS world.
4) In memory, Inter-region MR job which will be parralelized in one node is also planned in the future HBase releases. It will enable somewhat more advanced analytical processing then coprocessors.
FAST RANDOM READS = PREPREPARED data sitting in HBase!
Use Hbase for what it is...
1. A place to store a lot of data.
2. A place from which you can do super fast reads.
3. A place where SQL is not gonna do you any good (use java).
Although you can read data from HBase and do all sorts of aggregates right in Java data structures before you return your aggregated result, its best to leave the computation to mapreduce. From your questions, it seems as if you want the source data for computation to sit in HBase. If this is the case, the route you want to take is have HBase as the source data for a mapreduce job. Do computations on that and return the aggregated data. But then again, why would you read from Hbase to run a mapreduce job? Just leave the data sitting HDFS/ Hive tables and run mapreduce jobs on them THEN load the data to Hbase tables "pre-prepared" so that you can do super fast random reads from it.
Once you have the preaggregated data in HBase, you can use Crux http://github.com/sonalgoyal/crux to further drill, slice and dice your HBase data. Crux supports composite and simple keys, with advanced filters and group by.