Apache Spark and MongoDB integration using pymongo - mongodb

i have a problem working with apache spark and mongodb using the pymongo library. Actually, i am processing thousands of records and for each record, i need to read its corresponding data from the database, update certain info and save it back to the database. Due to the reads and writes, i choosed to use Pymongo instead of using Spark-Mongo Connector which apparently isnt well suited for this task. Unfortunately however, when performing writes, mongodb always returns write successful but when i check the database, some updates where not performed. After debugging for over a week, i realized by setting the server to a single core processor, all writes were successful and written in the database but the application has become tremendously slow.
I would like to know if anyone knows how to solve this issue. Thanks in advance

Related

How to process and insert millions of MongoDB records into Postgres using Talend Open Studio

I need to process millions of records coming from MongoDb and put a ETL pipeline to insert that data into a PostgreSQL database. However, in all the methods I've tried, I keep getting the out memory heap space exception. Here's what I've already tried -
Tried connecting to MongoDB using tMongoDBInput and put a tMap to process the records and output them using a connection to PostgreSQL. tMap could not handle it.
Tried to load the data into a JSON file and then read from the file to PostgreSQL. Data got loaded into JSON file but from there on got the same memory exception.
Tried increasing the RAM for the job in the settings and tried the above two methods again, still no change.
I specifically wanted to know if there's any way to stream this data or process it in batches to counter the memory issue.
Also, I know that there are some components dealing with BulkDataLoad. Could anyone please confirm whether it would be helpful here since I want to process the records before inserting and if yes, point me to the right kind of documentation to get that set up.
Thanks in advance!
As you already tried all the possibilities the only way that I can see to do this requirement is breaking done the job into multiple sub-jobs or going with incremental load based on key columns or date columns, Considering this as a one-time activity for now.
Please let me know if it helps.

Inject big local json file into Druid

It's my first Druid experience.
I have got a local setup of Druid in local machine.
Now I'd like to make some query performance test. My test data is a huge local json file 1.2G.
The idea was to load it into druid and run required SQL query. The file is getting parsed and successfully processed (I'm using Druid web-based UI to submit an injection task).
The problem I run into is the datasource size. It doesn't makes sense that 1.2G of raw json data results in 35M of datasource. Is there any limitation the locally running Druid setup has. I think the test data is processed partially. Unfortunately didn't find any relevant config to change it. Will appreciate if some one is able to shed light on this.
Thanks in advance
With druid 80-90 percent compression is expected. I have seen 2GB CSV file reduced to 200MB druid datasoruce.
Can you query the count to make sure all data is ingested? All please disable approximate algorithm hyper-log-log to get exact count.Druid SQL will switch to exact distinct counts if you set "useApproximateCountDistinct" to "false", either through query context or through broker configuration.( refer http://druid.io/docs/latest/querying/sql.html )
Also can check logs for exception and error messages. If it faces problem to ingest particular JSON record it skips that record.

HBase or Mongo for an Analytics DB if already using Hadoop?

I currently have a Hadoop cluster where I store tons of logs over which I run pig scripts for calculating aggregated analytics. I also have a Mongo cluster where I store production data.
I've recently been put in a position where I need to do a lot of one-off analytics queries, or enable others to do them. These queries frequently need to use both production data and log data together, so whatever I go with, I'd like to have everything in one place. My log data is in json and about 10x the size of my prod data. Here are the pros/cons of Mongo and HBase I'm seeing:
Mongo Pros/ HBase Cons:
Since log data is in JSON, I can get it into Mongo pretty easily, and I can do this in real time as it comes in through something like FluentD.
Most people I work with already have experience writing Mongo queries from needing to work with prod data, so getting an analytics db up on Mongo would be very simple for everyone to use.
I know much less about Hbase than Mongo.
No idea how easy/difficult it would be to get data in JSON or from Mongo into Hbase. I imagine this isn't so bad, but I don't see much documentation.
HBase Pros/Mongo Cons:
My log data is much bigger than my prod data, so storing it in both hadoop and mongo would be way more expensive than storing my prod data in both hadoop and mongo.
I can build HBase on top of my already running Hadoop cluster and fit my prod data in there without adding many extra machines. If I went with Mongo, I'd need a whole new Mongo cluster.
I could use Phoenix on top of Hbase to allow a simple SQL syntax for accessing all our data, but I'm not sure how unwieldily this would be for multi-level document-based data.
I know very little about Hbase currently, and I wouldn't consider myself a Mongo expert, so I'm probably missing a lot.
So, what am I missing, and which is right for my situation?
First of all, you should use something which you already can handle. Therefore, Mongo DB seems a good choice, especially when the data is already in the json format.
On the other hand, I used HBase quite a while and the read performance is amazing although having a lot of rows and I really don't know if there is any good and fast integration of Mongo DB with Hadoop.
HBase is the Hadoop database, so it is predestinated to work with Hadoop together.
If the logs could be indexed by (in the HBase Rowkey):
producing_program_identifier, timestamp, ...
HBase could work quite well for this query pattern.
But if you decide on HBase, use the
phoenix framwork, it will save you time using familiar interfaces like jdbc and sql-like queries. It also provides simple aggregation functions (count, avg, max, min) which may be sufficient.
From what you're saying it seems a mongoDB based solution would work best for you.
HBase is extremely versatile and you can get it to serve both your prod needs as well as your analytics needs however the general purpose SQL capabilities (in Phoenix, Cloudera's Impala and others) are in their infancy and the standard HBase way to get high query performance (designing the data structure for reads) will take a lot on effort (esp. since you don't have experience in HBase).
By the way it may be applicable for you to use map/reduces pre-aggregated data and then load it into MongoDB and thus utilize your current setup bette rather than change it either way

Job scheduling in MongoDB

The project in which I am working requires synchronization of data between MongoDB and SQL Server, so what is the best way for it? This synchronization should be handled by MongoDB side; does MongoDB support job scheduling?
I am new in MongoDB and I want to know few things about it. I know it is not an RDBMS but in case it is possible, then how?
Like we can write program in Oracle can we write in MongoDB. I mean directly in MongoDB, if no then which client side language can used?
As #Vasanth said, MongoDB has no native job management.
As to which client side language: well that's entirely upto you.
You can always get get a prebuilt replicator like: http://code.google.com/p/tungsten-replicator/ that might do the job for you.
As further reference, this question maybe of help to you: https://softwareengineering.stackexchange.com/questions/125980/can-sql-server-and-mongo-be-used-together

Is there a way to persist HSQLDB data?

We have all of our unit tests written so that they create and populate tables in HSQL. I want the developers who use this to be able to write queries against this HSQL DB ( 1) by writing queries they can better understand the data model and the ones not as familiar with SQL can play with the data before writing the runtime statements and 2) since they don't have access to the test DB/security reasons). Is there a way to persist the results of the test data so that it may be examine and analyzed with a an sql client?
Right now I am jury rigging it by switching the data source to a different DB (like DB2/mysql, then connecting to that DB on my machine so I can play with persistant data), however it would be easier for me if HSQL supports persisting this than to explain how to do this to every new developer.
Just to be clear, I need an SQL client to interact with persistent data, so debugging and checking memory won't be clean. This has more to do with initial development and not debugging/maintenance/testing.
If you use an HSQLDB Server instance for your tests, the data will survive the test run.
If the server uses a jdbc:hsqldb:mem:aname (all-in-memory) url for its database, then the data will be available while the server is running. Alternatively the server can use a jdbc:hsqldb:file:filepath url and the data is persisted to files.
The latest HSQLDB docs explain the different options. Most of the observations also apply to older (1.8.x) versions. However, the latest version 2.0.1 supports starting a server and creating databases dynamically upon the first connection, which can simplify testing a lot.
http://hsqldb.org/doc/2.0/guide/deployment-chapt.html#N13C3D