Scrapy MongoDB and Elasticsearch Synchronization - mongodb

I'm using Scrapy to fetch data from websites and mongodb for persistence and elasticsearch for searching purposes.
My problem is that when Scrapy inserts data to Mongodb , Elasticsearch is not aware even with the listener set to inserts, updates, and deletes.
Should I add a new plugin for Scrapy to communicate directly with Elasticsearch , if so , why doesn't the listener listen to what happens to the database? Thanks!

The rivers in elasticsearch is deprecated.
Try this you can use transporter to sync data between mongodb and elasticsearch.
How To Sync Transformed Data from MongoDB to Elasticsearch with Transporter
Installing Go
In order to install the compose transporter we need to install Go language.
sudo apt-get install golang
Create a folder for Go from your $HOME directory:
mkdir ~/go; echo "export GOPATH=$HOME/go" >> ~/.bashrc
Update your path:
echo "export PATH=$PATH:$HOME/go/bin:/usr/local/go/bin" >> ~/.bashrc
Now go to the $GOPATH directory and create the subdirectories src, pkg and bin. These directories constitute a workspace for Go.
cd $GOPATH
mkdir src pkg bin
Installing Transporter
Now create and move into a new directory for Transporter. Since the utility was developed by Compose, we'll call the directory compose.
mkdir -p $GOPATH/src/github.com/compose
cd $GOPATH/src/github.com/compose
This is where compose/transporter will be installed.
Clone the Transporter GitHub repository:
git clone https://github.com/compose/transporter.git
Move into the new directory:
cd transporter
Take ownership of the /usr/lib/go directory:
sudo chown -R $USER /usr/lib/go
Make sure build-essential is installed for GCC:
sudo apt-get install build-essential
Run the go get command to get all the dependencies:
go get -a ./cmd/...
This step might take a while, so be patient. Once it's done you can build Transporter.
go build -a ./cmd/...
If all goes well, it will complete without any errors or warnings. Check that Transporter is installed correctly by running this command:
transporter
So the installation is complete.
Create some sample data in mongoDB.
Then we have to configure the transporter.
Transporter requires a config file (config.yaml), a transform file (myTransformation.js), and an application file (application.js) to migrate our data from MongoDB to Elasticsearch.
Move to the transporter directory:
cd ~/go/src/github.com/compose/transporter
Config File
You can take a look at the example config.yaml file if you like. We're going to back up the original and then replace it with our own contents.
mv test/config.yaml test/config.yaml.00
The new file is similar but updates some of the URIs and a few of the other settings to match what's on our server. Let's copy the contents from here and paste into the new config.yaml file. Use nano editor again.
nano test/config.yaml
Copy the contents below into the file. Once done, save the file as described earlier.
# api:
# interval: 60s
# uri: "http://requestb.in/13gerls1"
# key: "48593282-b38d-4bf5-af58-f7327271e73d"
# pid: "something-static"
nodes:
localmongo:
type: mongo
uri: mongodb://localhost/foo
tail: true
es:
type: elasticsearch
uri: http://localhost:9200/
timeseries:
type: influx
uri: influxdb://root:root#localhost:8086/compose
debug:
type: file
uri: stdout://
foofile:
type: file
uri: file:///tmp/foo
Application File
Now, open the application.js file in the test directory.
nano test/application.js
Replace the sample contents of the file with the contents shown below:
Source({name:"localmongo", namespace:"foo.bar"})
.transform({filename: "transformers/addFullName.js", namespace: "foo.bar"})
.save({name:"es", namespace:"foo.bar"});
Transformation File
Let's say we want the documents being stored in Elasticsearch to have another field called fullName. For that, we need to create a new transform file, test/transformers/addFullName.js.
nano test/transformers/addFullName.js
Paste the contents below into the file. Save and exit as described earlier.
module.exports = function(doc) {
console.log(JSON.stringify(doc)); //If you are curious you can listen in on what's changed and being copied.
doc._id = doc.data._id['$oid'];
doc["fullName"] = doc["firstName"] + " " + doc["lastName"];
return doc
}
The first line is necessary to tackle the way Transporter handles MongoDB's ObjectId() field. The second line tells Transporter to concatenate firstName and lastName of mongoDB to form fullName of ES.
This is a simple transformation for the example, but with a little JavaScript you can do more complex data manipulation as you prepare your data for searching.
Executing Transporter:
If you have a simple standalone instance of MongoDB, it won't be being replicated, there'll be no oplog and the Transporter won't be able to detect the changes. To convert a standalone MongoDB to a single node replica set you'll need to start the server with --replSet rs0 (rs0 is just a name for the set) and when running, log in with the Mongo shell and run rs.initiate() to get the server to configure itself.
make sure you are in the transporter directory:
cd ~/go/src/github.com/compose/transporter
Execute the following command to sync the data:
transporter run --config ./test/config.yaml ./test/application.js

The synchronizing mechanism that you are looking for is called rivers in Elasticsearch.
In this specific case, you should sync the specific Mongodb collection that you are using to save your Scrapy data with the Elasticsearch index.
For details about how to proceed you should check out the following links:
setting up the mongodb river
mongodb river plugin
Also I recommend checking out the answers on the Elasticsearch tags here on Stackoverflow. I have found detailed answers for most of the common problems regarding implementation details.

Related

PostgreSQL data directory configuration

I am using PostgreSQL for CentOS. And i changed the data directory to store PostgreSQL data on a different disk.
nano /usr/lib/systemd/system/postgresql.service
#Environment=PGDATA=/var/lib/pgsql/data
Environment=PGDATA=/data/pgsql/data
However, after installing the package update, the contents of the configuration file were changed back to the default settings.
Do I need to check the configuration file every time I install a package update later? Or is there a way to preserve the config file?
There are two ways to deal with this:
the old way:
You create a file /etc/systemd/system/postgresql.service that contains
.include /usr/lib/systemd/system/postgresql.service
[Service]
Environment=PGDATA=/data/pgsql/data
the new way:
You create a directory /etc/systemd/system/postgresql.service.d that contains a file named (for example) pgdata.conf with the contents
[Service]
Environment=PGDATA=/data/pgsql/data
Then notify systemd with
systemctl daemon-reload
This configuration change will override the corresponding value from /usr/lib/systemd/system/postgresql.service, so the change will survive an upgrade.

"mount" a PostgreSQL database from files not Backup

I've been given a project to extract data from a PostgreSQL database. I've no previous experience with PostgreSQL but the project I have is to bug fix existing code, so all the logic to connect to the engine and get data is already in place.
The problem I have is the database has been given to me in the form of the folders and files straight from the source HDD, not a backup (which isn't going to happen so "Get the customer to give you a backup instead isn't an option here).
The folders also contained the actual PostgreSQL binaries so I looked a the version (9.4.14) and downloaded the nearest (9.4.18) from the PostgreSQL site and installed it. Now all I have to do is some how is to get it to look at my given data files.
I tried the obvious of copying the contents of the data folder into the installed data folder but after the PostgreSQL service won't start.
I did find a option in the conf file:
#data_directory = 'ConfigDir'
I changed this to:
data_directory = 'C:\customer\data'
But again the service won't start after this.
The data directory used by the service is defined through the service command line which overwrites any property defined in postgresql.conf.
You need to re-create the service in order to change the data directory, e.g.:
Remove the service:
pg_ctl -unregister -N postgresql-9.1
postgresql-9.1 is the "real" name of the service, not the "Display Name". You can see that in the properties of the service inside the "services" app.
Then re-create the service with the correct data directory:
pg_ctl -register -D -D c:\customer\data -N postgresql-9.1
Another way of "debugging" startup errors in Windows, is to start Postgres from the command line (not through the service) because some errors during startup are not logged in the Postgres logfile but they are displayed on the command line. You can do that with e.g.:
pg_ctl start -D c:\customer\data`
If the bin directory is not in your PATH you need to specify the full path to it on the command line, e.g.: c:\Postgres9.1\bin\pg_ctl

How to migrate/shift/copy/move data in Neo4j

Does any one know how to migrate data from one instance of Neo4j to another. To be more precise, I want to know, how to move the data from one instance of Neo4j on my local machine to another on remote machine. Does any one have any idea about it.
I'm working on my windows machine with Eclipse and Embedded Neo4j . I need to transfer this data to remote Neo4j instance on a Centos machine. Please help me with this.
Not sure how to do it for "embedded neo4j db".
But for standalone and in case you have something like the command line tool "Putty" on your windows machine, this should work. Instead of $NEO4j_HOME you can also use the normal path without the env variable.
$NEO4J_HOME/bin/neo4j stop
cd $NEO4J_HOME/data
tar -cvf graph.db.tar graph.db
gzip graph.db.tar
scp -i ~/some_path/key_for_remote_server.pem ./graph.db.tar.gz username#your_remote_domeain.com:~/
ssh -i ~/some_path/key_for_remote_server.pem/ username#your_remote_domeain.com
On your remote server (at least this works for ubuntu):
Maybe you need to use "sudo" (prefix the commands with sudo).
mv ./graph.db.tar.gz /some_path/
cd /some_path/
gunzip graph.db.tar.gz
tar -xvf graph.db.tar
$NEO4J_HOME/bin/neo4j start
$NEO4J_HOME/bin/neo4j status
You can migrate the data by using the apoc procedure by running the below query in the cypher shell from where the data needs to be exported:
CALL apoc.export.cypher.all('myfilename.cypher');
This will download the file with cypher queries in the import folder
Go the database instance where the data needs to be imported and copy the file in the import folder. Run the below command using the cypher shell:
apoc.cypher.runFile("myfilename.cypher",{}) yield row, result;
For more advanced options follow the below links:
https://neo4j.com/docs/labs/apoc/current/export/cypher/
http://neo4j-contrib.github.io/neo4j-apoc-procedures/3.4/cypher-execution/run-cypher-scripts/
I found out the following workaround for copying the data from a server in the cluster to all others, after using the neo4j-import tool:
Stop all nodes.
On the new node/server, where you need your data to be copied, you have to create the database folder for that graph (in my case loadTest):
/neo4j-enterprise-3.1.0/data/databases/loadTest.db
Then, the source node/server that is holding the data, you have to copy here the neostore.id file to the destination server db folder (loadTest.db from the previous step).
Start all nodes. In the background neo4j will copy data from other cluster servers to the new node.
For embedded mode , you would just need to locate the graph neo4j-db folder location then zip and send it to the remote system.
In your code where you would have called graphdatabaseservice , you would have given the target location
Check if its relative path then the database might be in your project folder .
Now for running the db instance on browser , you will need to use the neo4j communty server and point it to the folder containing the index folder. So if your neo4j-db is located at $project/tmp/neo4j-db then you will give the file path till this folder(the index folder will be inside this folder)
Edit
The folder that will contain the schema and index folders needs to be zipped. You can upload and unzip the folder at a certain location using Putty on your standalone server. Then just change the org.neo4j.server.database.location in conf/neo4j-server.properties file.

Getting Use to MongoDB

I am just getting familiar with MongoDB and its my first time to use it. I am using Ubuntu enviornment for the development. I installed the MongoDB as per the instructions mentioned in the tutorial available on the MongoDb website. They said that the data will be stored in the /data/db, now i have two question about this
1) Where do i need to make this folder? means which directory?
2) I made one directory in the root directory / and then i ran the mongo server. I made one database with the name of movies with use movies command and then put some collection in it like action and comedy and then each collection i saved some documents with db.action.insert and db.comedy.insert command and when i tried to find it in /data/db folder i found three files and one folder but there is no file with the name of movies so my question is with what name my database is saved.
Please guide me in this aspect.
Thanks
By default, Mongo stores all of its DB files in the root of the configured dbdir directory. Look into the directoryperdb configuration option if you want Mongo to create a separate directory for each database and place the database files within it.
The default data directory /data/db needs to be created by you if it hasn't been already.
You can do this in Ubuntu via the shell:
CTL+ALT+T to bring up a Terminal window:
cd /
sudo mkdir /data
cd data
sudo mkdir db
Note: you may also need to change ownership of the new directory to be owned by you:
sudo chown your_user_name /data
Note: my Mongo installation defaulted to a dbdir of /var/lib/mongodb
You can also choose a different data directory when you start Mongo:
mongod --dbdir some/dir
If Mongo is installed as a service, you should be checking the dbdir configuration option in your mongodb.conf file (again, when I installed Mongo, which I believe I did via the Software Center, it placed mongodb.conf at /etc/mongodb.conf)
Hope this helps
What are the 3 files you see? test*?
Here are the 2 more likely cases
a) your DB files are into another location
b) your data is in the test database
Go back to your shell and run:
use local
db.startup_log.find({},{startTimeLocal:1,cmdLine:1}).sort({$natural:-1}).limit(1).pretty()
{
"_id" : "mycomputer",
"startTimeLocal" : "Mon Oct 7 09:44:25.353",
"cmdLine" : {
"config" : "mongodb.conf",
"dbpath" : "./db",
"fork" : "1",
"journal" : "1",
"logpath" : "./logs/mongo.log",
"port" : 27017,
"rest" : "1"
}
}
=> dbpath tells you the location that was provided on the command line or config file
Again in the shell, run the following:
use test
show collections
use movies
show collections
If you see your collection names under test, you may not have used 'use movies' before creating the data.

Moving MongoDB's data folder?

I have 2 computers in different places (so it's impossible to use the same wifi network).
One contains about 50GBs of data (MongoDB files) that I want to move to the second one which has much more computation power for analysis. But how can I make MongoDB on the second machine recognize that folder?
When you start mongodprocess you provide an argument to it --dbpath /directory which is how it knows where the data folder is.
All you need to do is:
stop the mongod process on the old computer. wait till it exits.
copy the entire /data/db directory to the new computer
start mongod process on the new computer giving it --dbpath /newdirectory argument.
The mongod on the new machine will use the folder you indicate with --dbpath. There is no need to "recognize" as there is nothing machine specific in that folder, it's just data.
I did this myself recently, and I wanted to provide some extra considerations to be aware of, in case readers (like me) run into issues.
The following information is specific to *nix systems, but it may be applicable with very heavy modification to Windows.
If the source data is in a mongo server that you can still run (preferred)
Look into and make use of mongodump and mongorestore. That is probably safer, and it's the official way to migrate your database.
If you never made a dump and can't anymore
Yes, the data directory can be directly copied; however, you also need to make sure that the mongodb user has complete access to the directory after you copy it.
My steps are as follows. On the machine you want to transfer an old database to:
Edit /etc/mongod.conf and change the dbPath field to the desired location.
Use the following script as a reference, or tailor it and run it on your system, at your own risk.
I do not guarantee this works on every system --> please verify it manually.
I also cannot guarantee it works perfectly in every case.
WARNING: will delete everything in the target data directory you specify.
I can say, however, that it worked on my system, and that it passes shellcheck.
The important part is simply copying over the old database directory, and giving mongodb access to it through chown.
#!/bin/bash
TARGET_DATA_DIRECTORY=/path/to/target/data/directory # modify this
SOURCE_DATA_DIRECTORY=/path/to/old/data/directory # modify this too
echo shutting down mongod...
sudo systemctl stop mongod
if test "$TARGET_DATA_DIRECTORY"; then
echo removing existing data directory...
sudo rm -rf "$TARGET_DATA_DIRECTORY"
fi
echo copying backed up data directory...
sudo cp -r "$SOURCE_DATA_DIRECTORY" "$TARGET_DATA_DIRECTORY"
sudo chown -R mongodb "$TARGET_DATA_DIRECTORY"
echo starting mongod back up...
sudo systemctl start mongod
sudo systemctl status mongod # for verification
quite easy for windows, just move the data folder to the target location
run cmd
"C:\your\mongodb\bin-path\mongod.exe" --dbpath="c:\what\ever\path\data\db"
In case of Windows in case you need just to configure new path for data, all you need to create new folder, for example D:\dev\mongoDb-data, open C:\Program Files\MongoDB\Server\6.0\bin\mongod.cfg and change there path :
Then, restart your PC. Check folder - it should contains new files/folders with data.
Maybe what you didn't do was export or dump the database.
Databases aren't portable therefore must be exported or created as a dumpfile.
Here is another question where the answer is further explained