Scheduled sequential jobs in Ubuntu server - scheduled-tasks

I want to write two scheduled jobs for my Ubuntu 14.04.4 server. The jobs need to be sequential.
The first job should unzip a .gz file (SQL Dump) and then import the table "myTable" into MySQL Database (localhost).
The second job (written using Pentaho Data Integration tool) extracts data from the table "myTable" , transforms it and loads it into a new database.
I could have accomplished the first task using pentaho PDI spoon but it doesn't provide any function to unzip a .gz file & after some research and coming accross these posts :
http://forums.pentaho.com/showthread.php?82566-How-to-use-the-content-of-a-tar-gz-file-in-Kettle
How to uncompress and import a .tar.gz file in kettle?
I have gathered that I should manually write a job to accomplish the first task i.e. unzip a .gz file and then import the table "myTable" into MySQL Database.
My question is that how to create a cron job that executes the two sequentially i.e. first job first completes and then the second is executed.
If there is any better alternative approach to this please suggest.

You can make use of the "SHELL" step in a PDI job. Code the unzip portion of your code in the shell step followed sequentially by your transformation. A sample image looks like this:
Now you can schedule this complete job in CRON or any other scheduler. No need for separate scripts.
Note: This works only in a linux env. which i assume you are using.
Hope this helps :)

Related

Best practice for importing bulk data to AWS RDS PostgreSQL database

I have a big AWS RDS database that needs to be updated with data on a periodic basis. The data is in JSON files stored in S3 buckets.
This is my current flow:
Download all the JSON files locally
Run a ruby script to parse the JSON files to generate a CSV file matching the table in the database
Connect to RDS using psql
Use \copy command to append the data to the table
I would like switch this to an automated approach (maybe using an AWS Lambda). What would be the best practices?
Approach 1:
Run a script (Ruby / JS) that parses all folders in the past period (e.g., week) and within the parsing of each file, connect to the RDS db and execute an INSERT command. I feel this is a very slow process with constant writes to the database and wouldn't be optimal.
Approach 2:
I already have a Ruby script that parses local files to generate a single CSV. I can modify it to parse the S3 folders directly and create a temporary CSV file in S3. The question is - how do I then use this temporary file to do a bulk import?
Are there any other approaches that I have missed and might be better suited for my requirement?
Thanks.

How can I import a large (multi-GB) sql file into postgres using dotnet core?

My database needs to mirror another, to which I have no access except for a nightly export of the sql file. I could script the import using psql.exe, but would prefer everything to be under the control of the dotnet core application.
I can't use the COPY command, because the file contains ALL the sql to set up the schemas and tables, as well as all the sql commands to insert/alter/copy the data.
I can't use \i because that is a postgresql console command, not something I can run through npgsql.
Is what I'm trying to do possible? Is it inherently a bad idea, and should I run a script to import it outside of the dotnet application? Should the dotnet application run and talk to the psql.exe program directly?
You could theoretically parse the SQL file in .NET and send it to PostgreSQL, but this is a very non-trivial thing to do, since you'd need to understand where statements end (identify semicolons) in order to send chunks.
You could, of course, send the entire file as a single chunk, but if it's huge, that may be a bad idea.
At the end of the day, I don't think there's any particular issue with launching psql.exe as an external process from .NET, and properly inspecting its exit code for error handling. Any reason you think you need to avoid that?

Easy way to get all tables out to S3 on a nightly basis?

I need to be able to dump the contents of each table in my redshift data warehouse each night to S3.
The outcome that I want to achieve is the same outcome as if I was manually issueing an UNLOAD command for each table.
For something this simple, I assumed I could use something like data pipeline or glue, but these don’t seem to make this easy.
Am I looking at this problem wrong? This seems like it should be simple.
I had this process but in reverse recently. My solution: a python script that queried pg_schema (to grab eligible table names), and then looped through the results using the table name as a parameter in an INSERT query. I ran the script as a cron job in an EC2.
In theory, you could set up the script via Lambda or as a ShellCommand in Pipeline. But I could never get that to work, whereas a cron job was super simple.
Do you have a specific use case for explicitly UNLOADing data to S3? Like being able to use that data with Spark/Hive?
If not, you should be scheduling Snapshots of Redshift cluster to S3 every day. This happens by default anyway.
Snapshots are stored in S3 as well.
Snapshots are incremental and fast. You can restore entire clusters using snapshots.
You can also restore individual tables from snapshots.
Here is the documentation about it: https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html
This is as simple as creating a script (shell/python/...) and putting that in crontab. Somewhere in the lines of (snippet from a shell script):
psql -U$username -p $port -h $hostname $database -f path/to/your/unload_file.psql
and your unload_file.psql would contain the standard Redshift unload statement:
unload ('select * from schema.tablename') to 's3://scratchpad_bucket/filename.extension'
credentials 'aws_access_key_id=XXXXXXXXXX;aws_secret_access_key=XXXXXXXXXX'
[options];
Put your shell script in a crontab and execute it daily at the time when you want to take the backup.
However, remember:
While taking backups is indispensible, daily full backups will generate a mammoth bill for s3. You should rotate the backups /
log files i.e. regularly delete them or take a backup from s3 and
store locally.
A full daily backup might not be the best thing to do. Check whether you can do it incrementally.
It would be better that you tar and gzip files and then send them to s3 rather than storing an Excel or a csv.

Data Migration from Java Hibernate SQL Server to Python Mongo Stack

I have one live website with multiple active users(around 30K) and each of them have their own configuration to render there homepages. The current stack of the portal is Java Spring Hibernate with SQL Server. Now, we have re written the code in Python MongoDB stack and want to migrate our users to new system. The issue here is that the old and new code will be deployed on the separate machines and we want to run this migration for few users as part of Beta Testing. Once the Beta testing is done, we will migrate all the users.
What would be the best approach to achieve this? We are thinking about dumping the data in alternative file system like XML/JSON on a remote server and then reading it in the new code.
Please suggest what should be the best way to accomplish this task
Import CSV, TSV or JSON data into MongoDB.
It will be faster and optimal to dump the file in a format like json, txt or csv , which should be copied to the new server and then import the data using mongoimport, in the command line shell.
Example
mongoimport -d databasename -c collectionname < users.json
Kindly regard to the link below for more information on mongoimport if you need to
http://docs.mongodb.org/manual/reference/mongoimport/

HBase Export/Import: Unable to find output directory

I am using HBase for my application and I am trying to export the data using org.apache.hadoop.hbase.mapreduce.Export as it was directed here. The issue I am facing with the command is that once the command is executed, there are no errors while creating the export. But the specified output directoy does not appear at its place.The command I used was
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export table_name db_dump/
I got the solution hence I am replying my own answer
You must have following two lines in hadoop-env.sh in conf directory of hadoop
export HBASE_HOME=/home/sitepulsedev/hbase/hbase-0.90.4
export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.90.4.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.90.4-test.jar:$HBASE_HOME/lib/zookeeper-3.3.2.jar:$HBASE_HOME
save it and restart mapred by ./stop-mapred.sh and ./start-mapred.sh
now run in bin directory of hadoop
./hadoop jar ~/hbase/hbase-0.90.4/hbase-0.90.4.jar export your_table /export/your_table
Now you can verify the dump by hitting
./hadoop fs -ls /export
finally you need to copy the whole thing into your local file system for which run
./hadoop fs -copyToLocal /export/your_table ~/local_dump/your_table
here are the References that helped me out in export/import and in hadoop shell commands
Hope this one helps you out!!
As you noticed the HBase export tool will create the backup in the HDFS, if you instead want the output to be written on your local FS you can use the file URI. In your example it would be something similar to:
bin/hbase org.apache.hadoop.hbase.mapreduce.Export table_name file:///tmp/db_dump/
Related to your own answer, this would also avoid going through the HDFS. Just be very careful if your are running this is a cluster of servers, because each server will write the result files in their own local file systems.
This is true for HBase 0.94.6 at least.
Hope this helps
I think the previous answer needs some modification:
Platform: AWS EC2,
OS: Amazon Linux
Hbase Version: 0.96.1.1
Hadoop Distribution: Cloudera CDH5.0.1
MR engine: MRv1
To export data from Hbase Table to local filesystem:
sudo -u hdfs /usr/bin/hbase org.apache.hadoop.hbase.mapreduce.Export -Dmapred.job.tracker=local "table_name" "file:///backups/"
This command will dump data in HFile format with number of files equaling the number of regions of that table in Hbase.