Exporting data from Mongo/Cassandra to HDFS using Apache Sqoop - mongodb

I have a problem where I have to read data from multiple data sources i.e RDBMS(MYSQL,Oracle) and NOSQL(MongoDb, Cassandra) to HDFS via Hive.(incrementally)
Apache Sqoop works perfectly for RDBMS but it does not work for NOSQL, at-least I was not able to successfully use it, (I tried to use the JDBC driver for Mongo...It was able to connect to Mongo but could not push to HDFS)
IF any one has done any work related to this and can share it , would be really very helpfull

I have used an example from web and able to transfer files from Mongo to HDFS and the other way round. I couldn't gather myself of the exact web page right now. But the program looks like below.
You can get a spark out of this and move on.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.bson.BSONObject;
import org.bson.types.ObjectId;
import com.mongodb.hadoop.MongoInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import com.mongodb.hadoop.util.MongoConfigUtil;
public class CopyFromMongodbToHDFS {
public static class ImportWeblogsFromMongo extends
Mapper<LongWritable, Text, Text, Text> {
public void map(Object key, BSONObject value, Context context)
throws IOException, InterruptedException {
System.out.println("Key: " + key);
System.out.println("Value: " + value);
String md5 = value.get("md5").toString();
String url = value.get("url").toString();
String date = value.get("date").toString();
String time = value.get("time").toString();
String ip = value.get("ip").toString();
String output = "\t" + url + "\t" + date + "\t" + time + "\t" + ip;
context.write(new Text(md5), new Text(output));
}
}
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
MongoConfigUtil.setInputURI(conf,
"mongodb://127.0.0.1:27017/test.mylogs");
System.out.println("Configuration: " + conf);
#SuppressWarnings("deprecation")
Job job = new Job(conf, "Mongo Import");
Path out = new Path("/user/cloudera/test1/logs.txt");
FileOutputFormat.setOutputPath(job, out);
job.setJarByClass(CopyFromMongodbToHDFS.class);
job.setMapperClass(ImportWeblogsFromMongo.class);
job.setOutputKeyClass(ObjectId.class);
job.setOutputValueClass(BSONObject.class);
job.setInputFormatClass(MongoInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduceTasks(0);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

In case of mongoDB create a mongodump of the collection you want to export to HDFS.
cd < /dir_name >
mongodump -h < IP_address > -d < db_name > -c < collection_name >
This creates a dump is .bson format, eg "file.bson" . To convert to .json format.
The file.bson will be stored by default in "dump" folder in your specified < dir_name >.
bsondump file.bson > file.json
copy the file to HDFS using "copyFromLocal".

Related

Connecting Scala with Hive Database using sbt for dependencies using IntelliJ

I am having a very difficult time connecting to hive database using Intellij or basic Command line with scala ( would be happy with java too). I have in the past been able to connect to a MYSQL database by adding it on the library mysql-Connector. but I am unable somehow add a jar file to the project structure where it works.
and to make things abit more difficult. I have installed ubuntu with hive,spark, hadoop and I am connecting to it over the network.
Is there someway I can add a depedency on the sbt file?
Lastly, I know there are similar questions but they do not show in detail how to connect to a hive database from scala
`import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
object HiveJdbcClient extends App {
val driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
Class.forName(driverName);
val con=DriverManager.getConnection("jdbc:hive://http://192.168.43.64:10000/default", "", "");
val stmt = con.createStatement();
val tableName = "testHiveDriverTable";
stmt.executeQuery("drop table " + "wti");
var res = stmt.executeQuery("create table " + tableName + " (key int, value string)");
// select * query
var sql = "select * from " + tableName;
res = stmt.executeQuery(sql);
while (res.next()) {System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
}
// regular hive query
sql = "select count(1) from " + tableName;
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1));
}
}`
The driver name is not correct for hive 3.1.2, it should be
org.apache.hive.jdbc.HiveDriver
Cf https://hive.apache.org/javadocs/r3.1.2/api/org/apache/hive/jdbc/HiveDriver.html

AWS Device Farm.How can i save the custom report being generated after the test case into my local space

I am working with AWS device farm.My test script when run on my local system works as expected and generates a report in my local system at specified path.Now when i run the code in the device farm the report does not get generated.Am i missing something?
This is my test code to write the test cases to a html report.
package testOutput;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.io.Writer;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import net.dongliu.apk.parser.ApkFile;
import net.dongliu.apk.parser.bean.ApkMeta;
import report.TestReportSteps;
public class TestResultHtml {
public static void WriteResultToHtml(List<TestReportSteps> items, String getCurrentDateTime, String getCurrentTime) {
try {String filePath="C:\\\\Appium\\\\app-qc-debug.apk";
ApkFile apkFile = new ApkFile(new File(filePath));
ApkMeta apkMeta = apkFile.getApkMeta();
String Version=apkMeta.getVersionName();
DateFormat df = new SimpleDateFormat("dd/MM/yy, HH:mm:ss");
Date dateobj = new Date();
String currentDateTime = df.format(dateobj);
StringBuilder color = new StringBuilder();
StringBuilder status = new StringBuilder();
// define a HTML String Builder
StringBuilder actualResult = new StringBuilder();
StringBuilder htmlStringBuilder = new StringBuilder();
// append html header and title
htmlStringBuilder.append(
"<html><head><link rel=\"stylesheet\" href=\"https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css\" integrity=\"sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm\" crossorigin=\"anonymous\">\r\n"
+ "<script src=\"https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.min.js\" integrity=\"sha384-JZR6Spejh4U02d8jOt6vLEHfe/JQGiRRSQQxSfFWpi1MquVdAyjUar5+76PVCmYl\" crossorigin=\"anonymous\"></script><title>Appium Test </title></head>");
// append body
htmlStringBuilder.append("<body>");
// append table
//if (count == 0)
{
htmlStringBuilder.append("<table class=\"table table-striped table-bordered\">");
htmlStringBuilder.append("<tr><th style=\"background-color:#a6a6a6\">Date Time</th><td>"
+ currentDateTime
+ "</td><th style=\"background-color:#a6a6a6\">Environment Tested</th><td>QC</td></tr>"
+ "<tr><th style=\"background-color:#a6a6a6\">OS</th><td>Android</td><th style=\"background-color:#a6a6a6\">Application</th><td>app-qc-debug.apk</td></tr>"
+ "<tr><th style=\"background-color:#a6a6a6\">Script Name</th><td colspan=\""
+ 3
+ "\">Cityvan Workflow</td>"
+ "<th style=\"background-color:#a6a6a6\">Build Version</th><td>"+Version+"</td></tr><tr><th style=\"background-color:#a6a6a6\">Objective</th><td colspan=\""
+ 3 + "\">To verify that cityvan app is working as expected</td><tr><tr></table>");
}
// append row
htmlStringBuilder.append("<table class=\"table table-striped\">");
htmlStringBuilder.append(
"<thead style=\"background-color:#007acc\"><tr><th><b>TestObjective</b></th><th><b>StepName</b></th><th><b>StepDescription</b></th><th><b>ExpectedResult</b></th><th><b>ActualResult</b></th><th><b>Status</b></th><th><b>Screenshot</b></th></tr></thead><tbody>");
// append row
for (TestReportSteps a : items) {
if (!a.getActualResultFail().isEmpty()) {
status.append("Fail");
color.append("red");
actualResult.append(a.getActualResultFail());
} else {
status.append("Pass");
color.append("green");
actualResult.append(a.getActualResultPass());
}
if (a.getScreenshotPath()!=null)
{
htmlStringBuilder.append("<tr><td>" + a.getTestObjective() + "</td><td>" + a.getStepName()
+ "</td><td>" + a.getStepDescription() + "</td><td>" + a.getExpectedResult() + "</td><td>"
+ actualResult + "</td><td style=\"color:" + color + ";font-weight:bolder;\">" + status
+ "</td><td>Click here</td></tr>");
}
else
{
htmlStringBuilder.append("<tr><td style=\"font-weight:bold\">" + a.getTestObjective() + "</td><td>"
+ a.getStepName() + "</td><td>" + a.getStepDescription() + "</td><td>"
+ a.getExpectedResult() + "</td><td>" + actualResult + "</td><td style=\"color:" + color
+ ";font-weight:bolder;\">" + status + "</td><td></td></tr>");
}
actualResult.delete(0, actualResult.length());
color.delete(0, color.length());
status.delete(0, status.length());
}
// close html file
htmlStringBuilder.append("</tbody></table></body></html>");
// write html string content to a file
String htmlFilepath = "";
htmlFilepath = "D:\\FinalAppiumWorkspace\\AppiumMavenProject2\\src\\test\\java\\testOutput\\HtmlReport\\" + getCurrentDateTime + "\\testfile"
+ getCurrentTime + "\\testfile.html";
WriteToFile(htmlStringBuilder.toString(), htmlFilepath);
} catch (IOException e) {
e.printStackTrace();
}
}
public static void WriteToFile(String fileContent, String fileName) throws IOException, FileNotFoundException {
File file = new File(fileName);
file.getParentFile().mkdirs();
PrintWriter out = null;
if (file.exists() && !file.isDirectory())
{
out = new PrintWriter(new FileOutputStream(new File(fileName), true));
out.append(fileContent);
out.close();
} else
{
// write to file with OutputStreamWriter
OutputStream outputStream = new FileOutputStream(file.getAbsoluteFile(), false);
Writer writer = new OutputStreamWriter(outputStream);
writer.write(fileContent);
writer.close();
}
}
}
The path the code is referencing doesn't exist in the device host for the farm. The Device Host for android tests is a Linux machine and from my experience we have access to the tmp directory. By using the custom artifacts feature of Device Farm and the tmp directory this should be possible. Try changing the path to the html file to:
htmlFilepath = "/tmp/reports/testfile.html";
Then, using the web console, explicitly mark that directory to be exported.
Once the testes finish, you should see a link for customer artifacts.
Additionally, you may be interested in other options for test reports like
extent reports
Allure reports
rather than writing your own from scratch.
HTH
-James
For the users who are receiving java.io.FileNotFoundException:(Permission denied) exception while trying to save test logs/reports in $WORKING_DIRECTORY or any other path. You can use $DEVICEFARM_LOG_DIR to save your artifacts, this worked for me.
String artifactsDir = System.getenv("DEVICEFARM_LOG_DIR");
Use this directory to store any artifacts.

Get "holes" in dates in MogoDB collection

I have a MongoDB collection that stores data for each hour since 2011.
For example:
{
"dateEntity" : ISODate("2011-01-01T08:00:00Z"),
"price" : 0.3
}
{
"dateEntity" : ISODate("2011-01-01T09:00:00Z"),
"price" : 0.35
}
I'd like to know if there are "holes" in that dates. For example, a missing entry at a hour.
Unfortunately, there is no gaps-marking aggregator in Mongodb.
I have checked if it's possible to write an own gaps-aggregator for Mongodb basing on Javascript functions in Map-Reduce pipelines by creating a time raster in the first map stage and then mapping it to its corresponding values, but database reads are discouraged while mapping and reducing, so it would be bad design. So, it is not possible to achieve this with Mongodb-own instruments.
I think, there are two possible solutions.
Solution one: Use a driver like the Java driver
I suggest you could use an idiomatic driver like the Java driver for your Mongodb data and create a raster of hours like in the Test provided.
import com.mongodb.BasicDBObject;
import com.mongodb.MongoClient;
import com.mongodb.ServerAddress;
import com.mongodb.client.MongoCollection;
import org.bson.Document;
import org.junit.Test;
import java.time.LocalDateTime;
import java.time.ZoneOffset;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
public class HourGapsTest {
#Test
public void testHourValues() {
String host = "127.0.0.1:27017";
ServerAddress addr = new ServerAddress(host);
MongoClient mongoClient = new MongoClient(addr);
MongoCollection<Document> collection = mongoClient.getDatabase("sotest").getCollection("hourhole");
LocalDateTime start = LocalDateTime.of(2011, 1, 1, 8, 0, 0);
LocalDateTime end = LocalDateTime.of(2011, 1, 2, 0, 0, 0);
List<LocalDateTime> allHours = new ArrayList<>();
for (LocalDateTime hour = start; hour.isBefore(end); hour = hour.plusHours(1L)) {
allHours.add(hour);
}
List<LocalDateTime> gaps = new ArrayList<>();
for (LocalDateTime hour : allHours) {
BasicDBObject filter = new BasicDBObject("dateEntity", new Date(hour.toInstant(ZoneOffset.UTC).toEpochMilli()));
if (!collection.find(filter).iterator().hasNext()) {
gaps.add(hour);
}
}
gaps.forEach(System.out::println);
}
}
Solution two: Use a timeseries database
However, timeseries databases like Kairosdb provide this functionality. Consider storing these time-value data in a timeseries database.

Unable to get any data when spark streaming program in run taking source as textFileStream

I am running following code on Spark shell
>`spark-shell
scala> import org.apache.spark.streaming._
import org.apache.spark.streaming._
scala> import org.apache.spark._
import org.apache.spark._
scala> object sparkClient{
| def main(args : Array[String])
| {
| val ssc = new StreamingContext(sc,Seconds(1))
| val Dstreaminput = ssc.textFileStream("hdfs:///POC/SPARK/DATA/*")
| val transformed = Dstreaminput.flatMap(word => word.split(" "))
| val mapped = transformed.map(word => if(word.contains("error"))(word,"defect")else(word,"non-defect"))
| mapped.print()
| ssc.start()
| ssc.awaitTermination()
| }
| }
defined object sparkClient
scala> sparkClient.main(null)
Output is blank as follows. No file is read and no streaming took place.
Time: 1510663547000 ms
Time: 1510663548000 ms
Time: 1510663549000 ms
Time: 1510663550000 ms
Time: 1510663551000 ms
Time: 1510663552000 ms
Time: 1510663553000 ms
Time: 1510663554000 ms
Time: 1510663555000 ms
The path which I have given as input in the above code is as follows:
[hadoopadmin#master ~]$ hadoop fs -ls /POC/SPARK/DATA/
17/11/14 18:04:32 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r-- 2 hadoopadmin supergroup 17881 2017-09-21 11:02
/POC/SPARK/DATA/LICENSE
-rw-r--r-- 2 hadoopadmin supergroup 24645 2017-09-21 11:04
/POC/SPARK/DATA/NOTICE
-rw-r--r-- 2 hadoopadmin supergroup 845 2017-09-21 12:35
/POC/SPARK/DATA/confusion.txt
Could anyone please explain where I am going wrong? Or is there anything wrong with the syntax(although I did not encounter any error) as I am new to spark?
textFileStream won't read pre-existing data. It will include only new files:
created in the dataDirectory by atomically moving or renaming them into the data directory.
https://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
Everyone on the earth has a right to be happy, be it spark itself or a spark developer.
Spark streaming method of textFileStream() needs files to be modified after the streaming process is started. This means, spark steaming will not read existing files.
So, you may think you can copy the new files. But this is a problem because copy does not affect the Modified time of the file.
The last option, you may try to create new files on the fly. But that's tedious and should happen while the spark cycle is running.
I wrote a simple java program that would create the files on the fly. So everyone now is happy. :-)(You just need a commons-io lib on the classpath. just a single jar.)
import java.awt.Button;
import java.awt.FlowLayout;
import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import javax.swing.JFrame;
import org.apache.commons.io.IOUtils;
public class CreateFileMain extends JFrame {
private static final long serialVersionUID = 1L;
Button b;
public CreateFileMain() {
b = new Button("Create New File");
b.addActionListener(new ActionListener() {
#Override
public void actionPerformed(ActionEvent e) {
String dir = "C:/Users/spratapw/workspace/batchload1/spark-streaming-poc/input/";
deleteExistingFiles(dir);
Random r = new Random();
File f = new File(dir+r.nextInt()+".txt");
createNewFile(f);
}
private void createNewFile(File f) {
try {
f.createNewFile();
List<String> lines = new ArrayList<>();
lines.add("Hello World");
FileOutputStream fos = new FileOutputStream(f);
IOUtils.writeLines(lines, "\n", fos, Charset.defaultCharset());
fos.close();
} catch (IOException e2) {
e2.printStackTrace();
}
}
private void deleteExistingFiles(String dir) {
File filetodelete = new File(dir);
File[] allContents = filetodelete.listFiles();
if (allContents != null) {
for (File file : allContents) {
file.delete();
}
}
}
});
this.add(b);
this.setLayout(new FlowLayout());
}
public static void main(String[] args) throws IOException {
CreateFileMain m = new CreateFileMain();
m.setVisible(true);
m.setSize(200, 200);
m.setLocationRelativeTo(null);
m.setDefaultCloseOperation(EXIT_ON_CLOSE);
}
}
Output :

Creating a chart in apache poi

I need to create in java a microsoft word document containing charts. I'm trying out the Apache POI but haven't found a way to do it. Are there any examples of how to do this?
you can create chart using Temp Ms-Word file.
just create charts in your Temp Ms-Word File and read using customised POI jar and write back to your actual Ms-word File
https://github.com/sandeeptiwari32/POI_ENHN/blob/master/POI3.14.jar.
You can get this code in official poi version 4.0
code Example:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.POIXMLDocumentPart;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.xwpf.usermodel.XWPFChart;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.openxmlformats.schemas.drawingml.x2006.chart.CTChart;
import org.openxmlformats.schemas.drawingml.x2006.chart.CTTitle;
import org.openxmlformats.schemas.drawingml.x2006.chart.CTTx;
import org.openxmlformats.schemas.drawingml.x2006.main.CTRegularTextRun;
import org.openxmlformats.schemas.drawingml.x2006.main.CTTextBody;
import org.openxmlformats.schemas.drawingml.x2006.main.CTTextParagraph;
public class TestXWPFChart {
public static void main(String[] args) throws Exception {
FileInputStream inpuFile=new FileInputStream("input.docx");
FileOutputStream outFile = new FileOutputStream("output.docx");
#SuppressWarnings("resource")
XWPFDocument document = new XWPFDocument(inpuFile);
XWPFChart chart=null;
for (POIXMLDocumentPart part : document.getRelations()) {
if (part instanceof XWPFChart) {
chart = (XWPFChart) part;
break;
}
}
//change chart title from "Chart Title" to XWPF CHART
CTChart ctChart = chart.getCTChart();
CTTitle title = ctChart.getTitle();
CTTx tx = title.addNewTx();
CTTextBody rich = tx.addNewRich();
rich.addNewBodyPr();
rich.addNewLstStyle();
CTTextParagraph p = rich.addNewP();
CTRegularTextRun r = p.addNewR();
r.addNewRPr();
r.setT("XWPF CHART");
//write modified chart in output docx file
document.write(outFile);
}
}