Connecting Scala with Hive Database using sbt for dependencies using IntelliJ

Connecting Scala with Hive Database using sbt for dependencies using IntelliJ - scala

I am having a very difficult time connecting to hive database using Intellij or basic Command line with scala ( would be happy with java too). I have in the past been able to connect to a MYSQL database by adding it on the library mysql-Connector. but I am unable somehow add a jar file to the project structure where it works.
and to make things abit more difficult. I have installed ubuntu with hive,spark, hadoop and I am connecting to it over the network.
Is there someway I can add a depedency on the sbt file?
Lastly, I know there are similar questions but they do not show in detail how to connect to a hive database from scala
`import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
object HiveJdbcClient extends App {
val driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
Class.forName(driverName);
val con=DriverManager.getConnection("jdbc:hive://http://192.168.43.64:10000/default", "", "");
val stmt = con.createStatement();
val tableName = "testHiveDriverTable";
stmt.executeQuery("drop table " + "wti");
var res = stmt.executeQuery("create table " + tableName + " (key int, value string)");
// select * query
var sql = "select * from " + tableName;
res = stmt.executeQuery(sql);
while (res.next()) {System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
}
// regular hive query
sql = "select count(1) from " + tableName;
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1));
}
}`

The driver name is not correct for hive 3.1.2, it should be
org.apache.hive.jdbc.HiveDriver
Cf https://hive.apache.org/javadocs/r3.1.2/api/org/apache/hive/jdbc/HiveDriver.html

Related

Apache Spark Data Generator Function on Databricks Not working

I am trying to execute the Data Generator function provided my Microsoft to test streaming data to Event Hubs.
Unfortunately, I keep on getting the error
Processing failure: No such file or directory
When I try and execute the function:
%scala
DummyDataGenerator.start(15)
Can someone take a look at the code and help decipher why I'm getting the error:
class DummyDataGenerator:
streamDirectory = "/FileStore/tables/flight"
None # suppress output
I'm not sure how the above cell gets called into the function DummyDataGenerator
%scala
import scala.util.Random
import java.io._
import java.time._
// Notebook #2 has to set this to 8, we are setting
// it to 200 to "restore" the default behavior.
spark.conf.set("spark.sql.shuffle.partitions", 200)
// Make the username available to all other languages.
// "WARNING: use of the "current" username is unpredictable
// when multiple users are collaborating and should be replaced
// with the notebook ID instead.
val username = com.databricks.logging.AttributionContext.current.tags(com.databricks.logging.BaseTagDefinitions.TAG_USER);
spark.conf.set("com.databricks.training.username", username)
object DummyDataGenerator extends Runnable {
var runner : Thread = null;
val className = getClass().getName()
val streamDirectory = s"dbfs:/tmp/$username/new-flights"
val airlines = Array( ("American", 0.17), ("Delta", 0.12), ("Frontier", 0.14), ("Hawaiian", 0.13), ("JetBlue", 0.15), ("United", 0.11), ("Southwest", 0.18) )
val reasons = Array("Air Carrier", "Extreme Weather", "National Aviation System", "Security", "Late Aircraft")
val rand = new Random(System.currentTimeMillis())
var maxDuration = 3 * 60 * 1000 // default to three minutes
def clean() {
System.out.println("Removing old files for dummy data generator.")
dbutils.fs.rm(streamDirectory, true)
if (dbutils.fs.mkdirs(streamDirectory) == false) {
throw new RuntimeException("Unable to create temp directory.")
}
}
def run() {
val date = LocalDate.now()
val start = System.currentTimeMillis()
while (System.currentTimeMillis() - start < maxDuration) {
try {
val dir = s"/dbfs/tmp/$username/new-flights"
val tempFile = File.createTempFile("flights-", "", new File(dir)).getAbsolutePath()+".csv"
val writer = new PrintWriter(tempFile)
for (airline <- airlines) {
val flightNumber = rand.nextInt(1000)+1000
val deptTime = rand.nextInt(10)+10
val departureTime = LocalDateTime.now().plusHours(-deptTime)
val (name, odds) = airline
val reason = Random.shuffle(reasons.toList).head
val test = rand.nextDouble()
val delay = if (test < odds)
rand.nextInt(60)+(30*odds)
else rand.nextInt(10)-5
println(s"- Flight #$flightNumber by $name at $departureTime delayed $delay minutes due to $reason")
writer.println(s""" "$flightNumber","$departureTime","$delay","$reason","$name" """.trim)
}
writer.close()
// wait a couple of seconds
//Thread.sleep(rand.nextInt(5000))
} catch {
case e: Exception => {
printf("* Processing failure: %s%n", e.getMessage())
return;
}
}
}
println("No more flights!")
}
def start(minutes:Int = 5) {
maxDuration = minutes * 60 * 1000
if (runner != null) {
println("Stopping dummy data generator.")
runner.interrupt();
runner.join();
}
println(s"Running dummy data generator for $minutes minutes.")
runner = new Thread(this);
runner.run();
}
def stop() {
start(0)
}
}
DummyDataGenerator.clean()
displayHTML("Imported streaming logic...") // suppress output

you should be able to use the Databricks Labs Data Generator on the Databricks community edition. I'm providing the instructions below:
Running Databricks Labs Data Generator on the community edition
The Databricks Labs Data Generator is a Pyspark library so the code to generate the data needs to be Python. But you should be able to create a view on the generated data and consume it from Scala if that's your preferred language.
You can install the framework on the Databricks community edition by creating a notebook with the cell
%pip install git+https://github.com/databrickslabs/dbldatagen
Once it's installed you can then use the library to define a data generation spec and by using build, generate a Spark dataframe on it.
The following example shows generation of batch data similar to the data set you are trying to generate. This should be placed in a separate notebook cell
Note - here we generate 10 million records to illustrate ability to create larger data sets. It can be used to generate datasets much larger than that
%python
num_rows = 10 * 1000000 # number of rows to generate
num_partitions = 8 # number of Spark dataframe partitions
delay_reasons = ["Air Carrier", "Extreme Weather", "National Aviation System", "Security", "Late Aircraft"]
# will have implied column `id` for ordinal of row
flightdata_defn = (dg.DataGenerator(spark, name="flight_delay_data", rows=num_rows, partitions=num_partitions)
.withColumn("flightNumber", "int", minValue=1000, uniqueValues=10000, random=True)
.withColumn("airline", "string", minValue=1, maxValue=500, prefix="airline", random=True, distribution="normal")
.withColumn("original_departure", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", interval="1 minute", random=True)
.withColumn("delay_minutes", "int", minValue=20, maxValue=600, distribution=dg.distributions.Gamma(1.0, 2.0))
.withColumn("delayed_departure", "timestamp", expr="cast(original_departure as bigint) + (delay_minutes * 60) ", baseColumn=["original_departure", "delay_minutes"])
.withColumn("reason", "string", values=delay_reasons, random=True)
)
df_flight_data = flightdata_defn.build()
display(df_flight_data)
You can find information on how to generate streaming data in the online documentation at https://databrickslabs.github.io/dbldatagen/public_docs/using_streaming_data.html
You can create a named temporary view over the data so that you can access it from SQL or Scala using one of two methods:
1: use createOrReplaceTempView
df_flight_data.createOrReplaceTempView("delays")
2: use options for build. In this case the name passed to the Data Instance initializer will be the name of the view
i.e
df_flight_data = flightdata_defn.build(withTempView=True)

This code will not work on the community edition because of this line:
val dir = s"/dbfs/tmp/$username/new-flights"
as there is no DBFS fuse on Databricks community edition (it's supported only on full Databricks). It's potentially possible to make it working by:
Changing that directory to local directory, like, /tmp or something like
adding a code (after writer.close()) to list flights-* files in that local directory, and using dbutils.fs.mv to move them into streamDirectory

Odoo-Creation sequence based on PostgreSQL sequence

I am working with odoo 14 and I want to customize sale.order number generation. So, I want to create new sequence (ir.sequence) based on PostgreSQL database sequence object.
Do you have any idea?
Thank you for your help.
SAAD

from odoo import api, fields, models
import psycopg2
class ventes(models.Model):
_inherit = ['sale.order']
company = fields.Char()
name = fields.Char(string='Order Reference')
#Connection a la base de donnees
def open_conn(self):
try:
connection = psycopg2.connect(user="user",
password="xxxxxxxxxxxxxx",
host="192.168.1.1",
port="5432",
database="ventes")
print("Using Python variable in PostgreSQL select Query")
cursor = connection.cursor()
postgreSQL_select_Query = "select nextval('myOdoo')"
cursor.execute(postgreSQL_select_Query)
row = cursor.fetchone()
return row[0]
except (Exception, psycopg2.Error) as error:
print("Error fetching data from PostgreSQL table", error)
finally:
# closing database connection
if connection:
cursor.close()
connection.close()
print("PostgreSQL connection is closed \n")
#api.model
def create(self, vals):
num = self.open_conn()
vals['name'] = num
result = super(ventes, self).create(vals)
return result

I am trying to delete the data from postgres using spark but unable to delete same code is working for select statements

Class.forName("org.postgresql.Driver")
val conn = Url
val del = s"(delete from db.table where timestamp = '1950-09-08 00:00:00.000')"
val db = DriverManager.getConnection(conn)
println("delete query :" + del)
val pstdel = db.prepareStatement(del)
try {
pstdel.execute()
}
I am getting the below error:
org.postgresql.util.PSQLException: ERROR: syntax error at or near "delete"
The same code is working for select statements. I do have delete permissions.

Try this
val del = s"delete from db.table where timestamp = '1950-09-08 00:00:00.000'"

Cannot access data after full text search using sqlalchemy, postgres and flask

I would like to search my postgres data base using postgres build-in full text search capability. In my app I have a set of posts stored according to title, content and date.
I think I can search the database using tsvector, but cannot retrieve the data from the results; i.e. the title, the content and the date. Could anyone help me, please?
import json, sys
from flask import Flask, render_template
from flask_sqlalchemy import SQLAlchemy
from sqlalchemy.dialects import postgresql
from sqlalchemy.sql.expression import cast, func
from sqlalchemy import Index
def create_tsvector(*args):
exp = args[0]
for e in args[1:]:
exp += ' ' + e
return func.to_tsvector('english', exp)
app = Flask(__name__)
app.config['SECRET_KEY'] = 'some_key'
app.config["SQLALCHEMY_DATABASE_URI"] = 'postgresql:somedb'
db = SQLAlchemy(app)
class Post(db.Model):
id = db.Column(db.Integer, primary_key=True)
title = db.Column(db.Text, nullable=False)
content = db.Column(db.Text, nullable=False)
date = db.Column(db.Text,unique=False)
__ts_vector__ = create_tsvector(
cast(func.coalesce(content, ''), postgresql.TEXT)
)
__table_args__ = (Index('idx_post_fts', __ts_vector__, postgresql_using='gin'), )
def __repr__(self):
return f"Post('{self.title}', '{self.date}')"
if len(sys.argv) > 1:
filename1 = sys.argv[1]
infile=open(filename1,'r')
posts=json.load(infile)
infile.close()
List=list(posts)
art = 0
for j in range(0,len(List)):
if j % 10 == 0:
print(j)
title= posts[List[art]]['Title']
date = posts[List[art]]['Posted']
content=posts[List[art]]['Text']
post = Post(title=title, date=date, content=content)
db.session.add(post)
db.session.commit()
art+=1
from sqlalchemy.dialects.postgresql import TSVECTOR
from sqlalchemy import select, cast
posts = Post.__ts_vector__.match("bicycle", postgresql_regconfig='english')
print(posts)

Exporting data from Mongo/Cassandra to HDFS using Apache Sqoop

I have a problem where I have to read data from multiple data sources i.e RDBMS(MYSQL,Oracle) and NOSQL(MongoDb, Cassandra) to HDFS via Hive.(incrementally)
Apache Sqoop works perfectly for RDBMS but it does not work for NOSQL, at-least I was not able to successfully use it, (I tried to use the JDBC driver for Mongo...It was able to connect to Mongo but could not push to HDFS)
IF any one has done any work related to this and can share it , would be really very helpfull

I have used an example from web and able to transfer files from Mongo to HDFS and the other way round. I couldn't gather myself of the exact web page right now. But the program looks like below.
You can get a spark out of this and move on.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.bson.BSONObject;
import org.bson.types.ObjectId;
import com.mongodb.hadoop.MongoInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import com.mongodb.hadoop.util.MongoConfigUtil;
public class CopyFromMongodbToHDFS {
public static class ImportWeblogsFromMongo extends
Mapper<LongWritable, Text, Text, Text> {
public void map(Object key, BSONObject value, Context context)
throws IOException, InterruptedException {
System.out.println("Key: " + key);
System.out.println("Value: " + value);
String md5 = value.get("md5").toString();
String url = value.get("url").toString();
String date = value.get("date").toString();
String time = value.get("time").toString();
String ip = value.get("ip").toString();
String output = "\t" + url + "\t" + date + "\t" + time + "\t" + ip;
context.write(new Text(md5), new Text(output));
}
}
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
MongoConfigUtil.setInputURI(conf,
"mongodb://127.0.0.1:27017/test.mylogs");
System.out.println("Configuration: " + conf);
#SuppressWarnings("deprecation")
Job job = new Job(conf, "Mongo Import");
Path out = new Path("/user/cloudera/test1/logs.txt");
FileOutputFormat.setOutputPath(job, out);
job.setJarByClass(CopyFromMongodbToHDFS.class);
job.setMapperClass(ImportWeblogsFromMongo.class);
job.setOutputKeyClass(ObjectId.class);
job.setOutputValueClass(BSONObject.class);
job.setInputFormatClass(MongoInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduceTasks(0);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

In case of mongoDB create a mongodump of the collection you want to export to HDFS.
cd < /dir_name >
mongodump -h < IP_address > -d < db_name > -c < collection_name >
This creates a dump is .bson format, eg "file.bson" . To convert to .json format.
The file.bson will be stored by default in "dump" folder in your specified < dir_name >.
bsondump file.bson > file.json
copy the file to HDFS using "copyFromLocal".

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Connecting Scala with Hive Database using sbt for dependencies using IntelliJ - scala

The driver name is not correct for hive 3.1.2, it should be org.apache.hive.jdbc.HiveDriver Cf https://hive.apache.org/javadocs/r3.1.2/api/org/apache/hive/jdbc/HiveDriver.html

Related

Apache Spark Data Generator Function on Databricks Not working

Odoo-Creation sequence based on PostgreSQL sequence

I am trying to delete the data from postgres using spark but unable to delete same code is working for select statements

Cannot access data after full text search using sqlalchemy, postgres and flask

Exporting data from Mongo/Cassandra to HDFS using Apache Sqoop

Categories

Resources