HTTP Post Request using import org.apache.http.client._ in Scala - scala

I am trying to Post a Alert Message in Google Chats(Space) by HTTP post request.
def speak(text: String, url: String) {
val timeout = 1800
val requestConfig = RequestConfig.custom().setConnectTimeout(timeout*1000).setConnectionRequestTimeout(timeout*1000).setSocketTimeout(timeout*1000).build()
val client = HttpClientBuilder.create().setDefaultRequestConfig(requestConfig).build()
val post:HttpPost = new HttpPost(url)
post.addHeader("Content-Type", "application/json")
post.setEntity(new StringEntity(text))
val response:CloseableHttpResponse = client.execute(post)
}
The Issue is I am able post a single line Alert msg Example :
speak("Hi this is code", url)
But I need to Post a string which is in dataframe format
| colna|colnb|colnc|
+----------+--------------------+---------------+
|1| 2| 3
|4| 5| 6|
+----------+--------------------+---------------+
I need to either post the df directly or neeed to post the string(converted the df using functions).
Can any one help!

Related

Spark Send DataFrame as body of HTTP Post request

I have a data frame which I want to send it as the body of HTTP Post request, what's the best Sparky way to do it?
How can I control a number of HTTP requests?
If the number of records gets bigger is there any way to split sending data frame into multiple HTTP Post call?
let's say my data frame is like this:
+--------------------------------------+------------+------------+------------------+
| user_id | city | user_name | facebook_id |
+--------------------------------------+------------+------------+------------------+
| 55c3c59d-0163-46a2-b495-bc352a8de883 | Toronto | username_x | 0123482174440907 |
| e2ddv22d-4132-c211-4425-9933aa8de454 | Washington | username_y | 0432982476780234 |
+--------------------------------------+------------+------------+------------------+
I want to have user_id and facebook_id in the body of HTTP Post request to this endpoint localhost:8080/api/spark
You can achieve this using foreachPartition method on a Dataframe. I am assuming here you want to make an Http Call for each row in the Dataframe in parallel. foreachPartition operates on each partition of the Dataframe in parallel. If you wanted to batch multiple rows together in a single HTTP post call that too is possible by changing the signature of the makeHttpCall method from Row to Iterator[Row]
def test(): Unit = {
val df: DataFrame = null
df.foreachPartition(_.foreach(x => makeHttpCall(x)))
}
def makeHttpCall(row: Row) = {
val json = Json.obj("user_name" -> row.getString(2), "facebook_id" -> row.getString(3))
/**
* code make Http call
*/
}
for making bulk Http request makeHttpCall. make sure you have sufficient number of partitions in the dataframe so that each partition is small enough to make your Http Post request.
import org.apache.spark.sql.{DataFrame, Row}
import play.api.libs.json.Json
def test(): Unit = {
val df: DataFrame = null
df.foreachPartition(x => makeHttpCall(x))
}
def makeHttpCall(row: Iterator[Row]) = {
val json = Json.arr(row.toSeq.map(x => Json.obj("user_name" -> x.getString(2), "facebook_id" -> x.getString(3))))
/**
* code make Http call
*/
}

How call method based on Json Object scala spark?

I Have two functions like below
def method1(ip:String,r:Double,op:String)={
val data = spark.read.option("header", true).csv(ip).toDF()
val r3= data.select("c", "S").dropDuplicates("C", "S").withColumn("R", lit(r))
r3.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(op)
}
def method2(ip:String,op:String)={
val data = spark.read.option("header", true).csv(ip).toDF()
val r3= data.select("c", "S").dropDuplicates("C", "StockCode")
r3.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(op)
}
I want to call this methods by based on Json object parameter .
for example if my input json is like below
{"name":"method1","ip":"Or.csv","r":1.0,"op":"oppath"}
It has to call method1 and "Or.csv",1.0,ā€¯oppath" as parameters I.e. in json object name indicate method name, and reaming fields are parameters.
Please help me on this.
First we need to read Json through spark into a dataframe.
val df = sqlContext.read.json("path to the json file")
which should be give you dataframe as
scala> df.show()
+------+-------+------+---+
| ip| name| op| r|
+------+-------+------+---+
|Or.csv|method1|oppath|1.0|
+------+-------+------+---+
Next
scala> def method1(ip:String,r:Double,op:String)={
| val data = spark.read.option("header", true).csv(ip).toDF()
| val r3= data.select("c", "S").dropDuplicates("C", "S").withColumn("R", lit(r))
| r3.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(op)
| }
method1: (ip: String, r: Double, op: String)Unit
next
scala> def method2(ip:String,op:String)={
| val data = spark.read.option("header", true).csv(ip).toDF()
| val r3= data.select("c", "S").dropDuplicates("C", "StockCode")
| r3.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(op)
| }
method2: (ip: String, op: String)Unit
next
scala>df.withColumn("methodCalling",when($"name" === "method1",method1(df.first().getString(1),df.first().getDouble(2),df.first().getString(3))).otherwise(when($"name" === "method2", method2(df.first().getString(1),df.first().getString(2)))))
it will call method1 or method2 based on Json Object.

How to open TCP connection with TLS in scala using akka

I want to write a Scala client that talks a proprietary protocol over a tcp connection with TLS.
Basically, I want to rewrite the following code from Node.js in Scala:
var conn_options = {
host: endpoint,
port: port
};
tlsSocket = tls.connect(conn_options, function() {
if (tlsSocket.authorized) {
logger.info('Successfully established a connection');
// Now that the connection has been established, let's perform the handshake
// Identification frame:
// 1 | I | id_size | id
var idFrameTypeAndVersion = "1I";
var clientIdString = "foorbar";
var idDataBuffer = new Buffer(idFrameTypeAndVersion.length + 1 + clientIdString.length);
idDataBuffer.write(idFrameTypeAndVersion, 0 ,
idFrameTypeAndVersion.length);
idDataBuffer.writeUIntBE(clientIdString.length,
idFrameTypeAndVersion.length, 1);
idDataBuffer.write(clientIdString, idFrameTypeAndVersion.length + 1, clientIdString.length);
// Send the identification frame to Logmet
tlsSocket.write(idDataBuffer);
}
...
}
From the akka documentation I found a good example with Akka over plain tcp, but I've no clue how to enhance the example using a TLS socket connection. There are some older versions of the documentation that shows an example with ssl/tls but that's missed in the newer version.
I've found documentation about a TLS object in Akka but I did not found any good example around it.
Many thanks in advance!
Got it working with the following code and want to share.
Basically, I started looking at the TcpTlsEcho.java that I got from the akka community.
I followed the documentation of akka-streams. Another very good example that shows and illustrate the usage of akka-streams can be found in the following blog post
The connection setup and flow looks like:
/**
+---------------------------+ +---------------------------+
| Flow | | tlsConnectionFlow |
| | | |
| +------+ +------+ | | +------+ +------+ |
| | SRC | ~Out~> | | ~~> O2 -- I1 ~~> | | ~O1~> | | |
| | | | LOGG | | | | TLS | | CONN | |
| | SINK | <~In~ | | <~~ I2 -- O2 <~~ | | <~I2~ | | |
| +------+ +------+ | | +------+ +------+ |
+---------------------------+ +---------------------------+
**/
// the tcp connection to the server
val connection = Tcp().outgoingConnection(address, port)
// ignore the received data for now. There are different actions to implement the Sink.
val sink = Sink.ignore
// create a source as an actor reference
val source = Source.actorRef(1000, OverflowStrategy.fail)
// join the TLS BidiFlow (see below) with the connection
val tlsConnectionFlow = tlsStage(TLSRole.client).join(connection)
// run the source with the TLS conection flow that is joined with a logging step that prints the bytes that are sent and or received from the connection.
val sourceActor = tlsConnectionFlow.join(logging).to(sink).runWith(source)
// send a message to the sourceActor that will be send to the Source of the stream
sourceActor ! ByteString("<message>")
The TLS connection flow is a BidiFlow. My first simple example ignores all certificates and avoids managing trust and key stores. Examples how that is done can be found in the .java example above.
def tlsStage(role: TLSRole)(implicit system: ActorSystem) = {
val sslConfig = AkkaSSLConfig.get(system)
val config = sslConfig.config
// create a ssl-context that ignores self-signed certificates
implicit val sslContext: SSLContext = {
object WideOpenX509TrustManager extends X509TrustManager {
override def checkClientTrusted(chain: Array[X509Certificate], authType: String) = ()
override def checkServerTrusted(chain: Array[X509Certificate], authType: String) = ()
override def getAcceptedIssuers = Array[X509Certificate]()
}
val context = SSLContext.getInstance("TLS")
context.init(Array[KeyManager](), Array(WideOpenX509TrustManager), null)
context
}
// protocols
val defaultParams = sslContext.getDefaultSSLParameters()
val defaultProtocols = defaultParams.getProtocols()
val protocols = sslConfig.configureProtocols(defaultProtocols, config)
defaultParams.setProtocols(protocols)
// ciphers
val defaultCiphers = defaultParams.getCipherSuites()
val cipherSuites = sslConfig.configureCipherSuites(defaultCiphers, config)
defaultParams.setCipherSuites(cipherSuites)
val firstSession = new TLSProtocol.NegotiateNewSession(None, None, None, None)
.withCipherSuites(cipherSuites: _*)
.withProtocols(protocols: _*)
.withParameters(defaultParams)
val clientAuth = getClientAuth(config.sslParametersConfig.clientAuth)
clientAuth map { firstSession.withClientAuth(_) }
val tls = TLS.apply(sslContext, firstSession, role)
val pf: PartialFunction[TLSProtocol.SslTlsInbound, ByteString] = {
case TLSProtocol.SessionBytes(_, sb) => ByteString.fromByteBuffer(sb.asByteBuffer)
}
val tlsSupport = BidiFlow.fromFlows(
Flow[ByteString].map(TLSProtocol.SendBytes),
Flow[TLSProtocol.SslTlsInbound].collect(pf));
tlsSupport.atop(tls);
}
def getClientAuth(auth: ClientAuth) = {
if (auth.equals(ClientAuth.want)) {
Some(TLSClientAuth.want)
} else if (auth.equals(ClientAuth.need)) {
Some(TLSClientAuth.need)
} else if (auth.equals(ClientAuth.none)) {
Some(TLSClientAuth.none)
} else {
None
}
}
And for completion there is the logging stage that has been implemented as a BidiFlow as well.
def logging: BidiFlow[ByteString, ByteString, ByteString, ByteString, NotUsed] = {
// function that takes a string, prints it with some fixed prefix in front and returns the string again
def logger(prefix: String) = (chunk: ByteString) => {
println(prefix + chunk.utf8String)
chunk
}
val inputLogger = logger("> ")
val outputLogger = logger("< ")
// create BidiFlow with a separate logger function for each of both streams
BidiFlow.fromFunctions(outputLogger, inputLogger)
}
I will further try to improve and update the answer. Hope that helps.
I really liked Jeremias Werner's answer as it got me where I needed to be. However, I would like to offer the code below (heavily influenced by his answer) as a "one cut and paste" solution that hits an actual TLS server
using as little code as I had time to produce.
import javax.net.ssl.SSLContext
import akka.NotUsed
import akka.actor.ActorSystem
import akka.stream.TLSProtocol.NegotiateNewSession
import akka.stream.scaladsl.{BidiFlow, Flow, Sink, Source, TLS, Tcp}
import akka.stream.{ActorMaterializer, OverflowStrategy, TLSProtocol, TLSRole}
import akka.util.ByteString
object TlsClient {
// Flow needed for TLS as well as mapping the TLS engine's flow to ByteStrings
def tlsClientLayer = {
// Default SSL context supporting most protocols and ciphers. Embellish this as you need
// by constructing your own SSLContext and NegotiateNewSession instances.
val tls = TLS(SSLContext.getDefault, NegotiateNewSession.withDefaults, TLSRole.client)
// Maps the TLS stream to a ByteString
val tlsSupport = BidiFlow.fromFlows(
Flow[ByteString].map(TLSProtocol.SendBytes),
Flow[TLSProtocol.SslTlsInbound].collect {
case TLSProtocol.SessionBytes(_, sb) => ByteString.fromByteBuffer(sb.asByteBuffer)
})
tlsSupport.atop(tls)
}
// Very simple logger
def logging: BidiFlow[ByteString, ByteString, ByteString, ByteString, NotUsed] = {
// function that takes a string, prints it with some fixed prefix in front and returns the string again
def logger(prefix: String) = (chunk: ByteString) => {
println(prefix + chunk.utf8String)
chunk
}
val inputLogger = logger("> ")
val outputLogger = logger("< ")
// create BidiFlow with a separate logger function for each of both streams
BidiFlow.fromFunctions(outputLogger, inputLogger)
}
def main(args: Array[String]): Unit = {
implicit val system: ActorSystem = ActorSystem("sip-client")
implicit val materializer: ActorMaterializer = ActorMaterializer()
val source = Source.actorRef(1000, OverflowStrategy.fail)
val connection = Tcp().outgoingConnection("www.google.com", 443)
val tlsFlow = tlsClientLayer.join(connection)
val srcActor = tlsFlow.join(logging).to(Sink.ignore).runWith(source)
// I show HTTP here but send/receive your protocol over this actor
// Should respond with a 302 (Found) and a small explanatory HTML message
srcActor ! ByteString("GET / HTTP/1.1\r\nHost: www.google.com\r\n\r\n")
}
}

Sparksql cut String after special position

Hello what i want to do is to cut a URL so it is all in a specific Format.
At the Moment my URL Looks like this.
[https://url.com/xxxxxxx/xxxxx/xxxxxx]
I just want to cut everything after the third / and just Count my data so that i have an overview how much URLs i have in my data.
I hope someone can help me
User-defined functions (UDFs) is what you need. Assume you have following input:
case class Data(url: String)
val urls = sqlContext.createDataFrame(Seq(Data("http://google.com/q=dfsdf"), Data("https://fb.com/gsdgsd")))
urls.registerTempTable("urls")
Now you can define UDF that gets only hostname from URL:
def getHost(url: String) = url.split('/')(2) //naive implementation, for example only
sqlContext.udf.register("getHost", getHost _)
And get your data transformed using SQL:
val hosts = sqlContext.sql("select getHost(url) as host from urls")
hosts.show()
Result:
+----------+
| host|
+----------+
|google.com|
| fb.com|
+----------+
If you prefer Scala DSL, you can use your UDF too:
import org.apache.spark.sql.functions.udf
val getHostUdf = udf(getHost _)
val urls = urls.select(getHostUdf($"url") as "host")
Result will be exactly the same.

Using Spark to extract part of the string in CSV format

Spark newbie here and hopefully you guys can give me some help. Thanks!
I am trying to extract a URL from a CSV file and the URL is located at the 16th column. The problem is that the URLs were written in a strange format as you can see from the print out from the code below. What is the best approach to get a the URL in correct format?
case class log(time_stamp: String, url: String )
val logText = sc.textFile("hdfs://...").map(s => s.split(",")).map( s => log(s(0).replaceAll("\"", ""),s(15).replaceAll("\"", ""))).toDF()
logText.registerTempTable("log")
val results = sqlContext.sql("SELECT * FROM log")
results.map(s => "URL: " + s(1)).collect().foreach(println)
URL: /XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz
URL: /XX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz
URL: /XXX/YYY/ZZZ/http/www.domain.com/
URL: /VV/XXXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz
You can try regexp_replace:
import org.apache.spark.sql.functions.regexp_replace
val df = sc.parallelize(Seq(
(1L, "/XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz"),
(2L, "/XXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz")
)).toDF("id", "url")
df
.select(regexp_replace($"url", "^(/\\w+){3}/(https?)/", "$2://").alias("url"))
.show(2, false)
// +--------------------------------------+
// |url |
// +--------------------------------------+
// |http://www.domain.com/xyz/xyz |
// |https://sub.domain.com/xyz/xyz/xyz/xyz|
// +--------------------------------------+
In Spark 1.4 you can try Hive UDF:
df.selectExpr("""regexp_replace(url, '^(/\w+){3}/(https?)/','$2://') AS url""")
If number of sections before http(s) can vary you adjust regexp by replacing {3} with * or range.
The question comes down to parsing the long strings and extracting the domain name. This solution will work as long as you don't have any of the random strings (XXX,YYYYY, etc.) be "http" and "https":
def getUrl(data: String): Option[String] = {
val slidingPairs = data.split("/").sliding(2)
slidingPairs.flatMap{
case Array(x,y) =>
if(x=="http" || x == "https") Some(y) else None
}.toList.headOption
}
Here are some examples in the REPL:
scala> getUrl("/XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz")
res8: Option[String] = Some(www.domain.com)
scala> getUrl("/XXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz")
resX: Option[String] = Some(sub.domain.com)
scala> getUrl("/XXX/YYY/ZZZ/a/asdsd/asdase/123123213/https/sub.domain.com/xyz/xyz/xyz/xyz")
resX: Option[String] = Some(sub.domain.com)