I have this function which uses InetAddress, but the output is occasionally wrong. (example: "::ffff:49e7:a9b2" will give an incorrect result.)
def IPv6ToBigInteger(ip: String): BigInteger = {
val i = InetAddress.getByName(ip)
val a: Array[Byte] = i.getAddress
new BigInteger(1, a)
}
And the I also have this function
def IPv6ToBigInteger(ip: String): BigInteger = {
val fragments = ip.split(":|\\.|::").filter(_.nonEmpty)
require(fragments.length <= 8, "Bad IPv6")
var ipNum = new BigInteger("0")
for (i <-fragments.indices) {
val frag2Long = new BigInteger(s"${fragments(i)}", 16)
ipNum = frag2Long.or(ipNum.shiftLeft(16))
}
ipNum
}
which appears to have a parsing error because it gives the wrong output unless it is in 0:0:0:0:0:0:0:0 format, but is an based on my IPv4ToLong function:
def IPv4ToLong(ip: String): Long = {
val fragments = ip.split('.')
var ipNum = 0L
for (i <- fragments.indices) {
val frag2Long = fragments(i).toLong
ipNum = frag2Long | ipNum << 8L
}
ipNum
}
This
ipNum = frag2Long | ipNum << 8L
is
ipNum = (frag2Long | ipNum) << 8L
not
ipNum = frag2Long | (ipNum << 8L)
[ And please use foldLeft rather than var and while ]
Interesting challenge: transform IP address strings into BigInt values, allowing for all legal IPv6 address forms.
Here's my try.
import scala.util.Try
def iPv62BigInt(ip: String): Try[BigInt] = Try{
val fill = ":0:" * (8 - ip.split("[:.]").count(_.nonEmpty))
val fullArr =
raw"((?<=\.)(\d+)|(\d+)(?=\.))".r
.replaceAllIn(ip, _.group(1).toInt.toHexString)
.replace("::", fill)
.split("[:.]")
.collect{case s if s.nonEmpty => s"000$s".takeRight(4)}
if (fullArr.length == 8) BigInt(fullArr.mkString, 16)
else throw new NumberFormatException("wrong number of elements")
}
This is, admittedly, a bit lenient in that it won't catch all all non-IPv6 forms, but that's not a trivial task using tools like regex.
Related
I'm getting logs from a firewall in CEF Format as a string which looks as:
ABC|XYZ|F123|1.0|DSE|DSE|4|externalId=e705265d0d9e4d4fcb218b cn2=329160 cn1=3053998 dhost=SRV2019 duser=admin msg=Process accessed NTDS fname=ntdsutil.exe filePath=\\Device\\HarddiskVolume2\\Windows\\System32 cs5="C:\\Windows\\system32\\ntdsutil.exe" "ac i ntds" ifm "create full ntdstest3" q q fileHash=80c8b68240a95 dntdom=adminDomain cn3=13311 rt=1610948650000 tactic=Credential Access technique=Credential Dumping objective=Gain Access patternDisposition=Detection. outcome=0
How can I create a DataFrame from this kind of string where I'm getting key-value pairs separated by = ?
My objective is to infer schema from this string using the keys dynamically, i.e extract the keys from left side of the = operator and create a schema using them.
What I have been doing currently is pretty lame(IMHO) and not very dynamic in approach.(because the number of key-value pairs can change as per different type of logs)
val a: String = "ABC|XYZ|F123|1.0|DSE|DCE|4|externalId=e705265d0d9e4d4fcb218b cn2=329160 cn1=3053998 dhost=SRV2019 duser=admin msg=Process accessed NTDS fname=ntdsutil.exe filePath=\\Device\\HarddiskVolume2\\Windows\\System32 cs5="C:\\Windows\\system32\\ntdsutil.exe" "ac i ntds" ifm "create full ntdstest3" q q fileHash=80c8b68240a95 dntdom=adminDomain cn3=13311 rt=1610948650000 tactic=Credential Access technique=Credential Dumping objective=Gain Access patternDisposition=Detection. outcome=0"
val ttype: String = "DCE"
type parseReturn = (String,String,List[String],Int)
def cefParser(a: String, ttype: String): parseReturn = {
val firstPart = a.split("\\|")
var pD = new ListBuffer[String]()
var listSize: Int = 0
if (firstPart.size == 8 && firstPart(4) == ttype) {
pD += firstPart(0)
pD += firstPart(1)
pD += firstPart(2)
pD += firstPart(3)
pD += firstPart(4)
pD += firstPart(5)
pD += firstPart(6)
val secondPart = parseSecondPart(firstPart(7), ttype)
pD ++= secondPart
listSize = pD.toList.length
(firstPart(2), ttype, pD.toList, listSize)
} else {
val temp: List[String] = List(a)
(firstPart(2), "IRRELEVANT", temp, temp.length)
}
}
The method parseSecondPart is:
def parseSecondPart(m:String, ttype:String): ListBuffer[String] = ttype match {
case auditActivity.ttype=>parseAuditEvent(m)
Another function call to just replace some text in the logs
def parseAuditEvent(msg: String): ListBuffer[String] = {
val updated_msg = msg.replace("cat=", "metadata_event_type=")
.replace("destinationtranslatedaddress=", "event_user_ip=")
.replace("duser=", "event_user_id=")
.replace("deviceprocessname=", "event_service_name=")
.replace("cn3=", "metadata_offset=")
.replace("outcome=", "event_success=")
.replace("devicecustomdate1=", "event_utc_timestamp=")
.replace("rt=", "metadata_event_creation_time=")
parseEvent(updated_msg)
}
Final function to get only the values:
def parseEvent(msg: String): ListBuffer[String] = {
val newMsg = msg.replace("\\=", "$_equal_$")
val pD = new ListBuffer[String]()
val splitData = newMsg.split("=")
val mSize = splitData.size
for (i <- 1 until mSize) {
if(i < mSize-1) {
val a = splitData(i).split(" ")
val b = a.size-1
val c = a.slice(0,b).mkString(" ")
pD += c.replace("$_equal_$","=")
} else if(i == mSize-1) {
val a = splitData(i).replace("$_equal_$","=")
pD += a
} else {
logExceptions(newMsg)
}
}
pD
}
The returns contains a ListBuffer[String]at 3rd position, using which I create a DataFrame as follows:
val df = ss.sqlContext
.createDataFrame(tempRDD.filter(x => x._1 != "IRRELEVANT")
.map(x => Row.fromSeq(x._3)), schema)
People of stackoverflow, i really need your help in improving my code, both for performance and approach.
Any kind of help and/or suggestions will be highly appreciated.
Thanks In Advance.
I was looking for a basic utility with 2 functions to convert IPv4 Addresses to/from Long in Scala, such as "10.10.10.10" to its Long representation of 168430090 and back. A basic utility such as this exists in many languages (such as python), but appears to require re-writing the same code for everyone for the JVM.
What is the recommended approach on unifying IPv4ToLong and LongToIPv4 functions?
Combining the ideas from leifbatterman and Elesin Olalekan Fuad and avoiding multiplication and power operations:
def ipv4ToLong(ip: String): Option[Long] = Try(
ip.split('.').ensuring(_.length == 4)
.map(_.toLong).ensuring(_.forall(x => x >= 0 && x < 256))
.reverse.zip(List(0,8,16,24)).map(xi => xi._1 << xi._2).sum
).toOption
To convert Long to String in dotted format:
def longToipv4(ip: Long): Option[String] = if ( ip >= 0 && ip <= 4294967295L) {
Some(List(0x000000ff, 0x0000ff00, 0x00ff0000, 0xff000000).zip(List(0,8,16,24))
.map(mi => ((mi._1 & ip) >> mi._2)).reverse
.map(_.toString).mkString("."))
} else None
import java.net.InetAddress
def IPv4ToLong(dottedIP: String): Long = {
val addrArray: Array[String] = dottedIP.split("\\.")
var num: Long = 0
var i: Int = 0
while (i < addrArray.length) {
val power: Int = 3 - i
num = num + ((addrArray(i).toInt % 256) * Math.pow(256, power)).toLong
i += 1
}
num
}
def LongToIPv4 (ip : Long) : String = {
val bytes: Array[Byte] = new Array[Byte](4)
bytes(0) = ((ip & 0xff000000) >> 24).toByte
bytes(1) = ((ip & 0x00ff0000) >> 16).toByte
bytes(2) = ((ip & 0x0000ff00) >> 8).toByte
bytes(3) = (ip & 0x000000ff).toByte
InetAddress.getByAddress(bytes).getHostAddress()
}
scala> IPv4ToLong("10.10.10.10")
res0: Long = 168430090
scala> LongToIPv4(168430090L)
res1: String = 10.10.10.10
Try the ipaddr scala library. Create an IpAddress and get its long value like this:
val ip1: IpAddress = IpAddress("192.168.0.1")
val ip1Long = ip1.numerical // returns 3232235521L
This is pretty straightforward for ipv4:
def ipToLong(ip:String) = ip.split("\\\\.").foldLeft(0L)((c,n)=>c*256+n.toLong)
def longToIP(ip:Long) = (for(a<-3 to 0 by -1) yield ((ip>>(a*8))&0xff).toString).mkString(".")
I have a GitHub gist that solves this. The gist contains code that converts from IP to Long likewise the reverse. Visit https://gist.github.com/OElesin/f0f2c69530a315177b9e0227a140f9c1
Here is the code:
def ipToLong(ipAddress: String): Long = {
ipAddress.split("\\.").reverse.zipWithIndex.map(a=>a._1.toInt*math.pow(256,a._2).toLong).sum
}
def longToIP(long: Long): String = {
(0 until 4).map(a=>long / math.pow(256, a).floor.toInt % 256).reverse.mkString(".")
}
Enjoy
Adding to Elesin Olalekan Fuad's answer it can be made a little more robust like this:
def ipToLong(ip: String): Option[Long] = {
Try(ip.split('.').ensuring(_.length == 4)
.map(_.toLong).ensuring(_.forall(x => x >= 0 && x < 256))
.zip(Array(256L * 256L * 256L, 256L * 256L, 256L, 1L))
.map { case (x, y) => x * y }
.sum).toOption
}
def longToIp(ip: Long): Option[String] = {
if (ip >= 0 && ip <= 4294967295L)
Some((0 until 4)
.map(a => ip / math.pow(256, a).floor.toInt % 256)
.reverse.mkString("."))
else
None
}
I like #jwvh's comment on ipv4ToLong. As to longToIpv4, how about just simply:
def longToIpv4(v:Long):String = (for (i <- 0 to 3) yield (v >> (i * 8)) & 0x000000FF ).reverse.mkString(".")
I need to write code that does the following:
Connect to a tcp socket
Read a line ending in "\r\n" that contains a number N
Read N bytes
Use those N bytes
I am currently using the following code:
val socket = new Socket(InetAddress.getByName(host), port)
val in = socket.getInputStream;
val out = new PrintStream(socket.getOutputStream)
val reader = new DataInputStream(in)
val baos = new ByteArrayOutputStream
val buffer = new Array[Byte](1024)
out.print(cmd + "\r\n")
out.flush
val firstLine = reader.readLine.split("\\s")
if(firstLine(0) == "OK") {
def read(written: Int, max: Int, baos: ByteArrayOutputStream): Array[Byte] = {
if(written >= max) baos.toByteArray
else {
val count = reader.read(buffer, 0, buffer.length)
baos.write(buffer, 0, count)
read(written + count, max, baos)
}
}
read(0, firstLine(1).toInt, baos)
} else {
// RAISE something
}
baos.toByteArray()
The problem with this code is that the use of DataInputStream#readLine raises a deprecation warning, but I can't find a class that implements both read(...) and readLine(...). BufferedReader for example, implements read but it reads Chars and not Bytes. I could cast those chars to bytes but I don't think it's safe.
Any other ways to write something like this in scala?
Thank you
be aware that on the JVM a char has 2 bytes, so "\r\n" is 4 bytes. This is generally not true for Strings stored outside of the JVM.
I think the safest way would be to read your file in raw bytes until you reache your Binary representation of "\r\n", now you can create a Reader (makes bytes into JVM compatible chars) on the first bytes, where you can be shure that there is Text only, parse it, and contiue safely with the rest of the binary data.
You can achive the goal to use read(...) and readLine(...) in one class. The idea is use BufferedReader.read():Int. The BufferedReader class has buffered the content so you can read one byte a time without performance decrease.
The change can be: (without scala style optimization)
import java.io.BufferedInputStream
import java.io.BufferedReader
import java.io.ByteArrayOutputStream
import java.io.PrintStream
import java.net.InetAddress
import java.net.Socket
import java.io.InputStreamReader
object ReadLines extends App {
val host = "127.0.0.1"
val port = 9090
val socket = new Socket(InetAddress.getByName(host), port)
val in = socket.getInputStream;
val out = new PrintStream(socket.getOutputStream)
// val reader = new DataInputStream(in)
val bufIns = new BufferedInputStream(in)
val reader = new BufferedReader(new InputStreamReader(bufIns, "utf8"));
val baos = new ByteArrayOutputStream
val buffer = new Array[Byte](1024)
val cmd = "get:"
out.print(cmd + "\r\n")
out.flush
val firstLine = reader.readLine.split("\\s")
if (firstLine(0) == "OK") {
def read(written: Int, max: Int, baos: ByteArrayOutputStream): Array[Byte] = {
if (written >= max) {
println("get: " + new String(baos.toByteArray))
baos.toByteArray()
} else {
// val count = reader.read(buffer, 0, buffer.length)
var count = 0
var b = reader.read()
while(b != -1){
buffer(count) = b.toByte
count += 1
if (count < max){
b = reader.read()
}else{
b = -1
}
}
baos.write(buffer, 0, count)
read(written + count, max, baos)
}
}
read(0, firstLine(1).toInt, baos)
} else {
// RAISE something
}
baos.toByteArray()
}
for test, below is a server code:
object ReadLinesServer extends App {
val serverSocket = new ServerSocket(9090)
while(true){
println("accepted a connection.")
val socket = serverSocket.accept()
val ops = socket.getOutputStream()
val printStream = new PrintStream(ops, true, "utf8")
printStream.print("OK 2\r\n") // 1 byte for alpha-number char
printStream.print("ab")
}
}
Seems this is the best solution I can find:
val reader = new BufferedReader(new InputStreamReader(in))
val buffer = new Array[Char](1024)
out.print(cmd + "\r\n")
out.flush
val firstLine = reader.readLine.split("\\s")
if(firstLine(0) == "OK") {
def read(readCount: Int, acc: List[Byte]): Array[Byte] = {
if(readCount <= 0) acc.toArray
else {
val count = reader.read(buffer, 0, buffer.length)
val asBytes = buffer.slice(0, count).map(_.toByte)
read(readCount - count, acc ++ asBytes)
}
}
read(firstLine(1).toInt, List[Byte]())
} else {
// RAISE
}
That is, use buffer.map(_.toByte).toArray to transform a char Array into a Byte Array without caring about the encoding.
I'm looking to roundtrip bytes through java's Deflater and running into issues. First the output, then the code. What am I doing wrong here, and how can I properly round trip through these streams?
Output:
scala> new String(decompress(compress("face".getBytes)))
(crazy output string of length 20)
Code:
def compress(bytes: Array[Byte]): Array[Byte] = {
val deflater = new java.util.zip.Deflater
val baos = new ByteArrayOutputStream
val dos = new DeflaterOutputStream(baos, deflater)
dos.write(bytes)
baos.close
dos.finish
dos.close
baos.toByteArray
}
def decompress(bytes: Array[Byte]): Array[Byte] = {
val deflater = new java.util.zip.Deflater
val baos = new ByteArrayOutputStream(512)
val bytesIn = new ByteArrayInputStream(bytes)
val in = new DeflaterInputStream(bytesIn, deflater)
var go = true
while (go) {
val b = in.read
if (b == -1)
go = false
else
baos.write(b)
}
baos.close
in.close
baos.toByteArray
}
You're (re-)Deflater-ing the result of the original deflation when you should be Inflater-ing it...
I'm seeing some strange behavior with Scala's collection.mutable.PriorityQueue. I'm performing an external sort and testing it with 1M records. Each time I run the test and verify the results between 10-20 records are not sorted properly. I replace the scala PriorityQueue implementation with a java.util.PriorityQueue and it works 100% of the time. Any ideas?
Here's the code (sorry it's a bit long...). I test it using the tools gensort -a 1000000 and valsort from http://sortbenchmark.org/
def externalSort(inFileName: String, outFileName: String)
(implicit ord: Ordering[String]): Int = {
val MaxTempFiles = 1024
val TempBufferSize = 4096
val inFile = new java.io.File(inFileName)
/** Partitions input file and sorts each partition */
def partitionAndSort()(implicit ord: Ordering[String]):
List[java.io.File] = {
/** Gets block size to use */
def getBlockSize: Long = {
var blockSize = inFile.length / MaxTempFiles
val freeMem = Runtime.getRuntime().freeMemory()
if (blockSize < freeMem / 2)
blockSize = freeMem / 2
else if (blockSize >= freeMem)
System.err.println("Not enough free memory to use external sort.")
blockSize
}
/** Sorts and writes data to temp files */
def writeSorted(buf: List[String]): java.io.File = {
// Create new temp buffer
val tmp = java.io.File.createTempFile("external", "sort")
tmp.deleteOnExit()
// Sort buffer and write it out to tmp file
val out = new java.io.PrintWriter(tmp)
try {
for (l <- buf.sorted) {
out.println(l)
}
} finally {
out.close()
}
tmp
}
val blockSize = getBlockSize
var tmpFiles = List[java.io.File]()
var buf = List[String]()
var currentSize = 0
// Read input and divide into blocks
for (line <- io.Source.fromFile(inFile).getLines()) {
if (currentSize > blockSize) {
tmpFiles ::= writeSorted(buf)
buf = List[String]()
currentSize = 0
}
buf ::= line
currentSize += line.length() * 2 // 2 bytes per char
}
if (currentSize > 0) tmpFiles ::= writeSorted(buf)
tmpFiles
}
/** Merges results of sorted partitions into one output file */
def mergeSortedFiles(fs: List[java.io.File])
(implicit ord: Ordering[String]): Int = {
/** Temp file buffer for reading lines */
class TempFileBuffer(val file: java.io.File) {
private val in = new java.io.BufferedReader(
new java.io.FileReader(file), TempBufferSize)
private var curLine: String = ""
readNextLine() // prep first value
def currentLine = curLine
def isEmpty = curLine == null
def readNextLine() {
if (curLine == null) return
try {
curLine = in.readLine()
} catch {
case _: java.io.EOFException => curLine = null
}
if (curLine == null) in.close()
}
override protected def finalize() {
try {
in.close()
} finally {
super.finalize()
}
}
}
val wrappedOrd = new Ordering[TempFileBuffer] {
def compare(o1: TempFileBuffer, o2: TempFileBuffer): Int = {
ord.compare(o1.currentLine, o2.currentLine)
}
}
val pq = new collection.mutable.PriorityQueue[TempFileBuffer](
)(wrappedOrd)
// Init queue with item from each file
for (tmp <- fs) {
val buf = new TempFileBuffer(tmp)
if (!buf.isEmpty) pq += buf
}
var count = 0
val out = new java.io.PrintWriter(new java.io.File(outFileName))
try {
// Read each value off of queue
while (pq.size > 0) {
val buf = pq.dequeue()
out.println(buf.currentLine)
count += 1
buf.readNextLine()
if (buf.isEmpty) {
buf.file.delete() // don't need anymore
} else {
// re-add to priority queue so we can process next line
pq += buf
}
}
} finally {
out.close()
}
count
}
mergeSortedFiles(partitionAndSort())
}
My tests don't show any bugs in PriorityQueue.
import org.scalacheck._
import Prop._
object PriorityQueueProperties extends Properties("PriorityQueue") {
def listToPQ(l: List[String]): PriorityQueue[String] = {
val pq = new PriorityQueue[String]
l foreach (pq +=)
pq
}
def pqToList(pq: PriorityQueue[String]): List[String] =
if (pq.isEmpty) Nil
else { val h = pq.dequeue; h :: pqToList(pq) }
property("Enqueued elements are dequeued in reverse order") =
forAll { (l: List[String]) => l.sorted == pqToList(listToPQ(l)).reverse }
property("Adding/removing elements doesn't break sorting") =
forAll { (l: List[String], s: String) =>
(l.size > 0) ==>
((s :: l.sorted.init).sorted == {
val pq = listToPQ(l)
pq.dequeue
pq += s
pqToList(pq).reverse
})
}
}
scala> PriorityQueueProperties.check
+ PriorityQueue.Enqueued elements are dequeued in reverse order: OK, passed
100 tests.
+ PriorityQueue.Adding/removing elements doesn't break sorting: OK, passed
100 tests.
If you could somehow reduce the input enough to make a test case, it would help.
I ran it with five million inputs several times, output matched expected always. My guess from looking at your code is that your Ordering is the problem (i.e. it's giving inconsistent answers.)