Related
I have a csv file containing data like :
key1,value1,value2,value3
key2,value1,value2,value3
key3,value1,value2,value3
I am able to read the file in scala but nothing much after that.
val file = scala.io.Source.fromFile("filepath").getLines.toList
file: List[String] = List(key1,value1,value2,value3, key2,value1,value2,value3, key3,value1,value2,value3)
I want the output to be like :
Map(key1->value1),Map(key1->value2),Map(key1->value3),Map(key2->value1)...and so on`
Assuming that this is a fixed layout, you can turn your files value into key/value pairs like this:
val kv = file
.grouped(4)
.flatMap{
case List(k, v1, v2, v3) => List(k -> v1, k ->v2, k -> v3)
}.toList
This gives
List((key1,value1), (key1,value2), (key1,value3), (key2,value1), (key2,value2), (key2,value3), (key3,value1), (key3,value2), (key3,value3))
Your final output looks odd because it is just a list of single-entry Maps, but if this is really what you want then it just needs a simple map call:
kv.map(x => Map(x))
Update
If there is a variable number of values on each line then you need to process each line separately, something like this:
val src = scala.io.Source.fromFile("filepath")
val res =
src.getLines.toList.flatMap { line =>
line.split(",").toList match {
case key :: values =>
values.map(v => key -> v)
case _ =>
Nil
}
}
src.close()
I have dataframe 'regexDf' like below
id,regex
1,(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)
2,(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)
If the length of the regex exceeds some max length for example 50, then i want to remove the last text token in splitted regex string separated by '|' for the exceeded id. In the above data frame, id 1 length is more than 50 so that last tokens 'text4(.)' and 'text6(.)' from each splitted regex string should be removed. Even after removing that also length of the regex string in id 1 still more than 50, so that again last tokens 'text3(.)' and 'text5(.)' should be removed.so the final dataframe will be
id,regex
1,(.*)text1(.*)text2(.*)|(.*)text2(.*)
2,(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)
I am able to trim the last tokens using the following code
val reducedStr = regex.split("|").foldLeft(List[String]()) {
(regexStr,eachRegex) => {
regexStr :+ eachRegex.replaceAll("\\(\\.\\*\\)\\w+\\(\\.\\*\\)$", "\\(\\.\\*\\)")
}
}.mkString("|")
I tried using while loop to check the length and trim the text tokens in iteration which is not working. Also i want to avoid using var and while loop. Is it possible to achieve without while loop.
val optimizeRegexString = udf((regex: String) => {
if(regex.length >= 50) {
var len = regex.length;
var resultStr: String = ""
while(len >= maxLength) {
val reducedStr = regex.split("|").foldLeft(List[String]()) {
(regexStr,eachRegex) => {
regexStr :+ eachRegex
.replaceAll("\\(\\.\\*\\)\\w+\\(\\.\\*\\)$", "\\(\\.\\*\\)")
}
}.mkString("|")
len = reducedStr.length
resultStr = reducedStr
}
resultStr
} else {
regex
}
})
regexDf.withColumn("optimizedRegex", optimizeRegexString(col("regex")))
As per SathiyanS and Pasha suggestion, I changed the recursive method as function.
def optimizeRegex(regexDf: DataFrame): DataFrame = {
val shrinkString= (s: String) => {
if (s.length > 50) {
val extractedString: String = shrinkString(s.split("\\|")
.map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"))
extractedString
}
else s
}
def shrinkUdf = udf((regex: String) => shrinkString(regex))
regexDf.withColumn("regexString", shrinkUdf(col("regex")))
}
Now i am getting exception as "recursive value shrinkString needs type"
Error:(145, 39) recursive value shrinkString needs type
val extractedString: String = shrinkString(s.split("\\|")
.map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"));
Recursion:
def shrink(s: String): String = {
if (s.length > 50)
shrink(s.split("\\|").map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"))
else s
}
Looks like issues with function calling, some additional info.
Can be called as static function:
object ShrinkContainer {
def shrink(s: String): String = {
if (s.length > 50)
shrink(s.split("\\|").map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"))
else s
}
}
Link with dataframe:
def shrinkUdf = udf((regex: String) => ShrinkContainer.shrink(regex))
df.withColumn("regex", shrinkUdf(col("regex"))).show(truncate = false)
Drawbacks: Just basic example (approach) provided. Some edge cases (if regexp does not contains "text", if too many parts separated by "|", for ex. 100; etc.) have to be resolved by author of question, for avoid infinite recursion loop.
This is how I would do it.
First, a function for removing the last token from a regex:
def deleteLastToken(s: String): String =
s.replaceFirst("""[^)]+\(\.\*\)$""", "")
Then, a function that shortens the entire regex string by deleting the last token from all the |-separated fields:
def shorten(r: String) = {
val items = r.split("[|]").toSeq
val shortenedItems = items.map(deleteLastToken)
shortenedItems.mkString("|")
}
Then, for a given input regex string, create the stream of all the shortened strings you get by applying the shorten function repeatedly. This is an infinite stream, but it's lazily evaluated, so only as few elements as required will be actually computed:
val regex = "(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)"
val allShortened = Stream.iterate(regex)(shorten)
Finally, you can treat allShortened as any other sequence. For solving our problem, you can drop all elements while they don't satisfy the length requirement, and then keep only the first one of the remaining ones:
val result = allShortened.dropWhile(_.length > 50).head
You can see all the intermediate values by printing some elements of allShortened:
allShortened.take(10).foreach(println)
// Prints:
// (.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)
// (.*)text1(.*)text2(.*)text3(.*)|(.*)text2(.*)text5(.*)
// (.*)text1(.*)text2(.*)|(.*)text2(.*)
// (.*)text1(.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
Just to add to #pasha701 answer. Here is the solution that works in spark.
val df = sc.parallelize(Seq((1,"(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)"),(2,"(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)"))).toDF("ID", "regex")
df.show()
//prints
+---+------------------------------------------------------------------------+
|ID |regex |
+---+------------------------------------------------------------------------+
|1 |(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)|
|2 |(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*) |
+---+------------------------------------------------------------------------+
Now you can use the #pasha701 shrink function using udf
val shrink: String => String = (s: String) => if (s.length > 50) shrink(s.split("\\|").map(s => s.substring(0,s.lastIndexOf("text"))).mkString("|")) else s
def shrinkUdf = udf((regex: String) => shrink(regex))
df.withColumn("regex", shrinkUdf(col("regex"))).show(truncate = false)
//prints
+---+---------------------------------------------+
|ID |regex |
+---+---------------------------------------------+
|1 |(.*)text1(.*)text2(.*)|(.*)text2(.*) |
|2 |(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)|
+---+---------------------------------------------+
I'm trying to implement Huffman compression. After encoding the char into 0 and 1, how do I write it to a file so it'll be compressed? Obviously simply writing the characters 0,1 will only make the file larger.
Let's say I have a string "0101010" which represents bits.
I wish to write the string into file but as in binary, i.e. not the char '0' but the bit 0.
I tried using the getBytesArray() on the string but it seems to make no difference rather than simply writing the string.
Although Sarvesh Kumar Singh code will probably work, it looks so Javish to me that I think this question needs one more answer. The way I imaging Huffman coding to be implemented in Scala is something like this:
import scala.collection._
type Bit = Boolean
type BitStream = Iterator[Bit]
type BitArray = Array[Bit]
type ByteStream = Iterator[Byte]
type CharStream = Iterator[Char]
case class EncodingMapping(charMapping: Map[Char, BitArray], eofCharMapping: BitArray)
def buildMapping(src: CharStream): EncodingMapping = {
def accumulateStats(src: CharStream): Map[Char, Int] = ???
def buildMappingImpl(stats: Map[Char, Int]): EncodingMapping = ???
val stats = accumulateStats(src)
buildMappingImpl(stats)
}
def convertBitsToBytes(bits: BitStream): ByteStream = {
bits.grouped(8).map(bits => {
val res = bits.foldLeft((0.toByte, 0))((acc, bit) => ((acc._1 * 2 + (if (bit) 1 else 0)).toByte, acc._2 + 1))
// last byte might be less than 8 bits
if (res._2 == 8)
res._1
else
(res._1 << (8 - res._2)).toByte
})
}
def encodeImpl(src: CharStream, mapping: EncodingMapping): ByteStream = {
val mainData = src.flatMap(ch => mapping.charMapping(ch))
val fullData = mainData ++ mapping.eofCharMapping
convertBitsToBytes(fullData)
}
// can be used to encode String as src. Thanks to StringLike/StringOps extension
def encode(src: Iterable[Char]): (EncodingMapping, ByteStream) = {
val mapping = buildMapping(src.iterator)
val encoded = encode(src.iterator, mapping)
(mapping, encoded)
}
def wrapClose[A <: java.io.Closeable, B](resource: A)(op: A => B): B = {
try {
op(resource)
}
finally {
resource.close()
}
}
def encodeFile(fileName: String): (EncodingMapping, ByteStream) = {
// note in real life you probably want to specify file encoding as well
val mapping = wrapClose(Source.fromFile(fileName))(buildMapping)
val encoded = wrapClose(Source.fromFile(fileName))(file => encode(file, mapping))
(mapping, encoded)
}
where in accumulateStats you find out how often each Char is present in the src and in buildMappingImpl (which is the main part of the whole Huffman encoding) you first build a tree from that stats and then use the create a fixed EncodingMapping. eofCharMapping is a mapping for the pseudo-EOF char as mentioned in one of the comments. Note that high-level encode methods return both EncodingMapping and ByteStream because in any real life scenario you want to save both.
The piece of logic specifically being asked is located in convertBitsToBytes method. Note that I use Boolean to represent single bit rather than Char and thus Iterator[Bit] (effectively Iterator[Boolean]) rather than String to represent a sequence of bits. The idea of the implementation is based on the grouped method that converts a BitStream into a stream of Bits grouped in a byte-sized groups (except possible for the last one).
IMHO the main advantage of this stream-oriented approach comparing to Sarvesh Kumar Singh answer is that you don't need to load the whole file into memory at once or store the whole encoded file in the memory. Note however that in such case you'll have to read the file twice: first time to build the EncodingMapping and second to apply it. Obviously if the file is small enough you can load it into the memory first and then convert ByteStream to Array[Byte] using .toArray call. But if your file is big, you can use just stream-based approach and easily save the ByteStream into a file using something like .foreach(b => out.write(b))
I don't think this will help you achieve your Huffman compression goal, but in answer to your question:
string-to-value
Converting a String to the value it represents is pretty easy.
val x: Int = "10101010".foldLeft(0)(_*2 + _.asDigit)
Note: You'll have to check for formatting (only ones and zeros) and overflow (strings too long) before conversion.
value-to-file
There are a number of ways to write data to a file. Here's a simple one.
import java.io.{FileOutputStream, File}
val fos = new FileOutputStream(new File("output.dat"))
fos.write(x)
fos.flush()
fos.close()
Note: You'll want to catch any errors thrown.
I will specify all the required imports first,
import java.io.{ File, FileInputStream, FileOutputStream}
import java.nio.file.Paths
import scala.collection.mutable.ArrayBuffer
Now, We are going to need following smaller units to achieve this whole thing,
1 - We need to be able to convert our binary string (eg. "01010") to Array[Byte],
def binaryStringToByteArray(binaryString: String) = {
val byteBuffer = ArrayBuffer.empty[Byte]
var byteStr = ""
for (binaryChar <- binaryString) {
if (byteStr.length < 7) {
byteStr = byteStr + binaryChar
}
else {
try{
val byte = java.lang.Byte.parseByte(byteStr + binaryChar, 2)
byteBuffer += byte
byteStr = ""
}
catch {
case ex: java.lang.NumberFormatException =>
val byte = java.lang.Byte.parseByte(byteStr, 2)
byteBuffer += byte
byteStr = "" + binaryChar
}
}
}
if (!byteStr.isEmpty) {
val byte = java.lang.Byte.parseByte(byteStr, 2)
byteBuffer += byte
byteStr = ""
}
byteBuffer.toArray
}
2 - We need to be able to open the file to serve in our little play,
def openFile(filePath: String): File = {
val path = Paths.get(filePath)
val file = path.toFile
if (file.exists()) file.delete()
if (!file.exists()) file.createNewFile()
file
}
3 - We need to be able to write bytes to a file,
def writeBytesToFile(bytes: Array[Byte], file: File): Unit = {
val fos = new FileOutputStream(file)
fos.write(bytes)
fos.close()
}
4 - We need to be able to read bytes back from the file,
def readBytesFromFile(file: File): Array[Byte] = {
val fis = new FileInputStream(file)
val bytes = new Array[Byte](file.length().toInt)
fis.read(bytes)
fis.close()
bytes
}
5 - We need to be able convert bytes back to our binaryString,
def byteArrayToBinaryString(byteArray: Array[Byte]): String = {
byteArray.map(b => b.toBinaryString).mkString("")
}
Now, we are ready to do every thing we want,
// lets say we had this binary string,
scala> val binaryString = "00101110011010101010101010101"
// binaryString: String = 00101110011010101010101010101
// Now, we need to "pad" this with a leading "1" to avoid byte related issues
scala> val paddedBinaryString = "1" + binaryString
// paddedBinaryString: String = 100101110011010101010101010101
// The file which we will use for this,
scala> val file = openFile("/tmp/a_bit")
// file: java.io.File = /tmp/a_bit
// convert our padded binary string to bytes
scala> val bytes = binaryStringToByteArray(paddedBinaryString)
// bytes: Array[Byte] = Array(75, 77, 85, 85)
// write the bytes to our file,
scala> writeBytesToFile(bytes, file)
// read bytes back from file,
scala> val bytesFromFile = readBytesFromFile(file)
// bytesFromFile: Array[Byte] = Array(75, 77, 85, 85)
// so now, we have our padded string back,
scala> val paddedBinaryStringFromFile = byteArrayToBinaryString(bytes)
// paddedBinaryStringFromFile: String = 1001011100110110101011010101
// remove that "1" from the front and we have our binaryString back,
scala> val binaryStringFromFile = paddedBinaryString.tail
// binaryStringFromFile: String = 00101110011010101010101010101
NOTE :: you may have to make few changes if you want to deal with very large "binary strings" (more than few millions of characters long) to improve performance or even be usable. For example - You will need to start using Streams or Iterators instead of Array[Byte].
I have listMap1 variable of type List[Map [String, String]] and I want all values associated with key 'k1' as one string with comma separated values
import fiddle.Fiddle, Fiddle.println
import scalajs.js
#js.annotation.JSExport
object ScalaFiddle {
var m1:Map[String,String] = Map(("k1"->"v1"), ("k2"->"vv1"))
var m2:Map[String,String] = Map(("k1"->"v2"),("k2"->"vv2"))
var m3:Map[String,String] = Map(("k1"->"v3"),("k2"->"vv3"))
var listMap1 = List(m1,m2,m3)
var valList = ?? // need all values assoicated with k1 like --> v1,v2,v3...
}
A simple approach would be:
listMap1.flatMap(_.get("k1")).mkString(",")
be warned that this will not work if you're generating CSV data and the associated values contain ,s e.g. Map(("k1" -> "\some, string"))
is that ok ??
val r = listMap1.filter(l => l.contains("k1") ).map(r => r("k1") ).mkString(",")
Does Kotlin have a function like .zipAll in Scala?
In Scala I can sum two array with different length using the zipAll function.
Scala:
val arrA = Array(1,2,3)
val arrB = Array(4, 5)
arrA.zipAll(arrB, 0, 0).map(x => x._1 + x._2)
Or what is the correct way to do this in Kotlin?
There is no in-build analog in Kotlin 1.0. It might be a good idea to add it into the stdlib. Feel free to file an issue on the YouTrack
Here is zipAll for Kotlin:
fun <T1: Any, T2: Any> List<T1>.zipAll(other: List<T2>, emptyValue: T1, otherEmptyValue: T2): List<Pair<T1, T2>> {
val i1 = this.iterator()
val i2 = other.iterator()
return generateSequence {
if (i1.hasNext() || i2.hasNext()) {
Pair(if (i1.hasNext()) i1.next() else emptyValue,
if (i2.hasNext()) i2.next() else otherEmptyValue)
} else {
null
}
}.toList()
}
And a unit test:
#Test fun sumTwoUnevenLists() {
val x = listOf(1,2,3,4,5)
val y = listOf(10,20,30)
assertEquals(listOf(11,22,33,4,5), x.zipAll(y, 0, 0).map { it.first + it.second })
}
And the same could be applied to arrays, other collection types, sequences, etc. An array-only version would be easier since you can index into the arrays. The array version could be:
fun <T1: Any, T2: Any> Array<T1>.zipAll(other: Array<T2>, emptyValue: T1, otherEmptyValue: T2): List<Pair<T1, T2>> {
val largest = this.size.coerceAtLeast(other.size)
val result = arrayListOf<Pair<T1, T2>>()
(0..this.size.coerceAtLeast(other.size)-1).forEach { i ->
result.add(Pair(if (i < this.size) this[i] else emptyValue, if (i < other.size) other[i] else otherEmptyValue))
}
return result.filterNotNull()
}
It returns a List because map function is going to turn you into a list anyway.
I made a quick tail-recursive version for fun. Not very efficient though, due to the list appends.
fun <T, U> List<T>.zipAll(that: List<U>, elem1: T, elem2: U): List<Pair<T, U>> {
tailrec fun helper(first: List<T>, second: List<U>, acc: List<Pair<T, U>>): List<Pair<T, U>> {
return when {
first.isEmpty() && second.isEmpty() -> acc
first.isEmpty() -> helper(first, second.drop(1), acc + listOf(elem1 to second.first()))
second.isEmpty() -> helper(first.drop(1), second, acc + listOf(first.first() to elem2))
else -> helper(first.drop(1), second.drop(1), acc + listOf(first.first() to second.first()))
}
}
return helper(this, that, emptyList())
}
This does not yet exist in the Koltin stdlib, but this is the suggested approach I posted in the youtrack ticket about this.
Here is a potential implementation modeled after the current zip function https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.collections/zip.html.
/**
* Returns a list of values built from the elements of `this` collection and the [other] collection with the same
* index using the provided [transform] function applied to each pair of elements. The returned list has length of
* the longest collection.
*/
fun <T, R, V> Iterable<T>.zipAll(
other: Iterable<R>,
thisDefault: T,
otherDefault: R,
transform: (a: T, b: R) -> V,
): List<V> {
val first = iterator()
val second = other.iterator()
val list = ArrayList<V>(maxOf(collectionSizeOrDefault(10),
other.collectionSizeOrDefault(10)))
while (first.hasNext() || second.hasNext()) {
val thisValue = if (first.hasNext()) first.next() else thisDefault
val otherValue =
if (second.hasNext()) second.next() else otherDefault
list.add(transform(thisValue, otherValue))
}
return list
}
// Copying this from kotlin.collections where it is an Internal function
fun <T> Iterable<T>.collectionSizeOrDefault(default: Int): Int =
if (this is Collection<*>) this.size else default
And here is how I use it
/**
* Takes two multiline stings and combines them into a two column view.
*/
fun renderSideBySide(
leftColumn: String,
rightColumn: String,
divider: String = " | ",
): String {
val leftColumnWidth: Int = leftColumn.lines().map { it.length }.maxOrNull() ?: 0
return leftColumn.lines()
.zipAll(rightColumn.lines(), "", "") { left, right ->
left.padEnd(leftColumnWidth) + divider + right
}
.reduce { acc, nextLine -> acc + "\n" + nextLine }
}
Example of how I am using this:
val left = """
Left Column
with some data
""".trimIndent()
val right = """
Right Column
also with some data
but not the same length
of data as the left colum.
""".trimIndent()
println(left)
Left Column
with some data
println(right)
Right Column
also with some data
but not the same length
of data as the left colum.
println(renderSideBySide(left,right))
Left Column | Right Column
with some data | also with some data
| but not the same length
| of data as the left colum.
println(renderSideBySide(right,left))
Right Column | Left Column
also with some data | with some data
but not the same length |
of data as the left colum. |