I'm new to scala so was wondering if anyone can help me with a function that would take an arbitrary list of labels, a delimted text string and return something like a Map or Dictionary.
val labels = Seq("color", "cost", "name")
val data = ("blue|$9.99|smurf")
private def getData(data:String, labels:Seq[String]) {
val values = labels.split("|")
//now how to map this split values with the the labels to create a nice map or dictionary
}
val labels = Seq("color", "cost", "name")
val values = "blue|$9.99|smurf".split("\\|")
// Array(blue, $9.99, smurf)
val map = labels.zip(values).toMap
// Map(color -> blue, cost -> $9.99, name -> smurf)
Related
My current DataFrame looks like as below:
{"id":"1","inputs":{"values":{"0.2":[1,1],"0.4":[1,1],"0.6":[1,1]}},"id1":[1,2]}
I want to transform this dataframe into the below dataFrame:
{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}
This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.
How does the code would look like in order to achieve this. I have tried withColumn but couldn't achieve this.
Try the below code and please find the inline comments for the code explanation
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
object DynamicCol {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val df = spark.read.json("src/main/resources/dyamicCol.json") /// Load the JSON file
val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
val index = dfTemp
.schema.fieldIndex("values")
val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => { // Join Dataframe with the list of nested columns
val colNameInt = (field.name.toDouble * 100).toInt
val colName = s"v$colNameInt"
df.withColumn(colName,col("inputs.values.`" + field.name + "`")) // Add the nested column mappings
} ).drop("inputs") // Drop the extra column
dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file
}
}
I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.
def stringDecimalToVNumber(colName:String): String =
"v" + (colName.toFloat * 100).toInt.toString
and form a single function that transforms according to the case
val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
case x => x // keep it
now we have the function to transform the end of the columns, let's pick the schema dynamicly.
val flattenDF = df.select("id","inputs.values.*")
val finalDF = flattenDF
.schema.names
.foldLeft(flattenDF)((dfacum,x) => {
val newName = transformColumnName(x)
if (newName == x)
dfacum // the name didn't need to be changed
else
dfacum.withColumnRenamed(x, transformColumnName(x))
})
This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.
I'm a beginner here. I'm working with Spark 2.4.4 and Scala.
I have an RDD with three columns with the first entry like this:
(String, Double, String) = (100,10,neg)
The RDD has thousands of entries. I want to change the value of the double to a negative value when there is 'neg' in the same row and do nothing when there is any other phrase. I want to get the following output:
(String, Double) = (100,-10)
I figured the map function would work for this to create a new RDD, but if there is another option, please let me know.
When you have all the required data on the same item, you can use map to make your data transformations.
val yourRDD = spark.sparkContext.parallelize(Seq(
("10", 2.0, "neg"),
("50", 6.0, "other"),
("40", -5.0, "neg"),
("100", 1.0, ""))) // Sample data
// org.apache.spark.rdd.RDD[(String, Double, String)]
val updatedRDD = yourRDD.map(item=>{
val tag = item._3 // position of your tag
val outputValue = if(tag.equals("neg") && item._2>0) item._2 * -1 // only if your tag is 'neg' and the value is possitive
else item._2
(item._1 ,outputValue)
})
// Output data: ((10,-2.0), (50,6.0), (40,-5.0), (100,1.0))
Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file.
object twitterAthlete {
def loadAthleteNames() : Map[String, String] = {
// Handle character encoding issues:
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
// Create a Map of Ints to Strings, and populate it from u.item.
var athleteInfo:Map[String, String] = Map()
//var movieNames:Map[Int, String] = Map()
val lines = Source.fromFile("../athletes.csv").getLines()
for (line <- lines) {
var fields = line.split(',')
if (fields.length > 1) {
athleteInfo += (fields(1) -> fields(7))
}
}
return athleteInfo
}
def parseLine(line:String): (String)= {
var athleteInfo = loadAthleteNames()
var hello = new String
for((k,v) <- athleteInfo){
if(line.toString().contains(k)){
hello = k
}
}
return (hello)
}
def main(args: Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "twitterAthlete")
val lines = sc.textFile("../twitter.test")
var athleteInfo = loadAthleteNames()
val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2))
var hello = new String()
val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache
container.collect().foreach(println)
// val mapping = container.map(x => (x,1)).reduceByKey(_+_)
//mapping.collect().foreach(println)
}
}
the first file look like:
id,name,nationality,sex,height........
001,Michael,USA,male,1.96 ...
002,Json,GBR,male,1.76 ....
003,Martin,female,1.73 . ...
the second file look likes:
time, id , tweet .....
12:00, 03043, some message that contain some athletes names , .....
02:00, 03023, some message that contain some athletes names , .....
some thinks like this ...
but i got empty result after running this code, any suggestions is much appreciated
result i got is empty :
()....
()...
()...
but the result that i expected something like:
(name,1)
(other name,1)
You need to use yield to return value to your map
val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first...
I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc.
Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.Tokenizer
val tweets = spark.read.format("csv").load(...)
val athletes = spark.read.format("csv").load(...)
val tokenizer = new Tokenizer()
tokenizer.setInputCol("tweet")
tokenizer.setOutputCol("words")
val tokenized = tokenizer.transform(tweets)
val exploded = tokenized.withColumn("word", explode('words))
val withAthlete = exploded.join(athletes, 'word === 'name)
withAthlete.select(exploded("id"), 'name).show()
I have a ResultSet object returned from Hive using JDBC.
I am trying to store the values in a resultset in a Scala Immutable Map.
How can i add there values to an Immutable map as i am iterating the resultset using while loop
val m : Map[String, String] = null
while ( resultSet.next() ) {
val col = resultSet.getString("col_name")
val data = resultSet.getString("data_type")
m += (col -> data) // This Gives Reassignment error
}
I propose :
Iterator.continually{
val col = resultSet.getString("col_name")
val data = resultSet.getString("data_type")
col->data
}.takeWhile( _ => resultSet.next()).toMap
Instead of thinking "let's init an empty collection and fill it" which is imho the mutable way to think, this proposition rather think in terms of "let's declare how to build a collection with those elements in it and be done" :-)
You might want to use scala.collection.Iterator[A] so that you can create immutable map out of your java resultSet.
val myMap : Map[String, String] = new Iterator[(String, String)] {
override def hasNext = resultSet.next()
override def next() = {
val col = resultSet.getString("col_name")
val data = resultSet.getString("data_type")
col -> data
}
}.toMap
Otherwise you have to use mutable scala.collection.mutable.Map.
How can I export all these maps to a single csv file using scala? mapKey contains all keys from all the maps, but not all of the (language) maps may have a value for a certain key.
The head row should contain "key", "default", "de", "fr", "it", "en"
val mapDefault: Map[String, String] = messages.getOrElse("default",Map())
val mapDe: Map[String, String] = messages.getOrElse("de", Map())
val mapFr: Map[String, String] = messages.getOrElse("fr", Map())
val mapEn: Map[String, String] = messages.getOrElse("en", Map())
val mapIt: Map[String, String] = messages.getOrElse("it", Map())
var mapKey: Set[String] = mapDefault.keySet ++ mapDe.keySet ++
mapFr.keySet ++ mapEn.keySet ++ mapIt.keySet
As mentioned in comment - it's best to use some library that constructs the CSV for you, especially to deal with special characters (comma or newline) in input (which would break your CSV if you're not careful).
Either way - you first have to transform your data into a sequence of records with a constant order and number of columns.
Below is an implementation that does not use any library, just to show the gist of how it's done in Scala - feel free to replace the actual CSV creation with a proper library:
// create headers row:
val headers = Seq("key", "default", "de", "fr", "it", "en")
// data rows: for each key in mapKey, create a Seq[String] with the values for this
// key (in correct order - matching the headers), or with empty String for
// missing values
val records: Seq[Seq[String]] = mapKey.toSeq.map(key => Seq(
key,
mapDefault.getOrElse(key, ""),
mapDe.getOrElse(key, ""),
mapFr.getOrElse(key, ""),
mapIt.getOrElse(key, ""),
mapEn.getOrElse(key, "")
))
// add the headers as the first record
val allRows: Seq[Seq[String]] = headers +: records
// create CSV (naive implementation - assumes no special chars in input!)
val csv: String = allRows.map(_.mkString(",")).mkString("\n")
// write to file:
new PrintWriter("filename.csv") { write(csv); close() }