I have the following piece of code
var splitDf = fullCertificateSourceDf.map(row => {
val ID = row.getAs[String]("ID")
val CertificateID = row.getAs[String]("CertificateID")
val CertificateTag = row.getAs[String]("CertificateTag")
val CertificateDescription = row.getAs[String]("CertificateDescription")
val WorkBreakdownUp1Summary = row.getAs[String]("WorkBreakdownUp1Summary")
val ProcessBreakdownSummaryList = row.getAs[String]("ProcessBreakdownSummaryList")
val ProcessBreakdownUp1SummaryList = row.getAs[String]("ProcessBreakdownUp1SummaryList")
val ProcessBreakdownUp2Summary = row.getAs[String]("ProcessBreakdownUp2Summary")
val ProcessBreakdownUp3Summary = row.getAs[String]("ProcessBreakdownUp3Summary")
val ActualStartDate = row.getAs[java.sql.Date]("ActualStartDate")
val ActualEndDate = row.getAs[java.sql.Date]("ActualEndDate")
val ApprovedDate = row.getAs[java.sql.Date]("ApprovedDate")
val CurrentState = row.getAs[String]("CurrentState")
val DataType = row.getAs[String]("DataType")
val PullDate = row.getAs[String]("PullDate")
val PullTime = row.getAs[String]("PullTime")
val split_ProcessBreakdownSummaryList = ProcessBreakdownSummaryList.split(",")
val split_ProcessBreakdownUp1SummaryList = ProcessBreakdownUp1SummaryList.split(",")
val Pattern = "^.*?(?= - *[a-zA-Z])".r
for{
subSystem : String <- split_ProcessBreakdownSummaryList
} yield(ID,
CertificateID,
CertificateTag,
CertificateDescription,
WorkBreakdownUp1Summary,
subSystem,
for{ system: String <- split_ProcessBreakdownUp1SummaryList if(system contains subSystem.trim().substring(0,11))}yield(system),
ProcessBreakdownUp2Summary,
ProcessBreakdownUp3Summary,
ActualStartDate,
ActualEndDate,
ApprovedDate,
CurrentState,
DataType,
PullDate,
PullTime
)
}).flatMap(identity(_))
display(splitDf)
How can I get the first matching element from the following portion of the above statement:
for{ system: String <- split_ProcessBreakdownUp1SummaryList if(system contains subSystem.trim().substring(0,11))}yield(system)
At the moment it returns an array with one element in it. I dont want the array I just want the element.
Thank you in advance.
Related
Spark: I need to save the prediction from below Scala code into a file instead of showing it?? how can I do it??
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
< val rawDataTest = sc.textFile("/home/hduser1/Pictures/LogTextTest.txt")
val header1 = rawDataTest.first()
val rawDataTest1 = rawDataTest.filter(line => line != header1)
val records1 = rawDataTest1.map(line => line.split(","))
records1.first()
case class StumbleUpon1(id: String, text: String)
val data1 = records1.map{ r1 =>
val trimmed1 = r1.map(_.replaceAll("\"", ""))
//val label1 = trimmed(r1.size - 1).toInt
val features = trimmed1.slice(0, r1.size)
assert(features.size == 2)
StumbleUpon1(features(0), features(1))
}
data1.first()
val Test = data1.toDF("id", "text")
Test.show(15)
val predictions = model.transform(Test)
predictions.show(5) >
I have two strings in Scala
Input 1 : "a,c,e,g,i,k"
Input 2 : "b,d,f,h,j,l"
How do I join the two Strings in Scala?
Required output = "ab,cd,ef,gh,ij,kl"
I tried something like:
var columnNameSetOne:Array[String] = Array(); //v1 = "a,c,e,g,i,k"
var columnNameSetTwo:Array[String] = Array(); //v2 = "b,d,f,h,j,l"
After I get the input data as mentioned above
columnNameSetOne = v1.split(",")
columnNameSetTwo = v2.split(",");
val newColumnSet = IntStream.range(0, Math.min(columnNameSetOne.length, columnNameSetTwo.length)).mapToObj(j => (columnNameSetOne(j) + columnNameSetTwo(j))).collect(Collectors.joining(","));
println(newColumnSet)
But I am getting error on j
Also, I am not sure if this would work!
object Solution1 extends App {
val input1 = "a,c,e,g,i,k"
val input2 = "b,d,f,h,j,l"
val i1= input1.split(",")
val i2 = input2.split(",")
val x =i1.zipAll(i2, "", "").map{
case (a,b)=> a + b
}
println(x.mkString(","))
}
//output : ab,cd,ef,gh,ij,kl
Easy to do using zip function on list.
val v1 = "a,c,e,g,i,k"
val v2 = "b,d,f,h,j,l"
val list1 = v1.split(",").toList
val list2 = v2.split(",").toList
list1.zip(list2).mkString(",") // res0: String = (a,b),( c,d),( e,f),( g,h),( i,j),( k,l)
I am new to Scala and Spark and trying to build on some samples I found. Essentially I am trying to call a function from within a data frame to get State from zip code using Google API..
I have the code working separately but not together ;(
Here is the piece of code not working...
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type Unit is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:716)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:654)
at org.apache.spark.sql.functions$.udf(functions.scala:2837)
at MovieRatings$.getstate(MovieRatings.scala:51)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:48)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:47)...
Line 51 starts with def getstate = udf {(zipcode:String)...
...
code:
userDF.createOrReplaceTempView("Users")
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, zipcode as state FROM Users")
// zipcodesDF.map(zipcodes => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = colNames.map(cName => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = cols.map(c =>
if (c.toString() == theColumn.toString()) getstate(c).as("transformed") else c)
val newDF = zipcodesDF.select(mappedCols:_*).show()
}
def getstate = udf {(zipcode:String) => {
val url = "http://maps.googleapis.com/maps/api/geocode/json?address="+zipcode
val result = scala.io.Source.fromURL(url).mkString
val address = parse(result)
val shortnames = for {
JObject(address_components) <- address
JField("short_name", short_name) <- address_components
} yield short_name
val state = shortnames(3)
//return state.toString()
val stater = state.toString()
}
}
Thanks for the responses.. I think I figured it out. Here is the code that works. One thing to note is Google API has restriction so some valid zip codes don't have state info.. not an issue for me though.
private def loaduserdata(spark: SparkSession): Unit = {
import spark.implicits._
// Create an RDD of User objects from a text file, convert it to a Dataframe
val userDF = spark.sparkContext
.textFile("examples/src/main/resources/users.csv")
.map(_.split("::"))
.map(attributes => users(attributes(0).trim.toInt, attributes(1), attributes(2).trim.toInt, attributes(3), attributes(4)))
.toDF()
// Register the DataFrame as a temporary view
userDF.createOrReplaceTempView("Users")
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, substr(zipcode,1,5) as state FROM Users ORDER BY zipcode desc") // zipcodesDF.map(zipcodes => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = colNames.map(cName => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = cols.map(c =>
if (c.toString() == theColumn.toString()) getstate(c).as("state") else c)
val geoDF = zipcodesDF.select(mappedCols:_*)//.show()
geoDF.createOrReplaceTempView("Geo")
}
val getstate = udf {(zipcode: String) =>
val url = "http://maps.googleapis.com/maps/api/geocode/json?address="+zipcode
val result = scala.io.Source.fromURL(url).mkString
val address = parse(result)
val statenm = for {
JObject(statename) <- address
JField("types", JArray(types)) <- statename
JField("short_name", JString(short_name)) <- statename
if types.toString().equals("List(JString(administrative_area_level_1), JString(political))")
// if types.head.equals("JString(administrative_area_level_1)")
} yield short_name
val str = if (statenm.isEmpty.toString().equals("true")) "N/A" else statenm.head
}
I want to map a RDD by lookup another RDD by this code:
val product = numOfT.map{case((a,b),c)=>
val h = keyValueRecords.lookup(b).take(1).mkString.toInt
(a,(h*c))
}
a,b are Strings and c is a Integer. keyValueRecords is like this: RDD[(string,string)]-
i got type missmatch error: how can I fix it ?
what is my mistake ?
This is a sample of data:
userId,movieId,rating,timestamp
1,16,4.0,1217897793
1,24,1.5,1217895807
1,32,4.0,1217896246
2,3,2.0,859046959
3,7,3.0,8414840873
I'm triying by this code:
val lines = sc.textFile("ratings.txt").map(s => {
val substrings = s.split(",")
(substrings(0), (substrings(1),substrings(1)))
})
val shoppingList = lines.groupByKey()
val coOccurence = shoppingList.flatMap{case(k,v) =>
val arry1 = v.toArray
val arry2 = v.toArray
val pairs = for (pair1 <- arry1; pair2 <- arry2 ) yield ((pair1,pair2),1)
pairs.iterator
}
val numOfT = coOccurence.reduceByKey((a,b)=>(a+b)) // (((item,rate),(item,rate)),coccurence)
// produce recommend for an especial user
val keyValueRecords = sc.textFile("ratings.txt").map(s => {
val substrings = s.split(",")
(substrings(0), (substrings(1),substrings(2)))
}).filter{case(k,v)=> k=="1"}.groupByKey().flatMap{case(k,v) =>
val arry1 = v.toArray
val arry2 = v.toArray
val pairs = for (pair1 <- arry1; pair2 <- arry2 ) yield ((pair1,pair2),1)
pairs.iterator
}
val numOfTForaUser = keyValueRecords.reduceByKey((a,b)=>(a+b))
val joined = numOfT.join(numOfTForaUser).map{case(k,v)=>(k._1._1,(k._2._2.toFloat*v._1.toFloat))}.collect.foreach(println)
The Last RDD won't produced. Is it wrong ?
I have used a cogroup function and obtain following RDD:
org.apache.spark.rdd.RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]
Before the map operation joined object would look like this:
RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]
(-2095842000,(CompactBuffer((1504999740,1430096464017), (613904354,1430211912709), (-1514234644,1430288363100), (-276850688,1430330412225)),CompactBuffer((-511732877,1428682217564), (1133633791,1428831320960), (1168566678,1428964645450), (-407341933,1429009306167), (-1996133514,1429016485487), (872888282,1429031501681), (-826902224,1429034491003), (818711584,1429111125268), (-1068875079,1429117498135), (301875333,1429121399450), (-1730846275,1429131773065), (1806256621,1429135583312))))
(352234000,(CompactBuffer((1350763226,1430006650167), (-330160951,1430320010314)),CompactBuffer((2113207721,1428994842593), (-483470471,1429324209560), (1803928603,1429426861915))))
Now I want to do the following:
val globalBuffer = ListBuffer[Double]()
val joined = data1.cogroup(data2).map(x => {
val listA = x._2._1.toList
val listB = x._2._2.toList
for(tupleB <- listB) {
val localResults = ListBuffer[Double]()
val itemToTest = Set(tupleB._1)
val tempList = ListBuffer[(Int, Double)]()
for(tupleA <- listA) {
val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
val i = (tupleA._1, tValue)
tempList += i
}
val sortList = tempList.sortWith(_._2 > _._2).slice(0,20).map(i => i._1)
val intersect = sortList.toSet.intersect(itemToTest)
if (intersect.size > 0)
localResults += 1.0
else localResults += 0.0
val normalized = sum(localResults.toList)/localResults.size
globalBuffer += normalized
}
})
//method sum
def sum(xs: List[Double]): Double = {//do the sum}
At the end of this I was expecting joined to be a list with double values. But when I looked at it it was unit. Also I will this is not the Scala way of doing it. How do I obtain globalBuffer as the final result.
Hmm, if I understood your code correctly, it could benefit from these improvements:
val joined = data1.cogroup(data2).map(x => {
val listA = x._2._1.toList
val listB = x._2._2.toList
val localResults = listB.map {
case (intBValue, longBValue) =>
val itemToTest = intBValue // it's always one element
val tempList = listA.map {
case (intAValue, longAValue) =>
(intAValue, someFunctionReturnDouble(longBvalue, longAValue))
}
val sortList = tempList.sortWith(-_._2).slice(0,20).map(i => i._1)
if (sortList.toSet.contains(itemToTest)) { 1.0 } else {0.0}
// no real need to convert to a set for 20 elements, by the way
}
sum(localResults)/localResults.size
})
Transformations of RDDs are not going to modify globalBuffer. Copies of globalBuffer are made and sent out to each of the workers, but any modifications to these copies on the workers will never modify the globalBuffer that exists on the driver (the one you have defined outside the map on the RDD.) Here's what I do (with a few additional modifications):
val joined = data1.cogroup(data2) map { x =>
val iterA = x._2._1
val iterB = x._2._2
var count, positiveCount = 0
val tempList = ListBuffer[(Int, Double)]()
for (tupleB <- iterB) {
tempList.clear
for(tupleA <- iterA) {
val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
tempList += ((tupleA._1, tValue))
}
val sortList = tempList.sortWith(_._2 > _._2).iterator.take(20)
if (sortList.exists(_._1 == tupleB._1)) positiveCount += 1
count += 1
}
positiveCount.toDouble/count
}
At this point you can obtain of local copy of the proportions by using joined.collect.