My objectif is to read Data from a csv file and convert my rdd to dataframe in scala/spark. This is my code :
package xxx.DataScience.CompensationStudy
import org.apache.spark._
import org.apache.log4j._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object CompensationAnalysis {
case class GetDF(profil_date:String, profil_pays:String, param_tarif2:String, param_tarif3:String, dt_titre:String, dt_langues:String,
dt_diplomes:String, dt_experience:String, dt_formation:String, dt_outils:String, comp_applications:String,
comp_interventions:String, comp_competence:String)
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val conf = new SparkConf().setAppName("CompensationAnalysis ")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val lines = sc.textFile("C:/Users/../Downloads/CompensationStudy.csv").flatMap { l =>
l.split(",") match {
case field: Array[String] if field.size > 13 => Some(field(0), field(1), field(2), field(3), field(4), field(5), field(6), field(7), field(8), field(9), field(10), field(11), field(12))
case field: Array[String] if field.size == 1 => Some((field(0), "default value"))
case _ => None
}
}
At this stade, I had the error : Product with Serializable does not take parameters
val summary = lines.collect().map(x => GetDF(x("profil_date"), x("profil_pays"), x("param_tarif2"), x("param_tarif3"), x("dt_titre"), x("dt_langues"), x("dt_diplomes"), x("dt_experience"), x("dt_formation"), x("dt_outils"), x("comp_applications"), x("comp_interventions"), x("comp_competence")))
val sum_df = summary.toDF()
df.printSchema
}
}
This is a screenshot :
Help please ?
You have several things you should improve. The most urgent problem, which causes the exception, is, as #CyrilleCorpet points out, " the three different lines in the pattern matching return values of types Some[Tuple13], Some[Tuple2] and None.type. The least-upper-bound is then Option[Product with Serializable] which complies with flatMap's signature (where the result should be an Iterable[T]) modulo some implicit conversion."
Basically, if you had Some[Tuple13], Some[Tuple13], and None or Some[Tuple2], Some[Tuple2], and None, you would be better off.
Also, pattern matching on types is generally a bad idea because of type erasure, and pattern matching isn't even great anyway for your situation.
So you could set default values in your case class:
case class GetDF(profile_date: String,
profile_pays: String = "default",
param_tarif2: String = "default",
...
)
Then in your lambda:
val tokens = l.split
if (l.length > 13) {
Some(GetDf(l(0), l(1), l(2)...))
} else if (l.length == 1) {
Some(GetDf(l(0)))
} else {
None
}
Now in all cases you are returning Option[GetDF]. You can flatMap the RDD to get rid of all the Nones and keep only GetDF instances.
Related
I am trying to reproduce the devign experiment. When joern is updated, some error messages appear in the graph-for-funcs.sc script that parses the bin file into JSON. I modified some, and there are still some errors as shown in the figure below. Did you encounter similar errors and how did you solve them?
graph-for-funcs.sc
import scala.jdk.CollectionConverters._
import io.circe.syntax._
import io.circe.generic.semiauto._
import io.circe.{Encoder, Json}
import io.shiftleft.semanticcpg.language.types.expressions.generalizations.CfgNode
import io.shiftleft.codepropertygraph.generated.EdgeTypes
import io.shiftleft.codepropertygraph.generated.NodeTypes
import io.shiftleft.codepropertygraph.generated.nodes
import io.shiftleft.dataflowengineoss.language._
import io.shiftleft.semanticcpg.language._
import io.shiftleft.semanticcpg.language.types.expressions.Call
import io.shiftleft.semanticcpg.language.types.structure.Local
import io.shiftleft.codepropertygraph.generated.nodes.MethodParameterIn
import overflowdb._
import overflowdb.traversal._
final case class GraphForFuncsFunction(function: String,
file: String,
id: String,
AST: List[nodes.AstNode],
CFG: List[nodes.AstNode],
PDG: List[nodes.AstNode])
final case class GraphForFuncsResult(functions: List[GraphForFuncsFunction])
implicit val encodeEdge: Encoder[Edge] =
(edge: Edge) =>
Json.obj(
("id", Json.fromString(edge.toString)),
("in", Json.fromString(edge.inNode.toString)),
("out", Json.fromString(edge.outNode.toString))
)
implicit val encodeNode: Encoder[nodes.AstNode] =
(node: nodes.AstNode) =>
Json.obj(
("id", Json.fromString(node.toString)),
("edges",
Json.fromValues((node.inE("AST", "CFG").l ++ node.outE("AST", "CFG").l).map(_.asJson))),
("properties", Json.fromValues(node.propertyMap.asScala.toList.map { case (key, value) =>
Json.obj(
("key", Json.fromString(key)),
("value", Json.fromString(value.toString))
)
}))
)
implicit val encodeFuncFunction: Encoder[GraphForFuncsFunction] = deriveEncoder
implicit val encodeFuncResult: Encoder[GraphForFuncsResult] = deriveEncoder
#main def main(): Json = {
GraphForFuncsResult(
cpg.method.map { method =>
val methodName = method.fullName
val methodId = method.toString
val methodFile = method.location.filename
val astChildren = method.astMinusRoot.l
val cfgChildren = method.out(EdgeTypes.CONTAINS).asScala.collect { case node: nodes.CfgNode => node }.toList
val local = new NodeSteps(
method
.out(EdgeTypes.CONTAINS)
.hasLabel(NodeTypes.BLOCK)
.out(EdgeTypes.AST)
.hasLabel(NodeTypes.LOCAL)
.cast[nodes.Local])
val sink = local.evalType(".*").referencingIdentifiers.dedup
val source = new NodeSteps(method.out(EdgeTypes.CONTAINS).hasLabel(NodeTypes.CALL).cast[nodes.Call]).nameNot("<operator>.*").dedup
val pdgChildren = sink
.reachableByFlows(source)
.l
.flatMap { path =>
path.elements
.map {
case trackingPoint # (_: MethodParameterIn) => trackingPoint.start.method.head
case trackingPoint => trackingPoint.cfgNode
}
}
.filter(_.toString != methodId)
GraphForFuncsFunction(methodName, methodFile, methodId, astChildren, cfgChildren, pdgChildren.distinct)
}.l
).asJson
}
error
graph-for-funcs.sc:92: value evalType is not a member of io.shiftleft.semanticcpg.language.NodeSteps[io.shiftleft.codepropertygraph.generated.nodes.Local]
val sink = local.evalType(".*").referencingIdentifiers.dedup
^
graph-for-funcs.sc:93: value nameNot is not a member of io.shiftleft.semanticcpg.language.NodeSteps[io.shiftleft.codepropertygraph.generated.nodes.Call]
val source = new NodeSteps(method.out(EdgeTypes.CONTAINS).hasLabel(NodeTypes.CALL).cast[nodes.Call]).nameNot("<operator>.*").dedup
^
java.lang.RuntimeException: Compilation Failed
io.shiftleft.console.scripting.AmmoniteExecutor.$anonfun$runScript$7(AmmoniteExecutor.scala:50)
cats.effect.internals.IORunLoop$.liftedTree3$1(IORunLoop.scala:229)
cats.effect.internals.IORunLoop$.step(IORunLoop.scala:229)
cats.effect.IO.unsafeRunTimed(IO.scala:320)
cats.effect.IO.unsafeRunSync(IO.scala:239)
io.shiftleft.console.scripting.ScriptManager.runScript(ScriptManager.scala:130)
io.shiftleft.console.scripting.ScriptManager$CpgScriptRunner.runScript(ScriptManager.scala:64)
io.shiftleft.console.scripting.ScriptManager$CpgScriptRunner.runScript(ScriptManager.scala:54)
ammonite.$sess.cmd8$.<clinit>(cmd8.sc:1)
My objective is to run a number of spark ml regression models (1000s of times) on one dataset and I want to do this using zio instead of future, because it is running too slow. Below is the working example of using Future.
A distinct list of keys is used to filter the partitioned dataset on key and run the model on. I've set up a thread pool with 8 executors to manage it, but it quickly degrades in performance.
import scala.concurrent.{Await, ExecutionContext, ExecutionContextExecutorService, Future}
import java.util.concurrent.{Executors, TimeUnit}
import scala.concurrent.duration._
import org.apache.spark.sql.SaveMode
val pool = Executors.newFixedThreadPool(8)
implicit val xc: ExecutionContextExecutorService = ExecutionContext.fromExecutorService(pool)
case class Result(key: String, coeffs: String)
try {
import spark.implicits._
val tasks = {
for (x <- keys)
yield Future {
Seq(
Result(
x.group,
runModel(input.filter(col("group")===x)).mkString(",")
)
).toDS()
.write.mode(SaveMode.Overwrite).option("header", false).csv(
s"hdfs://namenode:8020/results/$x.csv"
)
}
}.toSeq
Await.result(Future.sequence(tasks), Duration.Inf)
}
finally {
pool.shutdown()
pool.awaitTermination(Long.MaxValue, TimeUnit.NANOSECONDS)
}
I've tried to implement this in zio, but I don't know how to implement queues and set a limit of executors like in futures.
Below is my failed attempt so far...
import zio._
import zio.console._
import zio.stm._
import org.apache.spark.sql.{Dataset, SaveMode, SparkSession}
import org.apache.spark.sql.functions.col
//example data/signatures
case class ModelResult(key: String, coeffs: String)
case class Data(key: String, sales: Double)
val keys: Array[String] = Array("100_1", "100_2")
def runModel[T](ds: Dataset[T]): Vector[Double]
object MyApp1 extends App {
val spark = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
val input: Dataset[Data] = Seq(Data("100_1", 1d), Data("100_2", 2d)).toDS
def run(args: List[String]): ZIO[ZEnv, Nothing, Int] = {
for {
queue <- Queue.bounded[Int](8)
_ <- ZIO.foreach(1 to 8) (i => queue.offer(i)).fork
_ <- ZIO.foreach(keys) { k => queue.take.flatMap(_ => readWrite(k, input, queue)) }
} yield 0
}
def writecsv(k: String, v: String) = {
Seq(ModelResult(k, v))
.toDS
.write
.mode(SaveMode.Overwrite).option("header", value = false)
.csv(s"hdfs://namenode:8020/results/$k.csv")
}
def readWrite[T](key: String, ds: Dataset[T], queue: Queue[Int]): ZIO[ZEnv, Nothing, Int] = {
(for {
result <- runModel(ds.filter(col("key")===key)).mkString(",")
_ <- writecsv(key, result)
_ <- queue.offer(1)
_ <- putStrLn(s"successfully wrote output for $key")
} yield 0)
}
}
//to run
MyApp1.run(List[String]())
What is the best way to deal with compute this in zio?
To parallelize some workload across, say, 8 threads all you need is
ZIO.foreachParN(8)(1 to 100)(id => zio.blocking.blocking(Task{yourClusterJob(id)}))
But don't expect lots of a boost by switching from Futures to ZIO here:
1) Actual workload dominates coordination overhead so difference between ZIO and Future should be marginal.
2) Maybe you won't get any boost at all because 8 tasks will be fighting for the same resource pool in the Spark cluster.
I'm trying to order my list and get the biggest 5 tuples in my list which will then be printed out, heres the code that i have been working with:
import scala.io.Codec.string2codec
import scala.io.Source
import scala.reflect.io.File
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleWordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Word Count")
val sc = new SparkContext(conf)
val test = scala.io.Source.fromFile("/home/cloudera/Books/book1.txt").getLines
val wordCount =
test
.flatMap(_.split("\\W+"))
.foldLeft(Map.empty[String, Int]) {
(count, word) =>
count + (word -> (count.getOrElse(word, 0) + 1))
}
val formatteWordCount =
filtered
.map(tuple => s"${tuple._1} -> ${tuple._2}")
.mkString("\n", "\n", "\n")
when trying to launch the code the following lines gives the error:
diverging implicit expansion for type scala.math.Ordering[B]
starting with method Tuple9 in object Ordering
.sortBy(x => (x._2))
I also tried using .stableSort(k, (x, y) => x._2 < y.2) which gave the error value stableSort is not a member of String
and .maxBy(._2) which gave the error diverging implicit expansion for type Ordering[B]
starting with method Tuple9 in object Ordering
println(s"Final Word Count: $formatteWordCount")
}
I have been playing with use classes but I am continuously getting the error above when I try to implement them. This is my code:
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
class EdgeProperties()
case class WriterWriterProperties(weight: String, edgeType: String) extends EdgeProperties
object GraphXAnalysis2 {
val edgeWeightedWriterWriterCollaborated = "in/Graphs/Graph4_WriterWriter/EdgesWeightedWriterWriter_writerscollaborated.csv"
val vertexWriterWriter = "in/Graphs/Graph4_WriterWriter/Vertices.csv"
val conf = new SparkConf().setAppName("Music Graph Application").setMaster("local[1]")
val sc = new SparkContext(conf)
val WriterWriter: RDD[(VertexId, String)] = sc.textFile(vertexWriterWriter).map {
line =>
val row = line.split(",")
(row(0).toLong, row(2))
}
val edgesWriterWriterCollaborated: RDD[Edge[EdgeProperties]] = sc.textFile(edgeWeightedWriterWriterCollaborated).map {
line =>
val row = line.split(",")
Edge(row(0).toLong, row(1).toLong, WriterWriterProperties(row(2), row(3)): EdgeProperties)
}
val graph4 = Graph(WriterWriter, edgesWriterWriterCollaborated)
Am I declaring the class wrongly, using it wrong or putting it in the wrong position? Thank you so much as I am completely new to this.
This is my idea
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
object pizD {
def filePath = {
new File(this.getClass.getClassLoader.getResource("wikipedia/wikipedia.dat").toURI).getPath
}
def regex(line: String): pichA = {
......
......
pichA(t1, t2)
}
}
case class pichA(t1: String, t2: String)
object dushP {
val conf = new SparkConf()
val sc = new SparkContext(conf)
val mirdd: RDD[pichA] = ???
How to integrate sc.textfile with my methods filePath and regex?I want to combine in order to get new rdd.
val baseRDD =sc.textfile(pizD.filepath).filter(line => {
val value = pizD.regex(line)
if(value !=null)
true
else false
})
Assuming pizD.filepath will give you file name as string and regex() would return null value if regex din match. If the understanding is correct, then above code would do the trick.