Not able to see RDD contents

Not able to see RDD contents - scala

i am using scala to create an RDD but when i am trying to see the contents of RDD i am getting below results
MapPartitionsRDD[25] at map at <console>:96
I want to see the contents of RDD how can i see that ?
below is my scala code:
object WordCount {
def main(args: Array[String]): Unit = {
val textfile = sc.textFile("/user/cloudera/xxx/File")
val word = textfile.filter(x => x.length > 0).map(_.split('|'))
println(word)
}
}

You need to provide an output transformation (action). e.g. use RDD.collect:
object WordCount {
def main(args: Array[String]): Unit = {
val textfile = sc.textFile("/user/cloudera/xxx/File")
val word = textfile.filter(x => x.length > 0).map(_.split('|'))
word.collect().foreach(println)
}
}
If you have an Array[Array[T]], you'll need to flatten before using foreach:
word.collect().flatten.foreach(println)

Related

Efficient way to collect HashSet during map operation on some Dataset

I have big dataset to transform one structure to another. During that phase I want also collect some info about computed field (quadkeys for given lat/longs). I dont want attach this info to every result row, since it would give a lot of duplication information and memory overhead. All I need is to know which particular quadkeys are touched by given coordinates. If there are any way to do it within one job to not iterate dataset twice?
def load(paths: Seq[String]): (Dataset[ResultStruct], Dataset[String]) = {
val df = sparkSession.sqlContext.read.format("com.databricks.spark.csv").option("header", "true")
.schema(schema)
.option("delimiter", "\t")
.load(paths:_*)
.as[InitialStruct]
val qkSet = mutable.HashSet.empty[String]
val result = df.map(c => {
val id = c.id
val points = toPoints(c.geom)
points.foreach(p => qkSet.add(Quadkey.get(p.lat, p.lon, 6).getId))
createResultStruct(id, points)
})
return result, //some dataset created from qkSet's from all executors
}

You could use accumulators
class SetAccumulator[T] extends AccumulatorV2[T, Set[T]] {
import scala.collection.JavaConverters._
private val items = new ConcurrentHashMap[T, Boolean]
override def isZero: Boolean = items.isEmpty
override def copy(): AccumulatorV2[T, Set[T]] = {
val other = new SetAccumulator[T]
other.items.putAll(items)
other
}
override def reset(): Unit = items.clear()
override def add(v: T): Unit = items.put(v, true)
override def merge(
other: AccumulatorV2[T, Set[T]]): Unit = other match {
case setAccumulator: SetAccumulator[T] => items.putAll(setAccumulator.items)
}
override def value: Set[T] = items.keys().asScala.toSet
}
val df = Seq("foo", "bar", "foo", "foo").toDF("test")
val acc = new SetAccumulator[String]
spark.sparkContext.register(acc)
df.map {
case Row(str: String) =>
acc.add(str)
str
}.count()
println(acc.value)
Prints
Set(bar, foo)
Note that map itself is lazy so something like count etc. is needed to actually force the calculation. Depending on the real use-case, another option would be to cache the data frame and just using plain SQL functions df.select("test").distinct()

type mismatch; found : Unit required: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]

Why the following code has a compilation error at return statement,
def getData(queries: Array[String]): Dataset[Row] = {
val res = spark.read.format("jdbc").jdbc(jdbcUrl, "", props).registerTempTable("")
return res
}
Error,
type mismatch; found : Unit required: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]
Scala version 2.11.11
Spark version 2.0.0
EDIT:
Actual case
def getDataFrames(queries: Array[String]) = {
val jdbcResult = queries.map(query => {
val tablename = extractTableName(query)
if (tablename.contains("1")) {
spark.sqlContext.read.format("jdbc").jdbc(jdbcUrl1, query, props)
} else {
spark.sqlContext.read.format("jdbc").jdbc(jdbcUrl2, query, props)
}
})
}
Here I want to return the combined output from the iteration like an Array[Dataset[Row]] or Array[DataFrame] (but Dataframe is not available in 2.0.0 as a dependency). Do the above code does the magic ? or How can I do it?

You can return a list of dataframes as List[Dataframe]
def getData(queries: Array[String]): List[Dataframe] = {
val res = spark.read.format("jdbc").jdbc(jdbcUrl, "", props)
//create multiple dataframe from your queries
val df1 = ???
val df2 = ???
val list = List(df1, df2)
//You can create a list dynamically with list of quries
list
}
registerTempTable returns Unit you better remove the registerTempTable and return Dataframe, and return a list of dataframes.
UPDATE:
Here is how you can return list of dataframes with list of queries
def getDataFrames(queries: Array[String]): Array[DataFrame] = {
val jdbcResult = queries.map(query => {
val tablename = extractTableName(query)
val dataframe = if (tablename.contains("1")) {
spark.sqlContext.read.format("jdbc").jdbc("", query, prop)
} else {
spark.sqlContext.read.format("jdbc").jdbc("", query, prop)
}
dataframe
})
jdbcResult
}
I hope this helps!

Its clear from the error message that there is a type mismatch in your function.
registerTempTable() api creates an in-memory table scoped to the current session and stays accesible till the SparkSession is active.
Check the return type of registerTempTable() api here
change your code to the following to remove the error message:
def getData(queries: Array[String]): Unit = {
val res = spark.read.format("jdbc").jdbc(jdbcUrl, "", props).registerTempTable("")
}
an even better way would be to write the code as follows:
val tempName: String = "Name_Of_Temp_View"
spark.read.format("jdbc").jdbc(jdbcUrl, "", props).createOrReplaceTempView(tempName)
Use the createOrReplaceTempView() as registerTempTable() is deprecated since Spark 2.0.0
The Alternate solution as per your requirement:
def getData(queries: Array[String], spark: SparkSession): Array[DataFrame] = {
spark.read.format("jdbc").jdbc(jdbcUrl, "", props).createOrReplaceTempView("Name_Of_Temp_Table")
val result: Array[DataFrame] = queries.map(query => spark.sql(query))
result }

how to convert my pyspark code into scala?

I am a scala beginner. Now I have to convert some codes I wrote in Pyspark to scala. The codes are just to extract fields for modeling.
Could someone point out to me how to write the following code into scala? At least where and how I could get the quick answer. Thanks so much!!!
Here are my previous codes
{val records = rawdata.map(x=> x.split(","))
val data = records.map(r=> LabeledPoint(extract_label(r), extract_features(r)))
...
def extract_features(record):
return np.array(map(float, record[2:16]))
def extract_label(record):
return float(record[16])
}

It goes like this:
scala> def extract_label(record: Array[String]): Float = { record(16).toFloat }
extract_label: (record: Array[String])Float
scala> def extract_features(record: Array[String]): Array[Float] = { val newArray = new Array[Float](14); for(i <- 2 until 16) newArray(i-2)=record(i).toFloat; newArray;}
extract_features: (record: Array[String])Array[Float]
There may be a direct method for above logic.
Test:
scala> records.map(x => extract_label(x)).take(5).foreach(println)
4.9
scala> records.map(x => extract_features(x).mkString(",")).take(5).foreach(println)
6.4,2.5,4.5,2.8,4.7,2.5,6.4,8.5,3.5,6.4,2.9,10.5,6.4,2.2

Why does mapPartitions print nothing to stdout?

I have this code in scala
object SimpleApp {
def myf(x: Iterator[(String, Int)]): Iterator[(String, Int)] = {
while (x.hasNext) {
println(x.next)
}
x
}
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val tx1 = sc.textFile("/home/paourissi/Desktop/MyProject/data/testfile1.txt")
val file1 = tx1.flatMap(line => line.split(" ")).map(word => (word, 1))
val s = file1.mapPartitions(x => myf(x))
}
}
I am trying to figure out why it doesn't print anything on the output. I run this on a local machine and not on a cluster.

You only have transformations, no actions. Spark will not execute until an action is called. Add this line to print out the top 10 of your results.
s.take(10).foreach(println)

mapPartitions is a transformation, and thus lazy
If you will add an action in the end, the whole expression will be evaluated. Try adding s.count in the end.

sortByKey in Spark

New to Spark and Scala. Trying to sort a word counting example. My code is based on this simple example.
I want to sort the results alphabetically by key. If I add the key sort to an RDD:
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
then I get a compile error:
error: No implicit view available from java.io.Serializable => Ordered[java.io.Serializable].
[INFO] val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
I don't know what the lack of an implicit view means. Can someone tell me how to fix it? I am running the Cloudera 5 Quickstart VM. I think it bundles Spark version 0.9.
Source of the Scala job
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
Array("NO NAME")
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Some (unsorted) output
("INTERNATIONAL EYELETS INC",879)
("SHAQUITA SALLEY",865)
("PAZ DURIGA",791)
("TERESSA ALCARAZ",824)
("MING CHAIX",878)
("JACKSON SHIELDS YEISER",837)
("AUDRY HULLINGER",875)
("GABRIELLE MOLANDS",802)
("TAM TACKER",775)
("HYACINTH VITELA",837)

No implicit view means there is no scala function like this defined
implicit def SerializableToOrdered(x :java.io.Serializable) = new Ordered[java.io.Serializable](x) //note this function doesn't work
The reason this error is coming out is because in your function you are returning two different types with a super type of java.io.Serializable (ones a String the other an Array[String]). Also reduceByKey for obvious reasons requires the key to be an Orderable. Fix it like this
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
"NO NAME"
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Now the function just returns Strings instead of two different types

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Not able to see RDD contents - scala

Related

Efficient way to collect HashSet during map operation on some Dataset

type mismatch; found : Unit required: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]

how to convert my pyspark code into scala?

Why does mapPartitions print nothing to stdout?

sortByKey in Spark

Categories

Resources