Passing list to sc.textFile -scala- - scala

I'm looking for how to pass a list of paths to sc.textFile (in scala), without using foreach.
Example :
myList :Seq[String] = ArrayBuffer (path1, path2, path3)
Is there a way to do :
var data = sc.textFile(myList)

Try
var data = sc.textFile(myList.mkstring(","))
Alternatively, we can read each text file, then union the resulting rdds:
import scala.util.{Try, Success}
val rdds = myList.flatMap { f =>
Try(sc.textFile(f)) match {
case Success(rdd) => Some(rdd)
case _ => None
}
}
val rdd = sc.union(rdds)

Related

How can i split a string of dataframe schema into each Structs

I want to split a schema of a dataframe into a collection. I am trying this, but the schema is printed out as a string. Is there anyway I can split it into a collection per StructType so that I can manipulate it (like take only array columns from the output)? I am trying to flatten a complex multi level struct + array dataframe.
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql._
val test = sqlContext.read.json(sc.parallelize(Seq("""{"a":1,"b":[2,3],"d":[2,3]}""")))
test.printSchema
val flattened = test.withColumn("b", explode($"d"))
flattened.printSchema
def identifyArrayColumns(dataFrame : DataFrame) = {
val output = for ( d <- dataFrame.collect()) yield
{
d.schema
}
output.toList
}
identifyArrayColumns(test)
Output currently is
identifyArrayColumns: (dataFrame: org.apache.spark.sql.DataFrame)List[org.apache.spark.sql.types.StructType]
res58: List[org.apache.spark.sql.types.StructType] = List(StructType(StructField(a,LongType,true), StructField(b,ArrayType(LongType,true),true), StructField(d,ArrayType(LongType,true),true)))
It is one full string, so I cannot filter only the array columns. Suppose if I do a foreach(println). I get only one line
scala> output.foreach(println)
StructType(StructField(a,LongType,true), StructField(b,ArrayType(LongType,true),true), StructField(d,ArrayType(LongType,true),true))
What I want is each StructTypes in a single element in a collection
You can simply filter the fields of the DataFrame's schema for fields with type array - no need to inspect the DataFrame's data for this:
def identifyArrayColumns(schema: StructType): List[StructField] = {
schema.fields.filter(_.dataType.typeName == "array").toList
}
NOTE that this is a "shallow" solution that would only return the array fields directly under "root", if you want to also find Arrays within Arrays / maps / structs, you'd need to recursively traverse the shcema and produce this filtered result, something like:
// can be converted into a tail-recursive method by adding another argument to accumulate results
def identifyArrayColumns(schema: StructType): List[StructField] = {
val arrays = schema.fields.filter(_.dataType.typeName == "array").toList
val deeperArrays = schema.fields.flatMap {
case f # StructField(_, s: StructType, _, _) => identifyArrayColumns(s)
case _ => List()
}
arrays ++ deeperArrays
}

Looping through Map Spark Scala

Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file.
object twitterAthlete {
def loadAthleteNames() : Map[String, String] = {
// Handle character encoding issues:
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
// Create a Map of Ints to Strings, and populate it from u.item.
var athleteInfo:Map[String, String] = Map()
//var movieNames:Map[Int, String] = Map()
val lines = Source.fromFile("../athletes.csv").getLines()
for (line <- lines) {
var fields = line.split(',')
if (fields.length > 1) {
athleteInfo += (fields(1) -> fields(7))
}
}
return athleteInfo
}
def parseLine(line:String): (String)= {
var athleteInfo = loadAthleteNames()
var hello = new String
for((k,v) <- athleteInfo){
if(line.toString().contains(k)){
hello = k
}
}
return (hello)
}
def main(args: Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "twitterAthlete")
val lines = sc.textFile("../twitter.test")
var athleteInfo = loadAthleteNames()
val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2))
var hello = new String()
val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache
container.collect().foreach(println)
// val mapping = container.map(x => (x,1)).reduceByKey(_+_)
//mapping.collect().foreach(println)
}
}
the first file look like:
id,name,nationality,sex,height........
001,Michael,USA,male,1.96 ...
002,Json,GBR,male,1.76 ....
003,Martin,female,1.73 . ...
the second file look likes:
time, id , tweet .....
12:00, 03043, some message that contain some athletes names , .....
02:00, 03023, some message that contain some athletes names , .....
some thinks like this ...
but i got empty result after running this code, any suggestions is much appreciated
result i got is empty :
()....
()...
()...
but the result that i expected something like:
(name,1)
(other name,1)
You need to use yield to return value to your map
val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first...
I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc.
Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.Tokenizer
val tweets = spark.read.format("csv").load(...)
val athletes = spark.read.format("csv").load(...)
val tokenizer = new Tokenizer()
tokenizer.setInputCol("tweet")
tokenizer.setOutputCol("words")
val tokenized = tokenizer.transform(tweets)
val exploded = tokenized.withColumn("word", explode('words))
val withAthlete = exploded.join(athletes, 'word === 'name)
withAthlete.select(exploded("id"), 'name).show()

Scala : map Dataset[Row] to Dataset[Row]

I am trying to use scala to transform a dataset with array to a dataset with label and vectors, before putting it into some machine learning algo.
So far, I succeeded to add a double label, but i block on the vectors part. Below, the code to create the vectors :
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
import org.apache.spark.sql.types.{DataTypes, StructField}
import org.apache.spark.sql.{Dataset, Row, _}
import spark.implicits._
def toVectors(withLabelDs: Dataset[Row]) = {
val allLabel = withLabelDs.count()
var countLabel = 0
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
println("schema line {}", line.schema)
//StructType(
// StructField(label,DoubleType,false),
// StructField(code,ArrayType(IntegerType,true),true),
// StructField(score,ArrayType(IntegerType,true),true))
val label = line.getDouble(0)
val indicesList = line.getList(1)
val indicesSize = indicesList.size
val indices = new Array[Int](indicesSize)
val valuesList = line.getList(2)
val values = new Array[Double](indicesSize)
var i = 0
while ( {
i < indicesSize
}) {
indices(i) = indicesList.get(i).asInstanceOf[Int] - 1
values(i) = valuesList.get(i).asInstanceOf[Int].toDouble
i += 1
}
var r: Row = null
try {
r = Row(label, Vectors.sparse(195, indices, values))
countLabel += 1
}
catch {
case e: IllegalArgumentException =>
println("something went wrong with label {} / indices {} / values {}", label, indices, values)
println("", e)
}
println("Still {} labels to process", allLabel - countLabel)
r
})
newDataset
}
With this code, I got this error :
Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._
Support for serializing other types will be added in future releases.
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
So naturally, I changed my code
def toVectors(withLabelDs: Dataset[Row]) = {
...
}, Encoders.bean(Row.getClass))
newDataset
}
But I got this error :
error: overloaded method value map with alternatives:
[U](func: org.apache.spark.api.java.function.MapFunction[org.apache.spark.sql.Row,U],
encoder: org.apache.spark.sql.Encoder[U])org.apache.spark.sql.Dataset[U]
<and>
[U](func: org.apache.spark.sql.Row => U)
(implicit evidence$6: org.apache.spark.sql.Encoder[U])org.apache.spark.sql.Dataset[U]
cannot be applied to (org.apache.spark.sql.Row => org.apache.spark.sql.Row, org.apache.spark.sql.Encoder[?0])
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
How can I make this work ? Aka, having a dataset[Row] returned with Vectors ?
Two things:
.map is of type (T => U)(implicit Encoder[U]) => Dataset[U] but looks like you are calling it like it is (T => U, implicit Encoder[U]) => Dataset[U] which are slightly different. Instead of .map(f, encoder), try .map(f)(encoder).
Also, I doubt Encoders.bean(Row.getClass) will work since Row is not a bean. Some quick googling turned up RowEncoder which looks like it should work but I couldn't find much documentation about it.
The error message is unfortunately quite poor. import spark.implicits._ is only correct in the spark-shell. What it actually means is to import <Spark Session object>.implicits._, spark just happens to be the variable name used for the SparkSession object in the spark-shell.
You can access the SparkSession from a Dataset
At the top of your method you can add the import
def toVectors(withLabelDs: Dataset[Row]) = {
val sparkSession = withLabelIDs.sparkSession
import sparkSession.implicits._
//rest of method code

Add scoped variable per row iteration in Apache Spark

I'm reading multiple html files into a dataframe in Spark.
I'm converting elements of the html to columns in the dataframe using a custom udf
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.toDF("filepath", "filecontent")
.withColumn("biz_name", parseDocValue(".biz-page-title")('filecontent))
.withColumn("biz_website", parseDocValue(".biz-website a")('filecontent))
...
def parseDocValue(cssSelectorQuery: String) =
udf((html: String) => Jsoup.parse(html).select(cssSelectorQuery).text())
Which works perfectly, however each withColumn call will result in the parsing of the html string, which is redundant.
Is there a way (without using lookup tables or such) that I can generate 1 parsed Document (Jsoup.parse(html)) based on the "filecontent" column per row and make that available for all withColumn calls in the dataframe?
Or shouldn't I even try using DataFrames and just use RDD's ?
So the final answer was in fact quite simple:
Just map over the rows and create the object ones there
def docValue(cssSelectorQuery: String, attr: Option[String] = None)(implicit document: Document): Option[String] = {
val domObject = document.select(cssSelectorQuery)
val domValue = attr match {
case Some(a) => domObject.attr(a)
case None => domObject.text()
}
domValue match {
case x if x == null || x.isEmpty => None
case y => Some(y)
}
}
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath, minPartitions = 265)
.map {
case (filepath, filecontent) => {
implicit val document = Jsoup.parse(filecontent)
val customDataJson = docJson(filecontent, customJsonRegex)
DataEntry(
biz_name = docValue(".biz-page-title"),
biz_website = docValue(".biz-website a"),
url = docValue("meta[property=og:url]", attr = Some("content")),
...
filename = Some(fileName(filepath)),
fileTimestamp = Some(fileTimestamp(filepath))
)
}
}
.toDS()
I'd probably rewrite it as follows, to do the parsing and selecting in one go and put them in a temporary column:
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.withColumn("temp", parseDocValue(Array(".biz-page-title", ".biz-website a"))('filecontent))
.withColumn("biz_name", col("temp")(0))
.withColumn("biz_website", col("temp")(1))
.drop("temp")
def parseDocValue(cssSelectorQueries: Array[String]) =
udf((html: String) => {
val j = Jsoup.parse(html)
cssSelectorQueries.map(query => j.select(query).text())})

Creating Dataframe from XML parsed by scalaxb

I can successfully parse XML data dropped into a directory by using the Spark streaming fileStream method, and I can write the resulting RDDs out to a text file just fine:
val fStream = {
ssc.fileStream[LongWritable, Text, XmlInputFormat](
WATCHDIR, xmlFilter _, newFilesOnly = false, conf = hadoopConf)
}
fStream.foreachRDD(rdd =>
if (rdd.count() == 0) {
logger.info("No files..")
})
val dStream = fStream.map{ case(x, y) =>
logger.info("Hello from the dStream")
logger.info(y.toString)
scalaxb.fromXML[Music](scala.xml.XML.loadString(y.toString))
}
dStream.foreachRDD(rdd => rdd.saveAsTextFile("file:///tmp/xmlout"))
The trouble is when I want to convert the RDDs to DataFrames in order to either register them as a temp table or saveAsParquetFile.
This code:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
dStream.foreachRDD(rdd => rdd.distinct().toDF().printSchema())
Results in this error:
java.lang.UnsupportedOperationException: Schema for type scalaxb.DataRecord[scala.Any] is not supported
I would have thought that since scalaxb generates case classes for my records, and that it would be simple for Spark to infer using reflection, and I see this is what it's trying to do, except Spark doesn't support the scalaxb.DataRecord type. Are there any Spark or Scalaxb experts who have any ideas on how to make the case classes generated by Scalaxb compatible with Spark?
BTW, here are the generated classes from scalaxb:
package generated
case class Song(attributes: Map[String, scalaxb.DataRecord[Any]] = Map()) {
lazy val title = attributes.get("#title") map { _.as[String] }
lazy val length = attributes.get("#length") map { _.as[String] }
}
case class Album(song: Seq[generated.Song] = Nil,
description: String,
attributes: Map[String, scalaxb.DataRecord[Any]] = Map()) {
lazy val title = attributes.get("#title") map { _.as[String] }
}
case class Artist(album: Seq[generated.Album] = Nil,
attributes: Map[String, scalaxb.DataRecord[Any]] = Map()) {
lazy val name = attributes.get("#name") map { _.as[String] }
}
case class Music(artist: Seq[generated.Artist] = Nil)