I have this two lines(among the all others)
import scala.io.Source
val source = Source.fromFile(filename)
As I understand this is a way to read file content.I have read
http://www.scala-lang.org/api/2.12.x/scala/io/Source.html#iter:Iterator[Char]
I still do not get it what does Source.from File represent,one of Type Members,or something else?
from the Scala API stated here fromFile is a method defined on the Source companion object. This is a curried method with the first param list taking a single String representing the path of the file to be read and the second curried parameter list takes a single implicit codec argument of type scala.io.Codec. And this function returns a BufferedSource object
Related
Based on this description of datasets and dataframes I wrote this very short test code which works.
import org.apache.spark.sql.functions._
val thing = Seq("Spark I am your father", "May the spark be with you", "Spark I am your father")
val wordsDataset = sc.parallelize(thing).toDS()
If that works... why does running this give me a
error: value toDS is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.catalog.Table]
import org.apache.spark.sql.functions._
val sequence = spark.catalog.listDatabases().collect().flatMap(db =>
spark.catalog.listTables(db.name).collect()).toSeq
val result = sc.parallelize(sequence).toDS()
toDS() is not a member of RRD[T]. Welcome to the bizarre world of Scala implicits where nothing is what it seems to be.
toDS() is a member of DatasetHolder[T]. In SparkSession, there is an object called implicits. When brought in scope with an expression like import sc.implicits._, an implicit method called rddToDatasetHolder becomes available for resolution:
implicit def rddToDatasetHolder[T](rdd: RDD[T])(implicit arg0: Encoder[T]): DatasetHolder[T]
When you call rdd.toDS(), the compiler first searches the RDD class and all of its superclasses for a method called toDS(). It doesn't find one so what it does is start searching all the compatible implicits in scope. While doing so, it finds the rddToDatasetHolder method which accepts an RDD instance and returns an object of a type which does have a toDS() method. Basically, the compiler rewrites:
sc.parallelize(sequence).toDS()
into
SparkSession.implicits.rddToDatasetHolder(sc.parallelize(sequence)).toDS()
Now, if you look at rddToDatasetHolder itself, it has two argument lists:
(rdd: RDD[T])
(implicit arg0: Encoder[T])
Implicit arguments in Scala are optional and if you do not supply the argument explicitly, the compiler searches the scope for implicits that match the required argument type and passes whatever object it finds or can construct. In this particular case, it looks for an instance of the Encoder[T] type. There are many predefined encoders for the standard Scala types, but for most complex custom types no predefined encoders exist.
So, in short: The existence of a predefined Encoder[String] makes it possible to call toDS() on an instance of RDD[String], but the absence of a predefined Encoder[org.apache.spark.sql.catalog.Table] makes it impossible to call toDS() on an instance of RDD[org.apache.spark.sql.catalog.Table].
By the way, SparkSession.implicits contains the implicit class StringToColumn which has a $ method. This is how the $"foo" expression gets converted to a Column instance for column foo.
Resolving all the implicit arguments and implicit transformations is why compiling Scala code is so dang slow.
I am new to Scala and trying to grab the language fundamentals. I have working knowledge of Spark with Java API.
I have some hard time understanding some scala code and therfore I am not able to write the same in Java. I got this piece of code in https://learn.microsoft.com/en-us/azure/cosmos-db/spark-connector
// Import Necessary Libraries
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark._
import com.microsoft.azure.cosmosdb.spark.config.Config
// Read Configuration
val readConfig = Config(Map(
"Endpoint" -> "https://doctorwho.documents.azure.com:443/",
"Masterkey" -> "YOUR-KEY-HERE",
"Database" -> "DepartureDelays",
"Collection" -> "flights_pcoll",
"query_custom" -> "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'" // Optional
))
// Connect via azure-cosmosdb-spark to create Spark DataFrame
val flights = spark.read.cosmosDB(readConfig)
flights.count()
As far as I know the read method returns an object of type org.apache.spark.sql.DataFrameReader and this does not have any method cosmosDB(), then how this code is working. Also how do I convert this code to Java.
Thank You
What you are seeing is the magic of Scala implicit conversions. The compiler sees that you intend to call the cosmosDB method of a DataFrameReader and that there's no method of that name with the proper signature, as you note.
When you
import com.microsoft.azure.cosmosdb.spark.schema._
you also import the contents of the package object (current git commit as of this writing, last updated in 2017 so it's stable code). The relevant bit that gets imported is
implicit def toDataFrameReaderFunctions(dfr: DataFrameReader): DataFrameReaderFunctions
An implicit def which takes one argument signals to the compiler that, if this def is in scope, the compiler can insert a call to this method if:
it has a DataFrameReader
a method is being called which is not a member of DataFrameReader
com.microsoft.azure.cosmosdb.spark.schema.DataFrameReaderFunctions has member with the desired name and signature
Since DataFrameReaderFunctions has a method cosmosDB, the compiler then translates your code to
toDataFrameReaderFunctions(spark.read).cosmosDB(readConfig)
This general approach of using an implicit conversion to make it look like you're adding methods to a type without modifying the type is called enrichment or an extension method. Implicit conversions in general should probably be avoided: they very often make code hard to follow and an errant implicit conversion in scope can make code you don't intend to compile compile. For an enrichment like this, there's an alternative: use an implicit class, where the compiler essentially autogenerates the implicit conversion but this doesn't allow you to use an Int in place of a String.
I have a folder on HDFS, which for whatever reason, contains part-files that contain commas in their name. For instance
hdfs://namespace/mypath/1-1,123
hdfs://namespace/mypath/1-2,124
hdfs://namespace/mypath/1-3,125
The issue is, I want to only read some of the part files at a time, to prevent over-loading my cluster, meaning that I want to read 1-1,123 and 1-2,124 files.
However, when path is fed to spark as:
sc.textFile("hdfs://namespace/mypath/1-1,123,hdfs://namespace/mypath/1-2,124")
Spark obviously seems to just tokenize on ",", thereby assuming I'm looking for 4 separate files.
Is there a way to escape the commas in the path?
Is the only option to rename the source files?
SparkContext.textFile calls at some point FileInputFormat.setInputPaths(Job job, String commaSeparatedPaths) which apparently simply splits on , the input String representing the comma-separated paths:
Sets the given comma separated paths as the list of inputs for the map-reduce job.
One way to bypass this limitation consists in using the alternative signature of setInputPaths: FileInputFormat.setInputPaths(Job job, Path... inputPaths) which takes a vararg of Path objects. This way, no need to split on , and thus no confusion possible.
To do that, we'll have to create our own textFile method which does the exact same thing as SparkContext.textFile: calling the HadoopRDD object but this time using an input provided as a List of Strings instead of a String:
package org.apache.spark
import org.apache.spark.rdd.{RDD, HadoopRDD}
import org.apache.spark.util.SerializableConfiguration
import org.apache.hadoop.mapred.{FileInputFormat, JobConf, TextInputFormat}
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.fs.Path
object TextFileOverwrite {
implicit class SparkContextExtension(val sc: SparkContext) extends AnyVal {
def textFile(
paths: Seq[String],
minPartitions: Int = sc.defaultMinPartitions
): RDD[String] = {
val confBroadcast =
sc.broadcast(new SerializableConfiguration(sc.hadoopConfiguration))
val setInputPathsFunc =
(jobConf: JobConf) =>
FileInputFormat.setInputPaths(jobConf, paths.map(p => new Path(p)): _*)
new HadoopRDD(
sc,
confBroadcast,
Some(setInputPathsFunc),
classOf[TextInputFormat],
classOf[LongWritable],
classOf[Text],
minPartitions
).map(pair => pair._2.toString)
}
}
}
which can be used this way:
import org.apache.spark.TextFileOverwrite.SparkContextExtension
sc.textFile(Seq("path/hello,world.txt", "path/hello_world.txt"))
Compared to SparkContext.textFile, the only difference in the implementation is the call to FileInputFormat.setInputPaths which takes Paths in input instead of a comma-separated String.
Note that I use the package org.apache.spark to store this function, because SerializableConfiguration has the visibility private[spark] in spark's code base.
Also note the use of an implicit class on SparkContext which allows us to implicitly attach this additional textFile method directly to the SparkContext object and thus to call it using sc.textFile() instead of having to pass the sparkContext as a parameter of the method.
Also note that I would have preferred giving Seq[Path] instead of Seq[String] as an input of this method, but Path is not yet Serializable in the current version of hadoop-common used by Spark (it will become Serializable starting version 3 of hadoop-common).
Use filename globbing, assuming that this gives you unique files:
sc.textFile("hdfs://namespace/mypath/1-1?123,hdfs://namespace/mypath/1-2?124")
Doesn't work if you only want the first one of these and not the other two:
hdfs://namespace/mypath/1-1,123,hdfs
hdfs://namespace/mypath/1-1:123,hdfs
hdfs://namespace/mypath/1-1.123,hdfs
I was going to suggest this:
sc.textFile("hdfs://namespace/mypath/1-1[,]123, ...
And I think that's supposed to work. Looking at the code for org.apache.hadoop.mapred.FileInputFormat#getPathStrings though makes me suspicious. It looks like that function specifically looks for commas inside curly braces, and will fail if you put a comma inside [,].
Wrong syntax problem in recursively deleting scala files
Files.walk(path, FileVisitOption.FOLLOW_LINKS)
.sorted(Comparator.reverseOrder())
.forEach(Files.deleteIfExists)
The issue is that you're trying to pass a scala-style function to a method expecting a java-8-style function. There's a couple libraries out there that can do the conversion, or you could write it yourself (it's not complicated), or probably the simplest is to just convert the java collection to a scala collection that has a foreach method expecting a scala-style function as an argument:
import scala.collection.JavaConverters._
Files.walk(path, FileVisitOption.FOLLOW_LINKS)
.sorted(Comparator.reverseOrder())
.iterator().asScala
.foreach(Files.deleteIfExists)
In Scala 2.12 I expect this should work:
...forEach(Files.deleteIfExists(_: Path))
The reason you need to specify argument type is because expected type is Consumer[_ >: Path], not Consumer[Path] as it would be in Scala.
If it doesn't work (can't test at the moment), try
val deleteIfExists: Consumer[Path] = Files.deleteIfExists(_)
...forEach(deleteIfExists)
Before Scala 2.12, Joe K's answer is the correct one.
I am new to Scala programming and I wanted to read a properties file in Scala.
I can't find any APIs to read a property file in Scala.
Please let me know if there are any API for this or other way to read properties files in Scala.
Beside form Java API, there is a library by Typesafe called config with a good API for working with configuration files of different types.
You will have to do it in similar way you would with with Scala Map to java.util.Map. java.util.Properties extends java.util.HashTable whiche extends java.util.Dictionary.
scala.collection.JavaConverters has functions to convert to and fro from Dictionary to Scala mutable.Map:
val x = new Properties
//load from .properties file here.
import scala.collection.JavaConverters._
scala> x.asScala
res4: scala.collection.mutable.Map[String,String] = Map()
You can then use Map above. To get and retrieve. But if you wish to convert it back to Properties type (to store back etc), you might have to type cast it manually then.
You can just use the Java API.
Consider something along the lines
def getPropertyX: Option[String] = Source.fromFile(fileName)
.getLines()
.find(_.startsWith("propertyX="))
.map(_.replace("propertyX=", ""))