Spark/Scala n00bie here...
I've got a large set of documents whose pages are stored as individual tif images. I need to convert and concatenate the individual tifs' into a single PDF document. i.e 1.tif, 2.tif, 3.tif -> 123.pdf
I've been investigating using Spark to accomplish this. I create the initial RDD like this:
val inputTifRDD = sc.binaryFile("file:///some/path/to/lots/of/*tif")
with inputTifRDD comprised of tuples of the form:
(fullFilePath: String, data:org.apache.spark.input.PortableDataStream )
I then apply a custom map function that converts each tif into a pdf and returns a RDD comprised of tuples of the form:
(fullFilePath: String, data:com.itextpdf.text.Document)
I now want to apply an action to this RDD to concatenate the PDF's into a single PDF. I don't think reducing is possible because the concatenation is not commutative - the order of pages matters.
My question is how do I effect the concatenation of elements in the RDD in the correct order? The filenames contain the page number, so this information can be used. Alternatively - is there another, better/more efficient way to do the conversion/concatentation with Spark?
Related
I have a use case where I need to create a DataFrame from an array.
I've created a DataFrame that reads a CSV then I am using a map to process/transform it further.
var mapTransform = df1.collect.map(
line => {
// line.split(",") logic for fields separation
//transformation logic here for various fields
(field1+","+field2+","+field3);
}
)
From this, I am getting an array(Array[String]) which is transformed result.
I want to further convert it DataFrames with separate columns so that later it can be used to write to DB or file, however, I am facing an issue. Is it possible to do it? Any solutions?
This does your job:
spark.sparkContext.parallelize(mapTransform.toSeq)
But note that you must avoid methods that produce non-rdd, as they load all the contents of the array to the one node and that's ineffective in the general case.
Also, there's a convention turn vars to vals as much as possible.
I am using Spark to read multiple parquet files into a single RDD, using standard wildcard path conventions. In other words, I'm doing something like this:
val myRdd = spark.read.parquet("s3://my-bucket/my-folder/**/*.parquet")
However, sometimes these Parquet files will have different schemas. When I'm doing my transforms on the RDD, I can try and differentiate between them in the map functions, by looking for the existence (or absence) of certain columns. However a surefire way to know which schema a given row in the RDD uses - and the way I'm asking about specifically here - is to know which file path I'm looking at.
Is there any way, on an RDD level, to tell which specific parquet file the current row came from? So imagine my code looks something like this, currently (this is a simplified example):
val mapFunction = new MapFunction[Row, (String, Row)] {
override def call(row: Row): (String, Row) = myJob.transform(row)
}
val pairRdd = myRdd.map(mapFunction, encoder=kryo[(String, Row)]
Within the myJob.transform( ) code, I'm decorating the result with other values, converting it to a pair RDD, and do some other transforms as well.
I make use of the row.getAs( ... ) method to look up particular column values, and that's a really useful method. I'm wondering if there are any similar methods (e.g. row.getInputFile( ) or something like that) to get the name of the specific file that I'm currently operating on?
Since I'm passing in wildcards to read multiple parquet files into a single RDD, I don't have any insight into which file I'm operating on. If nothing else, I'd love a way to decorate the RDD rows with the input file name. Is this possible?
You can add a new column for the file name as shown below
import org.apache.spark.sql.functions._
val myDF = spark.read.parquet("s3://my-bucket/my-folder/**/*.parquet").withColumn("inputFile", input_file_name())
I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.
You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.
I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
Spark version :2.2.0
Scala version :2.11.11
There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
So the alternate is to check few items from the dataframe. What I generally do is
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
So, 2nd alternative is
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
Hence there is third option
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe
The first will do :)
val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
You can also do:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame
When I retrieve data from text file using rdd in spark, it looks like that the retrieved lines are seprated from each other.
What i want is that the rdd combines them and treat them as if I have parallelized a single string.
e.g: from rddcontent:
sc.TextFile("sample.txt") // content: List("abc", \n "def")
To:
sc.parallelize("abcdef") // content: "abcdef"
This should be done because the whole data is too big to fit in memory with reduce but still needs to be processed as a whole(In parallel of course but without the line seperations)