What is a LabeledPoint RDD? How to print data in it? - pyspark

I create a Labeled point RDD by mapping label and feature-set. Now I want to print out the data in Labeled Point format, how may I do it?
What is a LabeledPoint? Is it a list of tuples containing a key(label) and a list(features)?

Related

Extract info from some cloumns, store into dataframe after flattening up

I have one dataframe, my use case is to extract info from some columns do flattening up based on some column type and store into dataframe what is the efficient way to do and how ?
Ex: We have one dataframe which have lets say 5 columns for ex: a(string),b(string),c(string),d(json value),e(string). in transformation i wanted to extract or i can say flattened some records from d column (from json value). and in our resultant dataframe each value from json will become one row.

Fetch distinct values from column in file to create RDD

I am new to Pyspark. I need to find distinct values from a certain column in an RDD.
I have a comma delimited .txt file with no column headers on S3.
rddDistinct = sc.textFile(fileLocation).map(lambda x: x[2])
print rddDistinct.take(10)
What am I doing wrong? Eventually, I would like to store the resulting RDD in S3 (haven't gotten there yet). If the file exists in S3, I would like to overwrite it.
You need to add .distinct() after your map function.
rddDistinct = sc.textFile(fileLocation).map(lambda x: x[2]).distinct()
print rddDistinct.take(10)

Getting the value of a DataFrame column in Spark

I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
Spark version :2.2.0
Scala version :2.11.11
There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
So the alternate is to check few items from the dataframe. What I generally do is
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
So, 2nd alternative is
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
Hence there is third option
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe
The first will do :)
val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
You can also do:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame

Add list as column to Dataframe in pyspark

I have a list of integers and a sqlcontext dataframe with the number of rows equal to the length of the list. I want to add the list as a column to this dataframe maintaining the order. I feel like this should be really simple but I can't find an elegant solution.
You cannot simply add a list as a dataframe column since list is local object and dataframe is distirbuted. You can try one of thw followin approaches:
convert dataframe to local by collect() or toLocalIterator() and for each row add corresponding value from the list OR
convert list to dataframe adding an extra column (with keys from dataframe) and then join them both

Bulk File Conversion with Apache Spark

Spark/Scala n00bie here...
I've got a large set of documents whose pages are stored as individual tif images. I need to convert and concatenate the individual tifs' into a single PDF document. i.e 1.tif, 2.tif, 3.tif -> 123.pdf
I've been investigating using Spark to accomplish this. I create the initial RDD like this:
val inputTifRDD = sc.binaryFile("file:///some/path/to/lots/of/*tif")
with inputTifRDD comprised of tuples of the form:
(fullFilePath: String, data:org.apache.spark.input.PortableDataStream )
I then apply a custom map function that converts each tif into a pdf and returns a RDD comprised of tuples of the form:
(fullFilePath: String, data:com.itextpdf.text.Document)
I now want to apply an action to this RDD to concatenate the PDF's into a single PDF. I don't think reducing is possible because the concatenation is not commutative - the order of pages matters.
My question is how do I effect the concatenation of elements in the RDD in the correct order? The filenames contain the page number, so this information can be used. Alternatively - is there another, better/more efficient way to do the conversion/concatentation with Spark?