How to import and read data from a dataset without using transform or transform_df in palantir foundry? - pyspark

I want to know are there any ways to import the file without using transform_df or transform in code repository.
Basically I want to extract the data from the dataset and return all the values in terms of list. If I use transform or transform_df decorators then I won't be able to access that input file while calling the return function.

Are you trying to access the raw files in the dataset? That is possible using the filesystem API. Search your stack's documentation for "Raw File Access" wher eyou can find example python code. You still use the #transform decorator, except instead of calling .dataframe() you call .filesystem(). Here's some example code.
import csv
with hair_eye_color.filesystem().open('students.csv') as f:
reader = csv.reader(f, delimiter=',')
next(reader)
next(reader)
# ['id', 'hair', 'eye', 'sex']
# ['1', 'brown', 'brown', 'M']
You can create and a Spark dataframe using the file data and write it the output.

Related

Read oneline file into dataframe

I have the task of reading a one line json file into spark. I´ve thought about either modifying the input file so that it fits spark.read.json(path) or read the whole file and modify it inmemory to make it fit the previous line as shown bellow:
import spark.implicit._
val file = sc.textFile(path).collect()(0)
val data = file.split("},").map(json => s"$json}")
val ds = data.toSeq.toDF()
Is there a way of directly reading the json or read the one line file into multiple rows?
Edit:
Sorry I didn´t crealy explain the json format, all the json in the same line:
{"key":"value"},{"key":"value2"},{"key":"value2"}
If imported with spark.read.json(path) it would only take the first value.
Welcome to SO HugoDife! I believe single line load is what spark.read.json() does and you are perhaps looking for this answer. If not maybe you want to adjust your question with a data example.

How to writeback to dataframe using transform_df in palantir foundry?

I created a library for updating description of the columns of the input dataset. This function takes three parameter as input (input_dataset, output_dataset, config file) and eventually writes back the description of output dataset. So now we want to import this library across various use cases. How to go for those cases where we are writing spark transformation i.e taking inputs through transform_df because here we can't assign output to output variable. In that situation how can i call my description library function? How to proceed in those situation in palantir foundry. Any suggestions?
This method isn't currently supported using the #transform_df decorator; you'll have to use the #transform decorator at the moment.
The reasoning behind this resulted from recognizing the need for broader access to metadata APIs like the #transform decorator already allows. Thus it seemed more in line with this pattern to keep it there since the #transform_df decorator is inherently higher-level.
You can always simply move over your transformations from...
from transforms.api import transform_df, Input, Output
#transform_df(
Output("/my/output"),
my_input("/my/input"),
)
def my_compute_function(my_input):
df = my_input
# ... logic ....
return my_input
...to...
from transforms.api import transform, Input, Output
#transform(
my_output=Output("/my/output"),
my_input=Input("/my/input")
)
def my_compute_function(my_input, my_output):
df = my_input.dataframe()
# ... logic ....
my_output.write_dataframe(df)
...in which only 6 lines of code need be changed.

how to read CSV file in scala

I have a CSV file and I want to read that file and store it in case class. As I know A CSV is a comma separated values file. But in case of my csv file there are some data which have already comma itself. and it creates new column for every comma. So the problem how to split data from that.
1st data
04/20/2021 16:20(1st column) Here a bunch of basic techniques that suit most businesses, and easy-to-follow steps that can help you create a strategy for your social media marketing goals.(2nd column)
2nd data
11-07-2021 12:15(1st column) Focus on attracting real followers who are genuinely interested in your content, and make the most of your social media marketing efforts.(2nd column)
var i=0
var length=0
val data=Source.fromFile(file)
for (line <- data.getLines) {
val cols = line.split(",").map(_.trim)
length = cols.length
while(i<length){
//println(cols(i))
i=i+1
}
i=0
}
If you are reading a complex CSV file then the ideal solution is to use an existing library. Here is a link to the ScalaDex search results for CSV.
ScalaDex CSV Search
However, based on the comments, it appears that you might actually be wanting to read data stored in a Google Sheet. If that is the case, you can utilize the fact that you have some flexibility to save the data in a text file yourself. When I want to read data from a Google Sheet in Scala, the approach I use first is to save the file in a format that isn't hard to read. If the fields have embedded commas but no tabs, which is common, then I will save the file as a TSV and parse that with split("\t").
A simple bit of code that only uses the standard library might look like the following:
val source = scala.io.Source.fromFile("data.tsv")
val data = source.getLines.map(_.split("\t")).toArray
source.close
After this, data will be an Array[Array[String]] with your data in it that you can process as you desire.
Of course, if your data includes both tabs and commas then you'll really want to use one of those more robust external libraries.
You could use univocity CSV parser for faster stuffs.
You can also use it for creation as well.
Univocity parsers

What is the fastest way to transform a very large JSON file with Spark?

I am having a rather large JSON file (Amazon product data) with a lot of single JSON objects. Those JSON objects contain text that I want to preprocess for a specific training task but it is the preprocessing that I need to speed up here. One JSON object looks like this:
{
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"helpful": [2, 3],
"reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!",
"overall": 5.0,
"summary": "Heavenly Highway Hymns",
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}
The task would be to extract reviewText from each JSON object and perform some preprocessing like lemmatizing etc.
My problem is that I don't know how I could use Spark in order to speed this task up on a cluster.. I am actually not even sure if I can read that JSON file as a stream object-by-object and parallelize the main task.
What would be the best way to get started with this?
As you have single JSON object per line, you can use RDD's textFile to get RDD[String] of lines. Then use map to parse JSON objects using something like json4s and extract necessary field.
You whole code will looks as simple as this (assuming you have SparkContext as sc):
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit def formats = DefaultFormats
val r = sc.textFile("input_path").map(l => (parse(l) \ "reviewText").extract[String])
You can use a JSON dataset and then execute a simple sql query to retrieve the reviewText column's value:
// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files.
val path = "path/reviews.json"
val people = sqlContext.read.json(path)
// Register this DataFrame as a table.
people.registerTempTable("reviews")
val reviewTexts = sqlContext.sql("SELECT reviewText FROM reviews")
Built from examples at the SparkSQL docs (http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets)
I would load JSON data into Dataframe and then select field that i need, also you can use map to apply preprocessing like lemmatising.

Spark/Scala read hadoop file

In a pig script I saved a table using PigStorage('|').
I have in the corresponding hadoop folder files like
part-r-00000
etc.
What is the best way to load it in Spark/Scala ? In this table I have 3 fields: Int, String, Float
I tried:
text = sc.hadoopFile("file", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)
But then I would need somehow to split each line. Is there a better way to do it?
If I were coding in python I would create a Dataframe indexed by the first field and whose columns are the values found in the string field and coefficients the float values. But I need to use scala to use the pca module. And the dataframes don't seem that close to python's ones
Thanks for the insight
PigStorage creates a text file without schema information so you need to do that work yourself something like
sc.textFile("file") // or directory where the part files are
val data = csv.map(line => {
vals=line.split("|")
(vals(0).toInt,vals(1),vals(2).toDouble)}
)