How can I union and repartition DataFrames stored in parquet files using fugue? - fugue

Say I have these 2 parquet files
import pandas as pd
pd.DataFrame([[0]], columns=["a"]).to_parquet("/tmp/1.parquet")
pd.DataFrame([[0],[2]], columns=["a"]).to_parquet("/tmp/2.parquet")
I would like to have a new parquet file that is a row wise union of the two.
The resulting DataFrame should look like this
a
0 0
1 0
2 2
I also would like to repartition that new file with a pre-determined number of partitions.

You can certainly solve this problem in either Pandas, Spark or other computing frameworks, but each of them will require different implementations. Using Fugue here, you can have one implementation for different computing backends, more importantly, the logic is unit testable without using any heavy backend.
from fugue import FugueWorkflow
def merge_and_save(file1, file2, file3, partition_num):
dag = FugueWorkflow()
df1 = dag.load(file1)
df2 = dag.load(file2)
df3 = df1.union(df2, distinct=False)
df3.partition(num=partition_num).save(file3)
return dag
To unit test this logic, just use small local files and use the default execution engine. Assume you have a function assert_eq:
merge_and_save(f1, f2, f3, 4).run()
assert_eq(pd.read_parquet(f3), expected_df)
And in real production, if the input files are large, you can switch to spark
merge_and_save(f4, f5, f6, 100).run(spark_session)
It's worth to point out that partition_num is not respected by the default local execution engine, so we can't assert on the number of output files. But it takes effect when the backend is Spark or Dask.

Related

Apply scikit-learn model to pyspark dataframe column

I have a trained Scikit-learn LogisticRegression model in a sklearn.pipeline.Pipeline. This is an NLP task. The model is saved as a pkl file (actually in ML Studio models, but I download it to databricks dbfs).
I have a Hive table (delta-backed) containing some 1 million rows. The rows have, amongst other things, an id, a keyword_context column (containing the text), a modelled column (boolean, indicates the model has been run on this row), and a prediction column, which is an integer for the class output by the logistic regression.
My problem is how to update the prediction column.
running locally I can do
def generatePredictions(data:pd.DataFrame, model:Pipeline) -> pd.DataFrame:
data.loc[:, 'keyword_context'] = data.keyword_context.apply(lambda x: x.replace("\n", " ")
data['prediction'] = model.predict(data.keyword_context)
data['modelled'] = True
return data
This actually runs fast enough (~20s), but running the UPDATEs back to databricks via the databricks.sql.connector, takes many hours. So I want to do the same in a pyspark notebook to bypass the lengthy upload.
The trouble is that it is generally suggested to use inbuilt functions (which this isn't) or if there must be a udf then the examples all use inbuilt types, not Pipelines. I'm wondering whether the model should be loaded within the function, and I presume the function takes a single row, which means a lot of loading. I'm really not sure how to code the function, or call it.
I work on the Fugue project which aims to provide a simpler interface than the Spark one for porting Python/Pandas code. This is actually the first use case in our tutorial. Fugue will use the underlying Spark call (pandas_udf, udf, mapPartitions, applyInPandas, mapInPandas) based on the arguments you provide with minimal overhead.
Here is what the code looks like.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
X = pd.DataFrame({"x_1": [1, 1, 2, 2], "x_2":[1, 2, 2, 3]})
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression().fit(X, y)
def predict(df: pd.DataFrame, model: LinearRegression) -> pd.DataFrame:
return df.assign(predicted=model.predict(df))
input_df = pd.DataFrame({"x_1": [3, 4, 6, 6], "x_2":[3, 3, 6, 6]})
sdf = spark.createDataFrame(input_df)
# This is the start of the Fugue portion. It's minimal
from fugue import transform
result = transform(
sdf,
predict,
schema="*,predicted:double",
params=dict(model=reg),
engine=spark
)
print(type(result))
result.show()
This code will be applied per partition. Schema is a requirement for Spark. I am not sure but it sounds like you were using a row-wise UDF, so I think this will be faster. It also leaves your logic easy to unit test because you don't need Spark to unit test.
On loading the file inside the function
If you load the file inside the function, it gets executed on the workers. If you pass it in, it gets passed through the scheduler. This can create a lot of redundant passing of data. Loading it inside might speed things up.

Read file content per row of Spark DataFrame

We have an AWS S3 bucket with millions of documents in a complex hierarchy, and a CSV file with (among other data) links to a subset of those files, I estimate this file will be about 1000 to 10.000 rows. I need to join the data from the CSV file with the contents of the documents for further processing in Spark. In case it matters, we're using Scala and Spark 2.4.4 on an Amazon EMR 6.0.0 cluster.
I can think of two ways to do this. First is to add a transformation on the CSV DataFrame that adds the content as a new column:
val df = spark.read.format("csv").load("<csv file>")
val attempt1 = df.withColumn("raw_content", spark.sparkContext.textFile($"document_url"))
or variations thereof (for example, wrapping it in a udf) don't seem to work, I think because sparkContext.textFile returns an RDD, so I'm not sure it's even supposed to work this way? Even if I get it working, is the best way to keep it performant in Spark?
An alternative I tried to think of is to use spark.sparkContext.wholeTextFiles upfront and then join the two dataframes together:
val df = spark.read.format("csv").load("<csv file>")
val contents = spark.sparkContext.wholeTextFiles("<s3 bucket>").toDF("document_url", "raw_content");
val attempt2 = df.join(contents, df("document_url") === contents("document_url"), "left")
but wholeTextFiles doesn't go into subdirectories and the needed paths are hard to predict, and I'm also unsure of the performance impact of trying to build an RDD of the entire bucket of millions of files if I only need a small fraction of it, since the S3 API probably doesn't make it very fast to list all the objects in the bucket.
Any ideas? Thanks!
I did figure out a solution in the end:
val df = spark.read.format("csv").load("<csv file>")
val allS3Links = df.map(row => row.getAs[String]("document_url")).collect()
val joined = allS3Links.mkString(",")
val contentsDF = spark.sparkContext.wholeTextFiles(joined).toDF("document_url", "raw_content");
The downside to this solution is that it pulls all the urls to the driver, but it's workable in my case (100,000 * ~100 char length strings is not that much) and maybe even unavoidable.

Recursively adding rows to a dataframe

I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.

Spark dataframe write method writing many small files

I've got a fairly simple job coverting log files to parquet. It's processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12 thousand files.
Job works as follows:
val events = spark.sparkContext
.textFile(s"$stream/$sourcetype")
.map(_.split(" \\|\\| ").toList)
.collect{case List(date, y, "Event") => MyEvent(date, y, "Event")}
.toDF()
df.write.mode(SaveMode.Append).partitionBy("date").parquet(s"$path")
It collects the events with a common schema, converts to a DataFrame, and then writes out as parquet.
The problem I'm having is that this can create a bit of an IO explosion on the HDFS cluster, as it's trying to create so many tiny files.
Ideally I want to create only a handful of parquet files within the partition 'date'.
What would be the best way to control this? Is it by using 'coalesce()'?
How will that effect the amount of files created in a given partition? Is it dependent on how many executors I have working in Spark? (currently set at 100).
you have to repartiton your DataFrame to match the partitioning of the DataFrameWriter
Try this:
df
.repartition($"date")
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")
In Python you can rewrite Raphael's Roth answer as:
(df
.repartition("date")
.write.mode("append")
.partitionBy("date")
.parquet("{path}".format(path=path)))
You might also consider adding more columns to .repartition to avoid problems with very large partitions:
(df
.repartition("date", another_column, yet_another_colum)
.write.mode("append")
.partitionBy("date)
.parquet("{path}".format(path=path)))
The simplest solution would be to replace your actual partitioning by :
df
.repartition(to_date($"date"))
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")
You can also use more precise partitioning for your DataFrame i.e the day and maybe the hour of an hour range. and then you can be less precise for writer.
That actually depends on the amount of data.
You can reduce entropy by partitioning DataFrame and the write with partition by clause.
I came across the same issue and I could using coalesce solved my problem.
df
.coalesce(3) // number of parts/files
.write.mode(SaveMode.Append)
.parquet(s"$path")
For more information on using coalesce or repartition you can refer to the following spark: coalesce or repartition
Duplicating my answer from here: https://stackoverflow.com/a/53620268/171916
This is working for me very well:
data.repartition(n, "key").write.partitionBy("key").parquet("/location")
It produces N files in each output partition (directory), and is (anecdotally) faster than using coalesce and (again, anecdotally, on my data set) faster than only repartitioning on the output.
If you're working with S3, I also recommend doing everything on local drives (Spark does a lot of file creation/rename/deletion during write outs) and once it's all settled use hadoop FileUtil (or just the aws cli) to copy everything over:
import java.net.URI
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
// ...
def copy(
in : String,
out : String,
sparkSession: SparkSession
) = {
FileUtil.copy(
FileSystem.get(new URI(in), sparkSession.sparkContext.hadoopConfiguration),
new Path(in),
FileSystem.get(new URI(out), sparkSession.sparkContext.hadoopConfiguration),
new Path(out),
false,
sparkSession.sparkContext.hadoopConfiguration
)
}
how about trying running scripts like this as map job consolidating all the parquet files into one:
$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer cat

How to use QuantileDiscretizer across groups in a DataFrame?

I have a DataFrame with the following columns.
scala> show_times.printSchema
root
|-- account: string (nullable = true)
|-- channel: string (nullable = true)
|-- show_name: string (nullable = true)
|-- total_time_watched: integer (nullable = true)
This is data about how many times customer has watched watched a particular show. I'm supposed to categorize the customer for each show based on total time watched.
The dataset has 133 million rows in total with 192 distinct show_names.
For each individual show I'm supposed to bin the customer into 3 categories (1,2,3).
I use Spark MLlib's QuantileDiscretizer
Currently I loop through every show and run QuantileDiscretizer in the sequential manner as in the code below.
What I'd like to have in the end is for the following sample input to get the sample output.
Sample Input:
account,channel,show_name,total_time_watched
acct1,ESPN,show1,200
acct2,ESPN,show1,250
acct3,ESPN,show1,800
acct4,ESPN,show1,850
acct5,ESPN,show1,1300
acct6,ESPN,show1,1320
acct1,ESPN,show2,200
acct2,ESPN,show2,250
acct3,ESPN,show2,800
acct4,ESPN,show2,850
acct5,ESPN,show2,1300
acct6,ESPN,show2,1320
Sample Output:
account,channel,show_name,total_time_watched,Time_watched_bin
acct1,ESPN,show1,200,1
acct2,ESPN,show1,250,1
acct3,ESPN,show1,800,2
acct4,ESPN,show1,850,2
acct5,ESPN,show1,1300,3
acct6,ESPN,show1,1320,3
acct1,ESPN,show2,200,1
acct2,ESPN,show2,250,1
acct3,ESPN,show2,800,2
acct4,ESPN,show2,850,2
acct5,ESPN,show2,1300,3
acct6,ESPN,show2,1320,3
Is there a more efficient and distributed way to do it using some groupBy-like operation instead of looping through each show_name and bin it one after other?
I know nothing about QuantileDiscretizer, but think you're mostly concerned with the dataset to apply QuantileDiscretizer to. I think you want to figure out how to split your input dataset into smaller datasets per show_name (you said that there are 192 distinct show_name in the input dataset).
Solution 1: Partition Parquet Dataset
I've noticed that you use parquet as the input format. My understanding of the format is very limited but I've noticed that people are using some partitioning scheme to split large datasets into smaller chunks that they could then process whatever they like (per some partitioning scheme).
In your case the partitioning scheme could include show_name.
That would make your case trivial as the splitting were done at writing time (aka not my problem anymore).
See How to save a partitioned parquet file in Spark 2.1?
Solution 2: Scala's Future
Given your iterative solution, you could wrap every iteration into a Future that you'd submit to process in parallel.
Spark SQL's SparkSession (and Spark Core's SparkContext) are thread-safe.
Solution 3: Dataset's filter and union operators
I would think twice before following this solution since it puts burden on your shoulders which I think could easily be sorted out by solution 1.
Given you've got one large 133-million-row parquet file, I'd first build the 192 datasets per show_name using filter operator (as you did to build show_rdd which is against the name as it's a DataFrame not RDD) and union (again as you did).
See Dataset API.
Solution 4: Use Window Functions
That's something I think could work, but didn't check it out myself.
You could use window functions (see WindowSpec and Column's over operator).
Window functions would give you partitioning (windows) while over would somehow apply QuantileDiscretizer to a window/partition. That would however require "destructuring" QuantileDiscretizer into an Estimator to train a model and somehow fit the result model to the window again.
I think it's doable, but haven't done it myself. Sorry.
This is older question. However answering it to help someone with same situation in future.
It can be achieved using pandas udf function. Both input and output of pandas UDF function is dataframe. We need to provide schema of the output dataframe as shown in annotation in below code sample. Below code sample can achieve required result.
output_schema = StructType(df.schema.fields + [StructField('Time_watched_bin', IntegerType(), True)])
#pandas_udf(output_schema, PandasUDFType.GROUPED_MAP)
# pdf: pandas dataframe
def get_buckets(pdf):
pdf['Time_watched_bin'] = pd.cut(pdf['total_time_watched'], 3, labels=False)
return pdf
df = df.groupby('show_name').apply(get_buckets)
df will have new column 'Time_watched_bin' with bucket information.