Writing each row in a spark dataframe to a separate json

Writing each row in a spark dataframe to a separate json - scala

I have a fairly large dataframe(million rows), and the requirement is to store each of the row in a separate json file.
For this data frame
root
|-- uniqueID: string
|-- moreData: array
The output should be stored like below for all the rows.
s3://.../folder[i]/<uniqueID>.json
where i is the first letter of the uniqueID
I have looked at other questions and solutions, but they don't satisfy my requirements.
Trying to do this in a more time optimized way, and from what I have read so far re-partition is not a good option.
Tried writing the df with maxRecordsPerFile option, but I can't seem to control the naming of the files.
df.write.mode("overwrite")
.option("maxRecordsPerFile", 1)
.json(outputPath)
I am fairly new to spark, any help is much appreciated.

I don't think there is really an optimized (if we take that to mean "much faster than any other") method of doing this. It's fundamentally an inefficient operation, and one that I can't really see a good use case for. But, assuming you really have thought this through and decided this is the best way to solve the problem at hand, I would suggest you reconsider using the repartition method on the dataframe; it can take a column to be used as the partitioning expression. The only thing it won't do is split files across directories the way you want.
I suppose something like this might work:
import java.io.File
import scala.reflect.io.Directory
// dummy data
val df = Seq(("A", "B", "XC"), ("D", "E", "YF"), ("G", "H", "ZI"), ("J", "K", "ZL"), ("M", "N", "XO")).toDF("FOO", "BAR", "BAZ")
// List of all possible prefixes for the index column. If you need to generate this
// from the data, replace this with a query against the input dataframe to do that.
val prefixes = List("X", "Y", "Z")
// replace with your path
val path = "/.../data"
prefixes.foreach{p =>
val data = df.filter(col("uniqueID").startsWith(p))
val path = new Directory(new File(f"$path/$p"))
data.repartition(data.count.toInt) // repartition the dataframe with 1 record per partition
data.write.format("json").save(path)
}
The above doesn't quite meet the requirement since you can't control the output file name1. We can use a shell script to
fix the file names afterward. This assumes you are running in an environment with bash and jq available.
#!/usr/bin/env bash
# replace with the path that contains the directories to process
cd /.../data
for sub_data_dir in ./*; do
cd "${sub_data_dir}"
rm _SUCCESS
for f in ./part-*.json; do
uuid="$(jq -r ."uniqueID" "${f}")"
mv "${f}" "${uuid}"
done
cd ..
done
1: Spark doesnt give you an option to control individual file names when using Dataframe.write because that isn't how it is meant to be used. The intended usage is on a multi-node Hadoop cluster where data may be distributed arbitrarily between the nodes. The write operation is coordinated among all nodes and targets a path on the shared HDFS. In that case it makes no sense to talk about individual files because the operation is performed on the dataframe level, and so you can only control the naming of the directory where the output files will be written (as the argument to the save method)

Related

apply map partitions on pyspark dataframe to run python logic

I would like to apply spacy nlp on my pyspark dataframe. I am using map partitions concept on my pyspark dataframe to apply python logic that consists of spacy.
Spark version: 3.2.0
Below is the sample pyspark dataframe:
token id
0 [This, is, java, world] 0
1 [This, is, spark, world] 1
Below is the code where I am passing a data to the python function and returning a dictionary
def get_spacy_doc_parallel_map_func(partitionData):
import spacy
from tqdm import tqdm
import pandas as pd
nlp=spacy.load('en_core_web_sm')
nlp.tokenizer=nlp.tokenizer.tokens_from_list
from spacy.tokens import Doc
if not Doc.has_extension("text_id"):
Doc.set_extension("text_id", default=None)
columnNames = broadcasted_source_columns.value
partitionData = pd.DataFrame(partitionData, columns=columnNames)
'''
This function creates a mapper of review id and review spacy.doc.Doc type
'''
def get_spacy_doc_parallel(data):
text_tuples = []
dodo = data[['token','id']].drop_duplicates(['id'])
for _,i in dodo.iterrows():
text_tuples.append((i['token'],{'text_id':i['id']}))
doc_tuples = nlp.pipe(text_tuples, as_tuples=True,n_process=8,disable=['tagger','parser','ner'])
docsf = []
for doc, context in tqdm(doc_tuples,total=len(text_tuples)):
doc._.text_id = context["text_id"]
docsf.append(doc)
vv={}
for doc in docsf:
vv[doc._.text_id] = doc
return vv
id_spacy_doc_mapper = get_spacy_doc_parallel(partitionData)
partitionData['spacy_doc'] = id_spacy_doc_mapper
partitionData.reset_index(inplace=True)
partitionData_dict = partitionData.to_dict("index")
result = []
for key in partitionData_dict:
result.append(partitionData_dict[key])
return iter(result)
resultDF_tokens = data.rdd.mapPartitions(get_spacy_doc_parallel_map_func)
df = spark.createDataFrame(resultDF_tokens)
The issue I am getting here is that the length of dictionary values does not match with length of the dataframe. Below is the error
Error:
ValueError: Length of values (954) does not match length of index (1438)
Output:
{0: This is java word, 1: This is spark world }
The above dictionary is assigned as a column to the python dataframe after applying spacy (partitionData['spacy_doc'] = id_spacy_doc_mapper)

I don't have enough experience with spacy to figure out what the intent is here and I'm very confused by the input and output because the input looks tokenized, but I'll take a stab at it and list my assumptions and the problems I ran into.
First off, I think Fugue can make this type of transformation significantly easier. It will use the underlying Spark UDF, pandas_udf, mapPartition, or mapInPandas depending what parameters you supply. The point is that Fugue will handle that boilerplate. For you, it seems you have Pandas DataFrame in (that part is clear), but the output is less clear. I think you are passing some iterable of list to make Spark happy, but I think Pandas DataFrame output might be simpler. I'm guessing here.
So first we set some stuff up. This is all native Python. The tokens_from_list portion was removed from the original because it seems like the latest versions deprecated it. Shouldn't matter for the example.
import pandas as pd
from typing import List, Any, Iterable, Dict
import spacy
nlp=spacy.load('en_core_web_sm')
from spacy.tokens import Doc
if not Doc.has_extension("text_id"):
Doc.set_extension("text_id", default=None)
data = pd.DataFrame({"token": ["This is java world", "This is spark world"],
"id": [0, 1]})
and then you define your logic for one partition. I am assuming Pandas DataFrame in and Pandas DataFrame out, but Fugue can actually support many other types such as Pandas DataFrame in and Iterable[List] out. The important thing is just you annotate your logic so Fugue knows how to handle it. Note this code is still native Python. I edited the logic a bit to just get it to work. Again, I am pretty sure I butchered the logic somewhere, but the example can still work. I really couldn't find a way for the original to work (because I don't know spacy enough)
def get_spacy_doc(data: pd.DataFrame) -> pd.DataFrame:
text_tuples = []
dodo = data[['token','id']].drop_duplicates(['id'])
for _,i in dodo.iterrows():
text_tuples.append((i['token'],{'text_id':i['id']}))
doc_tuples = nlp.pipe(text_tuples, as_tuples=True,n_process=1,disable=['tagger','parser','ner'])
docsf = []
for doc, context in doc_tuples:
doc._.text_id = context["text_id"]
docsf.append(doc)
vv={}
for doc in docsf:
vv[doc._.text_id] = doc
id_spacy_doc_mapper = vv.copy()
data['space_doc'] = id_spacy_doc_mapper
return data
Now to bring this to Spark, all you have to do with Fugue is:
from fugue import transform
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(data)
sdf = transform(sdf, get_spacy_doc, schema="*, space_doc:int", engine=spark)
sdf.show()
and the Fugue transform will handle it. This is to run on Spark, but you can also run on Pandas if you don't supply an engine like this:
df = transform(data, get_spacy_doc, schema="*, space_doc:int")
This allows you to test the logic clearly without relying on Spark. It will then work when you bring it to Spark. Schema is a requirement because it is a requirement for Spark.
On partitioning
The Fugue transform can take partitioning strategy. For example:
transform(df, func, schema="*", partition={"by":"col1"}, engine=spark)
but for this case, I don't think you partition on anything so you can just use the default partitions, which is what will happen.
On parallelization
You have this code like:
nlp.pipe(text_tuples, as_tuples=True,n_process=8,disable=['tagger','parser','ner'])
This is two-stage parallelism. The first stage is Spark mapping over partitions, and the second stage is this pipe operation (I assume). Two stage parallelism is an anti-pattern in distributed computing because the first stage will already occupy all the available resources on the cluster. The parallelism should be done on the partition level. If you do something like this, it's very common to run into resource deadlocks when the 2nd stage tries to occupy resources also. I would recommend setting the n_process=1.
On tqdm
I may be wrong on this one but I don't think tqdm plays well with Spark because I don't think you can get a real time progress bar for work that happens on worker machines. It can only work on the driver machine. The workers don't send logs to the driver for the functions it runs.
If the example is clearer, I can certainly help you port this logic to Spark. Feel free to reach out. I hope at least some bit of this was useful.

Source.fromFile not working for HDFS file path

i am trying to read file contents from my hdfs for that i am using Source.fromFile(). It is working fine when my file is in local system but throwing error when i am trying to read file from HDFS.
object CheckFile{
def main(args:Array[String]) {
for (line <- Source.fromFile("/user/cloudera/xxxx/File").getLines()) {
println(line)
}
}
}
Error:
java.io.FileNotFoundException: hdfs:/quickstart.cloudera:8080/user/cloudera/xxxx/File (No such file or directory)
i searched but i am not able to find any solutions to this.
Please help

If you are using Spark you should use SparkContext to load the files. Source.fromFile uses the local file system.
Say you have your SparkContext at sc,
val fromFile = sc.textFile("hdfs://path/to/file.txt")
Should do the trick. You might have to specify the node address, though.
UPDATE:
To add to the comment. You want to read some data from hdfs and store it as a Scala collection. This is bad practice as the file might contain milions of lines and it will crash due to insufficient amount of memory; you should use RDDs and not built-in Scala collections. Nevertheless, if this is what you want, you could do:
val fromFile = sc.textFile("hdfs://path/to/file.txt").toLocalIterator.toArray
Which would produce a local collection of desired type (Array in this case).

sc.textFile("hdfs://path/to/file.txt").toLocalIterator.toArray.mkString will give the result as string

Error when reading a file in Spark

I'm having a hard time figuring out why Spark is not accessing a file that I add to the context. Below is my code in the repl:
scala> sc.addFile("/home/ubuntu/my_demo/src/main/resources/feature_matrix.json")
scala> val featureFile = sc.textFile(SparkFiles.get("feature_matrix.json"))
featureFile: org.apache.spark.rdd.RDD[String] = /tmp/spark/ubuntu/spark-d7a13d92-2923-4a04-a9a5-ad93b3650167/feature_matrix.json MappedRDD[1] at textFile at <console>:60
scala> featureFile.first()
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: cfs://172.30.26.95/tmp/spark/ubuntu/spark-d7a13d92-2923-4a04-a9a5-ad93b3650167/feature_matrix.json
The file does in fact exist at /tmp/spark/ubuntu/spark-d7a13d92-2923-4a04-a9a5-ad93b3650167/feature_matrix.json
Any help appreciated.

If you are using addFile, then you need to use get to retrieve it. Also, the addFile method is lazy, so it is very possible that it was not put in the location you are finding it until you actually call first, so you are creating this kind of circle.
All that being said, I don't know that using SparkFiles as the first action is ever going to be a smart idea. Use something like --files with SparkSubmit and the files will be put in your working directory.

Get a value from RichPipe

I have a RichPipe with 3 fields: name: String, time: Long and value: Int. I need to get the value for a specific name, time pair. How can I do it? I can't figure it out from scalding documentation, as it is very cryptic and can't find any examples that do this.

Well a RichPipe is not a Key-Value store, that's why there is no documentation on using as a key-value store :) A RichPipe should be thought of as a pipe - so you can't get at data in the middle without first going in at one end and traversing the pipe till you find the element your looking for. Furthermore this is a little painful in Scalding because you have to write your results to disk (because it's built on top of Hadoop) and then read the result from disk in order to use it in your application. So the code will be something like:
myPipe.filter[String, Long](('name, 'time))(_ == (specificName, specificTime))
.write(Tsv("tmp/location"))
Then you'll need some higher level code to run the job and read the data back into memory to get at the result. Rather than write out all the code to do this (it's pretty straightforward), why don't you give some more context about what your use case is and what you are trying to do - maybe you can solve your problem under the Map-Reduce programming model.
Alternatively, use Spark, you'll have the same problem of having to traverse a distributed dataset, but you don't have the faff of writting to disk and reading back again. Furthermore you can use custom partitioner is Spark that could result in near key-value store like behaviour. But anyway naively, the code would be:
val theValueYouWant =
myRDD.filter {
case (`specificName`, `specificTime`, _) => true
case _ => false
}
.toArray.head._3

how to make saveAsTextFile NOT split output into multiple file?

When using Scala in Spark, whenever I dump the results out using saveAsTextFile, it seems to split the output into multiple parts. I'm just passing a parameter(path) to it.
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Does the number of outputs correspond to the number of reducers it uses?
Does this mean the output is compressed?
I know I can combine the output together using bash, but is there an option to store the output in a single text file, without splitting?? I looked at the API docs, but it doesn't say much about this.

The reason it saves it as multiple files is because the computation is distributed. If the output is small enough such that you think you can fit it on one machine, then you can end your program with
val arr = year.collect()
And then save the resulting array as a file, Another way would be to use a custom partitioner, partitionBy, and make it so everything goes to one partition though that isn't advisable because you won't get any parallelization.
If you require the file to be saved with saveAsTextFile you can use coalesce(1,true).saveAsTextFile(). This basically means do the computation then coalesce to 1 partition. You can also use repartition(1) which is just a wrapper for coalesce with the shuffle argument set to true. Looking through the source of RDD.scala is how I figured most of this stuff out, you should take a look.

For those working with a larger dataset:
rdd.collect() should not be used in this case as it will collect all data as an Array in the driver, which is the easiest way to get out of memory.
rdd.coalesce(1).saveAsTextFile() should also not be used as the parallelism of upstream stages will be lost to be performed on a single node, where data will be stored from.
rdd.coalesce(1, shuffle = true).saveAsTextFile() is the best simple option as it will keep the processing of upstream tasks parallel and then only perform the shuffle to one node (rdd.repartition(1).saveAsTextFile() is an exact synonym).
rdd.saveAsSingleTextFile() as provided bellow additionally allows one to store the rdd in a single file with a specific name while keeping the parallelism properties of rdd.coalesce(1, shuffle = true).saveAsTextFile().
Something that can be inconvenient with rdd.coalesce(1, shuffle = true).saveAsTextFile("path/to/file.txt") is that it actually produces a file whose path is path/to/file.txt/part-00000 and not path/to/file.txt.
The following solution rdd.saveAsSingleTextFile("path/to/file.txt") will actually produce a file whose path is path/to/file.txt:
package com.whatever.package
import org.apache.spark.rdd.RDD
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import org.apache.hadoop.io.compress.CompressionCodec
object SparkHelper {
// This is an implicit class so that saveAsSingleTextFile can be attached to
// SparkContext and be called like this: sc.saveAsSingleTextFile
implicit class RDDExtensions(val rdd: RDD[String]) extends AnyVal {
def saveAsSingleTextFile(path: String): Unit =
saveAsSingleTextFileInternal(path, None)
def saveAsSingleTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit =
saveAsSingleTextFileInternal(path, Some(codec))
private def saveAsSingleTextFileInternal(
path: String, codec: Option[Class[_ <: CompressionCodec]]
): Unit = {
// The interface with hdfs:
val hdfs = FileSystem.get(rdd.sparkContext.hadoopConfiguration)
// Classic saveAsTextFile in a temporary folder:
hdfs.delete(new Path(s"$path.tmp"), true) // to make sure it's not there already
codec match {
case Some(codec) => rdd.saveAsTextFile(s"$path.tmp", codec)
case None => rdd.saveAsTextFile(s"$path.tmp")
}
// Merge the folder of resulting part-xxxxx into one file:
hdfs.delete(new Path(path), true) // to make sure it's not there already
FileUtil.copyMerge(
hdfs, new Path(s"$path.tmp"),
hdfs, new Path(path),
true, rdd.sparkContext.hadoopConfiguration, null
)
// Working with Hadoop 3?: https://stackoverflow.com/a/50545815/9297144
hdfs.delete(new Path(s"$path.tmp"), true)
}
}
}
which can be used this way:
import com.whatever.package.SparkHelper.RDDExtensions
rdd.saveAsSingleTextFile("path/to/file.txt")
// Or if the produced file is to be compressed:
import org.apache.hadoop.io.compress.GzipCodec
rdd.saveAsSingleTextFile("path/to/file.txt.gz", classOf[GzipCodec])
This snippet:
First stores the rdd with rdd.saveAsTextFile("path/to/file.txt") in a temporary folder path/to/file.txt.tmp as if we didn't want to store data in one file (which keeps the processing of upstream tasks parallel)
And then only, using the hadoop file system api, we proceed with the merge (FileUtil.copyMerge()) of the different output files to create our final output single file path/to/file.txt.

You could call coalesce(1) and then saveAsTextFile() - but it might be a bad idea if you have a lot of data. Separate files per split are generated just like in Hadoop in order to let separate mappers and reducers write to different files. Having a single output file is only a good idea if you have very little data, in which case you could do collect() as well, as #aaronman said.

As others have mentioned, you can collect or coalesce your data set to force Spark to produce a single file. But this also limits the number of Spark tasks that can work on your dataset in parallel. I prefer to let it create a hundred files in the output HDFS directory, then use hadoop fs -getmerge /hdfs/dir /local/file.txt to extract the results into a single file in the local filesystem. This makes the most sense when your output is a relatively small report, of course.

In Spark 1.6.1 the format is as shown below. It creates a single output file.It is best practice to use it if the output is small enough to handle.Basically what it does is that it returns a new RDD that is reduced into numPartitions partitions.If you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1)
pair_result.coalesce(1).saveAsTextFile("/app/data/")

You can call repartition() and follow this way:
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
var repartitioned = year.repartition(1)
repartitioned.saveAsTextFile("C:/Users/TheBhaskarDas/Desktop/wc_spark00")

You will be able to do it in the next version of Spark, in the current version 1.0.0 it's not possible unless you do it manually somehow, for example, like you mentioned, with a bash script call.

I also want to mention that the documentation clearly states that users should be careful when calling coalesce with a real small number of partitions . this can cause upstream partitions to inherit this number of partitions.
I would not recommend using coalesce(1) unless really required.

Here's my answer to output a single file. I just added coalesce(1)
val year = sc.textFile("apat63_99.txt")
.map(_.split(",")(1))
.flatMap(_.split(","))
.map((_,1))
.reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Code:
year.coalesce(1).saveAsTextFile("year")

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Writing each row in a spark dataframe to a separate json - scala

Related

apply map partitions on pyspark dataframe to run python logic

Source.fromFile not working for HDFS file path

Error when reading a file in Spark

Get a value from RichPipe

how to make saveAsTextFile NOT split output into multiple file?

Categories

Resources