New to Pyspark
I am loading a JSON file from a HDFS. It reads the data from logs one at a time.
let's say picking up date , config1d from the each log and loading it into a JSON file.
Is there a way to only load 5 or 10 percent of the data using random sampling with out loading the whole JSON file in the memory. Since loading the whole JSON file takes up more than an hour for me.
Please let me know if have more questions on it
For dataframe df you can use df.sample(fraction=0.05, seed=3) to sample 5 percent. Fraction is a number between 0 and 1, seed is optional but otherwise random.
In Spark, it's not possible to do it without first load all the data in your memory. First, you have to load it and the do sample (transformation) as #firtree said.
Related
I have a pyspark.sql.DataFrame that I would like to save as .csv. This is what I am doing.
df.toPandas().to_csv('myDF.csv')
Is it possible to partition the data in different chunks and save them as separate files?
You can achieve this using below
df.repartition()
df.coalesce(<integer value to number of file you want>).write.csv()
do not convert spark dataframe to pandas, directly save it to file.
I am trying to measure how long does it take me to read and write parquet files in Amazon s3 (under a specific partition)
For that I wrote a script that simply reads the files and than write them back:
val df = sqlContext.read.parquet(path + "p1.parquet/partitionBy=partition1")
df.write.mode("overwrite").parquet(path + "p1.parquet/partitionBy=partition1")
However I get a null pointer exception. I tried to add df.count in between, but got the same error.
The reason for the error is that Spark only reads the data when it is going to be used. This results in Spark reading data from the file at the same time as trying to overwrite the file. This causes an issue since data can't be overwritten while reading.
I'd recommend saving to a temporary location as this is for timing purposes. An alternative would be to use .cache() on the data when reading, perform an action to force the read (as well as actually cache the data), and then overwrite the file.
I get stuck with the following problem. I have around 30,000 JSON files stored in S3 inside a particular bucket. These files are very small; each one takes only 400-500 Kb, but their quantity is not so small.
I want to create DataFrame based on all these files. I am reading JSON files using wildcard as follows:
var df = sqlContext.read.json("s3n://path_to_bucket/*.json")
I also tried this approach since json(...) is deprecated:
var df = sqlContext.read.format("json").load("s3n://path_to_bucket/*.json")
The problem is that it takes a very long time to create df. I was waiting 4 hours and the Spark job was still running.
Is there any more efficient approach to collect all these JSON files and create a DataFrame based on them?
UPDATE:
Or at least is it possible to read last 1000 files instead of reading all files? I found out that one can pass options as follows sqlContext.read.format("json").options, however I cannot figure out how to read only N newest files.
If you can get the last 1000 modified file names into a simple list you can simply call:
sqlContext.read.format("json").json(filePathsList: _*)
Please note that the .option call(s) are usually used to configure schema options.
Unfortunately, I haven't used S3 before, but I think you can use the same logic in the answer to this question to get the last modified file names:
How do I find the last modified file in a directory in Java?
You are loading like 13Gb of information. Are you sure that it takes a long time in just to create the DF? Maybe it's running the rest of the application but the UI shows that.
Try just to load and print the first row of the DF.
Anyway, what is the configuration of the cluster?
I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks).
The log files are CSV so I read them and apply a schema, then perform my transformations.
My problem is, how can I save each hour's data as a parquet format but append to the existing data set? When saving, I need to partition by 4 columns present in the dataframe.
Here is my save line:
data
.filter(validPartnerIds($"partnerID"))
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
The problem is that if the destination folder exists the save throws an error.
If the destination doesn't exist then I am not appending my files.
I've tried using .mode("append") but I find that Spark sometimes fails midway through so I end up loosing how much of my data is written and how much I still need to write.
I am using parquet because the partitioning substantially increases my querying in the future. As well, I must write the data as some file format on disk and cannot use a database such as Druid or Cassandra.
Any suggestions for how to partition my dataframe and save the files (either sticking to parquet or another format) is greatly appreciated.
If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).
If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:
1) Use snappy by adding to the configuration:
conf.set("spark.sql.parquet.compression.codec", "snappy")
2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.
If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.
If you're using unsorted partitioning your data is going to be split across all of your partitions. That means every task will generate and write data to each of your output files.
Consider repartitioning your data according to your partition columns before writing to have all the data per output file on the same partitions:
data
.filter(validPartnerIds($"partnerID"))
.repartition([optional integer,] "partnerID","year","month","day")
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
See: DataFrame.repartition
I'm a bit confused to find the right way to save data into HDFS after processing them with spark.
This is what I'm trying to do. I'm calculating min, max and SD of numeric fields. My input files have millions of rows, but output will have only around 15-20 fields. So, the output is a single value(scalar) for each field.
For example: I will load all the rows of FIELD1 into an RDD, and at the end, I will get 3 single values for FIELD 1(MIN, MAX, SD). I concatenated these three values into temporary string. In the end, I will have 15 to twenty rows, containing 4 columns in this following format
FIELD_NAME_1 MIN MAX SD
FIELD_NAME_2 MIN MAX SD
This is a snippet of the code:
//create rdd
val data = sc.textFile("hdfs://x.x.x.x/"+args(1)).cache()
//just get the first column
val values = data.map(_.split(",",-1)(1))
val data_double= values.map(x=>if(x==""){0}else{x}.toDouble)
val min_value= data_double.map((_,1)).reduceByKey((_+_)).sortByKey(true).take(1)(0)._1
val max_value= data_double.map((_,1)).reduceByKey((_+_)).sortByKey(false).take(1)(0)._1
val SD = data_double.stdev
So, i have 3 variables, min_value, max_value and SD that I want to store back to hdfs.
Question 1:
Since the output will be rather small, do I just save it locally on the server? or should I dump it to HDFS. Seems to me like dumping the file locally makes better sense.
Question 2:
In spark, I can just call the following to save an RDD into text file
some_RDD.saveAsTextFile("hdfs://namenode/path")
How do I accomplish the same thing in for a String variable that is not an RDD in scala? should I parallelize my result into an RDD first and then call saveAsTextFile?
To save locally just do
some_RDD.collect()
Then save the resulting array with something like from this question. And yes if the data set is small, and can easily fit in memory you should collect and bring it to the driver of the program. Another option if the data is a little to large to store in memory is just some_RDD.coalesce(numParitionsToStoreOn). Keep in mind coalesce also takes a boolean shuffle, if you are doing calculations on the data before coalescing, you should set this to true to get more parallelism on the calculations. Coalesce will reduce the number of nodes that store data when you call some_RDD.saveAsTextFile("hdfs://namenode/path"). If the file is very small but you need it on hdfs, call repartition(1), which is the same as coalesce(1,true), this will ensure that your data is only saved on one node.
UPDATE:
So if all you want to do is save three values in HDFS you can do this.
sc.parallelize(List((min_value,max_value,SD)),1).saveAsTextFile("pathTofile")
Basically you are just putting the 3 vars in a tuple, wrap that in a List and set the parallelism to one since the data is very small
Answer 1: Since you just need several scalar, I'd like to say storing them in local file system. You can first do val localValue = rdd.collect(), which will collect all data from workers to master. And then you call java.io to write things to disk.
Answer 2: You can do sc.parallelize(yourString).saveAsTextFile("hdfs://host/yourFile"). The will write things to part-000*. If you want to have all things in one file, hdfs dfs -getmerge is here to help you.