I have one program which is expecting a csv file and written in python .
The csv data is suppose to come from scala which is using spark functionality to store the data from source and store into temp table like below.
abb.createOrReplaceTempView("tempt")
temp is outcome of above spark command Described .
I want to store the temp data into csvfile /tmp/something.csv
But I did not find anything as such in scala with spark which will serve my purpose .
Please suggest me what whould be the the best way to store temptinto csv file/
declaring "temp" as tempTable allows you to reference it when you write SQL commands in spark
if you want to save the dataframe use abb.write.csv("file_name")
Related
Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert)
using spark submit as:
spark-submit \
--class class-name\
--jar file
or can I add any extra Parameter in spark submit for improving the optimization.
scala code(sample):
All Imports
object sample_1 {
def main(args: Array[String]) {
//sparksession with enabled hivesuppport
var a1=sparksession.sql("load data inpath 'filepath' overwrite into table table_name")
var b1=sparksession.sql("insert into tablename (all_column) select 'ALL_COLUMNS' from source_table")
}
}
First of all, you don't need to store the data in the temp table to write into hive table later. You can straightaway read the file and write the output using the DataFrameWriter API. This will reduce one step from your code.
You can write as follows:
val spark = SparkSession.builder.enableHiveSupport().getOrCreate()
val df = spark.read.csv(filePath) //Add header or delimiter options if needed
inputDF.write.mode("append").format(outputFormat).saveAsTable(outputDB + "." + outputTableName)
Here, the outputFormat will be orc, the outputDB will be your hive database and outputTableName will be your Hive table name.
I think using the above technique, your write time will reduce significantly. Also, please mention the resources your job is using and I may be able to optimize it further.
Another optimization you can use is to partition your dataframe while writing. This will make the write operation faster. However, you need to decide the columns on which to partition carefully so that you don't end up creating a lot of partitions.
In Spark Streaming when the input source is a csv file and I read it through a socket (Java), a Dataset<Row> is created with only a string column and the value of each row contains each line sent through the socket.
When I know the format of each line, e.g. the first two values of the csv line are Strings the next is an integer and so on, is t possible to declare my schema and create another Dataset<Row> based on that schema and place the data accordingly?
Thank you in advance.
First of all,if it is csv i dont see any point to use spark streaming for that.It will be hisotrical data ,data is not changing.So you should use spark sql only to read and process csv.
You can create your schema by crating StructField and decalre data types.
i am processing more than 1000000 records of json file i am reading file line by line and extract requried key values
(json are mix structure is not fix. so i am parsing and generate requried json element) and generate json string simillar to json_string variable and push to hive table data are store properly but at hadoop apps/hive/warehouse/jsondb.myjson_table folder contain small part files. every insert query the new (.1 to .20 kb)part file will be created. beacuse of that if i run simple query on hive as it will take more than 30 min. showing sample code of my logic this iterate multipal times for new records to inesrt in hive.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("SparkSessionZipsExample").enableHiveSupport().getOrCreate()
var json_string = """{"name":"yogesh_wagh","education":"phd" }"""
val df = spark.read.json(Seq(json_string).toDS)
//df.write.format("orc").saveAsTable("bds_data1.newversion");
df.write.mode("append").format("orc").insertInto("bds_data1.newversion");
i have also try to add hive property to merge the files but it wont work,
i have also try to create table from existing table for combine small part file to one 256 mb files..
please share sample code to insert multipal records and append record in part file.
I think each of those individual inserts creating a new part file.
You could create dataset/dataframe of these json strings and then save it to hive table.
you could merge the existing small file using hive ddl ALTER TABLE table_name CONCATENATE;
I am trying to follow this example to save some data in parquet format and read it. If I use the write.parquet("filename"), then the iterating Spark job gives error that
"filename" already exists.
If I use SaveMode.Append option, then the Spark job gives the error
".spark.sql.AnalysisException: Specifying database name or other qualifiers are not allowed for temporary tables".
Please let me know the best way to ensure new data is just appended to the parquet file. Can I define primary keys on these parquet tables?
I am using Spark 1.6.2 on Hortonworks 2.5 system. Here is the code:
// Option 1: peopleDF.write.parquet("people.parquet")
//Option 2:
peopleDF.write.format("parquet").mode(SaveMode.Append).saveAsTable("people.parquet")
// Read in the parquet file created above
val parquetFile = spark.read.parquet("people.parquet")
//Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile")
val teenagers = sqlContext.sql("SELECT * FROM people.parquet")
I believe if you use .parquet("...."), you should use .mode('append'),
not SaveMode.Append:
df.write.mode('append').parquet("....")
I use this method to write csv file. But it will generate a file with multiple part files. That is not what I want; I need it in one file. And I also found another post using scala to force everything to be calculated on one partition, then get one file.
First question: how to achieve this in Python?
In the second post, it is also said a Hadoop function could merge multiple files into one.
Second question: is it possible merge two file in Spark?
You can use,
df.coalesce(1).write.csv('result.csv')
Note:
when you use coalesce function you will lose your parallelism.
You can do this by using the cat command line function as below. This will concatenate all of the part files into 1 csv. There is no need to repartition down to 1 partition.
import os
test.write.csv('output/test')
os.system("cat output/test/p* > output/test.csv")
Requirement is to save an RDD in a single CSV file by bringing the RDD to an executor. This means RDD partitions present across executors would be shuffled to one executor. We can use coalesce(1) or repartition(1) for this purpose. In addition to it, one can add a column header to the resulted csv file.
First we can keep a utility function for make data csv compatible.
def toCSVLine(data):
return ','.join(str(d) for d in data)
Let’s suppose MyRDD has five columns and it needs 'ID', 'DT_KEY', 'Grade', 'Score', 'TRF_Age' as column Headers. So I create a header RDD and union MyRDD as below which most of times keeps the header on top of the csv file.
unionHeaderRDD = sc.parallelize( [( 'ID','DT_KEY','Grade','Score','TRF_Age' )])\
.union( MyRDD )
unionHeaderRDD.coalesce( 1 ).map( toCSVLine ).saveAsTextFile("MyFileLocation" )
saveAsPickleFile spark context API method can be used to serialize data that is saved in order save space. Use pickFile to read the pickled file.
I needed my csv output in a single file with headers saved to an s3 bucket with the filename I provided. The current accepted answer, when I run it (spark 3.3.1 on a databricks cluster) gives me a folder with the desired filename and inside it there is one csv file (due to coalesce(1)) with a random name and no headers.
I found that sending it to pandas as an intermediate step provided just a single file with headers, exactly as expected.
my_spark_df.toPandas().to_csv('s3_csv_path.csv',index=False)