I have a file which have transaction Id and xml data which is separated by comma(,) .I want to have only the xml file by removing the tranction ID - pyspark

I have a file which has transactionId and xml data in it and which is separated by comma(,) .
I want to have only the xml file by removing the tranction ID and process the xml data; but I am not able to do so in pyspark.
Method 1 - I followed:
I tried to read the file as csv then dropped the first column and concatenated rest others columns and writing the file as text format.
df_spike = spark.read.format('csv').option('delimiter',',').load(readLocation)
df_spike = df_spike.drop("\_c0")
columns = df_spike.columns
df_spike = df_spike.withColumn('fresh',concat_ws(",", \*\[col(x) for x in columns\]))
df_final = df_spike.select('fresh')
df_final.write.txt(locatiion)
After writing this data in text format, when I am trying to read the data in XML, it is not reflecting all the rows.
Method 2 - I followed:
I read the data as text file and collected the element of the column one by one and removed the column one by one.
list_data = \[\]
for i in range(df_spike.count()):
df_collect = df_spike.collect()\[i\]\[0\]
df_list_data = df_collect\[12:\]
list_data.append(df_list_data)
This method is working fine; but it is taking excessive time as it is traversing through one by one row of the data.
Is there any efficient method to achieve this?

Related

How to export statistics_log data cumulatively to the desired Excel sheet in AnyLogic?

I have selected to export tables at the end of model execution to an Excel file, and I would like that data to accumulate on the same Excel sheet after every stop and start of the model. As of now, every stop and start just exports that 1 run's data and overwrites what was there previously. I may be approaching the method of exporting multiple runs wrong/inefficiently but I'm not sure.
Best method is to export the raw data, as you do (if it is not too large).
However, 2 improvements:
manage your output data yourself, i.e. do not rely on the standard export tables but only write data that you really need. Check this help article to learn how to write your own data
in your custom output data tables, add additional identification columns such as date_of_run. I often use iteration and replication columns to also identify from which of those the data stems.
custom csv approach
Alternative approach is to write create your own csv file programmatically, this is possible with Java code. Then, you can create a new one (with a custom filename) after any run:
First, define a “Text file” element as below:
Then, use this code below to create your own csv with a custom name and write to it:
File outputDirectory = new File("outputs");
outputDirectory.mkdir();
String outputFileNameWithExtension = outputDirectory.getPath()+File.separator+"output_file.csv";
file.setFile(outputFileNameWithExtension, Mode.WRITE_APPEND);
// create header
file.println( "col_1"+","+"col_2");
// Write data from dbase table
List<Tuple> rows = selectFrom(my_dbase_table).list();
for (Tuple row : rows) {
file.println( row.get( my_dbase_table.col_1) + "," +
row.get( my_dbase_table.col_2));
}
file.close();

Google Cloud Data Fusion is appending a column to original data

When I am loading data encrypted data from GCS source to GCS sink there one additional column getting added.
Original data
Employee ID,Employee First Name,Employee Last Name,Employee Joining Date,Employee location
1,Vinay,Argekar,01/01/2017,India
2,Thirukkumaran,Haridass,02/02/2017,USA
3,David,Wu,03/04/2000,Canada
4,Vinod,Kumar,04/02/2002,India
5,Joshua,Abraham,04/15/2010,France
6,Allaudin,Dastigar,09/24/2012,UK
7,Senthil,Kumar,08/15/2009,Germany
8,Sudha,Narayanan,12/14/2016,India
9,Ravi,Prasad,11/11/2011,Costa Rica
Data came to file after running pipeline
0,Employee ID,Employee First Name,Employee Last Name,Employee Joining Date,Employee location
91,1,Vinay,Argekar,01/01/2017,India
124,2,Thirukkumaran,Haridass,02/02/2017,US
164,3,David,Wu,03/04/2000,Canada
193,4,Vinod,Kumar,04/02/2002,India
224,5,Joshua,Abraham,04/15/2010,France
259,6,Allaudin,Dastigar,09/24/2012,UK
293,7,Senthil,Kumar,08/15/2009,Germany
328,8,Sudha,Narayanan,12/14/2016,India
363,9,Ravi,Prasad,11/11/2011,Costa Rica
First column 0 was not present in original file
When you are configuring the GCS source, did you specify the Format to be CSV or was it left as Text? When the Format is Text, the output schema actually contains an offset, which is the first column that first column that you see in the output data. When you specify the format to be CSV, you have to specify the output schema of the file.

retrieve the fields we are interested in and write to a result file the fields + count result using scala spark

I need to import a CSV file that contains several fields, I must later loop on some fields that interest us to recover the data contained in it.
In the file there is a field named query that contains SQL queries that must be executed and store in another CSV file that will contain the fields to retrieve as well as the results of each query.
Below is my code so far:
// step1:read the file
val table_requete = spark.read.format("com.databricks.spark.csv").option("header","true").option("delimiter", ";").load("/user/swychowski/ClientAnlytics_Controle/00_Params/filtre.csv")
req.registerTempTable("req")
// step2:read the file
However, I dont know how to loop and store on another file at the same time.

Scala - Writing dataframe to a file as binary

I have a hive table of type parquet, with column Content storing various documents as base64 encoded.
Now, I need to read that column and write into a file in HDFS, so that the base64 column will be converted back to a document for each row.
val profileDF = sqlContext.read.parquet("/hdfspath/profiles/");
profileDF.registerTempTable("profiles")
val contentsDF = sqlContext.sql(" select unbase64(contents) as contents from profiles where file_name'file1'")
Now that contentDF is storing the binary format of a document as a row, which I need to write to a file. Tried different options but couldn't get back the dataframe content to a file.
Appreciate any help regarding this.
I would suggest save as parquet:
https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/sql/DataFrameWriter.html#parquet(java.lang.String)
Or convert to RDD and do save as object:
https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/rdd/RDD.html#saveAsObjectFile(java.lang.String)

issue insert data in hive create small part files

i am processing more than 1000000 records of json file i am reading file line by line and extract requried key values
(json are mix structure is not fix. so i am parsing and generate requried json element) and generate json string simillar to json_string variable and push to hive table data are store properly but at hadoop apps/hive/warehouse/jsondb.myjson_table folder contain small part files. every insert query the new (.1 to .20 kb)part file will be created. beacuse of that if i run simple query on hive as it will take more than 30 min. showing sample code of my logic this iterate multipal times for new records to inesrt in hive.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("SparkSessionZipsExample").enableHiveSupport().getOrCreate()
var json_string = """{"name":"yogesh_wagh","education":"phd" }"""
val df = spark.read.json(Seq(json_string).toDS)
//df.write.format("orc").saveAsTable("bds_data1.newversion");
df.write.mode("append").format("orc").insertInto("bds_data1.newversion");
i have also try to add hive property to merge the files but it wont work,
i have also try to create table from existing table for combine small part file to one 256 mb files..
please share sample code to insert multipal records and append record in part file.
I think each of those individual inserts creating a new part file.
You could create dataset/dataframe of these json strings and then save it to hive table.
you could merge the existing small file using hive ddl ALTER TABLE table_name CONCATENATE;