How to process large amounts of data in scala fs2? - scala

We have a Scala utility which reads data from database and then writes data to a text file in a csv format, using fs2 library. Then it does some processing on few columns and create the final file. So it is a 2 step process.
Read the data from db and create a data_tmp csv file.
Process few columns from _tmp file and create final file data_final csv file.
We use code similar to at link:
https://levelup.gitconnected.com/how-to-write-data-processing-application-in-fs2-2b6f84e3939c
Stream.resource(Blocker[IO]).flatMap { blocker =>
val inResource = getClass.getResource(in) // data_tmp file location
val outResource = getClass.getResource(out) // data_final file location
io.file
.readAll[IO](Paths.get(inResource.toURI), blocker, 4096)
.through(text.utf8Decode)
.through(text.lines)
..... // our processing logic here
.through(text.utf8Encode)
.through(io.file.writeAll(Paths.get(outResource.toURI), blocker))
}
Till now, this used to work as we did not have more than 5k records.
Now we have a new requirement where we expect the data from query to db to be in range of 50k to 1000k.
So we want to create multiple data_final files like data_final_1, data_final_2, ... and so on.
Each output file should not be more than a specific size, let's say 2 MB.
So the data_final should be created in chunks of 2 MB.
How can I modify the above code snippet so that we can create multiple output files from single large data_tmp csv file?

Related

I have a file which have transaction Id and xml data which is separated by comma(,) .I want to have only the xml file by removing the tranction ID

I have a file which has transactionId and xml data in it and which is separated by comma(,) .
I want to have only the xml file by removing the tranction ID and process the xml data; but I am not able to do so in pyspark.
Method 1 - I followed:
I tried to read the file as csv then dropped the first column and concatenated rest others columns and writing the file as text format.
df_spike = spark.read.format('csv').option('delimiter',',').load(readLocation)
df_spike = df_spike.drop("\_c0")
columns = df_spike.columns
df_spike = df_spike.withColumn('fresh',concat_ws(",", \*\[col(x) for x in columns\]))
df_final = df_spike.select('fresh')
df_final.write.txt(locatiion)
After writing this data in text format, when I am trying to read the data in XML, it is not reflecting all the rows.
Method 2 - I followed:
I read the data as text file and collected the element of the column one by one and removed the column one by one.
list_data = \[\]
for i in range(df_spike.count()):
df_collect = df_spike.collect()\[i\]\[0\]
df_list_data = df_collect\[12:\]
list_data.append(df_list_data)
This method is working fine; but it is taking excessive time as it is traversing through one by one row of the data.
Is there any efficient method to achieve this?

write only when all tables are valid with databricks and delta table

I'm looping through some CSV files in a folder. I want to write these CSV files as delta table only if they are all valid. Each CSV files in a folder as different name and schemas. I want to reject the entire folder and all the files it contains until data are fixed. I'm running a lot of test but ultimately I have to actually write the files as delta table with the following loop (simplified for this question):
for f in files:
# read csv
df = spark.read.csv(f, header=True, schema=schema)
# writing to already existing delta table
df.write.format("delta").save('path/' + f)
Is there a callback mechanism so the write method is executed only if all the dataframe doesn't returns any errors? Delta table schema enforcement is pretty rigid which is great, but errors can pop at any time despite all the test I'm running before passing these files in this loop.
union is not an option because I want to handle this by date and each files has different schemas and names.
You can use df.union() or df.unionByName() to read all of your files into a single dataframe. Then that one is either written fully or fails.
# Create empty dataframe with schema to fill up
emptyRDD = spark.sparkContext.emptyRDD()
df = spark.createDataFrame(emptyRDD,schema)
for f in files:
# read csv
dfNext = spark.read.csv(f, header=True, schema=schema)
df = df.unionByName(dfNext)
df.write.format("delta").save(path)

How to export statistics_log data cumulatively to the desired Excel sheet in AnyLogic?

I have selected to export tables at the end of model execution to an Excel file, and I would like that data to accumulate on the same Excel sheet after every stop and start of the model. As of now, every stop and start just exports that 1 run's data and overwrites what was there previously. I may be approaching the method of exporting multiple runs wrong/inefficiently but I'm not sure.
Best method is to export the raw data, as you do (if it is not too large).
However, 2 improvements:
manage your output data yourself, i.e. do not rely on the standard export tables but only write data that you really need. Check this help article to learn how to write your own data
in your custom output data tables, add additional identification columns such as date_of_run. I often use iteration and replication columns to also identify from which of those the data stems.
custom csv approach
Alternative approach is to write create your own csv file programmatically, this is possible with Java code. Then, you can create a new one (with a custom filename) after any run:
First, define a “Text file” element as below:
Then, use this code below to create your own csv with a custom name and write to it:
File outputDirectory = new File("outputs");
outputDirectory.mkdir();
String outputFileNameWithExtension = outputDirectory.getPath()+File.separator+"output_file.csv";
file.setFile(outputFileNameWithExtension, Mode.WRITE_APPEND);
// create header
file.println( "col_1"+","+"col_2");
// Write data from dbase table
List<Tuple> rows = selectFrom(my_dbase_table).list();
for (Tuple row : rows) {
file.println( row.get( my_dbase_table.col_1) + "," +
row.get( my_dbase_table.col_2));
}
file.close();

Hive SaveAsTable creates a new Parquet table file for every run

I have the following Scala code that I use to write data from a json file to a table in Hive.
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
val conf = new SparkConf().setAppName("App").setMaster("local")
import org.apache.spark.sql.hive._
val hiveContext = new HiveContext(sc)
val stg_comments = hiveContext.read.schema(buildSchema()).json(<path to json file)
comment.write.mode("append").saveAsTable(<table name>)
My json data has newline and carriage return characters in it's field values and hence, I cannot simply insert records in Hive (because Hive tables by default do not store newline and carriage returns in the data values) and hence, I need to use SaveAsTable option. The issue here is that every time a json file is read and new records are appended to the existing table, a new parquet file is created in the table directory in Hive warehouse directory. This leads to really small small parquet files in the directory. I would like the data to be appended to the existing parquet file. Do we know how to do that? Thanks!
This is an expected behavior. There is no append-to-existing file option here. Each job has its own set of tasks, each task has its own output file. repartitioning before rewrite can reduce number of files written, but not prevent creating new files.
If number of files becomes a problem, you have to run a separate job to read existing small files and merge into larger chunks.

issue insert data in hive create small part files

i am processing more than 1000000 records of json file i am reading file line by line and extract requried key values
(json are mix structure is not fix. so i am parsing and generate requried json element) and generate json string simillar to json_string variable and push to hive table data are store properly but at hadoop apps/hive/warehouse/jsondb.myjson_table folder contain small part files. every insert query the new (.1 to .20 kb)part file will be created. beacuse of that if i run simple query on hive as it will take more than 30 min. showing sample code of my logic this iterate multipal times for new records to inesrt in hive.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("SparkSessionZipsExample").enableHiveSupport().getOrCreate()
var json_string = """{"name":"yogesh_wagh","education":"phd" }"""
val df = spark.read.json(Seq(json_string).toDS)
//df.write.format("orc").saveAsTable("bds_data1.newversion");
df.write.mode("append").format("orc").insertInto("bds_data1.newversion");
i have also try to add hive property to merge the files but it wont work,
i have also try to create table from existing table for combine small part file to one 256 mb files..
please share sample code to insert multipal records and append record in part file.
I think each of those individual inserts creating a new part file.
You could create dataset/dataframe of these json strings and then save it to hive table.
you could merge the existing small file using hive ddl ALTER TABLE table_name CONCATENATE;