I have really large tab delimited files (10GB-70GB) and need to do some read, data manipulation, and write to a separate file. The files can range from 100 to 10K columns with 2 million to 5 million rows.
The first x columns are static which are required for reference. Sample file format:
#ProductName Brand Customer1 Customer2 Customer3
Corolla Toyota Y N Y
Accord Honda Y Y N
Civic Honda 0 1 1
I need to use the first 2 columns to get a product id then generate an output file similar to:
ProductID1 Customer1 Y
ProductID1 Customer2 N
ProductID1 Customer3 Y
ProductID2 Customer1 Y
ProductID2 Customer2 Y
ProductID2 Customer3 N
ProductID3 Customer1 N
ProductID3 Customer2 Y
ProductID3 Customer3 Y
Current sample code:
val fileNameAbsPath = filePath + fileName
val outputFile = new PrintWriter(filePath+outputFileName)
var customerList = Array[String]()
for(line <- scala.io.Source.fromFile(fileNameAbsPath).getLines()) {
if(line.startsWith("#")) {
customerList = line.split("\t")
}
else {
val cols = line.split("\t")
val productid = getProductID(cols(0), cols(1))
for (i <- (2 until cols.length)) {
val rowOutput = productid + "\t" + customerList(i) + "\t" + parser(cols(i))
outputFile.println(rowOutput)
outputFile.flush()
}
}
}
outputFile.close()
One of tests that I ran took about 12 hours to read a file (70GB) that has 3 million rows and 2500 columns. The final output file generated 250GB with about 800+ million rows.
My question is: is there anything in Scala other than what I'm already doing that can offer quicker performance?
Ok, some ideas ...
As mentioned in the comments, you don't want to flush after every line. So, yeah, get rid of it.
Moreover, PrintWriter by default flushes after every newline anyway (so, currently, you are actually flushing twice :)). Use a two-argument constructor, when creating PrintWriter, and make sure the second parameter is false
You don't need to create BufferedWriter explicitly, PrintWriter is already buffering by default. The default buffer size is 8K, you might want to try to play around with it, but it will probably not make any difference, because, last I checked, the underlying FileOutputStream ignores all that, and flushes kilobyte-sized chunks either way.
Get rid of gluing rows together in a variable, and just write each field straight to the output.
If you do not care about the order in which lines appear in the output, you can trivially parallelize the processing (if you do care about the order, you still can, just a little bit less trivially), and write several files at once. That would help tremendously, if you place your output chunks on different disks and/or if you have multiple cores to run this code. You'd need to rewrite your code in (real) scala to make it thread safe, but that should be easy enough.
Compress data as it is being written. Use GZipOutputStream for example. That not only lets you reduce the physical amount of data actually hitting the disks, but also allows for a much larger buffer
Check out what your parser thingy is doing. You didn't show the implementation, but something tells me it is likely not free.
split can get prohibitively expensive on huge strings. People often forget, that its parameter is actually a regular expression. You are probably better off writing a custom iterator or just using good-old StringTokenizer to parse the fields out as you go, rather than splitting up-front. At the very least, it'll save you one extra scan per line.
Finally, last, but by no measure least. Consider using spark and hdfs. This kind of problems is the very area where those tools really excel.
Related
I'm just getting started with using spark, I've previously used python with pandas. One of the common things I do very regularly is compare datasets to see which columns have differences. In python/pandas this looks something like this:
merged = df1.merge(df2,on="by_col")
for col in cols:
diff = merged[col+"_x"] != merged[col+"_y"]
if diff.sum() > 0:
print(f"{col} has {diff.sum()} diffs")
I'm simplifying this a bit but this is the gist of it, and of course after this I'd drill down and look at for example:
col = "col_to_compare"
diff = merged[col+"_x"] != merged[col+"_y"]
print(merged[diff][[col+"_x",col+"_y"]])
Now in spark/scala this is turning out to be extremely inefficient. The same logic works, but this dataset is roughly 300 columns long, and the following code takes about 45 minutes to run for a 20mb dataset, because it's submitting 300 different spark jobs in sequence, not in parallel, so I seem to be paying the startup cost of spark 300 times. For reference the pandas one takes something like 300ms.
for(col <- cols){
val cnt = merged.filter(merged("dev_" + col) <=> merged("prod_" + col)).count
if(cnt != merged.count){
println(col + " = "+cnt + "/ "+merged.count)
}
}
What's the faster more spark way of doing this type of thing? My understanding is I want this to be a single spark job where it creates one plan. I was looking at transposing to a super tall dataset and while that could potentially work it ends up being super complicated and the code is not straightforward at all. Also although this example fits in memory, I'd like to be able to use this function across datasets and we have a few that are multiple terrabytes so it needs to scale for large datasets as well, whereas with python/pandas that would be a pain.
We are evaluating NEsper. Our focus is to monitor data quality in an enterprise context. In an application we are going to log every change on a lot of fields - for example in an "order". So we have fields like
Consignee name
Consignee street
Orderdate
....and a lot of more fields. As you can imagine the log files are going to grow big.
Because the data is sent by different customers and is imported in the application, we want to analyze how many (and which) fields are updated from "no value" to "a value" (just as an example).
I tried to build a test case with just with the fields
order reference
fieldname
fieldvalue
For my test cases I added two statements with context-information. The first one should just count the changes in general per order:
epService.EPAdministrator.CreateEPL("create context RefContext partition by Ref from LogEvent");
var userChanges = epService.EPAdministrator.CreateEPL("context RefContext select count(*) as x, context.key1 as Ref from LogEvent");
The second statement should count updates from "no value" to "a value":
epService.EPAdministrator.CreateEPL("create context FieldAndRefContext partition by Ref,Fieldname from LogEvent");
var countOfDataInput = epService.EPAdministrator.CreateEPL("context FieldAndRefContext SELECT context.key1 as Ref, context.key2 as Fieldname,count(*) as x from pattern[every (a=LogEvent(Value = '') -> b=LogEvent(Value != ''))]");
To read the test-logfile I use the csvInputAdapter:
CSVInputAdapterSpec csvSpec = new CSVInputAdapterSpec(ais, "LogEvent");
csvInputAdapter = new CSVInputAdapter(epService.Container, epService, csvSpec);
csvInputAdapter.Start();
I do not want to use the update listener, because I am interested only in the result of all events (probably this is not possible and this is my failure).
So after reading the csv (csvInputAdapter.Start() returns) I read all events, which are stored in the statements NewEvents-Stream.
Using 10 Entries in the CSV-File everything works fine. Using 1 Million lines it takes way to long. I tried without EPL-Statement (so just the CSV import) - it took about 5sec. With the first statement (not the complex pattern statement) I always stop after 20 minutes - so I am not sure how long it would take.
Then I changed my EPL of the first statement: I introduce a group by instead of the context.
select Ref,count(*) as x from LogEvent group by Ref
Now it is really fast - but I do not have any results in my NewEvents Stream after the CSVInputAdapter comes back...
My questions:
Is the way I want to use NEsper a supported use case or is this the root cause of my failure?
If this is a valid use case: Where is my mistake? How can I get the results I want in a performant way?
Why are there no NewEvents in my EPL-statement when using "group by" instead of "context"?
To 1), yes
To 2) this is valid, your EPL design is probably a little inefficient. You would want to understand how patterns work, by using filter indexes and index entries, which are more expensive to create but are extremely fast at discarding unneeded events.
Read:
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#processingmodel_indexes_filterindexes and also
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#pattern-walkthrough
Try the "previous" perhaps. Measure performance for each statement separately.
Also I don't think the CSV adapter is optimized for processing a large file. I think CSV may not stream.
To 3) check your code? Don't use CSV file for large stuff. Make sure a listener is attached.
I need to join 2 RDDs which are too big to load and join in a single processing. So I fetch some records from source RDD and destination RDD and join them iteratively. But I found with time went by the join speed became slower and slower and at last the program stopped at a certain stage. The stage's status was staging and never changed. It seems that the RDDs declared in the iteration had not been freed and thus the system has no enough memory for new RDDs. How to fix it?
var A: RDD[(Int, UUID)] = …
var B: RDD[(Int, UUID)] = …
for (i <- 0 until 64) {
var tmpA = A.filter(x => x._1%64 == i)
var tmpB = B.filter(x => x._1%64 == i)
var C = A.join(B)
println(C.count)
}
I believe at least part of your issue is that the original RDDs can't get out of memory. In each step you basically load them to memory in order to do the filter.
Instead you can split them into multiple RDDS and save each to disk. Then load only the relevant pair when you do the join. This means that by the next join it can release everything else.
That said, assuming your sample represents your actual code, it seems you have a schema for your RDD. I would try either using datasets or even better dataframes which take less memory (from my experience this can easily be a factor of 10) and then everything might fit in memory.
I have a table of distinct users, which has 400,000 users. I would like to split it into 4 parts, and expected each user located in one part only.
Here is my code:
val numPart = 4
val size = 1.0 / numPart
val nsizes = Array.fill(numPart)(size)
val data = userList.randomSplit(nsizes)
Then I write each data(i), i from 0 to 3, into parquet files. Select the directory, group by user id and count by part, there are some users that located in two or more parts.
I still have no idea why?
I have found the solution: cache the DataFrame before you split it.
Should be
val data = userList.cache().randomSplit(nsizes)
Still have no idea why. My guess, each time the randomSplit function "fill" the data, it reads records from userList which is re-evaluate from the parquet file(s), and give a different order of rows, that's why some users are lost and some users are duplicated.
That's what I thought. If some one have any answer or explanation, I will update.
References:
(Why) do we need to call cache or persist on a RDD
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html
http://159.203.217.164/using-sparks-cache-for-correctness-not-just-performance/
If your goal is to split it to different files you can use the functions.hash to calculate a hash, then mod 4 to get a number between 0 to 4 and when you write the parquet use partitionBy which would create a directory for each of the 4 values.
Let's assume that I have a very large dataset that can not be fit into the memory, there are millions of records in the dataset and I want to remove duplicate rows (actually keeping one row from the duplicates)
What's the most efficient approach in terms of space and time complexity ?
What I thought :
1.Using bloom filter , I am not sure about how it's implemented , but I guess the side effect is having false-positives , in that case how can we find if it's REALLY a duplicate or not ?
2.Using hash values , in this case if we have a small number of duplicate values, the number of unique hash values would be large and again we may have problem with memory ,
Your solution 2: using hash value doesn't force a memory problem. You just have to partition the hash space into slices that fits into memory. More precisely:
Consider a hash table storing the set of records, each record is only represented by its index in the table. Say for example that such a hash table will be 4GB. Then you split your hash space in k=4 slice. Depending on the two last digits of the hash value, each record goes into one of the slice. So the algorithm would go roughly as follows:
let k = 2^M
for i from 0 to k-1:
t = new table
for each record r on the disk:
h = hashvalue(r)
if (the M last bit of h == i) {
insert r into t with respect to hash value h >> M
}
search t for duplicate and remove them
delete t from memory
The drawback is that you have to hash each record k times. The advantage is that is it can trivially be distributed.
Here is a prototype in Python:
# Fake huge database on the disks
records = ["askdjlsd", "kalsjdld", "alkjdslad", "askdjlsd"]*100
M = 2
mask = 2**(M+1)-1
class HashLink(object):
def __init__(self, idx):
self._idx = idx
self._hash = hash(records[idx]) # file access
def __hash__(self):
return self._hash >> M
# hashlink are equal if they link to equal objects
def __eq__(self, other):
return records[self._idx] == records[other._idx] # file access
def __repr__(self):
return str(records[self._idx])
to_be_deleted = list()
for i in range(2**M):
t = set()
for idx, rec in enumerate(records):
h = hash(rec)
if (h & mask == i):
if HashLink(idx) in t:
to_be_deleted.append(idx)
else:
t.add(HashLink(idx))
The result is:
>>> [records[idx] for idx in range(len(records)) if idx not in to_be_deleted]
['askdjlsd', 'kalsjdld', 'alkjdslad']
Since you need deletion of duplicate item, without sorting or indexing, you may end up scanning entire dataset for every delete, which is unbearably costly in terms of performance. Given that, you may think of some external sorting for this, or a database. If you don't care about ordering of output dataset. Create 'n' number of files which stores a subset of input dataset according to hash of the record or record's key. Get the hash and take modulo by 'n' and get the right output file to store the content. Since size of every output file is small now, your delete operation would be very fast; for output file you could use normal file, or a sqlite/ berkeley db. I would recommend sqlite/bdb though. In order to avoid scanning for every write to output file, you could have a front-end bloom filter for every output file. Bloom filter isn't that difficult. Lot of libraries are available. Calculating 'n' depends on your main memory, I would say. Go with pessimistic, huge value for 'n'. Once your work is done, concatenate all the output files into a single one.