make filling with filna permanent in pyspark - pyspark

I have this data frame. I want to change null to zero but I cannot vary. Data frame with zero is
not permanent.I tried these codes
df5.na.fill("0").show()
df5.fillna( { 'rc3':0, 'rc5':0, 'rc7':0 } ).show()

Related

Make absolute work inside filtering in Scala

I want to return a percentage of results from a dataset. Being a noob in Scala, tried the following
ds.filter(abs(hash(col("source"))) % 100 < percentage)
but getting abs cannot be applied to (org.apache.spark.sql.Column). I don't want to sample it, I want to return based on the hash of a column so that it's deterministic even when dataset changes.
This works just fine:
ds.filter(abs(hash(col("source"))) % 100 < percentage)
Probabely you have multiple abs in your namespace (e.g. from imports like import math._ etc. To be sure, use
ds.filter(org.apache.spark.sql.functions.abs(hash(col("source"))) % 100 < percentage)
But I think this will not garantee that you get the exact percentage, because hash values may not be equally distributed (think about a dataframe with only 1 unique value of source, hash values will all be the same.... you get either all records or none. To get the exact percentage, you would need something like :
val newDF = df
.withColumn("rnb",row_number().over(Window.orderBy($"source"))) // or order by hash if you wish
.withColumn("count",count("*").over())
.where($"rnb" < lit(fraction)*$"count")

Customized Android GraphView x-axis date labels not displaying as per setNumHorizontalValues()

I attempt to show a tersely formatted date/time on the x-axis in a graphview chart. As per the API Code examples, I set HumanRounding to false when using using a date formatter on that axis. I'm also setting the NumHorizontalLabels to 3 in order to display reasonably OK in both orientations.
This results in e.g. the following, where the date labels show as a black shape, and the LineChart background is different. I'm speculating that the black shape is the result of all my date data points overwriting each other:
With HumanRounding set to true (commented out), I get labels showing, but instead of the expected 3 evenly distributed labels, they are unpredictably spread out and/or not equal to 3, sometimes the labels over-write each other, sometimes they are bunched on the left...
The number of date data-points on the x-axis can vary depending on how much history the user has selected. Note that this can vary from 60 to thousands of minutes.
Here's the code that receives data and charts it. Note that the unixdate retrieved from wxList elements has already been converted to a Java date (by multiplying by 1000) by the time they get used here (the time portion of the x-axis are in fact correct when they do show up in a reasonably distributed manner):
protected void onPostExecute(List<WxData> wxList) {
// We will display MM/dd HH:mm on the x-axes on all graphs...
SimpleDateFormat shortDateTime = new SimpleDateFormat("MM/dd HH:mm");
shortDateTime.setTimeZone(TimeZone.getTimeZone("America/Toronto"));
DateAsXAxisLabelFormatter xAxisFormat = new DateAsXAxisLabelFormatter(parentContext, shortDateTime);
if (wxList == null || wxList.isEmpty()) {
makeText(parentContext,
"Could not retrieve data from server",
Toast.LENGTH_LONG).show();
} else {
// Temperature Celcius
GraphView tempGraph = findViewById(R.id.temp_graph);
tempGraph.removeAllSeries();
tempGraph.setTitle(parentContext.getString(R.string.temp_graph_label));
DataPoint[] tempCArray = new DataPoint[wxList.size()];
for (int i = 0; i < wxList.size(); i++) {
tempCArray[i] = new DataPoint(wxList.get(i).getUnixtime(), wxList.get(i).getTempC().doubleValue());
}
LineGraphSeries<DataPoint> tempCSeries = new LineGraphSeries<>(tempCArray);
tempGraph.addSeries(tempCSeries);
tempGraph.getGridLabelRenderer().invalidate(false, false);
tempGraph.getGridLabelRenderer().setLabelFormatter(xAxisFormat);
tempGraph.getGridLabelRenderer().setNumHorizontalLabels(3);
tempGraph.getViewport().setMinX(wxList.get(0).getUnixtime());
tempGraph.getViewport().setMaxX(wxList.get(wxList.size() - 1).getUnixtime());
tempGraph.getViewport().setXAxisBoundsManual(true);
// Code below seems buggy - with humanRounding, X-axis turns black
// tempGraph.getGridLabelRenderer().setHumanRounding(false);
...
I have tried many variations,but I cannot get the graph to consistently display 3 datetimes evenly spread out, for both orientations, for varyings sample sizes. Any help is appreciated.

(Spark/Scala) What would be the most effective way to compare specific data in one RDD to a line of another?

Basically, I have two sets of data in two text files. One set of data is in the format:
a,DataString1,DataString2 (One line) (The first character is in every entry but not relevant)
.... (and so on)
The second set of data is in format:
Data, Data Data Data, Data Data, Data, Data Data Data (One line)(separated by either commas or spaces, but I'm able to use a regular expression to handle this, so that's not the problem)
.... (And so on)
So what I need to do is check if DataString1 AND DataString2 are both present on any single line of the second set of data.
Currently I'm doing this like so:
// spark context is defined above, imported java.util.regex.Pattern above as well
case class test(data_one: String, data_two: String)
// case class is used to just more simply organize data_one to work with
val data_one = sc.textFile("path")
val data_two = sc.textFile("path")
val rdd_one = data_one.map(_.split(",")).map( c => test(c(1),c(2))
val rdd_two = data_two.map(_.split("[,\\s*]"))
val data_two_array = rdd_two.collect()
// this causes data_two_array to be an array of array of strings.
one.foreach { line =>
for (array <- data_two_array) {
for (string <- array) {
// comparison logic here that checks finds if both dataString1 and dataString2
// happen to be on same line is in these two for loops
}
}
}
How could I make this process more efficient? At the moment it does work correctly, but as data sizes grow this becomes very ineffective.
The double for loop scans for all elements with size m*n where m,n are sizes of each set. You can start with join to eliminate rows. Since you have 2 columns to verify, make sure the join takes care of those.

Cassandra: get_range_slices of TimeUUID super column?

I have a schema of Row Keys 1-n. In each row there are a variable number of supercolumns with a TimeUUID 'name'. Im hoping to be able to query this data by a time range.
Two issues have come up:
in KeyRange -> the values that I put in for 'start_key' and 'end_key' are getting misunderstood (for lack of a better term) by Thrift. Experimenting with different groups of values Im not seeing what I expect and often get back something completely unexpected.
Example: my row keys are running from 1-1000 with lots of random gaps. I put start_key = 50 and end_key = 20 .. and I get back rows with keys ranging from 99 to 414.
Example: I have a known row with key = 13. Putting this value into start_key and end_key gives me no results.
Second issue: even when I do get results the 'columns' portion of the 'keyslice' is always empty. I have checked via cassandra-cli and I know there is data.
Im using Perl as follows:
my $slice_range = new Cassandra::SliceRange();
$slice_range->{ start } = create_UUID( UUID::Tiny::UUID_TIME, "2010-12-24 00:00:00" );
$slice_range->{ finish } = create_UUID( UUID::Tiny::UUID_TIME, "2011-12-25 00:00:00" );
my $slice_predicate = new Cassandra::SlicePredicate();
$slice_predicate->{ slice_range } = $slice_range;
my $key_range = new Cassandra::KeyRange();
$key_range->{ start_key } = 13;
$key_range->{ end_key } = 13;
my $result = $client->get_range_slices( $column_parent, $slice_predicate, $key_range, $consistency_level );
print Dumper( $result );
Clearly Im misunderstanding some basic precept.
EDIT: It turns out that the Perl library Im using is not properly documented. The UUID creation was not working as advertised. I opened it up, fixed it, and now its all going a bit more as I was expecting. I can slice my supercolumns by date/time range. Still working on getting the key range portion to work.
http://wiki.apache.org/cassandra/FAQ#range_rp covers why you're not seeing what you expect with key ranges.
You need to specify a SlicePredicate that contains the actual range of what you're trying to select. The default of no column_names and no slice_range will result in the empty columns list that you see.

Regarding BigDecimal

I have a csv file where amount and quantity fields are present in each detail record except header and trailer record. Trailer record has a total charge values which is the total sum of quantity multiplied by amount field in detail records . I need to check whether the trailer total charge value is equal to my calculated value of amount and quantity fields. I am using the double data type for all these calculations. When i browsed i am able to understand from the below web link that it might create an issue using double datatype while comparison with decimal points. It's suggesting to using BigDecimal
http://epramono.blogspot.com/2005/01/double-vs-bigdecimal.html
Will i get issues if i use double data type. How can i do the calculations using BigDecimal. Also i am not sure how many digits i will get after decimal points in csv file. Also amount can have a positive or negative value.
In csv file
H,ABC.....
"D",....,"1","12.23"
"D",.....,"3","-13.334"
"D",......,"2","12"
T,csd,123,12.345
------------------------------ While Validation i am having the below code --------------------
double detChargeCount =0;
//From csv file i am reading trailer records charge value
String totChargeValue = items[3].replaceAll("\"","").trim();
if (null != totChargeValue && !totChargeValue.equals("")) {
detChargeCount = new Double(totChargeValue).doubleValue();
if(detChargeCount==calChargeCount)
validflag=true;
-----------------------While reading CSV File i am having the below code
if (null != chargeQuan && !chargeQuan.equals("")) {
tmpChargeQuan=Long(chargeQuan).longValue();
}
if (null != chargeAmount && !chargeAmount.equals("")) {
tmpChargeAmt=new Double(chargeAmount).doubleValue();
calChargeCount=calChargeCount+(tmpChargeQuan*tmpChargeAmt);
}
I had declared the variables tmpChargeQuan, tmpChargeAmt, calChargeCount as double
Especially for anything with financial data, but in general for everything dealing with human readable numbers, BigDecimal is what you want to use instead of double, just as that source says.
The documentation on BigDecimal is pretty straight-forward, and should provide everything you need.
It has a int, double, and string constructors, so you can simply have:
BigDecimal detChargeCount = new BigDecimal(0);
...
detChargeCount = new BigDecimal(totChargeValue);
The operators are implemented as functions, so you'd have to do things like
tmpChargeQuan.multiply(tmpChargeAmt)
instead of simply tmpChargeQun * tmpChargeAmt, but that shouldn't be a big deal.
but they're all defined with all the overloads you could need as well.
It is very possible that you will have issues with doubles, by which I mean the precomputed value and the newly computed value may differ by .000001 or less.
If you don't know how the value you are comparing to was computed, I think the best solution is to define "equal" as having a difference of less than epsilon, where epsilon is a very small number such as .0001.
I.e. rather than using the test A == B, use abs(A - B) < .0001.