Why won't my factor column value change to a date value? - date

I know this is elementary but I can't seem to figure it out, even after reading other posts.
In a dataset, I want to convert an entire column into a date. The current class is factor.
The value in the field looks like this 12/25/2012
This is what I've tried.
C$DateofDeath=as.Date(C$DateofDeath,'%m/%d/%Y')
Error in as.Date.default(C$DateofDeath, "%m/%d/%Y") :
do not know how to convert 'C$DateofDeath' to class “Date”
C$DateofDeath=as.Date(C$DateofDeath,"%m/%d/%Y")
Error in as.Date.default(C$DateofDeath, "%m/%d/%Y") :
do not know how to convert 'C$DateofDeath' to class “Date”
Claims$DateofDeath=strptime(as.character(Claims$DateofDeath),format= '%m/%d/%Y')
Error in `$<-.data.frame`(`*tmp*`, "DateofDeath", value = list(sec = numeric(0), :
replacement has 0 rows, data has 71616
Claims$DateofDeath=strptime(as.character(Claims$DateofDeath),format= "%m/%d/%Y")
Error in `$<-.data.frame`(`*tmp*`, "DateofDeath", value = list(sec = numeric(0), :
replacement has 0 rows, data has 71616

Use as.POSIXct
C$DateOfDeath<-as.POSIXct(as.character(C$DateOfDeath), format = "%d/%m/%Y")

There are lots of R experts here but you have to specify R as one of your tags to get them to notice your question.
Looks like you have tried a bunch of combinations but not the right one.
> C <- data.frame(DateofDeath="12/25/2012",other=TRUE)
> as.Date(as.character(C$DateofDeath),format="%m/%d/%Y")
[1] "2012-12-25"
Notice that as.Date() takes a character input, not a factor. So you need to convert to character, then to Date.
Your strptime() versions seem fine to me except that you call are referring to the dataframe Claims instead of C. Actually strptime() should convert the factor to character for you, so you don't need the as.character() part with those.

Related

How to convert a type Any List to a type Double (Scala)

I am new to Scala and I would like to understand some basic stuff.
First of all, I need to calculate the average of a certain column of a DataFrame and use the result as a double type variable.
After some Internet research I was able to calculate the average and at the same time pass it into a List type Any by using the following command:
val avgX_List = mainDataFrame.groupBy().agg(mean("_c1")).collect().map(_(0)).toList
where "_c1" is the second column of my dataframe. This line of code returns a List with type List[Any].
To pass the result into a variable I used the following command:
var avgX = avgX_List(0)
hoping that the var avgX would be type double automatically but that didn't happen obviously.
So now let the questions begin:
What does map(_(0)) do? I know the basic definition of the map() transformation but I can't find an explanation with this exact argument
I know that by using .toList method in the end of the command my result will be a List with type Any. Is there a way that I could change this into List which contains type Double elements? Or even convert this one
Do you think that it would be much more appropriate to pass the column of my Dataframe into a List[Double] and then calculate the average of its elements?
Is the solution I showed above at any point of view correct based on my problem? I know that "it is working" is different from "correct solution"?
Summing up, I need to calculate the average of a certain column of a Dataframe and have the result as a double type variable.
Note that: I am Greek and I find it hard sometimes to understand some English coding "slang".
map(_(0)) is a shortcut for map( (r: Row) => r(0) ), which is in turn a shortcut for map( (r: Row) => r.apply(0) ). The apply method returns Any, and so you are losing the right type. Try using map(_.getAs[Double](0)) or map(_.getDouble(0)) instead.
Collecting all entries of the column and then computing the average would be highly counterproductive, because you'd have to send huge amounts of data to the master node, and then do all the calculations on this single central node. That would be the exact opposite of what Spark is good for.
You also don't need collect(...).toList, because you can access the 0-th entry directly (it doesn't matter whether you get it from an Array or from a List). Since you are collapsing everything into a single Row anyway, you could get rid of the map step entirely by reordering the methods a little bit:
val avgX = mainDataFrame.groupBy().agg(mean("_c1")).collect()(0).getDouble(0)
It can be written even shorter using the first method:
val avgX = mainDataFrame.groupBy().agg(mean("_c1")).first().getDouble(0)
#Any dataType in Scala can't be directly converted to Double.
#Use toString & then toDouble on final captured result.
#Eg-
#scala> x
#res22: Any = 1.0
#scala> x.toString.toDouble
#res23: Double = 1.0
#Note- Instead of using map().toList() directly use (0)(0) to get the final value from your resultset.
#TestSample(Scala)-
val wa = Array("one","two","two")
val wrdd = sc.parallelize(wa,3).map(x=>(x,1))
val wdf = wrdd.toDF("col1","col2")
val x = wdf.groupBy().agg(mean("col2")).collect()(0)(0).toString.toDouble
#O/p-
#scala> val x = wdf.groupBy().agg(mean("col2")).collect()(0)(0).toString.toDouble
#x: Double = 1.0

Parse Query containsAllObjectsInArray in Swift

In my Parse class "Challenge" i have an column "status" which contains a Number between 0-5.
When im loading the data from Parse, i only want objects which contain number 1 or 2 in the column "status".
query.whereKey("status", containsAllObjectsInArray: [1,2])
This gives me a result of 0 Objects.
While this gives me the right answer
query.whereKey("status", lessThan: 2)
but i dont want to use this line, since i will need different numbers (example only 3 and 5).
What am i doing wrong?
Try with containedIn :
query.whereKey("status", containedIn: [1,2])

Convert numerous date and time values into serial number

I need to convert date and time into a numerical value. for example:
>> num = datenum('2011-05-07 11:52:23')
num =
7.3463e+05
How would I write a script to do this for numerous values without inputting the date and time manually?
You can store your date strings first in a cell array (or a matrix, provided they are of fixed format), and feed it straight to datenum. For example:
C = {'2011-05-07 11:52:23'
'2011-03-01 20:30:01'};
vals = datenum(C)

Cassandra: get_range_slices of TimeUUID super column?

I have a schema of Row Keys 1-n. In each row there are a variable number of supercolumns with a TimeUUID 'name'. Im hoping to be able to query this data by a time range.
Two issues have come up:
in KeyRange -> the values that I put in for 'start_key' and 'end_key' are getting misunderstood (for lack of a better term) by Thrift. Experimenting with different groups of values Im not seeing what I expect and often get back something completely unexpected.
Example: my row keys are running from 1-1000 with lots of random gaps. I put start_key = 50 and end_key = 20 .. and I get back rows with keys ranging from 99 to 414.
Example: I have a known row with key = 13. Putting this value into start_key and end_key gives me no results.
Second issue: even when I do get results the 'columns' portion of the 'keyslice' is always empty. I have checked via cassandra-cli and I know there is data.
Im using Perl as follows:
my $slice_range = new Cassandra::SliceRange();
$slice_range->{ start } = create_UUID( UUID::Tiny::UUID_TIME, "2010-12-24 00:00:00" );
$slice_range->{ finish } = create_UUID( UUID::Tiny::UUID_TIME, "2011-12-25 00:00:00" );
my $slice_predicate = new Cassandra::SlicePredicate();
$slice_predicate->{ slice_range } = $slice_range;
my $key_range = new Cassandra::KeyRange();
$key_range->{ start_key } = 13;
$key_range->{ end_key } = 13;
my $result = $client->get_range_slices( $column_parent, $slice_predicate, $key_range, $consistency_level );
print Dumper( $result );
Clearly Im misunderstanding some basic precept.
EDIT: It turns out that the Perl library Im using is not properly documented. The UUID creation was not working as advertised. I opened it up, fixed it, and now its all going a bit more as I was expecting. I can slice my supercolumns by date/time range. Still working on getting the key range portion to work.
http://wiki.apache.org/cassandra/FAQ#range_rp covers why you're not seeing what you expect with key ranges.
You need to specify a SlicePredicate that contains the actual range of what you're trying to select. The default of no column_names and no slice_range will result in the empty columns list that you see.

Regarding BigDecimal

I have a csv file where amount and quantity fields are present in each detail record except header and trailer record. Trailer record has a total charge values which is the total sum of quantity multiplied by amount field in detail records . I need to check whether the trailer total charge value is equal to my calculated value of amount and quantity fields. I am using the double data type for all these calculations. When i browsed i am able to understand from the below web link that it might create an issue using double datatype while comparison with decimal points. It's suggesting to using BigDecimal
http://epramono.blogspot.com/2005/01/double-vs-bigdecimal.html
Will i get issues if i use double data type. How can i do the calculations using BigDecimal. Also i am not sure how many digits i will get after decimal points in csv file. Also amount can have a positive or negative value.
In csv file
H,ABC.....
"D",....,"1","12.23"
"D",.....,"3","-13.334"
"D",......,"2","12"
T,csd,123,12.345
------------------------------ While Validation i am having the below code --------------------
double detChargeCount =0;
//From csv file i am reading trailer records charge value
String totChargeValue = items[3].replaceAll("\"","").trim();
if (null != totChargeValue && !totChargeValue.equals("")) {
detChargeCount = new Double(totChargeValue).doubleValue();
if(detChargeCount==calChargeCount)
validflag=true;
-----------------------While reading CSV File i am having the below code
if (null != chargeQuan && !chargeQuan.equals("")) {
tmpChargeQuan=Long(chargeQuan).longValue();
}
if (null != chargeAmount && !chargeAmount.equals("")) {
tmpChargeAmt=new Double(chargeAmount).doubleValue();
calChargeCount=calChargeCount+(tmpChargeQuan*tmpChargeAmt);
}
I had declared the variables tmpChargeQuan, tmpChargeAmt, calChargeCount as double
Especially for anything with financial data, but in general for everything dealing with human readable numbers, BigDecimal is what you want to use instead of double, just as that source says.
The documentation on BigDecimal is pretty straight-forward, and should provide everything you need.
It has a int, double, and string constructors, so you can simply have:
BigDecimal detChargeCount = new BigDecimal(0);
...
detChargeCount = new BigDecimal(totChargeValue);
The operators are implemented as functions, so you'd have to do things like
tmpChargeQuan.multiply(tmpChargeAmt)
instead of simply tmpChargeQun * tmpChargeAmt, but that shouldn't be a big deal.
but they're all defined with all the overloads you could need as well.
It is very possible that you will have issues with doubles, by which I mean the precomputed value and the newly computed value may differ by .000001 or less.
If you don't know how the value you are comparing to was computed, I think the best solution is to define "equal" as having a difference of less than epsilon, where epsilon is a very small number such as .0001.
I.e. rather than using the test A == B, use abs(A - B) < .0001.