Scala GraphLoader.edgeListFile (NumberFormatException) - scala

Im loding edges of a graph from file
val graph = GraphLoader.edgeListFile(sc, "comb.txt")
Hoverwe its throwing error.
java.lang.NumberFormatException: For input string: "116374117927631468606"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
I think it accepts only integer node values. How do i . fix this
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/graphx/GraphLoader.html

the api document clearly states the following comments
/**
* Loads a graph from an edge list formatted file where each line contains two integers: a source
* id and a target id. Skips lines that begin with#.
*
* If desired the edges can be automatically oriented in the positive
* direction (source Id is less than target Id) by settingcanonicalOrientationto
* true.
And 116374117927631468606 value is certainly big for it to be an integer as the official site says
final val MaxValue: Int(2147483647)
The largest value representable as a Int.

Related

Postgres and tables internal organization

I found an explanation of how things work internally in postgresql. There was the following picture:
and the following explanation:
Items after the headers is an array identifier composed of (offset,
length) pairs pointing to the actual items.
Because an item identifier is never moved until it is freed, its index
can be used on a long-term basis to reference an item, even when the
item itself is moved around on the page to compact free space. A
Pointer to an item is called CTID (ItemPointer), created by
PostgreSQL, it consists of a page number and the index of an item
identifier.
Could you be so kind to clear a couple of things out here?
Am I right that items near the page header are CTIDs themselves or Items and CTIDs are different things?
Do CTIDs never move around or rows?
Depending on the answers, maybe I'll understand what the following means exactly "Because an item identifier is never moved until it is freed, its index can be used on a long-term basis to reference an item, even when the item itself is moved around on the page to compact free space."
However, additional more detailed explanation would be nice.
What is called “item” in the picture is a “line pointer” in PostgreSQL jargon. It is defined in src/include/storage/itemid.h:
/*
* A line pointer on a buffer page. See buffer page definitions and comments
* for an explanation of how line pointers are used.
*
* In some cases a line pointer is "in use" but does not have any associated
* storage on the page. By convention, lp_len == 0 in every line pointer
* that does not have storage, independently of its lp_flags state.
*/
typedef struct ItemIdData
{
unsigned lp_off:15, /* offset to tuple (from start of page) */
lp_flags:2, /* state of line pointer, see below */
lp_len:15; /* byte length of tuple */
} ItemIdData;
typedef ItemIdData *ItemId;
These line pointers are stored in an array right after the page header.
See the excellent documentation in src/include/storage/bufpage.h:
/*
* A postgres disk page is an abstraction layered on top of a postgres
* disk block (which is simply a unit of i/o, see block.h).
*
* specifically, while a disk block can be unformatted, a postgres
* disk page is always a slotted page of the form:
*
* +----------------+---------------------------------+
* | PageHeaderData | linp1 linp2 linp3 ... |
* +-----------+----+---------------------------------+
* | ... linpN | |
* +-----------+--------------------------------------+
* | ^ pd_lower |
* | |
* | v pd_upper |
* +-------------+------------------------------------+
* | | tupleN ... |
* +-------------+------------------+-----------------+
* | ... tuple3 tuple2 tuple1 | "special space" |
* +--------------------------------+-----------------+
* ^ pd_special
*
* NOTES:
*
* linp1..N form an ItemId (line pointer) array. ItemPointers point
* to a physical block number and a logical offset (line pointer
* number) within that block/page. Note that OffsetNumbers
* conventionally start at 1, not 0.
*
* tuple1..N are added "backwards" on the page. Since an ItemPointer
* offset is used to access an ItemId entry rather than an actual
* byte-offset position, tuples can be physically shuffled on a page
* whenever the need arises. This indirection also keeps crash recovery
* relatively simple, because the low-level details of page space
* management can be controlled by standard buffer page code during
* logging, and during recovery.
Answers to your questions:
The ctid of a tuple is the physical address, consisting of the block number (starting at 0) and the line pointer (starting at 1). You can identify the line pointer from the ctid of a table row: it is the second number. For example, (321,5) would be the fifth line pointer on the 322th page.
The location of the actual tuple in the block is not fixed: it is stored in lp_off. That allows PostgreSQL to move the data around in a block without changing the physical address (tid) of the tuples. The line pointer itself never changes.
As explained above, the actual data can move in the block, but the line pointer doesn't change. The ctid of a tuple is what is stored in the index. The statement should be clear now.

How to convert a type Any List to a type Double (Scala)

I am new to Scala and I would like to understand some basic stuff.
First of all, I need to calculate the average of a certain column of a DataFrame and use the result as a double type variable.
After some Internet research I was able to calculate the average and at the same time pass it into a List type Any by using the following command:
val avgX_List = mainDataFrame.groupBy().agg(mean("_c1")).collect().map(_(0)).toList
where "_c1" is the second column of my dataframe. This line of code returns a List with type List[Any].
To pass the result into a variable I used the following command:
var avgX = avgX_List(0)
hoping that the var avgX would be type double automatically but that didn't happen obviously.
So now let the questions begin:
What does map(_(0)) do? I know the basic definition of the map() transformation but I can't find an explanation with this exact argument
I know that by using .toList method in the end of the command my result will be a List with type Any. Is there a way that I could change this into List which contains type Double elements? Or even convert this one
Do you think that it would be much more appropriate to pass the column of my Dataframe into a List[Double] and then calculate the average of its elements?
Is the solution I showed above at any point of view correct based on my problem? I know that "it is working" is different from "correct solution"?
Summing up, I need to calculate the average of a certain column of a Dataframe and have the result as a double type variable.
Note that: I am Greek and I find it hard sometimes to understand some English coding "slang".
map(_(0)) is a shortcut for map( (r: Row) => r(0) ), which is in turn a shortcut for map( (r: Row) => r.apply(0) ). The apply method returns Any, and so you are losing the right type. Try using map(_.getAs[Double](0)) or map(_.getDouble(0)) instead.
Collecting all entries of the column and then computing the average would be highly counterproductive, because you'd have to send huge amounts of data to the master node, and then do all the calculations on this single central node. That would be the exact opposite of what Spark is good for.
You also don't need collect(...).toList, because you can access the 0-th entry directly (it doesn't matter whether you get it from an Array or from a List). Since you are collapsing everything into a single Row anyway, you could get rid of the map step entirely by reordering the methods a little bit:
val avgX = mainDataFrame.groupBy().agg(mean("_c1")).collect()(0).getDouble(0)
It can be written even shorter using the first method:
val avgX = mainDataFrame.groupBy().agg(mean("_c1")).first().getDouble(0)
#Any dataType in Scala can't be directly converted to Double.
#Use toString & then toDouble on final captured result.
#Eg-
#scala> x
#res22: Any = 1.0
#scala> x.toString.toDouble
#res23: Double = 1.0
#Note- Instead of using map().toList() directly use (0)(0) to get the final value from your resultset.
#TestSample(Scala)-
val wa = Array("one","two","two")
val wrdd = sc.parallelize(wa,3).map(x=>(x,1))
val wdf = wrdd.toDF("col1","col2")
val x = wdf.groupBy().agg(mean("col2")).collect()(0)(0).toString.toDouble
#O/p-
#scala> val x = wdf.groupBy().agg(mean("col2")).collect()(0)(0).toString.toDouble
#x: Double = 1.0

Adding edge attribute causes TypeError: 'AtlasView' object does not support item assignment

Using networkx 2.0 I try to dynamically add an additional edge attribute by looping through all the edges. The graph is a MultiDiGraph.
According to the tutorial it seems to be possible to add edge attributes the way I do in the code below:
g = nx.read_gpickle("../pickles/" + gname)
yearmonth = gname[:7]
g.name = yearmonth # works
for source, target in g.edges():
g[source][target]['yearmonth'] = yearmonth
This code throws the following error:
TypeError: 'AtlasView' object does not support item assignment
What am I doing wrong?
That should happen if your graph is a nx.MultiGraph. From which case you need an extra index going from 0 to n where n is the number of edges between the two nodes.
Try:
for source, target in g.edges():
g[source][target][0]['yearmonth'] = yearmonth
The tutorial example is intended for a nx.Graph.

Postgresql earth_box algorithm

I found this tutorial how to find something in specified the radius. My question is what algorithm was used to implement it?
If you mean the earth_box, the idea is to come up with a data type that can be useful with a GIST index (inverted search tree):
http://www.postgresql.org/docs/current/static/gist-intro.html
See in particular the links at the bottom of the maintainers' page:
http://www.sai.msu.su/~megera/postgres/gist/
One leads to:
The GiST is a balanced tree structure like a B-tree, containing pairs. But keys in the GiST are not integers like the keys in a B-tree. Instead, a GiST key is a member of a user-defined class, and represents some property that is true of all data items reachable from the pointer associated with the key. For example, keys in a B+-tree-like GiST are ranges of numbers ("all data items below this pointer are between 4 and 6"); keys in an R-tree-like GiST are bounding boxes, ("all data items below this pointer are in Calfornia"); keys in an RD-tree-like GiST are sets ("all data items below this pointer are subsets of {1,6,7,9,11,12,13,72}"); etc. To make a GiST work, you just have to figure out what to represent in the keys, and then write 4 methods for the key class that help the tree do insertion, deletion, and search.
http://gist.cs.berkeley.edu/gist1.html
If you mean the earth distance itself, the meaty part of source is:
/* compute difference in longitudes - want < 180 degrees */
longdiff = fabs(long1 - long2);
if (longdiff > M_PI)
longdiff = TWO_PI - longdiff;
sino = sqrt(sin(fabs(lat1 - lat2) / 2.) * sin(fabs(lat1 - lat2) / 2.) +
                cos(lat1) * cos(lat2) * sin(longdiff / 2.) * sin(longdiff / 2.));
if (sino > 1.)
        sino = 1.;
return 2. * EARTH_RADIUS * asin(sino);
https://github.com/postgres/postgres/blob/master/contrib/earthdistance/earthdistance.c#L50
My math is too rusty to be affirmative on what the above does exactly, but my guess would be that it's computing the distance between two points on the surface of a sphere (without considering the height of the two points). In other words, nautical miles.

AChartEngine Messed up labels

My chart displays fine but as soon as I scroll to the side, I have random time appearing and it messes up the dates, see this picture:
http://img14.imageshack.us/img14/8329/statqs.jpg
I'd like to only display the date and nothing else, I don't know how the renderer comes up with time that I never entered.
Also I'd like to know how I can prevent scrolling to the left (x axis) and down (negative y), I can no longer use SetPanLimits because my x values are dates and not numbers.
Any help would be greatly appreciated!
I know this is very old, but for the next user, it may help to have the solution.
You can specify the date format to use
/**
* Creates a time chart intent that can be used to start the graphical view
* activity.
*
* #param context the context
* #param dataset the multiple series dataset (cannot be null)
* #param renderer the multiple series renderer (cannot be null)
* #param format the date format pattern to be used for displaying the X axis
* date labels. If null, a default appropriate format will be used.
* #return a time chart intent
* #throws IllegalArgumentException if dataset is null or renderer is null or
* if the dataset and the renderer don't include the same number of
* series
*/
public static final Intent getTimeChartIntent(Context context, XYMultipleSeriesDataset dataset,
XYMultipleSeriesRenderer renderer, String format) {
return getTimeChartIntent(context, dataset, renderer, format, "");
}
To show only day and month, use something like the following:
Intent intent = ChartFactory.getTimeChartIntent(context, dataset, mRenderer, "dd-MMM");