Tuesday, July 7, 2009

Delivery: PageRank, Hits and Modularity

I'm delivering PageRank, HITS and Modularity (Louvain algorithm) metrics. Modularity was a bit trickier than I would have figured. I'm still not convinced that my code represents the best way to implement the Louvain algorithm, however, it is fast and accurate.



To-date I have implemented all of the metrics from the orignal proposal (allbeit with some lingering issues). Hits appears to have some hiccups that need to be ironed out. Both hits and pagerank need some (minor) adjustments for running on undirected graphs.

There is still, of course, a lot to do. My objective for the rest of this week is to iron out issues with the metrics and complete the reports (add parameter list, date, network revision, etc.).

The next major task is to begin work on the Report Center, which will monitor and inform users of the progress of reports.

Tuesday, June 23, 2009

PageRank and HITS

I have implemented a first go at PageRank and the HITS algorithm. These two will be harder to verify as I'll have to look around for other known solutions to verify our work. The results so far seem to be very small.

So far the algorithms are only setup to run on directed algorithms, I will need to go through and add a directed flag and then iterate over different sets of edges based on that flag.

I am not thrilled with the parmaterization forms as they are now. They are rather unimpressive, there are very few parameters to get from the user. I'm thinking of adding some explanation for what these algorithms compute on the forms (eventually)

Monday, June 22, 2009

Delivery #3

I've just pushed delivery #3 which includes working versions of Clustering Coefficient, Degree Distribution, and Graph Distance metrics. I keep meaning to go into detail on the wiki on the design details (diagrams, etc.); I haven't found the time as yet, will get to it this week.

I had to spend some time resolving packaging issues and getting everything up-to-date and merged.

Some good advances today. Created package for handling the gui and an interface to allow gui parametrization to run smoothly.



Aside from continuing to work on the framework, beefing up the parametrization and the reporting, there remains only three main statistics: HITS, Pagerank, and Modularity.





In case anyone other then my mentors want to try and play with what's done so far, the bzr repository can be reached via
bzr branch lp:~pjmcswee/gephi/Statistics

Below are some notes concerning the current state of affairs.

Accomplished:
(1) Statistics are now run by the LongTaskExecutor and implement
LongTask (thank you Mathieu).
(2) I've checked everything in that I think you'll need for running
the project. Thank you for your patience with me on this issue :)
(3) Clustering Coefficient, Degree Distribution & Graph Distance
metrics all run successfully (see issues).
(4) I've created a StatisticsUI similar to GenerateUI that is used to
parametrize the statistics.


To Do:
(1) Continue to move the GUI out of StatisticsControllerImpl and
StatisticsReporterImpl
(2) Add onto the parametrization & reports of the existing algorithms
(3) Get the 'Print' functionality working correctly in the
StatisticsReporterImpl.
(4) Add a 'Save' option to save the
(5) Increase the reporting, time stamp, network basics, network name, etc.

Issues:
(1) There is a bug in the Triangle version of clustering coefficient
when the network is looked at from a directed point of view. Notice
that if you run as undirected you will get the same result for either
brute force or triangles. If you run directed they disagree. I will resolve this week.
(2) The combo box menu on the toplevel component uses the actual
Statistics objects. I'm not a huge fan of this as then we need to
make all of them implement toString(), when we already have getName().
Seems like there should be a cleaner solution which uses the existing
getName().


Suggestions:
(1) In the data panel, it may not be a bad idea to add a section for
per Network attributes. Many of the statistics return results that
could be placed there for easy referral: average clustering
coefficient, diameter, mean shortest path, degree distribution
(scalefree-ness), modularity, etc.
(2) I am seeing a lot of similarity between the Generator and the
Statistics Modules. Perhaps the GeneratorUI can be generalized and
then used for both packages, as the LongTask module has been.

Thursday, June 18, 2009

Apologies

Sorry for not updating my blog sooner. I have two draft posts saved that I meant to complete but I've been busy working on the project code and completing project deliveries. I'll try to recap where I left off.

After the Brandes algorithm (the last post) I worked on the first delivery which included pseudo-code for all of the metrics we are going to implement along with making some suggestions about possible GUI approaches. (6/8)

I had alread implemented a rough draft of some the metrics which needed to be updated with the new Graph API, which really did not take that much time.


On (6/15) the second delivery was due which implemented the graph-distance metrics. The actual metric turned out to be the easier part of this delivery, as there are still some framework issues that need to be ironed out (see the wiki page for some discussion).



On (6/22) the third delivery is due which has the implementations for ClusteringCoefficient and DegreeDistribution. Both of these are close to being finished as is, so this gives me some time to work on the framework issues that are arising, which all basically have to do with inputing parameters or displaying results to/from the user.



Here is a list of the main points which need to be ironed out as of now:

1) Parameterization: I think this is issue is mostly resolved. We create a JPanel (using Matisse) and have the Statistics return that particular JPanel when 'getPanel' is called. If the Statistics method 'isParameterizable' returns true, the controller displays the JPanel and cancel/next buttons to the user. One issue that arises in my mind is should the JPanel tie directly into the Statistics so that on each Gui event the Statistics Object is automatically updated? Or should the values in the Statistics object only be updated when the 'Next' button is pressed in which case we may need another function in the Statistics interface which tells the object to grab its parameters from the JPanel.

2) Progress monitoring: In my second delivery I had the Statistics object taking a ProgressMonitor object so it could update how much work has been completed. This is not a great solution as it ties down ProgressMonitor to the Statistics Interface. I've put some comments on the WikiPage about this issue. My understanding is that the Statistics Object will need to tell the rest of the system how much work it has completed to advance the progress meter. We want to do so in such a way that Statistics implementations do not HAVE to do this but they can so if they choose.



3) Statistics Cancellation: Mathieu has suggested using the Executors API for threading off each new Statistics implementation to a new thread. I'm spending some time investigating the Executor API. We can certainly call cancel on the Statistics implementations, however, we may want to kill the Statistics implementation if it is not responding. To do this I am under the impression that we may need to implement our own StatisticsExecutor class, as otherwise we may not have access to the particular thread?



4) Printing: I think that we can include a JButton on the report page which allows the HTML report to be directly sent to a printner. I've got this working to some degree, I need to finish the way it formats the page. Of course we will also want a way (JButton) to save the HTML report as well.

5) Parameterization & Reporting: For each metric I'll have to come up with a list of parameters and report items for discussion. What reports will be most helpful, should we make the list of reports selectable in the Parameterization, etc?

Saturday, May 23, 2009

Brandes Algorithm

I've implemented Brandes algorithm http://www.inf.uni-konstanz.de/algo/publications/b-fabc-01.pdf which calculates:

Characteristic path length (mean shortest path)
Diameter
Betweenness centrality
Closeness centrality

The version I have implemented is really for unweighted networks so future adjustments may be necessary.

I should be able to easily add some tweaks to get us the # of connected components for free (more or less).


I've realized that we probably need some mechanism to start pulling the GUI stuff together. We will probably want a way for the user to examine the results from different viewpoints: global metrics view, per-node view, per metric view, global node-metric view.

I'll still need a way to query the user for parameters (undirected/directed, weighted/unweighted, estimate/pure-calculations).

I'll make a to-do list (I'll have to see if there is way to do this easily in blog-spot).

Friday, May 22, 2009

Degree Distribution

I've begun some preliminary work on developing some of the metrics for this project. I've realized that the actual metrics will not be hard, but the integration, display and parametrization will be challenging. As I mentioned last time we've chosen JFreeChart for display purposes. I've coded up a Least-Squares fit for the degree distribution metric.

Scale free networks are those in which the degree distribution of the network follows a power-law (sometimes called the 80-20 law) which means that there are very few nodes with very high-degree and very few nodes with small degree. When we look at the the logarithm of the degree distribution we expect to see a straight-line (more or less) for scale free networks.

I'd like to see if there is any advantage to threading this model when it comes to large networks... java is pretty easy to add threads to. The run time of the algorithm probably isn't that bad when compared to the other metrics.


Possible user parameters:
1) Separate into in and out degree or combined. (The below is combined)
2) Least-Squares or other method for computing alpha value.
3) Display logarithm or non-logarithm.


This chart is from the Jazz Musicians social network with < 200 nodes. The axises below are logarithmic.




I've also coded some of the clustering coefficient, both the standard method and the triangles approach from Dr. Lattapy. There is some small error still in the triangles version that I must hunt down as the two are disagreeing slightly (1%). The current version is really only for undirected so this should be changed as well.

Sunday, May 10, 2009

Design Elements

Here I am listing several design questions/suggestions. Feel free to comment.

Design elements:
  1. I am proposing a GUI dialog which will allow the users to select which network metrics they want to run on the network, see the first post for a list of metrics. The user can select to run all or some (possibly only one) metric to run on their network.

  2. Two modes: Estimation or Exact.
    Given that some of the metrics run in N*N*N (graph distances) and that the some of the networks Gephi users may be interested in may be quite large (100,000+ nodes) users may not always want to run the metrics to their full extent and instead be satisfied with estimations. For example there is a graph diameter estimation that users will most likely be interested in running.

  3. We can also provide a mechanism to parallelize the metrics using Java threads, to improve the run time performance.

  4. We will probably want to present a Task Monitor to the users while the metric(s) are running.