Monday, July 15, 2013

My Experiments with JUNG 2.0 (Java Universal Network/Graph Framework)

My first minor project was a major disaster. But, in hindsight I can salvage out some simple nifty tools from the graveyard of codes on my laptop. I had to settle for a simple "Simpletron: A Simulated Microprocessor in Java." (I've blogged about it earlier here.) All it took was firing one of our three team members and the remaining duo pair-programmed into the night.....or at least for a few hours until he too quit and went away to see a movie. We ended up showing this little nifty tool and anther tool which Chitransh had designed to go together with this one. I then finally completed and perfected Simpletron on my own.

Nexus Grapher was a tool to be added to a data-mining project which we failed to complete due to broken team dynamics and interaction, as well as, pressing deadlines. I coded this small tool from Open Source, JUNG 2.0 API. That's "Java Universal Network/Graph Framework" Version 2.0.

This project had a lot of potential and planned features that were never realized. Nexus - as we named it, was supposed to be a suite of various data mining tools which can be used on a directory full of web-pages (preferably from Wikipedia & news-sites) and text files etc documents.

Initially we planned to adopt Stanford NLP Parser to perform NER tagging (Named Entity Recognition - This link is a Wikipedia page). This would produce a Penn-treebank structure which would be parsed by a method of my nexus-grapher to generate a visible, click able, interactive graph of inter-relating facts and texts from across the webpages and texts saved in the directory.

Selecting a node with the label of a named entity would invoke another method that I wrote with my team-mate Harshdeep Sokhey which will open a window displaying a consolidated text file with (that label) related texts collected from all over the mined directory. (Like a dynamically generated Wikipedia page).

But, this consolidation and compaction method was in it's nascent state without much intelligent features of context aware auto-data updates (on comparison with recent file save) and even basic features like text-redundancy. All it could do was get paragraphs and lines, consisting of the keywords, from all over the directory and order them by file creation date.

The connecting edges of the graph was also meant to be click able to pull up a consolidated file with paragraphs of texts where both these named entities would appear simultaneously. Another feature we failed to implement. I tried switching to Apache Lucene but due to approaching dead-line I didn't have enough time to learn Lucene.

Currently, the nexus-grapher reads from a CSV of preselected named entities to generate the graph.

If you want to know more of my plans for "Nexus: An interactive data-mining & visualization suite." send me an email. If you're able to implement such a suite, very well, do inform me.

Directions/Code Walk through::
--------------------------------

This small tool is built with open source JUNG 2.0. My knowledge and expertise with it is... limited. If you encounter any troubles and have questions relating the graph and its subsequent visualization, please contact them. Here's a tutorial of JUNG 2.0, as a PDF, which I found handy and it should be enough for the Java noob to get his feet wet.


Before I get into the details of my Java project, let's first go over the basics of my C.S.V formatting and rules for, what I call, "nexusft"; nexus-format-textfile.

The first token of the C.S.V is the available roots that can be searched and all other succeeding tokens are it's children.

Now, onto Java! The project package is split into three classes:

1) naseemgraphexp2.java
2) ReadFile.java
3) search.java

"naseemgraphexp2.java" is the core class with a Main  method to run a simple graph visualization. The code is self-explanatory with comments.

A name (search-token) is sought from the user when the program runs and calls the readnexus() method. The searched nexusft file address is passed as a parameter for the class ReadFile.

This ReadFile returns the arraylist of lines of texts in the nexusft file which is tokenized and searched against the entered keywords. Only the first token of the C.S.V (if matched) is considered the root and all others are considered children. Else the method returns an error message.

The constructor method "naseemgraphexp2()" is where my iterative loop adds the tokens as vertices and their names as labels. Only the searched central/root node of the graph is connected to its relatives.

The rest of the code in the Main method is self-explanatory and JUNG specific, to build a graph and visualize it. The geniuses behind JUNG kept very clean and clear examples for me to hack with my shoddy programming skills. I've left behind some original code of JUNG examples, which has been commented, I know it's bad practice (I read in one of them, Pragmatic Programmer/Publishers book), but, it is there for all those who are new to JUNG 2.0, to see and realize the available features I did not use.

"ReadFile.java" is a straight forward file reading program to read and add each line of text into a String[] array which is returned to the calling method in naseemgraphexp2.java.

Finally, in "search.java" is the small program to tokenize each string that we previously stored in our String[] array to be matched with our search token. If a match is found, then a true Boolean is returned to  nexusgraphexp2.java's method "readnexus()" which then proceeds to tokenize this string to create nodes for the graph.

I hope, you find it useful despite this poor hacking. If you polish it and go onto implementing better features do e-mail me and credit me for this shoddy hack.

Your,
Md.Naseem Ashraf
iam.legend.n@gmail.com

Source Available at:


https://docs.google.com/file/d/0B4e1TZA7mwrNVTYyTk9fOGVuT1E/edit?usp=sharing

Screen Shot:

 

 

 

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 United States License.

No comments:

Post a Comment