Visualising a Journal Article

This tutorial demostrates how to visualise a journal article. In general these steps can be applied to any kind of document

TextNet

Convert Document to Plain Text

TextNet has a PDF upload option. In many cases this is a suitable choice but often it can fail. This is typically the result of the PDF parser creating characters that create inconsistancies in GEXF file which is used to capture the graph structure.

To avoid these issues its preferrable to upload plain text to the textarea field on the home page.

The title of the article used in this tutorial is Humpback whale “super-groups” – A novel low-latitude feeding behaviour of Southern Hemisphere humpback whales (Megaptera novaeangliae) in the Benguela Upwelling System. It can be downloaded in as a PDF or viewed online in HMTML. For our purposes and for generality we will start by downloading the PDF. WE will then copy the text into a text editor and paste this into TextNet to create a visualiation. The proper noun garph is usually much smaller than the full graph as such its usually a good idea to start wtih that one. Its a good idea to name the project lets say the name is name of project. Its important not to include underscores (_) when creating a project name.

The graph will not render.

Troubleshooting

The first task is to find the cause of the error. This requires navigating to the GEXF file. The GEXF file can be found at the extenion /files/name_of_project/graph.gexf. Take note of the URL which includes underscores and the actual name of the project which does not. This is done automatically.

The error caused the will read something along the lines of: Error on line # at column ##: Entity ‘###’ not defined. The entity in question will typically be ‘&Atilde’

This tells us where the error is but not which character in the original text caused it. This can be answered by opening up the file. There are two ways to do this:

  • right click on the webpage showing the GEXF file and select view source
  • right click on the webpage showing the GEXF file and select save target as...
    • subsequently you can open the GEXF file in a text editor. If the original text was large (several pages) then this option will be faster.

A search of the undefined entity (‘###’) will return be linked to specific nodes. When those nodes are searched within the document typically the infringing characters can be found close to them. Some examples are included below:

  • '... node id="Reference numbersÃ"...'
  • '... node id="23Ã"...'

A search for Reference numbers will return "Reference numbersÐEC020" with "Ð" the infringing character. This should be changed to "D". The other infrinding characters are

  • Ë to E
  • ± to +-

After making these changes the proper noun graph should work. At this point it might be a good idea to try the full graph. It will fail as there is a final character unaccounted for "È" which can be converted to "E". This character was found using the same process as described above. The process can be tedious however the alternative is to perform a search of various accented characters which might be easier to automate but it might not address all the infringing characters. I have created 6 graphs from this paper 3 proper noun graphs and 3 full graphs, links are avilabel below

Finishing Off

At this point the proper noun graph is complete. It can be exported in GEXF file format and finished off in Gephi.