Extract funding metadata from PDF documents
- "An approach to automatically gather funding information about scientific research projects from published papers" (Bachelor’s Thesis in Computer Science by Dimitri Kohler Locarno, Switzerland, 2016) - https://www.merlin.uzh.ch/contributionDocument/download/10065
Two methods were developed that can both extract funding entities from papers. One method is using RegEx to recognize entities in the funding section of a paper. The other method uses a machine learning algorithm for NER developed by the Stanford NLP team. The RegEx implementation delivers accurate results if the funding text is well structured and contains clearly separated entity names. To extract entities from more complex structures the developed RegEx has to be tweaked, so it can match the patterns of the structure. This happens directly in the program code. In comparison the NER implementation delivers similarly accurate results for simple structures it has learned before. If it encounters a new structure that was not included in the training set, the results become inconsistent. In order to be able to extract entity names from structures that were not trained, the training set can be extended. This way it is possible to train the model on the new structure without changing the program code. This improves the method’s accuracy for the new and similar structures
2) GROBID has the branch with Acknowledgements parser:
https://github.com/kermitt2/grobid/tree/AcknowledgmentParser_Desir2ndCodeSprint
Example of Paper: http://inspirehep.net/record/1699835/files/10.1088_1742-6596_1085_3_032013.pdf
Acknowledgements The work was supported by the Russian Ministry of Science and Education under contract No.14.Z50.31.0024 and by the Russian Science Foundation under contract №16-11-10280.
Results in TEI:
<div type="acknowledgement">
<div
xmlns="http://www.tei-c.org/ns/1.0">
<head>Acknowledgements</head>
<p>The work was supported by
<rs type="fundingAgency" coords="6,247,70,268,19,34,80,9,69;6,208,70,280,79,71,75,9,69">the Russian Ministry of Science and Education</rs> under contract No.
<rs type="grantNumber" coords="6,86,28,280,79,15,23,9,69;6,116,73,280,79,20,30,9,69;6,466,69,280,79,5,40,9,69">14.Z50.31.0024</rs> and by
<rs type="fundingAgency" coords="6,247,70,268,19,34,80,9,69;6,208,70,280,79,124,58,9,69">the Russian Science Foundation</rs> under contract
<rs type="grantNumber" coords="6,401,89,280,79,64,80,9,69">№16-11-10280</rs>.
</p>
</div>
</div>
We can test this branch for all downloaded PDF documents. Obviously, it will not work correctly for all documents, but if it can process at least 50% of fundings info, it'll be very helpful