SciBoard issueshttps://gitlab.at.ispras.ru/Nedumov/sciboard/-/issues2019-12-10T11:59:08Zhttps://gitlab.at.ispras.ru/Nedumov/sciboard/-/issues/8Extract funding metadata from PDF documents2019-12-10T11:59:08ZМария ГригорьеваExtract funding metadata from PDF documents1) "An approach to automatically gather funding information about scientific research projects from published papers" (Bachelor’s Thesis in Computer Science by Dimitri Kohler Locarno, Switzerland, 2016) - https://www.merlin.uzh.ch/contri...1) "An approach to automatically gather funding information about scientific research projects from published papers" (Bachelor’s Thesis in Computer Science by Dimitri Kohler Locarno, Switzerland, 2016) - https://www.merlin.uzh.ch/contributionDocument/download/10065
> Two methods were developed that can both extract funding entities from papers.
One method is using RegEx to recognize entities in the funding section of a paper. The other method uses a machine learning algorithm for NER developed by the Stanford NLP team. The RegEx implementation delivers accurate results if
the funding text is well structured and contains clearly separated entity names.
To extract entities from more complex structures the developed RegEx has to be tweaked, so it can match the patterns of the structure. This happens directly in the program code. In comparison the NER implementation delivers similarly
accurate results for simple structures it has learned before. If it encounters a new structure that was not included in the training set, the results become inconsistent. In order to be able to extract entity names from structures that were not trained, the training set can be extended. This way it is possible to train the model on the new structure without changing the program code. This improves the method’s accuracy for the new and similar structures
![Снимок_экрана_2019-10-28_в_12.10.33](/uploads/042dcb34b0be9026d3260389dd20bffe/Снимок_экрана_2019-10-28_в_12.10.33.png)
**2) GROBID has the branch with Acknowledgements parser:**
https://github.com/kermitt2/grobid/tree/AcknowledgmentParser_Desir2ndCodeSprint
Example of Paper: http://inspirehep.net/record/1699835/files/10.1088_1742-6596_1085_3_032013.pdf
> Acknowledgements
The work was supported by the Russian Ministry of Science and Education under contract
No.14.Z50.31.0024 and by the Russian Science Foundation under contract №16-11-10280.
**Results in TEI:**
```
<div type="acknowledgement">
<div
xmlns="http://www.tei-c.org/ns/1.0">
<head>Acknowledgements</head>
<p>The work was supported by
<rs type="fundingAgency" coords="6,247,70,268,19,34,80,9,69;6,208,70,280,79,71,75,9,69">the Russian Ministry of Science and Education</rs> under contract No.
<rs type="grantNumber" coords="6,86,28,280,79,15,23,9,69;6,116,73,280,79,20,30,9,69;6,466,69,280,79,5,40,9,69">14.Z50.31.0024</rs> and by
<rs type="fundingAgency" coords="6,247,70,268,19,34,80,9,69;6,208,70,280,79,124,58,9,69">the Russian Science Foundation</rs> under contract
<rs type="grantNumber" coords="6,401,89,280,79,64,80,9,69">№16-11-10280</rs>.
</p>
</div>
</div>
```
We can test this branch for all downloaded PDF documents. Obviously, it will not work correctly for all documents, but if it can process at least 50% of fundings info, it'll be very helpfulhttps://gitlab.at.ispras.ru/Nedumov/sciboard/-/issues/2Choose PDF parser for HEP papers2019-10-28T10:07:28ZМария ГригорьеваChoose PDF parser for HEP papers* CERMINE - http://cermine.ceon.pl (online service)
* science-parse v1 - https://github.com/allenai/science-parse (docker)
* GROBID - http://cloud.science-miner.com/grobid/ (online service)
Google doc with there results: https://docs.go...* CERMINE - http://cermine.ceon.pl (online service)
* science-parse v1 - https://github.com/allenai/science-parse (docker)
* GROBID - http://cloud.science-miner.com/grobid/ (online service)
Google doc with there results: https://docs.google.com/document/d/1F8JJxIxJFFlSUOHtGTZaRRodL3VHgdld49Rkcoss1v4/edit?usp=sharing
Мария ГригорьеваМария Григорьева