SciBoard issueshttps://gitlab.at.ispras.ru/Nedumov/sciboard/-/issues2019-12-26T19:16:44Zhttps://gitlab.at.ispras.ru/Nedumov/sciboard/-/issues/9InspireHEP data issues2019-12-26T19:16:44ZЕвгений ТретьяковInspireHEP data issuesЗаметил, что **Sternberg Astron. Inst.** имеет координаты где-то в **Муроме (РФ)**. Посмотрел в данные у себя на ElasticSearch ошибка в исходных данных. Посмотрел по соответствующей ссылке на inspirehep.net ([https://labs.inspirehep.net/...Заметил, что **Sternberg Astron. Inst.** имеет координаты где-то в **Муроме (РФ)**. Посмотрел в данные у себя на ElasticSearch ошибка в исходных данных. Посмотрел по соответствующей ссылке на inspirehep.net ([https://labs.inspirehep.net/api/institutions/903568](https://labs.inspirehep.net/api/institutions/903568)). Там уже исправленная информация, но есть поле **legacy_ICN**, где написано **Sternberg Astron. Inst.**
**Выводы следующие**:
1. У нас неактуальные данные, но их приводят в порядок на стороне inspirehep.net.
2. Можно актуализировать данные или делать демо с этими данными.
3. Точно нужно иметь ввиду то, что на демонстрации могут выскочить подобные вещиhttps://gitlab.at.ispras.ru/Nedumov/sciboard/-/issues/8Extract funding metadata from PDF documents2019-12-10T11:59:08ZМария ГригорьеваExtract funding metadata from PDF documents1) "An approach to automatically gather funding information about scientific research projects from published papers" (Bachelor’s Thesis in Computer Science by Dimitri Kohler Locarno, Switzerland, 2016) - https://www.merlin.uzh.ch/contri...1) "An approach to automatically gather funding information about scientific research projects from published papers" (Bachelor’s Thesis in Computer Science by Dimitri Kohler Locarno, Switzerland, 2016) - https://www.merlin.uzh.ch/contributionDocument/download/10065
> Two methods were developed that can both extract funding entities from papers.
One method is using RegEx to recognize entities in the funding section of a paper. The other method uses a machine learning algorithm for NER developed by the Stanford NLP team. The RegEx implementation delivers accurate results if
the funding text is well structured and contains clearly separated entity names.
To extract entities from more complex structures the developed RegEx has to be tweaked, so it can match the patterns of the structure. This happens directly in the program code. In comparison the NER implementation delivers similarly
accurate results for simple structures it has learned before. If it encounters a new structure that was not included in the training set, the results become inconsistent. In order to be able to extract entity names from structures that were not trained, the training set can be extended. This way it is possible to train the model on the new structure without changing the program code. This improves the method’s accuracy for the new and similar structures
![Снимок_экрана_2019-10-28_в_12.10.33](/uploads/042dcb34b0be9026d3260389dd20bffe/Снимок_экрана_2019-10-28_в_12.10.33.png)
**2) GROBID has the branch with Acknowledgements parser:**
https://github.com/kermitt2/grobid/tree/AcknowledgmentParser_Desir2ndCodeSprint
Example of Paper: http://inspirehep.net/record/1699835/files/10.1088_1742-6596_1085_3_032013.pdf
> Acknowledgements
The work was supported by the Russian Ministry of Science and Education under contract
No.14.Z50.31.0024 and by the Russian Science Foundation under contract №16-11-10280.
**Results in TEI:**
```
<div type="acknowledgement">
<div
xmlns="http://www.tei-c.org/ns/1.0">
<head>Acknowledgements</head>
<p>The work was supported by
<rs type="fundingAgency" coords="6,247,70,268,19,34,80,9,69;6,208,70,280,79,71,75,9,69">the Russian Ministry of Science and Education</rs> under contract No.
<rs type="grantNumber" coords="6,86,28,280,79,15,23,9,69;6,116,73,280,79,20,30,9,69;6,466,69,280,79,5,40,9,69">14.Z50.31.0024</rs> and by
<rs type="fundingAgency" coords="6,247,70,268,19,34,80,9,69;6,208,70,280,79,124,58,9,69">the Russian Science Foundation</rs> under contract
<rs type="grantNumber" coords="6,401,89,280,79,64,80,9,69">№16-11-10280</rs>.
</p>
</div>
</div>
```
We can test this branch for all downloaded PDF documents. Obviously, it will not work correctly for all documents, but if it can process at least 50% of fundings info, it'll be very helpfulhttps://gitlab.at.ispras.ru/Nedumov/sciboard/-/issues/7InspireHEP Beta (new JSON API)2019-12-10T11:59:28ZМария ГригорьеваInspireHEP Beta (new JSON API)It turned out that InspireHEP is moving to a new platform and the old recjson api has not been actively maintained for a while.
InspireHEP Beta - https://labs.inspirehep.net
GitHub - https://github.com/inspirehep/inspire-next
Docs - h...It turned out that InspireHEP is moving to a new platform and the old recjson api has not been actively maintained for a while.
InspireHEP Beta - https://labs.inspirehep.net
GitHub - https://github.com/inspirehep/inspire-next
Docs - https://inspirehep.readthedocs.io/en/latest/
Read about it in this blog post: https://blog.inspirehep.net/2019/02/introducing-inspire-beta/
And a new undocumented search API with rich JSON:
$ curl -H "accept: application/json"
'https://labs.inspirehep.net/api/literature/1747615'
The schema for HEP records: https://github.com/inspirehep/inspire-schemas/blob/master/inspire_schemas/records/hep.yml
Example of JSON record for URL: https://labs.inspirehep.net/api/literature/1747615
[1747615.json](/uploads/b3996e5070059a57a11c457c04797225/1747615.json)
https://gitlab.at.ispras.ru/Nedumov/sciboard/-/issues/5Searching for RFBR Grants Information2019-10-15T15:48:53ZМария ГригорьеваSearching for RFBR Grants Informationhttps://gitlab.at.ispras.ru/Nedumov/sciboard/-/issues/4MARCXML, Enhanced MARCXML, JSON. What is happening here?2019-10-15T20:44:51ZЕвгений ТретьяковMARCXML, Enhanced MARCXML, JSON. What is happening here?Inspire exposes an [API](http://inspirehep.net/info/hep/api?ln=ru) for querying most aspects of its holdings and provides responses in either XML, Enhanced MARCXML or JSON.
Also for bulk download Inspire takes snapshots in [JSON Format]...Inspire exposes an [API](http://inspirehep.net/info/hep/api?ln=ru) for querying most aspects of its holdings and provides responses in either XML, Enhanced MARCXML or JSON.
Also for bulk download Inspire takes snapshots in [JSON Format](http://inspirehep.net/hep_records.json.gz) and [Enhanced MARCXML format](http://inspirehep.net/dumps/inspire-dump.html).
It has to be mentioned that with Inspire API it`s possible to query special MARC Fields data and download information directly in JSON or MARCXML formats.
The further request: `http://inspirehep.net/record/1757461?of=xm&ot=100,200` returns
![Screenshot_from_2019-10-06_13-15-05](/uploads/05dc99bd1488079c5b5402742b298ab6/Screenshot_from_2019-10-06_13-15-05.png)
The JSON API operates similarly, **with named fields instead of MARC tags**. *(So it is possible that MARC tags and JSON fields can be mapped not in a proper way)*. Since the field names are evolving, a comprehensive list is currently best found in the source: [https://github.com/inspirehep/invenio/blob/prod/modules/bibfield/etc/atlantis.cfg](https://github.com/inspirehep/invenio/blob/prod/modules/bibfield/etc/atlantis.cfg)
MARCXML is the native format used to store metadata in INSPIRE. All the bibliographic metadata that could be hand-curated are stored in the MARCXML format. Links between authors in papers to corresponding authors in HepNames, links between Author disambiguation and references are available in the Enhanced MARCXML format. This is based on the original **MARCXML**, but **with additional subfields** that express relations across records.
For more information on the additional relations see the detailed MARCXML description in: [records markup](https://twiki.cern.ch/twiki/bin/view/Inspire/DevelopmentRecordMarkup).
To get Enhanced MARCXML record via API the *of=xm* has to be replaced by *of=xm**e***. So this request: `http://inspirehep.net/record/1757461?of=xme` returns full Enhanced MARCXML record.
If would compare *of=xm* and *of=xm**e*** full records we see additional info under 100 code:
**XM**
![Screenshot_from_2019-10-06_13-45-46](/uploads/9705ca0d1148b8a8d75d6915972bd56b/Screenshot_from_2019-10-06_13-45-46.png)
**XME**
![Screenshot_from_2019-10-06_13-44-41](/uploads/0760d0f4eb5849a215fb3049bff36f2c/Screenshot_from_2019-10-06_13-44-41.png)
But it doesn't work with such request `http://inspirehep.net/record/1757461?of=xme&ot=100` it returns *of=xm* data.
Let's make a conclusion:
1. There are API and available data in JSON, MARCXML and Enhanced MARCXML.
2. There are dumps for JSON and Enhanced MARCXML.
3. Data from JSON API is larger than from JSON dump.
4. Enhanced MARCXML has additional custom fields that might coincide with this [mapping](https://twiki.cern.ch/twiki/bin/view/Inspire/DevelopmentRecordMarkup).
5. The mapping between JSON and MARC is described [here](https://github.com/inspirehep/invenio/blob/prod/modules/bibfield/etc/atlantis.cfg).
Also we are doing business not with MARC but with MARCXML customized to Enhanced MARCXML with (possibly correct) mappings.
There are many MARC/MARCXML to JSON converters in Python, GO, Scala, Perl, ... And all of them are working just with reading from MARC/MARCXML formats and link MARC codes to some JSON fields, so I suppose there is no universal converter for all custom mappings.
* Here is Python Converter: [pymarc](https://github.com/edsu/pymarc)
* Here is GO converters: [marc21](https://github.com/miku/marc21) and [marctools](https://github.com/miku/marctools)
* Here is Scala converter: [scala-marc](https://github.com/diegododero/scala-marc)
* Here is Perl converter: [Catmandu](https://github.com/LibreCat/Catmandu)
The variety of these tools in different languages, for me, means that they all just read from XML and convert XML tags (MARC codes) to JSON Fields pursuant to some mappings.
So what we want to do and what we need to do?https://gitlab.at.ispras.ru/Nedumov/sciboard/-/issues/3Importing data from inspire for visualization2019-09-19T12:07:41ZYaroslavImporting data from inspire for visualizationWe need a simple script for importing data from elasic to SciNoon.We need a simple script for importing data from elasic to SciNoon.https://gitlab.at.ispras.ru/Nedumov/sciboard/-/issues/2Choose PDF parser for HEP papers2019-10-28T10:07:28ZМария ГригорьеваChoose PDF parser for HEP papers* CERMINE - http://cermine.ceon.pl (online service)
* science-parse v1 - https://github.com/allenai/science-parse (docker)
* GROBID - http://cloud.science-miner.com/grobid/ (online service)
Google doc with there results: https://docs.go...* CERMINE - http://cermine.ceon.pl (online service)
* science-parse v1 - https://github.com/allenai/science-parse (docker)
* GROBID - http://cloud.science-miner.com/grobid/ (online service)
Google doc with there results: https://docs.google.com/document/d/1F8JJxIxJFFlSUOHtGTZaRRodL3VHgdld49Rkcoss1v4/edit?usp=sharing
Мария ГригорьеваМария Григорьеваhttps://gitlab.at.ispras.ru/Nedumov/sciboard/-/issues/1possible use-cases and filters2019-09-24T15:11:45ZАлексей Климентовpossible use-cases and filtersлист возможных use-cases и фильтров для начального этапа работы с данными из Inspire-HEP (по результатам обсуждения в ИСП РАН 29.08.19)
Предположения :
Источник информации (“inspire”) : inspire (http://inspirehep.net)
inspire dump ...лист возможных use-cases и фильтров для начального этапа работы с данными из Inspire-HEP (по результатам обсуждения в ИСП РАН 29.08.19)
Предположения :
Источник информации (“inspire”) : inspire (http://inspirehep.net)
inspire dump (“dump”) - каталог, содержащий 1,300,000 записей
inspire meta (“meta”) - мета-информация для каждой из записей из dump
PDF docs (“pdf”) - текстовая информация для записей
Возможный сценарий (двухшаговый) :
* “dump” и “meta” “выгружаются” из “inspire” полностью и помещаются в ES / kibana
* pdf выгружается только для записей, отобранных после фильтрации
Описание возможных фильтров и use-cases для отбора статей для (по) данных полученных на БАК
Уровни фильтрации (в порядке увеличения сложности)
1. все записи в inspire, при поиске по слову ‘LHC’ (~69.5k записей)
2. отбор тезисов диссертаций (thesis)
3. отбор статей опубликованных экспериментами
предположения :
1. мета информация содержит слова CMS/ATLAS/LHCb/ALICE Collaboration
2. заголовок содержит название эксперимента : CMS/ATLAS/LHCb/ALICE
4. отбор статей опубликованных Российскими учеными (Университетами), не входящими в 1.1.3 и 1.1.2
use-cases :
1. для записей из (3) на основе библиографических данных (предполагается, что meta содержит полную библиографию статьи (из раздела bibliography или references) получить перекрестные ссылки на статьи и определить наиболее часто упоминаемые статьи
2. для записей из (2) : количество тезисов по коллаборациям для России
3. для записей из (2,3,4)
3.1 количество записей по физическим результатам
3.2 количество записей, связанных с IT, DAQ, computing, SW
3.3 количество записей, связанных с разработкой детекторов
3.4 поиск наиболее цитируемых (имеющим наибольшее количество ссылок) статей
один из сценариев "выгрузки" :
1. определить объем (2,3 и 4)
1.1 при слишком большом объеме (“большой объем” : TBD) применить дополнительную фильтрацию по годам, например записи за последние N лет)
1.2 для 1.1.3 и 1.1.4 (после доп.отбора 5.1) получить тексты и провести поиск ссылок на гранты РФАлексей КлиментовАлексей Климентов2019-09-09