Training model (PDF processing)
To train Grobid (instruction) we need two things PDFs and generated TEI XMLs.
Here is two issues:
- How many and what PDFs to download as downloading 1.3 millions PDFs is quite time consuming process and very unstable using proxies.
- How to prepare TEI XMLs. Using direct converter MARC XML -> TEI XML or create MARC XML <-> JSON <-> TEI XML. If second than we need common structure for JSON format.
And also what exactly we want to train?