Training model (PDF processing)

To train Grobid (instruction) we need two things PDFs and generated TEI XMLs.

Here is two issues:

How many and what PDFs to download as downloading 1.3 millions PDFs is quite time consuming process and very unstable using proxies.
How to prepare TEI XMLs. Using direct converter MARC XML -> TEI XML or create MARC XML <-> JSON <-> TEI XML. If second than we need common structure for JSON format.

And also what exactly we want to train?