BLEU compares n-grams (contiguous sequences of n words) between a candidate translation (output by an MT system) and one or more human reference translations. It relies on exact string matching. If the candidate says “The cat sits” and the reference says “The cat sit,” the score drops.
BLEU requires sentence-level or document-level alignment. For PDF work:
Not all PDF extractors are equal. For BLEU evaluation, you need layout-aware extraction. bleu+pdf+work
| Tool | Best for | Handling of BLEU-sensitive elements | |------|----------|--------------------------------------| | Adobe Acrobat Pro (Export to Word) | Small documents with complex layouts | Good for columns, poor for hyphenation | | pdfplumber (Python) | Programmatic, multilingual text | Excellent; can detect line breaks and table structures | | Tesseract + OCR (for scanned PDFs) | Image-based PDFs | Required but introduces OCR errors | | Grobid | Scientific papers (double columns) | Superior for multi-column text ordering |
Recommendation for BLEU work: Use pdfplumber for digital PDFs. For scanned PDFs, apply OCR cleanup. BLEU compares n-grams (contiguous sequences of n words)
If you prefer a terminal-based approach:
Example:
pdftotext -layout reference.pdf ref_raw.txt
pdftotext -layout candidate.pdf cand_raw.txt
./clean_pdf.sh ref_raw.txt > ref_clean.txt
./clean_pdf.sh cand_raw.txt > cand_clean.txt
cat cand_clean.txt | sacrebleu ref_clean.txt --tokenize zh
Page boundaries are arbitrary for BLEU. Concatenate all extracted text from the PDF into a single string, then segment by punctuation. This avoids penalizing valid line breaks.
After extraction, you must normalize the text to match the reference format. Write a script to: Example: pdftotext -layout reference