Benchmarking LLMs for Safety Data Sheet…

Imagine trying to assemble a complex IKEA wardrobe, but instead of one manual, you have fifty different versions written by fifty different people who all have wildly different ideas of what a “step” is. Some use bullet points, some use narrative prose, and some just put a picture of a screw and hope you figure it out. This is essentially the nightmare of dealing with Safety Data Sheets (SDS). They are the legal requirements of the chemical world, meant to tell you if a substance will melt your floor or blow up your lab, but they are formatted with a chaotic energy that would make a postmodern poet blush.

The recent paper on Benchmarking Large Language Models for Safety Data Extraction tackles this exact mess. The goal is to see if LLMs can actually pull structured information out of these documents better than the old-school, rule-based systems. For those who have spent any time in the trenches of industrial data, “rule-based” is a polite way of saying “a ten-thousand-line regex file that breaks every time a vendor changes their font to Arial.” The study basically confirms what we suspected: LLMs are significantly better at handling the heterogeneity of these documents because they don’t panic when a table is slightly offset or a section header is misspelled.

But here is the problem that the benchmarking community loves to ignore: the “PDF tax.” You can have the most capable model on the planet, but if your ingestion pipeline turns a multi-column PDF into a linear string of gibberish, the model is just hallucinating based on garbage (mostly because PDFs are a crime against humanity). Why do we keep pretending that the model’s reasoning capability is the bottleneck when the real struggle is simply getting the text into the prompt without losing the relationship between a chemical property and its value? If the OCR trips on a line break, the “structured extraction” becomes a guessing game.

The move from brittle rules to probabilistic LLMs is a necessary evil, but it introduces a new kind of friction. In a safety context, a 95% accuracy rate is a failure. If a model misses a “highly flammable” warning because it was buried in a weirdly formatted footer, the result isn’t a bug report—it’s a fire. This is where the cost of ownership hits. To get that last 5% of accuracy, you aren’t just paying for tokens; you’re paying for a human-in-the-loop verification system that probably costs more than the LLM implementation itself. Plus, the latency of running a massive context window just to find one flash-point value is a hard pill to swallow for real-time industrial applications.

I suspect we are currently in the “brute force” phase of this problem. We are throwing general-purpose models at niche industrial documents and acting surprised when they struggle with the specific jargon of chemical safety. It is like using a Swiss Army knife to perform surgery—it technically has the tools, but it’s not the right instrument. By Q1 of next year, we will see the emergence of a specialized, open-source “Safety-LLM” or a highly tuned adapter that outperforms GPT-4 on SDS extraction by at least 15% because it has been trained on the actual visual layout of these documents, not just the cleaned-up text.

The tools are better, but the documents are still a disaster.

Until then, we are just benchmarking how well a model can guess the contents of a digital scavenger hunt. We can celebrate the benchmark numbers, but the real win happens when we stop treating the PDF as a text file and start treating it as a visual map.

Related coverage

Solving Long-Form Coherence in Small Open-Weight LLMs

Multi-Pass Prompt Verification: Addressing Qualitative Loss in Quantized LLMs

Bridging the Intent Gap: Why LLMs Struggle with Pragmatic Meaning

Anthropic’s Constitutional AI: Moving Beyond Human-Labeled Data