Unstructured data doesn't have to be this painful

Working with unstructured data is painful. Have you tried to get Claude to extract data from a 200 page PDF? Have you tried to do it at scale? Have you worried that you’re not technically allowed to feed that document to Claude in the first place?

I’ve been in this situation myself, and I’ve talked to plenty of people in the exact same one. A finance team that needs a pipeline to pull line items out of invoices. Quant traders who want to analyse thousands of SEC filings. Customer service centres that need to read call transcripts to understand why people churn.

Most end up in one of two places. They don’t do it, because it’s too painful. Or they build something convoluted to make it happen: they export the files to blob storage, stand up a Python service they now have to run, wire in an OCR library or a vision API to parse the docs, write chunking logic, push it into a vector database, schedule the whole thing in an orchestrator, add retries for when the rate limits hit. Then a week of security review explaining why the customer contracts are being sent to some outside endpoint. New file type? Start over.

Take something as ordinary as an invoice.

A sample vendor invoice: vendor name, invoice number, dates, line items, total due.

Vendor, invoice number, dates, line items, a total. Obvious to you. To a pipeline, every one of those is a field you have to find, label, and pull out, across hundreds of layouts that never quite match.

This is the problem Snowflake fixes with AI functions.

Snowflake has a set of AI functions, called AISQL, that do the parsing and extraction for you, in plain SQL, right where your data already lives. AI_PARSE_DOCUMENT turns a PDF into clean text and layout. AI_EXTRACT pulls named fields out of it. AI_TRANSCRIBE handles audio and video, AI_CLASSIFY labels, AI_SENTIMENT scores, and AI_EMBED with Cortex Search lets you ask questions over the results. They’re functions you call in a SELECT. The files never leave Snowflake, and everything runs under the access controls you already have. That quietly answers the “am I even allowed to send this to a model” question.

Step one is parsing. AI_PARSE_DOCUMENT reads every file in the stage and turns it into clean text. LAYOUT mode keeps the tables intact, so a line item grid doesn’t collapse into a wall of words.

AI_PARSE_DOCUMENT running over a stage of invoice PDFs, with a preview of the parsed text on the right.

That’s the whole parse step. One SELECT over a folder of PDFs, and on the right you can see the result: the invoice came back as readable text, total and all. No OCR library, no Python service, no files leaving Snowflake.

Then you extract the fields you actually want. Vendor, date, total, named in a single call and returned as columns:

SELECT AI_EXTRACT(file, ['vendor', 'invoice_date', 'total_amount']) AS fields
FROM my_invoices;

No template, no rules per vendor, no service to babysit. A folder of messy PDFs becomes a clean table you can query:

The INVOICE_EXTRACTIONS table: one row per invoice, with columns for vendor, invoice number, date, PO number, total due, and line items.

One row per invoice. Vendor, number, date, PO, total, and the line items pulled out as structured JSON. Notice it handles the messy ones too: a scanned .jpg, a smudged file, a German invoice in euros. That’s now a table you can join, sum, and report on like any other.

The tool I use to build these is Cortex Code, Snowflake’s coding agent. I point it at the folder, tell it what I want out of it, and it writes the SQL with me: staging the files, choosing the functions, building the table. I still read every line it produces. I just don’t start from an empty file.

Next week I’m running a live demo on building unstructured data pipelines with Cortex Code. We’ll go end to end, parse, extract, operationalize, and build three from scratch: invoice processing, call center audio analytics, and contract intelligence you can ask questions of in plain English.

It’s free and live on June 30. If that sounds like your kind of thing, come hang out.