How AI Revolutionized My Hreflang XML Sitemap Creation

```json
{
  "alt": "Abstract digital representation of data flow from URLs to organized code blocks.",
  "caption": "Digital data streams from numerous URLs into structured code, depicting a seamless process of information integration and organization.",
  "description": "This image visually represents a digital process where multiple URLs on the left side flow into a central node before being organized into structured code blocks on the right. The URLs symbolize a vast stream of online data being systematically processed, possibly illustrating concepts like big data management or web crawling. The visualization includes colorful lines indicating organized pathways for data, enhancing the technological and digital theme."
}
```

I’ve witnessed AI tools become indispensable in automating complex processes that traditionally demanded a lot of manual effort. However, I’ve also seen them used without any real benefit just because they are available.

That’s why I prefer focusing on AI applications that save time and address genuine challenges.

Recently, I was tasked with aligning the SEO architecture for over a dozen websites across three separate businesses, eight regional domains, and numerous languages, including three English dialects, Italian, Japanese, Spanish, Thai, French, and Korean.

Mapping thousands of URLs to create seamless hreflang XML sitemaps traditionally required specialized software or extensive spreadsheet work. Instead, I used Google Gemini to develop a custom Python script to handle the heavy lifting.

Here’s how an initial prompt evolved into a fully customized automation tool and what it taught me about utilizing AI for technical SEO.

Where AI Delivers the Most Value

I leverage AI primarily for practical, time-saving tasks, including:

  • Generating regex patterns when I need quick solutions without researching syntax from scratch.
  • Creating complex spreadsheet formulas for reporting workflows that depend on manual data exports.
  • Speeding up research and planning for projects requiring competitive analysis across business lines.
  • Building custom automation tools for recurring SEO and data-processing tasks.

The hreflang project I discuss here fits perfectly into the last category.

Mapping hreflang at Scale

The challenge was straightforward: accurately map thousands of URLs across multiple multilingual websites into cohesive hreflang XML sitemaps.

I chose not to tackle this manually. Instead, Google Gemini helped me build a custom Python solution.

Here’s a walkthrough of how the process unfolded.

Phase 1: Asking for an Approach, Not Just a Script

One common pitfall of using generative AI for coding is asking it to sprint before understanding the course. Typing, “Write a Python script to create an hreflang sitemap,” will yield generic code prone to breaking with real-world data.

Instead, I started by asking for an approach. I detailed the scenario: multiple regional domains, organic growth over several years leading to mismatched URL slugs, translated subfolders, and appended revision years.

Gemini suggested a multi-step, data-driven approach:

  • Crawl the websites to collect live URLs and their metadata.
  • Use Python in Google Colab to process the raw data.
  • Run an exact match cluster to group identical slugs.
  • Use an advanced semantic AI model (like SentenceTransformers) to fuzzy match translated pages based on their titles and normalized URLs.

Phase 2: Crawling and Data Collection

Following the recommended strategy, I used a crawler to spider all regional websites to generate a unified CSV file with live URLs, status codes, title tags, and H1s. Screaming Frog proved ideal for this task.

The quality of AI output relates directly to the quality of your crawl data, so make sure it’s robust.

An AI script can miss an obvious “exact match” if a target URL is a 404 or a 301 redirect. Before feeding data into the script, filter your CSV to include only indexable content.

Dig deeper: International SEO in 2026: What still works, what no longer does, and why

Phase 3: The Google Colab Sandbox

Google Colab offers a free, cloud-based Jupyter notebook environment for coding, bypassing local installations or environment variable issues. I used Google Drive to access it. The free version sufficed for this project.

After uploading the CSV to Colab, Gemini provided an initial Python script that utilized a domain-mapping routine to assign language codes, clean the URLs, and generate an XML tree. The initial results required refinement.

Phase 4: The Iteration (Where the Real Work Happens)

If you expect AI to produce a flawless script on the first try, you’ll be disappointed. Like an intern, AI requires oversight. The true value lies in iteration.

After running the initial script, several unmatched URLs left orphaned pages rather than grouping them with international counterparts. Here’s how I iteratively guided AI through the complexities of human-managed websites.

The Directory Flattening Problem

The U.S. site had recently reorganized its blog into topical folders, unlike the Mexican and Italian sites. I presented these mismatches to Gemini, leading to a script adjustment that flattened directories, allowing slugs to align.

The Aggressive Semantic Trap

Concept traps we implemented were initially strict. A UK article about manufacturing wouldn’t match its Italian counterpart due to a slightly different title. By loosening these traps for general industries and enforcing them for critical terms, the AI became adept at delivering better matches.

The Translated Slug Epiphany

The pivotal insight arrived when examining Mexican blog orphans. A Spanish URL /detras-de-escenas-historias... matched the English /behind-the-scenes-stories..., which I pointed out to Gemini. As a result, Gemini updated the script to create a “Combined Semantic Signature,” dynamically translating slugs and efficiently bridging language gaps.

Dig deeper: Cultural SEO: A practical framework for Spanish markets in AI search

Lessons from Building an AI-Assisted SEO Tool

This project reinforced a simple truth: AI excels as a collaborator rather than a shortcut.

  • Be the strategist, let AI be the coder: Rather than demanding a finished product, discuss architecture and logic first, treating AI as a junior developer needing guidance.
  • Provide concrete examples: Don’t simply state, “It’s broken.” Give specific failed URL examples or mismatches to help AI refine its logic.
  • Embrace the iterative loop: Run the code, identify issues, and iterate. Each iteration enhances the tool’s intelligence.
  • Leverage Google Colab: You don’t need to be a Python guru to apply Python in SEO. Colab bridges the gap, providing access to complex data science libraries in your browser.

In the end, I had a fully customized Python script capable of processing a massive CSV to generate a cross-referenced hreflang XML sitemap in minutes.

Though AI isn’t replacing technical SEOs, those who collaborate with AI to build scalable tools will have a significant edge.

Dig deeper: How AI search defines market relevance beyond hreflang


Inspired by this post on Search Engine Land.


crushpress.ai community screenshot

FAQs

What is this post about?

It explains how the author used AI, Python, and Google Colab to automate hreflang XML sitemap creation. It also covers iterative AI development and its impact on technical SEO.

Which tools and technologies are highlighted?

Google Gemini was used to develop a Python script in Google Colab, and Screaming Frog was used for crawling. The post emphasizes using AI to handle practical, time-saving tasks.

What are the main phases of the workflow?

It outlines four phases: Phase 1 focuses on asking for an approach rather than writing code, Phase 2 covers crawling and data collection, Phase 3 describes the Google Colab sandbox, and Phase 4 covers iterative refinement.

What is the core takeaway about AI's role?

AI is a collaborator, not a shortcut. The post emphasizes guiding AI with architecture and concrete examples, and iterating for better results.

What was the final outcome?

A fully customized Python script capable of processing a massive CSV to generate a cross-referenced hreflang XML sitemap in minutes.

What is the 'Translated Slug Epiphany'?

The ‘Translated Slug Epiphany’ describes how translated slugs bridged language gaps (e.g., Spanish and English) to improve matches.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *