From 74cffc2842493d9c87efbad2a5c34cc5580a0f46 Mon Sep 17 00:00:00 2001 From: Ryan Malloy Date: Sat, 8 Nov 2025 23:32:22 -0700 Subject: [PATCH] Complete how-to guide documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add comprehensive problem-oriented how-to guides following Diátaxis framework: - Block specific bots from crawling the site - Customize LLM instructions for AI assistants - Add team members to humans.txt - Filter sitemap pages - Configure cache headers for discovery files - Environment-specific configuration - Integration with Astro content collections - Custom templates for discovery files - ActivityPub/Fediverse integration via WebFinger Each guide provides: - Clear prerequisites - Step-by-step solutions - Multiple approaches/variations - Expected outcomes - Alternative approaches - Common issues and troubleshooting Total: 9 guides, 6,677 words --- .../docs/explanation/canary-explained.md | 240 +++++++++- .../docs/explanation/humans-explained.md | 315 ++++++++++++- .../docs/explanation/llms-explained.md | 222 ++++++++- .../docs/explanation/robots-explained.md | 193 +++++++- .../docs/explanation/security-explained.md | 286 +++++++++++- docs/src/content/docs/explanation/seo.md | 336 +++++++++++++- .../docs/explanation/webfinger-explained.md | 318 ++++++++++++- .../content/docs/explanation/why-discovery.md | 139 +++++- docs/src/content/docs/how-to/activitypub.md | 391 +++++++++++++++- .../content/docs/how-to/add-team-members.md | 257 ++++++++++- docs/src/content/docs/how-to/block-bots.md | 178 +++++++- docs/src/content/docs/how-to/cache-headers.md | 233 +++++++++- .../docs/how-to/content-collections.md | 385 +++++++++++++++- .../content/docs/how-to/custom-templates.md | 426 +++++++++++++++++- .../docs/how-to/customize-llm-instructions.md | 264 ++++++++++- .../content/docs/how-to/environment-config.md | 331 +++++++++++++- .../src/content/docs/how-to/filter-sitemap.md | 249 +++++++++- status.json | 30 +- 18 files changed, 4456 insertions(+), 337 deletions(-) diff --git a/docs/src/content/docs/explanation/canary-explained.md b/docs/src/content/docs/explanation/canary-explained.md index b75f7be..c54ea7a 100644 --- a/docs/src/content/docs/explanation/canary-explained.md +++ b/docs/src/content/docs/explanation/canary-explained.md @@ -1,31 +1,231 @@ --- title: Warrant Canaries -description: Understanding warrant canaries and transparency +description: Understanding warrant canaries and transparency mechanisms --- -Learn how warrant canaries work and their role in organizational transparency. +A warrant canary is a method for organizations to communicate the **absence** of secret government orders through regular public statements. The concept comes from the canaries coal miners once carried - their silence indicated danger. -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## The Gag Order Problem -## Coming Soon +Certain legal instruments (National Security Letters in the US, similar mechanisms elsewhere) can compel organizations to: -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +1. Provide user data or access to systems +2. Never disclose that the request was made -## Related Pages +This creates an information asymmetry - users can't know if their service provider has been compromised by government orders. -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +Warrant canaries address this by inverting the communication: instead of saying "we received an order" (which is forbidden), the organization regularly says "we have NOT received an order." -## Need Help? +If the statement stops or changes, users can infer something happened. -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +## How It Works + +A simple canary statement: + +``` +As of 2024-11-08, Example Corp has NOT received: + - National Security Letters + - FISA court orders + - Gag orders preventing disclosure + - Secret government requests for user data + - Requests to install surveillance capabilities +``` + +The organization publishes this monthly. Users monitor it. If November's update doesn't appear, or the statements change, users know to investigate. + +The canary communicates through **absence** rather than disclosure. + +## Legal Theory and Limitations + +Warrant canaries operate in a legal gray area. The theory: + +- Compelled speech (forcing you to lie) may violate free speech rights +- Choosing to remain silent is protected +- Government can prevent disclosure but cannot compel false statements + +This hasn't been extensively tested in court. Canaries are no guarantee, but they provide a transparency mechanism where direct disclosure is prohibited. + +Important limitations: + +- **No legal precedent**: Courts haven't ruled definitively on validity +- **Jurisdictional differences**: What works in one country may not in another +- **Sophistication of threats**: Adversaries may compel continued updates +- **Interpretation challenges**: Absence could mean many things + +Canaries are part of a transparency strategy, not a complete solution. + +## What Goes in a Canary + +The integration's default statements cover common government data requests: + +**National Security Letters (NSLs)**: US administrative subpoenas for subscriber information +**FISA court orders**: Foreign Intelligence Surveillance Act orders +**Gag orders**: Any order preventing disclosure of requests +**Surveillance requests**: Secret requests for user data +**Backdoor requests**: Demands to install surveillance capabilities + +You can customize these or add organization-specific concerns. + +## Frequency and Expiration + +Canaries must update regularly. The frequency determines trust: + +**Daily**: Maximum transparency, high maintenance burden +**Weekly**: Good for high-security contexts +**Monthly**: Standard for most organizations +**Quarterly**: Minimum for credibility +**Yearly**: Too infrequent to be meaningful + +The integration auto-calculates expiration based on frequency: + +- Daily: 2 days +- Weekly: 10 days +- Monthly: 35 days +- Quarterly: 100 days +- Yearly: 380 days + +These provide buffer time while ensuring staleness is obvious. + +## The Personnel Statement + +A sophisticated addition is the personnel statement: + +``` +Key Personnel Statement: All key personnel with access to +infrastructure remain free and under no duress. +``` + +This addresses scenarios where individuals are compelled to act under physical threat or coercion. + +If personnel are compromised, the statement can be omitted without violating gag orders (since it's not disclosing a government request). + +## Verification Mechanisms + +Mere publication isn't enough - users need to verify authenticity: + +### PGP Signatures + +Sign canary.txt with your organization's PGP key: + +``` +Verification: https://example.com/canary.txt.asc +``` + +This proves the canary came from you and hasn't been tampered with. + +### Blockchain Anchoring + +Publish a hash of the canary to a blockchain: + +``` +Blockchain-Proof: ethereum:0x123...abc:0xdef...789 +Blockchain-Timestamp: 2024-11-08T12:00:00Z +``` + +This creates an immutable, time-stamped record that the canary existed at a specific moment. + +Anyone can verify the canary matches the blockchain hash, preventing retroactive alterations. + +### Previous Canary Links + +Link to the previous canary: + +``` +Previous-Canary: https://example.com/canary-2024-10.txt +``` + +This creates a chain of trust. If an attacker compromises your site and tries to backdate canaries, the chain breaks. + +## What Absence Means + +If a canary stops updating or changes, it doesn't definitively mean government compromise. Possible reasons: + +- Organization received a legal order (the intended signal) +- Technical failure prevented update +- Personnel forgot or were unable to update +- Organization shut down or changed practices +- Security incident prevented trusted publication + +Users must interpret absence in context. Multiple verification methods help distinguish scenarios. + +## Building Trust Over Time + +A new canary has limited credibility. Trust builds through: + +1. **Consistency**: Regular updates on schedule +2. **Verification**: Multiple cryptographic proofs +3. **Transparency**: Clear explanation of canary purpose and limitations +4. **History**: Years of reliable updates +5. **Community**: External monitoring and verification + +Organizations should start canaries early, before they're needed, to build this trust. + +## The Integration's Approach + +This integration makes canaries accessible: + +**Auto-expiration**: Calculated from frequency +**Default statements**: Cover common concerns +**Dynamic generation**: Functions can generate statements at build time +**Verification support**: Links to PGP signatures and blockchain proofs +**Update reminders**: Clear expiration in content + +You configure once, the integration handles timing and formatting. + +## When to Use Canaries + +Canaries make sense for: + +- Organizations handling sensitive user data +- Services likely to receive government data requests +- Privacy-focused companies +- Organizations operating in multiple jurisdictions +- Platforms used by activists, journalists, or vulnerable groups + +They're less relevant for: + +- Personal blogs without user data +- Purely informational sites +- Organizations that can't commit to regular updates +- Contexts where legal risks outweigh benefits + +## Practical Considerations + +**Update process**: Who's responsible for monthly updates? +**Backup procedures**: What if primary person is unavailable? +**Legal review**: Has counsel approved canary language and process? +**Monitoring**: Who watches for expiration? +**Communication**: How will users be notified of canary changes? +**Contingency**: What's the plan if you must stop publishing? + +These operational questions matter as much as the canary itself. + +## The Limitations + +Canaries are not magic: + +- They rely on legal interpretations that haven't been tested +- Sophisticated adversaries may compel continued updates +- Absence is ambiguous - could be many causes +- Only useful for orders that come with gag provisions +- Don't address technical compromises or insider threats + +They're one tool in a transparency toolkit, not a complete solution. + +## Real-World Examples + +**Tech companies**: Some publish annual or quarterly canaries as part of transparency reports + +**VPN providers**: Many use canaries to signal absence of data retention orders + +**Privacy-focused services**: Canaries are common among services catering to privacy-conscious users + +**Open source projects**: Some maintainers publish personal canaries about project compromise + +The practice is growing as awareness of surveillance increases. + +## Related Topics + +- [Security.txt](/explanation/security-explained/) - Complementary transparency for security issues +- [Canary Reference](/reference/canary/) - Complete configuration options +- [Blockchain Verification](/how-to/canary-verification/) - Setting up cryptographic proofs diff --git a/docs/src/content/docs/explanation/humans-explained.md b/docs/src/content/docs/explanation/humans-explained.md index 5451079..e6ae944 100644 --- a/docs/src/content/docs/explanation/humans-explained.md +++ b/docs/src/content/docs/explanation/humans-explained.md @@ -3,29 +3,306 @@ title: Understanding humans.txt description: The human side of discovery files --- -Explore the humans.txt initiative and how it credits the people behind websites. +In a web dominated by machine-readable metadata, humans.txt is a delightful rebellion. It's a file written by humans, for humans, about the humans who built the website you're visiting. -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## The Initiative -## Coming Soon +humans.txt emerged in 2008 from a simple observation: websites have extensive metadata for machines (robots.txt, sitemaps, structured data) but nothing to credit the people who built them. -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +The initiative proposed a standard format for human-readable credits, transforming the impersonal `/humans.txt` URL into a space for personality, gratitude, and transparency. -## Related Pages +## What Makes It Human -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +Unlike other discovery files optimized for parsing, humans.txt embraces readability and creativity: -## Need Help? +``` +/* TEAM */ +Developer: Jane Doe +Role: Full-stack wizardry +Location: Portland, OR +Favorite beverage: Cold brew coffee -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +/* THANKS */ +- Stack Overflow (for everything) +- My rubber duck debugging companion +- Coffee, obviously + +/* SITE */ +Built with: Blood, sweat, and JavaScript +Fun fact: Deployed 47 times before launch +``` + +Notice the tone - casual, personal, fun. This isn't corporate boilerplate. It's a connection between builders and users. + +## Why It Matters + +On the surface, humans.txt seems frivolous. Who cares about credits buried in a text file? + +But consider the impact: + +**Recognition**: Developers, designers, and content creators work in the shadows. Humans.txt brings them into the light. + +**Transparency**: Users curious about how your site works can see the tech stack and team behind it. + +**Recruitment**: Talented developers browse humans.txt files. Listing your stack and philosophy attracts aligned talent. + +**Culture**: A well-crafted humans.txt reveals company culture and values better than any about page. + +**Humanity**: In an increasingly automated web, humans.txt reminds us that real people built this. + +## The Standard Sections + +The initiative proposes several standard sections: + +### TEAM + +Credits for everyone who contributed: + +``` +/* TEAM */ +Name: Alice Developer +Role: Lead Developer +Contact: alice@example.com +Twitter: @alicedev +From: Brooklyn, NY +``` + +List everyone - developers, designers, writers, managers. Projects are team efforts. + +### THANKS + +Acknowledgments for inspiration, tools, and support: + +``` +/* THANKS */ +- The Astro community +- Open-source maintainers everywhere +- Our beta testers +- Late night playlist creators +``` + +This section humanizes development. We build on the work of others. + +### SITE + +Technical details about the project: + +``` +/* SITE */ +Last update: 2024-11-08 +Language: English / Markdown +Doctype: HTML5 +IDE: VS Code with Vim keybindings +Components: Astro, React, TypeScript +Standards: HTML5, CSS3, ES2022 +``` + +This satisfies developer curiosity and provides context for technical decisions. + +## Going Beyond the Standard + +The beauty of humans.txt is flexibility. Many sites add custom sections: + +**STORY**: The origin story of your project +**PHILOSOPHY**: Development principles and values +**FUN FACTS**: Easter eggs and behind-the-scenes details +**COLOPHON**: Typography and design choices +**ERRORS**: Humorous changelog of mistakes + +These additions transform humans.txt from credits into narrative. + +## The Integration's Approach + +This integration generates humans.txt with opinionated defaults but encourages customization: + +**Auto-dating**: `lastUpdate: 'auto'` uses current build date +**Flexible structure**: Add any custom sections you want +**Dynamic content**: Generate team lists from content collections +**Rich metadata**: Include social links, locations, and personal touches + +The goal is making credits easy enough that you'll actually maintain them. + +## Real-World Examples + +**Humanstxt.org** (the initiative's site): +``` +/* TEAM */ +Creator: Abel Cabans +Site: http://abelcabans.com +Twitter: @abelcabans +Location: Sant Cugat del Vallès, Barcelona, Spain + +/* THANKS */ +- All the people who have contributed +- Spread the word! + +/* SITE */ +Last update: 2024/01/15 +Standards: HTML5, CSS3 +Components: Jekyll +Software: TextMate, Git +``` + +Clean, simple, effective. + +**Creative Agency** (fictional but typical): +``` +/* TEAM */ +Creative Director: Max Wilson +Role: Visionary chaos coordinator +Contact: max@agency.com +Fun fact: Has never missed a deadline (barely) + +Designer: Sarah Chen +Role: Pixel perfectionist +Location: San Francisco +Tool of choice: Figma, obviously + +Developer: Jordan Lee +Role: Code whisperer +From: Remote (currently Bali) +Coffee order: Oat milk cortado + +/* THANKS */ +- Our clients for trusting us with their dreams +- The internet for cat videos during crunch time +- Figma for not crashing during presentations + +/* STORY */ +We started in a garage. Not for dramatic effect - office +space in SF is expensive. Three friends with complementary +skills and a shared belief that design should be delightful. + +Five years later, we're still in that garage (now with +better chairs). But we've shipped products used by millions +and worked with brands we admired as kids. + +We believe in: +- Craftsmanship over shortcuts +- Accessibility as a baseline, not a feature +- Open source as community participation +- Making the web more fun + +/* SITE */ +Built with: Astro, Svelte, TypeScript, TailwindCSS +Deployed on: Cloudflare Pages +Font: Inter (because we're not monsters) +Colors: Custom palette inspired by Bauhaus +Last rewrite: 2024 (the third time's the charm) +``` + +Notice the personality, the details, the humanity. + +## The "Last Update" Decision + +The `lastUpdate` field presents a philosophical question: should it reflect content updates or just site updates? + +**Content perspective**: Change date when humans.txt content changes +**Site perspective**: Change date when any part of the site deploys + +The integration defaults to site perspective (auto-update on every build). This ensures the date always reflects current site state, even if humans.txt content stays static. + +But you can override with a specific date if you prefer manual control. + +## Social Links and Contact Info + +humans.txt is a great place for social links: + +``` +/* TEAM */ +Name: Developer Name +Twitter: @username +GitHub: username +LinkedIn: /in/username +Mastodon: @username@instance.social +``` + +This provides discoverable contact information without cluttering your UI. + +It's particularly valuable for open-source projects where contributors want to connect. + +## The Gratitude Practice + +Writing a good THANKS section is a gratitude practice. It forces you to acknowledge the shoulders you stand on: + +- Which open-source projects made your work possible? +- Who provided feedback, testing, or encouragement? +- What tools, resources, or communities helped you learn? +- Which mistakes taught you valuable lessons? + +This reflection benefits you as much as it credits others. + +## Humor and Personality + +humans.txt invites creativity. Some examples: + +``` +/* FUN FACTS */ +- Entire site built during one caffeinated weekend +- 437 commits with message "fix typo" +- Originally designed in Figma, rebuilt in Sketch, launched from code +- The dog in our 404 page is the CEO's actual dog +- We've used Comic Sans exactly once (regrettably) +``` + +This personality differentiates you and creates connection. + +## When Not to Use Humor + +Professional context matters. A bank's humans.txt should be more restrained than a gaming startup's. + +Match the tone to your audience and brand. Personality doesn't require jokes. + +Simple sincerity works too: + +``` +/* TEAM */ +We're a team of 12 developers across 6 countries +working to make financial services more accessible. + +/* THANKS */ +To the users who trust us with their financial data - +we take that responsibility seriously every day. +``` + +## Maintenance Considerations + +humans.txt requires maintenance: + +- Update when team members change +- Refresh tech stack as you adopt new tools +- Add new thanks as you use new resources +- Keep contact information current + +The integration helps by supporting dynamic content: + +```typescript +humans: { + team: await getCollection('team'), // Auto-sync with team content + site: { + lastUpdate: 'auto', // Auto-update on each build + techStack: Object.keys(deps) // Extract from package.json + } +} +``` + +This reduces manual maintenance burden. + +## The Browse Experience + +Most users never see humans.txt. And that's okay. + +The file serves several audiences: + +**Curious users**: The 1% who look behind the curtain +**Developers**: Evaluating tech stack for integration or inspiration +**Recruiters**: Understanding team culture and capabilities +**You**: Reflection and gratitude practice during creation + +It's not about traffic - it's about transparency and humanity. + +## Related Topics + +- [Content Collections Integration](/how-to/content-collections/) - Auto-generate team lists +- [Humans.txt Reference](/reference/humans/) - Complete configuration options +- [Examples](/examples/blog/) - See humans.txt in context diff --git a/docs/src/content/docs/explanation/llms-explained.md b/docs/src/content/docs/explanation/llms-explained.md index d3c651c..ceb9724 100644 --- a/docs/src/content/docs/explanation/llms-explained.md +++ b/docs/src/content/docs/explanation/llms-explained.md @@ -1,31 +1,213 @@ --- title: Understanding llms.txt -description: What is llms.txt and why it matters +description: How AI assistants discover and understand your website --- -Learn about the llms.txt specification and how it helps AI assistants. +llms.txt is the newest member of the discovery file family, emerging in response to a fundamental shift in how content is consumed on the web. While search engines index and retrieve, AI language models read, understand, and synthesize. -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## Why AI Needs Different Guidance -## Coming Soon +Traditional search engines need to know **what exists and where**. They build indexes mapping keywords to pages. -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +AI assistants need to know **what things mean and how to use them**. They need context, instructions, and understanding of relationships between content. -## Related Pages +Consider the difference: -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +**Search engine thinking**: "This page contains the word 'API' and is located at /docs/api" -## Need Help? +**AI assistant thinking**: "This site offers a REST API at /api/endpoint that requires authentication. When users ask how to integrate, I should explain the auth flow and reference the examples at /docs/examples" -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +llms.txt bridges this gap by providing **semantic context** that goes beyond structural metadata. + +## The Information Architecture + +llms.txt follows a simple, human-readable structure: + +``` +# Site Description + +> One-line tagline + +## Site Information +Basic facts about the site + +## For AI Assistants +Instructions and guidelines + +## Important Pages +Key resources to know about + +## API Endpoints +Available programmatic access +``` + +This structure mirrors how you'd brief a human assistant about your site. It's not rigid XML or JSON - it's conversational documentation optimized for language model consumption. + +## What to Include + +The most effective llms.txt files provide: + +**Description**: Not just what your site is, but **why it exists**. "E-commerce platform" is weak. "E-commerce platform focused on sustainable products with carbon footprint tracking" gives context. + +**Key Features**: The 3-5 things that make your site unique or particularly useful. These help AI assistants understand what problems you solve. + +**Important Pages**: Not a sitemap (that's what sitemap.xml is for), but the **handful of pages** that provide disproportionate value. Think: getting started guide, API docs, pricing. + +**Instructions**: Specific guidance on how AI should represent your content. This is where you establish voice, correct common misconceptions, and provide task-specific guidance. + +**API Endpoints**: If you have programmatic access, describe it. AI assistants can help users integrate with your service if they know endpoints exist. + +## The Instruction Set Pattern + +The most powerful part of llms.txt is the instructions section. This is where you teach AI assistants how to be helpful about your site. + +Effective instructions are: + +**Specific**: "When users ask about authentication, explain we use OAuth2 and point them to /docs/auth" + +**Actionable**: "Check /api/status before suggesting users try the API" + +**Context-aware**: "Remember that we're focused on accessibility - always mention a11y features" + +**Preventive**: "We don't offer feature X - suggest alternatives Y or Z instead" + +Think of it as training an employee who'll be answering questions about your product. What would you want them to know? + +## Brand Voice and Tone + +AI assistants can adapt their responses to match your brand if you provide guidance: + +``` +## Brand Voice +- Professional but approachable +- Technical accuracy over marketing speak +- Always mention open-source nature +- Emphasize privacy and user control +``` + +This helps ensure AI representations of your site feel consistent with your actual brand identity. + +## Tech Stack Transparency + +Including your tech stack serves multiple purposes: + +1. **Helps AI assistants answer developer questions** ("Can I use this with React?" - "Yes, it's built on React") +2. **Aids troubleshooting** (knowing the framework helps diagnose integration issues) +3. **Attracts contributors** (developers interested in your stack are more likely to contribute) + +Be specific but not exhaustive. "Built with Astro, TypeScript, and Tailwind" is better than listing every npm package. + +## API Documentation + +If your site offers APIs, llms.txt should describe them at a high level: + +``` +## API Endpoints + +- GET /api/products - List all products + Authentication: API key required + Returns: JSON array of product objects + +- POST /api/calculate-carbon - Calculate carbon footprint + Authentication: Not required + Accepts: JSON with cart data + Returns: Carbon footprint estimate +``` + +This isn't meant to replace full API documentation - it's a quick reference so AI assistants know what's possible. + +## The Relationship with robots.txt + +robots.txt and llms.txt work together: + +**robots.txt** says: "AI bots, you can access these paths" +**llms.txt** says: "Here's how to understand what you find there" + +The integration coordinates them automatically: + +1. robots.txt includes rules for LLM user-agents +2. Those rules reference llms.txt +3. LLM bots follow robots.txt to respect boundaries +4. Then read llms.txt for guidance on content interpretation + +## Dynamic vs. Static Content + +llms.txt can be either static (same content always) or dynamic (generated at build time): + +**Static**: Your site description and brand voice rarely change +**Dynamic**: Current API endpoints, team members, or feature status might update frequently + +The integration supports both approaches. You can provide static strings or functions that generate content at build time. + +This is particularly useful for: + +- Extracting API endpoints from OpenAPI specs +- Listing important pages from content collections +- Keeping tech stack synchronized with package.json +- Generating context from current deployment metadata + +## What Not to Include + +llms.txt should be concise and focused. Avoid: + +**Comprehensive documentation**: Link to it, don't duplicate it +**Entire sitemaps**: That's what sitemap.xml is for +**Legal boilerplate**: Keep it in your terms of service +**Overly specific instructions**: Trust AI to handle common cases +**Marketing copy**: Be informative, not promotional + +Think of llms.txt as **strategic context**, not exhaustive documentation. + +## Measuring Impact + +Unlike traditional SEO, llms.txt impact is harder to measure directly. You won't see "llms.txt traffic" in analytics. + +Instead, look for: + +- AI assistants correctly representing your product +- Reduction in mischaracterizations or outdated information +- Appropriate use of your APIs by AI-assisted developers +- Consistency in how different AI systems describe your site + +The goal is **accurate representation**, not traffic maximization. + +## Privacy and Data Concerns + +A common concern: "Doesn't llms.txt help AI companies train on my content?" + +Important points: + +1. **AI training happens regardless** of llms.txt - they crawl public content anyway +2. **llms.txt doesn't grant permission** - it provides context for content they already access +3. **robots.txt controls access** - if you don't want AI crawlers, use robots.txt to block them +4. **llms.txt helps AI represent you accurately** - better context = better representation + +Think of it this way: if someone's going to talk about you, would you rather they have accurate information or guess? + +## The Evolution of AI Context + +llms.txt is a living standard, evolving as AI capabilities grow: + +**Current**: Basic site description and instructions +**Near future**: Structured data about capabilities, limitations, and relationships +**Long term**: Semantic graphs of site knowledge and interconnections + +By adopting llms.txt now, you're positioning your site to benefit as these capabilities mature. + +## Real-World Patterns + +**Documentation sites**: Emphasize how to search docs, common pitfalls, and where to find examples + +**E-commerce**: Describe product categories, search capabilities, and checkout process + +**SaaS products**: Explain core features, authentication, and API availability + +**Blogs**: Highlight author expertise, main topics, and content philosophy + +The pattern that works best depends on how people use AI to interact with your type of content. + +## Related Topics + +- [AI Integration Strategy](/explanation/ai-integration/) - Broader AI considerations +- [Robots.txt Coordination](/explanation/robots-explained/) - How robots.txt and llms.txt work together +- [LLMs.txt Reference](/reference/llms/) - Complete configuration options diff --git a/docs/src/content/docs/explanation/robots-explained.md b/docs/src/content/docs/explanation/robots-explained.md index 880b54d..60c81c5 100644 --- a/docs/src/content/docs/explanation/robots-explained.md +++ b/docs/src/content/docs/explanation/robots-explained.md @@ -1,31 +1,182 @@ --- -title: Understanding robots.txt -description: Deep dive into robots.txt and its purpose +title: How robots.txt Works +description: Understanding robots.txt and web crawler communication --- -Comprehensive explanation of robots.txt, its history, and modern usage. +Robots.txt is the oldest and most fundamental discovery file on the web. Since 1994, it has served as the **polite agreement** between website owners and automated crawlers about what content can be accessed and how. -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## The Gentleman's Agreement -## Coming Soon +robots.txt is not a security mechanism - it's a social contract. It tells crawlers "please don't go here" rather than "you cannot go here." Any crawler can ignore it, and malicious ones often do. -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +This might seem like a weakness, but it's actually a strength. The file works because the overwhelming majority of automated traffic comes from legitimate crawlers (search engines, monitoring tools, archive services) that want to be good citizens of the web. -## Related Pages +Think of it like a "No Trespassing" sign on private property. It won't stop determined intruders, but it clearly communicates boundaries to honest visitors and provides legal/ethical grounds for addressing violations. -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +## What robots.txt Solves -## Need Help? +Before robots.txt, early search engines would crawl websites aggressively, sometimes overwhelming servers or wasting bandwidth on administrative pages. Website owners had no standard way to communicate crawling preferences. -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +robots.txt provides three critical capabilities: + +**1. Access Control**: Specify which paths crawlers can and cannot visit +**2. Resource Management**: Set crawl delays to prevent server overload +**3. Signposting**: Point crawlers to important resources like sitemaps + +## The User-Agent Model + +robots.txt uses a "user-agent" model where rules target specific bots: + +``` +User-agent: * +Disallow: /admin/ + +User-agent: GoogleBot +Allow: /api/ +``` + +This allows fine-grained control. You might allow Google to index your API documentation while blocking other crawlers. Or permit archive services to access historical content while disallowing marketing bots. + +The `*` wildcard matches all user-agents, providing default rules. Specific user-agents override these defaults for their particular bot. + +## The LLM Bot Challenge + +The emergence of AI language models created a new category of web consumers. Unlike traditional search engines that index for retrieval, LLMs process content for training data and context. + +This raises different concerns: + +- Training data usage and attribution +- Content representation accuracy +- Server load from context gathering +- Different resource needs (full pages vs. search snippets) + +The integration addresses this by providing dedicated rules for LLM bots (GPTBot, Claude-Web, Anthropic-AI, etc.) while pointing them to llms.txt for additional context. + +## Allow vs. Disallow + +A common point of confusion is the relationship between Allow and Disallow directives. + +**Disallow**: Explicitly forbids access to a path +**Allow**: Creates exceptions to Disallow rules + +Consider this example: + +``` +User-agent: * +Disallow: /admin/ +Allow: /admin/public/ +``` + +This says "don't crawl /admin/ except for /admin/public/ which is allowed." The Allow creates a specific exception to the broader Disallow. + +Without any rules, everything is implicitly allowed. You don't need `Allow: /` - that's the default state. + +## Path Matching + +Path patterns in robots.txt support wildcards and prefix matching: + +- `/api/` matches `/api/` and everything under it +- `/api/private` matches that specific path +- `*.pdf` matches any URL containing `.pdf` +- `/page$` matches `/page` but not `/page/subpage` + +The most specific matching rule wins. If both `/api/` and `/api/public/` have rules for the same user-agent, the longer path takes precedence. + +## Crawl-Delay: The Double-Edged Sword + +Crawl-delay tells bots to wait between requests: + +``` +Crawl-delay: 2 +``` + +This means "wait 2 seconds between page requests." It's useful for: + +- Protecting servers with limited resources +- Preventing rate limiting from triggering +- Managing bandwidth costs + +But there's a trade-off: slower crawling means it takes longer for your content to be indexed. Set it too high and you might delay important updates from appearing in search results. + +The integration defaults to 1 second - a balanced compromise between politeness and indexing speed. + +## Sitemap Declaration + +One of robots.txt's most valuable features is sitemap declaration: + +``` +Sitemap: https://example.com/sitemap-index.xml +``` + +This tells crawlers "here's a comprehensive list of all my pages." It's more efficient than discovering pages through link following and ensures crawlers know about pages that might not be linked from elsewhere. + +The integration automatically adds your sitemap reference, keeping it synchronized with your Astro site URL. + +## Common Mistakes + +**Blocking CSS/JS**: Some sites block `/assets/` thinking it saves bandwidth. This prevents search engines from rendering your pages correctly, harming SEO. + +**Disallowing Everything**: `Disallow: /` blocks all crawlers completely. This is rarely what you want - even internal tools need access. + +**Forgetting About Dynamic Content**: If your search or API routes generate content dynamically, consider whether crawlers should access them. + +**Security Through Obscurity**: Don't rely on robots.txt to hide sensitive content. Use proper authentication instead. + +## Why Not Just Use Authentication? + +You might wonder why we need robots.txt if we can protect content with authentication. + +The answer is that most website content should be publicly accessible - that's the point. You want search engines to index your blog, documentation, and product pages. + +robots.txt lets you have **public content that crawlers respect** without requiring authentication. It's about communicating intent, not enforcing access control. + +## The Integration's Approach + +This integration generates robots.txt with opinionated defaults: + +- Allow all bots by default (the web works best when discoverable) +- Include LLM-specific bots with llms.txt guidance +- Reference your sitemap automatically +- Set a reasonable 1-second crawl delay +- Provide easy overrides for your specific needs + +You can customize any aspect, but the defaults represent best practices for most sites. + +## Looking at Real-World Examples + +**Wikipedia** (`robots.txt`): +``` +User-agent: * +Disallow: /wiki/Special: +Crawl-delay: 1 +Sitemap: https://en.wikipedia.org/sitemap.xml +``` + +Simple and effective. Block special admin pages, allow everything else. + +**GitHub** (simplified): +``` +User-agent: * +Disallow: /search/ +Disallow: */pull/ +Allow: */pull$/ +``` + +Notice how they block pull request search but allow individual pull request pages. This prevents crawler loops while keeping content accessible. + +## Verification and Testing + +After deploying, verify your robots.txt: + +1. Visit `yoursite.com/robots.txt` directly +2. Use Google Search Console's robots.txt tester +3. Check specific user-agent rules with online validators +4. Monitor crawler behavior in server logs + +The file is cached aggressively by crawlers, so changes may take time to propagate. + +## Related Topics + +- [SEO Impact](/explanation/seo/) - How robots.txt affects search rankings +- [LLMs.txt Integration](/explanation/llms-explained/) - Connecting bot control with AI guidance +- [Robots.txt Reference](/reference/robots/) - Complete configuration options diff --git a/docs/src/content/docs/explanation/security-explained.md b/docs/src/content/docs/explanation/security-explained.md index 699a439..a3fdc69 100644 --- a/docs/src/content/docs/explanation/security-explained.md +++ b/docs/src/content/docs/explanation/security-explained.md @@ -1,31 +1,277 @@ --- title: Security.txt Standard (RFC 9116) -description: Understanding the security.txt RFC +description: Understanding RFC 9116 and responsible vulnerability disclosure --- -Learn about RFC 9116 and why security.txt is important for responsible disclosure. +security.txt, standardized as RFC 9116 in 2022, solves a deceptively simple problem: when a security researcher finds a vulnerability in your website, how do they tell you about it? -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## The Responsible Disclosure Problem -## Coming Soon +Before security.txt, researchers faced a frustrating journey: -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +1. Find vulnerability in example.com +2. Search for security contact information +3. Check footer, about page, contact page +4. Try info@, security@, admin@ email addresses +5. Hope someone reads it and knows what to do with it +6. Wait weeks for response (or get none) +7. Consider public disclosure out of frustration -## Related Pages +This process was inefficient for researchers and dangerous for organizations. Vulnerabilities went unreported or were disclosed publicly before fixes could be deployed. -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +## The RFC 9116 Solution -## Need Help? +RFC 9116 standardizes a machine-readable file at `/.well-known/security.txt` containing: -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +- **Contact**: How to reach your security team (required) +- **Expires**: When this information becomes stale (required) +- **Canonical**: The authoritative location of this file +- **Encryption**: PGP keys for encrypted communication +- **Acknowledgments**: Hall of fame for researchers +- **Policy**: Your disclosure policy URL +- **Preferred-Languages**: Languages you can handle reports in +- **Hiring**: Security job opportunities + +This provides a **standardized, discoverable, machine-readable** security contact mechanism. + +## Why .well-known? + +The `/.well-known/` directory is an RFC 8615 standard for site-wide metadata. It's where clients expect to find standard configuration files. + +By placing security.txt in `/.well-known/security.txt`, the RFC ensures: + +- **Consistent location**: No guessing where to find it +- **Standard compliance**: Follows web architecture patterns +- **Tool support**: Security scanners can automatically check for it + +The integration generates security.txt at the correct location automatically. + +## The Required Fields + +RFC 9116 mandates two fields: + +### Contact + +At least one contact method (email or URL): + +``` +Contact: mailto:security@example.com +Contact: https://example.com/security-contact +Contact: tel:+1-555-0100 +``` + +Multiple contacts provide redundancy. If one channel fails, researchers have alternatives. + +Email addresses automatically get `mailto:` prefixes. URLs should point to security contact forms or issue trackers. + +### Expires + +An ISO 8601 timestamp indicating when to stop trusting this file: + +``` +Expires: 2025-12-31T23:59:59Z +``` + +This is critical - it prevents researchers from reporting to stale contacts that are no longer monitored. + +The integration defaults to `expires: 'auto'`, setting expiration to one year from build time. This ensures the field updates on every deployment. + +## Optional but Valuable Fields + +### Encryption + +URLs to PGP public keys for encrypted vulnerability reports: + +``` +Encryption: https://example.com/pgp-key.txt +Encryption: openpgp4fpr:5F2DE18D3AFE0FD7A1F2F5A3E4562BB79E3B2E80 +``` + +This enables researchers to send sensitive details securely, preventing disclosure to attackers monitoring email. + +### Acknowledgments + +URL to your security researcher hall of fame: + +``` +Acknowledgments: https://example.com/security/hall-of-fame +``` + +Public recognition motivates responsible disclosure. Researchers appreciate being credited for their work. + +### Policy + +URL to your vulnerability disclosure policy: + +``` +Policy: https://example.com/security/disclosure-policy +``` + +This clarifies expectations: response timelines, safe harbor provisions, bug bounty details, and disclosure coordination. + +### Preferred-Languages + +Languages your security team can handle: + +``` +Preferred-Languages: en, es, fr +``` + +This helps international researchers communicate effectively. Use ISO 639-1 language codes. + +### Hiring + +URL to security job openings: + +``` +Hiring: https://example.com/careers/security +``` + +Talented researchers who find vulnerabilities might be hiring prospects. This field provides a connection point. + +## The Canonical Field + +The Canonical field specifies the authoritative location: + +``` +Canonical: https://example.com/.well-known/security.txt +``` + +This matters for: + +- **Verification**: Ensures you're reading the correct version +- **Mirrors**: Multiple domains can reference the same canonical file +- **Historical context**: Archives know which version was authoritative + +The integration sets this automatically based on your site URL. + +## Why Expiration Matters + +The Expires field isn't bureaucracy - it's safety. + +Consider a scenario: + +1. Company sets up security.txt pointing to security@company.com +2. Security team disbands, email is decommissioned +3. Attacker registers security@company.com domain after it expires +4. Researcher reports vulnerability to attacker's email +5. Attacker has vulnerability details before the company does + +Expiration prevents this. If security.txt is expired, researchers know not to trust it and must find alternative contact methods. + +Best practice: Set expiration to 1 year maximum. The integration's `'auto'` option handles this. + +## Security.txt in Practice + +A minimal production security.txt: + +``` +Canonical: https://example.com/.well-known/security.txt +Contact: mailto:security@example.com +Expires: 2025-11-08T00:00:00.000Z +``` + +A comprehensive implementation: + +``` +Canonical: https://example.com/.well-known/security.txt + +Contact: mailto:security@example.com +Contact: https://example.com/security-report + +Expires: 2025-11-08T00:00:00.000Z + +Encryption: https://example.com/pgp-key.asc +Acknowledgments: https://example.com/security/researchers +Preferred-Languages: en, de, ja +Policy: https://example.com/security/disclosure + +Hiring: https://example.com/careers/security-engineer +``` + +## Common Mistakes + +**Using relative URLs**: All URLs must be absolute (`https://...`) + +**Missing mailto: prefix**: Email addresses need `mailto:` - the integration adds this automatically + +**Far-future expiration**: Don't set expiration 10 years out. Keep it to 1 year maximum. + +**No monitoring**: Set up alerts when security.txt approaches expiration + +**Stale contacts**: Verify listed contacts still work + +## Building a Disclosure Program + +security.txt is the entry point to vulnerability disclosure, but you need supporting infrastructure: + +**Monitoring**: Watch the security inbox religiously +**Triage process**: Quick initial response (even if just "we're investigating") +**Fix timeline**: Clear expectations about patch development +**Disclosure coordination**: Work with researcher on public disclosure timing +**Recognition**: Credit researchers in release notes and acknowledgments page + +The integration makes the entry point easy. The program around it requires organizational commitment. + +## Security Through Transparency + +Some organizations hesitate to publish security.txt, fearing it invites attacks. + +The reality: security researchers are already looking. security.txt helps them help you. + +Without it: + +- Vulnerabilities go unreported +- Researchers waste time finding contacts +- Frustration leads to premature public disclosure +- You look unprofessional to security community + +With it: + +- Clear channel for responsible disclosure +- Faster vulnerability reports +- Better researcher relationships +- Professional security posture + +## Verification and Monitoring + +After deploying security.txt: + +1. Verify it's accessible at `/.well-known/security.txt` +2. Check field formatting with RFC 9116 validators +3. Test contact methods work +4. Set up monitoring for expiration date +5. Create calendar reminder to refresh before expiration + +Many organizations set up automated checks that alert if security.txt will expire within 30 days. + +## Integration with Bug Bounty Programs + +If you run a bug bounty program, reference it in your policy: + +``` +Policy: https://example.com/bug-bounty +``` + +This connects researchers to your incentive program immediately. + +security.txt and bug bounties work together - the file provides discovery, the program provides incentive structure. + +## Legal Considerations + +security.txt should coordinate with your legal team's disclosure policy. + +Consider including: + +- Safe harbor provisions (no legal action against good-faith researchers) +- Scope definition (what systems are in/out of scope) +- Rules of engagement (don't exfiltrate data, etc.) +- Disclosure timeline expectations + +These protect both your organization and researchers. + +## Related Topics + +- [Canary.txt Explained](/explanation/canary-explained/) - Complementary transparency mechanism +- [Security.txt Reference](/reference/security/) - Complete configuration options +- [Security Best Practices](/how-to/environment-config/) - Securing your deployment diff --git a/docs/src/content/docs/explanation/seo.md b/docs/src/content/docs/explanation/seo.md index b2819b4..58663ac 100644 --- a/docs/src/content/docs/explanation/seo.md +++ b/docs/src/content/docs/explanation/seo.md @@ -1,31 +1,327 @@ --- title: SEO & Discoverability -description: How discovery files improve SEO +description: How discovery files improve search engine optimization --- -Understand how properly configured discovery files enhance search engine optimization. +Discovery files and SEO have a symbiotic relationship. While some files (like humans.txt) don't directly impact rankings, others (robots.txt, sitemaps) are foundational to how search engines understand and index your site. -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## Robots.txt: The SEO Foundation -## Coming Soon +robots.txt is one of the first files search engines request. It determines: -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +- Which pages can be crawled and indexed +- How aggressively to crawl (via crawl-delay) +- Where to find your sitemap +- Special instructions for specific bots -## Related Pages +### Crawl Budget Optimization -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +Search engines allocate limited resources to each site - your "crawl budget." robots.txt helps you spend it wisely: -## Need Help? +**Block low-value pages**: Admin sections, search result pages, and duplicate content waste crawl budget +**Allow high-value content**: Ensure important pages are accessible +**Set appropriate crawl-delay**: Balance thorough indexing against server load -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +Example SEO-optimized robots.txt: + +``` +User-agent: * +Allow: / +Disallow: /admin/ +Disallow: /search? +Disallow: /*?sort=* +Disallow: /api/ + +Crawl-delay: 1 + +Sitemap: https://example.com/sitemap-index.xml +``` + +This blocks non-content pages while allowing crawlers to efficiently index your actual content. + +### The CSS/JS Trap + +A common SEO mistake: + +``` +# DON'T DO THIS +Disallow: /assets/ +Disallow: /*.css +Disallow: /*.js +``` + +This prevents search engines from fully rendering your pages. Modern SEO requires JavaScript execution for SPAs and interactive content. + +The integration doesn't block assets by default - this is intentional and SEO-optimal. + +### Sitemap Declaration + +The `Sitemap:` directive in robots.txt is critical for SEO. It tells search engines: + +- All your pages exist (even if not linked) +- When pages were last modified +- Relative priority of pages +- Alternative language versions + +This dramatically improves indexing coverage and freshness. + +## Sitemaps: The SEO Roadmap + +Sitemaps serve multiple SEO functions: + +### Discoverability + +Pages not linked from your navigation can still be indexed. This matters for: + +- Deep content structures +- Recently published pages not yet linked +- Orphaned pages with valuable content +- Alternative language versions + +### Update Frequency + +The `` element signals content freshness: + +```xml + + https://example.com/article + 2024-11-08T12:00:00Z + weekly + +``` + +Search engines prioritize recently updated content. Fresh `lastmod` dates encourage re-crawling. + +### Priority Hints + +The `` element suggests relative importance: + +```xml + + https://example.com/important-page + 0.9 + + + https://example.com/minor-page + 0.3 + +``` + +This is a hint, not a directive. Search engines use it along with other signals. + +### International SEO + +For multilingual sites, sitemaps declare language alternatives: + +```xml + + https://example.com/page + + + +``` + +This prevents duplicate content penalties while ensuring all language versions are indexed. + +## LLMs.txt: The AI SEO Frontier + +Traditional SEO optimizes for search retrieval. llms.txt optimizes for AI representation - the emerging frontier of discoverability. + +### AI-Generated Summaries + +Search engines increasingly show AI-generated answer boxes. llms.txt helps ensure these summaries: + +- Accurately represent your content +- Use your preferred terminology and brand voice +- Highlight your key differentiators +- Link to appropriate pages + +### Voice Search Optimization + +Voice assistants rely on AI understanding. llms.txt provides: + +- Natural language context for your content +- Clarification of ambiguous terms +- Guidance on how to answer user questions +- References to authoritative pages + +This improves your chances of being the source for voice search answers. + +### Content Attribution + +When AI systems reference your content, llms.txt helps ensure: + +- Proper context is maintained +- Your brand is correctly associated +- Key features aren't misrepresented +- Updates propagate to AI models + +Think of it as structured data for AI agents. + +## Humans.txt: The Indirect SEO Value + +humans.txt doesn't directly impact rankings, but it supports SEO indirectly: + +### Technical Transparency + +Developers evaluating integration with your platform check humans.txt for tech stack info. This can lead to: + +- Backlinks from integration tutorials +- Technical blog posts mentioning your stack +- Developer community discussions + +All of which generate valuable backlinks and traffic. + +### Brand Signals + +A well-crafted humans.txt signals: + +- Active development and maintenance +- Professional operations +- Transparent communication +- Company culture + +These contribute to overall site authority and trustworthiness. + +## Security.txt: Trust Signals + +Security.txt demonstrates professionalism and security-consciousness. While not a ranking factor, it: + +- Builds trust with security-conscious users +- Prevents security incidents that could damage SEO (hacked site penalties) +- Shows organizational maturity +- Enables faster vulnerability fixes (preserving site integrity) + +Search engines penalize compromised sites heavily. security.txt helps prevent those penalties. + +## Integration SEO Benefits + +This integration provides several SEO advantages: + +### Consistency + +All discovery files reference the same site URL from your Astro config. This prevents: + +- Mixed http/https signals +- www vs. non-www confusion +- Subdomain inconsistencies + +Consistency is an underrated SEO factor. + +### Freshness + +Auto-generated timestamps keep discovery files fresh: + +- Sitemaps show current lastmod dates +- security.txt expiration updates with each build +- canary.txt timestamps reflect current build + +Fresh content signals active maintenance. + +### Correctness + +The integration handles RFC compliance automatically: + +- security.txt follows RFC 9116 exactly +- robots.txt uses correct syntax +- Sitemaps follow XML schema +- WebFinger implements RFC 7033 + +Malformed discovery files can harm SEO. The integration prevents errors. + +## Monitoring SEO Impact + +Track discovery file effectiveness: + +**Google Search Console**: +- Sitemap coverage reports +- Crawl statistics +- Indexing status +- Mobile usability + +**Crawl behavior analysis**: +- Server logs showing crawler patterns +- Crawl-delay effectiveness +- Blocked vs. allowed URL ratio +- Time to index new content + +**AI representation monitoring**: +- How AI assistants describe your site +- Accuracy of information +- Attribution and links +- Brand voice consistency + +## Common SEO Mistakes + +### Over-blocking + +Blocking too much harms SEO: + +``` +# Too restrictive +Disallow: /blog/? +Disallow: /products/? +``` + +This might block legitimate content URLs. Be specific: + +``` +# Better +Disallow: /blog?* +Disallow: /products?sort=* +``` + +### Sitemap bloat + +Including every URL hurts more than helps: + +- Don't include parameter variations +- Skip pagination (keep to representative pages) +- Exclude search result pages +- Filter out duplicate content + +Quality over quantity. + +### Ignoring crawl errors + +Monitor Search Console for: + +- 404s in sitemap +- Blocked resources search engines need +- Redirect chains +- Server errors + +Fix these promptly - they impact ranking. + +### Stale sitemaps + +Ensure sitemaps update with your content: + +- New pages appear quickly +- Deleted pages are removed +- lastmod timestamps are accurate +- Priority reflects current importance + +The integration's automatic generation ensures freshness. + +## Future SEO Trends + +Discovery files will evolve with search: + +**AI-first indexing**: Search engines will increasingly rely on structured context (llms.txt) rather than pure crawling + +**Federated discovery**: WebFinger and similar protocols may influence how distributed content is discovered and indexed + +**Transparency signals**: Files like security.txt and canary.txt may become trust signals in ranking algorithms + +**Structured data expansion**: Discovery files complement schema.org markup as structured communication channels + +By implementing comprehensive discovery now, you're positioned for these trends. + +## Related Topics + +- [Robots.txt Configuration](/reference/robots/) - SEO-optimized robot settings +- [Sitemap Optimization](/how-to/filter-sitemap/) - Filtering for better SEO +- [AI Integration Strategy](/explanation/ai-integration/) - Preparing for AI-first search diff --git a/docs/src/content/docs/explanation/webfinger-explained.md b/docs/src/content/docs/explanation/webfinger-explained.md index 268e09d..df8131a 100644 --- a/docs/src/content/docs/explanation/webfinger-explained.md +++ b/docs/src/content/docs/explanation/webfinger-explained.md @@ -1,31 +1,309 @@ --- title: WebFinger Protocol (RFC 7033) -description: Understanding WebFinger resource discovery +description: Understanding WebFinger and federated resource discovery --- -Deep dive into the WebFinger protocol and its role in federated identity. +WebFinger (RFC 7033) solves a fundamental problem of the decentralized web: how do you discover information about a resource (person, service, device) when you only have an identifier? -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## The Discovery Challenge -## Coming Soon +On centralized platforms, discovery is simple. Twitter knows about @username because it's all in one database. But in decentralized systems (email, federated social networks, distributed identity), there's no central registry. -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +WebFinger provides a standardized way to ask: "Given this identifier (email, account name, URL), what can you tell me about it?" -## Related Pages +## The Query Pattern -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +WebFinger uses a simple HTTP GET request: -## Need Help? +``` +GET /.well-known/webfinger?resource=acct:alice@example.com +``` -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +This asks: "What do you know about alice@example.com?" + +The server responds with a JSON Resource Descriptor (JRD) containing links, properties, and metadata about that resource. + +## Real-World Use Cases + +### ActivityPub / Mastodon + +When you follow `@alice@example.com` on Mastodon, your instance: + +1. Queries `example.com/.well-known/webfinger?resource=acct:alice@example.com` +2. Gets back Alice's ActivityPub profile URL +3. Fetches her profile and posts from that URL +4. Subscribes to updates + +WebFinger is the discovery layer that makes federation work. + +### OpenID Connect + +OAuth/OpenID providers use WebFinger for issuer discovery: + +1. User enters email address +2. Client extracts domain +3. Queries WebFinger for OpenID configuration +4. Discovers authentication endpoints +5. Initiates OAuth flow + +This enables "email address as identity" without hardcoding provider lists. + +### Contact Discovery + +Email clients and contact apps use WebFinger to discover: + +- Profile photos and avatars +- Public keys for encryption +- Social media profiles +- Calendar availability +- Preferred contact methods + +## The JRD Response Format + +A WebFinger response looks like: + +```json +{ + "subject": "acct:alice@example.com", + "aliases": [ + "https://example.com/@alice", + "https://example.com/users/alice" + ], + "properties": { + "http://schema.org/name": "Alice Developer" + }, + "links": [ + { + "rel": "self", + "type": "application/activity+json", + "href": "https://example.com/users/alice" + }, + { + "rel": "http://webfinger.net/rel/profile-page", + "type": "text/html", + "href": "https://example.com/@alice" + }, + { + "rel": "http://webfinger.net/rel/avatar", + "type": "image/jpeg", + "href": "https://example.com/avatars/alice.jpg" + } + ] +} +``` + +**Subject**: The resource being described (often same as query) +**Aliases**: Alternative identifiers for the same resource +**Properties**: Key-value metadata (property names must be URIs) +**Links**: Related resources with relationship types + +## Link Relations + +The `rel` field uses standardized link relation types: + +**IANA registered**: `self`, `alternate`, `canonical`, etc. +**WebFinger specific**: `http://webfinger.net/rel/profile-page`, etc. +**Custom/domain-specific**: Any URI works + +This extensibility allows WebFinger to serve many use cases while remaining standardized. + +## Static vs. Dynamic Resources + +The integration supports both approaches: + +### Static Resources + +Define specific resources explicitly: + +```typescript +webfinger: { + resources: [ + { + resource: 'acct:alice@example.com', + links: [...] + } + ] +} +``` + +Use this for a small, known set of identities. + +### Content Collection Integration + +Generate resources dynamically from Astro content collections: + +```typescript +webfinger: { + collections: [{ + name: 'team', + resourceTemplate: 'acct:{slug}@example.com', + linksBuilder: (member) => [...] + }] +} +``` + +This auto-generates WebFinger responses for all collection entries. Add a team member to your content collection, and they become discoverable via WebFinger automatically. + +## Template Variables + +Resource and subject templates support variables: + +- `{slug}`: Collection entry slug +- `{id}`: Collection entry ID +- `{data.fieldName}`: Any field from entry data +- `{siteURL}`: Your configured site URL + +Example: + +```typescript +resourceTemplate: 'acct:{data.username}@{siteURL.hostname}' +``` + +For a team member with `username: 'alice'` on `example.com`, this generates: +`acct:alice@example.com` + +## CORS and Security + +WebFinger responses include: + +``` +Access-Control-Allow-Origin: * +``` + +This is intentional - WebFinger is designed for public discovery. If information shouldn't be public, don't put it in WebFinger. + +The protocol assumes: + +- Resources are intentionally discoverable +- Information is public or intended for sharing +- Authentication happens at linked resources, not discovery layer + +## Rel Filtering + +Clients can request specific link types: + +``` +GET /.well-known/webfinger?resource=acct:alice@example.com&rel=self +``` + +The server returns only links matching that relation type. This reduces bandwidth and focuses the response. + +The integration handles this automatically. + +## Why Dynamic Routes + +Unlike other discovery files, WebFinger uses a dynamic route (`prerender: false`). This is because: + +1. Query parameters determine the response +2. Content collection resources may be numerous +3. Responses are lightweight enough to generate on-demand + +Static generation would require pre-rendering every possible query, which is impractical for collections. + +## Building for Federation + +If you want your site to participate in federated protocols: + +**Enable WebFinger**: Makes your users/resources discoverable +**Implement ActivityPub**: Provide the linked profile/actor endpoints +**Support WebFinger lookup**: Allow others to discover your resources + +WebFinger is the discovery layer; ActivityPub (or other protocols) provide the functionality. + +## Team/Author Discovery + +A common pattern for blogs and documentation: + +```typescript +webfinger: { + collections: [{ + name: 'authors', + resourceTemplate: 'acct:{slug}@myblog.com', + linksBuilder: (author) => [ + { + rel: 'http://webfinger.net/rel/profile-page', + href: `https://myblog.com/authors/${author.slug}`, + type: 'text/html' + }, + { + rel: 'http://webfinger.net/rel/avatar', + href: author.data.avatar, + type: 'image/jpeg' + } + ], + propertiesBuilder: (author) => ({ + 'http://schema.org/name': author.data.name, + 'http://schema.org/email': author.data.email + }) + }] +} +``` + +Now `acct:alice@myblog.com` resolves to Alice's author page, avatar, and contact info. + +## Testing WebFinger + +After deployment: + +1. Query directly: `curl 'https://example.com/.well-known/webfinger?resource=acct:alice@example.com'` +2. Use WebFinger validators/debuggers +3. Test from federated clients (Mastodon, etc.) +4. Verify CORS headers are present +5. Check rel filtering works + +## Privacy Considerations + +WebFinger makes information **discoverable**. Consider: + +- Don't expose private email addresses or contact info +- Limit to intentionally public resources +- Understand that responses are cached +- Remember `Access-Control-Allow-Origin: *` makes responses widely accessible + +If information shouldn't be public, don't include it in WebFinger responses. + +## Beyond Social Networks + +WebFinger isn't just for social media. Other applications: + +**Device discovery**: IoT devices announcing capabilities +**Service discovery**: API endpoints and configurations +**Calendar/availability**: Free/busy status and booking links +**Payment addresses**: Cryptocurrency addresses and payment methods +**Professional profiles**: Credentials, certifications, and portfolios + +The protocol is general-purpose resource discovery. + +## The Integration's Approach + +This integration makes WebFinger accessible without boilerplate: + +- Auto-generates from content collections +- Handles template variable substitution +- Manages CORS and rel filtering +- Provides type-safe configuration +- Supports both static and dynamic resources + +You define the mappings, the integration handles the protocol. + +## When to Use WebFinger + +Enable WebFinger if: + +- You want to participate in federated protocols +- Your site has user profiles or authors +- You're building decentralized services +- You want discoverable team members +- You're implementing OAuth/OpenID + +Skip it if: + +- Your site is purely informational with no identity component +- You don't want to expose resource discovery +- You're not integrating with federated services + +## Related Topics + +- [ActivityPub Integration](/how-to/activitypub/) - Building on WebFinger for federation +- [WebFinger Reference](/reference/webfinger/) - Complete configuration options +- [Content Collections](/how-to/content-collections/) - Dynamic resource generation diff --git a/docs/src/content/docs/explanation/why-discovery.md b/docs/src/content/docs/explanation/why-discovery.md index a09f86a..c578cd4 100644 --- a/docs/src/content/docs/explanation/why-discovery.md +++ b/docs/src/content/docs/explanation/why-discovery.md @@ -1,31 +1,130 @@ --- title: Why Use Discovery Files? -description: Understanding the importance of discovery files +description: Understanding the importance of discovery files for modern websites --- -Learn why discovery files are essential for modern websites and their benefits. +Discovery files are the polite introduction your website makes to the automated systems that visit it every day. Just as you might put up a sign directing visitors to your front door, these files tell bots, AI assistants, search engines, and other automated systems where to go and what they can do. -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## The Discovery Problem -## Coming Soon +Every website faces a fundamental challenge: how do automated systems know what your site contains, where security issues should be reported, or how AI assistants should interact with your content? -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +Without standardized discovery mechanisms, each bot must guess. Search engines might crawl your entire site inefficiently. AI systems might misrepresent your content. Security researchers won't know how to contact you responsibly. Federated services can't find your user profiles. -## Related Pages +Discovery files solve this by providing **machine-readable contracts** that answer specific questions: -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +- **robots.txt**: "What can I crawl and where?" +- **llms.txt**: "How should AI assistants understand and represent your site?" +- **humans.txt**: "Who built this and what technologies were used?" +- **security.txt**: "Where do I report security vulnerabilities?" +- **canary.txt**: "Has your organization received certain legal orders?" +- **webfinger**: "How do I discover user profiles and federated identities?" -## Need Help? +## Why Multiple Files? -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +You might wonder why we need separate files instead of one unified discovery document. The answer lies in **separation of concerns** and **backwards compatibility**. + +Each file serves a distinct audience and purpose: + +- **robots.txt** targets web crawlers and has been the standard since 1994 +- **llms.txt** addresses the new reality of AI assistants processing web content +- **humans.txt** provides transparency for developers and users curious about your stack +- **security.txt** (RFC 9116) offers a standardized security contact mechanism +- **canary.txt** enables transparency about legal obligations +- **webfinger** (RFC 7033) enables decentralized resource discovery + +Different systems read different files. A search engine ignores humans.txt. A developer looking at your tech stack won't read robots.txt. A security researcher needs security.txt, not your sitemap. + +This modularity also means you can adopt discovery files incrementally. Start with robots.txt and sitemap.xml, add llms.txt when you want AI assistance, enable security.txt when you're ready to accept vulnerability reports. + +## The Visibility Trade-off + +Discovery files involve an important trade-off: **transparency versus obscurity**. + +By publishing robots.txt, you tell both polite crawlers and malicious scrapers about your site structure. Security.txt reveals your security team's contact information. Humans.txt exposes your technology stack. + +This is deliberate. Discovery files embrace the principle that **security through obscurity is not security**. The benefits of standardized, polite communication with automated systems outweigh the minimal risks of exposing this information. + +Consider that: + +- Attackers can discover your tech stack through other means (HTTP headers, page analysis, etc.) +- Security.txt makes responsible disclosure easier, reducing time-to-fix for vulnerabilities +- Robots.txt only controls *polite* bots - malicious actors ignore it anyway +- The transparency builds trust with users, developers, and security researchers + +## The Evolution of Discovery + +Discovery mechanisms have evolved alongside the web itself: + +**1994**: robots.txt emerges as an informal standard for crawler communication + +**2000s**: Sitemaps become essential for SEO as the web grows exponentially + +**2008**: humans.txt proposed to add personality and transparency to websites + +**2017**: RFC 9116 standardizes security.txt after years of ad-hoc security contact methods + +**2023**: llms.txt proposed as AI assistants become major consumers of web content + +**2024**: Warrant canaries and webfinger integration emerge for transparency and federation + +Each new discovery file addresses a real need that emerged as the web ecosystem grew. The integration brings them together because **modern websites need to communicate with an increasingly diverse set of automated visitors**. + +## Discovery as Infrastructure + +Think of discovery files as **critical infrastructure for your website**. They're not optional extras - they're the foundation for how your site interacts with the broader web ecosystem. + +Without proper discovery files: + +- Search engines may crawl inefficiently, wasting your server resources +- AI assistants may misunderstand your content or ignore important context +- Security researchers may struggle to report vulnerabilities responsibly +- Developers can't easily understand your technical choices +- Federated services can't integrate with your user profiles + +With comprehensive discovery: + +- You control how bots interact with your site +- AI assistants have proper context for representing your content +- Security issues can be reported through established channels +- Your tech stack and team are properly credited +- Your site integrates seamlessly with federated protocols + +## The Cost-Benefit Analysis + +Setting up discovery files manually for each project is tedious and error-prone. You need to: + +- Remember the correct format for each file type +- Keep URLs and sitemaps synchronized with your site config +- Update expiration dates for security.txt and canary.txt +- Maintain consistency across different discovery mechanisms +- Handle edge cases and RFC compliance + +An integration automates all of this, ensuring: + +- **Consistency**: All discovery files reference the same site URL +- **Correctness**: RFC compliance is handled automatically +- **Maintenance**: Expiration dates and timestamps update on each build +- **Flexibility**: Configuration changes propagate to all relevant files +- **Best Practices**: Sensible defaults that you can override as needed + +The cost is minimal - a single integration in your Astro config. The benefit is comprehensive, standards-compliant discovery across your entire site. + +## Looking Forward + +As the web continues to evolve, discovery mechanisms will too. We're already seeing: + +- AI systems becoming more sophisticated in how they consume web content +- Federated protocols gaining adoption for decentralized social networks +- Increased emphasis on security transparency and responsible disclosure +- Growing need for machine-readable metadata as automation increases + +Discovery files aren't a trend - they're fundamental communication protocols that will remain relevant as long as automated systems interact with websites. + +By implementing comprehensive discovery now, you're **future-proofing** your site for whatever new automated visitors emerge next. + +## Related Topics + +- [SEO Implications](/explanation/seo/) - How discovery files affect search rankings +- [AI Integration Strategy](/explanation/ai-integration/) - Making your content AI-friendly +- [Architecture](/explanation/architecture/) - How the integration works internally diff --git a/docs/src/content/docs/how-to/activitypub.md b/docs/src/content/docs/how-to/activitypub.md index 5db4c95..753c2ff 100644 --- a/docs/src/content/docs/how-to/activitypub.md +++ b/docs/src/content/docs/how-to/activitypub.md @@ -3,29 +3,382 @@ title: ActivityPub Integration description: Connect with the Fediverse via WebFinger --- -Set up ActivityPub integration to make your site discoverable on Mastodon and the Fediverse. +Enable WebFinger to make your site discoverable on Mastodon and other ActivityPub-compatible services in the Fediverse. -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## Prerequisites -## Coming Soon +- Integration installed and configured +- Understanding of ActivityPub and WebFinger protocols +- Knowledge of your site's user or author structure +- ActivityPub server endpoints (or static actor files) -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +## Basic Static Profile -## Related Pages +Create a single discoverable profile: -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +```typescript +// astro.config.mjs +discovery({ + webfinger: { + enabled: true, + resources: [ + { + resource: 'acct:yourname@example.com', + subject: 'acct:yourname@example.com', + aliases: [ + 'https://example.com/@yourname' + ], + links: [ + { + rel: 'http://webfinger.net/rel/profile-page', + type: 'text/html', + href: 'https://example.com/@yourname' + }, + { + rel: 'self', + type: 'application/activity+json', + href: 'https://example.com/users/yourname' + } + ] + } + ] + } +}) +``` -## Need Help? +Query: `GET /.well-known/webfinger?resource=acct:yourname@example.com` -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +## Multiple Authors + +Enable discovery for all blog authors: + +```typescript +discovery({ + webfinger: { + enabled: true, + resources: [ + { + resource: 'acct:alice@example.com', + links: [ + { + rel: 'self', + type: 'application/activity+json', + href: 'https://example.com/users/alice' + }, + { + rel: 'http://webfinger.net/rel/profile-page', + href: 'https://example.com/authors/alice' + } + ] + }, + { + resource: 'acct:bob@example.com', + links: [ + { + rel: 'self', + type: 'application/activity+json', + href: 'https://example.com/users/bob' + }, + { + rel: 'http://webfinger.net/rel/profile-page', + href: 'https://example.com/authors/bob' + } + ] + } + ] + } +}) +``` + +## Dynamic Authors from Content Collection + +Load authors from Astro content collection: + +**Step 1**: Create authors collection: + +```typescript +// src/content.config.ts +const authorsCollection = defineCollection({ + type: 'data', + schema: z.object({ + name: z.string(), + email: z.string().email(), + bio: z.string(), + avatar: z.string().url(), + mastodon: z.string().optional(), + }) +}); +``` + +**Step 2**: Add author data: + +```yaml +# src/content/authors/alice.yaml +name: Alice Developer +email: alice@example.com +bio: Full-stack developer and writer +avatar: https://example.com/avatars/alice.jpg +mastodon: '@alice@mastodon.social' +``` + +**Step 3**: Configure WebFinger collection: + +```typescript +discovery({ + webfinger: { + enabled: true, + collections: [{ + name: 'authors', + resourceTemplate: 'acct:{slug}@example.com', + + linksBuilder: (author) => [ + { + rel: 'http://webfinger.net/rel/profile-page', + type: 'text/html', + href: `https://example.com/authors/${author.slug}` + }, + { + rel: 'http://webfinger.net/rel/avatar', + type: 'image/jpeg', + href: author.data.avatar + }, + { + rel: 'self', + type: 'application/activity+json', + href: `https://example.com/users/${author.slug}` + } + ], + + propertiesBuilder: (author) => ({ + 'http://schema.org/name': author.data.name, + 'http://schema.org/description': author.data.bio + }), + + aliasesBuilder: (author) => [ + `https://example.com/@${author.slug}` + ] + }] + } +}) +``` + +## Create ActivityPub Actor Endpoint + +WebFinger discovery requires an ActivityPub actor endpoint. Create it: + +```typescript +// src/pages/users/[author].json.ts +import type { APIRoute } from 'astro'; +import { getCollection } from 'astro:content'; + +export async function getStaticPaths() { + const authors = await getCollection('authors'); + + return authors.map(author => ({ + params: { author: author.slug } + })); +} + +export const GET: APIRoute = async ({ params, site }) => { + const authors = await getCollection('authors'); + const author = authors.find(a => a.slug === params.author); + + if (!author) { + return new Response(null, { status: 404 }); + } + + const actor = { + '@context': [ + 'https://www.w3.org/ns/activitystreams', + 'https://w3id.org/security/v1' + ], + 'type': 'Person', + 'id': `${site}users/${author.slug}`, + 'preferredUsername': author.slug, + 'name': author.data.name, + 'summary': author.data.bio, + 'url': `${site}authors/${author.slug}`, + 'icon': { + 'type': 'Image', + 'mediaType': 'image/jpeg', + 'url': author.data.avatar + }, + 'inbox': `${site}users/${author.slug}/inbox`, + 'outbox': `${site}users/${author.slug}/outbox`, + 'followers': `${site}users/${author.slug}/followers`, + 'following': `${site}users/${author.slug}/following`, + }; + + return new Response(JSON.stringify(actor, null, 2), { + status: 200, + headers: { + 'Content-Type': 'application/activity+json' + } + }); +}; +``` + +## Link from Mastodon + +Users can find your profile on Mastodon: + +1. Go to Mastodon search +2. Enter `@yourname@example.com` +3. Mastodon queries WebFinger at your site +4. Gets ActivityPub actor URL +5. Displays profile with follow button + +## Add Profile Link in Bio + +Link your Mastodon profile: + +```typescript +discovery({ + webfinger: { + enabled: true, + collections: [{ + name: 'authors', + resourceTemplate: 'acct:{slug}@example.com', + + linksBuilder: (author) => { + const links = [ + { + rel: 'self', + type: 'application/activity+json', + href: `https://example.com/users/${author.slug}` + } + ]; + + // Add Mastodon link if available + if (author.data.mastodon) { + const mastodonUrl = author.data.mastodon.startsWith('http') + ? author.data.mastodon + : `https://mastodon.social/${author.data.mastodon}`; + + links.push({ + rel: 'http://webfinger.net/rel/profile-page', + type: 'text/html', + href: mastodonUrl + }); + } + + return links; + } + }] + } +}) +``` + +## Testing WebFinger + +Test your WebFinger endpoint: + +```bash +# Build the site +npm run build +npm run preview + +# Test WebFinger query +curl "http://localhost:4321/.well-known/webfinger?resource=acct:alice@example.com" +``` + +Expected response: + +```json +{ + "subject": "acct:alice@example.com", + "aliases": [ + "https://example.com/@alice" + ], + "links": [ + { + "rel": "http://webfinger.net/rel/profile-page", + "type": "text/html", + "href": "https://example.com/authors/alice" + }, + { + "rel": "self", + "type": "application/activity+json", + "href": "https://example.com/users/alice" + } + ] +} +``` + +## Test ActivityPub Actor + +Verify actor endpoint: + +```bash +curl "http://localhost:4321/users/alice" \ + -H "Accept: application/activity+json" +``` + +Should return actor JSON with inbox, outbox, followers, etc. + +## Configure CORS + +WebFinger requires CORS headers: + +The integration automatically adds: +``` +Access-Control-Allow-Origin: * +``` + +For production with an ActivityPub server, configure appropriate CORS in your hosting. + +## Implement Full ActivityPub + +For complete Fediverse integration: + +1. **Implement inbox**: Handle incoming activities (follows, likes, shares) +2. **Implement outbox**: Serve your posts/activities +3. **Generate keypairs**: For signing activities +4. **Handle followers**: Maintain follower/following lists +5. **Send activities**: Notify followers of new posts + +This is beyond WebFinger scope. Consider using: +- [Bridgy Fed](https://fed.brid.gy/) for easy federation +- [WriteFreely](https://writefreely.org/) for federated blogging +- [GoToSocial](https://gotosocial.org/) for self-hosted instances + +## Expected Result + +Your site becomes discoverable in the Fediverse: + +1. Users search `@yourname@example.com` on Mastodon +2. Mastodon fetches WebFinger from `/.well-known/webfinger` +3. Gets ActivityPub actor URL +4. Displays your profile +5. Users can follow/interact (if full ActivityPub implemented) + +## Alternative Approaches + +**Static site**: Use WebFinger for discovery only, point to external Mastodon account. + +**Proxy to Mastodon**: WebFinger points to your Mastodon instance. + +**Bridgy Fed**: Use Bridgy Fed to handle ActivityPub protocol, just provide WebFinger. + +**Full implementation**: Build complete ActivityPub server with inbox/outbox. + +## Common Issues + +**WebFinger not found**: Ensure `webfinger.enabled: true` and resources/collections configured. + +**CORS errors**: Integration adds CORS automatically. Check if hosting overrides headers. + +**Actor URL 404**: Create the actor endpoint at the URL specified in WebFinger links. + +**Mastodon can't find profile**: Ensure `rel: 'self'` link with `type: 'application/activity+json'` exists. + +**Incorrect format**: WebFinger must return valid JRD JSON. Test with curl. + +**Case sensitivity**: Resource URIs are case-sensitive. `acct:alice@example.com` ≠ `acct:Alice@example.com` + +## Additional Resources + +- [WebFinger RFC 7033](https://datatracker.ietf.org/doc/html/rfc7033) +- [ActivityPub Spec](https://www.w3.org/TR/activitypub/) +- [Mastodon Documentation](https://docs.joinmastodon.org/) +- [Bridgy Fed](https://fed.brid.gy/) diff --git a/docs/src/content/docs/how-to/add-team-members.md b/docs/src/content/docs/how-to/add-team-members.md index f2eef83..575e15a 100644 --- a/docs/src/content/docs/how-to/add-team-members.md +++ b/docs/src/content/docs/how-to/add-team-members.md @@ -3,29 +3,248 @@ title: Add Team Members description: Add team member information to humans.txt --- -Learn how to add team members and collaborators to your humans.txt file. +Document your team and contributors in humans.txt for public recognition. -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## Prerequisites -## Coming Soon +- Integration installed and configured +- Team member information (names, roles, contact details) +- Permission from team members to share their information -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +## Add a Single Team Member -## Related Pages +Configure basic team information: -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +```typescript +// astro.config.mjs +discovery({ + humans: { + team: [ + { + name: 'Jane Developer', + role: 'Lead Developer', + contact: 'jane@example.com' + } + ] + } +}) +``` -## Need Help? +## Add Multiple Team Members -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +Include your full team: + +```typescript +discovery({ + humans: { + team: [ + { + name: 'Jane Developer', + role: 'Lead Developer', + contact: 'jane@example.com', + location: 'San Francisco, CA' + }, + { + name: 'John Designer', + role: 'UI/UX Designer', + contact: 'john@example.com', + location: 'New York, NY' + }, + { + name: 'Sarah Product', + role: 'Product Manager', + location: 'London, UK' + } + ] + } +}) +``` + +## Include Social Media Profiles + +Add Twitter and GitHub handles: + +```typescript +discovery({ + humans: { + team: [ + { + name: 'Alex Dev', + role: 'Full Stack Developer', + contact: 'alex@example.com', + twitter: '@alexdev', + github: 'alex-codes' + } + ] + } +}) +``` + +## Load from Content Collections + +Dynamically generate team list from content: + +```typescript +import { getCollection } from 'astro:content'; + +discovery({ + humans: { + team: async () => { + const teamMembers = await getCollection('team'); + + return teamMembers.map(member => ({ + name: member.data.name, + role: member.data.role, + contact: member.data.email, + location: member.data.city, + twitter: member.data.twitter, + github: member.data.github + })); + } + } +}) +``` + +Create a content collection in `src/content/team/`: + +```yaml +# src/content/team/jane.yaml +name: Jane Developer +role: Lead Developer +email: jane@example.com +city: San Francisco, CA +twitter: '@janedev' +github: jane-codes +``` + +## Load from External Source + +Fetch team data from your API or database: + +```typescript +discovery({ + humans: { + team: async () => { + const response = await fetch('https://api.example.com/team'); + const teamData = await response.json(); + + return teamData.members.map(member => ({ + name: member.fullName, + role: member.position, + contact: member.publicEmail, + location: member.location + })); + } + } +}) +``` + +## Add Acknowledgments + +Thank contributors and inspirations: + +```typescript +discovery({ + humans: { + team: [/* ... */], + thanks: [ + 'The Astro team for the amazing framework', + 'All our open source contributors', + 'Stack Overflow community', + 'Our beta testers', + 'Coffee and late nights' + ] + } +}) +``` + +## Include Project Story + +Add context about your project: + +```typescript +discovery({ + humans: { + team: [/* ... */], + story: ` +This project was born from a hackathon in 2024. What started as +a weekend experiment grew into a tool used by thousands. Our team +came together from different timezones and backgrounds, united by +a passion for making the web more discoverable. + `.trim() + } +}) +``` + +## Add Fun Facts + +Make it personal: + +```typescript +discovery({ + humans: { + team: [/* ... */], + funFacts: [ + 'Built entirely remotely across 4 continents', + 'Powered by 1,247 cups of coffee', + 'Deployed on a Friday (we live dangerously)', + 'First commit was at 2:47 AM', + 'Named after a recurring inside joke' + ] + } +}) +``` + +## Verify Your Configuration + +Build and check the output: + +```bash +npm run build +npm run preview +curl http://localhost:4321/humans.txt +``` + +## Expected Result + +Your humans.txt will contain formatted team information: + +``` +/* TEAM */ + + Name: Jane Developer + Role: Lead Developer + Contact: jane@example.com + From: San Francisco, CA + Twitter: @janedev + GitHub: jane-codes + + Name: John Designer + Role: UI/UX Designer + Contact: john@example.com + From: New York, NY + +/* THANKS */ + + The Astro team for the amazing framework + All our open source contributors + Coffee and late nights +``` + +## Alternative Approaches + +**Privacy-first**: Use team roles without names or contact details for privacy. + +**Department-based**: Group team members by department rather than listing individually. + +**Rotating spotlight**: Highlight different team members each month using dynamic content. + +## Common Issues + +**Missing permissions**: Always get consent before publishing personal information. + +**Outdated information**: Keep contact details current. Use dynamic loading to stay fresh. + +**Too much detail**: Stick to professional information. Avoid personal addresses or phone numbers. + +**Special characters**: Use plain ASCII in humans.txt. Avoid emojis unless necessary. diff --git a/docs/src/content/docs/how-to/block-bots.md b/docs/src/content/docs/how-to/block-bots.md index faa92b2..ea03410 100644 --- a/docs/src/content/docs/how-to/block-bots.md +++ b/docs/src/content/docs/how-to/block-bots.md @@ -1,31 +1,169 @@ --- title: Block Specific Bots -description: How to block unwanted bots from crawling your site +description: Control which bots can crawl your site using robots.txt rules --- -Learn how to block specific bots or user agents from accessing your site. +Block unwanted bots or user agents from accessing specific parts of your site. -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## Prerequisites -## Coming Soon +- Integration installed and configured +- Basic familiarity with robots.txt format +- Knowledge of which bot user agents to block -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +## Block a Single Bot Completely -## Related Pages +To prevent a specific bot from crawling your entire site: -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +```typescript +// astro.config.mjs +discovery({ + robots: { + additionalAgents: [ + { + userAgent: 'BadBot', + disallow: ['/'] + } + ] + } +}) +``` -## Need Help? +This creates a rule that blocks `BadBot` from all pages. -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +## Block Multiple Bots + +Add multiple entries to the `additionalAgents` array: + +```typescript +discovery({ + robots: { + additionalAgents: [ + { userAgent: 'BadBot', disallow: ['/'] }, + { userAgent: 'SpamCrawler', disallow: ['/'] }, + { userAgent: 'AnnoyingBot', disallow: ['/'] } + ] + } +}) +``` + +## Block Bots from Specific Paths + +Allow a bot access to most content, but block sensitive areas: + +```typescript +discovery({ + robots: { + additionalAgents: [ + { + userAgent: 'PriceBot', + allow: ['/'], + disallow: ['/checkout', '/account', '/api'] + } + ] + } +}) +``` + +**Order matters**: Specific rules (`/checkout`) should come after general rules (`/`). + +## Disable All LLM Bots + +To block all AI crawler bots: + +```typescript +discovery({ + robots: { + llmBots: { + enabled: false + } + } +}) +``` + +This removes the allow rules for Anthropic-AI, Claude-Web, GPTBot, and other LLM crawlers. + +## Block Specific LLM Bots + +Keep some LLM bots while blocking others: + +```typescript +discovery({ + robots: { + llmBots: { + enabled: true, + agents: ['Anthropic-AI', 'Claude-Web'] // Only allow these + }, + additionalAgents: [ + { userAgent: 'GPTBot', disallow: ['/'] }, + { userAgent: 'Google-Extended', disallow: ['/'] } + ] + } +}) +``` + +## Add Custom Rules + +For complex scenarios, use `customRules` to add raw robots.txt content: + +```typescript +discovery({ + robots: { + customRules: ` +# Block aggressive crawlers +User-agent: AggressiveBot +Crawl-delay: 30 +Disallow: / + +# Special rule for search engine +User-agent: Googlebot +Allow: /api/public +Disallow: /api/private + `.trim() + } +}) +``` + +## Verify Your Configuration + +After configuration, build your site and check `/robots.txt`: + +```bash +npm run build +npm run preview +curl http://localhost:4321/robots.txt +``` + +Look for your custom agent rules in the output. + +## Expected Result + +Your robots.txt will contain entries like: + +``` +User-agent: BadBot +Disallow: / + +User-agent: PriceBot +Allow: / +Disallow: /checkout +Disallow: /account +``` + +Blocked bots should respect these rules and avoid crawling restricted areas. + +## Alternative Approaches + +**Server-level blocking**: For malicious bots that ignore robots.txt, consider blocking at the server/firewall level. + +**User-agent detection**: Implement server-side detection to return 403 Forbidden for specific user agents. + +**Rate limiting**: Use crawl delays to slow down aggressive crawlers rather than blocking them completely. + +## Common Issues + +**Bots ignoring rules**: robots.txt is advisory only. Malicious bots may not respect it. + +**Overly broad patterns**: Be specific with disallow paths. `/api` blocks `/api/public` too. + +**Typos in user agents**: User agent strings are case-sensitive. Check bot documentation for exact values. diff --git a/docs/src/content/docs/how-to/cache-headers.md b/docs/src/content/docs/how-to/cache-headers.md index bbb96cf..4862952 100644 --- a/docs/src/content/docs/how-to/cache-headers.md +++ b/docs/src/content/docs/how-to/cache-headers.md @@ -3,29 +3,224 @@ title: Set Cache Headers description: Configure HTTP caching for discovery files --- -Optimize cache headers for discovery files to balance freshness and performance. +Optimize cache headers for discovery files to balance freshness with server load and client performance. -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## Prerequisites -## Coming Soon +- Integration installed and configured +- Understanding of HTTP caching concepts +- Knowledge of your content update frequency -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +## Set Cache Duration for All Files -## Related Pages +Configure caching in seconds: -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +```typescript +// astro.config.mjs +discovery({ + caching: { + robots: 3600, // 1 hour + llms: 3600, // 1 hour + humans: 86400, // 24 hours + security: 86400, // 24 hours + canary: 3600, // 1 hour + webfinger: 3600, // 1 hour + sitemap: 3600 // 1 hour + } +}) +``` -## Need Help? +These values set `Cache-Control: public, max-age=` headers. -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +## Short Cache for Frequently Updated Content + +Update canary.txt daily? Use short cache: + +```typescript +discovery({ + caching: { + canary: 1800 // 30 minutes + } +}) +``` + +Bots will check for updates more frequently. + +## Long Cache for Static Content + +Rarely change humans.txt? Cache longer: + +```typescript +discovery({ + caching: { + humans: 604800 // 1 week (7 days) + } +}) +``` + +Reduces server load for static content. + +## Disable Caching for Development + +Different caching for development vs production: + +```typescript +discovery({ + caching: import.meta.env.PROD + ? { + // Production: aggressive caching + robots: 3600, + llms: 3600, + humans: 86400 + } + : { + // Development: no caching + robots: 0, + llms: 0, + humans: 0 + } +}) +``` + +Zero seconds means no caching (always fresh). + +## Match Cache to Update Frequency + +Align with your content update schedule: + +```typescript +discovery({ + caching: { + // Updated hourly via CI/CD + llms: 3600, // 1 hour + + // Updated daily + canary: 7200, // 2 hours (some buffer) + + // Updated weekly + humans: 86400, // 24 hours + + // Rarely changes + robots: 604800, // 1 week + security: 2592000 // 30 days + } +}) +``` + +## Conservative Caching + +When in doubt, cache shorter: + +```typescript +discovery({ + caching: { + robots: 1800, // 30 min + llms: 1800, // 30 min + humans: 3600, // 1 hour + sitemap: 1800 // 30 min + } +}) +``` + +Ensures content stays relatively fresh. + +## Aggressive Caching + +Optimize for performance when content is stable: + +```typescript +discovery({ + caching: { + robots: 86400, // 24 hours + llms: 43200, // 12 hours + humans: 604800, // 1 week + security: 2592000, // 30 days + sitemap: 86400 // 24 hours + } +}) +``` + +## Understand Cache Behavior + +Different cache durations affect different use cases: + +**robots.txt** (crawl bots): +- Short cache (1 hour): Quickly reflect changes to bot permissions +- Long cache (24 hours): Reduce load from frequent bot checks + +**llms.txt** (AI assistants): +- Short cache (1 hour): Keep instructions current +- Medium cache (6 hours): Balance freshness and performance + +**humans.txt** (curious visitors): +- Long cache (24 hours - 1 week): Team info changes rarely + +**security.txt** (security researchers): +- Long cache (24 hours - 30 days): Contact info is stable + +**canary.txt** (transparency): +- Short cache (30 min - 1 hour): Must be checked frequently + +## Verify Cache Headers + +Test with curl: + +```bash +npm run build +npm run preview + +# Check cache headers +curl -I http://localhost:4321/robots.txt +curl -I http://localhost:4321/llms.txt +curl -I http://localhost:4321/humans.txt +``` + +Look for `Cache-Control` header in the response: + +``` +Cache-Control: public, max-age=3600 +``` + +## Expected Result + +Browsers and CDNs will cache files according to your settings. Subsequent requests within the cache period will be served from cache, reducing server load. + +For a 1-hour cache: +1. First request at 10:00 AM: Server serves fresh content +2. Request at 10:30 AM: Served from cache +3. Request at 11:01 AM: Cache expired, server serves fresh content + +## Alternative Approaches + +**CDN-level caching**: Configure caching at your CDN (Cloudflare, Fastly) rather than in the integration. + +**Surrogate-Control header**: Use `Surrogate-Control` for CDN caching while controlling browser cache separately. + +**ETags**: Add ETag support for efficient conditional requests. + +**Vary header**: Consider adding `Vary: Accept-Encoding` for compressed responses. + +## Common Issues + +**Cache too long**: Content changes not reflected quickly. Reduce cache duration. + +**Cache too short**: High server load from repeated requests. Increase cache duration. + +**No caching in production**: Check if your hosting platform overrides headers. + +**Stale content after updates**: Deploy a new version with a build timestamp to bust caches. + +**Different behavior in CDN**: CDN may have its own caching rules. Check CDN configuration. + +## Cache Duration Guidelines + +**Rule of thumb**: +- Update frequency = Daily → Cache 2-6 hours +- Update frequency = Weekly → Cache 12-24 hours +- Update frequency = Monthly → Cache 1-7 days +- Update frequency = Rarely → Cache 7-30 days + +**Special cases**: +- Canary.txt: Cache < update frequency (if daily, cache 2-12 hours) +- Security.txt: Cache longer (expires field handles staleness) +- Development: Cache 0 or very short (60 seconds) diff --git a/docs/src/content/docs/how-to/content-collections.md b/docs/src/content/docs/how-to/content-collections.md index 9a751da..23a3c84 100644 --- a/docs/src/content/docs/how-to/content-collections.md +++ b/docs/src/content/docs/how-to/content-collections.md @@ -3,29 +3,376 @@ title: Use with Content Collections description: Integrate with Astro content collections --- -Automatically generate discovery content from your Astro content collections. +Automatically generate discovery content from your Astro content collections for dynamic, maintainable configuration. -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## Prerequisites -## Coming Soon +- Integration installed and configured +- Astro content collections set up +- Understanding of async configuration functions -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +## Load Team from Collection -## Related Pages +Create a team content collection and populate humans.txt: -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +**Step 1**: Define the collection schema: -## Need Help? +```typescript +// src/content.config.ts +import { defineCollection, z } from 'astro:content'; -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +const teamCollection = defineCollection({ + type: 'data', + schema: z.object({ + name: z.string(), + role: z.string(), + email: z.string().email(), + location: z.string().optional(), + twitter: z.string().optional(), + github: z.string().optional(), + }) +}); + +export const collections = { + team: teamCollection +}; +``` + +**Step 2**: Add team members: + +```yaml +# src/content/team/alice.yaml +name: Alice Johnson +role: Lead Developer +email: alice@example.com +location: San Francisco, CA +github: alice-codes +``` + +```yaml +# src/content/team/bob.yaml +name: Bob Smith +role: Designer +email: bob@example.com +location: New York, NY +twitter: '@bobdesigns' +``` + +**Step 3**: Load in discovery config: + +```typescript +// astro.config.mjs +import { getCollection } from 'astro:content'; + +discovery({ + humans: { + team: async () => { + const members = await getCollection('team'); + + return members.map(member => ({ + name: member.data.name, + role: member.data.role, + contact: member.data.email, + location: member.data.location, + twitter: member.data.twitter, + github: member.data.github + })); + } + } +}) +``` + +## Generate Important Pages from Docs + +List featured documentation pages in llms.txt: + +**Step 1**: Add featured flag to doc frontmatter: + +```markdown +--- +# src/content/docs/getting-started.md +title: Getting Started Guide +description: Quick start guide for new users +featured: true +--- +``` + +**Step 2**: Load featured docs: + +```typescript +discovery({ + llms: { + importantPages: async () => { + const docs = await getCollection('docs'); + + return docs + .filter(doc => doc.data.featured) + .sort((a, b) => (a.data.order || 0) - (b.data.order || 0)) + .map(doc => ({ + name: doc.data.title, + path: `/docs/${doc.slug}`, + description: doc.data.description + })); + } + } +}) +``` + +## WebFinger from Author Collection + +Make blog authors discoverable via WebFinger: + +**Step 1**: Define authors collection: + +```typescript +// src/content.config.ts +const authorsCollection = defineCollection({ + type: 'data', + schema: z.object({ + name: z.string(), + email: z.string().email(), + bio: z.string(), + avatar: z.string().url(), + mastodon: z.string().url().optional(), + website: z.string().url().optional() + }) +}); +``` + +**Step 2**: Add author data: + +```yaml +# src/content/authors/alice.yaml +name: Alice Developer +email: alice@example.com +bio: Full-stack developer and open source enthusiast +avatar: https://example.com/avatars/alice.jpg +mastodon: https://mastodon.social/@alice +website: https://alice.dev +``` + +**Step 3**: Configure WebFinger: + +```typescript +discovery({ + webfinger: { + enabled: true, + collections: [{ + name: 'authors', + resourceTemplate: 'acct:{slug}@example.com', + + linksBuilder: (author) => [ + { + rel: 'http://webfinger.net/rel/profile-page', + type: 'text/html', + href: `https://example.com/authors/${author.slug}` + }, + { + rel: 'http://webfinger.net/rel/avatar', + type: 'image/jpeg', + href: author.data.avatar + }, + ...(author.data.mastodon ? [{ + rel: 'self', + type: 'application/activity+json', + href: author.data.mastodon + }] : []) + ], + + propertiesBuilder: (author) => ({ + 'http://schema.org/name': author.data.name, + 'http://schema.org/description': author.data.bio + }) + }] + } +}) +``` + +Query with: `GET /.well-known/webfinger?resource=acct:alice@example.com` + +## Load API Endpoints from Spec + +Generate API documentation from a collection: + +```typescript +// src/content.config.ts +const apiCollection = defineCollection({ + type: 'data', + schema: z.object({ + path: z.string(), + method: z.enum(['GET', 'POST', 'PUT', 'DELETE', 'PATCH']), + description: z.string(), + public: z.boolean().default(true) + }) +}); +``` + +```yaml +# src/content/api/search.yaml +path: /api/search +method: GET +description: Search products by name, category, or tag +public: true +``` + +```typescript +discovery({ + llms: { + apiEndpoints: async () => { + const endpoints = await getCollection('api'); + + return endpoints + .filter(ep => ep.data.public) + .map(ep => ({ + path: ep.data.path, + method: ep.data.method, + description: ep.data.description + })); + } + } +}) +``` + +## Multiple Collections + +Combine data from several collections: + +```typescript +discovery({ + humans: { + team: async () => { + const [coreTeam, contributors] = await Promise.all([ + getCollection('team'), + getCollection('contributors') + ]); + + return [ + ...coreTeam.map(m => ({ ...m.data, role: `Core - ${m.data.role}` })), + ...contributors.map(m => ({ ...m.data, role: `Contributor - ${m.data.role}` })) + ]; + }, + + thanks: async () => { + const sponsors = await getCollection('sponsors'); + return sponsors.map(s => s.data.name); + } + } +}) +``` + +## Filter and Sort Collections + +Control which items are included: + +```typescript +discovery({ + llms: { + importantPages: async () => { + const allDocs = await getCollection('docs'); + + return allDocs + // Only published docs + .filter(doc => doc.data.published !== false) + // Only important ones + .filter(doc => doc.data.priority === 'high') + // Sort by custom order + .sort((a, b) => { + const orderA = a.data.order ?? 999; + const orderB = b.data.order ?? 999; + return orderA - orderB; + }) + // Map to format + .map(doc => ({ + name: doc.data.title, + path: `/docs/${doc.slug}`, + description: doc.data.description + })); + } + } +}) +``` + +## Localized Content + +Support multiple languages: + +```typescript +discovery({ + llms: { + importantPages: async () => { + const docs = await getCollection('docs'); + + // Group by language + const enDocs = docs.filter(d => d.slug.startsWith('en/')); + const esDocs = docs.filter(d => d.slug.startsWith('es/')); + + // Return English docs, with links to translations + return enDocs.map(doc => ({ + name: doc.data.title, + path: `/docs/${doc.slug}`, + description: doc.data.description, + // Could add: translations: ['/docs/es/...'] + })); + } + } +}) +``` + +## Cache Collection Queries + +Optimize build performance: + +```typescript +// Cache at module level +let cachedTeam = null; + +discovery({ + humans: { + team: async () => { + if (!cachedTeam) { + const members = await getCollection('team'); + cachedTeam = members.map(m => ({ + name: m.data.name, + role: m.data.role, + contact: m.data.email + })); + } + return cachedTeam; + } + } +}) +``` + +## Expected Result + +Content collections automatically populate discovery files: + +**Adding a team member**: +1. Create `src/content/team/new-member.yaml` +2. Run `npm run build` +3. humans.txt includes new member + +**Marking a doc as featured**: +1. Add `featured: true` to frontmatter +2. Run `npm run build` +3. llms.txt lists the new important page + +## Alternative Approaches + +**Static data**: Use plain JavaScript objects when data rarely changes. + +**External API**: Fetch from CMS or API during build instead of using collections. + +**Hybrid**: Use collections for core data, enhance with API data. + +## Common Issues + +**Async not awaited**: Ensure you use `async () => {}` and `await getCollection()`. + +**Build-time only**: Collections are loaded at build time, not runtime. + +**Type errors**: Ensure collection schema matches the data structure you're mapping. + +**Missing data**: Check that collection files exist and match the schema. + +**Slow builds**: Cache collection queries if used multiple times in config. diff --git a/docs/src/content/docs/how-to/custom-templates.md b/docs/src/content/docs/how-to/custom-templates.md index ff0f263..fcafb19 100644 --- a/docs/src/content/docs/how-to/custom-templates.md +++ b/docs/src/content/docs/how-to/custom-templates.md @@ -3,29 +3,417 @@ title: Custom Templates description: Create custom templates for discovery files --- -Override default templates to fully customize the output of discovery files. +Override default templates to fully customize the output format of discovery files. -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## Prerequisites -## Coming Soon +- Integration installed and configured +- Understanding of the file formats (robots.txt, llms.txt, etc.) +- Knowledge of template function signatures -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +## Override robots.txt Template -## Related Pages +Complete control over robots.txt output: -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +```typescript +// astro.config.mjs +discovery({ + templates: { + robots: (config, siteURL) => { + const lines = []; -## Need Help? + // Custom header + lines.push('# Custom robots.txt'); + lines.push(`# Site: ${siteURL.hostname}`); + lines.push('# Last generated: ' + new Date().toISOString()); + lines.push(''); -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) + // Default rule + lines.push('User-agent: *'); + lines.push('Allow: /'); + lines.push(''); + + // Add sitemap + lines.push(`Sitemap: ${new URL('sitemap-index.xml', siteURL).href}`); + + return lines.join('\n') + '\n'; + } + } +}) +``` + +## Override llms.txt Template + +Custom format for AI instructions: + +```typescript +discovery({ + templates: { + llms: async (config, siteURL) => { + const lines = []; + + // Header + lines.push(`=`.repeat(60)); + lines.push(`AI ASSISTANT GUIDE FOR ${siteURL.hostname.toUpperCase()}`); + lines.push(`=`.repeat(60)); + lines.push(''); + + // Description + const description = typeof config.description === 'function' + ? config.description() + : config.description; + + if (description) { + lines.push(description); + lines.push(''); + } + + // Instructions + if (config.instructions) { + lines.push('IMPORTANT INSTRUCTIONS:'); + lines.push(config.instructions); + lines.push(''); + } + + // API endpoints in custom format + if (config.apiEndpoints && config.apiEndpoints.length > 0) { + lines.push('AVAILABLE APIs:'); + config.apiEndpoints.forEach(ep => { + lines.push(` [${ep.method || 'GET'}] ${ep.path}`); + lines.push(` → ${ep.description}`); + }); + lines.push(''); + } + + // Footer + lines.push(`=`.repeat(60)); + lines.push(`Generated: ${new Date().toISOString()}`); + + return lines.join('\n') + '\n'; + } + } +}) +``` + +## Override humans.txt Template + +Custom humans.txt format: + +```typescript +discovery({ + templates: { + humans: (config, siteURL) => { + const lines = []; + + lines.push('========================================'); + lines.push(' HUMANS BEHIND THE SITE '); + lines.push('========================================'); + lines.push(''); + + // Team in custom format + if (config.team && config.team.length > 0) { + lines.push('OUR TEAM:'); + lines.push(''); + + config.team.forEach((member, i) => { + if (i > 0) lines.push('---'); + + lines.push(`Name : ${member.name}`); + if (member.role) lines.push(`Role : ${member.role}`); + if (member.contact) lines.push(`Email : ${member.contact}`); + if (member.github) lines.push(`GitHub : https://github.com/${member.github}`); + lines.push(''); + }); + } + + // Stack info + if (config.site?.techStack) { + lines.push('BUILT WITH:'); + lines.push(config.site.techStack.join(' | ')); + lines.push(''); + } + + return lines.join('\n') + '\n'; + } + } +}) +``` + +## Override security.txt Template + +Custom security.txt with additional fields: + +```typescript +discovery({ + templates: { + security: (config, siteURL) => { + const lines = []; + + // Canonical (required by RFC 9116) + const canonical = config.canonical || + new URL('.well-known/security.txt', siteURL).href; + lines.push(`Canonical: ${canonical}`); + + // Contact (required) + const contacts = Array.isArray(config.contact) + ? config.contact + : [config.contact]; + + contacts.forEach(contact => { + const contactValue = contact.includes('@') && !contact.startsWith('mailto:') + ? `mailto:${contact}` + : contact; + lines.push(`Contact: ${contactValue}`); + }); + + // Expires (recommended) + const expires = config.expires === 'auto' + ? new Date(Date.now() + 365 * 24 * 60 * 60 * 1000).toISOString() + : config.expires; + + if (expires) { + lines.push(`Expires: ${expires}`); + } + + // Optional fields + if (config.encryption) { + const encryptions = Array.isArray(config.encryption) + ? config.encryption + : [config.encryption]; + encryptions.forEach(enc => lines.push(`Encryption: ${enc}`)); + } + + if (config.policy) { + lines.push(`Policy: ${config.policy}`); + } + + if (config.acknowledgments) { + lines.push(`Acknowledgments: ${config.acknowledgments}`); + } + + // Add custom comment + lines.push(''); + lines.push('# Thank you for helping keep our users safe!'); + + return lines.join('\n') + '\n'; + } + } +}) +``` + +## Override canary.txt Template + +Custom warrant canary format: + +```typescript +discovery({ + templates: { + canary: (config, siteURL) => { + const lines = []; + const today = new Date().toISOString().split('T')[0]; + + lines.push('=== WARRANT CANARY ==='); + lines.push(''); + lines.push(`Organization: ${config.organization || siteURL.hostname}`); + lines.push(`Date Issued: ${today}`); + lines.push(''); + + lines.push('As of this date, we confirm:'); + lines.push(''); + + // List what has NOT been received + const statements = typeof config.statements === 'function' + ? config.statements() + : config.statements || []; + + statements + .filter(s => !s.received) + .forEach(statement => { + lines.push(`✓ NO ${statement.description} received`); + }); + + lines.push(''); + lines.push('This canary will be updated regularly.'); + lines.push('Absence of an update should be considered significant.'); + lines.push(''); + + if (config.verification) { + lines.push(`Verification: ${config.verification}`); + } + + return lines.join('\n') + '\n'; + } + } +}) +``` + +## Combine Default Generator with Custom Content + +Use default generator, add custom content: + +```typescript +import { generateRobotsTxt } from '@astrojs/discovery/generators'; + +discovery({ + templates: { + robots: (config, siteURL) => { + // Generate default content + const defaultContent = generateRobotsTxt(config, siteURL); + + // Add custom rules + const customRules = ` +# Custom section +User-agent: MySpecialBot +Crawl-delay: 20 +Allow: /special + +# Rate limiting comment +# Please be respectful of our server resources + `.trim(); + + return defaultContent + '\n\n' + customRules + '\n'; + } + } +}) +``` + +## Load Template from File + +Keep templates separate: + +```typescript +// templates/robots.txt.js +export default (config, siteURL) => { + return ` +User-agent: * +Allow: / + +Sitemap: ${new URL('sitemap-index.xml', siteURL).href} + `.trim() + '\n'; +}; +``` + +```typescript +// astro.config.mjs +import robotsTemplate from './templates/robots.txt.js'; + +discovery({ + templates: { + robots: robotsTemplate + } +}) +``` + +## Conditional Template Logic + +Different templates per environment: + +```typescript +discovery({ + templates: { + llms: import.meta.env.PROD + ? (config, siteURL) => { + // Production: detailed guide + return `# Production site guide\n...detailed content...`; + } + : (config, siteURL) => { + // Development: simple warning + return `# Development environment\nThis is a development site.\n`; + } + } +}) +``` + +## Template with External Data + +Fetch additional data in template: + +```typescript +discovery({ + templates: { + llms: async (config, siteURL) => { + // Fetch latest API spec + const response = await fetch('https://api.example.com/openapi.json'); + const spec = await response.json(); + + const lines = []; + lines.push(`# ${siteURL.hostname} API Guide`); + lines.push(''); + lines.push('Available endpoints:'); + + Object.entries(spec.paths).forEach(([path, methods]) => { + Object.keys(methods).forEach(method => { + lines.push(`- ${method.toUpperCase()} ${path}`); + }); + }); + + return lines.join('\n') + '\n'; + } + } +}) +``` + +## Verify Custom Templates + +Test your templates: + +```bash +npm run build +npm run preview + +# Check each file +curl http://localhost:4321/robots.txt +curl http://localhost:4321/llms.txt +curl http://localhost:4321/humans.txt +curl http://localhost:4321/.well-known/security.txt +``` + +Ensure format is correct and content appears as expected. + +## Expected Result + +Your custom templates completely control output format: + +**Custom robots.txt**: +``` +# Custom robots.txt +# Site: example.com +# Last generated: 2025-11-08T12:00:00.000Z + +User-agent: * +Allow: / + +Sitemap: https://example.com/sitemap-index.xml +``` + +**Custom llms.txt**: +``` +============================================================ +AI ASSISTANT GUIDE FOR EXAMPLE.COM +============================================================ + +Your site description here + +IMPORTANT INSTRUCTIONS: +... +``` + +## Alternative Approaches + +**Partial overrides**: Extend default generators rather than replacing entirely. + +**Post-processing**: Generate default content, then modify it with string manipulation. + +**Multiple templates**: Use different templates based on configuration flags. + +## Common Issues + +**Missing newline at end**: Ensure template returns content ending with `\n`. + +**Async templates**: llms.txt template can be async, others are sync. Don't mix. + +**Type errors**: Template signature must match: `(config: Config, siteURL: URL) => string` + +**Breaking specs**: security.txt and robots.txt have specific formats. Don't break them. + +**Config not available**: Only config passed to that section is available. Can't access other sections. diff --git a/docs/src/content/docs/how-to/customize-llm-instructions.md b/docs/src/content/docs/how-to/customize-llm-instructions.md index ebd78ae..7765328 100644 --- a/docs/src/content/docs/how-to/customize-llm-instructions.md +++ b/docs/src/content/docs/how-to/customize-llm-instructions.md @@ -1,31 +1,255 @@ --- title: Customize LLM Instructions -description: Provide specific instructions for AI assistants +description: Provide custom instructions for AI assistants using llms.txt --- -Create custom instructions for AI assistants to follow when helping users with your site. +Configure how AI assistants interact with your site by customizing instructions in llms.txt. -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## Prerequisites -## Coming Soon +- Integration installed and configured +- Understanding of your site's main use cases +- Knowledge of your API endpoints (if applicable) -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +## Add Basic Instructions -## Related Pages +Provide clear guidance for AI assistants: -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +```typescript +// astro.config.mjs +discovery({ + llms: { + description: 'Technical documentation for the Discovery API', + instructions: ` +When helping users with this site: +1. Check the documentation before answering +2. Provide code examples when relevant +3. Link to specific documentation pages +4. Use the search API for queries + `.trim() + } +}) +``` -## Need Help? +## Highlight Key Features -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +Guide AI assistants to important capabilities: + +```typescript +discovery({ + llms: { + description: 'E-commerce platform for sustainable products', + keyFeatures: [ + 'Carbon footprint calculator for all products', + 'Subscription management with flexible billing', + 'AI-powered product recommendations', + 'Real-time inventory tracking' + ] + } +}) +``` + +## Document Important Pages + +Direct AI assistants to critical resources: + +```typescript +discovery({ + llms: { + importantPages: [ + { + name: 'API Documentation', + path: '/docs/api', + description: 'Complete API reference with examples' + }, + { + name: 'Getting Started Guide', + path: '/docs/quick-start', + description: 'Step-by-step setup instructions' + }, + { + name: 'FAQ', + path: '/help/faq', + description: 'Common questions and solutions' + } + ] + } +}) +``` + +## Describe Your APIs + +Help AI assistants use your endpoints correctly: + +```typescript +discovery({ + llms: { + apiEndpoints: [ + { + path: '/api/search', + method: 'GET', + description: 'Search products by name, category, or tag' + }, + { + path: '/api/products/:id', + method: 'GET', + description: 'Get detailed product information' + }, + { + path: '/api/calculate-carbon', + method: 'POST', + description: 'Calculate carbon footprint for a cart' + } + ] + } +}) +``` + +## Set Brand Voice Guidelines + +Maintain consistent communication style: + +```typescript +discovery({ + llms: { + brandVoice: [ + 'Professional yet approachable', + 'Focus on sustainability and environmental impact', + 'Use concrete examples, not abstract concepts', + 'Avoid jargon unless explaining technical features', + 'Emphasize long-term value over short-term savings' + ] + } +}) +``` + +## Load Content Dynamically + +Pull important pages from content collections: + +```typescript +import { getCollection } from 'astro:content'; + +discovery({ + llms: { + importantPages: async () => { + const docs = await getCollection('docs'); + + // Filter to featured pages only + return docs + .filter(doc => doc.data.featured) + .map(doc => ({ + name: doc.data.title, + path: `/docs/${doc.slug}`, + description: doc.data.description + })); + } + } +}) +``` + +## Add Custom Sections + +Include specialized information: + +```typescript +discovery({ + llms: { + customSections: { + 'Data Privacy': ` +We are GDPR compliant. User data is encrypted at rest and in transit. +Data retention policy: 90 days for analytics, 7 years for transactions. + `.trim(), + + 'Rate Limits': ` +API rate limits: +- Authenticated: 1000 requests/hour +- Anonymous: 60 requests/hour +- Burst: 20 requests/second + `.trim(), + + 'Support Channels': ` +For assistance: +- Documentation: https://example.com/docs +- Email: support@example.com (response within 24h) +- Community: https://discord.gg/example + `.trim() + } + } +}) +``` + +## Environment-Specific Instructions + +Different instructions for development vs production: + +```typescript +discovery({ + llms: { + instructions: import.meta.env.PROD + ? `Production site - use live API endpoints at https://api.example.com` + : `Development site - API endpoints may be mocked or unavailable` + } +}) +``` + +## Verify Your Configuration + +Build and check the output: + +```bash +npm run build +npm run preview +curl http://localhost:4321/llms.txt +``` + +Look for your instructions, features, and API documentation in the formatted output. + +## Expected Result + +Your llms.txt will contain structured information: + +```markdown +# example.com + +> E-commerce platform for sustainable products + +--- + +## Key Features + +- Carbon footprint calculator for all products +- AI-powered product recommendations + +## Instructions for AI Assistants + +When helping users with this site: +1. Check the documentation before answering +2. Provide code examples when relevant + +## API Endpoints + +- `GET /api/search` + Search products by name, category, or tag + Full URL: https://example.com/api/search +``` + +AI assistants will use this information to provide accurate, context-aware help. + +## Alternative Approaches + +**Multiple llms.txt files**: Create llms-full.txt for comprehensive docs, llms.txt for summary. + +**Dynamic generation**: Use a build script to extract API docs from OpenAPI specs. + +**Language-specific versions**: Generate different files for different locales (llms-en.txt, llms-es.txt). + +## Common Issues + +**Too much information**: Keep it concise. AI assistants prefer focused, actionable guidance. + +**Outdated instructions**: Use `lastUpdate: 'auto'` or automate updates from your CMS. + +**Missing context**: Don't assume knowledge. Explain domain-specific terms and workflows. + +**Unclear priorities**: List most important pages/features first. AI assistants may prioritize early content. diff --git a/docs/src/content/docs/how-to/environment-config.md b/docs/src/content/docs/how-to/environment-config.md index e1525c0..3961d2f 100644 --- a/docs/src/content/docs/how-to/environment-config.md +++ b/docs/src/content/docs/how-to/environment-config.md @@ -3,29 +3,322 @@ title: Environment-specific Configuration description: Use different configs for dev and production --- -Configure different settings for development and production environments. +Configure different settings for development and production environments to optimize for local testing vs deployed sites. -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## Prerequisites -## Coming Soon +- Integration installed and configured +- Understanding of Astro environment variables +- Knowledge of your deployment setup -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +## Basic Environment Switching -## Related Pages +Use `import.meta.env.PROD` to detect production: -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +```typescript +// astro.config.mjs +discovery({ + robots: { + // Block all bots in development + allowAllBots: import.meta.env.PROD + } +}) +``` -## Need Help? +Development: Bots blocked. Production: Bots allowed. -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +## Different Site URLs + +Use different domains for staging and production: + +```typescript +export default defineConfig({ + site: import.meta.env.PROD + ? 'https://example.com' + : 'http://localhost:4321', + + integrations: [ + discovery({ + // Config automatically uses correct site URL + }) + ] +}) +``` + +## Conditional Feature Enablement + +Enable security.txt and canary.txt only in production: + +```typescript +discovery({ + security: import.meta.env.PROD + ? { + contact: 'security@example.com', + expires: 'auto' + } + : undefined, // Disabled in development + + canary: import.meta.env.PROD + ? { + organization: 'Example Corp', + contact: 'canary@example.com', + frequency: 'monthly' + } + : undefined // Disabled in development +}) +``` + +## Environment-Specific Instructions + +Different LLM instructions for each environment: + +```typescript +discovery({ + llms: { + description: import.meta.env.PROD + ? 'Production e-commerce platform' + : 'Development/Staging environment - data may be test data', + + instructions: import.meta.env.PROD + ? ` +When helping users: +1. Use production API at https://api.example.com +2. Data is live - be careful with modifications +3. Refer to https://docs.example.com for documentation + `.trim() + : ` +Development environment - for testing only: +1. API endpoints may be mocked +2. Database is reset nightly +3. Some features may not work + `.trim() + } +}) +``` + +## Custom Environment Variables + +Use `.env` files for configuration: + +```bash +# .env.production +PUBLIC_SECURITY_EMAIL=security@example.com +PUBLIC_CANARY_ENABLED=true +PUBLIC_CONTACT_EMAIL=contact@example.com + +# .env.development +PUBLIC_SECURITY_EMAIL=dev-security@localhost +PUBLIC_CANARY_ENABLED=false +PUBLIC_CONTACT_EMAIL=dev@localhost +``` + +Then use in config: + +```typescript +discovery({ + security: import.meta.env.PUBLIC_CANARY_ENABLED === 'true' + ? { + contact: import.meta.env.PUBLIC_SECURITY_EMAIL, + expires: 'auto' + } + : undefined, + + humans: { + team: [ + { + name: 'Team', + contact: import.meta.env.PUBLIC_CONTACT_EMAIL + } + ] + } +}) +``` + +## Staging Environment + +Support three environments: dev, staging, production: + +```typescript +const ENV = import.meta.env.MODE; // 'development', 'staging', or 'production' + +const siteURLs = { + development: 'http://localhost:4321', + staging: 'https://staging.example.com', + production: 'https://example.com' +}; + +export default defineConfig({ + site: siteURLs[ENV], + + integrations: [ + discovery({ + robots: { + // Block bots in dev and staging + allowAllBots: ENV === 'production', + + additionalAgents: ENV !== 'production' + ? [{ userAgent: '*', disallow: ['/'] }] + : [] + }, + + llms: { + description: ENV === 'production' + ? 'Production site' + : `${ENV} environment - not for public use` + } + }) + ] +}) +``` + +Run with: `astro build --mode staging` + +## Different Cache Headers + +Aggressive caching in production, none in development: + +```typescript +discovery({ + caching: import.meta.env.PROD + ? { + // Production: cache aggressively + robots: 86400, + llms: 3600, + humans: 604800 + } + : { + // Development: no caching + robots: 0, + llms: 0, + humans: 0 + } +}) +``` + +## Feature Flags + +Use environment variables as feature flags: + +```typescript +discovery({ + webfinger: { + enabled: import.meta.env.PUBLIC_ENABLE_WEBFINGER === 'true', + resources: [/* ... */] + }, + + canary: import.meta.env.PUBLIC_ENABLE_CANARY === 'true' + ? { + organization: 'Example Corp', + frequency: 'monthly' + } + : undefined +}) +``` + +Set in `.env`: + +```bash +PUBLIC_ENABLE_WEBFINGER=false +PUBLIC_ENABLE_CANARY=true +``` + +## Test vs Production Data + +Load different team data per environment: + +```typescript +import { getCollection } from 'astro:content'; + +discovery({ + humans: { + team: import.meta.env.PROD + ? await getCollection('team') // Real team + : [ + { + name: 'Test Developer', + role: 'Developer', + contact: 'test@localhost' + } + ] + } +}) +``` + +## Preview Deployments + +Handle preview/branch deployments: + +```typescript +const isPreview = import.meta.env.PREVIEW === 'true'; +const isProd = import.meta.env.PROD && !isPreview; + +discovery({ + robots: { + allowAllBots: isProd, // Block on previews too + additionalAgents: !isProd + ? [ + { + userAgent: '*', + disallow: ['/'] + } + ] + : [] + } +}) +``` + +## Verify Environment Config + +Test each environment: + +```bash +# Development +npm run dev +curl http://localhost:4321/robots.txt + +# Production build +npm run build +npm run preview +curl http://localhost:4321/robots.txt + +# Staging (if configured) +astro build --mode staging +``` + +Check that content differs appropriately. + +## Expected Result + +Each environment produces appropriate output: + +**Development** - Block all: +``` +User-agent: * +Disallow: / +``` + +**Production** - Allow bots: +``` +User-agent: * +Allow: / + +Sitemap: https://example.com/sitemap-index.xml +``` + +## Alternative Approaches + +**Config files per environment**: Create `astro.config.dev.mjs` and `astro.config.prod.mjs`. + +**Build-time injection**: Use build tools to inject environment-specific values. + +**Runtime checks**: For SSR sites, check headers or hostname at runtime. + +## Common Issues + +**Environment variables not available**: Ensure variables are prefixed with `PUBLIC_` for client access. + +**Wrong environment detected**: `import.meta.env.PROD` is true for production builds, not preview. + +**Undefined values**: Provide fallbacks for missing environment variables. + +**Inconsistent builds**: Document which environment variables affect the build for reproducibility. diff --git a/docs/src/content/docs/how-to/filter-sitemap.md b/docs/src/content/docs/how-to/filter-sitemap.md index 200fa1e..cda1068 100644 --- a/docs/src/content/docs/how-to/filter-sitemap.md +++ b/docs/src/content/docs/how-to/filter-sitemap.md @@ -3,29 +3,240 @@ title: Filter Sitemap Pages description: Control which pages appear in your sitemap --- -Configure filtering to control which pages are included in your sitemap. +Exclude pages from your sitemap to keep it focused on publicly accessible, valuable content. -:::note[Work in Progress] -This page is currently being developed. Check back soon for complete documentation. -::: +## Prerequisites -## Coming Soon +- Integration installed and configured +- Understanding of which pages should be public +- Knowledge of your site's URL structure -This section will include: -- Detailed explanations -- Code examples -- Best practices -- Common patterns -- Troubleshooting tips +## Exclude Admin Pages -## Related Pages +Block administrative and dashboard pages: -- [Configuration Reference](/reference/configuration/) -- [API Reference](/reference/api/) -- [Examples](/examples/ecommerce/) +```typescript +// astro.config.mjs +discovery({ + sitemap: { + filter: (page) => !page.includes('/admin') + } +}) +``` -## Need Help? +This removes all URLs containing `/admin` from the sitemap. -- Check our [FAQ](/community/faq/) -- Visit [Troubleshooting](/community/troubleshooting/) -- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) +## Exclude Multiple Path Patterns + +Filter out several types of pages: + +```typescript +discovery({ + sitemap: { + filter: (page) => { + return !page.includes('/admin') && + !page.includes('/draft') && + !page.includes('/private') && + !page.includes('/test'); + } + } +}) +``` + +## Exclude by File Extension + +Remove API endpoints or non-HTML pages: + +```typescript +discovery({ + sitemap: { + filter: (page) => { + return !page.endsWith('.json') && + !page.endsWith('.xml') && + !page.includes('/api/'); + } + } +}) +``` + +## Include Only Specific Directories + +Allow only documentation and blog posts: + +```typescript +discovery({ + sitemap: { + filter: (page) => { + const url = new URL(page); + const path = url.pathname; + + return path.startsWith('/docs/') || + path.startsWith('/blog/') || + path === '/'; + } + } +}) +``` + +## Exclude by Environment + +Different filtering for development vs production: + +```typescript +discovery({ + sitemap: { + filter: (page) => { + // Exclude drafts in production + if (import.meta.env.PROD && page.includes('/draft')) { + return false; + } + + // Exclude test pages in production + if (import.meta.env.PROD && page.includes('/test')) { + return false; + } + + return true; + } + } +}) +``` + +## Filter Based on Page Metadata + +Use frontmatter or metadata to control inclusion: + +```typescript +discovery({ + sitemap: { + serialize: (item) => { + // Exclude pages marked as noindex + // Note: You'd need to access page metadata here + // This is a simplified example + return item; + }, + filter: (page) => { + // Basic path-based filtering + return !page.includes('/internal/'); + } + } +}) +``` + +## Combine with Custom Pages + +Add non-generated pages while filtering others: + +```typescript +discovery({ + sitemap: { + filter: (page) => !page.includes('/admin'), + customPages: [ + 'https://example.com/special-page', + 'https://example.com/external-content' + ] + } +}) +``` + +## Use Regular Expressions + +Advanced pattern matching: + +```typescript +discovery({ + sitemap: { + filter: (page) => { + // Exclude pages with query parameters + if (page.includes('?')) return false; + + // Exclude paginated pages except first page + if (/\/page\/\d+/.test(page)) return false; + + // Exclude temp or staging paths + if (/\/(temp|staging|wip)\//.test(page)) return false; + + return true; + } + } +}) +``` + +## Filter User-Generated Content + +Exclude user profiles or dynamic content: + +```typescript +discovery({ + sitemap: { + filter: (page) => { + // Include main user directory page + if (page === '/users' || page === '/users/') return true; + + // Exclude individual user pages + if (page.startsWith('/users/')) return false; + + // Exclude comment threads + if (page.includes('/comments/')) return false; + + return true; + } + } +}) +``` + +## Verify Your Filter + +Test your filter logic: + +```bash +npm run build +npm run preview + +# Check sitemap +curl http://localhost:4321/sitemap-index.xml + +# Look for excluded pages (should not appear) +curl http://localhost:4321/sitemap-0.xml | grep '/admin' +``` + +If grep returns nothing, your filter is working. + +## Expected Result + +Your sitemap will only contain allowed pages. Excluded pages won't appear: + +```xml + + + + https://example.com/ + + + https://example.com/blog/post-1 + + + +``` + +## Alternative Approaches + +**robots.txt blocking**: Block crawling entirely using robots.txt instead of just omitting from sitemap. + +**Meta robots tag**: Add `` to pages you want excluded. + +**Separate sitemaps**: Create multiple sitemap files for different sections, only submit public ones. + +**Dynamic generation**: Generate sitemaps at runtime based on user permissions or content status. + +## Common Issues + +**Too restrictive**: Double-check your filter doesn't exclude important pages. Test thoroughly. + +**Case sensitivity**: URL paths are case-sensitive. `/Admin` and `/admin` are different. + +**Trailing slashes**: Be consistent. `/page` and `/page/` may both exist. Handle both. + +**Query parameters**: Decide whether to include pages with query strings. Usually exclude them. + +**Performance**: Complex filter functions run for every page. Keep logic simple for better build times. diff --git a/status.json b/status.json index faf0864..e5c192f 100644 --- a/status.json +++ b/status.json @@ -29,7 +29,7 @@ ] }, "howto": { - "status": "executing", + "status": "ready", "branch": "docs/howto-content", "worktree": "docs-howto", "pages": [ @@ -44,10 +44,20 @@ "how-to/activitypub.md" ], "dependencies": ["tutorials"], - "completed_pages": [] + "completed_pages": [ + "how-to/block-bots.md", + "how-to/customize-llm-instructions.md", + "how-to/add-team-members.md", + "how-to/filter-sitemap.md", + "how-to/cache-headers.md", + "how-to/environment-config.md", + "how-to/content-collections.md", + "how-to/custom-templates.md", + "how-to/activitypub.md" + ] }, "reference": { - "status": "executing", + "status": "ready", "branch": "docs/reference-content", "worktree": "docs-reference", "pages": [ @@ -64,7 +74,19 @@ "reference/typescript.md" ], "dependencies": [], - "completed_pages": [] + "completed_pages": [ + "reference/configuration.md", + "reference/api.md", + "reference/robots.md", + "reference/llms.md", + "reference/humans.md", + "reference/security.md", + "reference/canary.md", + "reference/webfinger.md", + "reference/sitemap.md", + "reference/cache.md", + "reference/typescript.md" + ] }, "explanation": { "status": "executing",