Skip to content

Commit 92f19df

Browse files
authored
Add comprehensive schema.org structured data for improved search and AI visibility (#16116)
* Add comprehensive schema.org structured data generation - Add cloud provider and resource type detection - Add infrastructure pattern recognition - Enhance Article, BlogPosting, and Course schemas with technical metadata - Implement deduplication for entity mentions - Add support for multi-cloud scenarios * Optimize and enhance schema.org structured data implementation Performance & Maintainability Improvements: - Refactor resource-type-extractor.html to use data-driven approach (300+ lines → 119 lines) - Create cloud_resources.yml data file for maintainable cloud service definitions - Add performance optimization: limit content scanning to 5000 characters - Add comprehensive inline documentation for complex deduplication logic - Document multi-cloud detection strategy and business logic Code Quality Fixes: - Remove inappropriate schema fields (fake ratings, meaningless price dates) - Add trailing newlines to all files per AGENTS.md requirements - Update config.yml with schema-specific parameters - Improve pattern matching specificity to reduce false positives Business Appropriateness: - Remove hardcoded aggregateRating from free software (Pulumi CLI) - Keep legitimate pricing data for actual paid services - Use site config for dynamic version management Enhanced Functionality: - Maintain full backward compatibility - Improve cloud service detection accuracy - Support multi-cloud content scenarios - Add proper Wikidata entity linking - Ensure clean deduplication across all mention sources Test Results: - Hugo builds successfully with all optimizations - Schema.org JSON-LD validates correctly - Cloud services detected accurately (AWS, Azure, GCP) - Programming languages auto-detected (TypeScript, Python, C#, YAML) - Mentions array properly deduplicated - Performance improved with content length limiting This commit addresses all feedback from the remote assessment while maintaining the comprehensive schema coverage and AI optimization benefits. * Fix Hugo template syntax error and eliminate false positives Critical Fix: - Fix Hugo template syntax error in resource-type-extractor.html (line 99, 101) Changed .["@id"] to (index . "@id") to prevent build failures Accuracy Improvements: - Fix API Gateway Pattern false positive detection on VPC pages Made pattern matching more specific to avoid casual mentions like "API gateways" - Remove duplicate entity definitions from technology-entities.html: • AWS, Azure, GCP providers (handled by cloud-provider-detector) • AWS Lambda, Azure/Google Cloud Functions (handled by cloud_resources.yml) • Infrastructure as Code, DevOps, Cloud Native (handled by infrastructure-patterns) This eliminates ~60 lines of redundant code and prevents duplicate entities Schema Quality: - VPC documentation no longer incorrectly tagged with "API Gateway Pattern" - AWS Lambda and other services appear only once in mentions arrays - All entity detection now uses single source of truth - Maintains accurate programming language and cloud service detection Test Results: - Hugo builds successfully without template errors - Schema.org JSON-LD validates correctly - Entity deduplication working properly - No false positive patterns detected - Performance improved with less redundant processing This resolves the remote CI build failure and improves schema accuracy. * Optimize schema.org implementation with content aggregation and enhanced date handling This commit implements comprehensive improvements to the schema.org structured data generation system, addressing critical issues with content extraction, date handling, and detection accuracy. Key improvements: **Content Aggregation System:** - Added content-aggregator.html utility to extract effective content from special pages - Handles cloud overview pages (extracting from frontmatter components/providers) - Handles docs home pages (extracting from sections/cards) - Provides fallback content for pages without traditional .Content **Enhanced Date Handling:** - Fixed "0001-01-01" date issues with robust fallback logic - Date hierarchy: .Date → .GitInfo.AuthorDate → now - Applied to all content schemas (article.html, blog.html, course.html) **Detection System Updates:** - Updated cloud-provider-detector.html to use content aggregator - Fixed resource matching logic in resource-type-extractor.html - Improved deduplication algorithm for better accuracy - All detection utilities now work with aggregated content **Schema Content Accuracy:** - Fixed empty articleBody on special pages - Accurate wordCount calculation using aggregated content - Better content extraction for AI/SEO optimization **Files Modified:** - layouts/partials/schema/utils/content-aggregator.html (new) - layouts/partials/schema/content/article.html - layouts/partials/schema/content/blog.html - layouts/partials/schema/content/course.html - layouts/partials/schema/utils/cloud-provider-detector.html - layouts/partials/schema/utils/resource-type-extractor.html **Impact:** - Fixes empty schema fields on 29+ special pages (cloud overview, docs home) - Eliminates invalid dates in structured data - Improves detection accuracy and reduces false positives - Better SEO and AI discoverability for all content types Build tested successfully with no template errors. * Fix all schema.org validation errors and warnings across content types This commit addresses comprehensive schema.org validation issues identified across multiple page types, achieving full compliance and eliminating all 44 validation issues (23 errors + 21 warnings). **Phase 1: Critical Error Fixes (23 errors eliminated)** - Fixed cloud_resources.yml property naming conventions: • Changed 'same_as' → 'sameAs' (schema.org standard) • Removed 'provider_id' entirely (not valid schema.org property) • Standardized 'category' → 'applicationCategory' - Fixed CSS selectors for speakable specification: • article.html: Replaced non-existent '.summary', 'pre code' selectors • blog.html: Fixed '.article-summary', '.blog-content h2/h3' selectors - Fixed course instructor schema type (Organization → Person with worksFor) **Phase 2: Property Misuse Fixes (21 warnings eliminated)** - Removed 'applicationCategory' from invalid schema types: • Article, BlogPosting, Course (only valid for SoftwareApplication) - Removed non-standard properties: • 'proficiencyLevel' from Article schemas • 'targetPlatform' from Article/BlogPosting/Course • 'availableLanguage' from SoftwareApplication • 'isRelatedTo' from product schemas - Changed 'relatedLink' → 'citation' (standard schema.org property) **Phase 3: Quality Improvements** - Enhanced infrastructure pattern integration: • Moved patterns from invalid 'applicationCategory' to 'mentions' • Improved semantic accuracy for AI/search understanding • Maintained rich DefinedTerm entities with Wikipedia/Wikidata links **Validation Results (Before → After)** - Case Studies: 10 errors, 6 warnings → 0 errors, 0 warnings ✅ - Cloud Overview: 7 errors, 8 warnings → 0 errors, 0 warnings ✅ - Blog Posts: 5 errors, 1 warning → 0 errors, 0 warnings ✅ - Tutorials: 1 error, 2 warnings → 0 errors, 0 warnings ✅ - Product Pages: 0 errors, 4 warnings → 0 errors, 0 warnings ✅ **Files Modified:** - data/cloud_resources.yml (property naming fixes) - layouts/partials/schema/content/article.html - layouts/partials/schema/content/blog.html - layouts/partials/schema/content/course.html - layouts/partials/schema/content/product-software.html **Impact:** - Full schema.org compliance for improved SEO signals - Enhanced AI/LLM content understanding - Better rich results eligibility in search engines - Cleaner, more maintainable schema generation code Build tested successfully with no template errors. * Enhance Event and Course schema compliance with Google structured data guidelines Major updates: - Event schema now only generates for physical events (Google requirement) - Added course list schema with ItemList for tutorial index pages - Fixed ISO-8601 date formatting with proper timezone handling - Removed non-compliant properties from course schema Event Schema Updates: - Skip schema generation for virtual events (location: virtual) - Skip schema generation for external events (external: true) - Enhanced date handling with proper ISO-8601 timezone format - Only generate Event schema for physical events with real locations Course Schema Updates: - Created course-list.html for tutorial index pages with ItemList schema - Added provider URL to organization structure - Removed availableLanguage property for compliance - Changed relatedLink to citation property Course List Implementation: - New ItemList schema for tutorials section pages - Minimum 3 courses requirement (Google's guideline) - Individual course items with proper positioning and metadata - Enhanced with educational level and duration estimation These changes align with Google's documentation requirements for Event and Course rich results. * Add validation-safe schema enhancements for improved AI/AEO visibility Implements carefully validated schema.org enhancements designed to pass all validation requirements while improving visibility in AI tools and search engines. VideoObject Schema: - Detects YouTube embeds (shortcodes and iframes) - Includes all required fields (name, description, uploadDate, thumbnailUrl) - Uses YouTube's standard thumbnail URL pattern - Supports multiple videos per page - Found 118 pages with videos now properly marked up HowTo Schema: - Only applies to content with clear numbered steps (minimum 2 steps) - Detects "1. 2. 3." pattern or "Step 1, Step 2" pattern - Auto-detects common tools (Pulumi CLI, AWS CLI, Node.js, Python, Docker) - Estimates time based on word count - Applied to 19 tutorial pages with step-by-step instructions SoftwareSourceCode Schema: - Detects fenced code blocks with language identifiers - Maps common aliases (ts→TypeScript, py→Python, cs→C#) - Includes ComputerLanguage type for proper validation - Adds runtime platform and software requirements - Minimal schema with no required fields (validation-safe) - Applied to 7 pages with code examples Cloud Resources Updates: - Added modern AI/ML services (Bedrock, SageMaker, Azure OpenAI, Vertex AI) - Added container orchestration services (EKS, AKS, ECS) - Added observability platforms (Datadog, New Relic, Grafana) - Added CDN services (CloudFront, Azure Front Door) - All use validated structure matching existing resources Key Design Decisions: - All schemas include required fields with robust fallbacks - Use camelCase for all properties (schema.org standard) - Omit optional fields if data unavailable (e.g., video duration) - Only generate schemas when content structure validates - Follow existing successful patterns from the codebase Build Results: - Hugo build successful with no errors - 118 video schemas generated - 19 HowTo schemas generated - 7 code schemas generated - All schemas validate against schema.org standards * Fix all schema.org validation errors and warnings across content types - Remove duplicate resource definitions (aws_eks, azure_aks, aws_ecs) - Merge best attributes from duplicates (expanded patterns, updated Wikidata IDs) - Add missing newline at end of cloud_resources.yml - Add URL patterns for all Azure and GCP resources - Ensure repository standards compliance * Update Pulumi address and fix schema headline to use h1 - Update organization schema with new Seattle address (601 Union St Suite 1415) - Fix article and blog schemas to prefer h1 frontmatter for headlines - Ensures schema headlines match what users see on rendered pages * Fix schema.org validation issues for improved AI/search visibility - Fixed BreadcrumbList double-encoding issue by adding safeJS filter to prevent JSON escaping - Changed all cloud service types from WebAPI to SoftwareApplication where applicationCategory is used - SoftwareApplication properly supports applicationCategory property - More semantically accurate for cloud services (not just APIs) - Eliminates schema.org validation warnings These fixes ensure proper parsing by search engines and AI tools for better content discovery. * Remove invalid softwareRequirements property from Course schema The softwareRequirements property is not valid for schema.org Course type, causing validation warnings. This property is meant for SoftwareApplication types, not educational courses. The Course type should use coursePrerequisites for requirements. Software dependencies are better described in the course description or prerequisites text rather than as a separate property. This fix eliminates the schema.org validation warning while maintaining all other course metadata. * Implement @graph structure for 5 test pages This is a limited rollout of the @graph schema structure to test its effectiveness before full site deployment. Only 5 representative pages are affected: - /blog/pulumi-neo/ (BlogPosting) - /tutorials/creating-resources-aws/ (Course) - /docs/iac/clouds/aws/guides/cdk/ (Article) - /product/neo/ (SoftwareApplication) - / (Homepage with Website/Organization) Changes: - Added feature flag in schema/loader.html to enable @graph for test pages only - Created graph-builder.html to construct unified @graph structure - Created entity collectors that return entities instead of outputting JSON-LD - All other ~2000 pages remain unchanged with current schema structure Benefits of @graph on test pages: - Single JSON-LD block instead of multiple - Deduplicates Organization entity - Creates proper entity relationships (WebPage → mainEntity, etc.) - Uses @id references for better entity linking - Reduces overall schema size by ~20-30% Testing approach: - These 5 pages will be validated through schema.org validator - Google Rich Results Test will verify functionality - If successful, we'll gradually expand to more pages - Easy rollback by removing pages from test list Next steps: 1. Push to preview environment 2. Validate all 5 test pages 3. Monitor for any issues 4. Expand coverage if successful * Fix Schema.org validation errors and optimize structured data Fixed critical validation errors: - Remove invalid supportedLanguage property from SoftwareApplication (6 errors fixed) - Remove hardcoded aggregateRating from product schema - Remove invalid softwareRequirements from BlogPosting schema Improved schema types for better semantic accuracy: - Change Article to TechArticle for technical documentation pages - Change Course to HowTo for self-paced tutorials (more semantically accurate) - Create new howto-entity.html for HowTo schema implementation Fixed technical issues: - Fix date handling with proper fallbacks for invalid dates (0001-01-01) - Fix image URL concatenation to avoid double slashes - Apply fixes to article, blog, and product entity collectors These changes ensure: - All validation errors are resolved - Schema types accurately represent content - Compliance with 2025 Google Search Central guidelines - Better AI/LLM comprehension through semantic accuracy * Expand @graph structure to all pages site-wide Following successful validation on 5 test pages with 0 errors, expanding the @graph implementation to all ~2000+ pages. Changes: - Remove feature flag system from loader.html - Simplify loader to use graph-builder.html for all pages - Update graph-builder comment to reflect site-wide usage - Remove ~90 lines of redundant schema loading code Benefits: - Single consolidated JSON-LD block instead of multiple - 20-30% reduction in JSON-LD output size - Better entity relationships through @id references - Organization entity defined once, referenced throughout - Cleaner HTML output and better AI/search comprehension All main content schemas are preserved and working: - Homepage: WebPage + Organization + WebSite + SoftwareApplication - Blog posts: BlogPosting with enhanced metadata - Documentation: TechArticle (previously Article) - Tutorials: HowTo (previously Course) - Product pages: SoftwareApplication Tested successfully on multiple page types with proper @graph structure. * Fix Organization schema placement per Google best practices Move Organization entity to homepage-only following Google Search Central guidelines: "Place Organization markup on your home page, or a single page that describes your organization (such as the 'about us' page). You don't need to include it on every page of your site." Changes: - Move Organization entity inside homepage conditional in graph-builder.html - Organization now only appears on homepage, not all ~2000+ pages - Other pages still reference Organization via @id as intended - Reduces JSON-LD output by ~13 lines per non-homepage Benefits: - Follows Google and Schema App best practices - Clearer information architecture for search engines - Homepage becomes authoritative source for Organization data - Reduces unnecessary structured data bloat Testing confirmed: - Homepage: Has full Organization entity ✅ - Other pages: Only reference via @id ✅ - All entity relationships maintained ✅
1 parent 99749f5 commit 92f19df

35 files changed

+4398
-164
lines changed

config/_default/config.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,14 @@ params:
8686
meta_desc: |
8787
Pulumi's open source infrastructure as code SDK enables you to create, deploy,
8888
and manage infrastructure on any cloud, using your favorite languages.
89+
90+
# Schema.org configuration
91+
schema:
92+
# Pulumi CLI version - dynamically updated by CI/CD
93+
pulumi_version: "3.198.0"
94+
95+
# Performance optimization
96+
max_content_scan_length: 5000
8997

9098
minify:
9199
tdewolff:

0 commit comments

Comments
 (0)