Giter VIP home page Giter VIP logo

Comments (24)

straight-shoota avatar straight-shoota commented on May 24, 2024 3

@RX14 The problem with canonical links is that docs for older versions vanish from the search results because the search algorithm treats them as duplicate content. But in fact, they're not duplicate and users might have a need to find documentation for older versions as well. For example, when upgrading to a new version, you may need to read the documentation of deprecated features not available in the current version in order to find a suitable replacement.

Using a sitemap is a superior solution because it allows to assign priorities to individual pages and outdated pages don't vanish completely, they just won't be as prominent as more recent once. I think it should eventually replace the canonical link.

from crystal-website.

straight-shoota avatar straight-shoota commented on May 24, 2024 3

The compiler supports generating a sitemap now. We can proceed to get this integrated into the docs generation process.

  • Add DOCS_OPTIONS to distribution-scripts:
    • For nightly: --sitemap-base-url=https://crystal-lang.org/api/master --sitemap-changefreq=daily --sitemap-priority=0.3
    • For latest release: --sitemap-base-url=https://crystal-lang.org/api/$(version) --sitemap-changefreq=never --sitemap-priority=1.0
  • Pass the build type from .circle/config.yml to the distribution-script's workflow.

from crystal-website.

straight-shoota avatar straight-shoota commented on May 24, 2024 1

There is essentially discussed crystal-lang/crystal#5952
The idea was to use canonical references to the latest url. But that also has some issues. And most importantly, it currently doesn't cover doc pages < 0.25.0 so these still show up on the search results (but there are no between 0.25.0 and 0.30.0 because they all point to latest).

Using a sitemap seems like a smart alternative to solve this issue. 👍

from crystal-website.

straight-shoota avatar straight-shoota commented on May 24, 2024 1

I've put together a simple program to automatically generate sitemaps for https://crystal-lang.org/api

It's available at https://github.com/straight-shoota/crystal_docs_sitemap

Generated output: output.tar.gz

The output contents should be published at https://crystal-lang.org/api/ and search engines need to be informed about the sitemap (see https://www.sitemaps.org/protocol.html#informing).

from crystal-website.

bcardiff avatar bcardiff commented on May 24, 2024 1

Ok, let's make the doc tool generate the sitemap if instructed so. But it will need to know about the base url. $ crystal docs --sitemap-base-url=https://crystal-lang.org/api/VERSION/ or something alike.

Then the maintenance of the root site map is more scriptable as proposed.

from crystal-website.

bcardiff avatar bcardiff commented on May 24, 2024

Does someone know if there is a tool for generating the sitemap does some recursive checks over directories? Unless it can be automated it won't happen.

Doing a pass over old docs to set the canonical seems more likely if that fixes the issue.

from crystal-website.

straight-shoota avatar straight-shoota commented on May 24, 2024

This should be fairly simple to automate. Essentially you only need to extract all local links from each version's index.html in the API docs. That's already the URLs for the sitemaps. All but those links from the most recent version get a lower priority.
Each released API docs should be in a distinct sitemap file and all can be combined together in a sitemap index.

I can try to put this together if nobody else is interested.

from crystal-website.

ukd1 avatar ukd1 commented on May 24, 2024

@straight-shoota nice, didn't think to check for the issue on the main repo somehow...lol. I'd be down for helping on this, but tbh, I have no idea how the docs or old versions are built - so it might be easier if you do it. If you'd like a hand / pair / lmk.

from crystal-website.

RX14 avatar RX14 commented on May 24, 2024

Doing a pass over old docs to set the canonical seems more likely if that fixes the issue.

this can be done with a simple sed script to adjust the header for old versions. The doc pages don't need to be regenerated.

from crystal-website.

RX14 avatar RX14 commented on May 24, 2024

I'd suggest redoing thse old pages before working on a sitemap

from crystal-website.

straight-shoota avatar straight-shoota commented on May 24, 2024

@bcardiff WDYT?

from crystal-website.

bcardiff avatar bcardiff commented on May 24, 2024

I agree the sitemap is worth having and is a good solution. I don't think having the sitemap checked in the repo is the right thing.

From a workflow point of view what it would make sense is to have a tool to change an existing sitemap with some operations:

  1. Add dir content as it will be reached from specific url prefix
  2. Set the priority for all routes matching a specific prefix

At least that workflow will play well with the release process, where we have a local dir with the new api documentation to upload.

I am unsure how to keep the sitemap up to date with respect to the content from jekyll itself regarding lastmod of existing pages and new posts. Maybe there is a 3rd action

  1. Update last most for a subset of dirs as it will be reached from specific url prefix.

That way we can update those params without iterating the whole content.

So, in essence, is having an approach to update rather than create a sitemap.

from crystal-website.

straight-shoota avatar straight-shoota commented on May 24, 2024

I don't think having the sitemap checked in the repo is the right thing.

Agreed. It just needs to be generated and put into an S3 bucket. Ideally, a rebuild should be triggered after the nightly API docs have been updated from master.

I am unsure how to keep the sitemap up to date with respect to the content from jekyll itself regarding lastmod of existing pages and new posts.

Currently, these sitemaps are only for /api, so Jekyll is not even involved. This is perfectly fine, the sitemap doesn't need to incorporate all pages on the domain. Getting the priorities right for the API versions is the main issue here, and I'd like to get that fixed before considering other parts of the website. They can be tackled individually (for example, Jekyll can simply build its own sitemap), we just need to reference all sitemaps from the sitemapindex.

So, in essence, is having an approach to update rather than create a sitemap.

Sure, we can do that. I just figured it would be easier to simply run the generator and push the result to S3 without having to synchronize first.

In practice, there are two events that would require an update to the sitemaps:

  1. Every day the updated nightly API docs are published for master. This only needs a rebuild of the sitemap for /api/master.
  2. When a new Crystal version is released, we need to build the sitemap for the new release and rebuild for the last x releases in order to update the priority. x is currently 3: The last 2 versions get priorities (0.5, 0.3) and the one after that needs to be set to the default (0.1)

The priority adjustments could actually just be implemented with a simple grep. The contents don't change when a release age, thus there is no need to actually rebuild the sitemap.

Considering all this, it might actually be the best solution to integrate the sitemap generation into the doc generator. This problem is not specific to the stdlib and this way all shards API docs could benefit.
This is really trivial to implement, it just spits out another file. And won't require additional configuration, there is already --canonical-base-url and priority could just be 1.0 by default. Maybe a --sitemap-priority option could be useful, but it's not necessary.

To build the sitemaps for legacy releases, we can just use https://github.com/straight-shoota/crystal_docs_sitemap That's a one-time thing.

With this, updates to master sitemap don't need any additional action because the updated sitemap is already provided by the doc generator.
When a new release is added, we need to add it to the sitemapindex and update the sitemaps for the last releases, but this could just be s/priority="1.0"/priority="0.5"/ etc.

from crystal-website.

bcardiff avatar bcardiff commented on May 24, 2024

Currently, these sitemaps are only for /api, so Jekyll is not even involved

Wouldn't that prevent indexing other pages?

Every day the updated nightly API docs are published for master.

I thought we didn't want sitemap for master. Is mostly used for preview (edit: sorry you mention it at the end)

When a new Crystal version is released,

I'm ok downloading the whole docs for a first time generation (edit: or using the proposed script), but upon a crystal version release I don't have locally all the bucket of docs. And I don't want to require to download it. What I do have is the -doc.tar.gz artifact that is pushed. I was thinking of injecting the new paths there, without actually retrieving them from http or the bucket. Hence the proposed transformations 1 and 2.

from crystal-website.

straight-shoota avatar straight-shoota commented on May 24, 2024

Wouldn't that prevent indexing other pages?

No, sitemaps are not used as an exclusive source. Search engines still employ their regular crawling. They just augment the results or help discover pages that would otherwise not be discovered. See https://webmasters.stackexchange.com/questions/114425/if-i-remove-urls-from-an-xml-sitemap-will-google-still-index-them

from crystal-website.

straight-shoota avatar straight-shoota commented on May 24, 2024

I thought we didn't want sitemap for master. Is mostly used for preview

I guess it's not strictly necessary, but when the doc generator puts out the sitemap anyway, this requires no extra effort at all.

What I do have is the -doc.tar.gz artifact that is pushed.

My suggestion is that the sitemap is generated directly by the docs generator, thus it would already be included in the doc.tar.gz.
Each API version has its own sitemap (sitemap.xml), which would be located at /api/{{version}}/sitemap.xml.

When publishing a new release, you would just push the contents of doc.tar.gz and the new sitemap is online. It needs to be referenced in the sitemap index, so that's adding one line to that file:

<sitemap loc="https://crystal-lang.org/api/{{version}}/sitemap.xml" lastmod="{{`date --rfc-3339=date`}}" />

And you would need to grab /api/{{version-1}}/sitemap.xml, /api/{{version-2}}/sitemap.xml, /api/{{version-3}}/sitemap.xml, replace the priorities and push them back up.

This could all be placed in a simple shell script which could automatically retrieve the files from S3, apply the changes and push them back up. I haven't tested this but the general idea looks like this:

CURRENT_VERSION=$1

aws s3 cp $S3_BUCKET/sitemapindex.xml sitemapindex.xml

sed '$ i\  <sitemap loc="https://crystal-lang.org/api/$CURRENT_VERSION/sitemap.xml" lastmod="$(date --rfc-3339=date)" />' -i sitemapindex.xml

aws s3 cp sitemapindex.xml $S3_BUCKET/sitemapindex.xml

ARGV=("$@")

for (( i=2; i < $#; i++ )); do
  version=$ARGV[$i]
  case $i in
    2)
      priority=0.5
      ;;
    3)
      priority=0.3
      ;;
    *)
      priority=0.1
  esac

  aws s3 cp $S3_BUCKET/$version/sitemap.xml sitemap-$version.xml

  sed "s/priority=\"\\d\\.\\d/priority=\"$priority\"/" -i sitemap-$version.xml

  aws s3 cp sitemap-$version.xml $S3_BUCKET/$version/sitemap.xml
done

from crystal-website.

straight-shoota avatar straight-shoota commented on May 24, 2024

We don't need another CLI option for this. That's exactly the same intent as --canonical-base-url.

from crystal-website.

bcardiff avatar bcardiff commented on May 24, 2024

The canonical-base-url is /latest always.
If there is no need to have a canonical-base then that setting might go away.
And they are different concerns.

from crystal-website.

straight-shoota avatar straight-shoota commented on May 24, 2024

Oh yes, I mixed that up, sorry. It should go, because using canonical completely hides all older versions. So we can simply replace it.

from crystal-website.

oprypin avatar oprypin commented on May 24, 2024

I think that^ was not done yet

from crystal-website.

oprypin avatar oprypin commented on May 24, 2024

I commented at crystal-lang/crystal#5952 (comment).

Should we dedupe the two issues?

from crystal-website.

oprypin avatar oprypin commented on May 24, 2024

The summary of that is:
The <priority> tag, which is the crux of this suggestion, is explicitly documented as ignored by Google.

Also per another documentation page, sitemaps seem to not affect things much because the API docs site is already comprehensively linked internally (the highlighted part is a quote)

from crystal-website.

szabgab avatar szabgab commented on May 24, 2024

Solving the Google ranking issue might be hard, but at least there could be a prominent warning and link at the top of each old page linking to the latest page. e.g like this one: https://flask.palletsprojects.com/en/1.1.x/patterns/appfactories/

If re-generating the old docs is an issue then I think this could be done even with JavaScript. These links will be placed on the top of the page dynamically without the need to regenerate the old pages. This won't solve the ranking issue, but will help visitors reach the latest version of the documentation.

from crystal-website.

straight-shoota avatar straight-shoota commented on May 24, 2024

It looks like this can finally be considered resolved. Google search consistently ranks search results for the latest release (https://crystal-lang.org/api/latest) highest.

from crystal-website.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.