Automating sitemaps for large CMS websites

Status
Not open for further replies.

kflanagan28

New Member
Hey,

First post so go easy ...

I am wondering has anyone attempted to resolve the issue caused by large CMS driven websites with regards to duplicate URLs pointing to the same content.

I know there is no potential duplicate content penalty in this scenerio (as google recently posted on). But from reading posts by Aaron Wall, there are definetly issues around page dilution.

I am wondering is it work creating an automated sitemap that is updated on a daily basis. I assume this could be done by querying the database. I was also thinking along the lines of adding a couple of fields to the database which could track popular pages, age of page etc, so only the most popular and fresh content remains in the sitemap.

I could achieve the same thing by using robots.txt and excluding certain dynamic pages.

Really I am just looking from some feedback and I am interested if I am making any sense !!!
 

kflanagan28

New Member
Thanks for the reply.

Actually looking back at my question, it's incorrect. Each CMS has there own method of generating a sitemap to include the most current content. I created a blog site out of Joomla to check this functionality out.

The site searchbrat.com uses a plugin which generates a sitemap and pings google.

Some of my clients have bespoke CMS's in place so really this is a developer issue and a similiar plugin would be required.

Thanks for the input

Kieran
 

paul

Ninja
Dupe copies of pages (like category listings) should be listed as noindex. Then they won't be in the index.

A sitemap won't help in telling google not to index pages.
 

kflanagan28

New Member
Hey Paul,

Thanks for the reply. The original post was done in a hurry and isn't correct as I said. The solution I require is a dynamic sitemap and robots.txt built on the fly. I know the sitemap doesn't stop pages getting indexed. The first post was just written poorly.

Thanks
 

mneylon

Administrator
Staff member
Hey Paul,

Thanks for the reply. The original post was done in a hurry and isn't correct as I said. The solution I require is a dynamic sitemap and robots.txt built on the fly. I know the sitemap doesn't stop pages getting indexed. The first post was just written poorly.

Thanks

Why would you want to change your robots.txt so often?
 

kflanagan28

New Member
A couple of the clients I work for are large publishing sites with up to 100,000 pages indexed. They have a complex bespoke CMS's. It isn't enough to just have a static robots.txt file and exclude certain content i.e.

Disallow: /?q=gallery/
Disallow: /?q=search/

etc etc

This is a good solution and one I have looked at. But the problem comes when the dynamic queries are not as static as that. You cannot exclude them all unless you include ids coming from the db etc (think galleries with lots of similar pages and URLs).

This problem is covered a little further in "Professional Search Engine Optimization with PHP: A Developer's Guide to SEO ".

But as I said, when I originally made that post, I wasn't thinking straight. So it doesn't make much sense.

Thanks

Kiera
 
Status
Not open for further replies.
Top