Status
Not open for further replies.

php.allstar

New Member
Hi,

We have been advised to create/add an XML Sitemap for our site.

Can anyone recommend a good program to automatically run, generate the sitemap and upload it to our server?

I was thinking about GSiteCrawler but I'm not sure if that can be scheduled.

Any thoughts?

Thanks.
 
K

Kieran

Guest
Would be interested to see such a tool but have never been comfortable with offline services doing the work.

For a Wordpress blog you can use a plugin that does it automatically and I guess for other CMS there are similar types of tools.

How often will you be adding pages to the site? If only once a week or so it is often easier to just of it manually and upload the sitemap. So when you add a page you just update the XML and away you go. Easy peasy.. (as he sneaks of to update a couple of his that htsi post reminded him to do :)
 

php.allstar

New Member
Thanks Keiran,

I'll be adding brand new pages and mini-apps about once or twice a week. But seing that this is a dynamic site with 4000+ pages we have pages that are modified/removed/added on a daily basis, so I think the sitemap has to be created and uploaded daily? Am I wrong on this?

It turns out, GSiteCrawler can be automatically scheduled to create and upload XML sitemaps

I would have liked to run it from our bare-bones linux development box but it seems I'll have to run it on my windows workstation as it's a windows exe.

I don't want it to consume my CPU and RAM on my workstation while I'm working during the day so I'll have to schedule it for every night, which means leaving my workstation powered on, something that I never do.

Maybe if I put my workstation on Stand By mode windows scheduler will wake it, run GSiteCrawler save a log file on my desktop for the following morning and then power my workstation off.
 

hydrosylator

New Member
I can't actually post url's just yet but I highly recommend looking at googles list of sitemap generators
Enter "Sitemap Generators A collection of links to tools and code snippets that generate Sitemap files" into google, and click on the first link, which should be on the code[dot]google[dot]com site.

There's free and commercial ones that can be used on and offline. I'd be inclined to use server-side.
 
K

Kieran

Guest
If the content of the pages are being updated then you shouldn't have to update daily / nightly as the sitemap just says "I have a page on my site and this is the name of it"

In my schizophrenic posting response method of course it would be idela ot have it done automatically if you are the forgetful type but again only if the new page has something that you want to have immediate

hope this helps

Kieran
 

link8r

New Member
If you have that many Pages that it will chew lots of processor time creating the sitemaps, then I suggest using multiple sitemaps - close off different sections of the site or if you have sub-sites within your sites for different languages/regions, put them into different sitemaps

Bear in mind that sitemaps are an assistant to Google's and your hosting bandwidth, so having a relly huge sitemap that takes a while to download may defeat the purpose.

While you are at it include a custom 404 error page with the google widget code for letting it remove missing URL's too.

You could also create a custom map crawler on the server and use a cron job to schedule it ?
 

php.allstar

New Member
Thanks guys

If the content of the pages are being updated then you shouldn't have to update daily / nightly as the sitemap just says "I have a page on my site and this is the name of it"

This is a uk golfing website. We have about 6 different information pages for around 320 golf courses in the UK.

Now this may be flawed in terms of seo but if a course becomes inactive on our web services it becomes inactive on our site (6 unique pages for that course are no longer available) and the user requesting the page is given a custom 404.

When that course is set to active again, the 6 unique pages for that course are available again, no more 404 for the user.

This also happens when courses are removed from our site or have just joined.

This is so sporadic, 1 course a month might leave, 2 courses a week might be set to inactive, 3 courses a week may join. Beacuse this activity is all over the place I feel as if I have to run GSiteCrawler every night. I don't want to have to generate and upload the file on an as it happens basis, i think this would be too much work (I'm a developer, not an SEO'er!)

If you have that many Pages that it will chew lots of processor time creating the sitemaps, then I suggest using multiple sitemaps - close off different sections of the site or if you have sub-sites within your sites for different languages/regions, put them into different sitemaps

Bear in mind that sitemaps are an assistant to Google's and your hosting bandwidth, so having a relly huge sitemap that takes a while to download may defeat the purpose.

I could let it run during the day, like I have done today on my first run, which took about 30 mins. You know yourself, I'm greedy with my CPU and RAM, I just don't like other applications slowing my workstation down. (Not that it was too noticeable today!) Running at night was just an idea, but in hindsight, that would be bad for the environment!

I wouldn't call the sitemap huge. It's just the time that it takes to generate the XML Sitemap and the fact that it consumes some RAM and CPU on me that are the issues! (Beggars can't be choosers!) The raw xml version of the sitemap (4000+ pages) is 858KB, the GZipped version is 41KB.

Does google use the GZipped version?

Is there an optimum file size for an XML filesize that won't banjax google and our bandwidth?
 

link8r

New Member
The idea is that lots of small files download easier than a big file - simple timeout concept.

Making them inactive - is that because they are no longer a client or that the course isn't accessible ? Why not just keep the url/page and forward to the home page or display a message that the course is no longer active?

Why not group the courses by region for purposes of a sitemap.

That way your trigger is when a site becomes active/inactive, you rebuild the sitemap.

BUT REMEMBER: Just because you create a sitemap, doesn't mean Google will index your site - your site index is set to a Google dictated crawl cycle, which could be weekly or monthly...so you could be generating 4 sitemaps for every 1 that Google actually reads, hence why you need that 404 widget so much...

Official Google Webmaster Central Blog: Make your 404 pages more useful
 

jmcc

Active Member
Does google use the GZipped version?
Yes. So does Yahoo (I think). You could write your own script for a sitemap generator instead of using an off-the-shelf one. If you think that a 4000 page sitemap is bad, I've just finished working on a preliminary one for 9.39 million pages. And Google has downloaded it but spidering it will take a while. Yahoo is currently downloading the gzipped sitemap files at the moment. Microsoft's Bing is missing in action as usual.

It might be possible to set up a database table with the page name, page url and state (active/deleted) and use this to generate your sitemap via a php script or similar. I think that Wordpress might have the last modified date of a page in its database schema. Most of the server load is probably due to all the database calls made by Wordpress for each individual page. This is a very inefficient way of generating a sitemap and some of those online sitemap generators are better suited to simple, static websites.

Is there an optimum file size for an XML filesize that won't banjax google and our bandwidth?
41k is smaller than a lot of webpages these days.

Regards...jmcc
 

php.allstar

New Member
The idea is that lots of small files download easier than a big file - simple timeout concept.

I'm not sure if I can agree with you on that one... Have been using the YSlow addon for Firebug on firefox for over a year now and it suggests combining files of the same filetype into one single file to cut down on http requests which in turn improves load speed.

YSlow Addon For Firebug said:
Decreasing the number of components on a page reduces the number of HTTP requests required to render the page, resulting in faster page loads. Some ways to reduce the number of components include: combine files, combine multiple scripts into one script, combine multiple CSS files into one style sheet, and use CSS Sprites and image maps.

Granted this is with respect to javascript and css but I assume the same thinking can be applied to XML Sitemaps?

Making them inactive - is that because they are no longer a client or that the course isn't accessible ? Why not just keep the url/page and forward to the home page or display a message that the course is no longer active?

Good idea, how's this page Inactive Course! - (346) | teetimes.co.uk

Setting them to inactive is when they do not have any teetimes to offer for a few weeks due to being booked out or not giving us inventory.

Why not group the courses by region for purposes of a sitemap.

That way your trigger is when a site becomes active/inactive, you rebuild the sitemap.

Yeah I guess I could create a batch file that triggers GSiteCrawler that our admin staff can run once they set a course to active/inactive.

We used to have the courses grouped into 11 UK regions, now we have them grouped into 41 counties/areas.

...you need that 404 widget so much...

I'm quite proud of our 404 page http://www.teetimes.co.uk/404 maybe I could add that google search box in here too, just don't want the closest match part or other things to try part...

...You could write your own script for a sitemap generator instead of using an off-the-shelf one.

I had thought of that but because the site uses the webservices from our administration application (which was written in ASPX by another developer), we don't have an operation in there to generate a list of inactive courses. I guess I should request that as a feature enhancement to our web services.

Thanks for all the help, this has been a real eye opener!
 

MickRegan

New Member
to JMCC...

In Googles words "A Sitemap should contain a list of your site's URLs—up to 50,000 of them. If you have a large site with more than 50,000 URLs, you should create multiple Sitemaps and submit a 'Sitemap Index File'"

Maybe that's what you're doing, but just in case!
 

jmcc

Active Member
to JMCC...

In Googles words "A Sitemap should contain a list of your site's URLs—up to 50,000 of them. If you have a large site with more than 50,000 URLs, you should create multiple Sitemaps and submit a 'Sitemap Index File'"

Maybe that's what you're doing, but just in case!
Yep. I read the docs. :) So far it is around 202 sitemap files. The domain data hasn't been submitted yet as it is approximately 236M pages of domain history. Google seems to be the fastest at picking up the sitemaps. Yahoo tends to take a bit longer. Microsoft Bing has yet to act on them.

Regards...jmcc
 

jmcc

Active Member
Granted this is with respect to javascript and css but I assume the same thinking can be applied to XML Sitemaps?
Gzip reduces the size of sitemap files but the number of URLs wouldn't max out a sitemap file.


I had thought of that but because the site uses the webservices from our administration application (which was written in ASPX by another developer), we don't have an operation in there to generate a list of inactive courses. I guess I should request that as a feature enhancement to our web services.
Probably the best way to do it as it would not require the entire site being whacked to generate the sitemap. The trick with any database backed website is to automate anything that can be automated.

Regards...jmcc
 

php.allstar

New Member
Gzip reduces the size of sitemap files but the number of URLs wouldn't max out a sitemap file.

LOL, thanks but I know how Gzip works!;)


...The trick with any database backed website is to automate anything that can be automated...

I am automation, I've been creating php scripts to replace humans for donkeys years now! The thing is, we have one central admin apllication written in ASPX (I haven't got a clue about ASPX!) by another developer that the company got before I joined. This admin application stores all the info in a MSSQL database. This developer also created our web services which in turn serves data to teetimes.co.uk (along with loads of other websites we own and third party websites).

TeeTimes.co.uk is made up of lots of php pages and scripts that I created, which all pull info in from the web services. If this was all stored in MySQL, I'd have created my own xml sitemap generator. My hands are tied in what I can get out of the web services at the moment.

But hey, thanks for all the comments, I've picked up quite a few things.
 

jmcc

Active Member
LOL, thanks but I know how Gzip works!;)
When someone mentions ASP, you can never be sure. :)

This admin application stores all the info in a MSSQL database.
The main thing is to get the schema for this database and look at how it is constructed. There may be something there that you can use as the basis for the sitemap generator.

Regards...jmcc
 

markkh

New Member
i would recommend xml-sitemaps for less than 500 pages or this wonderwebware one wonderwebware.com/sitemap-generator/ Sitemap Generator. good thing about the latter is that it creates an html sitemap as well which you can publish on your site. helpful for your visitors as well ;)
 

php.allstar

New Member
Yeah I must admit I'm quite impressed by GSiteCrawler. It too has the ability to create plain old html sitemaps.

Another nice feature is the ability to export to csv with numerous layout options available.

I had reported that the site consisted of over 4000 pages. It turns out that there were some pages in there which were getting crawled with search query parameters on the end of the url. I was able to add a filter to GSiteCrawler to stop it from crawling these pages and including them in the sitemap. In the end, the site now has ~2,500 pages.

Beacause of the fact that GSiteCrawler was able to find pages with search query parameters on the end of the url, I'm quite concerned that Google will also find these and we could be penalised for duplicate content!

Does anyone know if penalises something like this...

Normal link - Poulton Park Golf Club Tee Times | teetimes.co.uk

Parameter Link - Poulton Park Golf Club Tee Times | teetimes.co.uk

Both links point to the same page content is slightly different on the latter.

Is duplicate content where the same page is accessible at different urls or where the content within the two pages is exactly the same.

Sorry I don't know if I've even worded this correctly, I'm a complete n00b at SEO, and sorry for hijacking my own thread!
 
Status
Not open for further replies.
Top