Quote:
Originally Posted by dazzling
Thankyou for all that info, it would be for a small country, something like your next door neighbour NZ. I would not index porn. so that should reduce the load. Something I can add the websites myself I think. Your right, I might need as much as 100 million pages. I did see some open source such as Xapian but have no idea. As you said, its all mind boggling, but I need start somewhere I guess.
|
You may not index porn, but you will be crawling it to a certain extent. I assume you will just crawl a ccTLD name space, eg .nz or .uk, in that case you also need to think about sites which fall outside the namespace, for example some large sites have ccTLD domains but end up at nz.bigcompany.com , you'd need to have some way of dealing with that or your index would be fairly incomplete at the top end.
While thinking about architecture, a good way to look at things is that you have various components which go together to form a search platform. You have a crawler, indexer, database management component and the data set itself. One program won't handle the whole thing, so it won't be a script as suggested in your original question, it will be a series of programs running independently within a system.