One observation, 10 million pages is not many, if it's country specific then surely you would need to index more than that ? Keep in mind that your published index does not necessarily equal the number of sites you crawl.
We run a porn only search engine, we index millions of pages, however we are crawling many times that number and discarding them from the index because they are not porn sites. The web space is really big, Netcraft estimated in 2008 that there were 152 million sites at the start of the year and then 182 million later in the same year, the net has grown exponentially since then so there would be according to Netcraft about 255 million at the end of 2010. Now that number is sites, some sites have hundreds of pages or more, so the crawling effort is mind bogglingly huge.
You will also need to think about what tools you will need to manage such a huge data set, we use 3d modelling to map out the web space as we see it, it enables us to detect holes in our crawling efforts and then address that with updates to our crawler code.
You will probably find that once you choose a tool set that you will want to alter it to suit your specific needs, we started with a code base for a crawler and indexer then changed it to suit our needs, what we ended up with bears little resemblance to what we started out with. So you'd need programmer support to keep your tools doing what you need them to.
|