View Single Post
Old 04-21-2011, 08:13 AM  
dazzling
Confirmed User
 
dazzling's Avatar
 
Industry Role:
Join Date: Nov 2002
Posts: 579
Quote:
Originally Posted by AdultKing View Post
One observation, 10 million pages is not many, if it's country specific then surely you would need to index more than that ? Keep in mind that your published index does not necessarily equal the number of sites you crawl.

We run a porn only search engine, we index millions of pages, however we are crawling many times that number and discarding them from the index because they are not porn sites. The web space is really big, Netcraft estimated in 2008 that there were 152 million sites at the start of the year and then 182 million later in the same year, the net has grown exponentially since then so there would be according to Netcraft about 255 million at the end of 2010. Now that number is sites, some sites have hundreds of pages or more, so the crawling effort is mind bogglingly huge.

You will also need to think about what tools you will need to manage such a huge data set, we use 3d modelling to map out the web space as we see it, it enables us to detect holes in our crawling efforts and then address that with updates to our crawler code.

You will probably find that once you choose a tool set that you will want to alter it to suit your specific needs, we started with a code base for a crawler and indexer then changed it to suit our needs, what we ended up with bears little resemblance to what we started out with. So you'd need programmer support to keep your tools doing what you need them to.
Thankyou for all that info, it would be for a small country, something like your next door neighbour NZ. I would not index porn. so that should reduce the load. Something I can add the websites myself I think. Your right, I might need as much as 100 million pages. I did see some open source such as Xapian but have no idea. As you said, its all mind boggling, but I need start somewhere I guess.

I want to keep it as simple as possible with the minimum load on a server. At this stage it may be something smaller, a state or a city, or a subject. Not really sure. Still in the early stage of this. Any advice is appreciated.

You know any examples of anyone using terrier?
__________________
DuckDuckGo Search Engine

We Dont Track You

Last edited by dazzling; 04-21-2011 at 08:23 AM..
dazzling is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote