Welcome to the GoFuckYourself.com - Adult Webmaster Forum forums.

You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact us.

Post New Thread Reply

Register GFY Rules Calendar
Go Back   GoFuckYourself.com - Adult Webmaster Forum > >
Discuss what's fucking going on, and which programs are best and worst. One-time "program" announcements from "established" webmasters are allowed.

 
Thread Tools
Old 04-20-2011, 10:47 PM   #1
dazzling
Confirmed User
 
dazzling's Avatar
 
Industry Role:
Join Date: Nov 2002
Posts: 579
Any good search engine scripts?

Anyone know of any good search engine scripts?
__________________
DuckDuckGo Search Engine

We Dont Track You
dazzling is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-20-2011, 10:53 PM   #2
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Do you want to run a real search engine or a meta search site ?
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 04:33 AM   #3
Zorgman
Confirmed User
 
Zorgman's Avatar
 
Join Date: Aug 2002
Location: Sydney, Australia
Posts: 6,103
Like AdultKing said, your own Google or just spider your own pages?
__________________
---
Zorgman is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 04:48 AM   #4
u-Bob
there's no $$$ in porn
 
u-Bob's Avatar
 
Industry Role:
Join Date: Jul 2005
Location: icq: 195./568.-230 (btw: not getting offline msgs)
Posts: 33,063
http://www.google.com/_private/cvs/p...gle-STABLE.tgz
u-Bob is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 04:58 AM   #5
sarettah
see you later, I'm gone
 
Industry Role:
Join Date: Oct 2002
Posts: 14,063
Quote:
Originally Posted by u-Bob View Post
__________________
All cookies cleared!
sarettah is online now   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 05:52 AM   #6
HarryMuff
Confirmed User
 
HarryMuff's Avatar
 
Join Date: Dec 2005
Posts: 271
thanks to everyone who participated
HarryMuff is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 06:12 AM   #7
dazzling
Confirmed User
 
dazzling's Avatar
 
Industry Role:
Join Date: Nov 2002
Posts: 579
My own real search engine, country specific. Im guessing for what Im thinking I might need to index up to 10 million pages. any suggestions?
__________________
DuckDuckGo Search Engine

We Dont Track You
dazzling is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 07:44 AM   #8
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
There are a several ways to go about this.

Unless you are just making a raw index, you'd need pretty hefty bandwidth and hardware. You would want to cache pages, to compare them for updates, you'd also want some kind of cross words index which requires fast SQL. You also need a crawler unless you have some other way of obtaining an index.

Lucene, part of the Apache project now.

http://lucene.apache.org/java/docs/index.html

It's all Java but you need to supply your own crawler agents. Hardware requirements pretty much depend how you scale it. Typically you'd build your crawlers on seperate boxes to the index, however if you use tunnels, you can run everything from one box and just farm out the jobs to specific boxes by rerouting traffic using iptables.

http://webglimpse.net/

Webglimpse is good, but it's not free and you really need to know what you are doing, this is probably better as a document management solution or perhaps data mining. "The search engine (written in C) and webglimpse is the spider and indexer (primarily in Perl)". So it would be handy to know perl as crawler agents rarely do what you want out of the box.

Zebra, which is a tool used by many search engine researchers, is free to use, source is available and can handle huge databases.

https://www.indexdata.com/zebra


There are other options if you have Java or C# development capability on a moderate scale. I can provide more options if none I have provided suit you.

It's a really big subject, without knowing exactly what you want to achieve it's hard to say what tool set you should be looking at.

The biggest barrier to entry here is the hardware you will need to do this, working out an optimal configuration is difficult, however if you are wanting to cut down on resources and be able to scale, I would suggest having a main server to handle traffic in / out of your search platform, then redirect particular types of traffic to slave servers not visible to the internet, eg: one to crawl, one to index, one to run your database, a NAS for your disk storage - which you will need alot of, terrabytes just to start with.
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 07:54 AM   #9
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Here are some others

http://www.htdig.org/

http://openfts.sourceforge.net/

http://www.namazu.org/


I have plenty more, tell us a little bit more about your expectations, what the system needs to do and what type of index you want to create ? Do you need a crawler, or will you write one ? Do you want full text search, or keyword based cross searching ?
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 08:01 AM   #10
HarryMuff
Confirmed User
 
HarryMuff's Avatar
 
Join Date: Dec 2005
Posts: 271
I would use Terrier. http:// terrier . org
Lucene is OK but not as easy to extend.
Although you probably won't be able to get very far using anything unless you have studied IR in depth.
Also, Links pulled.

Last edited by HarryMuff; 04-21-2011 at 08:02 AM.. Reason: Links pulled
HarryMuff is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 08:04 AM   #11
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
One observation, 10 million pages is not many, if it's country specific then surely you would need to index more than that ? Keep in mind that your published index does not necessarily equal the number of sites you crawl.

We run a porn only search engine, we index millions of pages, however we are crawling many times that number and discarding them from the index because they are not porn sites. The web space is really big, Netcraft estimated in 2008 that there were 152 million sites at the start of the year and then 182 million later in the same year, the net has grown exponentially since then so there would be according to Netcraft about 255 million at the end of 2010. Now that number is sites, some sites have hundreds of pages or more, so the crawling effort is mind bogglingly huge.

You will also need to think about what tools you will need to manage such a huge data set, we use 3d modelling to map out the web space as we see it, it enables us to detect holes in our crawling efforts and then address that with updates to our crawler code.

You will probably find that once you choose a tool set that you will want to alter it to suit your specific needs, we started with a code base for a crawler and indexer then changed it to suit our needs, what we ended up with bears little resemblance to what we started out with. So you'd need programmer support to keep your tools doing what you need them to.
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 08:08 AM   #12
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Quote:
Originally Posted by HarryMuff View Post
I would use Terrier. http:// terrier . org
Yes I agree terrier is also a good place to start, there is a fairly wide user base so getting help is not so difficult, it's also quite extensible. It's all Java and you would want to have a pretty good grip on Java to implement your own terrier based system.
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 08:13 AM   #13
dazzling
Confirmed User
 
dazzling's Avatar
 
Industry Role:
Join Date: Nov 2002
Posts: 579
Quote:
Originally Posted by AdultKing View Post
One observation, 10 million pages is not many, if it's country specific then surely you would need to index more than that ? Keep in mind that your published index does not necessarily equal the number of sites you crawl.

We run a porn only search engine, we index millions of pages, however we are crawling many times that number and discarding them from the index because they are not porn sites. The web space is really big, Netcraft estimated in 2008 that there were 152 million sites at the start of the year and then 182 million later in the same year, the net has grown exponentially since then so there would be according to Netcraft about 255 million at the end of 2010. Now that number is sites, some sites have hundreds of pages or more, so the crawling effort is mind bogglingly huge.

You will also need to think about what tools you will need to manage such a huge data set, we use 3d modelling to map out the web space as we see it, it enables us to detect holes in our crawling efforts and then address that with updates to our crawler code.

You will probably find that once you choose a tool set that you will want to alter it to suit your specific needs, we started with a code base for a crawler and indexer then changed it to suit our needs, what we ended up with bears little resemblance to what we started out with. So you'd need programmer support to keep your tools doing what you need them to.
Thankyou for all that info, it would be for a small country, something like your next door neighbour NZ. I would not index porn. so that should reduce the load. Something I can add the websites myself I think. Your right, I might need as much as 100 million pages. I did see some open source such as Xapian but have no idea. As you said, its all mind boggling, but I need start somewhere I guess.

I want to keep it as simple as possible with the minimum load on a server. At this stage it may be something smaller, a state or a city, or a subject. Not really sure. Still in the early stage of this. Any advice is appreciated.

You know any examples of anyone using terrier?
__________________
DuckDuckGo Search Engine

We Dont Track You

Last edited by dazzling; 04-21-2011 at 08:23 AM..
dazzling is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 08:23 AM   #14
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Quote:
Originally Posted by dazzling View Post
Thankyou for all that info, it would be for a small country, something like your next door neighbour NZ. I would not index porn. so that should reduce the load. Something I can add the websites myself I think. Your right, I might need as much as 100 million pages. I did see some open source such as Xapian but have no idea. As you said, its all mind boggling, but I need start somewhere I guess.
You may not index porn, but you will be crawling it to a certain extent. I assume you will just crawl a ccTLD name space, eg .nz or .uk, in that case you also need to think about sites which fall outside the namespace, for example some large sites have ccTLD domains but end up at nz.bigcompany.com , you'd need to have some way of dealing with that or your index would be fairly incomplete at the top end.

While thinking about architecture, a good way to look at things is that you have various components which go together to form a search platform. You have a crawler, indexer, database management component and the data set itself. One program won't handle the whole thing, so it won't be a script as suggested in your original question, it will be a series of programs running independently within a system.
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 08:33 AM   #15
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Quote:
Originally Posted by dazzling View Post

You know any examples of anyone using terrier?
You will probably like to think about a fork of Terrier, well a branch rather than a fork, called Terraneo http://distterr.wordpress.com/2010/0...search-engine/

Read this, it's probably one of the best comparisons of open source search engines you could read

http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 09:01 AM   #16
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Quote:
Originally Posted by dazzling View Post
I want to keep it as simple as possible with the minimum load on a server.
Just to save you time, you will need more than one server, even if you use a homogeneous search engine/crawler you won't get away with even a 100 million page index with just one server, unless you plan on only crawling say 10,000 pages a day. Think of the processing needed for things like clone detection, caching etc.

Last edited by AdultKing; 04-21-2011 at 09:03 AM..
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 09:44 AM   #17
MrGusMuller
Confirmed User
 
MrGusMuller's Avatar
 
Industry Role:
Join Date: Oct 2010
Location: Portugal
Posts: 1,262
bookmarked
__________________
StagCMS - Adult CMS - user friendly adult content management system - speed up your websites with no SQL connections
ICQ: 63*23*43*113

MrGusMuller is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 10:16 AM   #18
Agent 488
Registered User
 
Industry Role:
Join Date: Feb 2006
Posts: 22,511
sorry for threadjack, but what about a search engine that would only search certain sites (not ones own) or maybe certain file types? something off the shelf?
Agent 488 is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 10:22 AM   #19
nextri
Confirmed User
 
nextri's Avatar
 
Industry Role:
Join Date: May 2004
Location: Norway
Posts: 1,661
http://www.sphinxsearch.com is awesome!
__________________
DivaTraffic - Traffic for Models
nextri is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 10:27 AM   #20
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Quote:
Originally Posted by Agent 488 View Post
sorry for threadjack, but what about a search engine that would only search certain sites (not ones own) or maybe certain file types? something off the shelf?
Good question, there are search platforms suited for searching certain file types, for example mnGosearch was good at finding mp3 files, however scalability is a huge problem with that architecture, given that mapping the entire web and finding all mp3 files would be a herculean task.

Searching a group of sites is easier, if you know the sites you want to index then you're not having to discover the sites in the first place. If however you mean searching types of sites, this is the effort that our search engine is working on, simply indexing porn sites and almost nothing else. The crawling effort is huge, however the indexing effort is less of an issue as many sites are discarded as they are not porn sites.

Off the shelf is where things get difficult, because the differences in search efforts means that anything off the shelf needs a certain amount of customization.

Last edited by AdultKing; 04-21-2011 at 10:28 AM..
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 10:31 AM   #21
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Quote:
Originally Posted by nextri View Post
For a large subset of the web though ? I always considered sphinx better for more localized search efforts.
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 10:40 AM   #22
Agent 488
Registered User
 
Industry Role:
Join Date: Feb 2006
Posts: 22,511
well stay with just searching certain sites. there are cheap scripts that search certain tube sites and embed the videos on your site.




Quote:
Originally Posted by AdultKing View Post
Good question, there are search platforms suited for searching certain file types, for example mnGosearch was good at finding mp3 files, however scalability is a huge problem with that architecture, given that mapping the entire web and finding all mp3 files would be a herculean task.

Searching a group of sites is easier, if you know the sites you want to index then you're not having to discover the sites in the first place. If however you mean searching types of sites, this is the effort that our search engine is working on, simply indexing porn sites and almost nothing else. The crawling effort is huge, however the indexing effort is less of an issue as many sites are discarded as they are not porn sites.

Off the shelf is where things get difficult, because the differences in search efforts means that anything off the shelf needs a certain amount of customization.
Agent 488 is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 10:44 AM   #23
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Quote:
Originally Posted by Agent 488 View Post
well stay with just searching certain sites. there are cheap scripts that search certain tube sites and embed the videos on your site.
That's pretty easy to do though, because you're not having to deal with massive amounts of data, well not as massive as indexing a subset of the web.

That would be the type of task you could employ Sphinx or Zebra to do quite trivially.
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 10:45 AM   #24
Agent 488
Registered User
 
Industry Role:
Join Date: Feb 2006
Posts: 22,511
yeah not looking for a programmer atm just wondering if there is anything off the shelf.
Agent 488 is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 10:47 AM   #25
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Quote:
Originally Posted by Agent 488 View Post
yeah not looking for a programmer atm just wondering if there is anything off the shelf.
Sphinx + wget

http://www.ivinco.com/blog/using-wge...search-engine/
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 09:51 PM   #26
dazzling
Confirmed User
 
dazzling's Avatar
 
Industry Role:
Join Date: Nov 2002
Posts: 579
Thankyou again for all this information AdultKing, you really know your stuff. Was thinking about this and think Ill start with something small as a trial n error sort of thing. What I want to do is add the websites to the engine myself, not have any sort of crawler. What do you recommend for that?
__________________
DuckDuckGo Search Engine

We Dont Track You
dazzling is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 10:06 PM   #27
Serge Litehead
Confirmed User
 
Serge Litehead's Avatar
 
Industry Role:
Join Date: Dec 2002
Location: Behind the scenes
Posts: 5,190
dazzling, maybe you need something like directory script then and just add and organize links.. with no sorting and filtering links you could probably get away with basic WP install
__________________
Serge Litehead is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-21-2011, 10:07 PM   #28
cam_girls
So Fucking Banned
 
Industry Role:
Join Date: Apr 2009
Posts: 2,968
Google Yahoo etc. use economies of scale.

You can't search 5 billion pages in 0.02 seconds on 1 server. The query is split over the whole network, distributed processing.

It takes the same total computer power to handle 200 million queries a day, but by using parallel processing you get the results 1000 times quicker. Each server tackles under a million web pages.

10 million pages is about the limit for a single server search engine.
cam_girls is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-22-2011, 12:39 AM   #29
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Quote:
Originally Posted by dazzling View Post
Thankyou again for all this information AdultKing, you really know your stuff. Was thinking about this and think Ill start with something small as a trial n error sort of thing. What I want to do is add the websites to the engine myself, not have any sort of crawler. What do you recommend for that?
There is a PHP Search engine, called Sphider, you could use to add sites yourself, however this seems a lot more limited than what you were originally proposing to do, the link is http://www.sphider.eu/

Sphider of course wont scale, you will be able to index a few thousand sites reliably but it will begin to break down, Sphider certainly wont index a whole country.

At this micro level, it would be possible to add affiliate links to the search results to potentially earn money from the links clicked on from search results. The question begs though, are people likely to use something so limited at an end user level ?
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-22-2011, 12:53 AM   #30
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Quote:
Originally Posted by cam_girls View Post
Google Yahoo etc. use economies of scale.

You can't search 5 billion pages in 0.02 seconds on 1 server. The query is split over the whole network, distributed processing.
Actually, you will find that the data sets are built in such a way that searches across big data sets can be achieved by relatively little processing power. This is all to do with how the data set is built at the indexing level. Very clever algorithms allow you to search very massive data sets very quickly using just one CPU as the back end systems do the hard work.

Quote:
It takes the same total computer power to handle 200 million queries a day, but by using parallel processing you get the results 1000 times quicker. Each server tackles under a million web pages.
Fairly simplistic, overly simplistic explanation. The data set is one complete entity. Preprocessing of the data set is what is key here, there are different approaches such as tokenization, term stemming, crossword indexing, weighting, ranking, the list goes on.

While a query on the index can consume the processing power of 1000 machines, according to early reports of the Google platform, this is misleading, as the machines are built such that you are really just dealing with component systems which may be built from many "servers".

There is a very good paper about the subject of using GPU's within a search architecture at http://koala.poly.edu/GPU.pdf
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-22-2011, 12:55 AM   #31
bbobby86
partners.sexier.com
 
bbobby86's Avatar
 
Industry Role:
Join Date: Jan 2007
Location: San Francisco, CA
Posts: 11,926
useful thread...
__________________

bbobby86 is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-22-2011, 01:00 AM   #32
dazzling
Confirmed User
 
dazzling's Avatar
 
Industry Role:
Join Date: Nov 2002
Posts: 579
A few years ago I used a really nice search engine script from fluid dynamics....
http://www.xav.com/scripts/search/

The script was great but very very limited in the amount of websites you could add and used up way too much CPU. It was good though for what I was doing at the time, the problem was the guy stopped development on the script.

I think what I would like to do is move in stages, start with a small project so I can learn, then move into something bigger later on.
__________________
DuckDuckGo Search Engine

We Dont Track You
dazzling is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-22-2011, 01:03 AM   #33
dazzling
Confirmed User
 
dazzling's Avatar
 
Industry Role:
Join Date: Nov 2002
Posts: 579
Quote:
Originally Posted by AdultKing View Post
There is a PHP Search engine, called Sphider, you could use to add sites yourself, however this seems a lot more limited than what you were originally proposing to do, the link is http://www.sphider.eu/

Sphider of course wont scale, you will be able to index a few thousand sites reliably but it will begin to break down, Sphider certainly wont index a whole country.

At this micro level, it would be possible to add affiliate links to the search results to potentially earn money from the links clicked on from search results. The question begs though, are people likely to use something so limited at an end user level ?
Sphider looks good as a starting point for a smaller search engine before I try something bigger. Would it be comfortable with indexing 10,000 pages from say several hundred websites?
__________________
DuckDuckGo Search Engine

We Dont Track You
dazzling is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-22-2011, 01:09 AM   #34
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Quote:
Originally Posted by dazzling View Post
A few years ago I used a really nice search engine script from fluid dynamics....
http://www.xav.com/scripts/search/

The script was great but very very limited in the amount of websites you could add and used up way too much CPU. It was good though for what I was doing at the time, the problem was the guy stopped development on the script.

I think what I would like to do is move in stages, start with a small project so I can learn, then move into something bigger later on.

You'll find Sphider comparable although PHP Based rather than Perl.

The problem with these types of scripts is that they are scripts and a script won't become a real search engine. You just can't do the type of things you need to do to run a search engine from a script. You need several programs running with one or more databases at a minimum, a basic search engine platform will consist of a crawler, indexer and query engine at a minimum.

It is possible to create a real search engine on one server, however on a limited scale. One of our test/development servers for PornoBug indexes a realm of web space, approximately 2.5 million sites on one Xeon server with 6TB of disk. However the machine is running under constant load and only runs as a search engine, it does however have a crawler, indexer and query interface all on the one machine. It crawls 100,000 pages a day and sites within the realm are typically visited every 2 to 3 days.
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 04-22-2011, 01:13 AM   #35
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Quote:
Originally Posted by dazzling View Post
Sphider looks good as a starting point for a smaller search engine before I try something bigger. Would it be comfortable with indexing 10,000 pages from say several hundred websites?
Try it and see, I do know that the sphider crawler is really slow as the data set grows.

A crawler really needs to be able to work quickly, ours are all written in C and are very nimble, not one line of code is included unless it is absolutely necessary for the crawler to work.
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Post New Thread Reply
Go Back   GoFuckYourself.com - Adult Webmaster Forum > >

Bookmarks



Advertising inquiries - marketing at gfy dot com

Contact Admin - Advertise - GFY Rules - Top

©2000-, AI Media Network Inc



Powered by vBulletin
Copyright © 2000- Jelsoft Enterprises Limited.