Any good search engine scripts? - GoFuckYourself.com

dazzling · 04-20-2011, 10:47 PM

Anyone know of any good search engine scripts?

AdultKing · 04-20-2011, 10:53 PM

Do you want to run a real search engine or a meta search site ?

Zorgman · 04-21-2011, 04:33 AM

Like AdultKing said, your own Google or just spider your own pages?

u-Bob · 04-21-2011, 04:48 AM

http://www.google.com/_private/cvs/p...gle-STABLE.tgz

sarettah · 04-21-2011, 04:58 AM

Quote:

Originally Posted by u-Bob

http://www.google.com/_private/cvs/p...gle-STABLE.tgz

HarryMuff · 04-21-2011, 05:52 AM

thanks to everyone who participated

dazzling · 04-21-2011, 06:12 AM

My own real search engine, country specific. Im guessing for what Im thinking I might need to index up to 10 million pages. any suggestions?

AdultKing · 04-21-2011, 07:44 AM

There are a several ways to go about this.

Unless you are just making a raw index, you'd need pretty hefty bandwidth and hardware. You would want to cache pages, to compare them for updates, you'd also want some kind of cross words index which requires fast SQL. You also need a crawler unless you have some other way of obtaining an index.

Lucene, part of the Apache project now.

http://lucene.apache.org/java/docs/index.html

It's all Java but you need to supply your own crawler agents. Hardware requirements pretty much depend how you scale it. Typically you'd build your crawlers on seperate boxes to the index, however if you use tunnels, you can run everything from one box and just farm out the jobs to specific boxes by rerouting traffic using iptables.

http://webglimpse.net/

Webglimpse is good, but it's not free and you really need to know what you are doing, this is probably better as a document management solution or perhaps data mining. "The search engine (written in C) and webglimpse is the spider and indexer (primarily in Perl)". So it would be handy to know perl as crawler agents rarely do what you want out of the box.

Zebra, which is a tool used by many search engine researchers, is free to use, source is available and can handle huge databases.

https://www.indexdata.com/zebra

There are other options if you have Java or C# development capability on a moderate scale. I can provide more options if none I have provided suit you.

It's a really big subject, without knowing exactly what you want to achieve it's hard to say what tool set you should be looking at.

The biggest barrier to entry here is the hardware you will need to do this, working out an optimal configuration is difficult, however if you are wanting to cut down on resources and be able to scale, I would suggest having a main server to handle traffic in / out of your search platform, then redirect particular types of traffic to slave servers not visible to the internet, eg: one to crawl, one to index, one to run your database, a NAS for your disk storage - which you will need alot of, terrabytes just to start with.

AdultKing · 04-21-2011, 07:54 AM

Here are some others

http://www.htdig.org/

http://openfts.sourceforge.net/

http://www.namazu.org/

I have plenty more, tell us a little bit more about your expectations, what the system needs to do and what type of index you want to create ? Do you need a crawler, or will you write one ? Do you want full text search, or keyword based cross searching ?

HarryMuff · 04-21-2011, 08:01 AM

I would use Terrier. http:// terrier . org
Lucene is OK but not as easy to extend.
Although you probably won't be able to get very far using anything unless you have studied IR in depth.
Also, Links pulled.

AdultKing · 04-21-2011, 08:04 AM

One observation, 10 million pages is not many, if it's country specific then surely you would need to index more than that ? Keep in mind that your published index does not necessarily equal the number of sites you crawl.

We run a porn only search engine, we index millions of pages, however we are crawling many times that number and discarding them from the index because they are not porn sites. The web space is really big, Netcraft estimated in 2008 that there were 152 million sites at the start of the year and then 182 million later in the same year, the net has grown exponentially since then so there would be according to Netcraft about 255 million at the end of 2010. Now that number is sites, some sites have hundreds of pages or more, so the crawling effort is mind bogglingly huge.

You will also need to think about what tools you will need to manage such a huge data set, we use 3d modelling to map out the web space as we see it, it enables us to detect holes in our crawling efforts and then address that with updates to our crawler code.

You will probably find that once you choose a tool set that you will want to alter it to suit your specific needs, we started with a code base for a crawler and indexer then changed it to suit our needs, what we ended up with bears little resemblance to what we started out with. So you'd need programmer support to keep your tools doing what you need them to.

AdultKing · 04-21-2011, 08:08 AM

Quote:

Originally Posted by HarryMuff

I would use Terrier. http:// terrier . org

Yes I agree terrier is also a good place to start, there is a fairly wide user base so getting help is not so difficult, it's also quite extensible. It's all Java and you would want to have a pretty good grip on Java to implement your own terrier based system.

dazzling · 04-21-2011, 08:13 AM

Quote:

Originally Posted by AdultKing

One observation, 10 million pages is not many, if it's country specific then surely you would need to index more than that ? Keep in mind that your published index does not necessarily equal the number of sites you crawl.

We run a porn only search engine, we index millions of pages, however we are crawling many times that number and discarding them from the index because they are not porn sites. The web space is really big, Netcraft estimated in 2008 that there were 152 million sites at the start of the year and then 182 million later in the same year, the net has grown exponentially since then so there would be according to Netcraft about 255 million at the end of 2010. Now that number is sites, some sites have hundreds of pages or more, so the crawling effort is mind bogglingly huge.

You will also need to think about what tools you will need to manage such a huge data set, we use 3d modelling to map out the web space as we see it, it enables us to detect holes in our crawling efforts and then address that with updates to our crawler code.

You will probably find that once you choose a tool set that you will want to alter it to suit your specific needs, we started with a code base for a crawler and indexer then changed it to suit our needs, what we ended up with bears little resemblance to what we started out with. So you'd need programmer support to keep your tools doing what you need them to.

Thankyou for all that info, it would be for a small country, something like your next door neighbour NZ. I would not index porn. so that should reduce the load. Something I can add the websites myself I think. Your right, I might need as much as 100 million pages. I did see some open source such as Xapian but have no idea. As you said, its all mind boggling, but I need start somewhere I guess.

I want to keep it as simple as possible with the minimum load on a server. At this stage it may be something smaller, a state or a city, or a subject. Not really sure. Still in the early stage of this. Any advice is appreciated.

You know any examples of anyone using terrier?

AdultKing · 04-21-2011, 08:23 AM

Quote:

Originally Posted by dazzling

Thankyou for all that info, it would be for a small country, something like your next door neighbour NZ. I would not index porn. so that should reduce the load. Something I can add the websites myself I think. Your right, I might need as much as 100 million pages. I did see some open source such as Xapian but have no idea. As you said, its all mind boggling, but I need start somewhere I guess.

You may not index porn, but you will be crawling it to a certain extent. I assume you will just crawl a ccTLD name space, eg .nz or .uk, in that case you also need to think about sites which fall outside the namespace, for example some large sites have ccTLD domains but end up at nz.bigcompany.com , you'd need to have some way of dealing with that or your index would be fairly incomplete at the top end.

While thinking about architecture, a good way to look at things is that you have various components which go together to form a search platform. You have a crawler, indexer, database management component and the data set itself. One program won't handle the whole thing, so it won't be a script as suggested in your original question, it will be a series of programs running independently within a system.

AdultKing · 04-21-2011, 08:33 AM

Quote:

Originally Posted by dazzling

You know any examples of anyone using terrier?

You will probably like to think about a fork of Terrier, well a branch rather than a fork, called Terraneo http://distterr.wordpress.com/2010/0...search-engine/

Read this, it's probably one of the best comparisons of open source search engines you could read

http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf

AdultKing · 04-21-2011, 09:01 AM

Quote:

Originally Posted by dazzling

I want to keep it as simple as possible with the minimum load on a server.

Just to save you time, you will need more than one server, even if you use a homogeneous search engine/crawler you won't get away with even a 100 million page index with just one server, unless you plan on only crawling say 10,000 pages a day. Think of the processing needed for things like clone detection, caching etc.

MrGusMuller · 04-21-2011, 09:44 AM

bookmarked

Agent 488 · 04-21-2011, 10:16 AM

sorry for threadjack, but what about a search engine that would only search certain sites (not ones own) or maybe certain file types? something off the shelf?

nextri · 04-21-2011, 10:22 AM

http://www.sphinxsearch.com is awesome!

AdultKing · 04-21-2011, 10:27 AM

Quote:

Originally Posted by Agent 488

sorry for threadjack, but what about a search engine that would only search certain sites (not ones own) or maybe certain file types? something off the shelf?

Good question, there are search platforms suited for searching certain file types, for example mnGosearch was good at finding mp3 files, however scalability is a huge problem with that architecture, given that mapping the entire web and finding all mp3 files would be a herculean task.

Searching a group of sites is easier, if you know the sites you want to index then you're not having to discover the sites in the first place. If however you mean searching types of sites, this is the effort that our search engine is working on, simply indexing porn sites and almost nothing else. The crawling effort is huge, however the indexing effort is less of an issue as many sites are discarded as they are not porn sites.

Off the shelf is where things get difficult, because the differences in search efforts means that anything off the shelf needs a certain amount of customization.

AdultKing · 04-21-2011, 10:31 AM

Quote:

Originally Posted by nextri

http://www.sphinxsearch.com is awesome!

For a large subset of the web though ? I always considered sphinx better for more localized search efforts.

Agent 488 · 04-21-2011, 10:40 AM

well stay with just searching certain sites. there are cheap scripts that search certain tube sites and embed the videos on your site.

Quote:

Originally Posted by AdultKing

Good question, there are search platforms suited for searching certain file types, for example mnGosearch was good at finding mp3 files, however scalability is a huge problem with that architecture, given that mapping the entire web and finding all mp3 files would be a herculean task.

Searching a group of sites is easier, if you know the sites you want to index then you're not having to discover the sites in the first place. If however you mean searching types of sites, this is the effort that our search engine is working on, simply indexing porn sites and almost nothing else. The crawling effort is huge, however the indexing effort is less of an issue as many sites are discarded as they are not porn sites.

Off the shelf is where things get difficult, because the differences in search efforts means that anything off the shelf needs a certain amount of customization.

AdultKing · 04-21-2011, 10:44 AM

Quote:

Originally Posted by Agent 488

well stay with just searching certain sites. there are cheap scripts that search certain tube sites and embed the videos on your site.

That's pretty easy to do though, because you're not having to deal with massive amounts of data, well not as massive as indexing a subset of the web.

That would be the type of task you could employ Sphinx or Zebra to do quite trivially.

Agent 488 · 04-21-2011, 10:45 AM

yeah not looking for a programmer atm just wondering if there is anything off the shelf.

AdultKing · 04-21-2011, 10:47 AM

Quote:

Originally Posted by Agent 488

yeah not looking for a programmer atm just wondering if there is anything off the shelf.

Sphinx + wget

http://www.ivinco.com/blog/using-wge...search-engine/

dazzling · 04-21-2011, 09:51 PM

Thankyou again for all this information AdultKing, you really know your stuff. Was thinking about this and think Ill start with something small as a trial n error sort of thing. What I want to do is add the websites to the engine myself, not have any sort of crawler. What do you recommend for that?

Serge Litehead · 04-21-2011, 10:06 PM

dazzling, maybe you need something like directory script then and just add and organize links.. with no sorting and filtering links you could probably get away with basic WP install

cam_girls · 04-21-2011, 10:07 PM

Google Yahoo etc. use economies of scale.

You can't search 5 billion pages in 0.02 seconds on 1 server. The query is split over the whole network, distributed processing.

It takes the same total computer power to handle 200 million queries a day, but by using parallel processing you get the results 1000 times quicker. Each server tackles under a million web pages.

10 million pages is about the limit for a single server search engine.

AdultKing · 04-22-2011, 12:39 AM

Quote:

Originally Posted by dazzling

Thankyou again for all this information AdultKing, you really know your stuff. Was thinking about this and think Ill start with something small as a trial n error sort of thing. What I want to do is add the websites to the engine myself, not have any sort of crawler. What do you recommend for that?

There is a PHP Search engine, called Sphider, you could use to add sites yourself, however this seems a lot more limited than what you were originally proposing to do, the link is http://www.sphider.eu/

Sphider of course wont scale, you will be able to index a few thousand sites reliably but it will begin to break down, Sphider certainly wont index a whole country.

At this micro level, it would be possible to add affiliate links to the search results to potentially earn money from the links clicked on from search results. The question begs though, are people likely to use something so limited at an end user level ?

AdultKing · 04-22-2011, 12:53 AM

Quote:

Originally Posted by cam_girls

Google Yahoo etc. use economies of scale.

You can't search 5 billion pages in 0.02 seconds on 1 server. The query is split over the whole network, distributed processing.

Actually, you will find that the data sets are built in such a way that searches across big data sets can be achieved by relatively little processing power. This is all to do with how the data set is built at the indexing level. Very clever algorithms allow you to search very massive data sets very quickly using just one CPU as the back end systems do the hard work.

Quote:

It takes the same total computer power to handle 200 million queries a day, but by using parallel processing you get the results 1000 times quicker. Each server tackles under a million web pages.

Fairly simplistic, overly simplistic explanation. The data set is one complete entity. Preprocessing of the data set is what is key here, there are different approaches such as tokenization, term stemming, crossword indexing, weighting, ranking, the list goes on.

While a query on the index can consume the processing power of 1000 machines, according to early reports of the Google platform, this is misleading, as the machines are built such that you are really just dealing with component systems which may be built from many "servers".

There is a very good paper about the subject of using GPU's within a search architecture at http://koala.poly.edu/GPU.pdf

bbobby86 · 04-22-2011, 12:55 AM

useful thread...

dazzling · 04-22-2011, 01:00 AM

A few years ago I used a really nice search engine script from fluid dynamics....
http://www.xav.com/scripts/search/

The script was great but very very limited in the amount of websites you could add and used up way too much CPU. It was good though for what I was doing at the time, the problem was the guy stopped development on the script.

I think what I would like to do is move in stages, start with a small project so I can learn, then move into something bigger later on.

dazzling · 04-22-2011, 01:03 AM

Quote:

Originally Posted by AdultKing

There is a PHP Search engine, called Sphider, you could use to add sites yourself, however this seems a lot more limited than what you were originally proposing to do, the link is http://www.sphider.eu/

Sphider of course wont scale, you will be able to index a few thousand sites reliably but it will begin to break down, Sphider certainly wont index a whole country.

At this micro level, it would be possible to add affiliate links to the search results to potentially earn money from the links clicked on from search results. The question begs though, are people likely to use something so limited at an end user level ?

Sphider looks good as a starting point for a smaller search engine before I try something bigger. Would it be comfortable with indexing 10,000 pages from say several hundred websites?

AdultKing · 04-22-2011, 01:09 AM

Quote:

Originally Posted by dazzling

A few years ago I used a really nice search engine script from fluid dynamics....
http://www.xav.com/scripts/search/

The script was great but very very limited in the amount of websites you could add and used up way too much CPU. It was good though for what I was doing at the time, the problem was the guy stopped development on the script.

I think what I would like to do is move in stages, start with a small project so I can learn, then move into something bigger later on.

You'll find Sphider comparable although PHP Based rather than Perl.

The problem with these types of scripts is that they are scripts and a script won't become a real search engine. You just can't do the type of things you need to do to run a search engine from a script. You need several programs running with one or more databases at a minimum, a basic search engine platform will consist of a crawler, indexer and query engine at a minimum.

It is possible to create a real search engine on one server, however on a limited scale. One of our test/development servers for PornoBug indexes a realm of web space, approximately 2.5 million sites on one Xeon server with 6TB of disk. However the machine is running under constant load and only runs as a search engine, it does however have a crawler, indexer and query interface all on the one machine. It crawls 100,000 pages a day and sites within the realm are typically visited every 2 to 3 days.

AdultKing · 04-22-2011, 01:13 AM

Quote:

Originally Posted by dazzling

Sphider looks good as a starting point for a smaller search engine before I try something bigger. Would it be comfortable with indexing 10,000 pages from say several hundred websites?

Try it and see, I do know that the sphider crawler is really slow as the data set grows.

A crawler really needs to be able to work quickly, ours are all written in C and are very nimble, not one line of code is included unless it is absolutely necessary for the crawler to work.

04-20-2011, 10:47 PM	#1
dazzling Confirmed User Industry Role: Join Date: Nov 2002 Posts: 579	Any good search engine scripts? Anyone know of any good search engine scripts? __________________ DuckDuckGo Search Engine We Dont Track You

04-21-2011, 04:33 AM	#3
Zorgman Confirmed User Join Date: Aug 2002 Location: Sydney, Australia Posts: 6,103	Like AdultKing said, your own Google or just spider your own pages? __________________ ---

04-21-2011, 06:12 AM	#7
dazzling Confirmed User Industry Role: Join Date: Nov 2002 Posts: 579	My own real search engine, country specific. Im guessing for what Im thinking I might need to index up to 10 million pages. any suggestions? __________________ DuckDuckGo Search Engine We Dont Track You

04-21-2011, 08:01 AM	#10
HarryMuff Confirmed User Join Date: Dec 2005 Posts: 271	I would use Terrier. http:// terrier . org Lucene is OK but not as easy to extend. Although you probably won't be able to get very far using anything unless you have studied IR in depth. Also, Links pulled. Last edited by HarryMuff; 04-21-2011 at 08:02 AM.. Reason: Links pulled

04-21-2011, 09:44 AM	#17
MrGusMuller Confirmed User Industry Role: Join Date: Oct 2010 Location: Portugal Posts: 1,262	bookmarked __________________ StagCMS - Adult CMS - user friendly adult content management system - speed up your websites with no SQL connections ICQ: 632343*113

04-20-2011, 10:53 PM	#2
AdultKing Raise Your Weapon Industry Role: Join Date: Jun 2003 Location: Outback Australia Posts: 15,601	Do you want to run a real search engine or a meta search site ?

04-21-2011, 04:48 AM	#4
u-Bob there's no $$$ in porn Industry Role: Join Date: Jul 2005 Location: icq: 195./568.-230 (btw: not getting offline msgs) Posts: 33,063	http://www.google.com/_private/cvs/p...gle-STABLE.tgz

04-21-2011, 05:52 AM	#6
HarryMuff Confirmed User Join Date: Dec 2005 Posts: 271	thanks to everyone who participated

04-21-2011, 07:44 AM	#8
AdultKing Raise Your Weapon Industry Role: Join Date: Jun 2003 Location: Outback Australia Posts: 15,601	There are a several ways to go about this. Unless you are just making a raw index, you'd need pretty hefty bandwidth and hardware. You would want to cache pages, to compare them for updates, you'd also want some kind of cross words index which requires fast SQL. You also need a crawler unless you have some other way of obtaining an index. Lucene, part of the Apache project now. http://lucene.apache.org/java/docs/index.html It's all Java but you need to supply your own crawler agents. Hardware requirements pretty much depend how you scale it. Typically you'd build your crawlers on seperate boxes to the index, however if you use tunnels, you can run everything from one box and just farm out the jobs to specific boxes by rerouting traffic using iptables. http://webglimpse.net/ Webglimpse is good, but it's not free and you really need to know what you are doing, this is probably better as a document management solution or perhaps data mining. "The search engine (written in C) and webglimpse is the spider and indexer (primarily in Perl)". So it would be handy to know perl as crawler agents rarely do what you want out of the box. Zebra, which is a tool used by many search engine researchers, is free to use, source is available and can handle huge databases. https://www.indexdata.com/zebra There are other options if you have Java or C# development capability on a moderate scale. I can provide more options if none I have provided suit you. It's a really big subject, without knowing exactly what you want to achieve it's hard to say what tool set you should be looking at. The biggest barrier to entry here is the hardware you will need to do this, working out an optimal configuration is difficult, however if you are wanting to cut down on resources and be able to scale, I would suggest having a main server to handle traffic in / out of your search platform, then redirect particular types of traffic to slave servers not visible to the internet, eg: one to crawl, one to index, one to run your database, a NAS for your disk storage - which you will need alot of, terrabytes just to start with.

04-21-2011, 07:54 AM	#9
AdultKing Raise Your Weapon Industry Role: Join Date: Jun 2003 Location: Outback Australia Posts: 15,601	Here are some others http://www.htdig.org/ http://openfts.sourceforge.net/ http://www.namazu.org/ I have plenty more, tell us a little bit more about your expectations, what the system needs to do and what type of index you want to create ? Do you need a crawler, or will you write one ? Do you want full text search, or keyword based cross searching ?

04-21-2011, 08:04 AM	#11
AdultKing Raise Your Weapon Industry Role: Join Date: Jun 2003 Location: Outback Australia Posts: 15,601	One observation, 10 million pages is not many, if it's country specific then surely you would need to index more than that ? Keep in mind that your published index does not necessarily equal the number of sites you crawl. We run a porn only search engine, we index millions of pages, however we are crawling many times that number and discarding them from the index because they are not porn sites. The web space is really big, Netcraft estimated in 2008 that there were 152 million sites at the start of the year and then 182 million later in the same year, the net has grown exponentially since then so there would be according to Netcraft about 255 million at the end of 2010. Now that number is sites, some sites have hundreds of pages or more, so the crawling effort is mind bogglingly huge. You will also need to think about what tools you will need to manage such a huge data set, we use 3d modelling to map out the web space as we see it, it enables us to detect holes in our crawling efforts and then address that with updates to our crawler code. You will probably find that once you choose a tool set that you will want to alter it to suit your specific needs, we started with a code base for a crawler and indexer then changed it to suit our needs, what we ended up with bears little resemblance to what we started out with. So you'd need programmer support to keep your tools doing what you need them to.

04-21-2011, 10:16 AM	#18
Agent 488 Registered User Industry Role: Join Date: Feb 2006 Posts: 22,511	sorry for threadjack, but what about a search engine that would only search certain sites (not ones own) or maybe certain file types? something off the shelf?

04-21-2011, 10:22 AM	#19
nextri Confirmed User Industry Role: Join Date: May 2004 Location: Norway Posts: 1,661	http://www.sphinxsearch.com is awesome! __________________ DivaTraffic - Traffic for Models

04-21-2011, 10:45 AM	#24
Agent 488 Registered User Industry Role: Join Date: Feb 2006 Posts: 22,511	yeah not looking for a programmer atm just wondering if there is anything off the shelf.

04-21-2011, 09:51 PM	#26
dazzling Confirmed User Industry Role: Join Date: Nov 2002 Posts: 579	Thankyou again for all this information AdultKing, you really know your stuff. Was thinking about this and think Ill start with something small as a trial n error sort of thing. What I want to do is add the websites to the engine myself, not have any sort of crawler. What do you recommend for that? __________________ DuckDuckGo Search Engine We Dont Track You

04-21-2011, 10:06 PM	#27
Serge Litehead Confirmed User Industry Role: Join Date: Dec 2002 Location: Behind the scenes Posts: 5,190	dazzling, maybe you need something like directory script then and just add and organize links.. with no sorting and filtering links you could probably get away with basic WP install __________________ SilverCrocodile.com Adult web design since '05

04-21-2011, 10:07 PM	#28
cam_girls So Fucking Banned Industry Role: Join Date: Apr 2009 Posts: 2,968	Google Yahoo etc. use economies of scale. You can't search 5 billion pages in 0.02 seconds on 1 server. The query is split over the whole network, distributed processing. It takes the same total computer power to handle 200 million queries a day, but by using parallel processing you get the results 1000 times quicker. Each server tackles under a million web pages. 10 million pages is about the limit for a single server search engine.

04-22-2011, 12:55 AM	#31
bbobby86 partners.sexier.com Industry Role: Join Date: Jan 2007 Location: San Francisco, CA Posts: 11,926	useful thread... __________________

04-22-2011, 01:00 AM	#32
dazzling Confirmed User Industry Role: Join Date: Nov 2002 Posts: 579	A few years ago I used a really nice search engine script from fluid dynamics.... http://www.xav.com/scripts/search/ The script was great but very very limited in the amount of websites you could add and used up way too much CPU. It was good though for what I was doing at the time, the problem was the guy stopped development on the script. I think what I would like to do is move in stages, start with a small project so I can learn, then move into something bigger later on. __________________ DuckDuckGo Search Engine We Dont Track You