![]() |
![]() |
![]() |
||||
Welcome to the GoFuckYourself.com - Adult Webmaster Forum forums. You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today! If you have any problems with the registration process or your account login, please contact us. |
![]() ![]() |
|
Discuss what's fucking going on, and which programs are best and worst. One-time "program" announcements from "established" webmasters are allowed. |
|
Thread Tools |
![]() |
#1 |
Confirmed User
Industry Role:
Join Date: Nov 2002
Posts: 579
|
![]() Anyone know of any good search engine scripts?
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#2 |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
Do you want to run a real search engine or a meta search site ?
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#3 |
Confirmed User
Join Date: Aug 2002
Location: Sydney, Australia
Posts: 6,103
|
Like AdultKing said, your own Google or just spider your own pages?
__________________
--- |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#4 |
there's no $$$ in porn
Industry Role:
Join Date: Jul 2005
Location: icq: 195./568.-230 (btw: not getting offline msgs)
Posts: 33,063
|
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#5 |
see you later, I'm gone
Industry Role:
Join Date: Oct 2002
Posts: 14,063
|
__________________
All cookies cleared! |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#6 |
Confirmed User
Join Date: Dec 2005
Posts: 271
|
thanks to everyone who participated
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#7 |
Confirmed User
Industry Role:
Join Date: Nov 2002
Posts: 579
|
My own real search engine, country specific. Im guessing for what Im thinking I might need to index up to 10 million pages. any suggestions?
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#8 |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
There are a several ways to go about this.
Unless you are just making a raw index, you'd need pretty hefty bandwidth and hardware. You would want to cache pages, to compare them for updates, you'd also want some kind of cross words index which requires fast SQL. You also need a crawler unless you have some other way of obtaining an index. Lucene, part of the Apache project now. http://lucene.apache.org/java/docs/index.html It's all Java but you need to supply your own crawler agents. Hardware requirements pretty much depend how you scale it. Typically you'd build your crawlers on seperate boxes to the index, however if you use tunnels, you can run everything from one box and just farm out the jobs to specific boxes by rerouting traffic using iptables. http://webglimpse.net/ Webglimpse is good, but it's not free and you really need to know what you are doing, this is probably better as a document management solution or perhaps data mining. "The search engine (written in C) and webglimpse is the spider and indexer (primarily in Perl)". So it would be handy to know perl as crawler agents rarely do what you want out of the box. Zebra, which is a tool used by many search engine researchers, is free to use, source is available and can handle huge databases. https://www.indexdata.com/zebra There are other options if you have Java or C# development capability on a moderate scale. I can provide more options if none I have provided suit you. It's a really big subject, without knowing exactly what you want to achieve it's hard to say what tool set you should be looking at. The biggest barrier to entry here is the hardware you will need to do this, working out an optimal configuration is difficult, however if you are wanting to cut down on resources and be able to scale, I would suggest having a main server to handle traffic in / out of your search platform, then redirect particular types of traffic to slave servers not visible to the internet, eg: one to crawl, one to index, one to run your database, a NAS for your disk storage - which you will need alot of, terrabytes just to start with. |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#9 |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
Here are some others
http://www.htdig.org/ http://openfts.sourceforge.net/ http://www.namazu.org/ I have plenty more, tell us a little bit more about your expectations, what the system needs to do and what type of index you want to create ? Do you need a crawler, or will you write one ? Do you want full text search, or keyword based cross searching ? |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#10 |
Confirmed User
Join Date: Dec 2005
Posts: 271
|
I would use Terrier. http:// terrier . org
Lucene is OK but not as easy to extend. Although you probably won't be able to get very far using anything unless you have studied IR in depth. Also, Links pulled. |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#11 |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
One observation, 10 million pages is not many, if it's country specific then surely you would need to index more than that ? Keep in mind that your published index does not necessarily equal the number of sites you crawl.
We run a porn only search engine, we index millions of pages, however we are crawling many times that number and discarding them from the index because they are not porn sites. The web space is really big, Netcraft estimated in 2008 that there were 152 million sites at the start of the year and then 182 million later in the same year, the net has grown exponentially since then so there would be according to Netcraft about 255 million at the end of 2010. Now that number is sites, some sites have hundreds of pages or more, so the crawling effort is mind bogglingly huge. You will also need to think about what tools you will need to manage such a huge data set, we use 3d modelling to map out the web space as we see it, it enables us to detect holes in our crawling efforts and then address that with updates to our crawler code. You will probably find that once you choose a tool set that you will want to alter it to suit your specific needs, we started with a code base for a crawler and indexer then changed it to suit our needs, what we ended up with bears little resemblance to what we started out with. So you'd need programmer support to keep your tools doing what you need them to. |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#12 |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
Yes I agree terrier is also a good place to start, there is a fairly wide user base so getting help is not so difficult, it's also quite extensible. It's all Java and you would want to have a pretty good grip on Java to implement your own terrier based system.
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#13 | |
Confirmed User
Industry Role:
Join Date: Nov 2002
Posts: 579
|
Quote:
I want to keep it as simple as possible with the minimum load on a server. At this stage it may be something smaller, a state or a city, or a subject. Not really sure. Still in the early stage of this. Any advice is appreciated. You know any examples of anyone using terrier? |
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#14 | |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
Quote:
While thinking about architecture, a good way to look at things is that you have various components which go together to form a search platform. You have a crawler, indexer, database management component and the data set itself. One program won't handle the whole thing, so it won't be a script as suggested in your original question, it will be a series of programs running independently within a system. |
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#15 |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
You will probably like to think about a fork of Terrier, well a branch rather than a fork, called Terraneo http://distterr.wordpress.com/2010/0...search-engine/
Read this, it's probably one of the best comparisons of open source search engines you could read http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#16 |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
Just to save you time, you will need more than one server, even if you use a homogeneous search engine/crawler you won't get away with even a 100 million page index with just one server, unless you plan on only crawling say 10,000 pages a day. Think of the processing needed for things like clone detection, caching etc.
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#17 |
Confirmed User
Industry Role:
Join Date: Oct 2010
Location: Portugal
Posts: 1,262
|
bookmarked
__________________
StagCMS - Adult CMS - user friendly adult content management system - speed up your websites with no SQL connections ICQ: 63*23*43*113 ![]() |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#18 |
Registered User
Industry Role:
Join Date: Feb 2006
Posts: 22,511
|
sorry for threadjack, but what about a search engine that would only search certain sites (not ones own) or maybe certain file types? something off the shelf?
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#19 |
Confirmed User
Industry Role:
Join Date: May 2004
Location: Norway
Posts: 1,661
|
http://www.sphinxsearch.com is awesome!
__________________
DivaTraffic - Traffic for Models |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#20 | |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
Quote:
Searching a group of sites is easier, if you know the sites you want to index then you're not having to discover the sites in the first place. If however you mean searching types of sites, this is the effort that our search engine is working on, simply indexing porn sites and almost nothing else. The crawling effort is huge, however the indexing effort is less of an issue as many sites are discarded as they are not porn sites. Off the shelf is where things get difficult, because the differences in search efforts means that anything off the shelf needs a certain amount of customization. |
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#21 | |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
Quote:
|
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#22 | |
Registered User
Industry Role:
Join Date: Feb 2006
Posts: 22,511
|
well stay with just searching certain sites. there are cheap scripts that search certain tube sites and embed the videos on your site.
Quote:
|
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#23 | |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
Quote:
That would be the type of task you could employ Sphinx or Zebra to do quite trivially. |
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#24 |
Registered User
Industry Role:
Join Date: Feb 2006
Posts: 22,511
|
yeah not looking for a programmer atm just wondering if there is anything off the shelf.
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#25 | |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
Quote:
http://www.ivinco.com/blog/using-wge...search-engine/ |
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#26 |
Confirmed User
Industry Role:
Join Date: Nov 2002
Posts: 579
|
Thankyou again for all this information AdultKing, you really know your stuff. Was thinking about this and think Ill start with something small as a trial n error sort of thing. What I want to do is add the websites to the engine myself, not have any sort of crawler. What do you recommend for that?
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#27 |
Confirmed User
Industry Role:
Join Date: Dec 2002
Location: Behind the scenes
Posts: 5,190
|
dazzling, maybe you need something like directory script then and just add and organize links.. with no sorting and filtering links you could probably get away with basic WP install
__________________
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#28 |
So Fucking Banned
Industry Role:
Join Date: Apr 2009
Posts: 2,968
|
Google Yahoo etc. use economies of scale.
You can't search 5 billion pages in 0.02 seconds on 1 server. The query is split over the whole network, distributed processing. It takes the same total computer power to handle 200 million queries a day, but by using parallel processing you get the results 1000 times quicker. Each server tackles under a million web pages. 10 million pages is about the limit for a single server search engine. |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#29 | |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
Quote:
Sphider of course wont scale, you will be able to index a few thousand sites reliably but it will begin to break down, Sphider certainly wont index a whole country. At this micro level, it would be possible to add affiliate links to the search results to potentially earn money from the links clicked on from search results. The question begs though, are people likely to use something so limited at an end user level ? |
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#30 | ||
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
Quote:
Quote:
While a query on the index can consume the processing power of 1000 machines, according to early reports of the Google platform, this is misleading, as the machines are built such that you are really just dealing with component systems which may be built from many "servers". There is a very good paper about the subject of using GPU's within a search architecture at http://koala.poly.edu/GPU.pdf |
||
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#31 |
partners.sexier.com
Industry Role:
Join Date: Jan 2007
Location: San Francisco, CA
Posts: 11,926
|
useful thread...
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#32 |
Confirmed User
Industry Role:
Join Date: Nov 2002
Posts: 579
|
A few years ago I used a really nice search engine script from fluid dynamics....
http://www.xav.com/scripts/search/ The script was great but very very limited in the amount of websites you could add and used up way too much CPU. It was good though for what I was doing at the time, the problem was the guy stopped development on the script. I think what I would like to do is move in stages, start with a small project so I can learn, then move into something bigger later on. |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#33 | |
Confirmed User
Industry Role:
Join Date: Nov 2002
Posts: 579
|
Quote:
|
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#34 | |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
Quote:
You'll find Sphider comparable although PHP Based rather than Perl. The problem with these types of scripts is that they are scripts and a script won't become a real search engine. You just can't do the type of things you need to do to run a search engine from a script. You need several programs running with one or more databases at a minimum, a basic search engine platform will consist of a crawler, indexer and query engine at a minimum. It is possible to create a real search engine on one server, however on a limited scale. One of our test/development servers for PornoBug indexes a realm of web space, approximately 2.5 million sites on one Xeon server with 6TB of disk. However the machine is running under constant load and only runs as a search engine, it does however have a crawler, indexer and query interface all on the one machine. It crawls 100,000 pages a day and sites within the realm are typically visited every 2 to 3 days. |
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#35 | |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
Quote:
A crawler really needs to be able to work quickly, ours are all written in C and are very nimble, not one line of code is included unless it is absolutely necessary for the crawler to work. |
|
![]() |
![]() ![]() ![]() ![]() ![]() |