![]() |
![]() |
![]() |
||||
Welcome to the GoFuckYourself.com - Adult Webmaster Forum forums. You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today! If you have any problems with the registration process or your account login, please contact us. |
![]() ![]() |
|
Discuss what's fucking going on, and which programs are best and worst. One-time "program" announcements from "established" webmasters are allowed. |
|
Thread Tools |
![]() |
#1 |
Too lazy to set a custom title
Industry Role:
Join Date: Aug 2002
Posts: 55,372
|
Best way to do this? python/ruby/perl/sed/awk/php/etc?
I have a list of links.
example Code:
<h3>search engine links</h3> <a href="http://google.com">google</a> <a href="http://www.bing.com">bing</a> <a href="http://www.yahoo.com">yahoo</a> <h3>payment links</h3> <a href="http://www.paypal.com">paypal</a> <a href="http://www.paxum.com">paxum</a> Code:
<a href="http://google.com">google</a> <a href="http://www.bing.com">bing</a> <a href="http://www.yahoo.com">yahoo</a> i was using sed, but its printing the 2nd h3 Code:
sed -n '/<h3>/,/<\/h3>/p' test.txt
__________________
Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence. ![]() WP Stuff |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#2 |
Confirmed User
Industry Role:
Join Date: May 2012
Posts: 124
|
have you tried using xpath?
i don't know exactly what you're doing but in this case //a[position()<4] will bring the search engines in this case, but i'm sure xpath could handle whatever you wanted to do i guess u just want everything after the first h3 tag and probably has dynamic number of links, but it was just an example of using xpath |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#3 |
Confirmed User
Industry Role:
Join Date: Oct 2011
Location: Munich
Posts: 411
|
Code:
$ cat test.txt <h3>search engine links</h3> <a href="http://google.com">google</a> <a href="http://www.bing.com">bing</a> <a href="http://www.yahoo.com">yahoo</a> <h3>payment links</h3> <a href="http://www.paypal.com">paypal</a> <a href="http://www.paxum.com">paxum</a> $ sed -e '/<h3>payment/,/<\/h3>/ d' -e '/<h3>/ d' test.txt <a href="http://google.com">google</a> <a href="http://www.bing.com">bing</a> <a href="http://www.yahoo.com">yahoo</a> $
__________________
http://kazymjir.com/ |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#4 |
Confirmed User
Industry Role:
Join Date: Feb 2011
Location: Ontario, Canada
Posts: 1,026
|
I would first extract all the "a href" tags with regex, xpath, or this: http://simplehtmldom.sourceforge.net/
Then detect which urls contain search engine keywords or domains. Something like this (Typed out fast, did not test): Code:
$href_array = array('<a href="http://google.com">google</a>', '<a href="http://www.bing.com">bing</a>', 'etc..'); $search_engines = array('bing.com', 'google.com', 'etc...'); $i = 0; foreach($href_array as $link) { foreach($search_engines as $site){ if(strpos($link, $site) !== FALSE){ // SE link found $final[$i] = $link; $i++; } } } echo '<pre>'; print_r($final);
__________________
[email protected] ICQ: 269486444 ZoxEmbedTube - Build unlimited "fake" tubes with this easy 100% unencoded CMS! |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#5 | |
Confirmed User
Industry Role:
Join Date: Oct 2011
Location: Munich
Posts: 411
|
Quote:
If you know all search engine links ($search_engines array), why do you search them? It's like "I *know* that lighbulb and toy car is inside this box, but I will check it anyway". Also, what will be if you don't have a link in $search_engines that exists in test.txt ? And, why you are firing up PHP, performing DOM/regexp processing, while it can be done with single sed command?
__________________
http://kazymjir.com/ |
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#6 |
Confirmed User
Industry Role:
Join Date: May 2012
Posts: 124
|
if you wanted to use xpath u could use //a[following-sibling::h3[1]]
but kazymjir's method is probably what you are looking for |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#7 | ||
Confirmed User
Industry Role:
Join Date: Feb 2011
Location: Ontario, Canada
Posts: 1,026
|
To you.
Quote:
Quote:
I code with php, not sed, so obviously my help would be provided with php. I didn't see your post so chill the fuck out allstar.
__________________
[email protected] ICQ: 269486444 ZoxEmbedTube - Build unlimited "fake" tubes with this easy 100% unencoded CMS! |
||
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#8 | |
It's 42
Industry Role:
Join Date: Jun 2010
Location: Global
Posts: 18,083
|
Quote:
in a foreach loop Code:
Perl foreach(@_){if ($_=~/href/ig) {chomp $_; print FILE $_\n";}} |
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#9 |
Confirmed User
Industry Role:
Join Date: Oct 2011
Location: Munich
Posts: 411
|
Zoxxa, notice, that Fris provided an example input. example input. It doesn't have to be search engines, it can be anything.
He wants to get the content between given <h3>s. He may not know what content is between them, so your code is in this case useless. Search engines are only example, as Fris said. There can be totally random links. Zoxxa, you should chill out. I didn't have on mind attacking you, I just express my opinion.
__________________
http://kazymjir.com/ |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#10 |
frc
Industry Role:
Join Date: Jul 2003
Location: Bitcoin wallet
Posts: 4,663
|
__________________
Crazy fast VPS for $10 a month. Try with $20 free credit |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#11 | |
Confirmed User
Industry Role:
Join Date: Feb 2011
Location: Ontario, Canada
Posts: 1,026
|
Quote:
I apologize, I misread his post where it says "i only want to get the search engine links being". I thought he was being literal and actually meant only urls being search engines, I did not read the text between the h3 tag or his last sed example which shows what he wants.
__________________
[email protected] ICQ: 269486444 ZoxEmbedTube - Build unlimited "fake" tubes with this easy 100% unencoded CMS! |
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#12 |
It's 42
Industry Role:
Join Date: Jun 2010
Location: Global
Posts: 18,083
|
http://search.cpan.org/dist/libwww-perl/lwpcook.pod |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#13 | |
Too lazy to set a custom title
Industry Role:
Join Date: Aug 2002
Posts: 55,372
|
Quote:
![]()
__________________
Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence. ![]() WP Stuff |
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#14 |
there's no $$$ in porn
Industry Role:
Join Date: Jul 2005
Location: icq: 195./568.-230 (btw: not getting offline msgs)
Posts: 33,063
|
ugly, but it'll work... and less memory intensive than splitting the file:
Code:
open(FILE, 'stuff.txt'); $h3 = 0; while(<FILE>) { chomp; if($_ =~ "<h3>"){$h3++; if($h3 > 1){close FILE;} } else{print "$_\n";} } close FILE; |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#15 |
Too lazy to set a custom title
Industry Role:
Join Date: Aug 2002
Posts: 55,372
|
foirgot to mention i wanna specify the links based on the h3 tag so <h3>payment links</h3> would grab those links under that h3 element ;)
__________________
Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence. ![]() WP Stuff |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#16 |
Confirmed User
Industry Role:
Join Date: May 2005
Location: UK
Posts: 1,201
|
cut and paste into a file editor, replace all <a href=" with \t<a href="
cut and paste into excel, select column, done |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#18 | |
there's no $$$ in porn
Industry Role:
Join Date: Jul 2005
Location: icq: 195./568.-230 (btw: not getting offline msgs)
Posts: 33,063
|
Quote:
Code:
open(FILE, 'stuff.txt'); $h3 = 0; $h3str = "payment links"; while(<FILE>) { chomp; if($_ =~ "<h3>$h3str</h3>"){$h3++;} elsif($_ =~ "<h3>"){$h3 = 0;} elsif($h3>0){print "$_\n";} } close FILE; |
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#19 |
It's 42
Industry Role:
Join Date: Jun 2010
Location: Global
Posts: 18,083
|
Code:
if($_ =~ /"<h3>$h3str</h3>"/ig) Might work better |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#20 |
Beer Money Baron
Industry Role:
Join Date: Jan 2001
Location: brujah / gmail
Posts: 22,157
|
Maybe something along this line:
Code:
echo preg_replace( '|.*</h3>(.*)<h3>.*|s', '$1', $input );
__________________
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#21 |
Beer Money Baron
Industry Role:
Join Date: Jan 2001
Location: brujah / gmail
Posts: 22,157
|
I guess if multiple h3 blocks continue this will work:
Code:
echo preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input );
__________________
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#23 | |
Beer Money Baron
Industry Role:
Join Date: Jan 2001
Location: brujah / gmail
Posts: 22,157
|
Quote:
Code:
$link_block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input );
__________________
|
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#24 | |
Too lazy to set a custom title
Industry Role:
Join Date: Aug 2002
Posts: 55,372
|
Quote:
Code:
$data = file_get_contents('links.txt'); $block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $data ); echo $block;
__________________
Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence. ![]() WP Stuff |
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#25 |
Beer Money Baron
Industry Role:
Join Date: Jan 2001
Location: brujah / gmail
Posts: 22,157
|
It displays the first block for me, but all I had to go on was your sample links.txt code above.
Code:
# cat links.txt <h3>search engine links</h3> <a href="http://google.com">google</a> <a href="http://www.bing.com">bing</a> <a href="http://www.yahoo.com">yahoo</a> <h3>payment links</h3> <a href="http://www.paypal.com">paypal</a> <a href="http://www.paxum.com">paxum</a> <h3>block three</h3> <a href="https://gfy.com">gfy</a> <a href="http://php.net">php</a> # php test.php <a href="http://google.com">google</a> <a href="http://www.bing.com">bing</a> <a href="http://www.yahoo.com">yahoo</a> # cat test.php <?php $data = file_get_contents('links.txt'); $block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $data ); echo $block;
__________________
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#26 | |
Too lazy to set a custom title
Industry Role:
Join Date: Aug 2002
Posts: 55,372
|
Quote:
its chrome bookmarks, so each folder has a h3 heading for the folder name, just wanna get those links for the h3 folder name.
__________________
Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence. ![]() WP Stuff |
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#27 |
It's 42
Industry Role:
Join Date: Jun 2010
Location: Global
Posts: 18,083
|
One idea... |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#28 |
Beer Money Baron
Industry Role:
Join Date: Jan 2001
Location: brujah / gmail
Posts: 22,157
|
Ah ok. If you're still interested in a php solution, maybe this?
Code:
if ( empty( $argv[1] ) ) die( 'Usage: php test.php keyword' . PHP_EOL ); $fp = fopen( 'links.txt', 'r' ); while( $line = fgets( $fp ) ) { if ( strpos( $line, '<h3>' ) !== false AND strpos( $line, $argv[1] ) !== false ) { do { $line = fgets( $fp ); if ( strpos( $line, '<h3>' ) !== false ) break 2; else echo $line; } while ( ! feof( $fp ) ); } } fclose( $fp ); Code:
~ $ php test.php Usage: php test.php keyword ~ $ php test.php search <a href="http://google.com">google</a> <a href="http://www.bing.com">bing</a> <a href="http://www.yahoo.com">yahoo</a> ~ $ php test.php pay <a href="http://www.paypal.com">paypal</a> <a href="http://www.paxum.com">paxum</a> ~ $ php test.php bleh <a href="http://php.net">php</a> <a href="http://nginx.org">nginx</a> ~ $ php test.php 'search engine links' <a href="http://google.com">google</a> <a href="http://www.bing.com">bing</a> <a href="http://www.yahoo.com">yahoo</a> ~ $ php test.php 'payment links' <a href="http://www.paypal.com">paypal</a> <a href="http://www.paxum.com">paxum</a>
__________________
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#29 |
Confirmed User
Industry Role:
Join Date: May 2005
Location: UK
Posts: 1,201
|
you could have done it in excel by now or registered a website that parses chrome book marks with php preg_match
is it just chrome bookmarks? I'll make a damn site to stop reading this ![]() |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#30 | |
Beer Money Baron
Industry Role:
Join Date: Jan 2001
Location: brujah / gmail
Posts: 22,157
|
Quote:
![]()
__________________
|
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#32 |
Confirmed User
Industry Role:
Join Date: Dec 2009
Location: Texas
Posts: 1,643
|
This is untested and you already have a lot of good examples but I believe you have more options to on the command line this way.
#!/usr/bin/perl die "Usage is $0 <start> <stop> <filename>\n" unless $ARGV[2]; $start = shift; $stop = shift; $file = shift; open(FILE, "$file") or die "Could not open $file $!\n; while(<FILE>){ chomp; $sp = 1 if $_ =~ /$start/; die if ($_ =~ /$stop/; next if $_ =~ /<h2>/; if($sp == 1){ print "$_\n"; } }
__________________
Go Fuck Yourself |
![]() |
![]() ![]() ![]() ![]() ![]() |