Best way to do this? python/ruby/perl/sed/awk/php/etc? - GoFuckYourself.com

fris · 08-02-2012, 10:48 AM

I have a list of links.

example

Code:

<h3>search engine links</h3>
<a href="http://google.com">google</a>
<a href="http://www.bing.com">bing</a>
<a href="http://www.yahoo.com">yahoo</a>
<h3>payment links</h3>
<a href="http://www.paypal.com">paypal</a>
<a href="http://www.paxum.com">paxum</a>

i only want to get the search engine links being

Code:

<a href="http://google.com">google</a>
<a href="http://www.bing.com">bing</a>
<a href="http://www.yahoo.com">yahoo</a>

best way to go about this?

i was using sed, but its printing the 2nd h3

Code:

sed -n '/<h3>/,/<\/h3>/p' test.txt

any input would be great ;)

alcstrategy · 08-02-2012, 11:14 AM

have you tried using xpath?

i don't know exactly what you're doing but in this case //a[position()<4] will bring the search engines in this case, but i'm sure xpath could handle whatever you wanted to do

i guess u just want everything after the first h3 tag and probably has dynamic number of links, but it was just an example of using xpath

kazymjir · 08-02-2012, 11:37 AM

Code:

$ cat test.txt
<h3>search engine links</h3>
<a href="http://google.com">google</a>
<a href="http://www.bing.com">bing</a>
<a href="http://www.yahoo.com">yahoo</a>
<h3>payment links</h3>
<a href="http://www.paypal.com">paypal</a>
<a href="http://www.paxum.com">paxum</a>
$ sed -e '/<h3>payment/,/<\/h3>/ d' -e '/<h3>/ d' test.txt
<a href="http://google.com">google</a>
<a href="http://www.bing.com">bing</a>
<a href="http://www.yahoo.com">yahoo</a>
$

Zoxxa · 08-02-2012, 11:57 AM

I would first extract all the "a href" tags with regex, xpath, or this: http://simplehtmldom.sourceforge.net/

Then detect which urls contain search engine keywords or domains.

Something like this (Typed out fast, did not test):

Code:

$href_array = array('<a href="http://google.com">google</a>', '<a href="http://www.bing.com">bing</a>', 'etc..');

$search_engines = array('bing.com', 'google.com', 'etc...');

$i = 0;
foreach($href_array as $link) {
	
	foreach($search_engines as $site){
		if(strpos($link, $site) !== FALSE){
			
			// SE link found
			$final[$i] = $link;
			$i++;
		}
	}

}

echo '<pre>';
print_r($final);

kazymjir · 08-02-2012, 12:09 PM

Quote:

Originally Posted by Zoxxa

I would first extract all the "a href" tags with regex, xpath, or this: http://simplehtmldom.sourceforge.net/

Then detect which urls contain search engine keywords or domains.

Something like this (Typed out fast, did not test):

Code:

$href_array = array('<a href="http://google.com">google</a>', '<a href="http://www.bing.com">bing</a>', 'etc..');

$search_engines = array('bing.com', 'google.com', 'etc...');

$i = 0;
foreach($href_array as $link) {
	
	foreach($search_engines as $site){
		if(strpos($link, $site) !== FALSE){
			
			// SE link found
			$final[$i] = $link;
			$i++;
		}
	}

}

echo '<pre>';
print_r($final);

Zoxxa, sorry, but this makes completely no sense.

If you know all search engine links ($search_engines array), why do you search them?
It's like "I *know* that lighbulb and toy car is inside this box, but I will check it anyway".

Also, what will be if you don't have a link in $search_engines that exists in test.txt ?

And, why you are firing up PHP, performing DOM/regexp processing, while it can be done with single sed command?

alcstrategy · 08-02-2012, 12:20 PM

if you wanted to use xpath u could use //a[following-sibling::h3[1]]
but kazymjir's method is probably what you are looking for

Zoxxa · 08-02-2012, 12:20 PM

Quote:

Originally Posted by kazymjir

Zoxxa, sorry, but this makes completely no sense.

To you.

Quote:

Originally Posted by kazymjir

If you know all search engine links ($search_engines array), why do you search them? It's like "I *know* that lighbulb and toy car is inside this box, but I will check it anyway".

Because from his example list he doesn't. He had links with paypal / paxum as well. I suppose he could just select all links between <h3>search engine links</h3> and <h3>payment links</h3> with regex or with xpath, but my code would work with all links he grabs from any section anywhere. If he is only concerned with that block, then regex that part out or xpath.

Quote:

Originally Posted by kazymjir

Also, what will be if you don't have a link in $search_engines that exists in test.txt ?

Like I said,he could use regex to grab the block, or just keep a simple list of search engines he wants to extract.

Quote:

Originally Posted by kazymjir

And, why you are firing up PHP, performing DOM/regexp processing, while it can be done with single sed command?

I code with php, not sed, so obviously my help would be provided with php. I didn't see your post so chill the fuck out allstar.

Barry-xlovecam · 08-02-2012, 12:32 PM

Quote:

Originally Posted by fris

I have a list of links.

example

Code:

<h3>search engine links</h3>
<a href="http://google.com">google</a>
<a href="http://www.bing.com">bing</a>
<a href="http://www.yahoo.com">yahoo</a>
<h3>payment links</h3>
<a href="http://www.paypal.com">paypal</a>
<a href="http://www.paxum.com">paxum</a>

i only want to get the search engine links being

Code:

<a href="http://google.com">google</a>
<a href="http://www.bing.com">bing</a>
<a href="http://www.yahoo.com">yahoo</a>

best way to go about this?

i was using sed, but its printing the 2nd h3

Code:

sed -n '/<h3>/,/<\/h3>/p' test.txt

any input would be great ;)

Fast and dirty -- should be in strict
in a foreach loop

Code:

Perl

foreach(@_){if ($_=~/href/ig)   {chomp $_; print FILE $_\n";}}

kazymjir · 08-02-2012, 12:37 PM

Zoxxa, notice, that Fris provided an example input. example input. It doesn't have to be search engines, it can be anything.
He wants to get the content between given <h3>s. He may not know what content is between them, so your code is in this case useless.
Search engines are only example, as Fris said. There can be totally random links.

Quote:

Originally Posted by Zoxxa

I didn't see your post so chill the fuck out allstar.

Zoxxa, you should chill out. I didn't have on mind attacking you, I just express my opinion.

shake · 08-02-2012, 12:39 PM

I'd use the scrapy python framework

http://scrapy.org/

Zoxxa · 08-02-2012, 12:45 PM

Quote:

Originally Posted by kazymjir

Zoxxa, notice, that Fris provided an example input. example input. It doesn't have to be search engines, it can be anything.
He wants to get the content between given <h3>s. He may not know what content is between them, so your code is in this case useless.
Search engines are only example, as Fris said. There can be totally random links.

Zoxxa, you should chill out. I didn't have on mind attacking you, I just express my opinion.

I apologize, I misread his post where it says "i only want to get the search engine links being".

I thought he was being literal and actually meant only urls being search engines, I did not read the text between the h3 tag or his last sed example which shows what he wants.

Barry-xlovecam · 08-02-2012, 12:47 PM

http://search.cpan.org/dist/libwww-perl/lwpcook.pod

If you need to get the file you can extract at the source ;)

or

#!/usr/bin/perl

use LWP::Simple qw(!head);

use HTML::SimpleLinkExtor;
my @links = HTML::SimpleLinkExtor->new->parse(get $page)->a;

fris · 08-02-2012, 01:20 PM

Quote:

Originally Posted by Zoxxa

I apologize, I misread his post where it says "i only want to get the search engine links being".

I thought he was being literal and actually meant only urls being search engines, I did not read the text between the h3 tag or his last sed example which shows what he wants.

i dont want whats in the h3 tags, i just want to get the links after the h3 tags, but only those links, not the <h3> block of links after those.

u-Bob · 08-02-2012, 01:40 PM

ugly, but it'll work... and less memory intensive than splitting the file:

Code:

open(FILE, 'stuff.txt');
$h3 = 0;
while(<FILE>)
{
  chomp;
  if($_ =~ "<h3>"){$h3++; if($h3 > 1){close FILE;} }
  else{print "$_\n";}
  
}
close FILE;

fris · 08-02-2012, 05:17 PM

Quote:

Originally Posted by u-Bob

ugly, but it'll work... and less memory intensive than splitting the file:

Code:

open(FILE, 'stuff.txt');
$h3 = 0;
while(<FILE>)
{
  chomp;
  if($_ =~ "<h3>"){$h3++; if($h3 > 1){close FILE;} }
  else{print "$_\n";}
  
}
close FILE;

foirgot to mention i wanna specify the links based on the h3 tag so <h3>payment links</h3> would grab those links under that h3 element ;)

livexxx · 08-02-2012, 06:40 PM

cut and paste into a file editor, replace all <a href=" with \t<a href="
cut and paste into excel, select column, done

fris · 08-02-2012, 08:32 PM

Quote:

Originally Posted by livexxx

cut and paste into a file editor, replace all <a href=" with \t<a href="
cut and paste into excel, select column, done

ya thats what i would end up doing ;)

u-Bob · 08-03-2012, 07:40 AM

Quote:

Originally Posted by fris

foirgot to mention i wanna specify the links based on the h3 tag so <h3>payment links</h3> would grab those links under that h3 element ;)

quick mod:

Code:

open(FILE, 'stuff.txt');
$h3 = 0;
$h3str = "payment links";
while(<FILE>)
{
  chomp;
  if($_ =~ "<h3>$h3str</h3>"){$h3++;}
  elsif($_ =~ "<h3>"){$h3 = 0;}
  elsif($h3>0){print "$_\n";}
}
close FILE;

this way it will even get the links if they are spread over multiple <h3>payment links</h3> blocks.

Barry-xlovecam · 08-03-2012, 12:05 PM

Code:

  if($_ =~ /"<h3>$h3str</h3>"/ig)

Might work better

Thank the gods -- a biz oriented thread

Brujah · 08-03-2012, 03:52 PM

Maybe something along this line:

Code:

echo preg_replace( '|.*</h3>(.*)<h3>.*|s', '$1', $input );

Brujah · 08-03-2012, 04:22 PM

I guess if multiple h3 blocks continue this will work:

Code:

echo preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input );

fris · 08-04-2012, 03:33 AM

Quote:

Originally Posted by Brujah

I guess if multiple h3 blocks continue this will work:

Code:

echo preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input );

where is the value for the link block stored to grab the certain block?

Brujah · 08-04-2012, 03:43 AM

Quote:

Originally Posted by fris

where is the value for the link block stored to grab the certain block?

assign it to a variable, ex. $link_block

Code:

$link_block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input );

fwiw, I prefer using xpath too like alcstrategy mentioned, but probably overkill depending on this need.

fris · 08-04-2012, 05:11 AM

Quote:

Originally Posted by Brujah

assign it to a variable, ex. $link_block

Code:

$link_block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input );

fwiw, I prefer using xpath too like alcstrategy mentioned, but probably overkill depending on this need.

cause the input is coming from file_get_contents

Code:

$data = file_get_contents('links.txt');
$block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $data );
echo $block;

that displays the 3rd link block everytime

Brujah · 08-04-2012, 06:08 AM

It displays the first block for me, but all I had to go on was your sample links.txt code above.

Code:

# cat links.txt
<h3>search engine links</h3>
<a href="http://google.com">google</a>
<a href="http://www.bing.com">bing</a>
<a href="http://www.yahoo.com">yahoo</a>
<h3>payment links</h3>
<a href="http://www.paypal.com">paypal</a>
<a href="http://www.paxum.com">paxum</a>
<h3>block three</h3>
<a href="https://gfy.com">gfy</a>
<a href="http://php.net">php</a>

# php test.php

<a href="http://google.com">google</a>
<a href="http://www.bing.com">bing</a>
<a href="http://www.yahoo.com">yahoo</a>


# cat test.php
<?php

$data = file_get_contents('links.txt');
$block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $data );
echo $block;

fris · 08-04-2012, 07:59 AM

Quote:

Originally Posted by Brujah

It displays the first block for me, but all I had to go on was your sample links.txt code above.

Code:

# cat links.txt
<h3>search engine links</h3>
<a href="http://google.com">google</a>
<a href="http://www.bing.com">bing</a>
<a href="http://www.yahoo.com">yahoo</a>
<h3>payment links</h3>
<a href="http://www.paypal.com">paypal</a>
<a href="http://www.paxum.com">paxum</a>
<h3>block three</h3>
<a href="https://gfy.com">gfy</a>
<a href="http://php.net">php</a>

# php test.php

<a href="http://google.com">google</a>
<a href="http://www.bing.com">bing</a>
<a href="http://www.yahoo.com">yahoo</a>


# cat test.php
<?php

$data = file_get_contents('links.txt');
$block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $data );
echo $block;

ya see the file is full of h3 sections, i wanna specify which one and it will get those links, so its not just 1 h3, u-bobs works for this

its chrome bookmarks, so each folder has a h3 heading for the folder name, just wanna get those links for the h3 folder name.

Barry-xlovecam · 08-04-2012, 09:08 AM

One idea...

http://www.perlmonks.org/?node_id=507660

what is the next starting <tag>text</tag> and always so?

Brujah · 08-04-2012, 04:56 PM

Ah ok. If you're still interested in a php solution, maybe this?

Code:

if ( empty( $argv[1] ) ) die( 'Usage: php test.php keyword' . PHP_EOL );
$fp = fopen( 'links.txt', 'r' );
while( $line = fgets( $fp ) )
{
    if ( strpos( $line, '<h3>' ) !== false AND strpos( $line, $argv[1] ) !== false )
    {
        do {
            $line = fgets( $fp );
            if ( strpos( $line, '<h3>' ) !== false ) break 2;
            else echo $line;
        } while ( ! feof( $fp ) );
    }

}
fclose( $fp );

Output usage:

Code:

~ $ php test.php
Usage: php test.php keyword
~ $ php test.php search
<a href="http://google.com">google</a>
<a href="http://www.bing.com">bing</a>
<a href="http://www.yahoo.com">yahoo</a>
~ $ php test.php pay   
<a href="http://www.paypal.com">paypal</a>
<a href="http://www.paxum.com">paxum</a>
~ $ php test.php bleh
<a href="http://php.net">php</a>
<a href="http://nginx.org">nginx</a>

~ $ php test.php 'search engine links'
<a href="http://google.com">google</a>
<a href="http://www.bing.com">bing</a>
<a href="http://www.yahoo.com">yahoo</a>
~ $ php test.php 'payment links'      
<a href="http://www.paypal.com">paypal</a>
<a href="http://www.paxum.com">paxum</a>

livexxx · 08-04-2012, 06:51 PM

you could have done it in excel by now or registered a website that parses chrome book marks with php preg_match

is it just chrome bookmarks? I'll make a damn site to stop reading this

Brujah · 08-04-2012, 07:24 PM

Quote:

Originally Posted by livexxx

you could have done it in excel by now or registered a website that parses chrome book marks with php preg_match

is it just chrome bookmarks? I'll make a damn site to stop reading this

I think it's more than just needing a quick solution, and not having one yet. It's a way to learn from others code, and to think of different approaches. Maybe one will suddenly be more elegant or preferable or faster or show an approach that you might not have chosen. Maybe a much better regex is presented. That's usually how I view these code threads.

fris · 08-05-2012, 07:37 PM

great posts guys ;)

pornsprite · 08-05-2012, 10:56 PM

This is untested and you already have a lot of good examples but I believe you have more options to on the command line this way.

#!/usr/bin/perl

die "Usage is $0 <start> <stop> <filename>\n" unless $ARGV[2];

$start = shift;
$stop = shift;
$file = shift;

open(FILE, "$file") or die "Could not open $file $!\n;
while(<FILE>){
chomp;
$sp = 1 if $_ =~ /$start/;
die if ($_ =~ /$stop/;
next if $_ =~ /<h2>/;
if($sp == 1){
print "$_\n";
}
}

08-02-2012, 10:48 AM	#1
fris Too lazy to set a custom title Industry Role: Join Date: Aug 2002 Posts: 55,372	Best way to do this? python/ruby/perl/sed/awk/php/etc? I have a list of links. example Code: <h3>search engine links</h3> <a href="http://google.com">google</a> <a href="http://www.bing.com">bing</a> <a href="http://www.yahoo.com">yahoo</a> <h3>payment links</h3> <a href="http://www.paypal.com">paypal</a> <a href="http://www.paxum.com">paxum</a> i only want to get the search engine links being Code: <a href="http://google.com">google</a> <a href="http://www.bing.com">bing</a> <a href="http://www.yahoo.com">yahoo</a> best way to go about this? i was using sed, but its printing the 2nd h3 Code: sed -n '/<h3>/,/<\/h3>/p' test.txt any input would be great ;) __________________ Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence. WP Stuff

08-02-2012, 11:14 AM	#2
alcstrategy Confirmed User Industry Role: Join Date: May 2012 Posts: 124	have you tried using xpath? i don't know exactly what you're doing but in this case //a[position()<4] will bring the search engines in this case, but i'm sure xpath could handle whatever you wanted to do i guess u just want everything after the first h3 tag and probably has dynamic number of links, but it was just an example of using xpath Last edited by alcstrategy; 08-02-2012 at 11:23 AM..

08-02-2012, 11:57 AM	#4
Zoxxa Confirmed User Industry Role: Join Date: Feb 2011 Location: Ontario, Canada Posts: 1,026	I would first extract all the "a href" tags with regex, xpath, or this: http://simplehtmldom.sourceforge.net/ Then detect which urls contain search engine keywords or domains. Something like this (Typed out fast, did not test): Code: $href_array = array('<a href="http://google.com">google</a>', '<a href="http://www.bing.com">bing</a>', 'etc..'); $search_engines = array('bing.com', 'google.com', 'etc...'); $i = 0; foreach($href_array as $link) { foreach($search_engines as $site){ if(strpos($link, $site) !== FALSE){ // SE link found $final[$i] = $link; $i++; } } } echo '<pre>'; print_r($final); __________________ [email protected] ICQ: 269486444 ZoxEmbedTube - Build unlimited "fake" tubes with this easy 100% unencoded CMS! Last edited by Zoxxa; 08-02-2012 at 11:58 AM..

08-02-2012, 12:39 PM	#10
shake frc Industry Role: Join Date: Jul 2003 Location: Bitcoin wallet Posts: 4,663	I'd use the scrapy python framework http://scrapy.org/ __________________ Crazy fast VPS for $10 a month. Try with $20 free credit

08-02-2012, 12:47 PM	#12
Barry-xlovecam It's 42 Industry Role: Join Date: Jun 2010 Location: Global Posts: 18,083	http://search.cpan.org/dist/libwww-perl/lwpcook.pod If you need to get the file you can extract at the source ;) or #!/usr/bin/perl use LWP::Simple qw(!head); use HTML::SimpleLinkExtor; my @links = HTML::SimpleLinkExtor->new->parse(get $page)->a; Last edited by Barry-xlovecam; 08-02-2012 at 12:51 PM..

08-02-2012, 12:20 PM	#6
alcstrategy Confirmed User Industry Role: Join Date: May 2012 Posts: 124	if you wanted to use xpath u could use //a[following-sibling::h3[1]] but kazymjir's method is probably what you are looking for

08-02-2012, 01:40 PM	#14
u-Bob there's no $$$ in porn Industry Role: Join Date: Jul 2005 Location: icq: 195./568.-230 (btw: not getting offline msgs) Posts: 33,063	ugly, but it'll work... and less memory intensive than splitting the file: Code: open(FILE, 'stuff.txt'); $h3 = 0; while(<FILE>) { chomp; if($_ =~ "<h3>"){$h3++; if($h3 > 1){close FILE;} } else{print "$_\n";} } close FILE;

08-02-2012, 06:40 PM	#16
livexxx Confirmed User Industry Role: Join Date: May 2005 Location: UK Posts: 1,201	cut and paste into a file editor, replace all <a href=" with \t<a href=" cut and paste into excel, select column, done __________________ http://www.webcamalerts.com for auto tweets for web cam operators

08-03-2012, 12:05 PM	#19
Barry-xlovecam It's 42 Industry Role: Join Date: Jun 2010 Location: Global Posts: 18,083	Code: if($_ =~ /"<h3>$h3str</h3>"/ig) Might work better Thank the gods -- a biz oriented thread

08-03-2012, 03:52 PM	#20
Brujah Beer Money Baron Industry Role: Join Date: Jan 2001 Location: brujah / gmail Posts: 22,157	Maybe something along this line: Code: echo preg_replace( '\|.</h3>(.)<h3>.\|s', '$1', $input ); __________________ Free Premium Domain Lists and Tools at Clickmojo.com* For Sale: Obscenity.com

08-03-2012, 04:22 PM	#21
Brujah Beer Money Baron Industry Role: Join Date: Jan 2001 Location: brujah / gmail Posts: 22,157	I guess if multiple h3 blocks continue this will work: Code: echo preg_replace( '\|.?</h3>(.?)<h3>.\|s', '$1', $input ); __________________ Free Premium Domain Lists and Tools at Clickmojo.com* For Sale: Obscenity.com

08-04-2012, 09:08 AM	#27
Barry-xlovecam It's 42 Industry Role: Join Date: Jun 2010 Location: Global Posts: 18,083	One idea... http://www.perlmonks.org/?node_id=507660 what is the next starting <tag>text</tag> and always so?

08-04-2012, 06:51 PM	#29
livexxx Confirmed User Industry Role: Join Date: May 2005 Location: UK Posts: 1,201	you could have done it in excel by now or registered a website that parses chrome book marks with php preg_match is it just chrome bookmarks? I'll make a damn site to stop reading this __________________ http://www.webcamalerts.com for auto tweets for web cam operators

08-05-2012, 07:37 PM	#31
fris Too lazy to set a custom title Industry Role: Join Date: Aug 2002 Posts: 55,372	great posts guys ;) __________________ Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence. WP Stuff

08-05-2012, 10:56 PM	#32
pornsprite Confirmed User Industry Role: Join Date: Dec 2009 Location: Texas Posts: 1,643	This is untested and you already have a lot of good examples but I believe you have more options to on the command line this way. #!/usr/bin/perl die "Usage is $0 <start> <stop> <filename>\n" unless $ARGV[2]; $start = shift; $stop = shift; $file = shift; open(FILE, "$file") or die "Could not open $file $!\n; while(<FILE>){ chomp; $sp = 1 if $_ =~ /$start/; die if ($_ =~ /$stop/; next if $_ =~ /<h2>/; if($sp == 1){ print "$_\n"; } } __________________ Go Fuck Yourself