GoFuckYourself.com - Adult Webmaster Forum - Best way to do this? python/ruby/perl/sed/awk/php/etc?

GoFuckYourself.com - Adult Webmaster Forum (https://gfy.com/index.php)

- Fucking Around & Business Discussion (https://gfy.com/forumdisplay.php?f=26)

- - Best way to do this? python/ruby/perl/sed/awk/php/etc? (https://gfy.com/showthread.php?t=1076714)

fris	08-02-2012 10:48 AM

Best way to do this? python/ruby/perl/sed/awk/php/etc?

I have a list of links.

example

Code:

<h3>search engine links</h3>

<a href="http://google.com">google</a>

<a href="http://www.bing.com">bing</a>

<a href="http://www.yahoo.com">yahoo</a>

<h3>payment links</h3>

<a href="http://www.paypal.com">paypal</a>

<a href="http://www.paxum.com">paxum</a>

i only want to get the search engine links being

Code:

<a href="http://google.com">google</a>

<a href="http://www.bing.com">bing</a>

<a href="http://www.yahoo.com">yahoo</a>

best way to go about this?

i was using sed, but its printing the 2nd h3

Code:

sed -n '/<h3>/,/<\/h3>/p' test.txt

any input would be great ;)

alcstrategy

08-02-2012 11:14 AM

have you tried using xpath?

i don't know exactly what you're doing but in this case //a[position()<4] will bring the search engines in this case, but i'm sure xpath could handle whatever you wanted to do

i guess u just want everything after the first h3 tag and probably has dynamic number of links, but it was just an example of using xpath

kazymjir

08-02-2012 11:37 AM

Code:

$ cat test.txt

<h3>search engine links</h3>

<a href="http://google.com">google</a>

<a href="http://www.bing.com">bing</a>

<a href="http://www.yahoo.com">yahoo</a>

<h3>payment links</h3>

<a href="http://www.paypal.com">paypal</a>

<a href="http://www.paxum.com">paxum</a>

$ sed -e '/<h3>payment/,/<\/h3>/ d' -e '/<h3>/ d' test.txt

<a href="http://google.com">google</a>

<a href="http://www.bing.com">bing</a>

<a href="http://www.yahoo.com">yahoo</a>

$

Zoxxa

08-02-2012 11:57 AM

I would first extract all the "a href" tags with regex, xpath, or this: http://simplehtmldom.sourceforge.net/

Then detect which urls contain search engine keywords or domains.

Something like this (Typed out fast, did not test):

Code:



$href_array = array('<a href="http://google.com">google</a>', '<a href="http://www.bing.com">bing</a>', 'etc..');



$search_engines = array('bing.com', 'google.com', 'etc...');



$i = 0;

foreach($href_array as $link) {

        

        foreach($search_engines as $site){

                if(strpos($link, $site) !== FALSE){

                        

                        // SE link found

                        $final[$i] = $link;

                        $i++;

                }

        }



}



echo '<pre>';

print_r($final);

kazymjir

08-02-2012 12:09 PM

Quote:

Originally Posted by Zoxxa (Post 19100194)

Code:



$href_array = array('<a href="http://google.com">google</a>', '<a href="http://www.bing.com">bing</a>', 'etc..');



$search_engines = array('bing.com', 'google.com', 'etc...');



$i = 0;

foreach($href_array as $link) {

        

        foreach($search_engines as $site){

                if(strpos($link, $site) !== FALSE){

                        

                        // SE link found

                        $final[$i] = $link;

                        $i++;

                }

        }



}



echo '<pre>';

print_r($final);

Zoxxa, sorry, but this makes completely no sense.

If you know all search engine links ($search_engines array), why do you search them?
It's like "I *know* that lighbulb and toy car is inside this box, but I will check it anyway".

Also, what will be if you don't have a link in $search_engines that exists in test.txt ?

And, why you are firing up PHP, performing DOM/regexp processing, while it can be done with single sed command?

alcstrategy

08-02-2012 12:20 PM

if you wanted to use xpath u could use //a[following-sibling::h3[1]]
but kazymjir's method is probably what you are looking for

Zoxxa

08-02-2012 12:20 PM

Quote:

Originally Posted by kazymjir (Post 19100212)

Zoxxa, sorry, but this makes completely no sense.

To you.

Quote:

Originally Posted by kazymjir (Post 19100212)

If you know all search engine links ($search_engines array), why do you search them? It's like "I *know* that lighbulb and toy car is inside this box, but I will check it anyway".

Because from his example list he doesn't. He had links with paypal / paxum as well. I suppose he could just select all links between <h3>search engine links</h3> and <h3>payment links</h3> with regex or with xpath, but my code would work with all links he grabs from any section anywhere. If he is only concerned with that block, then regex that part out or xpath.

Quote:

Originally Posted by kazymjir (Post 19100212)

Also, what will be if you don't have a link in $search_engines that exists in test.txt ?

Like I said,he could use regex to grab the block, or just keep a simple list of search engines he wants to extract.

Quote:

Originally Posted by kazymjir (Post 19100212)

And, why you are firing up PHP, performing DOM/regexp processing, while it can be done with single sed command?

I code with php, not sed, so obviously my help would be provided with php. I didn't see your post so chill the fuck out allstar.

Barry-xlovecam

08-02-2012 12:32 PM

Quote:

Originally Posted by fris (Post 19100063)

I have a list of links.

example

Code:

<h3>search engine links</h3>

<a href="http://google.com">google</a>

<a href="http://www.bing.com">bing</a>

<a href="http://www.yahoo.com">yahoo</a>

<h3>payment links</h3>

<a href="http://www.paypal.com">paypal</a>

<a href="http://www.paxum.com">paxum</a>

i only want to get the search engine links being

Code:

<a href="http://google.com">google</a>

<a href="http://www.bing.com">bing</a>

<a href="http://www.yahoo.com">yahoo</a>

best way to go about this?

i was using sed, but its printing the 2nd h3

Code:

sed -n '/<h3>/,/<\/h3>/p' test.txt

any input would be great ;)

Fast and dirty -- should be in strict
in a foreach loop

Code:

Perl



foreach(@_){if ($_=~/href/ig)   {chomp $_; print FILE $_\n";}}

kazymjir

08-02-2012 12:37 PM

Zoxxa, notice, that Fris provided an example input. example input. It doesn't have to be search engines, it can be anything.
He wants to get the content between given <h3>s. He may not know what content is between them, so your code is in this case useless.
Search engines are only example, as Fris said. There can be totally random links.

Quote:

Originally Posted by Zoxxa (Post 19100238)

I didn't see your post so chill the fuck out allstar.

Zoxxa, you should chill out. I didn't have on mind attacking you, I just express my opinion.

shake

08-02-2012 12:39 PM

I'd use the scrapy python framework

http://scrapy.org/

Zoxxa

08-02-2012 12:45 PM

Quote:

Originally Posted by kazymjir (Post 19100272)

I apologize, I misread his post where it says "i only want to get the search engine links being".

I thought he was being literal and actually meant only urls being search engines, I did not read the text between the h3 tag or his last sed example which shows what he wants.

Barry-xlovecam

08-02-2012 12:47 PM

http://search.cpan.org/dist/libwww-perl/lwpcook.pod

If you need to get the file you can extract at the source ;)

or

#!/usr/bin/perl

use LWP::Simple qw(!head);

use HTML::SimpleLinkExtor;
my @links = HTML::SimpleLinkExtor->new->parse(get $page)->a;

fris	08-02-2012 01:20 PM

Quote:

Originally Posted by Zoxxa (Post 19100285)

i dont want whats in the h3 tags, i just want to get the links after the h3 tags, but only those links, not the <h3> block of links after those.

:thumbsup

u-Bob

08-02-2012 01:40 PM

ugly, but it'll work... and less memory intensive than splitting the file:

Code:

open(FILE, 'stuff.txt');

$h3 = 0;

while(<FILE>)

{

  chomp;

  if($_ =~ "<h3>"){$h3++; if($h3 > 1){close FILE;} }

  else{print "$_\n";}

  

}

close FILE;

fris	08-02-2012 05:17 PM

Quote:

Originally Posted by u-Bob (Post 19100387)

ugly, but it'll work... and less memory intensive than splitting the file:

Code:

open(FILE, 'stuff.txt');

$h3 = 0;

while(<FILE>)

{

  chomp;

  if($_ =~ "<h3>"){$h3++; if($h3 > 1){close FILE;} }

  else{print "$_\n";}

  

}

close FILE;

foirgot to mention i wanna specify the links based on the h3 tag so <h3>payment links</h3> would grab those links under that h3 element ;)

livexxx

08-02-2012 06:40 PM

cut and paste into a file editor, replace all <a href=" with \t<a href="
cut and paste into excel, select column, done

fris	08-02-2012 08:32 PM

Quote:

Originally Posted by livexxx (Post 19100715)

cut and paste into a file editor, replace all <a href=" with \t<a href="
cut and paste into excel, select column, done

ya thats what i would end up doing ;)

u-Bob

08-03-2012 07:40 AM

Quote:

Originally Posted by fris (Post 19100630)

foirgot to mention i wanna specify the links based on the h3 tag so <h3>payment links</h3> would grab those links under that h3 element ;)

quick mod:

Code:

open(FILE, 'stuff.txt');

$h3 = 0;

$h3str = "payment links";

while(<FILE>)

{

  chomp;

  if($_ =~ "<h3>$h3str</h3>"){$h3++;}

  elsif($_ =~ "<h3>"){$h3 = 0;}

  elsif($h3>0){print "$_\n";}

}

close FILE;

this way it will even get the links if they are spread over multiple <h3>payment links</h3> blocks.

Barry-xlovecam

08-03-2012 12:05 PM

Code:

if($_ =~ /"<h3>$h3str</h3>"/ig)

Might work better

Thank the gods -- a biz oriented thread :1orglaugh

Brujah

08-03-2012 03:52 PM

Maybe something along this line:

Code:

echo preg_replace( '|.*</h3>(.*)<h3>.*|s', '$1', $input );

Brujah

08-03-2012 04:22 PM

I guess if multiple h3 blocks continue this will work:

Code:

echo preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input );

fris	08-04-2012 03:33 AM

Quote:

Originally Posted by Brujah (Post 19102043)

I guess if multiple h3 blocks continue this will work:

Code:

echo preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input );

where is the value for the link block stored to grab the certain block?

Brujah

08-04-2012 03:43 AM

Quote:

Originally Posted by fris (Post 19103060)

where is the value for the link block stored to grab the certain block?

assign it to a variable, ex. $link_block

Code:

$link_block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input );

fwiw, I prefer using xpath too like alcstrategy mentioned, but probably overkill depending on this need.

fris	08-04-2012 05:11 AM

Quote:

Originally Posted by Brujah (Post 19103067)

assign it to a variable, ex. $link_block

Code:

$link_block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input );

fwiw, I prefer using xpath too like alcstrategy mentioned, but probably overkill depending on this need.

cause the input is coming from file_get_contents

Code:



$data = file_get_contents('links.txt');

$block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $data );

echo $block;

that displays the 3rd link block everytime

Brujah

08-04-2012 06:08 AM

It displays the first block for me, but all I had to go on was your sample links.txt code above.

Code:

# cat links.txt

<h3>search engine links</h3>

<a href="http://google.com">google</a>

<a href="http://www.bing.com">bing</a>

<a href="http://www.yahoo.com">yahoo</a>

<h3>payment links</h3>

<a href="http://www.paypal.com">paypal</a>

<a href="http://www.paxum.com">paxum</a>

<h3>block three</h3>

<a href="https://gfy.com">gfy</a>

<a href="http://php.net">php</a>



# php test.php



<a href="http://google.com">google</a>

<a href="http://www.bing.com">bing</a>

<a href="http://www.yahoo.com">yahoo</a>





# cat test.php

<?php



$data = file_get_contents('links.txt');

$block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $data );

echo $block;

fris	08-04-2012 07:59 AM

Quote:

Originally Posted by Brujah (Post 19103154)

It displays the first block for me, but all I had to go on was your sample links.txt code above.

Code:

# cat links.txt

<h3>search engine links</h3>

<a href="http://google.com">google</a>

<a href="http://www.bing.com">bing</a>

<a href="http://www.yahoo.com">yahoo</a>

<h3>payment links</h3>

<a href="http://www.paypal.com">paypal</a>

<a href="http://www.paxum.com">paxum</a>

<h3>block three</h3>

<a href="https://gfy.com">gfy</a>

<a href="http://php.net">php</a>



# php test.php



<a href="http://google.com">google</a>

<a href="http://www.bing.com">bing</a>

<a href="http://www.yahoo.com">yahoo</a>





# cat test.php

<?php



$data = file_get_contents('links.txt');

$block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $data );

echo $block;

ya see the file is full of h3 sections, i wanna specify which one and it will get those links, so its not just 1 h3, u-bobs works for this

its chrome bookmarks, so each folder has a h3 heading for the folder name, just wanna get those links for the h3 folder name.

Barry-xlovecam

08-04-2012 09:08 AM

One idea...

http://www.perlmonks.org/?node_id=507660

what is the next starting <tag>text</tag> and always so?

Brujah

08-04-2012 04:56 PM

Ah ok. If you're still interested in a php solution, maybe this?

Code:

if ( empty( $argv[1] ) ) die( 'Usage: php test.php keyword' . PHP_EOL );

$fp = fopen( 'links.txt', 'r' );

while( $line = fgets( $fp ) )

{

    if ( strpos( $line, '<h3>' ) !== false AND strpos( $line, $argv[1] ) !== false )

    {

        do {

            $line = fgets( $fp );

            if ( strpos( $line, '<h3>' ) !== false ) break 2;

            else echo $line;

        } while ( ! feof( $fp ) );

    }



}

fclose( $fp );

Output usage:

Code:

~ $ php test.php

Usage: php test.php keyword

~ $ php test.php search

<a href="http://google.com">google</a>

<a href="http://www.bing.com">bing</a>

<a href="http://www.yahoo.com">yahoo</a>

~ $ php test.php pay   

<a href="http://www.paypal.com">paypal</a>

<a href="http://www.paxum.com">paxum</a>

~ $ php test.php bleh

<a href="http://php.net">php</a>

<a href="http://nginx.org">nginx</a>



~ $ php test.php 'search engine links'

<a href="http://google.com">google</a>

<a href="http://www.bing.com">bing</a>

<a href="http://www.yahoo.com">yahoo</a>

~ $ php test.php 'payment links'      

<a href="http://www.paypal.com">paypal</a>

<a href="http://www.paxum.com">paxum</a>

livexxx

08-04-2012 06:51 PM

you could have done it in excel by now or registered a website that parses chrome book marks with php preg_match

is it just chrome bookmarks? I'll make a damn site to stop reading this :)

Brujah

08-04-2012 07:24 PM

Quote:

Originally Posted by livexxx (Post 19103907)

you could have done it in excel by now or registered a website that parses chrome book marks with php preg_match

is it just chrome bookmarks? I'll make a damn site to stop reading this :)

I think it's more than just needing a quick solution, and not having one yet. It's a way to learn from others code, and to think of different approaches. Maybe one will suddenly be more elegant or preferable or faster or show an approach that you might not have chosen. Maybe a much better regex is presented. That's usually how I view these code threads. :2 cents:

fris	08-05-2012 07:37 PM

great posts guys ;)

pornsprite

08-05-2012 10:56 PM

This is untested and you already have a lot of good examples but I believe you have more options to on the command line this way.

#!/usr/bin/perl

die "Usage is $0 <start> <stop> <filename>\n" unless $ARGV[2];

$start = shift;
$stop = shift;
$file = shift;

open(FILE, "$file") or die "Could not open $file $!\n;
while(<FILE>){
chomp;
$sp = 1 if $_ =~ /$start/;
die if ($_ =~ /$stop/;
next if $_ =~ /<h2>/;
if($sp == 1){
print "$_\n";
}
}

All times are GMT -7. The time now is 08:23 PM.