![]() |
Best way to do this? python/ruby/perl/sed/awk/php/etc?
I have a list of links.
example Code:
<h3>search engine links</h3> Code:
<a href="http://google.com">google</a> i was using sed, but its printing the 2nd h3 Code:
sed -n '/<h3>/,/<\/h3>/p' test.txt |
have you tried using xpath?
i don't know exactly what you're doing but in this case //a[position()<4] will bring the search engines in this case, but i'm sure xpath could handle whatever you wanted to do i guess u just want everything after the first h3 tag and probably has dynamic number of links, but it was just an example of using xpath |
Code:
$ cat test.txt |
I would first extract all the "a href" tags with regex, xpath, or this: http://simplehtmldom.sourceforge.net/
Then detect which urls contain search engine keywords or domains. Something like this (Typed out fast, did not test): Code:
|
Quote:
If you know all search engine links ($search_engines array), why do you search them? It's like "I *know* that lighbulb and toy car is inside this box, but I will check it anyway". Also, what will be if you don't have a link in $search_engines that exists in test.txt ? And, why you are firing up PHP, performing DOM/regexp processing, while it can be done with single sed command? |
if you wanted to use xpath u could use //a[following-sibling::h3[1]]
but kazymjir's method is probably what you are looking for |
Quote:
Quote:
Quote:
Quote:
|
Quote:
in a foreach loop Code:
Perl |
Zoxxa, notice, that Fris provided an example input. example input. It doesn't have to be search engines, it can be anything.
He wants to get the content between given <h3>s. He may not know what content is between them, so your code is in this case useless. Search engines are only example, as Fris said. There can be totally random links. Quote:
|
|
Quote:
I apologize, I misread his post where it says "i only want to get the search engine links being". I thought he was being literal and actually meant only urls being search engines, I did not read the text between the h3 tag or his last sed example which shows what he wants. |
http://search.cpan.org/dist/libwww-perl/lwpcook.pod |
Quote:
:thumbsup |
ugly, but it'll work... and less memory intensive than splitting the file:
Code:
open(FILE, 'stuff.txt'); |
Quote:
|
cut and paste into a file editor, replace all <a href=" with \t<a href="
cut and paste into excel, select column, done |
Quote:
|
Quote:
Code:
open(FILE, 'stuff.txt'); |
Code:
if($_ =~ /"<h3>$h3str</h3>"/ig) Might work better |
Maybe something along this line:
Code:
echo preg_replace( '|.*</h3>(.*)<h3>.*|s', '$1', $input ); |
I guess if multiple h3 blocks continue this will work:
Code:
echo preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input ); |
Quote:
|
Quote:
Code:
$link_block = preg_replace( '|.*?</h3>(.*?)<h3>.*|s', '$1', $input ); |
Quote:
Code:
|
It displays the first block for me, but all I had to go on was your sample links.txt code above.
Code:
# cat links.txt |
Quote:
its chrome bookmarks, so each folder has a h3 heading for the folder name, just wanna get those links for the h3 folder name. |
One idea... |
Ah ok. If you're still interested in a php solution, maybe this?
Code:
if ( empty( $argv[1] ) ) die( 'Usage: php test.php keyword' . PHP_EOL ); Code:
~ $ php test.php |
you could have done it in excel by now or registered a website that parses chrome book marks with php preg_match
is it just chrome bookmarks? I'll make a damn site to stop reading this :) |
Quote:
|
great posts guys ;)
|
This is untested and you already have a lot of good examples but I believe you have more options to on the command line this way.
#!/usr/bin/perl die "Usage is $0 <start> <stop> <filename>\n" unless $ARGV[2]; $start = shift; $stop = shift; $file = shift; open(FILE, "$file") or die "Could not open $file $!\n; while(<FILE>){ chomp; $sp = 1 if $_ =~ /$start/; die if ($_ =~ /$stop/; next if $_ =~ /<h2>/; if($sp == 1){ print "$_\n"; } } |
All times are GMT -7. The time now is 08:23 PM. |
Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123