![]() |
Best way to scrape this site?
Best way to scrape this site?
https://bitinfocharts.com/top-100-ri...addresses.html I tried to scrape with beautiful soup. I got a connection refused. I may copy site local, then scrape using beautiful soup. I want to keep an updated list of 10,000 bitcoin addresses. I manually entered the top 600 bitcoin addresses into a text file and uploaded to mysql. I have my bitcoin collider working now. I decided that instead of searching every bitcoin address in the blockchain I want to search a smaller database locally. This coding is so much fun. Code:
#!/usr/bin/env python |
Use phantomjs with a forged user agent paginate the script
|
See sig.
|
php advanced html dom
|
Wow somebody is pretty rich
124,178 BTC ($251,612,773 USD) |
Actually it was this fuckin' easy for my IP
curl "https://bitinfocharts.com/top-100-richest-bitcoin-addresses.html">bitcoin.html this returned the first 100 curl "https://bitinfocharts.com/top-100-richest-bitcoin-addresses-2.html">bitcoin-2.html this returned the next 200 So you have a banned IP or user-agent |
Quote:
As thanks for your help. I have found a picture from your favorite tv show. http://i.imgur.com/1NpdOgZ.jpg |
Quote:
|
|
I wish I were a bitcoin :(
|
Quote:
What I'd recommend you use selenium and use phantomjs. The reason is with Python as you have their you can do all your writing, parsing, database work within the script and do whatever you need to do without scripting phantomjs in javascript. from selenium import webdriver driver = webdriver.PhantomJS("file this in with a /path/to/phantomjs if not set") driver.set_window_size(1120, 550) driver.get("https://duckduckgo.com/") driver.find_element_by_id('search_form_input_homep age').send_keys("realpython") driver.find_element_by_id("search_button_homepage" ).click() print driver.current_url driver.quit() The reason why I'd use something like phantomjs or selenium to control firefox is the browser just takes care of it. If you use other libraries with python you'll run into small errors possibly with https or other things. You can always test using selenium with firefox so you can watch your browser do the work. |
I've seen my posts here scraped onto other forums word for word video for video it's been going on for years. There's clones of me making someone money 👁️👃👁️
|
Quote:
PHP is a great language (I don't care what anyone says) and works the easiest with most web servers but Python in my experience is one of the best general languages that is fast to complete a task and force people to convention. I love Python, listen to me! |
This is ugly, but works. It extracts bitcoin addresses from the page that has been downloaded. I could do better and add curl to the python script to make faster.
Code:
import sys |
put the above code in a loop for the files I download.
Just run curl 80 times for each page. Run my file splitter and upload each file to sql. http://i.imgur.com/Vh41yUx.png |
step 1 curl the page and > save
step 2 oneliner parse and save the data Code:
sed 's/>/>\n/g' bitcoin2.html|egrep '/bitcoin/address/'|cut -d'/' -f6|cut -d'"' -f1 |less Code:
$ sed 's/>/>\n/g' bitcoin2.html|egrep '/bitcoin/address/'|cut -d'/' -f6|cut -d'"' -f1 |more >> wallets.csv then; mysql> LOAD DATA LOCAL INFILE |
Quote:
Process 1. Use curl to download the first 80 Naming them 1.html, 2.html, 3.html and so on. 2. Run my program that will parse all 80 files. Code:
import sys up arrow and change to 2.txt and on and on. Thank you for your help my friend. You are the Kirk to my Khan. http://i.imgur.com/nnzl3ty.gif |
1 mistake in above code. Where I add 1 to filecounter. Line should be moved over to the left.
Kind of odd today. I notice the site is kind of messed up. I want to download the next 40 pages to add to my database. Maybe the site owner noticed me scraping every page up to 80? I need to find a new source for my database. And maybe I will get use your sed command next. |
All times are GMT -7. The time now is 04:30 PM. |
Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc