Selenium is a high level browser automation software that is fairly robust. It uses the Webdriver API and can interact with just about any browser, on any OS, and is compatible with many different testing frameworks. Selenium can be used in the context of testing or web scraping, and Google recently released the headless version of the Chromedriver. Up until now, the leader of the space was PhantomJS, which is used by companies such as Twitter and Netflix for testing purposes. But Chrome is much faster, most stable, and uses considerably less memory than Phantom JS.
To get started the first thing we need to do is get the version of Chrome that has the headless capabilities. Download Chrome if you don't already have it for some strange reason, or you can download the ChromeDriver binaires for your system. If you're on linux, Chrome should be in /usr/bin/ as google-chrome-stable. I liked to have the chromedriver binaries separate in /user/local/bin. This will be important when we switch between the regular and headless version.
First let's make a directory for the project to be in. You can create a virtualenv if you want to, but for the sake of this article, I won't be. Then install selenium.
$ mkdir ~/scraping/twitter $ cd ~/scraping/twitter //If you want to use a virtualenv $ virtualenv env $ . env/activate $ pip install selenium</pre>
Let's open our text editor and get started writing the configurations. I use Visual Studio Code, but you can use whatever you want. Atom and Sublime are two other really good ones, but I really recomment VSC. The extensions are really easy to use and are great, it's open source, etc. First, let's leave headless out of this. We'll get to it in a minute, but doing it without headless first is a good way to show you want the program is actually doing. And who doesn't like seeing a computer open up browsers and type things by itself. If you don't think that's cool, you're lying.
We're going to import Selenium and get the driver location:
from selenium import webdriver driver = webdriver.Chrome('/usr/local/bin/chromedriver')
First, we need to tell our program to go to a website of our choosing, in this case we want https://twitter.com/login. Remember to add the https or http or it won't work.
Now we're going to need to find selector for the username and password box. Normally, you can use any selector with Selenium; CSS, ID, Name, etc. But for now I'm going to stick to CSS.
Open the site you want to log into, open the developer tools, and select the input boxes. Twitter's username selector is actually pretty long for some reason.
Great, so now we're going to use driver.findelementby_ and .send_keys to make Selenium log in for us. Also, we can add driver.quit() at the end to have the browser automatically close. Try once with and once without to see how it works and how everything looks. If you do use it, you'll be surprised by how quickly it logs you in and then flashes closed.
userfield = driver.find_element_by_css_selector('.js-username-field.email-input.js-initial-focus') userfield.send_keys('JustinFormentin') passwordfield = driver.find_element_by_css_selector('.js-password-field') passwordfield.send_keys('hunter2') passwordfield.submit()
See how easy that was? 9 lines of code to log you in. Now let's move on to the headless version, and add a few more things.
Comment out or delete the driver section at the top, and we're going to use the path to Chrome instead of the chromedriver, and then add the headless argument.
from selenium import webdriver #driver = webdriver.Chrome('/usr/local/bin/chromedriver') options = webdriver.ChromeOptions() options.binary_location = '/usr/bin/google-chrome-stable' options.add_argument('headless') driver = webdriver.Chrome(chrome_options=options)
So you see we got rid of the chromedriver and added the four lines allowing us to use the headless version of Chrome. Now, we're going to test this out and get some proof it's working, even though we can't see it working in the browser. After the passwordfield.submit() we're going to go to the page that lists all the people I'm following, we're going to scroll down to load more, and then take a screenshot to give us proof.
driver.get('https://twitter.com/JustinFormentin/following') driver.execute_script("window.scrollTo(0, 10000);") driver.get_screenshot_as_file('followers-page.png')
We run it, we wait for it to finish, and then if we look in our directory, we can see that we now have a screenshot!
Yes, it's a pretty silly screenshot, but it proves that it could login, navigate to my following page, scroll down, and take a screenshot. The full code should now look like this:
from selenium import webdriver options = webdriver.ChromeOptions() options.binary_location = '/usr/bin/google-chrome-stable' options.add_argument('headless') driver = webdriver.Chrome(chrome_options=options) driver.get('https://twitter.com/login') userfield = driver.find_element_by_css_selector('.js-username-field.email-input.js-initial-focus') userfield.send_keys('JustinFormentin') passwordfield = driver.find_element_by_css_selector('.js-password-field') passwordfield.send_keys('hunter2') passwordfield.submit() driver.get('https://twitter.com/JustinFormentin/following') driver.execute_script("window.scrollTo(0, 10000);") driver.get_screenshot_as_file('followers-page.png') driver.quit()</pre>