Scraping hackernews using scrapex.ai

Goal:

To scrape all the posts from the first 10 pages of https://news.ycombinator.com/ in the following specification.

How to scrape hackernews

Specifications

Tags DataType Extractor
Title Text Text
Points Text Text
Comments Integer Text
URL Text Prop

Sample Data

POST_TITLE AUTHOR_URL COMMENTS POINTS
About Offline First https://news.ycombinator.com/user?id=thunderbong 84 131 points
Atari ST in daily use since 1985 [video] https://news.ycombinator.com/user?id=pmarin 37 98 points

Tutorial:

Step 1: Create a new project

Creating a project

Creating a project

  1. Head over to https://app.scrapex.ai/ and click on the New Project button
  2. Enter a project name and enter https://news.ycombinator.com/ as the URL
  3. Leave proxy settings as such and click on the Create button.

Step 2: Configure the scraper

Creating a project

  1. Click on the scraper titled Default Scraper to open up the Configurator

Configuring a Scraper

  1. Configurator would open up with the hackernews website loaded in the cloud browser. Panel on the left consists of 5 tools; namely DOM Single, DOM Multiple, DOM Nested, Builtin and Meta and a preview button.

Configuring a Scraper

  1. As we are looking obtain data in a grouped manner, DOM Nested would be the right tool for this job.

    • Click on the DOM Nested tool (3rd tool in the left toolbar)
    • In the popup that opens up, type in the group name as “posts”
    • Click the confirm button to create a group.

Using DOM Nested

  1. A temporary Untitled DOM Nested tag is created and displayed on the right panel. As show in the above figure, click on the post title.

Using DOM Nested

  1. As soon as you click on the post title, all similar titles are also selected intelligently.

Making a selection

Renaming a tag

  1. Let us now rename the tag and provide it a descriptive name.
    • Click on the options icon situated on the right side of the tag name
    • Click on the edit option.
    • Enter the tag name post_title and click Save

Using DOM Nested

  1. Let us now create another tag to extract out the points associated with each post.
    • Click on the DOM Nested tool again
    • In the popup, select posts from the dropdown in the Select Existing Collection Section.
    • This would automatically create a new temporary tag under the posts group.

Using DOM Nested

  1. Rename the tag as points. Perform a click on one of the points in the cloud browser as shown in the above figure. Automatically, all similar elements are selected as well.

Using DOM Nested

Using DOM Nested

  1. As a next task, let us try to extract the number of comments associated with each post. Repeat step 7 to create an Untitled DOM Nested tag. Now click on the author name as illustrated in the above figure.

Using DOM Nested

  1. As you would have noticed, apart from the comments being selected, extraneous elements was also selected. This is because the algorithm tries to predict the most general css selector based on the inputs by the user. Let us provide more inputs to refine the prediction.

    • Click X on the most irrelevant element. In the above example, we have chosen to click X on the login button. Immediately, we see the elements that are highlighted is decreasing.

Improving selector prediction

  • Once more, as the above screenshot illustrates, click X to remove the authors element as it is irrelevant.

Improving selector prediction

  • We see that the prediction has improved considerably, and just make one more click on the extraneous element as illustrated above.

Improving selector prediction

  • This is how we can iteratively refine our prediction. Rename this tag as comments.

Improving selector prediction

  1. Click on the preview button on the bottom left corner. A popup opens up with the data preview for the posts collection.

Configurator Preview

  1. Let us now introduce another constraint. The number of comments associated with each post has to be extracted out as a number rather than a text. i.e 100 rather than "100 comments". This can be accomplished by changing the DataType from text to Integer.

    • Click on the options button on the black bar on the right panel.
    • Click on the DataType dropdown and select Integers.
    • Click confirm.
    • Comments are now Integers rather than Texts
    • Click on the preview button to view the extracted data.u

Improving selector prediction

Improving selector prediction

  1. The last task to be accomplished is to extract out the Author’s profile URL. Repeat step 7 to create an Untitled DOM Nested tag. Rename this tag to author_url. Click on the author name as illustrated in the above image.

Improving selector prediction

  1. All authors are selected automatically selected. But the information we are interested in is the href property of the author tag rather than the text itself.

Improving selector prediction

  1. This can be accomplished by changing the extractor type to Prop and the setting the attribute to href. Once again,

    • Click on the options button on the black bar on the right panel.
    • Click on the Extractor dropdown and select Prop
    • Enter href in the Attributes Input box.
    • Click confirm.
    • Click on the preview button to view the extracted data.
    • Click on the submit button on the bottom right corner to save the configuration

Improving selector prediction

Step 4: Extracting more URLs out

Improving selector prediction

  1. On clicking submit, we are redirected to the scrapers dashboard page. The actual scraping and extraction happens in the background the data is extracted as shown in the below image.

Improving selector prediction

  1. Let us now see how we can add more URLs and extract data out from them.

Content - To be updated.

Improving selector prediction

Improving selector prediction

Step 5: Modifying existing configuration

Content - To be updated.

Step 6. Programmatically extracting out the next page using Scripts

Content - To be updated.