Blog Entry 9 years, 6 months ago

Django Scraper 0.3.8 - Application for scraping online content

There are already number of solutions for scraping online web data using Python, or you can just simply write a Python script doing the work. I started develop and maintain django-scraper originated from my own demand, and still keeping it improved and better serves diversity of requirements. In brief, given a starting page, it can crawl other linked sources then scrape/extract almost any data present in those pages (including text, HTML, images, or other media files). Operation output is represented in JSON and with complex results with other filed downloaded, all can be placed inside a Zip file.

Django Scraper on GitHub: https://github.com/zniper/django-scraper

Because of my consistent laziness, I copied this below from its repo README file.

Features

  • Extract content of given online website/pages and stored under JSON data
  • Crawl then extract content in multiple pages, with given depth.
  • Can download media files present in page
  • Have option for storing data under ZIP file
  • Support standard file system and AWS S3 storage
  • Customisable crawling requests for different scenarios
  • Process can be started from Django management command (~cron job) or with Python code
  • Support extracting multiple content (text, html, images, binary files) in the same page
  • Have content refinement (replacement) rules and black words filtering
  • Support custom proxy servers, and user-agents
  • Support Django 1.6, 1.7, and 1.8

Samples

This is a sample result from scraping https://news.ycombinator.com/ask: ZIP Result

Installation

This application requires some other tools installed first:

lxml
requests

django-scraper installation can be made using pip:

$pip install django-scraper

In order to use django-scraper, it should be put into Django settings as installed application.

INSTALLED_APPS = (
    ...
    'scraper',
)

If south is present in current Django project, please use migrate command to create database tables.

$python manage.py migrate scraper

Otherwise, please use standard 'syncdb' command

$python manage.py syncdb

Configuration

There is also an important configuration value should be added into settings.py file:

SCRAPER_CRAWL_ROOT = '/path/to/local/storage'
SCRAPER_TEMP_DIR = '/your_temp_dir/'

Some optional setting options are:

SCRAPER_COMPRESS_RESULT = True # or False

When having SCRAPER_COMPRESS_RESULT set to True, the application will compress crawled data and store under a Zip file.

SCRAPER_NO_TASK_ID_PREFIX = 'any-prefix'

This one is a custom value which will be added at front of task ID (or download location) of each crawled result.

Model description

Spider: The one (with list of Collectors) crawls from one page to another to collect links then perform extracting data from those pages using Collector's methods.

url - URL to the start page of source (website, entry list,...)
name - Name of the crawl session
target_links - List of XPath to links pointing to pages having content to be grabbed (entries, articles,...)
expand_links - List of XPath to links pointing to pages containing target pages. This relates to crawl_depth value.
crawl_depth - Max depth of scraping session. This relates to expand rules
crawl_root - Option for extracting starting page or bypass.
collectors - List of collectors which will extract data on target pages
proxy - Proxy server will be used when crawling current source
user_agent - User Agent value set in the header of every requests

Collector: Contains list of Selectors which will extract data from a given page. Besides, it has replace rules, black words and option for download images.

name - Name of the collector
get_image - Option to download all images present in extracted html content
selectors - List of selectors pointing to data portion
replace_rules - RegEx List of regular expressions will be applied to content to remove redundant data. Example: [('<br/?>', ''), ('&nbsp;', '')]
black_words - Select set of words separated by comma. A page will not be downloaded if containing one of those words.
proxy - Proxy server will be used when crawling current source
user_agent - User Agent value set in the header of every requests

Selector - Definition of single data portion, which contains key (name), XPath to wanted content and data type

key - Name of the selector
xpath - XPath to the content to be extract
data_type - Type of the content, which could be: text, html, binary

Usage

a_collector.get_content('http://....')
a_spider.crawl_content()

or under console, by running management command run_scraper:

$python manage.py run_scraper

With this command, all active spider inside current Django instance will be processed consecutively.

In conclusion, if you have any trouble using the tool, don't hesitate letting me know. It's always great when having feedback from real users. And any contributions, issue reports, pull-requests on the repo are always welcome.

Recent Reads