Pyder Proposal

Student Info

Name: Jesús Manuel Mager Hois

University: National Autonomous University of Mexico

Major: Computer Science

Expected graduation date: 2012

Degree: undergraduate

Home Page/Blog: http://www.h1n1-al.blogspot.com/

Email: fongog@gmail.com

Telephone: +525553600709

Freenode IRC handle: jmagerh

Project Proposal Info

Title: Web Crawler (pyder)

Abstract

Pyder is a Web Crawler project that aims to be complete, flexible multi-thread and distributed using Pydra and Django frameworks to do it. The idea is make web crawler integrated with Lucene, completely flexible in the content extraction, file format parsing, and clustering. Pyder will be completely written python to be easy to maintain.

Detailed Description

About Me

y Name is Jesus Manuel Mager Hois, computer science student at the UNAM (National Autonomous University of Mexico). I am Tux4Kids[1] contributor since 2008, successful gsoc student (factoroids activity) in 2008 and hobby game programmer. Definitive my favorite language is Python, combined with C. Other languages that I use are C on a regular basis, mainly GNU C, C++, Lua, Java and php. and experience with Git, svn, gettext and autotools, regular expressions, etc... For my work I use GNU/Linux on a Ubuntu distribution.

Web crawlers, is a theme that interest me a lot. Few months ago I wrote a very basic Web Crawler based on BeautifulSoup and using Python to extract information from MySpace[2] and I plan to write my beachlord final work describing crawlers. I hope I can learn very much from this project, and of all the team working with pydra!

The Plan

I want write a web crawler hat uses pydra to run as tasks in a parallel, so we can approach a cloud installations. But, ¿why write a new crawler if we can find a lot of them in the Internet? Many of the web crawlers that are either free or open source don't make approach of parallel power. Nuch[3] is at now, the most advanced web crawler, but it lacks of a good administrator and regular maintenance. As well, if we write a crawler in Python, it will be easier to maintain. Additional I will make a administration interface that uses the power of Django. All the information should be inserted in Lucene but should be constructed so that it can be serialized in other ways.

I know that new projects are big things to code, but it is also a big issue maintain it. Of course I will participate in the maintenance of the end result. ;)

Structure

In the library we will need

admin/ Crawl control code. Used mainly to interact with the administration web interface.
protocol/
- Handling of different local, net, and Internet protocols. The main aim is to get ready the base class, and write the http implementation. The idea here is make pyder ready to expand to other proposes.
  - ftp, http, https, file, irc,
search/
- The basic download routines, robot.txt parser and handle the main queue.
  - urlfilter/
    - Link restrictions and a link depth restriction.
indexer/
- Routines to identify field in the text(using parse), and interact with lucene to store it .
- Serialize the data for ranking queue-ing.
parse/
- This class will try to be compatible with a lot of different document formats. The base class will include the basic routines to be accessible from outside. The default parser will be support plain text and HTML. .
  - Other possible formats that the base class will aim to support can be XML, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, JavaScript, RSS, RTF, MP3 (ID3 tags), Json.
- Content specific parsers: the possibility of make plugins that parse specific websites, for example, myspace, interfaces with facebook api, hi5, linked, and other important portals.
Scoring/
- A storing system, that counts the number of links that have every page. So we can determine the relevance of each site. Also add this per domain, identifying high activity domains and sites, so we can increment the hot scoring of each site.
Clustering/
- Some classes that will help cluster the sites. I propose bayesian classification algorithm[5][6] for this.
statistics/
- Recompile the different data of the actual searching, and save a compact readable summary of the word that the carwler has done. This will be used in the web administration site to

I will also include a pyder-adim.py script that use the django django-admin.py to generate the web interface. So the user only needs to establish via web the needed works to be done. The interface will be used and modified based on pydra. The settings and actions in the web interface at the moment should be:

Start/begin the job
Set the beginning urls
Domain restriction with exclude, include and restrict options for example: “exclude” : “all”, “include”: “foo.com”.
Establish wanted clusters, and introduce example text for this groups.
Cluster restrictions. Determine what kind of thematic pages the crawler should save and add their links to the queue.
View the existing plugins.

Installation Schema

Use distutils in setup.py that should install the classes, and the pyder-admin.py executable script. The user should then execute this command to copy the needed django files, and the user should fill the needed configuration files.

Timeline.

April 30 – May 23: Set up the necessary repositories, identify and clone some existing (GPL compatible) code that will be useful for the project.

May 24 – May 31: Write the main handler, the protocol class, and the search class.

June 1 –June 7: I will work on the parser base class and the plain text, and mainly on the BeautifulSoup[7] HTML parser.

June 8 – June 14: Create the scoring class code.

June 15 – June 21: Clustering code, and integrate it with the scoring system

June 22 – June 28: Write the statisitic code

June 29 – July 5: Integrate all the code in one unity, so it works fine together and debuggin the whole

July 6 – July 19: Write the web interface.

July 20 – July 31: Make the install scripts, and document it.

August 1 – end. Documentations and I will use this time if some task will need more time than estimated.

Here I need to justify why some dates seems to reduced: some of the code, for example is trivial the ftp, http protocol implementation; we have python default libraries for that. But, some task will need more work than others, so I maybe will finish early some task but I will need more with another.

Links to additional information

[1] http://www.tux4kids.com

[2] http://kickapoo.svn.sourceforge.net/viewvc/kickapoo/

[3] http://lucene.apache.org/pylucene/index.html

[4] http://lucene.apache.org/nutch/

[5] http://www.statsoft.com/textbook/naive-bayes-classifier/

[6] http://www.autonlab.org/tutorials/naive.html

[7] http://www.crummy.com/software/BeautifulSoup/