Student Info



Name: Jesús Manuel Mager Hois

University: National Autonomous University of Mexico

Major: Computer Science

Expected graduation date: 2012

Degree: undergraduate

Home Page/Blog: http://www.h1n1-al.blogspot.com/

Email: fongog@gmail.com

Telephone: +525553600709

Freenode IRC handle: jmagerh



Project Proposal Info

Title: Web Crawler (pyder)

Abstract



Pyder is a Web Crawler project that aims to be complete, flexible multi-thread and distributed using Pydra and Django frameworks to do it. The idea is make web crawler integrated with Lucene, completely flexible in the content extraction, file format parsing, and clustering. Pyder will be completely written python to be easy to maintain.

Detailed Description



About Me


y Name is Jesus Manuel Mager Hois, computer science student at the UNAM (National Autonomous University of Mexico). I am Tux4Kids[1] contributor since 2008, successful gsoc student (factoroids activity) in 2008 and hobby game programmer. Definitive my favorite language is Python, combined with C. Other languages that I use are C on a regular basis, mainly GNU C, C++, Lua, Java and php. and experience with Git, svn, gettext and autotools, regular expressions, etc... For my work I use GNU/Linux on a Ubuntu distribution.


Web crawlers, is a theme that interest me a lot. Few months ago I wrote a very basic Web Crawler based on BeautifulSoup and using Python to extract information from MySpace[2] and I plan to write my beachlord final work describing crawlers. I hope I can learn very much from this project, and of all the team working with pydra!



The Plan



I want write a web crawler hat uses pydra to run as tasks in a parallel, so we can approach a cloud installations. But, ¿why write a new crawler if we can find a lot of them in the Internet? Many of the web crawlers that are either free or open source don't make approach of parallel power. Nuch[3] is at now, the most advanced web crawler, but it lacks of a good administrator and regular maintenance. As well, if we write a crawler in Python, it will be easier to maintain. Additional I will make a administration interface that uses the power of Django. All the information should be inserted in Lucene but should be constructed so that it can be serialized in other ways.

I know that new projects are big things to code, but it is also a big issue maintain it. Of course I will participate in the maintenance of the end result. ;)



Structure

In the library we will need



I will also include a pyder-adim.py script that use the django django-admin.py to generate the web interface. So the user only needs to establish via web the needed works to be done. The interface will be used and modified based on pydra. The settings and actions in the web interface at the moment should be:





Installation Schema



Use distutils in setup.py that should install the classes, and the pyder-admin.py executable script. The user should then execute this command to copy the needed django files, and the user should fill the needed configuration files.



Timeline.



April 30 – May 23: Set up the necessary repositories, identify and clone some existing (GPL compatible) code that will be useful for the project.

May 24 – May 31: Write the main handler, the protocol class, and the search class.

June 1 –June 7: I will work on the parser base class and the plain text, and mainly on the BeautifulSoup[7] HTML parser.

June 8 – June 14: Create the scoring class code.

June 15 – June 21: Clustering code, and integrate it with the scoring system

June 22 – June 28: Write the statisitic code

June 29 – July 5: Integrate all the code in one unity, so it works fine together and debuggin the whole

July 6 – July 19: Write the web interface.

July 20 – July 31: Make the install scripts, and document it.

August 1 – end. Documentations and I will use this time if some task will need more time than estimated.

Here I need to justify why some dates seems to reduced: some of the code, for example is trivial the ftp, http protocol implementation; we have python default libraries for that. But, some task will need more work than others, so I maybe will finish early some task but I will need more with another.

Links to additional information

[1] http://www.tux4kids.com

[2] http://kickapoo.svn.sourceforge.net/viewvc/kickapoo/

[3] http://lucene.apache.org/pylucene/index.html

[4] http://lucene.apache.org/nutch/

[5] http://www.statsoft.com/textbook/naive-bayes-classifier/

[6] http://www.autonlab.org/tutorials/naive.html

[7] http://www.crummy.com/software/BeautifulSoup/