[SATLUG] Security Issues
guy at gchconsult.com
Mon Jun 30 08:09:53 CDT 2014
New to the list so first things first. Hello, I hope to make it to a SATLUG meeting in the near future.
As far as Python based spiders go, they are pretty prevalent. At my last position in Higher Ed., we routinely saw urllib based crawlers in the logs, mostly from Eastern European, Asian, and Mexican IPs.
When creating one you would probably want to use a framework like Scrapy, or at the very least, BeautifulSoup to simplify things on your end.
Definitely would want to add in a DNS check before attempting to follow links. I remember the bad old days of email farming, adding in a hidden link on sites that to went to a script that would create recursive pages of randomly generated email addresses and links in an attempt to hinder crawlers that wouldn't respect the robots.txt. Not foolproof but kept a fair amount of crawlers busy for days.
> On Jun 29, 2014, at 1:44 PM, Christopher Lemire <christopher.lemire at gmail.com> wrote:
>> On Thu, Jun 26, 2014 at 3:09 PM, Jonathan Kelley <jonkelley at gmail.com> wrote:
>> I'm assuming you write a reactor pattern or something for spidering in
>> python? I'd think the barrier to threading etc would make it harder to code
>> a spider in python.
> What's computationally expensive and the overhead involved when you
> are thinking of programming that? What's the results of too much
> recursion? Think of traversing a maze with recursion and infinite
> recursion, need your recursive base case to be correct.
>> * JonK gives kris a colourful gift box with question marks and tux's
>> printed all over. "Open it!"
>> * Kris looks around nervously and slowly opens it
>> <JonK> it's empty!
>> <JonK> the box represents what an open-source linux programmer makes in a
> There are plenty who contribute to open source here that may be able
> to better answer that. However, I've found that the little I shared
> through open source such as JSpeak seemed to result in none if any
> contributions back.
>> On Thu, Jun 26, 2014 at 5:43 AM, Daniel Villarreal <youcanlinux at gmail.com>
>>> Can one really count on robots.txt being adhered to?
> On a larger scale, patterns can be observed, but to answer your
> question, not always.
> Tell Cameron Marez to speak for himself.
> SATLUG mailing list
> SATLUG at satlug.org
> http://alamo.satlug.org/mailman/listinfo/satlug to manage/unsubscribe
> Powered by Rackspace (www.rackspace.com)
More information about the SATLUG