Turning RSS Feeds in to Movable Type Entries

A while ago, I created the site PsychicProgrammer.com as a place to gather up programming-related stories from various corners of the Internet. I whipped up some code to automatically gather RSS feeds from various programming-related websites and pull the stories to turn them into Movable Type postings. I’ve had a few people ask how I did this, so I thought I’d post the code with some explanation of what’s going on.

#!/usr/bin/perl

use strict;
use XML::RSS::Parser;
use RPC::XML;
use RPC::XML::Client;
use Date::Manip;
use LWP::Simple;
use DBI;

The script is written in Perl, and I’m using a few nifty Perl modules available from CPAN.
XML::RSS::Parser – a great module for dealing with RSS feeds.
RPC::XML – used to communicate with Movable Type via RPC.
LWP::Simple – used to retreive the RSS feed document.

my @info=({'url'      => "http://www.oreillynet.com/pub/feed/16?format=rss2",
           'name'     => "Perl.com",
           'category' => "Perl.com",
           'datetype' => 2},
          {'url'      => "http://www.digg.com/rss/indexprogramming.xml",
           'name'     => "Digg.com",
           'category' => "Digg.com",
           'datetype' => 1},
          {'url'      => "http://www.oreillynet.com/pub/feed/20?format=rss2",
           'name'     => "Xml.com",
           'category' => "Xml.com",
           'datetype' => 2},
          {'url'      => "http://www.dotnetjunkies.com/WebLog/saasheim/rss.aspx",
           'name'     => "Steinar Aasheim's Blog",
           'category' => "Steinar Aasheim's Blog",
           'datetype' => 1},
          {'url'      => "http://tomcopeland.blogs.com/juniordeveloper/rss.xml",
           'name'     => "Junior Developer",
           'category' => "Junior Developer",
           'datetype' => 1},
          {'url'      => "http://programming.newsforge.com/programming.rss",
           'name'     => "Newsforge.com",
           'category' => "Newsforge.com",
           'datetype' => 2});

Here, I set up a structure containing the RSS feeds that I’m going to retreive. Of course, to scale this, it would be better to store this information in a SQL table.

my $username='user';
my $password='password';
my %category;
my $i;
my $dbh;
my $sth;
my $q;
my $seencount;
my $feed;
my $site;
my $xmldoc;

Some variable definitions.

# Set up database connection
$dbh=DBI->connect("dbi:Pg:dbname=p","psy","password") or die "Can't open database";

# Set up XML-RPC interface
my $cli=RPC::XML::Client->new('http://www.psychicprogrammer.com/mt/mt-xmlrpc.cgi');

# Set up XML parser
my $p=new XML::RSS::Parser;

# Get category list
my $req=RPC::XML::request->new('mt.getCategoryList','1',$username,$password);
my $resp=$cli->simple_request($req);
foreach $i (@$resp)
{
  $category{$i->{categoryName}}=$i->{categoryId};
}

Here, we set up a database connection. The database is used to record which articles have been seen before, so that we don’t have any duplicates. Next, we set up the RPC interface to the Movable Type blog. Make sure you put your correct URL in here for your blog. Then, we talk to MT to get a list of the categories. We do this because we need to map the RSS feed name to the appropriate MT category.

if($DEBUG)
{
  foreach $i (%category)
  {
    printf("$category{$i} $i\n");
  }
}

Some debugging script to dump the category information. This is a good check to make sure that the RPC interface is working. This script assumes that there is a pre-existing category defined for each RSS feed. Use the MT interface to create new categories.

foreach $site(@info)
{
  printf("*** Processing for site %s\n\n",$site->{'name'}) if $DEBUG;

  $xmldoc=get $site->{'url'};
  $feed=$p->parse($xmldoc);

This starts a loop for each RSS feed defined above.

  foreach my $i ( $feed->query('//item') )
  {
    my $datenode;
    my $date;
    my $titlenode = $i->query('title');
    my $linknode = $i->query('link');
    my $descnode = $i->query('description');

    if(($site->{'datetype'})==1)
    {
      $datenode = $i->query('pubDate');
      $date=UnixDate($datenode->text_content,"%Y-%m-%dT%H:%M:%S");
    }

    if(($site->{'datetype'})==2)
    {
      $datenode = $i->query('dc:date');
      $date=UnixDate($datenode->text_content,"%Y-%m-%dT%H:%M:%S");
    }

    my $dd = $descnode->text_content .
      "<br>Link: <a href=\"" . $linknode->text_content . "\">" .
      $linknode->text_content . "</a>";

For each entry in the RSS file, start pulling out information we’re interested in. I came across a problem with recording the date. If found both pubDate and dc:date tags. I use a variable called datetype to determine which one to look for in the RSS feed. I add a line of HTMl to the end of the article content that includes a link back to the original article.

    # Check to see if we've seen this one yet
    $q="SELECT count(source) FROM seen WHERE index=" . $dbh->quote($linknode->text_content);
    $sth=$dbh->prepare($q);
    $sth->execute();
    ($seencount)=$sth->fetchrow();
    $sth->finish();
    if($seencount==0)
    {

This is where we check our SQL table to see if we’ve seen this article URL before.

      # Post article
      printf("Posting %s\n",$titlenode->text_content) if $DEBUG;
      my $req=RPC::XML::request->new('metaWeblog.newPost',
                                     '1',
                                     $username,
                                     $password,
                                     RPC::XML::struct->new(
                                       'title' => RPC::XML::string->new($titlenode->text_content),
                                       'description' => RPC::XML::string->new($dd),
                                       'dateCreated' => RPC::XML::string->new($date),
                                       'mt_tb_ping_urls' => RPC::XML::array->new(
                                         $linknode->text_content)
                                     ),
                                     RPC::XML::boolean->new(1)
                                    );
      my $resp=$cli->simple_request($req);

This executes the RPC call to MT to actually post the article. Note that we attempt a trackback ping to the original article. This array can also be populated with other tracking sites, such as Technorati.

      # Change category
      $req=RPC::XML::request->new('mt.setPostCategories',
                                  $resp,
                                  $username,
                                  $password,
                                  RPC::XML::array->new(
                                    RPC::XML::struct->new(
                                      'categoryId' => $category{$site->{'category'}},
                                      'isPrimary' => RPC::XML::boolean->new(1)
                                    )
                                  )
                                 );
      $resp=$cli->simple_request($req);

Now we change the article’s category to be the same as the feed’s name.

      $q="INSERT INTO seen (index, source) VALUES (" . $dbh->quote($linknode->text_content) .
         ", " . $dbh->quote($site->{'name'}) . ")";
      $sth=$dbh->prepare($q);
      $sth->execute();
      $sth->finish();
    }
  }
}

# Close database
$dbh->disconnect();

We write a line into the database to say that we’ve seen this article before. We loop back for the rest of the articles, for the rest of the feeds. Finally, we close the database connection.

That’s it! I run this code from a crontab entry every hour or so. As soon as new articles are discovered in the RSS feeds, they will be magically turned into postings on a Movable Type blog, thanks to the wonders of XML-RPC.

I’d appreciate any comments or feedback if you decide to use this code in your own projects. Have fun!

Web Site Log Files

I’ve spent the entire day converting log files. To be more truthful, I spent an hour or so crafting a Perl program to do the actual conversion, but I’ve spent the rest of the day watching my machine crunch away at the log files.

We had an ‘incident’ on our Packeteer AppCelera box. We are running beta software (we’re in the same building as their Canadian office), and there was an issue with a recent beta that caused the log files to be exported in some wierd internal format instead of W3C format. I had to convert two and a half months of mangled log files. At 4pm on a Friday, it’s finally done.

Once we discovered the problem (why did it take me two and a half months to notice the problem?), Packeteer immediately got the programmers involved, found the bug, corrected it, and immediately compiled a new build for us. Now that’s service!