« Subversion 1.1.1 packages available | Main | Navidad »

CVS to Subversion

I am now in the process of converting the CVS repository we use at work into a Subversion repository. I'm using, naturally, the Subversion packages I've built and published on my iDisk.

The setup I am starting with is a dedicated CVS server with is accessed via SSH. The CVS repository is stored on a mounted NFS volume, which is served by a fancy server box that has backups and so on.

The Subversion repository cannot live on an NFS volume, so the target setup is a Subversion repository that is exposed via HTTP/WebDAV using the Apache HTTPd 2.1 web server. (I used the in-development HTTPd 2.1 instead of the "stable" 2.0 because I wanted to build using the 1.0 version of APR…)

The Subversion repository is kept on the local disk (at /var/subversion/repository) which is backed up to the NFS volume. On the NFS volume, I have a directory set up like so:

drwxr-xr-x  4 www  svn  backups/
drwxr-xr-x  3 www  svn  backups/full/
drwxr-xr-x  2 www  svn  backups/revisions/
drwxr-xr-x  2 svn  svn  bin/
-rwxr-xr-x  1 svn  svn  bin/hot-backup
-rwxr-xr-x  1 svn  svn  bin/mailer
drwxr-xr-x  2 svn  svn  conf/
-rw-r--r--  1 svn  svn  conf/mailer.conf
-rw-r--r--  1 svn  svn  conf/svn-access.conf
lrwxrwxrwx  1 svn  svn  repository/ -> /var/subversion/repository/

The repository at /var/subversion/repository and its contents are owner by user www and group www. The backups directory is also owner by www. Everything else is owner by user svn. The www user must own the repository because it is managed by the mod_dav_svn module in the Apache httpd process, which runs as www. The backups directory is also owned by www so that a nightly cron job for the www account can both read the repository and write to the backups directory:

# Backup subversion repository weekly
0 0 * * 0,4     /jingle/svn/bin/hot-backup /jingle/svn/repository /jingle/svn/backups/full

In the bin directory, I've added the hot-backup.py and mailer.py scripts. These scripts are in the Subversion source tree, but are not installed with Subversion.

In the conf directory, I have the configuration file for mailer.py and the AuthzSVNAccessFile access control file for the Subversion repository, which allows finer-grained access control. The httpd.conf file includes this configuration:

<Location /svn>
    AuthType Digest
    AuthName "Subversion"
    AuthDigestDomain http://svn-server/
    AuthDigestProvider file
    AuthUserFile  /etc/httpd/auth/users
    AuthzSVNAccessFile /nfs/svn/conf/svn-access.conf

    Require valid-user

    DAV svn
    SVNPath /var/subversion/repository
    RemoveHandler .cgi
    RemoveOutputFilter .html
</Location>

The remaining component is the populating repository itself.

The last time I ran a repository conversion using cvs2svn, it took all day. A likely factors is that the CVS repository is on an NFS volume. A local volume would be considerably faster. It probably doesn't help that the CVS server (733 Mhz G4, 512 MB RAM) isn't a top-of-the-line machine by today's measures.

So I'm going to want to do this on a faster machine with plenty of RAM using the local disk. Fortunately, I have that on my desk (2 GHz G5x2, 4 GB RAM). With that much RAM, I can also use a RAM disk to help things along. The steps I need to take, therefore, are:

  1. Cache the CVS repository to my local disk (in case I need to do this multiple times)
  2. Create a mondo RAM disk to store the working data
  3. Copy the CVS repository to the RAM disk
  4. Run cvs2svn, telling it to use the RAM disk for scratch space
  5. Load the dumpfile from cvs2svn into a new repository on the RAM disk

It turns out than the largest RAM disk I can make on my computer without having malloc() barf is 4632520 blocks (2.2 GB) even though I have about 3 GB of RAM free. Obviously, if you have less RAM, you may need to make a smaller RAM disk.

The temporary files created by cvs2svn while it is working can be quite large, and I had trouble getting both the CVS repository and the temp files into smaller RAM disks, so I used the largest size I could get away with. If you need a smaller RAM disk, Fitz tells me that it's probably best to put CVS on the RAM disk and the temp files on your local disk. Your CVS repository will be of a known size, whereas the temp files are not. Also, cvs2svn will hit the CVS files multiple times, and the benefit of the RAM disk is probably best bet on those files. I haven't done any metrics, since I managed to get it all into the RAM disk.

I have cvs2svn generate a dump file rather than load straight into a repository. This has a couple of advantages. First, if something goes wrong with loading, I don't need to re-run cvs2svn. Second, it gives me an opportunity to unmount the RAM disk I used for cvs2svn and start with a new one for the repository work.

Here is my script:

#!/bin/sh

##
# Configuration
##

wd="$(pwd -L)";

cvs_repo_remote="svn-server:/nfs/cvs/root/jingle/Jingle";
cvs_repo_local="${wd}/repository-cvs";
svn_dumpfile="${wd}/repository.svndump";
svn_repo_remote="svn-server:/var/subversion/repository";
svn_repo_local="${wd}/repository-svn";

# 4632520 blocks (2.2GB) appears to be as big as I can get away with
# without malloc errors
ram_disk_size="4632520";
ram_disk_name="cvs2svn-scratch";

##
# Do The Right Thing
##

echo "Starting up at $(date)";

if [ ! -f "${svn_dumpfile}" ] && [ ! -d "${svn_repo_local}" ]; then
    if [ -d cvs2svn ]; then
        echo "Updating cvs2svn...";
        svn update cvs2svn;
    else
        echo "Checking out cvs2svn...";
        svn checkout http://svn.collab.net/repos/cvs2svn/trunk cvs2svn;
    fi;

    if [ -n "${cvs_repo_remote}" ]; then
        echo "Copying CVS repository locally...";
        if ! rsync                      \
          --recursive                   \
          --delete                      \
          --verbose --progress --stats  \
          "${cvs_repo_remote}/"         \
          "${cvs_repo_local}/"; then
            echo "FATAL: copy failed.";
            exit 1;
        fi;
    fi;

    raw_device="$(echo $(hdid -nomount ram://4632520))";

    if [ -z "${raw_device}" ]; then
        echo "Unable to create RAM disk raw device.";
        exit 1;
    fi;

    echo "Created RAM disk raw device: ${raw_device}";

    echo "Formatting as case-sensitive HFS+...";
    newfs_hfs -s -v "${ram_disk_name}" "${raw_device}";

    echo -n "Mounting filesystem... ";
    fs_device="$(hdiutil mountvol "${raw_device}" | tail -1 | awk '{print $1}')";

    if [ -z "${fs_device}" ]; then
        echo "";
        echo "FATAL: Unable to create RAM disk.";
        hdiutil detach "${raw_device}";
        exit 1;
    fi;

    mount="$(df -lk | grep "^${fs_device}" | awk '{print $6}')";
    if [ -z "${mount}" ]; then
        echo "";
        echo "FATAL: Unable to locate RAM disk mount point.";
        hdiutil detach "${fs_device}";
        hdiutil detach "${raw_device}";
        exit 1;
    fi;
    echo "${mount}";

    df -lk | grep "^${fs_device}";

    echo "Copying CVS repository to RAM disk...";
    cd "$(dirname "${cvs_repo_local}")" && pax -rvw "$(basename "${cvs_repo_local}")" "${mount}";

    mkdir "${mount}/tmp";

    echo "Starting converstion at $(date)...";

    ./cvs2svn/cvs2svn                                   \
      --tmpdir="${mount}/tmp"                           \
      --mime-types=/usr/local/apache/conf/mime.types    \
      --no-default-eol --keywords-off                   \
      --encoding=UTF-8                                  \
      --dumpfile="${svn_dumpfile}"                      \
      --dump-only                                       \
      "${mount}/$(basename "${cvs_repo_local}")";

    echo "Converstion completed at $(date).";

    echo "Detaching RAM volume device ${fs_device}...";
    hdiutil detach "${fs_device}";

    echo "Detaching RAM raw device ${raw_device}...";
    hdiutil detach "${raw_device}";
fi;

if [ ! -f "${svn_dumpfile}" ]; then
    echo "No dumpfile?";
    exit 1;
fi;

if [ ! -d "${svn_repo_local}" ]; then
    raw_device="$(echo $(hdid -nomount ram://2000000))";

    if [ -z "${raw_device}" ]; then
        echo "Unable to create RAM disk raw device.";
        exit 1;
    fi;

    echo "Created RAM disk raw device: ${raw_device}";

    echo "Formatting as case-sensitive HFS+...";
    newfs_hfs -s -v "${ram_disk_name}" "${raw_device}";

    echo -n "Mounting filesystem... ";
    fs_device="$(hdiutil mountvol "${raw_device}" | tail -1 | awk '{print $1}')";

    if [ -z "${fs_device}" ]; then
        echo "";
        echo "FATAL: Unable to create RAM disk.";
        hdiutil detach "${raw_device}";
        exit 1;
    fi;

    mount="$(df -lk | grep "^${fs_device}" | awk '{print $6}')";
    if [ -z "${mount}" ]; then
        echo "";
        echo "FATAL: Unable to locate RAM disk mount point.";
        hdiutil detach "${fs_device}";
        hdiutil detach "${raw_device}";
        exit 1;
    fi;
    echo "${mount}";

    df -lk | grep "^${fs_device}";

    echo "Creating subversion repository...";
    svnadmin create --fs-type "fsfs" "${mount}/$(basename "${svn_repo_local}")";

    echo "Loading data into subversion repository...";
    svnadmin load "${mount}/$(basename "${svn_repo_local}")" < "${svn_dumpfile}";

    echo "Copying repository...";
    cd "${mount}" && pax -rvw "$(basename "${svn_repo_local}")" "$(dirname "${svn_repo_local}")";

    echo "Load completed at $(date).";

    echo "Detaching RAM volume device ${fs_device}...";
    hdiutil detach "${fs_device}";

    echo "Detaching RAM raw device ${raw_device}...";
    hdiutil detach "${raw_device}";
fi;

echo "Finished at $(date)";

Many thanks to Fitz and the Subversion developers and fans on #svn for lots of help.

TrackBack

TrackBack URL for this entry:
http://www.wsanchez.net/MovableType/mt-tb.cgi/83

Comments

Isn't one of the points of the FSFS that it can be put on an nfs volume? Not that it changes any of what you're doing, but I think worth noting.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)