Devops

Subversion (SVN) to Git (GitHub) Migrations

TL:DR Migration of Source for 5 products from SVN to Git. One required full history from 7 branches including merge history. This product also required splitting from one repo in SVN to 4 GitHub repos due to size limitations. It also required LFS. The other 6 just needed Trunk with minimal history but still had a complex non-standard SVN structure that ultimately resulted in 32 git repos, to then consolidate into 6 GitHub repos. We required 2 approaches for speed and efficiency. We used Git-SVN for the ones that didn’t require full history. We used Subgit for full history. BFG Repo Cleaner was used to clean the new repos before pushing to GitHub.

Takeaways:

  • Subgit took an age (due to size of repo) but brought over FULL history
  • Git-SVN was quicker but missed the merge history
  • BFG Repo Cleaner is awesome.

Introduction

As part of a consolidation and modernisation exercise, we are migrating our Engineering org to company wide unified processes and toolsets which involves migrating from multiple SCM tools (Subversion (SVN), TFS, File Shares and Perforce) to Github. This is the writeup of the migration of the source for 5 products from SVN to GitHub. These 5 products are split in terms of ownership between 2 teams – 1 (Team A) looking after 4 of the products (which are heavily related) and 1 (Team B) looking after the other product. Although this transformation project involves more than just technological migrations, there is wholesale process change for many teams, however this article is concentrating purely on the technological side of the migration. The requirements for the migration were slightly different between the 2 teams with Team A requiring minimal history and only the Trunk branch migrating across to GitHub, but Team B required 7 branches, each with full history including Merge history. The reason for the latter is due to offering LTS on a number of versions and the process flow involving merging code from older to newest when implementing changes. Knowing the merge history is essential for the history to be of use – which was deemed the best route to follow rather than switching back and forth endlessly between SVN and GitHub in the future.

Approach

After designing the new development process for the teams for Github, the first task was to perform a test migration from SVN to the desired end result to both prove out the migration concept and also give the teams a familiar and tangible sandbox to learn Git and familiarise themselves with the new ideas and techniques compared with SVN. It also gave us time to look at what changes to the CI/CD tooling would be required post transition. This brought about some challenges which formulated the plan for the final migration. The steps taken for the sandbox creation would be documented and repeated exactly across all Repos for the final migration planned for 2-3 weeks after Sandbox creation.

Team A

Following on from previous migrations, we went down the route of git-svn for the migration. We know this is relatively fast, brings over history that is usable and is simple to use. The first problem we encountered was the setup of SVN. Most SVN repos are setup following a standard and expected layout which looks as follows:

SVN Repo
  - Project
     - Trunk
     - Branches
     - Tags

With this structure it is straightforward to complete a migration, pointing simply at the project level of SVN telling git-svn the location of branches and tags, then performing the migration. Our structure was different to this, and not only that the team had different locations for what would constitute ‘trunk’ as part of the migration. The 4 products were split across 6 different “repos” each with a similar layout to the below:

SVN Repo 
   - Java Code
     - Subdir
        - Subdir 
           - Trunk
           - Branches
              - Branch Name
                 - Patch
                    - Patch Name
           - Tags
   - DB Code
     - Subdir
        - Trunk
        - Branches
           - Branch Name
              - Patch
                 - Patch Name
        - Tags
   - Legacy Java Code
     - Subdir
        - Trunk
        - Branches
           - Branch Name
        - Tags
           - Version
              - Patch
                 - Version

Each of the parent folders such as Java Code contain multiple subdirectories each with the directory structure above, but not al subdirectories were needed in the newly migrated world for multiple reasons. As a result, one of the new Git repos was to consist of the following:

Product A
GitHub Folder Name SVN Location of Source
UI-1 svnrepo/Java Code/Subdir1/Subdir2/trunk
DBS-1 svnrepo/DB Code/Subdir/trunk
UI-3 svnrepo/Legacy Java Code/Subdir/tags/TagID/Patch/PatchID
DBS-2 svnrepo/Legacy Java Code/Subdir/branches/BranchName/Patch/PtachID
UI-2 svnrepo/Java Code/Subdir/trunk
DBS-3 svnrepo/DB Code/Subdir/trunk

This posed a slight challenge, but a welcome one. Given that we only had a Master branch required for each of these, it made life a little easier. After much research we settled on the approach below.

  • Migrate each folder into a new Git Repo – git svn clone https://192.168.24.121/svn/product_java/web_wfs/ --no-metadata --authors-file=netmindauthors.txt 
  • Create a temporary GitHub Repo for each folder and push to Git – git remote add origin https://github.com/companyname/tempreponame.git && git push —all
  • Create a new blank Git repo locally – mkdir productA && cd product A && git init
  • Add a remote for each of the folders – git add remote TEMP-UI-1 https://github.com/CompanyName/TEMP-UI-1.git
  • Fetch the code from the remotes into the new blank repo – git fetch —all
  • Merge each Remote into the new repo – git merge TEMP-UI-1/master —-allow-unrelated-histories (without the last bit Git won’t let you merge the remote)
  • Ensure repo folder layout looks as expected. This required some `mkdir` and `git mv` commands followed by committing the changes so the single repo looked like the first Column in the above table.
  • Push new repo to Git – git remote add origin https://github.com/CompanyName/ProductA.git && git push —-all
  • Remove the TEMP repos from GitHub. 
  • TEST!!!

This worked. It gave us a newly consolidated repo, consisting of all the components required to build the product pulled from different areas of SVN complete with history. It was all relatively quick too. I am UK based, and the SVN repo in based in India and i connect via a VPN that backhauls through corporate HQ in the US (so hardly the quickest route!!) and the repos are not the largest. Each product took approximately 4 hours from start to finish.

Team B

The second team had a less complicated migration requirement, but had a much more stringent need for complete history including merge history. The existing SVN repo structure for this product was slightly simpler, however certainly not without issues of it’s own. The SVN server was an old, long lived implementation that was not structured in an optimal way. Rather than the SVN – Project – Trunk/Branch/Tag structure, the SVN structure had Trunk/Branch/Tag at the top level and all products listed under each folder – and there are MANY products homed on this repo. For Team B, the structure was as follows

SVN Repo
   -  Trunk
     - Product Name
        - Subdir 1
        - Subdir 2
        - Subdir 3
        - Subdir n
   - Branches
     - Product Name
        - Branch Name
           - Subdir 1
           - Subdir 2
           - Subdir 3
           - Subdir n
   - Tags
     - Product Name
        - Tag Name
           - Subdir 1
           - Subdir 2
           - Subdir 3
           - Subdir n

The solution for Team B was to split their product into 4 Git repos. 3 of the new repos would contain a single Subdir from the SVN repo, with the final repo containing everything else. Upon migrating the repo for the sandbox, we discovered a lack of merge history brought over with git-svn which led us to research alternatives and settle on Subgit. Subgit is similar to git svn in principle however on our repo was much slower but brought over full merge history which is what the team required. Subgit used similar command options to git-svn including the use of an Authors file for mapping SVN users to Github users, all we needed to do was identify the folder locations for each component (Trunk, Branches and Tags) and subgit would perform the slow but thorough migration. Once completed we could user BFG Repo Cleaner to remove the history from the folders we put in their own repo. For the first 3 repos, the process was straight forward and looked similar to the below:

  • Migrate repo using Subgit (this took 26 hours on my Mac and 9 hours on a HUGE AWS instance) – subgit import --default-domain company.com --authors-file ~/svn/authors.txt --trunk trunk/product --branches branches/product --tags tags/product --username USERNAME --password PASSWORD --non-interactive --trust-server-cert --svn-url svn://svnrepo/ product.git
  • Convert newly create bare repo to working directory – git config --local --bool core.bare false and git reset --hard
  • Remove unneeded folders – git rm -rf dq+tests/ dq+doc/ dq+deployment/
  • Rewrite History – java -jar bfg.jar --delete-folders {dq+tests,dq+doc,dq+deployment} --no-blob-protection 
  • Remove JARS from History (*This repo had a lot of compiled binaries which have since been migrated to Artifactory.) – java -jar bfg.jar -D *.jar --no-blob-protection
  • Refresh repo after rewrite – git reflog expire --expire=now --all && git gc --prune=now --aggressive
  • Commit Changes – git commit -m “Consolidating Repos”
  • Add new Remote – git remote add origin https://github.com/company/product.git 
  • Upload Repo – git push —-all

The final repo included some large Video files and many PDF files. For this we decided to utilise Git LFS to reduce the size of the converted repo. Git LFS was completely new to me, but with the help of BFG Repo Cleaner was pretty easy to get setup. The migration was still the same as above but without the final step of pushing the code to Github. To implement LFS we followed the following steps:

  • Install Git LFS on the host – https://help.github.com/en/articles/installing-git-large-file-storage 
  • Install LFS in the repo – git lfs install
  • Track the required files with git lfs - git lfs track *.m4v and git lfs track *.pdf
  • Add the files and Commit. This will add a .gitattributes file containing the LFS info.
  • Use BFG Repo Cleaner to rewrite history for LFS files –  java -jar /gitrepos/bfg.jar --convert-to-git-lfs “*.{pdf,m4v}" --no-blob-protection
  • Refresh repo and expire dead links – git reflog expire --expire=now --all && git gc --prune=now --aggressive
  • Push to repo – git push -—all (this will upload the files to LFS and push the code to the repo. The PDF and M4V files will be replaced with pointer files).

After this was complete we had 4 repos with clean history, one of which utilised LFS for large file storage.

When cloning a repo containing Git LFS files, by default all the files will be pulled down locally as part of the clone. If you do not wish this to be the case you can use the command GIT_LFS_SKIP_SMUDGE=1  git clone https://github.com/companyname/repo.git.

Resources used for this Project

Leave a Reply

Your email address will not be published. Required fields are marked *