The Challenge:

A chemical and digital photo lab enlisted our help in updating their existing backup system to make it more cost effective and efficient. They had roughly 20 TB of data spread across several different servers with multiple RAIDs. The backup system was scanning each of the RAIDs and copying any new or modified files and folders across the network to a tape library. There were several issues with this system: 1) they were using up to 10 tapes per week at a cost of roughly $45 per tape (or up to $450/week), 2) when they had a large influx of data, the backup system was too slow to finish the process overnight, therefore it would continue to run during the next business day, which translated to a compromised network and server speed, and 3) to ensure the integrity of the backup system, every three months they would start a new set of tapes; however, this process entailed using a large number of tapes and almost five days to complete, which meant there wasn’t any kind of regular backup for four of those days.1

The Solution:

First off, we worked with the lab team to determine that there were three distinct kinds of data coming from three different types of clients whose data had three different possible “shelf lives.” The first was very transient, short-lived data from walk-in customers looking to have images or entire rolls of film scanned, usually with some additional photo-retouching needs. The workflow for this kind of data meant it was usually in the lab for less than a week and the customer would not be paying for long-term storage. The second type was from customers bringing in 300GB to 1TB of data from larger-scale photo shoots that needed to be batch processed and possibly color corrected and/or retouched. Storage time for this type could be up to 30 days in the lab. Finally, the last kind of data was from customers paying for the long-term storage of millions of images, thus allowing them to call up and order prints on-demand.

Now understanding the life cycle of the data, we proceeded to design a backup method that specifically treated each of the three different types of data, rather than the existing method of trying to pull all of it through a single network connection to the backup server, and then finally writing it all to expensive tapes. With our new system, far less tape would be consumed and the data would never actually leave the servers, but would instead be backed up by the same server that was hosting it.

To achieve this, we developed new policies for the three data types. The first kind of data, from customers that did not have long-term storage needs, was now backed up to three rotating RAID sets instead of tape. Each set would be used for a week in a rotating pattern of Backup A, Backup B, and Backup C. On the 22nd day, the system would go back to Backup A and reuse it, thus overwriting the existing backup data. The end result was fully automated data backup for up to 21 days (or 14 days minimum), which is more than enough time, given the nature of the data and its expected life cycle. Additionally, since no tapes needed to be swapped, this cut roughly 50 percent of the overall tape usage.

The second kind of data, from larger photo shoots, was also backed up with rotating A-B-C RAIDs, but because the customers often wanted to go back and revisit this data later, it was also written out to tape at the end of the A-B-C rotation. Again, the vast majority of this was completely automated and the tapes were further reduced by roughly 25 percent.2

The final data set, requiring long-term storage, was handled in a slightly different manner. Because this kind of data was live in the lab for months on end and didn’t change very often, it was critical to ensure it was protected in case of fire, earthquake, etc. Therefore, it needed to be copied to tape and taken off-site on a regular basis. Since the process of scanning such large volumes of data and writing it to tape can take a significant amount of time, a main concern was that the process could not be completed when the lab was closed and could impact workflow the following morning. To avoid a potential issue, we designed a staged backup system where each night—and in only a few hours—the RAIDs containing the live data were synchronized to a matching set of RAIDs.3 We would then have the backup system scan the secondary set of RAIDs and from there, copy the data onto tape. This meant the lab could be working off the primary RAIDs at full speed while the backup system was making a copy of this data from the secondary RAIDs.4

The End Result:

Because our new design was far more automated and based primarily on hard drives rather than tape, the lab’s system was now more efficient, more reliable, required far fewer man hours for management, and alleviated the burden on the network and file servers during business hours. In the end, and although an initial significant investment was needed, the data was now being backed up in a manner appropriate for its type, tape consumption was reduced by roughly 75 percent, and the projected cost savings across a four year expected hardware life span resulted in the new system paying for itself in roughly 18 months.

Additional Notes for the Geekishly Inclined

1. They were creating a new set each quarter to ensure that in the event of a large scale restore, they were not trying to restore data spanning hundreds of tapes, which would result in the restore process taking days.

2. This tape library was directly attached to the server hosting this data, ensuring that it could be written to tape during the hours the lab was closed and thus prevent the backup process from impeding the workflow during business hours.

3. We used rsync-based scripts to scan the live data set and sync only the new and changed data to the matching backup RAIDs.

4. In this scenario, we again attached the tape library directly to the server to ensure the highest possible speed when writing it to tape.  The tapes were then taken off-site on a weekly basis.