Pages

Friday, 28 April 2017

How can we preserve Google Documents?

Last month I asked (and tried to answer) the question How can we preserve our wiki pages?

This month I am investigating the slightly more challenging issue of how to preserve native Google Drive files, specifically documents*.

Why?

At the University of York we work a lot with Google Drive. We have the G Suite for Education (formally known as Google Apps for Education) and as part of this we have embraced Google Drive and it is now widely used across the University. For many (me included) it has become the tool of choice for creating documents, spreadsheets and presentations. The ability to share documents and directly collaborate are key.

So of course it is inevitable that at some point we will need to think about how to preserve them.

How hard can it be?

Quite hard actually.

The basic problem is that documents created in Google Drive are not really "files" at all.

The majority of the techniques and models that we use in digital preservation are based around the fact that you have a digital object that you can see in your file system, copy from place to place and package up into an Archival Information Package (AIP).

In the digital preservation community we're all pretty comfortable with that way of working.

The key challenge with stuff created in Google Drive is that it doesn't really exist as a file.

Always living in hope that someone has already solved the problem, I asked the question on Twitter and that really helped with my research.

Isn't the digital preservation community great?

Exporting Documents from Google Drive

I started off testing the different download options available within Google docs. For my tests I used 2 native Google documents. One was the working version of our Phase 1 Filling the Digital Preservation Gap report. This report was originally authored as a Google doc, was 56 pages long and consisted of text, tables, images, footnotes, links, formatted text, page numbers, colours etc (ie: lots of significant properties I could assess). I also used another more simple document for testing - this one was just basic text and tables but also included comments by several contributors.

I exported both of these documents into all of the different export formats that Google supports and assessed the results, looking at each characteristic of the document in turn and establishing whether or not I felt it was adequately retained.

Here is a summary of my findings, looking specifically at the Filling the Digital Preservation Gap phase 1 report document:

  • docx - This was a pretty good copy of the original. It retained all of the key features of the report that I was looking for (images, tables, footnotes, links, colours, formatting etc), however, the 56 page report was now only 55 pages (in the original, page 48 was blank, but in the docx version this blank page wasn't there).
  • odt - Again, this was a good copy of the originals and much like the docx version in terms of the features it retained. However, the 56 page report was now only 54 pages long. Again it omitted page 48 which was blank in the Google version, but also slightly more words were squeezed on to each page which meant that it comprised of fewer pages. Initially I thought the quality of the images was degraded slightly but this turned out to be just the way they appeared to render in LibreOffice. Looking inside the actual odt file structure and viewing the images as files demonstrated to me that they were actually OK. 
  • rtf - First of all it is worth saying that the Rich Text Format file was *massive*. The key features of the document were retained, although the report document was now 60 pages long instead of 56!
  • txt - Not surprisingly this produces a tiny file that retains only the text of the original document. Obviously the original images, tables, colours, formatting etc were all lost. About the only other notable feature that was retained were the footnotes and these appeared together right at the end of the document. Also a txt file does not have a number of 'pages'... not until you print it at least.
  • pdf - This was a good copy of the original report and retained all the formatting and features that I was looking for. This was also the only copy of the report that had the right number of pages. However, it seems that this is not something we can rely on. A close comparison of the pages of the pdf compared with the original shows that there are some differences regarding which words fall on to which page - it isn't exact!
  • epub - Many features of the report were retained but like the text file it was not paginated and the footnotes were all at the end of the document. The formatting was partially retained - the images were there, but were not always placed in the same positions as in the original. For example on the title page, the logos were not aligned correctly. Similarly, the title on the front page was not central.
  • html - This was very similar to the epub file regarding what was and wasn't retained. It included footnotes at the end and had the same issues with inconsistent formatting.

...but what about the comments?

My second test document was chosen so I could look specifically at the comments feature and how these were retained (or not) in the exported version.

  • docx - Comments are exported. On first inspection they appear to be anonymised, however this seems to be just how they are rendered in Microsoft Word. Having unzipped and dug into the actual docx file and looked at the XML file that holds the comments, it is clear that a more detailed level of information is retained - see images below. The placement of the comments is not always accurate. In one instance the reply to a comment is assigned to text within a subsequent row of the table rather than to the same row as the original comment.
  • odt -  Comments are included, are attributed to individuals and have a date and time. Again, matching up of comments with right section of text is not always accurate - in one instance a comment and it's reply are linked to the table cell underneath the one that they referenced in the original document.
  • rtf - Comments are included but appear to be anonymised when displayed in MS Word...I haven't dug around enough to establish whether or not this is just a rendering issue.
  • txt - Comments are retained but appear at the end of the document with a [a], [b] etc prefix - these letters appear in the main body text to show where the comments appeared. No information about who made the comment is preserved.
  • pdf - Comments not exported
  • epub - Comments not exported
  • html - Comments are present but appear at the end of the document with a code which also acts as a placeholder in the text where the comment appeared. References to the comments in the text are hyperlinks which take you to the right comment at the bottom of the document. There is no indication of who made the comment (not even hidden within the html tags).

A comment in original Google doc

The same comment in docx as rendered by MS Word

...but in the XML buried deep within the docx file structure - we do have attribution and date/time
(though clearly in a different time zone)

What about bulk export options?

Ed Pinsent pointed me to the Google Takeout Service which allows you to:
"Create an archive with your data from Google products"
[Google's words not mine - and perhaps this is a good time to point you to Ed's blog post on the meaning of the term 'Archive']

This is really useful. It allows you to download Google Drive files in bulk and to select which formats you want to export them as.

I tested this a couple of times and was surprised to discover that if you select pdf or docx (and perhaps other formats that I didn't test) as your export format of choice, the takeout service creates the file in the format requested and an html file which includes all comments within the document (even those that have been resolved). The content of the comments/responses including dates and times is all included within the html file, as are names of individuals.

The downside of the Google Takeout Service is that it only allows you to select folders and not individual files. There is another incentive for us to organise our files better! The other issue is that it will only export documents that you are the owner of - and you may not own everything that you want to archive!

What's missing?

Quite a lot actually.

The owner, creation and last modified dates of a document in Google Drive are visible when you click on Document details... within the File menu. Obviously this is really useful information for the archive but is lost as soon as you download it into one of the available export formats.

Creation and last modified dates as visible in Document details

Update: I was pleased to see that if using the Google Takeout Service to bulk export files from Drive, the last modified dates are retained, however on single file export/download these dates are lost and the last modified date of the file becomes the date that you carried out the export. 

Part of the revision history of my Google doc
But of course in a Google document there is more metadata. Similar to the 'Page History' that I mentioned when talking about preserving wiki pages, a Google document has a 'Revision history'

Again, this *could* be useful to the archive. Perhaps not so much so for my document which I worked on by myself in March, but I could see more of a use case for mapping and recording the creative process of writing a novel for example. 

Having this revision history would also allow you to do some pretty cool stuff such as that described in this blog post: How I reverse engineered Google Docs to play back any documents Keystrokes (thanks to Nick Krabbenhoft for the link).

It would seem that the only obvious way to retain this information would be to keep the documents in their original native Google format within Google Drive but how much confidence do we have that it will be safe there for the long term?

Conclusions

If you want to preserve a Google Drive document there are several options but no one-size-fits-all solution.

As always it boils down to what the significant properties of the document are. What is it we are actually trying to preserve?

  • If we want a fairly accurate but non interactive digital 'print' of the document, pdf might be the most accurate representation though even the pdf export can't be relied on to retain the exact pagination. Note that I didn't try and validate the pdf files that I exported and sadly there is no pdf/a export option.
  • If comments are seen to be a key feature of the document then docx or odt will be a good option but again this is not perfect. With the test document I used, comments were not always linked to the correct point within the document.
  • If it is possible to get the owner of the files to export them, the Google Takeout Service could be used. Perhaps creating a pdf version of the static document along with a separate html file to capture the comments.

A key point to note is that all export options are imperfect so it would be important to check the exported document against the original to ensure it accurately retains the important features.

Another option would be simply keeping them in their native format but trying to get some level of control over them - taking ownership and managing sharing and edit permissions so that they can't be changed. I've been speaking to one of our Google Drive experts in IT about the logistics of this. A Google Team Drive belonging to the Archives could be used to temporarily store and lock down Google documents of archival value whilst we wait and see what happens next. 

...I live in hope that export options will improve in the future.

This is a work in progress and I'd love to find out what others think.




* note, I've also been looking at Google Sheets and that may be the subject of another blog post

10 comments:

  1. I had some good results web-archiving Google Docs with Webrecorder, which retains lots of detail. https://webrecorder.io/ At least für "public" documents this is fine, I am not sure about private ones though. Check the introduction video, it shows a session with a Google Doc being archived.

    If you want to give Webrecorder a try, I'm happy to help.

    ReplyDelete
    Replies
    1. Thanks Dragan - I think I saw you talk about webrecorder at PASIG? Looked really impressive. Will take another look.

      Delete
    2. So am I right in thinking you could capture the revision history and all versions of the document, but only if you clicked on each link within the revision history and could you capture the document details (author, creation date) if you clicked on File...document details whilst you were recording? Thanks.

      Delete
    3. Yes, you would need to direct the Google Docs web app into the states you'd like to access again later.

      Right now this means manually clicking the links.

      Delete
  2. Hi Jen

    A very thorough and practical piece of work...

    One issue that concerns me is the loss of Creation Date and Last Modified Date.

    I believe a very similar issue can also happen in SharePoint, the MS system that enables better collaboration and sharing of content. If you copy or move content, there’s a tendency for the new copy to pick up the migration date and lose the original creation date. I’m not sure if it’s resolved yet; see for instance https://support.share-gate.com/hc/en-us/articles/115000601427-SharePoint-objects-Creation-Date-is-not-migrated

    As you point out, these dates are important for the archivist. For the records manager, I would argue they are crucial metadata elements. I mention it in the context of SharePoint because (a) some documents managed in SharePoint stand a chance of become electronic records at some point, and (b) there are some records managers using SharePoint, or intending to use SharePoint, as an electronic records management system.

    I say they are crucial for a records manager, because creation / modified dates are evidential elements which increase the authenticity of the record. They also have a part to play in the audit trail of the record.

    I mention all this just to widen out the scope of your excellent blog post. Does this date loss aspect affect other types of cloud-based software? If it does, perhaps electronic records managers need to take note.

    ReplyDelete
    Replies
    1. Yes I agree. Retaining the original dates is key.

      Actually, one thing I didn't mention in the blog post but I should have done is that when you do an export using Google Takeout Service LAST MODIFIED DATES *ARE* RETAINED. Sorry for shouting but I thought that was worth highlighting. So, if Google can do it for a bulk export then then should be able to do it for single file download? I have raised a new feature request with them about this so we will see what happens.

      Delete
    2. I am late to the party on this I realise but I have a bit of a sinking feeling that the date loss issue is also happening with files we get deposited into Pure (our research infornation system which we are using as a repository). It seems OK with zipped files but if they aren't then I think it is defaulting to date of upload. I suspect as Ed suggests it's a widespread problem of electronic information systems so we might need to consider what the implications of this are...

      Delete
  3. Off the cuff idea. Have you tried round tripping a document with comments from Google Docs to .docx and then reimport back into Google Docs. I'm curious if it will retain the comment attributes.

    ReplyDelete
    Replies
    1. I hadn't ...but that is an interesting idea. Just tried it now. Imported docx and odt with comments back into Google Drive. On first inspection it seems to work very well. Comments appear as they did in the original Google document, with a line that says "From imported document" underneath each one. Even the comment that had been assigned to the wrong row of the table when displayed in MS Word or LibreOffice was back in the right place. However interesting to note that something has gone wrong with the date/time stamp of all the comments so the "Will do" comment that had the date 6th Dec at 15:21 in the original Google doc, now has the date 15th Feb at 17:41. All comments in that particular thread have now been assigned this date. On the original document the final comment on this thread is from 15th Feb at 17:41 so I can see what has happened here. I will report this as a bug to Google!

      Delete