Archive for category libraries

[2b2k] Crowdsourcing transcription

[This article is also posted at Digital Scholarship@Harvard.]

Marc Parry has an excellent article at the Chronicle of Higher Ed about using crowdsourcing to make archives more digitally useful:

Many people have taken part in crowdsourced science research, volunteering to classify galaxies, fold proteins, or transcribe old weather information from wartime ship logs for use in climate modeling. These days humanists are increasingly throwing open the digital gates, too. Civil War-era diaries, historical menus, the papers of the English philosopher Jeremy Bentham—all have been made available to volunteer transcribers in recent years. In January the National Archives released its own cache of documents to the crowd via its Citizen Archivist Dashboard, a collection that includes letters to a Civil War spy, suffrage petitions, and fugitive-slave case files.

Marc cites an article [full text] in Literary & Linguistic Computing that found that team members could have completed the transcription of works by Jeremy Bentham faster if they had devoted themselves to that task instead of managing the crowd of volunteer transcribers. Here are some more details about the project and its negative finding, based on the article in L&LC.

The project was supported by a grant of £262,673 from the Arts and Humanities Research Council, for 12 months, which included the cost of digitizing the material and creating the transcription tools. The end result was text marked up with TEI-compliant XML that can be easily interpreted and rendered by other apps.

During a six-month period, 1,207 volunteers registered, who together transcribed 1,009 manuscripts. 21% of those registered users actually did some transcribing. 2.7% of the transcribers produced 70% of all the transcribed manuscripts. (These numbers refer to the period before the New York Times publicized the project.)

Of the manuscripts transcribed, 56% were “deemed to be complete.” But the team was quite happy with the progress the volunteers made:

Over the testing period as a whole, volunteers transcribed an average of thirty-five manuscripts each week; if this rate were to be maintained, then 1,820 transcripts would be produced every twelve months. Taking Bentham’s difficult handwriting, the complexity and length of the manuscripts, and the text-encoding into consideration, the volume of work carried out by Transcribe Bentham volunteers is quite remarkable

Still, as Marc points out, two Research Associates spent considerable time moderating the volunteers and providing the quality control required before certifying a document as done. The L&LC article estimates that RA’s could have transcribed 400 transcripts per month, 2.5x faster than the pace of the volunteers. But, the volunteers got better as they were more experienced, and improvements to the transcription software might make quality control less of an issue.

The L&LC article suggests two additional reasons why the project might be considered a success. First, it generated lots of publicity about the Bentham collection. Second, “no funding body would ever provide a grant for mere transcription alone.” But both of these reasons depend upon crowdsourcing being a novelty. At some point, it will not be.

Based on the Bentham project’s experience, it seems to me there are a few plausible possibilities for crowdsourcing transcription to become practical: First, as the article notes, if the project had continued, the volunteers might have gotten substantially more productive and more accurate. Second, better software might drive down the need for extensive moderation, as the article suggests. Third, there may be a better way to structure the crowd’s participation. For example, it might be practical to use Amazon Mechanical Turk to pay the crowd to do two or three independent passes over the content, which can then be compared for accuracy. Fourth, algorithmic transcription might get good enough that there’s less for humans to do. Fifth, someone might invent something incredibly clever that increases the accuracy of the crowdsourced transcriptions. In fact, someone already has: reCAPTCHA transcribes tens of millions of words every day. So you never know what our clever species will come up with.

For now, though, the results of the Bentham project cannot be encouraging for those looking for a pragmatic way to generate high-quality transcriptions rapidly.


[2b2k][eim]Digital curation

I’m at the “Symposium on Digital Curation in the Era of Big Data” held by the Board on Research Data and Information of the National Research Council. These liveblog notes cover (in some sense — I missed some folks, and have done my usual spotty job on the rest) the morning session. (I’m keynoting in the middle of it.)

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Alan Blatecky [pdf] from the National Science Foundation says science is being transformed by Big Data. [I can't see his slides from the panel at front.] He points to the increase in the volume of data, but we haven’t paid enough attention to the longevity of the data. And, he says, some data is centralized (LHC) and some is distributed (genomics). And, our networks are unable to transport large amounts of data [see my post], making where the data is located quite significant. NSF is looking at creating data infrastructures. “Not one big cloud in the sky,” he says. Access, storage, services — how do we make that happen and keep it leading edge? We also need a “suite of policies” suitable for this new environment.

He closes by talking about the Data Web Forum, a new initiative to look at a “top-down governance approach.” He points positively to the IETF’s “rough consensus and running code.” “How do we start doing that in the data world?” How do we get a balanced representation of the community? This is not a regulatory group; everything will be open source, and progress will be through rough consensus. They’ve got some funding from gov’t groups around the world. (Check for more info.)

Now Josh Greenberg from the Sloan Foundation. He points to the opportunities presented by aggregated Big Data: the effects on social science, on libraries, etc. But the tools aren’t keeping up with the computational power, so researchers are spending too much time mastering tools, plus it can make reproducibility and provenance trails difficult. Sloan is funding some technical approaches to increasing the trustworthiness of data, including in publishing. But Sloan knows that this is not purely a technical problem. Everyone is talking about data science. Data scientist defined: Someone who knows more about stats than most computer scientists, and can write better code than typical statisticians :) But data science needs to better understand stewardship and curation. What should the workforce look like so that the data-based research holds up over time? The same concerns apply to business decisions based on data analytics. The norms that have served librarians and archivists of physical collections now apply to the world of data. We should be looking at these issues across the boundaries of academics, science, and business. E.g., economics works now rests on data from Web businesses, US Census, etc.

[I couldn't liveblog the next two — Michael and Myron — because I had to leave my computer on the podium. The following are poor summaries.]

Michael Stebbins, Assistant Director for Biotechnology in the Office of Science and Technology Policy in the White House, talked about the Administration’s enthusiasm for Big Data and open access. It’s great to see this degree of enthusiasm coming directly from the White House, especially since Michael is a scientist and has worked for mainstream science publishers.

Myron Gutmann, Ass’t Dir of of the National Science Foundation likewise expressed commitment to open access, and said that there would be an announcement in Spring 2013 that in some ways will respond to the recent UK and EC policies requiring the open publishing of publicly funded research.

After the break, there’s a panel.

Anne Kenney, Dir. of Cornell U. Library, talks about the new emphasis on digital curation and preservation. She traces this back at Cornell to 2006 when an E-Science task force was established. She thinks we now need to focus on e-research, not just e-science. She points to Walters and Skinners “New Roles for New Times: Digital Curation for Preservation.” When it comes to e-research, Anne points to the need for metadata stabilization, harmonizing applications, and collaboration in virtual communities. Within the humanities, she sees more focus on curation, the effect of the teaching environment, and more of a focus on scholarly products (as opposed to the focus on scholarly process, as in the scientific environment).

She points to Youngseek Kim et al. “Education for eScience Professionals“: digital curators need not just subject domain expertise but also project management and data expertise. [There's lots of info on her slides, which I cannot begin to capture.] The report suggests an increasing focus on people-focused skills: project management, bringing communities together.

She very briefly talks about Mary Auckland’s “Re-Skilling for Research” and Williford and Henry, “One Culture: Computationally Intensive Research in the Humanities and Sciences.”

So, what are research libraries doing with this information? The Association of Research Libraries has a jobs announcements database. And Tito Sierra did a study last year analyzing 2011 job postings. He looked at 444 jobs descriptions. 7.4% of the jobs were “newly created or new to the organization.” New mgt level positions were significantly higher, while subject specialist jobs were under-represented.

Anne went through Tito’s data and found 13.5% have “digital” in the title. There were more digital humanities positions than e-science. She posts a lists of the new titles jobs are being given, and they’re digilicious. 55% of those positions call for a library science degree.

Anne concludes: It’s a growth area, with responsibilities more clearly defined in the sciences. There’s growing interest in serving the digital humanists. “Digital curation” is not common in the qualifications nomenclature. MLS or MLIS is not the only path. There’s a lot of interest in post-doctoral positions.

Margarita Gregg of the National Oceanic and Atmospheric Administration, begins by talking about challenges in the era of Big Data. They produce about 15 petabytes of data per year. It’s not just about Big Data, though. They are very concerned with data quality. They can’t preserve all versions of their datasets, and it’s important to keep track of the provenance of that data.

Margarita directs one of NOAA’s data centers that acquires, preserves, assembles, and provides access to marine data. They cannot preserve everything. They need multi-disciplinary people, and they need to figure out how to translate this data into products that people need. In terms of personnel, they need: Data miners, system architects, developers who can translate proprietary formats into open standards, and IP and Digital Rights Management experts so that credit can be given to the people generating the data. Over the next ten years, she sees computer science and information technology becoming the foundations of curation. There is no currently defined job called “digital curator” and that needs to be addressed.

Vicki Ferrini at the Lamont -Doherty Earth Observatory at Columbia University works on data management, metadata, discovery tools, educational materials, best practice guidelines for optimizing acquisition, and more. She points to the increased communication between data consumers and producers.

As data producers, the goal is scientific discovery: data acquisition, reduction, assembly, visualization, integration, and interpretation. And then you have to document the data (= metadata).

Data consumers: They want data discoverability and access. Inceasingly they are concerned with the metadata.

The goal of data providers is to provide acccess, preservation and reuse. They care about data formats, metadata standards, interoperability, the diverse needs of users. [I've abbreviated all these lists because I can't type fast enough.].

At the intersection of these three domains is the data scientist. She refers to this as the “data stewardship continuum” since it spans all three. A data scientist needs to understand the entire life cycle, have domain experience, and have technical knowledge about data systems. “Metadata is key to all of this.” Skills: communication and organization, understanding the cultural aspects of the user communities, people and project management, and a balance between micro- and macro perspectives.

Challenges: Hard to find the right balance between technical skills and content knowledge. Also, data producers are slow to join the digital era. Also, it’s hard to keep up with the tech.

Andy Maltz, Dir. of Science and Technology Council of Academy of Motion Picture Arts and Sciences. AMPA is about arts and sciences, he says, not about The Business.

The Science and Technology Council was formed in 2005. They have lots of data they preserve. They’re trying to build the pipeline for next-generation movie technologists, but they’re falling behind, so they have an internship program and a curriculum initiative. He recommends we read their study The Digital Dilemma. It says that there’s no digital solution that meets film’s requirement to be archived for 100 years at a low cost. It costs $400/yr to archive a film master vs $11,000 to archive a digital master (as of 2006) because of labor costs. [Did I get that right?] He says collaboration is key.

In January they released The Digital Dilemma 2. It found that independent filmmakers, documentarians, and nonprofit audiovisual archives are loosely coupled, widely dispersed communities. This makes collaboration more difficult. The efforts are also poorly funded, and people often lack technical skills. The report recommends the next gen of digital archivists be digital natives. But the real issue is technology obsolescence. “Technology providers must take archival lifetimes into account.” Also system engineers should be taught to consider this.

He highly recommends the Library of Congress’ “The State of Recorded Sound Preservation in the United States,” which rings an alarm bell. He hopes there will be more doctoral work on these issues.

Among his controversial proposals: Require higher math scores for MLS/MLIS students since they tend to score lower than average on that. Also, he says that the new generation of content creators have no curatorial awareness. Executivies and managers need to know that this is a core business function.

Demand side data points: 400 movies/year at 2PB/movie. CNN has 1.5M archived assets, and generates 2,500 new archive objects/wk. YouTube: 72 hours of video uploaded every minute.


  • Show business is a business.

  • Need does not necessarily create demand.

  • The nonprofit AV archive community is poorly organized.

  • Next gen needs to be digital natvies with strong math and sci skills.

  • The next gen of executive leaders needs to understand the importance of this.

  • Digital curation and long-term archiving need a business case.


Q: How about linking the monetary value of the metadata to the metadata? That would encourage the generation of metadata.

Q: Weinberger paints a picture of flexible world of flowing data, and now we’re back in the academic, scientific world where you want good data that lasts. I’m torn.

A: Margarita: We need to look how that data are being used. Maybe in some circumstances the quality of the data doesn’t matter. But there are other instances where you’re looking for the highest quality data.

A: [audience] In my industry, one person’s outtakes are another person’s director cuts.

A: Anne: In the library world, we say if a little metadata would be great, a lot of it would be great. We need to step away from trying to capture the most to capturing the most useful (since can’t capture the most). And how do you produce data in a way that’s opened up to future users, as well as being useful for its primary consumers? It’s a very interesting balance that needs to be played. Maybe short-term need is a higher thing and long-term is lower.

A: Vicki: The scientists I work with use discrete data sets, spreadsheets, etc. As we get along we’ll have new ways to check the quality of datasets so we can use the messy data as well.

Q: Citizen curation? E.g., a lot of antiques are curated by being put into people’s attics…Not sure what that might imply as model. Two parallel models?

A: Margarita: We’re going to need to engage anyone who’s interested. We need to incorporate citizen corporation.

Anne: That’s already underway where people have particular interests. E.g., Cornell’s Lab of Ornithology where birders contribute heavily.

Q: What one term will bring people info about this topic?

A: Vicki: There isn’t one term, which speaks to the linked data concept.

Q: How will you recruit people from all walks of life to have the skills you want?

A: Andy: We need to convince people way earlier in the educational process that STEM is cool.

A: Anne: We’ll have to rely to some degree on post-hire education.

Q: My shop produces and integrates lots of data. We need people with domain and computer science skills. They’re more likely to come out of the domains.

A: Vicki: As long as you’re willing to take the step across the boundary, it doesn’t mater which side you start from.

Q: 7 yrs ago in library school, I was told that you need to learn a little programming so that you understand it. I didn’t feel like I had to add a whole other profession on to the one I was studying.


[2b2k] Libraries are platforms?

I’m at the DPLA Plenary meeting, heading toward the first public presentation — a status report — on the prototype DPLA platform we’ve been building at Berkman and the Library Innovation Lab. So, tons of intellectual stimulation, as well as a fair bit of stress.

The platform we’ve been building is a software platform, i.e., a set of data and services offered through an API so that developers can use it to build end-user applications, and so other sites can integrate DPLA data into their sites. But I’ve been thinking for the past few weeks about ways in which libraries can (and perhaps should) view themselves as platforms in a broader sense. I want to write about this more, but here’s an initial set of draft-y thoughts about platforms as a way of framing the library issue.

Libraries are attached to communities, whether local towns, universities, or other institutions. Traditionally, much of their value has been in providing access to knowledge and cultural objects of particular sorts (you know, like books and stuff). Libraries thus have been platforms for knowledge and culture: they provide a reliable, open resource that enable knowledge and culture to be developed and pursued.

As the content of knowledge and culture change from physical to digital (over time and never completely), perhaps it’s helpful to think about libraries in their abstract sense as platforms. What might a library platform look like in the age of digital networks?(An hour later: Note that this type of platform would be very different from what we’re working on for the DPLA.)

It would give its community open access to the objects of knowledge and culture. It would include physical spaces as a particularly valuable sort of node. But the platform would do much more. If the mission is to help the community develop and pursue knowledge and culture, it would certainly provide tools and services that enable communities to form around these objects. The platform would make public the work of local creators, and would provide contexts within which these works can be found, discussed, elaborated, and appropriated. It would provide an ecosystem in which ideas and conversations flow out and in, weaving objects into local meanings and lives. Of course it would allow the local culture to flourish while simultaneously connecting it with the rest of the world — ideally by beginning with linking it into other local library platforms.

This is obviously not a well-worked out idea. It also contains nothing that hasn’t been discussed for decades now. What I like about it (at least for now) is that a platform provides a positive metaphor for thinking about the value of libraries that both helps explain their traditional value, and their opportunity facing the future.

DPLA session beginning. Will post without rereading… (Hat tip to Tim O’Reilly who has been talking about government as a platform for a few years now.) (Later: Also, my friend and DPLA colleague Nate Hill blogged a couple of months ago about libraries as local publishing platforms.)


Library News

Did I ever mention the really useful site Matt Phillips and Jeff Goldenson at the Library Innovation Lab put up a couple of weeks ago? If you are interested in libraries and tech, Library News is a community-supported news site where you’ll find a steady stream of interesting articles. Or, put differently, it’s the Hacker News code redirected at library tech articles.

I have it open all day. Try it. Contribute to it. Go library hacker nuts!


Physical libraries in a digital world

I’m at the final meeting of a Harvard course on the future of libraries, led by John Palfrey and Jeffrey Schnapp. They have three guests in to talk about physical library space.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

David Lamberth lays out an idea as a provocation. He begins by pointing out that until the beginning of the 20th century, a library was not a place but only a collection of books. He gives a quick history of Harvard Library. After the library burned down in 1764, the libraries lived in fear of fire, until electric lights came in. The replacement library (Gore Hall) was built out of stone because brick structures need wood on the inside. But stone structures are dank, and many books had to be re-bound every 30 years. Once it filled up, 25-30 of Harvard libraries derived from the search for fireproof buildings, which helps explain the large distribution of libraries across campus. They also developed more than 40 different classification systems. At the beginning of the 20th C, Harvard’s collection was just over one million. Now it adds up to around 18M. [David's presentation was not choppy, the way this paraphrase is.]

In the 1980s, there was continuing debate about what to do about the need for space. The big issue was open or closed stacks. The faculty wanted the books on site so they could be browsed. But stack space is expensive and you tend to outgrow it faster than you think. So, it was decided not to build any more stack space. There already was an offsite repository (New England Book Depository), but it was decided to build a high density storage facility to remove the non-active parts of the collection to a cheaper, off-site space: The Harvard Depository (HD).

Now more than 40% of the physical collections are at HD. The Faculty of Arts and Sciences started out hostile to the idea, but “soon became converted.” The notion faculty had of browsing the shelves was based on a fantasy: Harvard had never had all the books on a subject on a shelf in a single facility. E.g., search on “Shakespeare” in the Harvard library system: 18,000 hits. Widener Library is where you’d expect to find Shakespeare books. But 8,000 of the volumes aren’t in Widener. Of Widener’s 10K Shakespeare, volumes, 4,500 are in HD. So, 25% of what you meant to browse is there. “Shelf browsing is a waste of time” if you’re trying to do thorough research. It’s a little better in the smaller libraries, but the future is not in shelf browsing. Open and closed stacks isn’t the question any more. “It’s just not possible any longer to do shelf browsing, unless we develop tools for browsing in a non-physical fashion.” E.g., catalog browsers, and ShelfLife (with StackView).

There’s nobody in the stacks any more. “It’s like the zombies have come and cleared people out.” People have new alternatives, and new habits. “But we have real challenges making sure they do as thorough research as possible, and that we leverage our collection.” About 12M of the 18M items are barcoded.

A task force saw that within 40 years, over 70% of the physical collection will be off site. HD was not designed to hold the part of the collection most people want to use. So, what can do that will give us pedagogical and intellectual benefit, and realizes the incredible resource that our collection is?

Let me present one idea, says David. The Library Task Force said emphatically that Harvard’s collection should be seen as one collection. It makes sense intellectually and financially. But that idea is in contention with the 56 physical libraries at Harvard. Also, most of our collection doesn’t circulate. Only some of it is digitally browsable, and some of that won’t change for a long long long time. E.g., our Arabic journals in Widener aren’t indexed, don’t publish cumulative indexes, and are very hard to index. Thus scholars need to be able to pull them off the shelves. Likewise for big collections of manuscripts that haven’t even been sorted yet.

One idea would be to say: Let’s treat physical libraries as one place as well. Think of them as contiguous, even though they’re not. What if bar-coded books stayed in the library you returned to them to? Not shelved by a taxonomy. Random access via the digital, and it tells you where the work is. And build perfect shelves for the works that need to be physically organized. Let’s build perfect Shakespeare shelves. Put them in one building. The other less-used works will be findable, but not browsable. This would require investing in better findability systems, but it would let us get past the arbitrariness of classification systems. Already David will usually go to Amazon to decide if he wants a book rather than take the 5 mins to walk to the library. By focusing on perfect shelves for what is most important to be browsable, resources would be freed up. This might make more space in the physical libraries, so “we could think about what the people in those buildings want to be doing,” so people would come in because there’s more going on. (David notes that this model will not go over well with many of his colleagues.)

53% of library space at Harvard is stack space. The other 47% is split between patron space and space staff. About 20-25% is space staff. Comparatively, Harvard is lower on patron space size than typical. The HD is holding half the collection in 20% of the space. It’s 4x as expensive to store a work on a stack on campus than off.

David responds to a question: The perfect shelves should be dynamic, not permanent. That will better serve the evolution of research. There are independent variables: Classification and shelf location. We certainly need classification, but it may not need to map to shelf locations. Widener has bibliographic lists and shelf lists. Barcodes give us more freedom; we don’t have to constantly return works to fixed locations.

Mike Barker: Students already build their own perfect shelves with carrels.

Q: What’s the case for ownership and retention if we’re only addressing temporal faculty needs?

A lot of the collecting in the first half of the 20 C was driven by faculty requests. Not now. The question of retention and purchase splits on the basis of how uncommon the piece of info is. If it’s being sold by Amazon, I don’t think it really matters if we retain it, because of the number of copies and the archival steps already in place. The more rare the work, the more we should think about purchase and retention. But under a third of the stack space on campus ideal environmental conditions. We shouldn’t put works we buy into those circumstances unless they’re being used.

Q: At the Law Library, we’re trying to spread it out so that not everyone is buying the same stuff. E.g., we buy Peruvian materials because other libraries aren’t. And many law books are not available digitally, so we we buy them … but we only buy one copy.

Yes, you’re making an assessment. In the Divinity library, Mike looked at the duplication rate. It was 53%. That is, 53% of our works are duplicated in other Harvard libraries.

Mike: How much do we spend on classification? To create call numbers? We annually spend about 1.5-2M on it, plus another million shelving it. So, $3M-3.5M total. (Mike warns that this is a “very squishy” number.) We circulate about 700,000 items a years. The total operating budget of the Library is about $152M. (He derived this number by asking catalogers who long it takes to classify an item without one, divided into salary.)

David: Scanning in tables of contents, indexes, etc., lets people find things without having to anticipate what they’re going to be interested in.

Q: Where does serendipity fall in this? What about when you don’t know what you’re looking for?

David: I agree completely. My dissertation depended on a book that no one had checked out since 1910. I found it on the stacks. But it’s not on the shelves now. Suppose I could ask a research librarian to bring me two shelves worth of stuff because I’m beginning to explore some area.

Q: What you’re suggesting won’t work so well for students. How would not having stacks affect students?

David: I’m being provocative but concrete. The status quo is not delivering what we think it does, and it hasn’t for the past three decades.

Q: [jeff goldenson] Public librarians tell us that the recently returned trucks are the most interesting place to go. We don’t really have the ability to see what’s moving in the Harvard system. Yes, there are privacy concerns, but just showing what books have been returned would be great.

Q: [palfrey] How much does the rise of the digital affect this idea? Also, you’ve said that the storage cost of a digital object may be more than that of physical objects. How does that affect this idea?

David: Copyright law is the big If. It’s not going away. But what kind of access do you have to digital objects that you own? That’s a huge variable. I’ve premised much of what I’ve said on the working notion that we will continue to build physical collections. We don’t know how much it will cost to keep a physical object for a long time. And computer scientists all say that digital objects are not durable. My working notion here is that the parts that are really crucial are the metadata pieces, which are more easily re-buildable if you have the physical objects. We’re not going to buy physical objects for all the digital items, so the selection principle goes back to how grey or black the items are. It depends on whether we get past the engineering question about digital durability — which depends a lot on electromagnetism as a storage medium, which may be a flash in the pan. We’re moving incrementally.

Q: [me] If we can identify the high value works that go on perfect shelves, why not just skip the physical shelves and increase the amount of metadata so that people can browse them looking for the sort of info they get from going to the physical shelf?

A: David: Money. We can’t spend too much on the present at the expense of the next century or two. There’s a threshold where you’d say that it’s worth digitizing them to the degree you’d need to replace physical inspection entirely. It’s a considered judgment, which we make, for example, when we decide to digitize exhibitions. You’d want to look at the opportunity costs.

David suggests that maybe the Divinity library (he’s in the Phil Dept.) should remove some stacks to make space for in-stack work and discussion areas. (He stresses that he’s just thinking out loud.)

Matthew Sheehy, who runs HD, says they’re thinking about how to keep books 500 years. They spend $300K/year on electricity to create the right environment. They’ve invested in redundancy. But, the walls of the HD will only last 100 years. [Nov. 25: I may have gotten the following wrong:] He thinks it costs about $1/ year to store a book, not the usual figure of $0.45.

Jeffrey Schnapp: We’re building a library test kitchen. We’re interested in building physical shelves that have digital lives as well.

[Nov. 25: Changed Philosophy school to Divinity, in order to make it correct. Switched the remark about the cost of physical vs. digital in the interest of truth.]


[avignon] [2b2k] Robert Darnton on the history of copyright , open access, the dpla…

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

We begin with a report on a Ministerial meeting yesterday here on culture — a dialogue among the stakeholders on the Internet. [No users included, I believe.] All agreed on the principles proposed at Deauville: It is a multi-stakeholder ecosystem that complies with law. In this morning’s discussion, I was struck by the convergence: we all agree about remunerating copyright holders. [Selection effect. I favor copyright and remunerating rights holders, but not as the supreme or exclusive value.] We agree that there are more legal alternatives. We agree that the law needs to be enforced. No one argued with that. [At what cost?] And we all agree we need international cooperation, especially to fight piracy.

Now Robert Darnton, Harvard Librarian, gives an invited talk about the history of copyright.

Darnton: I am grateful to be here. And especially grateful you did not ask me to talk about the death of the book. The book is not dead. More books are being produced in print and online every year than in the previous year. This year, more than 1 million new books will be produced. China has doubled its production of books in the past ten years. Brazil has a booming book industry. Even old countries like the US find book production is increasing. We should not bemoan the death of the book.

Should we conclude that all is well in the world of books? Certainly not. Listen to the lamentations of authors, publishers, booksellers. They are clearly frightened and confused. The ground is shifting beneath their feet and they don’t know where to stake a claim. The pace of tech is terrifying. What took millennia, then centuries, then decades, now happens all the time. Homesteading in the new info ecology is made difficult by uncertainty about copyright and economics.

Throughout early modern Europe, publishing was dominated by guilds of booksellers and printers. Modern copyright did not exist, but booksellers accumulated privileges, which Condorcet objected to. These privileges (AKA patents) gave them the exclusive rights to reproduce texts, with the support of the state. The monarchy in the 17th century eliminated competitors, especially ones in the provinces, reinforcing the guild, thus gaining control of publishing. But illegal production throve. Avignon was a great center of privacy in the 18th century because it was not French. It was surrounded by police intercepting the illegal books. It took a revolution to break the hegemony of the Parisian guild. For two years after the Bastille, the French press enjoyed liberty. Condorcet and others had argued for the abolition of constraints on the free exchange of ideas. It was a utopian vision that didn’t last long.

Modern copyright began with the 1793 French copyright law that established a new model in Europe. The exclusive right to sell a text was limited to the author for lifetime + 10 years. Meanwhile, the British Statute of Anne in 1710 created copyright. Background: The stationers’ monopoly required booksellers — and all had to be members — to register. The oligarchs of the guild crushed their competitors through monopolies. They were so powerful that they provoked results even within the book trade. Parliament rejected the guild’s attempt to secure the licensing act in 1695. The British celebrate this as the beginning of the end of pre-publication censorship.

The booksellers lobbied for the modern concept of copyright. For new works: 14 years, renewable once. At its origin, copyright law tried to strike a balance between the public good and the private benefit of the copyright owner. According to a liberal view, Parliament got the balance right. But the publishers refused to comply, invoking a general principle inherent in common law: When an author creates work, he acquires an unlimited right to profit from his labor. If he sold it, the publisher owned it in perpetuity. This was Diderot’s position. The same argument occurred in France and England.

In England, the argument culminated in a 1774 Donaldson vs. Beckett that reaffirmed 14 years renewable once. Then we Americans followed in our Constitution and in the first copyright law in 1790 (“An act for the encouragement of learning”, echoing the British 1710 Act): 14 years renewable once.

The debate is still alive. The 1998 copyright extension act in the US was considerably shaped by Jack Valenti and the Hollywood lobby. It extended copyright to life + 70 (or for corporations: life + 95). We are thus putting most literature out of the public domain and into copyright that seems perpetual. Valenti was asked if he favored perpetual copyright and said “No. Copyright should last forever minus one day.”

This history is meant to emphasize the interplay of two elements that go right through the copyright debate: A principle directed toward the public gain vs. self-interest for private gain. It would be wrong-headed and naive to only assert the former. B ut to assert only the latter would be cynical. So, do we have the balance right today?

Consider knowledge and power. We all agree that patents help, but no one would want the knowledge of DNA to be exploited as private property. The privitization of knowledge has become an enclosure movement. Consider academic periodicals. Most knowledge first appears in digitized periodicals. The journal article is the principle outlet for the sciences, law, philosophy, etc. Journal publishers therefore control access to most of the knowledge being created, and they charge a fortune. The price of academic journals rose ten times faster than the rate of inflation in the 1990s. The J of Comparative Neurology is $29,113/year. The Brain costs $23,000. The average list price in chemistry is over $3,000. Most of the research was subsidized by tax payers. It belongs in the public domain. But commercial publishers have fenced off parts of that domain and exploited it. Their profit margins runs as high as 40%. Why aren’t they constrained by the laws of supply and domain? Because they have crowded competitors out, and the demand is not elastic: Research libraries cannot cancel their subscriptions without an uproar from the faculty. Of course, professors and students produced the research and provided it for free to the publishers. Academics are therefore complicit. They advance their prestige by publishing in journals, but they fail to understand the damage they’re doing to the Republic of Letters.

How to reverse this trend? Open access journals. Journals that are subsidized at the production end and are made free to consumers. They get more readers, too, which is not surprising since search engines index them and it’s easy for readers to get to them. Open Access is easy access, and the ease has economic consequences. Doctors, journalists, researchers, housewives, nearly everyone wants information fast and costless. Open Access is the answer. It is a little simple, but it’s the direction we have to take to address this problem at least in academic journals.

But the Forum is thinking about other things. I admire Google for its technical prowess, but also because it demonstrated that free access to info can be profitable. But it ran into problems when it began to digitize books and make them available. It got sued for alleged breach of copyright. It tried to settle by turning it into a gigantic business and sharing the profits with the authors and publishers who sued them. Libraries had provided the books. Now they’d have to buy them back at a price set by Google. Google was fencing off access to knowledge. A federal judge rejected it because, among other points, it threatened to create a monopoly. By controlling access to books, Google occupied a position similar to that of the guilds in London and Paris.

So why not create a library as great as anything imagined by Google, but that would make works available to users free of charge? Harvard held a workshop on Oct. 1 2010 to explore this. Like Condorcet, a utopian fantasy? But it turns out to be eminently reasonable. A steering committee, a secretariat, 6 workgroups were established. A year later we launched the Digital Public Library of America at a conference hosted by the major cultural institutions in DC, and in April in 2013 we’ll have a preliminary version of it.

Let me emphasize two points. 1. The DPLA will serve a wide an varied constituency throughout the US. It will be a force in education, and will provide a stimulus to the economy by putting knowledge to work. 2. It will spread to everyone on the globe. The DPLA’s technical infrastructure is being designed to be interoperable with Europeana, which is aggregating the digital collections of 27 companies. National digital libraries are sprouting up everywhere, even Mongolia. We need to bring them together. Books have never respected boundaries. Within a few decades, we’ll have worldwide access to all the books in the world, and images, recordings, films, etc.

Of course a lot remains to be done. But, the book is dead? Long live the book!

Q: It is patronizing to think that the USA and Europe will set the policy here. India and China will set this policy.

A: We need international collaboration. And we need an infrastructure that is interoperable.


[2b2k] Interview with Kevin Kelly on What Libraries Want

Dan Jones just posted my Library Lab Podcast conversation with Kevin Kelly, of whom I’m a great admirer.


[2b2k] Will digital scholarship ever keep up?

Scott F. Johnson has posted a dystopic provocation about the present of digital scholarship and possibly about its future.

Here’s the crux of his argument:

… as the deluge of information increases at a very fast pace — including both the digitization of scholarly materials unavailable in digital form previously and the new production of journals and books in digital form — and as the tools that scholars use to sift, sort, and search this material are increasingly unable to keep up — either by being limited in terms of the sheer amount of data they can deal with, or in terms of becoming so complex in terms of usability that the average scholar can’t use it — then the less likely it will be that a scholar can adequately cover the research material and write a convincing scholarly narrative today.

Thus, I would argue that in the future, when the computational tools (whatever they may be) eventually develop to a point of dealing profitably with the new deluge of digital scholarship, the backward-looking view of scholarship in our current transitional period may be generally disparaging. It may be so disparaging, in fact, that the scholarship of our generation will be seen as not trustworthy, or inherently compromised in some way by comparison with what came before (pre-digital) and what will come after (sophisticatedly digital).

Scott tentatively concludes:

For the moment one solution is to read less, but better. This may seem a luddite approach to the problem, but what other choice is there?

First, I should point out that the rest of Scott’s post makes it clear that he’s no Luddite. He understands the advantages of digital scholarship. But I look at this a little differently.

I agree with most of Scott’s description of the current state of digital scholarship and with the inevitability of an ever increasing deluge of scholarly digital material. But, I think the issue is not that the filters won’t be able to keep up with the deluge. Rather, I think we’re just going to have to give up on the idea of “keeping up” — much as newspapers and half hour news broadcasts have to give up the pretense that they are covering all the day’s events. The idea of coverage was always an internalization of the limitation of the old media, as if a newspaper, a broadcast, or even the lifetime of a scholar could embrace everything important there is to know about a field. Now the Net has made clear to us what we knew all along: most of what knowledge wanted to do was a mere dream.

So, for me the question is what scholarship and expertise look like when they cannot attain a sense of mastery by artificial limiting the material with which they have to deal. It was much easier when you only had to read at the pace of the publishers. Now you’d have to read at the pace of the writers…and there are so many more writers! So, lacking a canon, how can there be experts? How can you be a scholar?

I’m bad at predicting the future, and I don’t know if Scott is right that we will eventually develop such powerful search and filtering tools that the current generation of scholars will look betwixt-and-between fools (or as an “asterisk,” as Scott says). There’s an argument that even if the pace of growth slows, the pace of complexification will increase. In any case, I’d guess that deep scholars will continue to exist because that’s more a personality trait than a function of the available materials. For example, I’m currently reading Armies of Heaven, by Jay Rubenstein. The depth of his knowledge about the First Crusade is astounding. Astounding. As more of the works he consulted come on line, other scholars of similar temperament will find it easier to pursue their deep scholarship. They will read less and better not as a tactic but because that’s how the world beckons to them. But the Net will also support scholars who want to read faster and do more connecting. Finally (and to me most interestingly) the Net is already helping us to address the scaling problem by facilitating the move of knowledge from books to networks. Books don’t scale. Networks do. Although, yes, that fundamentally changes the nature of knowledge and scholarship.

[Note: My initial post embedded one draft inside another and was a total mess. Ack. I've cleaned it up - Oct. 26, 2011, 4:03pm edt.]