Archive for March, 2010

[2b2k] The Web style of nonfiction story-telling

The challenge of the writer of non-fiction intended for a general readership is that usually the general readership doesn’t care about your topic. The big, obviously interesting topics have already been written about, by writers with bigger advances than you. So, you’ve probably picked something somewhat quirky. Readers don’t think they’re interested in your topic. Your job is to convince them that they’re wrong by pulling them through the book.

A current way of doing so is to introduce chapters and sections with a bit of nonfiction narrative that is quirkily interesting in itself, and then reveal that it is relevant to the book in some unexpected way. The reader begins happily surprised to find herself interested in the history of quail shoots or the discovery of floor wax, and then gets an extra squeeze of joy when she finds out that the digression – promisingly short – enlightens the overall topic.

This sort of writing has the structure of a joke: the sudden revelation of meaning. And I find it quite enjoyable as a reader when in the hands of a master such as Malcolm Gladwell, whose brilliance at it I think has driven the use of this style. And, yes, I use the technique myself, albeit lamely.

My hypothesis – and it is nothing more than that – is that the Web has abetted the spread of this literary form in two ways. First, the technique is a way of sustaining interest across the attention spans the Web has fragmented. Second, the Web makes it astoundingly easy for a writer to find digressive anecdotes and stories. You think it might be fun for the reader to begin a chapter with an account of the superstar team that analyzed why the Challenger shuttle exploded? Ten seconds later, you’ve got a rich set of materials listed for you. The reader would enjoy an account of the origins of the phrase “turtles all the way down”? The Web considers it done.

Of course, using this technique effectively is an entirely different story. But, the Web gives us both the motive and the means.


[2b2k] Jon Orwant of Google Books

Jon Orwant is an Engineering Manager at Google, with Google Books under him. He used to be CTO at O’Reilly, and was educated at MIT Media Lab. He’s giving a talk to Harvard’s librarians about his perspective on how libraries might change, a topic he says puts him out on a limb. Title of his talk: “Deriving the library from first principles.” If we were to start from scratch, would they look like today’s? He says no.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Part I: Trends.

He says it’s not controversial that patrons are accessing more info online. Foot traffic to libraries is going down. Library budgets are being squeezed. “Public libraries are definitely feeling the pinch” exactly when people have less discretionary money and thus are spending more time at libraries.

At MIT, Nicholas Negroponte contended in the early 1990s that telephones would switch from wired to wireless, and televisions would go from wired to wireless. “It seems obvious in retrospect.” At that time, Jon was doing his work using a Connection Machine, which consisted of 64K little computers. The wet-bar size device he shows provided a whopping 5gb of storage. The Media Lab lost its advantage of being able to provide high end computers since computing power has become widespread. So, Media Lab had to reinvent itself, to provide value as a physical location.

Is there an analogy to the Negroponte switch of telephone and TV, Jon asks? We used to use the library to search for books and talk about them at home. In the future, we’ll use our computer to search for books, and talk about them at our libraries.

What is the mission of libraries, he asks. Se;ect and preserve info, or disseminate it. Might libraries redefine themselves? But this depends on the type of library.

1. University libraries. U of Michigan moved its academic press into the library system, even though the press is the money-making arm.

2. Research libraries. Harvard’s Countway Medical Library incorporates a lab into it, the Center for Bioinformatics. This puts domain expertise and search experts together. And they put in the Warren Anatomical Museum (AKA Harvard’s Freak Museum). Maybe libraries should replicate this, adopting information-driven departments. The ideal learning environment might be a great professor’s office. That 1:1 instruction isn’t generally tenable, but why is it that the higher the level of education, the fewer books are in the learning environment? I.e., kindergarten classes are filled with books, but grad student classrooms have few.

3. Public libraries. They tend to be big open rooms, which is why you have to be quiet in them. What if the architecture were a series of smaller, specialized rooms? Henry Jenkins said about newspapers, Jon says, that it’s strange that hundreds of reporters cover the Superbowl, all writing basically the same story; newspapers should differentiate by geography. Might this notion of specialization apply to libraries, reflecting community interests at a more granular level. Too often, public libraries focus on lowest common denominator, but suppose unusual book collections could rotate like exhibits in museums, with local research experts giving advice and talks. [Turn public libraries into public non-degree based universities?]

Part 2: Software architecture

Google Books want to scan all books. Has done 12M out of the 120 works (which have 174 manifestations — different versions and editions, etc.). About 4B pages, 40+ libraries, 400 languages (“Three in Klingon”). Google Books is in the first stage: Scanning. Second: Scaling. Third: What do we do with all this? 20% are public domain.

He talks a bit about the scanning tech, which tries to correct for the inner curve of spines, keeps marginalia while removing dirt, doing OCR, etc. At O’Reilly, the job was to synthesize the elements; at Google, the job is to analyze them. They’re trying to recognize frontispieces, index pages, etc. He gives as a sample of the problem of recognizing italics: “Copyright is way too long to strike the balance between benefits to the author and the public. The entire raison d’etre of copyright is to strike a balance between benefits to the author and the public. Thus, the optimal copyright term is c(x) = 14(n + 1).” In each of these, italics indicates a different semantic point. Google is trying to algorithmically catch the author’s intent.

Physical proximity is good for low-latency apps, local caching, high-bandwidth communication, and immersive environments. So, maybe we’ll see books as applications (e.g., good for physics text that lets you play with problems, maybe not so useful for Plato), real-time video connections to others reading the same book, snazzy visualizations, presentation of lots of data in parallel (reviews, related books, commentary, and annotations).”

“We’ll be paying a lot more attention to annotations” as a culture. He shows a scan of a Chinese book that includes a fold-out piece that contains an annotation; that page is not a single rectangle. “What could we do with persistent annotations?” What could we do with annotations that have not gone through the peer review process? What if undergrads were able to annotate books in ways that their comments persisted for decades? Not everyone would choose to do this, he notes.

We can do new types of research now. If you want to know whether the past tense of “sneak” is, 50 yrs ago people would have said “snuck” but in 50 years it’ll be “sneaked.” You can see that there is a trend toward regularization of verbs (i.e., not irregular verbs) over the time, which you can see by examining the corpus of books Google makes available to researchers. Or, you can look at triplets of words and ask what are the distinctive trigrams. E.g., It was: oxide of lead, vexation of spirit, a striking proof. Now: lesbian and gay, the power elite, the poor countries. Steve Pinker is going to use the corpus to test the “Great man” theory. E.g., when Newton and Leibniz both invented the calculus, was the calculus in the air? Do a calculus word cloud in multiple languages and test against the word configurations of the time. The usage of phrases “World War I” and “The Great War” cross around 1938, but there were some people calling it “WWI” in 1932, which is a good way to discover a new book (wouldn’t you want to read the person who foresaw WWII?). This sort of research is one of the benefits of the Google Books settlement, he says. (He also says that he was both a plaintiff and defendant in the case because as an author, his book was scanned without authorization.)

The images of all the world’s books are about 100 petabytes. If you put terminals in libraries so anyone can access out of print books. You can let patrons print on demand. “Does that have an impact on collections” and budgets? Once that makes economic sense, then every library will “have” every single book.

How can we design a library for serendipity? The fact that books look different is appealing, Jon says. Maybe a library should buy lots and lots of different e-readers, in different form factors. The library could display info-rich electronic spines (graphics of spines) [Jon doesn't know that this is an idea the Harvard Law Library, with whom I'm working, is working on]. We could each have our own virtual rooms and bookshelves, with books that come through various analytics, including books that people I trust are reading. We could also generalize this by having the bookshelves change if more than one person in the room; maybe the topics get broader to find shared interests. We could have bookshelves for a community in general. Analytics of multifactor classification (subject, tone, bias, scholarliness, etc.) can increase “deep” serendipity.


Q: One of the concerns in the research and univ libraries is the ability to return to the evidence you’ve cited. Having many manifestations (= editions, etc.) lets scholars return. We need permanent ways of getting back to evidence at a particular time. E.g., Census Dept. makes corrections, which means people who ran analyses of the data get different answers afterward.
A: The glib answer: You just need better citation mechanisms. The more sophisticated answer: Anglo-Saxon scholars will hold up a palimpsest. I don’t have an answer, except for a pointer to George Mason conf where they’re trying to come up with a protocol for expressing uncertainty [I think I missed this point -- dw]. What are all the ways to point into a work? You want to think of the work as a container, with all the annotations that come up with it. The ideal container has the text itself, info extracted from it, the programs needed to do the extraction, and the annotations. This raises the issue of the persistence of digital media in general. “We need to get into the mindset of bundling it all together”: PDFs and TIFFs + the programs for reading them. [But don't the programs depend upon operating systems? - dw]

Q: Centralized vs. distributed repository models?
A: It gets into questions of rights. I’d love to see it as distributed to as many places and in as many formats as possible. It shouldn’t just be Google digitizing books. You can get 100 petabytes in a single room, and of course much smaller in the future. There are advantages to keeping things local. But for the in-copyright works, it’ll come down to how comfortable the holders feel that it’s “too annoying” for people to copy what they shouldn’t.


[2b2k] Big problems

I’m liking a line from Brian Behlendorf that came through my Twitter stream: “The only way I know to solve big problems anymore is to do it in public.” (@brianbehlendorf)

The human brain being an ornery thing, our first response is probably to find exceptions. And there are undoubtedly lots of them…although it can be a little hard to tell because when you think of some big problem you’d want solved behind closed doors — The recent US/Russia nuclear arms reduction? A path to peace in the Middle East? — you can’t tell if maybe it were done in public, we would have come up with something better or faster. Still, there have to be places where this aphorism is wrong.

So what? The amazing thing is that it even looks plausible, much less that it’s proven itself to be true in some important instances. That is a big BIG change.


[2b2k] History of facts

I just stared reading Mary Poovey’s A History of the Modern Fact and I’m excited. I’m still in the intro, but she’s way deepening what I’ve been thinking and writing about facts. (If she uses Malthus as an example I’ll be pissed, since I thought I’d come up with the differences between the first and sixth editions of his book as an example of the birth of modern facts in the early 1800s. I’m afraid to look in the index. Yes, writing makes me petty.)

The intro makes it clear that she’s taking facts back to the invention of double entry bookkeeping and “mercantile accommodation.” Mercantile accommodation was, she says, the system of informal agreements that enabled merchants to accept each other’s bills of exchange. So, is she going to tie credentials to mercantile credits? Does our system of trust in authorities of knowledge come from our system of commercial trust? I’ll have to read more to find out if I’m just making this up because it’s 2AM and I’m in an airport.


[2b2k] Distributed decision-making at Valve

The title of this post is the subtitle of an article in Game Developer (March 2010) by Matthew S. Burns about the production methods used by various leading game developers. (I have no idea why I’ve started receiving copies of this magazine for software engineers in the video game industry, which I’m enjoying despite — because — it’s over my head.) According to the article, Valve — the source of some of the greatest games ever, including Half-life, Portal, and Left4Dead — “works in a cooperative, adaptable way that is difficult to explain to people who are used to the top-down, hierarchical management at most other large game developers.” Valve trusts its employees to make good decisions, but it is not a free-for-all. Decisions are made in consultation with others (“relevant parties”) because, as Erik Johnson says, “…we know you’re going to be wrong more often than if you made decisions together.” In addition, what Matthew calls “a kind of decision market” develops because people who design a system also build it, so you “‘vote’ by spending time on the features most important” to you. Vote with your code.

Valve also believes in making incremental decisions. Week by week. But what does that do to long-term planning? Robin Walker says that one of the ways she (he?) judges how far they are from shipping by “how may pages of notes I’m taking from each session.” That means Valve “can’t plan more than three months out,” but planning out further than that increases the chances of being wrong.

Interesting approach. Interesting article. Great games.


[2b2k] Harry Lewis on ways the Net is making us stupider

Harry Lewis has begun a series of posts on how the Net makes us know less, rather than more. Apparently, it’s not just Google that’s making us stupid. His first post is about the astounding French criminal libel prosecution in which the editor of a scholarly journal could conceivably go to jail for publishing a negative review of a book.

I’ve read the editor’s reasonable and reasoned response [pdf], and that the case has gotten as far as it has is bottomlessly awful.


[2b2k] Authority as having the first word

Because of some talks I’m giving, I’ve been thinking about how to put the concrete effects the change in expertise has for the authority of business. I want to say that in the old days, we took expertise and authority as the last word about a topic. Increasingly, the value of expertise and authority is as the first word — the word that frames and initiates the discussion.

I realize that this sounds better than it thinks, so to speak. But there are some aspects of it that I like. 1 I do think that we are moving away in some areas from thinking that we have to settle issues; we are finding much value in the unsettling of ideas, for that allows for more nuance, more complexity, and more recognition that our ability to know our world is quite limited. 2 And I do think that there is a type of expertise that has value as the first word — think about some of your favorite bloggers who throw an idea out into the world so the world can plumb it for meaning, veracity, and relevance. 3 Finally, I do think that insisting on having the last word — and thus closing the conversation — often will be seen as counter-productive and arrogant.

Unfortunately, that maps imperfectly to the snappy aphorism that expertise is moving from having the last word to being the first word.


[berkman] John Wilbanks on making science generative

John Wilbanks of Creative Commons (and head of Science Commons) is giving a Berkman lunchtime talk about the threats to science’s generativity. He takes Jonathan Zittrain’s definition of generativity: “a system’s capacity to produced unanticipated change through unfiltered contributions from broad and varied audiences.”

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

[NOTE: Ethan Zuckerman has posted his far superior bloggage]

ScienceCommons tries to spark the sort of creativity and innovation in science that we find in the broader cultural Net. Scientists often resist the factors that have increased generativity in other realms: Science isn’t very accessible, it’s hard to master, and it’s not very transferable because the sciences exist as guild-disciplines. He says MIT had to build a $400M building to put scientists into the same room so they’d collaborate. There’s a tension, he says, between getting credit for your work and sharing your work. People think that it ought to be easy to build a science commons, but it’s not.

To build a common and increase generativity, John looks at three key elements: data, tools, and text. First, he looks at these from the standpoint of law. Text is copyrighted, but we can change the law and we can use Creative Commons. Tools include contracts and patents. Contracts govern the moving of ideas around, and they are between institutions, not between scientists. Data is mainly governed by secrecy.

The resistance turns out not to be from the law but from incentives, infrastructure, and institutions. E.g. the National Institutes of Health Public Access requires scientists to make their work available on line within 12 months if the scientist has taken any NIH money. Before it was required, only 4% of scientists posted their work. Now it’s up over 70%, and it’s rising. Without this, scientists are incented to withhold info until the moment of maximum impact.

To open up data, you need incentives and infrastructure if you’re going to make it useful to others. People need incentives to label their data, put it into useful formats, to take care of the privacy issues, to carefully differentiate attribution and citation (copy vs. inspiration). So far, data doesn’t have the right set of incentives.

To open up tools, we’re talking about physical stuff, e.g., recombinant DNA. Scientists don’t get funded to make copies. “The resistance is almost fractal,” he says, at each level of opening up these materials.

We need a “domain name system for data” if we’re going to get Net effects. But there’s no accepted data infrastructure on the Web for doing this, unlike Google’s role for text pages.

Science is heading back to the garage, in the Eric Von Hippel sense. [He's sitting next to me at the table!] You can buy a gene sequencer on eBay for under $1,000. You can go to People around the world are doing this. In SF, a group is doing DIY sequencing, creating toxin detectors, etc. The price of parts and materials are dropping the way memory prices and printer prices did. We need an open system, including a registry, in part because that’s the most responsive way to respond to bad genes made by bad people.

“PC or TiVo for science?” John asks. PC’s are ugly, but they give us more control over our tools and will let us innovate faster.

Q: [salil] You focus on experimental sciences. Are these obstacles present in mathematical and computer sciences? Data and tools are not a big part of math. Not making one’s work available right now in my field counts as a disadvantage. Specialization is an issue (what you call a guild)…
A: Math and physics are at the extreme of the gradient of openness, while chemistry probably sits at the other end. The lower the cost of publishing, the more disclosure there is. So, in math there isn’t as much institutional, systemic resistance because you don’t need a lot of support from an institution to be a great mathematician.
A: Guilds serve a purpose. But when you think about the competency of a system overall, it comes from the abstraction of expertise into tools. In the research sciences, microspecialization has come at the expense of abstraction. But it’s easier and easier to put knowledge into the tools because we can put lots into computers; that won’t revolutionize math, but it will have more of an effect on sciences with physical components. Science Commons stays away from math because it’s working.

Q: [Eric Von Hippel] State of patents?
A: Most of the time in science, patents are trading cards; they’re about leverage and negotiations than about keeping people from using them. If we think about data as prior art, if we funnel it correctly, it becomes harder to get stupid patents. Biotech patents should be dealt with through an robust public domain strategy. “We tend to get wound up about IP, but then you go out in the field and people are just doing stuff.” Copyright is more stressful because patents time out after 20 yrs.

Q: [ethanz] Clearly, the legal response is a tiny part of a larger equation. If you were coming into it now, not trying to put forward this novel legal framework, where would you start?
A: Funders. Starting with the law lets us engage everyone in the conversation, because as the legal group we don’t create text, tools, or data. But we’re focusing on the funder-institution relation. We want funders to write clauses that reserve the right to put stuff into the commons. “If the funders mandate, the universities tend to accept.” Also, it gets easier to do high-quality research outside the big universities. Which means the small schools can do deals with the funders to make their faculty more attractive to the funder. The funder can also specify that the scientists will annotate their data. The funder has the biggest interest in making sure that science is generative.

Q: Then why aren’t funders requiring the data be open?
A: Making data legally open is easy. Making it useful to others is difficult. Curating it with enough metadata, publishing it on the Web, making it machine readable, making it persistent — none of those infrastructures exist for that, with some exceptions (e.g., the genome). So, the Web has to become capable of handling data.
Q: [ethanz] One reason that orgs like CC have been successful is that they put into law something that is a norm on the Web. Math and physics are so open is that they’re open; it’s the norm. The institutional culture within these disciplines has a lot to do with it. How do you shape norms?
A: Carolina Rossini and I have been working on a paper about the university as a curator of norms. CC lets you waive all your rights. We’ve thought about writing a series of machine readable norms like CC contracts but with no law in the middle. E.g., citation is a norm. E.g., non-endorsement is a norm that says that if you use my data, you can’t imply that I agree with you. But the norm that I should mark my data clearly, should have a persistent URL, are things laws can’t govern but should be norms. We use Eric’s ideas here. E.g., branding something with an open trademark.
A: [carolina] We need a bottom up approach based on norms and a top down approach based on law and policy. If you don’t work with both, they will clash.
A: Our lawyer Tim says that norms scale far better than the law. You can’t enforce the law all the time.

Q: [me] “Making the Web capable of handling data”? How? Semantic Web? What scale?
A: It’s a religious question. My sect says that ontologies are human. We should be using standard formats, e.g., OWL, RDF. Some ontologies will be used by communities, and if they area expressed in standard ways, they can be stitched together. From my view: name things in clear and distinct ways. 2. Put them into OWL or other languages in the correct way. 3. Let smart people who need connected data do so, and let them publish. It’ll be a mix of top down standards setting and bottom up hacking. I’m a big SemWeb fan, but I get very scared of people saying that they have THE ontology. It’ll be messy. It won’t be beautiful. The main thing is to make it easy for people to wire data sets together. Standard URIs and standard formats are the only way to do this. We’ve seen this in the life sciences. Communities that need to write big data together treat it the way Linux packages get rolled together into a release. You’ll see data distributions emerge that represent different religions. If it works, people will use it. They’ll be flame wars, license wars, and forking, and chaos, and 99% of the projects will die. You should be able to boot your databases into a single operating system that understands it.

Q: Researchers are incented to make their work available and open. Frequently, institutions get in the way of that. Are you looking at CC-style MTA’s [material transfer agreements]?
A: We published some last year. The first adopter was the Cure Huntingtons Disease and then the Personal Genome Project. We’re going to foundations. We want to get the institutions out of the way, but only the funders can change the experience. NIH requires you to provide a breeding pair of genetically altered mice, kept in a storage facility in Maine [I think]. NIH is moving away from MTAs, going with a you-agree-by-opening agreement.

Q: Privacy?
A: Big issue. Sometimes used as an excuse for not sharing data, but privacy makes the issues we’ve been talking about look simple. It’s a long-term problem. Genomes are not considered as personally identifying, although your license plate is. “There will be a reckoning.” JW’s advice: If you’re dealing with humans, be careful.

Q: Scientists are already overwhelmed by requests. More open, more tagged, means more requests.
A: Yes, we have to design with the negative impacts in mind. We need social filtering, etc. I worry about the scientist in eastern Tennessee or Botswana who’s a genius and can’t get access. If enough of the data is available, maybe you can get a community that answers many of the questions. People generally get into science because they like to talk with people. They’re more likely than most to share. But you have to make it part of the culture that it’s easy. One of the ideas behind the open source trademark concept is that you have to build up a certain amount of karma before I’ll read your email. People are the answer. Most of the time.

Q: Incentives to motivate institutions, but how incentives for individuals to move them in this direction?
A: PLOS was created because Mike Eisner was so pissed at closed journals that he created a business to compete with them. In anthropology, the Society is trying to go more closed, but groups of scientists are trying to go more open access. There’s a battle for the discipline’s soul. Individuals in these institutions are driving it. The key is to get the first big adopters going. Everyone wants to be in the top ten, especially when the first three are Harvard, Yale and MIT. American Chemistry Society is not going to go open any time soon because they make lots of money selling abstracts.

Q: [eric von hippel] I hope you realize how wonderful you all are.


[2b2k] Data-intensive science book available for free on up

The Fourth Paradigm: Data-Intensive Scientific Discovery looks like a rich anthology about how the combination of massive amounts of data and powerful computers is changing our ideas about the nature of science. It’s available as a paperback, but also for < free as a set of PDFs. Or, for $0.99 you can get it on your Kindle. ([kvetch] Of course, on the Kindle you won’t be able to underline or annotate it (except in theory, or create citations with page numbers.)

Anyway, I’m looking forward to it, even though the folder I’m building for the chapter on science I’m planning in Too Big to Know is (appropriately, I suppose) getting over-stuffed.


[ahole] [2b2k] Me having tea with The Economist

I have to say that Tea with the Economist was a fun experience. The Economist has been videoing tea-time discussions with various folks. In line with that magazine’s tradition of anonymous authoring, the interviewer is unnamed, but I can assure you that he is as astute as he is delightful.

We talk about what people will do with the big loads of data that some governments are releasing, and the general problem of the world being too big to know.