Archive for category big data

[2b2k] Social Science in the Age of Too Big to Know

Gary King [twitter:kinggarry] , Director of Harvard’s Institute for Quantitative Social Science, has published an article (Open Access!) on the current status of this branch of science. Here’s the abstract:

The social sciences are undergoing a dramatic transformation from studying problems to solving them; from making do with a small number of sparse data sets to analyzing increasing quantities of diverse, highly informative data; from isolated scholars toiling away on their own to larger scale, collaborative, interdisciplinary, lab-style research teams; and from a purely academic pursuit focused inward to having a major impact on public policy, commerce and industry, other academic fields, and some of the major problems that affect individuals and societies. In the midst of all this productive chaos, we have been building the Institute for Quantitative Social Science at Harvard, a new type of center intended to help foster and respond to these broader developments. We offer here some suggestions from our experiences for the increasing number of other universities that have begun to build similar institutions and for how we might work together to advance social science more generally.

In the article, Gary argues that Big Data requires Big Collaboration to be understood:

Social scientists are now transitioning from working primarily on their own, alone in their officesâ??a style that dates back to when the offices were in monasteriesâ??to working in highly collaborative, interdisciplinary, larger scale, lab-style research teams. The knowledge and skills necessary to access and use these new data sources and methods often do not exist within any one of the traditionally defined social science disciplines and are too complicated for any one scholar to accomplish alone

He begins by giving three excellent examples of how quantitative social science is opening up new possibilities for research.

1. Latanya Sweeney [twitter:LatanyaSweeney] found “clear evidence of racial discrimination” in the ads served up by newspaper websites.

2. A study of all 187M registered voters in the US showed that a third of those listed as “inactive” in fact cast ballots, “and the problem is not politically neutral.”

3. A study of 11M social media posts from China showed that the Chinese government is not censoring speech but is censoring “attempts at collective action, whether for or against the government…”

Studies such as these “depended on IQSS infrastructure, including access to experts in statistics, the social sciences, engineering, computer science, and American and Chinese area studies. ”

Gary also points to “the coming end of the quantitative-qualitative divide” in the social sciences, as new techniques enable massive amounts of qualitative data to be quantified, enriching purely quantitative data and extracting additional information from the qualitative reports.

Instead of quantitative researchers trying to build fully automated methods and qualitative researchers trying to make do with traditional human-only methods, now both are heading toward using or developing computer-assisted methods that empower both groups.

We are seeing a redefinition of social science, he argues:

We instead use the term “social science” more generally to refer to areas of scholarship dedicated to understanding, or improving the well-being of, human populations, using data at the level of (or informative about) individual people or groups of people.

This definition covers the traditional social science departments in faculties of schools of arts and science, but it also includes most research conducted at schools of public policy, business, and education. Social science is referred to by other names in other areas but the definition is wider than use of the term. It includes what law school faculty call “empirical research,” and many aspects of research in other areas, such as health policy at schools of medicine. It also includes research conducted by faculty in schools of public health, although they have different names for these activities, such as epidemiology, demography, and outcomes research.

The rest of the article reflects on pragmatic issues, including what this means for the sorts of social science centers to build, since community is “by far the most important component leading to success…” ” If academic research became part of the X-games, our competitive event would be “‘extreme cooperation’”.

Tags:

[liveblog][2b2k] Saskia Sassen

The sociologist Saskia Sassen is giving a plenary talk at Engaging Data 2013. [I had a little trouble hearing some of it. Sorry. And in the press of time I haven't had a chance to vet this for even obvious typos, etc.]

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

1. The term Big Data is ambiguous. “Big Data” implies we’re in a technical zone. it becomes a “technical problem” as when morally challenging technologies are developed by scientists who thinks they are just dealing with a technical issue. Big Data comes with a neutral charge. “Surveillance” brings in the state, the logics of power, how citizens are affected.

Until recently, citizens could not relate to a map that came out in 2010 that shows how much surveillance there is in the US. It was published by the Washington Post, but it didn’t register. 1,271 govt orgs and 1,931 private companies work on programs related to counterterrorism, homeland security and intelligence. There are more than 1 million people with stop-secret clearance, and maybe a third are private contractors. In DC and enirons, 33 building complexes are under construction or have been built for top-secret intelligence since 9/11. Together they are 22x the size of Congress. Inside these environments, the govt regulates everything. By 2010, DC had 4,000 corporate office buildings that handle classified info,all subject to govt regulation. “We’re dealing with a massive material apparatus.” We should not be distracted by the small individual devices.

Cisco lost 28% of its sales, in part as a result of its being tainted by the NSA taking of its data. This is alienating citzens and foreign govts. How do we stop this? We’re dealing with a kind of assemblage of technical capabilities, tech firms that sell the notion that for security we all have to be surveilled, and people. How do we get a handle on this? I ask: Are there spaces where we can forget about them? Our messy, nice complex cities are such spaces. All that data cannot be analyzed. (She notes that she did a panel that included the brother of a Muslim who has been indefinitely detained, so now her name is associated with him.)

3. How can I activate large, diverse spaces in cities? How can we activate local knowledges? We can “outsource the neighborhood.” The language of “neighborhood” brings me pleasure, she says.

If you think of institutions, they are codified, and they notice when there are violations. Every neighborhood has knowledge about the city that is different from the knowledge at the center. The homeless know more about rats than the center. Make open access networks available to them into a reverse wiki so that local knowledge can find a place. Leak that knowledge into those codified systems. That’s the beginning of activating a city. From this you’d get a Big Data set, capturing the particularities of each neighborhood. [A knowledge network. I agree! :)]

The next step is activism, a movement. In my fantasy, at one end it’s big city life and at the other it’s neighborhood residents enabled to feel that their knowledge matters.

Q&A

Q: If local data is being aggregated, could that become Big Data that’s used against the neighborhoods?

A: Yes, that’s why we need neighborhood activism. The polticizing of the neighborhoods shapes the way the knowledge isued.

Q: Disempowered neighborhoods would be even less able to contribute this type of knowledge.

A: The problem is to value them. The neighborhood has knowledge at ground level. That’s a first step of enabling a devalued subject. The effect of digital networks on formal knowledge creates an informal network. Velocity itself has the effect of informalizing knowledge. I’ve compared environmental activists and financial traders. The environmentalists pick up knowledge on the ground. So, the neighborhoods may be powerless, but they have knowledge. Digital interactive open access makes it possible bring together those bits of knowledge.

Q: Those who control the pipes seem to control the power. How does Big Data avoid the world being dominated by brainy people?

A: The brainy people at, say, Goldman Sachs are part of a larger institution. These institutions have so much power that they don’t know how to govern it. The US govt has been the post powerful in the world, with the result that it doesn’t know how to govern its own power. It has engaged in disastrous wars. So “brainy people” running the world through the Ciscos, etc., I’m not sure. I’m talking about a different idea of Big Data sets: distributed knowledges. E.g, Forest Watch uses indigenous people who can’t write, but they can tell before the trained biologists when there is something wrong in the ecosystem. There’s lots of data embedded in lots of places.

[She's aggregating questions] Q1: Marginalized neighborhoods live being surveilled: stop and frisk, background checks, etc. Why did it take tapping Angela Merkel’s telephone to bring awareness? Q2: How do you convince policy makers to incorporate citizen data? Q3: There are strong disincentives to being out of the mainstream, so how can we incentivize difference.

A: How do we get the experts to use the knowledge? For me that’s not the most important aim. More important is activating the residents. What matters is that they become part of a conversation. A: About difference: Neighborhoods are pretty average places, unlike forest watchers. And even they’re not part of the knowledge-making circuit. We should bring them in. A: The participation of the neighborhoods isn’t just a utility for the central govt but is a first step toward mobilizing people who have been reudced to thinking that they don’t count. I think is one of the most effective ways to contest the huge apparatus with the 10,000 buildings.

Tags:

[liveblog][2b2k] Saskia Sassen

The sociologist Saskia Sassen is giving a plenary talk at Engaging Data 2013. [I had a little trouble hearing some of it. Sorry. And in the press of time I haven't had a chance to vet this for even obvious typos, etc.]

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

1. The term Big Data is ambiguous. “Big Data” implies we’re in a technical zone. it becomes a “technical problem” as when morally challenging technologies are developed by scientists who thinks they are just dealing with a technical issue. Big Data comes with a neutral charge. “Surveillance” brings in the state, the logics of power, how citizens are affected.

Until recently, citizens could not relate to a map that came out in 2010 that shows how much surveillance there is in the US. It was published by the Washington Post, but it didn’t register. 1,271 govt orgs and 1,931 private companies work on programs related to counterterrorism, homeland security and intelligence. There are more than 1 million people with stop-secret clearance, and maybe a third are private contractors. In DC and enirons, 33 building complexes are under construction or have been built for top-secret intelligence since 9/11. Together they are 22x the size of Congress. Inside these environments, the govt regulates everything. By 2010, DC had 4,000 corporate office buildings that handle classified info,all subject to govt regulation. “We’re dealing with a massive material apparatus.” We should not be distracted by the small individual devices.

Cisco lost 28% of its sales, in part as a result of its being tainted by the NSA taking of its data. This is alienating citzens and foreign govts. How do we stop this? We’re dealing with a kind of assemblage of technical capabilities, tech firms that sell the notion that for security we all have to be surveilled, and people. How do we get a handle on this? I ask: Are there spaces where we can forget about them? Our messy, nice complex cities are such spaces. All that data cannot be analyzed. (She notes that she did a panel that included the brother of a Muslim who has been indefinitely detained, so now her name is associated with him.)

3. How can I activate large, diverse spaces in cities? How can we activate local knowledges? We can “outsource the neighborhood.” The language of “neighborhood” brings me pleasure, she says.

If you think of institutions, they are codified, and they notice when there are violations. Every neighborhood has knowledge about the city that is different from the knowledge at the center. The homeless know more about rats than the center. Make open access networks available to them into a reverse wiki so that local knowledge can find a place. Leak that knowledge into those codified systems. That’s the beginning of activating a city. From this you’d get a Big Data set, capturing the particularities of each neighborhood. [A knowledge network. I agree! :)]

The next step is activism, a movement. In my fantasy, at one end it’s big city life and at the other it’s neighborhood residents enabled to feel that their knowledge matters.

Q&A

Q: If local data is being aggregated, could that become Big Data that’s used against the neighborhoods?

A: Yes, that’s why we need neighborhood activism. The polticizing of the neighborhoods shapes the way the knowledge isued.

Q: Disempowered neighborhoods would be even less able to contribute this type of knowledge.

A: The problem is to value them. The neighborhood has knowledge at ground level. That’s a first step of enabling a devalued subject. The effect of digital networks on formal knowledge creates an informal network. Velocity itself has the effect of informalizing knowledge. I’ve compared environmental activists and financial traders. The environmentalists pick up knowledge on the ground. So, the neighborhoods may be powerless, but they have knowledge. Digital interactive open access makes it possible bring together those bits of knowledge.

Q: Those who control the pipes seem to control the power. How does Big Data avoid the world being dominated by brainy people?

A: The brainy people at, say, Goldman Sachs are part of a larger institution. These institutions have so much power that they don’t know how to govern it. The US govt has been the post powerful in the world, with the result that it doesn’t know how to govern its own power. It has engaged in disastrous wars. So “brainy people” running the world through the Ciscos, etc., I’m not sure. I’m talking about a different idea of Big Data sets: distributed knowledges. E.g, Forest Watch uses indigenous people who can’t write, but they can tell before the trained biologists when there is something wrong in the ecosystem. There’s lots of data embedded in lots of places.

[She's aggregating questions] Q1: Marginalized neighborhoods live being surveilled: stop and frisk, background checks, etc. Why did it take tapping Angela Merkel’s telephone to bring awareness? Q2: How do you convince policy makers to incorporate citizen data? Q3: There are strong disincentives to being out of the mainstream, so how can we incentivize difference.

A: How do we get the experts to use the knowledge? For me that’s not the most important aim. More important is activating the residents. What matters is that they become part of a conversation. A: About difference: Neighborhoods are pretty average places, unlike forest watchers. And even they’re not part of the knowledge-making circuit. We should bring them in. A: The participation of the neighborhoods isn’t just a utility for the central govt but is a first step toward mobilizing people who have been reudced to thinking that they don’t count. I think is one of the most effective ways to contest the huge apparatus with the 10,000 buildings.

Tags:

[2b2k] Big Data and the Commons

I’m at the Engaging Big Data 2013 conference put on by Senseable City Lab at MIT. After the morning’s opener by Noam Chomsky (!), I’m leading one of 12 concurrent sessions. I’m supposed to talk for 15-20 mins and then lead a discussion. Here’s a summary of what I’m planning on saying:

Overall point: To look at the end state of the knowledge network/Commons we want to get to

Big Data started as an Info Age concept: magnify the storage and put it on a network. But you can see how the Net is affecting it:

First, there are a set of values that are being transformed:
- From accuracy to scale
- From control to innovation
- From ownership to collaboration
- From order to meaning

Second, the Net is transforming knowledge, which is changing the role of Big Data
- From filtered to scaled
- From settled to unsettled and under discussion
- From orderly to messy
- From done in private to done in public
- From a set of stopping points to endless lilnks

If that’s roughly the case, then we can see a larger Net effect. The old Info Age hope (naive, yes, but it still shows up at times) was that we’d be able to create models that ultimate interoperate and provide an ever-increasing and ever-more detailed integrated model of the world. But in the new Commons, we recognize that not only won’t we ever derive a single model, there is tremendous strength in the diversity of models. This Commons then is enabled if:

  • All have access to all
  • There can be social engagement to further enrich our understanding
  • The conversations default to public

So, what can we do to get there? Maybe:

  • Build platforms and services
  • Support Open Access (and, as Lewis Hyde says, “beat the bounds” of the Commons regularly)
  • Support Linked Open Data

Questions if the discussion needs kickstarting:

  • What Big Data policies would help the Commons to flourish?
  • How can we improve the diversity of those who access and contribute to the Commons?
  • What are the personal and institutional hesitations that are hindering the further development of the Commons?
  • What role can and should Big Data play in knowledge-focused discussions? With participants who are not mathematically or statistically inclined?
  • Does anyone have experience with Linked Data? Tell us about it?

 


I just checked the agenda, which of course I should have done earlier, and discovered that of the 12 sessions today, 1211 are being led by men. Had I done that homework, I would not have accepted their invitation.

Tags:

[2b2k] Is big data degrading the integrity of science?

Amanda Alvarez has a provocative post at GigaOm:

There’s an epidemic going on in science: experiments that no one can reproduce, studies that have to be retracted, and the emergence of a lurking data reliability iceberg. The hunger for ever more novel and high-impact results that could lead to that coveted paper in a top-tier journal like Nature or Science is not dissimilar to the clickbait headlines and obsession with pageviews we see in modern journalism.

The article’s title points especially to “dodgy data,” and the item in this list that’s by far the most interesting to me is the “data reliability iceberg,” and its tie to the rise of Big Data. Amanda writes:

…unlike in science…, in big data accuracy is not as much of an issue. As my colleague Derrick Harris points out, for big data scientists the abilty to churn through huge amounts of data very quickly is actually more important than complete accuracy. One reason for this is that they’re not dealing with, say, life-saving drug treatments, but with things like targeted advertising, where you don’t have to be 100 percent accurate. Big data scientists would rather be pointed in the right general direction faster — and course-correct as they go – than have to wait to be pointed in the exact right direction. This kind of error-tolerance has insidiously crept into science, too.

But, the rest of the article contains no evidence that the last sentence’s claim is true because of the rise of Big Data. In fact, even if we accept that science is facing a crisis of reliability, the article doesn’t pin this on an “iceberg” of bad data. Rather, it seems to be a melange of bad data, faulty software, unreliable equipment, poor methodology, undue haste, and o’erweening ambition.

The last part of the article draws some of the heat out of the initial paragraphs. For example: “Some see the phenomenon not as an epidemic but as a rash, a sign that the research ecosystem is getting healthier and more transparent.” It makes the headline and the first part seem a bit overstated — not unusual for a blog post (not that I would ever do such a thing!) but at best ironic given this post’s topic.

I remain interested in Amanda’s hypothesis. Is science getting sloppier with data?

Tags:

Big Data on broadband

Google commissioned the compiling of

an international dataset of retail broadband Internet connectivity prices. The result was an international dataset of 3,655 fixed and mobile broadband retail price observations, with fixed broadband pricing data for 93 countries and mobile broadband pricing data for 106 countries. The dataset can be used to make international comparisons and evaluate the efficacy of particular public policies—e.g., direct regulation and oversight of Internet peering and termination charges—on consumer prices.

The links are here. WARNING: a knowledgeable friend of mine says that he has already found numerous errors in the data, so use them with caution.

Tags:

[2b2k] Big Data needs Big Pipes

A post by Stacy Higginbotham at GigaOm talks about the problems moving Big Data across the Net so that it can be processed. She draws on an article by Mari Silbey at SmartPlanet. Mari’s example is a telescope being built on Cerro Pachon, a mountain in Chile, that will ship many high-resolution sky photos every day to processing centers in the US.

Stacy discusses several high-speed networks, and the possibility of compressing the data in clever ways. But a person on a mailing list I’m on (who wishes to remain anonymous) pointed to GLIF, the Global Lambda Integrated Facility, which rather surprisingly is not a cover name for a nefarious organization out to slice James Bond in two with a high-energy laser pointer.

The title of its “informational brochure” [pdf] is “Connecting research worldwide with lightpaths,” which helps some. It explains:

GLIF makes use of the cost and capacity advantages offered by optical multiplexing, in order to build an infrastructure that can take advantage of various processing, storage and instrumentation facilities around the world. The aim is to encourage the shared use of resources by eliminating the traditional performance bottlenecks caused by a lack of network capacity.

Multiplexing is the carrying of multiple signals at different wavelengths on a single optical fiber. And these wavelengths are known as … wait for it … lambdas. Boom!

My mailing list buddy says that GLIF provides “100 gigabit optical waves”, which compares favorably to your pathetic earthling (um, American) 3-20 megabit broadband connection,(maybe 50mb if you have FIOS), and he notes that GLIF is available in Chile.

To sum up: 1. Moving Big Data is an issue. 2. We are not at the end of innovating. 3. The bandwidth we think of as “high” in the US is a miserable joke.


By the way, you can hear an uncut interview about Big Data I did a few days ago for Breitband, a German radio program that edited, translated, and broadcast it.

Tags:

[2b2k] The commoditizing and networking of facts

Ars Technica has a post about Wikidata, a proposed new project from the folks that brought you Wikipedia. From the project’s introductory page:

Many Wikipedia articles contain facts and connections to other articles that are not easily understood by a computer, like the population of a country or the place of birth of an actor. In Wikidata you will be able to enter that information in a way that makes it processable by the computer. This means that the machine can provide it in different languages, use it to create overviews of such data, like lists or charts, or answer questions that can hardly be answered automatically today.

Because I had some questions not addressed in the Wikidata pages that I saw, I went onto the Wikidata IRC chat (http://webchat.freenode.net/?channels=#wikimedia-wikidata) where Denny_WMDE answered some questions for me.

[11:29] hi. I’m very interested in wikidata and am trying to write a brief blog post, and have a n00b question.

[11:29] go ahead!

[11:30] When there’s disagreement about a fact, will there be a discussion page where the differences can be worked through in public?

[11:30] two-fold answer

[11:30]
1. there will be a discussion page, yes

[11:31]
2. every fact can always have references accompanying it. so it is not about “does berlin really have 3.5 mio people” but about “does source X say that berlin has 3.5 mio people”

[11:31]
wikidata is not about truth

[11:31]
but about referenceable facts

When I asked which fact would make it into an article’s info box when the facts are contested, Denny_WMDE replied that they’re working on this, and will post a proposal for discussion.

So, on the one hand, Wikidata is further commoditizing facts: making them easier and thus less expensive to find and “consume.” Historically, this is a good thing. Literacy did this. Tables of logarithms did it. Almanacs did it. Wikipedia has commoditized a level of knowledge one up from facts. Now Wikidata is doing it for facts in a way that not only will make them easy to look up, but will enable them to serve as data in computational quests, such as finding every city with a population of at least 100,000 that has an average temperature below 60F.

On the other hand, because Wikidata is doing this commoditizing in a networked space, its facts are themselves links — “referenceable facts” are both facts that can be referenced, and simultaneously facts that come with links to their own references. This is what Too Big to Know calls “networked facts.” Those references serve at least three purposes: 1. They let us judge the reliability of the fact. 2. They give us a pointer out into the endless web of facts and references. 3. They remind us that facts are not where the human responsibility for truth ends.

Tags:

[tech@state][2b2k] Real-time awareness

At the Tech@State conf, a panel is starting up. Participants: Linton Wells (National Defense U), Robert Bectel (CTO, Office of Energy Efficiency), Robert Kirkpatrick (Dir., UN Global Pulse), Ahmed Al Omran (NPR and Suadi blogger), and Clark Freifield (HealthMap.org).

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Robert Bectel brought in Netvibes.com [I use NetVibes as my morning newspaper.] to bring real-time into to his group’s desktop. It’s customized to who they are and what they do. They use Netvibes as their portal. They bring in streaming content, including YouTube and Twitter. What happens when people get too much info? So, they’re building analytics so people get info summarized in bar charts, etc. Even video analytics, analyzing video content. They asked what people wanted and built a food cart tracker. Or the shuttle bus. Widgets bring functionality within the window. They’re working on single sign-on. There’s some gamification. They plan on adding doc mgt, SharePpoint access, links to Federal Social Network.

Even better, he says, is that the public now can get access to the “wicked science” the DOE does. Make the data available. Go to IMBY, put in your zip code, and it will tell you what your solar resource potential is and the tax breaks you’ll get. “We’re going to put that in your phone. “We’re creating leads for solar installers.” And geothermal heat pumps.

Robert Kirkpatrick works in the UN Sect’y Gen’ls office, called Global Pulse, which is an R&D lab trying to learn to take advantage of Big Data to improve human welfare. Now “We’re swimming in an ocean of real time data.” This data is generated passively and acively. If you look at what people say to one another and what people actually do, “we have the opportunity to look at these as sensor networks.” Businesses have been doing this for a long time. Can we begin to look at the patterns of data when people lose their job, get sick, pull their kids out of school to make ends meet? What patterns appear when our programs are working? Global pulse is working with the private sector as well. Robert hopes that big data and real-time awareness will enable them to move from waterfall development (staged, slow) to agile (interative, fast).

Ahmed Al Omram says last year was a moment he and so many in the Middle East had been hoping for. He started blogging (SaudiJeans) seven years, even though the gov’t tried to silence him. “I wasn’t afraid because I knew I wasn’t alone.” He was part of a network of activists. Arab Spring did not happen overnight. “Activists and bloggers had been working together for ten years to make it happen.” “There’s no question in my mind that the Internet and social media played a huge role in what happened.” But there is much debate. E.g., Malcolm Gladwell argued that these revolutions would have happened anyway. But no one debates whether the Net changed how journalists covered the story. E.g., Andy Carvin live-tweeted the revolutions (aggregating and disseminating). Others, too. On Feb. 2 2010, Andy tweeted 1,400 times over 20 hours.

So, do we call this journalism? Probably. It’s a real-time news gathering operation happening in an open source newsroom. “The people who follow us are not our audience. They are part of an open newsroom. They are potential sources and fact-checkers.” E.g., the media carried a story during the war in Libya that the Libyan forces were using Israeli weapons. Andy and his followers debunked that in real time.

There is still a lot of work to do, he says.

Clark Friefield is a cofounder of healthmap, doing real time infectious disease tracking. He shows a chart of the stock price of a Chinese pharma that makes a product that’s believed to have antiviral properties. In Jan 2003, there was an uptick because of the beginning of SARS, which as not identified until Feb 2003. In traditional public health reporting, there’s a hierarchy. In the new model, the connections are much flatter. And there are many more sources of info, from tweets that are fast but tend to have more noise, and slower but more validated sources.

To organize the info better, in 2006 they reated a real-time mapping dashboard (free and open to the public). They collect 2000 reports a day, geotagged to 10,000 locations. They use named entity extractin to find disesases and locations. A bayesian filtering system are categorized with 91% accuracy. They assign significance to each event. The ones that make it through this filter make it to the map. Humans help to train the system.

During the H1N1 outbreak, they decided to create participatory epidemiology. They launched an iphone app called “Outbreaks Near Me” which let people submit reports as well as get alerts, which beame the #1 health and fitness app. They found that the rate of submissions tracked well with the CDC’s info. Also FluNearYou.org

Linton Wells now moderates a discussion:

Robert Bectel: DOE is getting a serious fire hose of info from the grid, and they don’t yet know what to do with it. So they’re thinking about releasing the 89B data points and asking the public what they want to do with it.

Robert Kirkpatrick: You need the wisdom of crowds, the instinct of experts, and the power of algorithms [quoting someone I missed]. And this flood of info is no longer a one-way stream; it’s interactive.

Ahmed: It helps to have people who speak the language and know the culture. But tech counts too: How about a twitter client that can detect tweets coming from a particular location. It’s a combo of both.

Clark: We use this combined approach. One initiative we’re working on builds on our smartphone app by letting us push questions out to people in a location where we have a suspicion that something is happening.

Linton: Security and verification?

Robert K: Info can be exploited, so this year we’re bringing together advisers on privacy and security.

Ahmed: People ask how you can trust random people to tell the truth, but many of them are well known to us. We use standard tools of trust, and we’ll also see who they’re following on Twitter, who’s following them, etc. It’s real-time verification.

Clark: In public health, the ability to get info is much better with an open Net than the old hierarchical flow of info.

Q: Are people trying to game the system?
A: Ahmed: Sure. GayGirlInDamascus turned out to be a guy in Moscow. But using the very same tools we managed to figure out who he was. But gov’ts will always try to push back. The gov’ts in Syria and Bahrain hired people to go online to change the narrative and discredit people. It’s always a challenge to figure out what’s the truth. But if you’ve worked in the field for a while, you can identify trusted sources. We call this “news sense.”
A: Clark: Not so much in public health. When there have been imposters and liars, there’s been a rapid debunking using the same tools.

Q:What incentives can we give for opening up corporate data?
A: Robert K: We call this data philanthropy but the private sector doesn’t see it that way. They don’t want their markets to fall into poverty; it’s business risk mitigation insurance. So there are some incentives there already.
A: Robert B: We need to make it possible for people to create apps that use the data.

Q: How about that Twitter censorship policy?
A: Ahmed: It’s censorship, but the way Twitter approached this was transparent, and some people is good for activists because they could have gone for a broader censorship policy; Twitter will only block in the country that demands it. In fact, Twitter lets you get around it by changing your location.

Q: How do we get Netvibes past the security concerns?
A: Robert B.: I’m a security geek. But employees need tools to be smarter. But we can define what tools you have access to.

Q: Clark, do you run into privacy issues?
A: Clark: Most of the data in HealthMap comes from publicly available sources.
A: Robert K: There are situations arising for which we do not have a framework. A child protection expert had just returned frmo a crisis where young kids on a street were tweeting about being abused at home. “We’re not even allowed to ask that question,” she said, “but if they’re telling the entire world, can we use that to begin to advocate for their rescue?” Our frameworks have not yet adapted to this new reality.

Linton: After the Arab Spring, how do we use data to help build enduring value?
A: Ahmed: It’s not the tech but how we use it.
A: Robert K: Real time analytics and visualizations provide many-to-many communications. Groups can see their beliefs, enabling a type of self-awareness not possible before. These tools have the possibility of creating new types of identity.
A: Robert B: To get twitter or Facebook smarter, you have to find different ways to use it. “Break it!” Don’t get stuck using today’s tech.

Linton: A 26-ear-old Al Jazeera reporter was at a conf “What’s the next big thing?” She replied, “I’m too old. Ask a high school student.”

Tags:

[2b2k] Big data, big apps

From Gigaom, five apps that could change Big Data.

Tags: