DMC3 05: Data Sustainability

DMC3: Session 05: Data Sustainability
Ke 06.10. klo 13:00-15.30

Intro to Data & Information R/evolution

Sustainability, as the capacity to endure.

(1) ecological, (2) economical, (3) social, and (4) cultural sustainability.

“pattern of resource use that aims to meet human needs while preserving the environment so that these needs can be met not only in the present, but also for generations to come.”

“It is people and computers who collect data and impose patterns on it. These patterns are seen as information which can used to enhance knowledge. These patterns can be interpreted as truth, and are authorized as aesthetic and ethical criteria. Events that leave behind perceivable physical or virtual remains can be traced back through data. Marks are no longer considered data once the link between the mark and observation is broken.”

“Raw data refers to a collection of numbers, characters, images or other outputs from devices to convert physical quantities into symbols, that are unprocessed. Such data is typically further processed by a human or input into a computer, stored and processed there, or transmitted (output) to another human or computer (possibly through a data cable).”

Internet World Stats

Internet Traffic


Information R/evolution by Mike Wesch


Digital Formats

Sustainability factors given by Digital Preservation (Library of Congress, USA):

“Sustainability factors apply across digital formats for all categories of information.

The seven factors listed below influence the feasibility and cost of preserving content in the face of future changes to the technological environment in which users and archiving institutions operate.

These factors are significant whatever strategy is adopted as the basis for future preservation actions: migration to new formats, emulation of current software on future computers, or a hybrid approach.

Some important additional considerations, e.g., matters pertaining to the authenticity of a digital item, are attributes of the systems used to manage digital content and not of the content format itself.

Degree to which complete specifications and tools for validating technical integrity exist and are accessible to those creating and sustaining digital content. A spectrum of disclosure levels can be observed for digital formats.  What is most significant is not approval by a recognized standards body, but the existence of complete documentation.

Degree to which the format is already used by the primary creators, disseminators, or users of information resources.  This includes use as a master format, for delivery to end users, and as a means of interchange between systems.

Degree to which the digital representation is open to direct analysis with basic tools, such as human readability using a text-only editor.

Self-documenting digital objects contain basic descriptive, technical, and other administrative metadata.

External Dependencies
Degree to which a particular format depends on particular hardware, operating system, or software for rendering or use and the predicted complexity of dealing with those dependencies in future technical environments.

Impact of Patents
Degree to which the ability of archival institutions to sustain content in a format will be inhibited by patents.

Technical Protection Mechanisms
Implementation of mechanisms such as encryption that prevent the preservation of content by a trusted repository”


Internet Archive

Way Back Machine
“Browse through over 150 billion web pages archived from 1996 to a few months ago.”
K-12 Web Archiving Program:
“If you were a K12 student which websites would you want to save for future generations? What would you want people to look at 50 or even 500 years from now?”


Google Data farms

“Estimates of the power required for over 450,000 servers range upwards of 20 megawatts, which cost on the order of US$2 million per month in electricity charges. The combined processing power of these servers might reach from 20 to 100 petaflops.[9]
Upwards of 15,000 servers[2] ranging from 533 MHz Intel Celeron to dual 1.4 GHz Intel Pentium III (as of 2003[update]). A 2005 estimate by Paul Strassmann has 200,000 servers,[10] while unspecified sources claimed this number to be upwards of 450,000 in 2006.[11]
One or more 80 GB hard disks per server (2003)
2–4 GB of memory per machine (2004)
The exact size and whereabouts of the data centers Google uses are unknown, and official figures remain intentionally vague. In a 2000 estimate, Google's server farm consisted of 6,000 processors, 12,000 common IDE disks (2 per machine, and one processor per machine), at four sites: two in Silicon Valley, California and one in Virginia”

“In February 2009, Stora Enso announced that they had sold the Summa paper mill in Hamina, Finland to Google for 40 million Euros.[18][19] Google plans to invest 200 million euros on the site to build a data center. For Google the reason to choose this location was the availability of renewable energy close by”


Centralising Personal Data & Social Networks

Decentralised, “not owned by anyone”

"You may choose to associate information with your OpenID that can be shared with the websites you visit, such as a name or email address. With OpenID, you control how much of that information is shared with the websites you visit.”

“With OpenID, your password is only given to your identity provider, and that provider then confirms your identity to the websites you visit. Other than your provider, no website ever sees your password, so you don’t need to worry about an unscrupulous or insecure website compromising your identity.”

“Accelerate Sign Up Process at Your Favorite Websites
Reduce Frustration Associated with Maintaining Multiple Usernames and Passwords
Gain Greater Control Over Your Online Identity
Minimize Password Security Risks”


DataPortability Project

“Data portability enables a borderless experience, where people can move easily between network services, reusing data they provide while controlling their privacy and respecting the privacy of others.”

For the user:

With data portability, you can bring your identity, friends, conversations, files and histories with you, without having to manually add them to each new service. Each of the services you use can draw on this information relevant to the context. As your experiences accumulate and you add or change data, this information will update on other sites and services if you permit it, without having to revisit others to re-enter it.

For the Service Provider:

With cross-system data access, interoperability, and portability, people can bring their identities, friends, conversations, files, and histories with them to your service, cutting down on the need for form-filling which can drive people away. With minimal effort on the part of new customers, you can tailor services to suit them. When your customers browse networked services and accumulate experiences, this information can update on your service, if people permit it. Your relationship remains up-to-date and you can adapt your services in response, even when they don't visit. With mutual control and mutual benefit, your relationships remain relevant, encouraging continued usage.
Data portability is a new approach, where it is easier to use and deliver services. This frictionless movement through the network of services fosters stronger relationships between people and services providers and helps build a healthy networked ecosystem.”


WATCH [09.58]
The Social Network Privacy Mess: Why we need the Social Web

Data Silos – Difficult to join and connect.

“Your data in someone else's hands”


Open Graph Protocol (Facebook, 2010)

“The Open Graph protocol enables you to integrate your Web pages into the social graph. It is currently designed for Web pages representing profiles of real-world things — things like movies, sports teams, celebrities, and restaurants. Including Open Graph tags on your Web page, makes your page equivalent to a Facebook Page. This means when a user clicks a Like button on your page, a connection is made between your page and the user. Your page will appear in the "Likes and Interests" section of the user's profile, and you have the ability to publish updates to the user. Your page will show up in same places that Facebook pages show up around the site (e.g. search), and you can target ads to people who like your content. The structured data you provide via the Open Graph Protocol defines how your page will be represented on Facebook.”

Zuckerberg: “We Are Building A Web Where The Default Is Social”

Facebook Further Reduces Your Control Over Personal Information (April 2010)



Semantic Web

web 3.0

WATCH [14.25]


Decentralising Social Networking alternatives..

TPB AFK Documentary


The Pirate Bay (May 2010)


Kiosk of Piracy (September 2009)

“Dear users and abusers, dear Elders of the Internet,

the Kiosk of Piracy is proud to announce the launch of “The Pirate Kiosk”! From last night own, a copy of the infamous Pirate Bay is available to the public, but – here comes the catch – offline-only. Yes, offline, the Kiosk is not connected to the Internet in any way, but the interested public is invited to use the service in a wifi-radius around it.
With our newest project, we are joining the work of the dear people and groups which managed to duplicate the contents of The Pirate Bay on other places in the Net. We want to show in a very physical way that the Internet is neither a machine nor controllable in any way – it is just a system of agreements which work in any circumstances. We don’t need the Internet – the magic can happen anywhere.


“The privacy aware, personally controlled, do-it-all, open source social network.”

“free, open source, distributed micro-blogging platform. If you're tired of being locked in to one micro-blogging platform, or a single social network. Or you're weary of corporations hi-jacking your updates in the pursuit of money, then thimbl is for you.”

“a way for people and organisations to publish richer information themselves, without having to rely upon centralized services”


Analog media as (anti-) Social Networking

Florian Cramer:

'Analog media' are, strictly speaking, a colloquialism since all storage and transmission media is analog (electricity, conductors, waves, light, magnetized metal etc.) and only information can be digital. What we commonly call 'analog media' are systems that do not transmit or store information by coding it into countable, discrete entities.#


Open data

“"Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other mechanisms of control. It has a similar ethos to a number of other "Open" movements and communities such as Open Source and Open access.”

"Arguments made on behalf of Open Data include:

"Data belong to the human race". Typical examples are genomes, data on organisms, medical science, environmental data.
Public money was used to fund the work and so it should be universally available.
It was created by or at a government institution (this is common in US National Laboratories and government agencies)
Facts cannot legally be copyrighted.
Sponsors of research do not get full value unless the resulting data are freely available
Restrictions on data re-use create an anticommons
Data are required for the smooth process of running communal human activities (map data, public institutions)
In scientific research, the rate of discovery is accelerated by better access to data."

Several intentional or unintentional mechanisms exist for restricting access to or re-use of data. They include:
compilation in databases or websites to which only registered members or customers can have access.
use of a proprietary or closed technology or encryption which creates a barrier for access.
copyright forbidding (or obfuscating) re-use of the data.
license forbidding (or obfuscating) re-use of the data (such as share-alike or non-commercial)
patent forbidding re-use of the data (for example the 3-dimensional coordinates of some experimental protein structures have been patented)
restriction of robots to websites, with preference to certain search engines
aggregating factual data into "databases" which may be covered by "database rights" or "database directives" (e.g. Directive on the legal protection of databases)
time-limited access to resources such as e-journals (which on traditional print were available to the purchaser indefinitely)
webstacles, or the provision of single data points as opposed to tabular queries or bulk downloads of data sets.”


Open Knowledge Foundation

Example of cartographic map data & practice of gathering alternative data sources

Open Street Map


Data Longevity

Quote by Ian Davis:
"data outlasts code which lead[s] me to then assert that therefore open data is more important than open source.”

[He] did not say that code does not last nor that algorithms do not last, but

“Of course they last, but data lasts longer. My point was that code is tied to processes usually embodied in hardware whereas data is agnostic to the hardware it resides on.

The audience at the conference understand this already: they are archivists and librarians and they deal with data formats like MARC which has had superb longevity. Many of them deal with records every day that are essentially the same as they were two or three decades ago. Those records have gone through multiple generations of code to parse and manipulate the data.

It’s true that you need code to access data, but critically it doesn’t have to be the same code from year to year, decade to decade, century to century.”


Data Useful Now in Finland

Open Gov Finland

Apps for Democracy (Mindtrek 10.2009)

Jyrki Kasvi's greetings to Apps for Democracy Finland participants (2009)

Finnish Open Data Ecosystem Facebook Group


Journalism in the age of Data

“Journalists are coping with the rising information flood by borrowing data visualization techniques from computer scientists, researchers and artists.

Some newsrooms are already beginning to retool their staffs and systems to prepare for a future in which data becomes a medium.

But how do we communicate with data, how can traditional narratives be fused with sophisticated, interactive information displays?”

WATCH [53.57]


Net Neutrality

Long Live the Web: A Call for Continued Open Standards and Neutrality
"The Web is critical not merely to the digital revolution but to our continued prosperity—and even our liberty. Like democracy itself, it needs defending"
By Tim Berners-Lee November 22, 2010

Tim Berners-Lee: Facebook could fragment web

'Humanity Lobotomy - Second Draft' by Arin Crumley (2006)

Net Neutrality Watchdog Group Uses Google, Facebook Ads To Attack Google

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License