ICTlogy

Home » ICT4D Blog » big data

Data Analysis and Visualization: The 15M Movement and Other Case Studies

By Ismael Peña-López (@ictlogist), 01 October 2012
Main categories: ICT4D
Other tags: 15mdata, big data
No Comments »

Notes from the research seminar Data Analysis and Visualization: The 15M Movement and Other Case Studies, organized by the Internet Interdisciplinary Institute in Barcelona, Spain, on October 1st 2012.

Present: Javier Toret, Pablo Aragón and Oscar Marín, members of the group Datanalysis15M.

The research group #datanalysis15m was created to analyse the new movements that emerged in 2011: Arab Spring, Spanish Indignants, Occupy, etc. The main questions being: How can we measure augmented events? How can we measure new ways of organization, of communication, of engagement? How do ideas spread (virally)? How can we characterize network-systems?

About data

In 1969 ARPANET is born as a packet-switching network, which implies a major improvement in communications. With the World Wide Web in 1990, the user can consume information passively online with web browsers and circa 2004 the Web 2.0 is born, where the consumer also becomes a producer. All this activity is increasingly been traced and produces huge amounts of data. This is yet another evolution of the Internet which has been called Big Data.

There are many implications in the generation of such a big amount of data: privacy, security, commoditization of uses and users’ behavior, dematerialization of the economy, information overload, economics of attention, neuromarketing, etc.

Michael Cooley (Architect or Bee? distinguishes between data, information — organized data — knowledge — comprehended and applied information — and wisdom — knowledge put at the service of achieving some specific goals. Wisdom cannot be transmitted and always carries an ethical connotation.

After data acquisition, data analysis is crucial to be able to transform data into information: understanding and structuring data is the core of the information-building process. Last, but still very important, information can be presented in several ways, in what has been called information visualization.

How to organize information:

Location: maps, dynamic maps and animations.
Alphabet: lists of words, tag clouds.
Time: overlapping layers of data along time (to infer correlation or even causality, depending on what comes first in time), timelines (how data evolves along time).
Category: allows clustering of information and identification of groups.
Hierarchy: relationships of power/importance between different sets of data.

How technology shapes moods that engage people to act. If we can tell how mood is shaped by technology — or how technology can help in mood-shaping — then technology can help in choosing the appropriate time to invite and engage people to participate.

Engagement is also related to language: the use of the 1st person of plural is much more engaging and viral rather than other alternatives. “We are”, “we can”, etc. has way more punch than “I am” or “they can”.

Network or data laws:

First law of preferential connection: the more connections a node has, to more likely it is to gain more connections.
Law of Metcalfe: The value of a network increases proportionally to the square of their users. Behind this law we can find the concept of critical mass: how many propagations have to take place in a network before it explodes.
The power of histograms: while populations usually follow a normal distribution, histograms usually do not, as people lie in surveys. Thus, adjustments have to be made and caveats must be taken into account.
Dunbar number: the cognitive number of people is circa 150. After this number, it is very difficult to have quality relationships between people. We can find that in social networking sites, even if people have several hundreds (or thousands) of acquaintances, stable relationships happen in the 100-200 contacts range.
Power law: big head vs. long tail. Popularity, power, etc. is not evenly distributed, but distributed according to a power law/curve. This complements Pareto’s Law (20% of products represent the 80% of sales): the long tail can get thicker depending on the density of connections in the network, reaching up to 50% of the total of (in this case) sales.
Zipf’s law: the distribution of the words in a text also follows a power law. 80% of the words in a generic text is irrelevant from a semantic point of view.

Network Analysis

Network Analysis is deeply rooted in Graph Theory models.

Types of social relationships:

Directed: social relationship is not bi-directional
Non-directed social relationship is bi-directional
Explicit: relationships are explicitly stated.
Implicit: relationships are built through data analysis.

Average distance: number of intermediaries between two different nodes as an average.

Diameter (or effective diameter) of the network is the maximum distance between the most far away nodes. The diameter usually decreases as the network increases (Leskovec, 2007).

Density: proportion of links of a network in relationship with the total of possible links.

Giant component: the biggest connected component in a network. Outside of the giant component, groups are very small.

Clustering coefficient: measures the density of connections between neighbours of a node. Probability of a connection being the connection of another connection. Clusters are linked one to another through weak ties (Granovetter, 1983). Weak ties foster serendipity: weak ties have a higher potential to expose information to their contacts that they would otherwise never discover.

Reciprocity of a directed network measures how many of these relationships are really bi-directional.

Assortativity, associated with haemophilia, is the preference for relationships between users with same or different characteristics. If assortativity (r) is bigger than zero, the network is endogamic; if r < 0, the network is dissortative. Social networking sites tend to be assortative.

Degree distribution: how connections are distributed. Networks free of scale, where a small group of nodes have a high degree of connections and a long tail of nodes with a small number of connections.

Discussion

There are different social networks that are but different layers of the same reality. The purpose of social network analysis is to try to understand one of these layers and how does a specific layer feedbacks with the rest of layers and reality. One can usually find correlations between different networks and how they sync in emotions, contents, bodies.

It is also interesting to state that we increasingly see online behaviours being translated/transposed into “real life”. Not that online networks (their composition) is replicated offline, but that the practices of sharing, communication, decentralization, etc. are also put into practice even without digital technologies, thus reshaping traditionally organized networks.

We also see how information, communications, contents in online networks transcend the platform and permeate in other (offline) media, such as newspapers or TV news. Thus, even if the former network was not significantly representative of reality, the final message does get to a significantly representative share of people.

Another aspect to take into account is, even if the users of a specific social networking site are not representative of the population, whether their behaviour can be a good proxy to predict the general behaviour of the whole population. While this might sound a contradiction (not representativeness leading to predicting the whole population), the key could be in how this sample shapes the agenda of the whole population and thus, in the short or medium term ends up being a good proxy for prediction.

Data Analysis and Visualization: The 15M Movement and Other Case Studies

About data

Network Analysis

Discussion

Share:

ICT4D Blog

Subscribe

Recent posts

Recent comments

Categories

About Me

About Me

Resources

ICT4D Blog

Sociedad Red Blog

Follow Ictlogy

Social Networks