Searching information on the Internet: legal implications
Julià Minguillón, Department of IT, Multimedia and Telecommunications Department
Tim Berners-Lee creates the World Wide Web, based on a structure and protocols that require linking to work. The URL or URI identify documents that can be found on the Internet, creating a directed graph: A points to B, but we (usually) cannot walk the inverse way, the link is not reversible (i.e. you need another link to go from B to A, the initial A to B link does not serve this purpose).
There are two main strategies to explore the Internet and find information within: browsing and searching.
One of the “problems” of the Internet is that, as a graph, it’s got no centre: the Internet as no centre or place that can be considered as its begin.
There are some initiatives to map the Internet, to index it (like the Open Directory Project, but the speed of growth of the Internet have made them difficult to maintain… and even to use.
- a web crawler explores the Internet, retrieving information about the content and the structure of a web site;
- an index is created where the information is listed and categorized, and
- a query manager enables the user to ask the index and retrieve the desired information.
Web crawlers require that pages are linked to be able to visit them. Ways to prevent web crawlers to explore a web site (besides unlinking) is protection by username/password, use of CAPTCHAs, use of protocols of exclusion (e.g. in robots.txt files), etc.
Protocol of exclusion (robot.txt):
- Has to be public;
- Indication, not compulsory;
- Discloses sensible information;
- Google hack: intitle:index.of robots.txt
- Search engines find sensible information.
- Content and links are different things. A linked content might not be in the same place as the source content where the link is published.
- Users can link sensible information/contents.
- Broken links and permalinks: content might be moved but engines/users might track and re-link that content.
- Outdated versions (cache): to avoid repeated visiting, search engines save old versions of sites (caches), which stand for a specific time even if some content is deleted.
- Software vulnerabilities:
- Browsing patterns (case of AOL): what a user does on the Internet can be tracked and reveal personal information.
Nowadays, most ways to remain anonymous on the Internet is opting out of services like web crawling by search engines.
With the Web 2.0 things become more complicated. Initiallly, “all” content was originated by the “owner” of a website: you needed a hosting and to directly manage that site. When everyone can create or share content in a very easy and immediate way, the relationship server/hosting-manager-content is not as straightforward as it used to be.
Linking and tagging also complicate even more the landscape. And with the upcoming semantic web, cross-search and crossing data from different sources can make it easy to retrieve complex information and find out really sensible information.
- Users demand more and more services and are willing to give their privacy away for a handful of candies.
- Personalization is often on a trade-off relationship with privacy, and people demand more personalization.
- Opt-in should be the default, but it raises barriers to quick access to sites/services, hence opt-out is the default.
- An increased trend in egosurfing and aim for e-stardom is accompanied by an increasing trail of data left behind by users.
- The creator of content
- The uploader
- The one who links
- The one who tags
- Search engines
- End users
- Social networking sites
Ramon Casas points at Google cache and, while being not strictly necessary to run the search engine, it represents an ilegal copy and/or access to content that (in many cases) was removed from its original website. In his example
the museum closes at 20:00 but Google leaves the back door open until 22:00.
Bruce Kasanoff (2001). Making It Personal: How to Profit from Personalization without Invading Privacy. See a review by Julià Minguillón at UOC Papers.