© April 2006 François Bry. This article can be freely reproduced provided its author and source http://www.pms.ifi.lmu.de/ are cited.

Footprints in Cyberspace -- Research and Researchers' Visibility in the Google Age

François Bry
University of Munich, Germany
April 2006


This paper has been triggered by questions frequently posed by fellow researchers such as: "How for a researcher, or a research team, to get a good Google rank?", "How to make research results visible on the Web?", and "What should scientists publish on the Web?"
This paper aims at giving practical answers to these questions understandable without technical knowledge. This paper first argues that the cyberspace is a premier place in todays making of science. Then, it discusses the importance of scientific publishing in the cyberspace. Finally, it describes ways, that in the author's opinion are not unethical, to improve scientific Web pages' Google rankings.


  1. Cyberspace: A Place where Nowadays Science is Made
  2. Importance of Scientific Publishing in the Cyberspace
  3. Research and Researchers' Visibility in the Cyberspace: Possibilities and Pitfalls
  4. Concluding Remarks

1. Cyberspace: A Place where Nowadays Science is Made

The Internet and the Web as well as already existing 1 and forthcoming similar infrastructures, short the 'cyberspace', have not only transformed our culture by revolutioning written communication 2 and creating virtual market places 3. They also and most prominently have radically transformed a researcher's work.

In disciplines such as Computer Science and Molecular Biology, for citing only two, most research papers published since the 80ies or 90ies of the 20th century are nowadays available online. Most of these papers are available at no or low costs and are easily found using general purpose search engines like Google or Web portals specialized to a discipline or a research area.4 Furthermore, active researchers can easily be located, once again using general purpose or specialized search engines, making a direct, person-to-person communication possible within minutes or, in case of locations in different time zones, within a few hours. Web-based communication platforms considerably help reviewing conference 5 and journal papers, holding virtual meetings, or complement telephone conversations with synchronous written texts. Data and scholarly texts needed for research purposes can be collected on the Web.

In disciplines where researchers are less addicted to techniques and computers, the cyberspace is steadily gaining in importance. Descriptions and/or photographs of art works are found on Web pages not only of museums but also of auctions and antiquaries; on-going research is published on the Web overcoming an isolation that, in the past, was a salient trait of some disciplines; papers too specialized, too 'heretical' or too innovative for traditional scientific media are made available by their authors on the Web giving the authors and the papers a visibility they would otherwise hardly have.

Thus now in all disciplines the Internet and the Web make possible to perform research tasks within minutes that, twenty years ago, would have required weeks or even months and, in some case, would have made necessary to travel to remote places. Undoubtedly, tools such as Web-based communication platforms commonly used today by, among others, computer scientists, will sooner or later be widespread in other disciplines as well because they are easy to use and extremely helpful. Most certainly, new tools are going to emerge that will further enhance a researcher's working environment.

For scientists, communication in the cyberspace is not optional because timely communication in today's science is more important than it has ever been. Nowadays, it is as important for a scientist to communicate in the cyberspace as it was in the 19th century, to attend meetings of the local scientific society. Nowadays, it is as important for a scientist to leave footprints in the cyberspace as it is to regularly publish scholarly papers in scientific journals or conference proceedings. Footprints in the cyberspace are today the premier means for research contributions and for researchers to be noticed and acknowledged.

2. Importance of Scientific Publishing in the Cyberspace

The cyberspace offers researchers means for both, collecting research material, and getting feedback from fellow researchers with similar interests. Indeed, as soon as someone has made its research interests and results visible on the Web, it is possible for everyone everywhere to know about these interests and results, to establish contacts, and to join efforts. Tim Berners-Lee seminal vision of the Web as a platform for exchanging research documents [1] has become reality.

What should researchers publish in the cyberspace? Basically, everything relevant to their research endeavors, e.g. research goals, working hypotheses, research project descriptions, sample data, research results including research papers, etc. and of course addresses.6

2.1 Timeliness and Impact

The publication of research papers in the cyberspace is desirable for two reasons: timeliness and impact.

Timeliness. It is often preferable to publish research papers on the Web or elsewhere in the cyberspace before publishing them in traditional media such as journals, books, or conference proceedings.7 Indeed, immediate publication in the cyberspace makes research papers available to the research community. In fast evolving disciplines, like among others Computer Science, immediate publication in the cyberspace is, except in rare cases, highly recommended because publication in traditional media often takes too long time -- the more renown the media is, the longer it might take. Furthermore, we should not forget, not all submissions -- even not all outstanding submissions -- are accepted for publication in peer-reviewed media. Immediate publication in cyberspace, today on the Web, is therefore often the only means to ensure both, personal recognition for achievements and contributions to fast progressing, collective scientific endeavors.

Impact. Free online availability substantially increases a paper's impact. This statement is the title of a paper in Nature [2] by Steve Lawrence, a contributor to CiteSeer, cf. http://citeseer.ist.psu.edu/, a digital library of Computer Science literature. Considering 119,924 articles in Computer Science and related disciplines, Steve Lawrence found: "The mean number of citations to offline articles is 2.74, and the mean number of citations to online articles is 7.03, an increase of 157%." It is worth stressing that, as pointed out in the afore mentioned paper, mere availability on the Web is not sufficient for ensuring impact. Impact depends as well, and maybe prominently, on appropriate search services. The importance of search in scientific publishing is discussed below in more details.

2.2 Copyrights and Plagiarism

Scientific publishing on the Web, however, raises two problems: copyrights and plagiarism.

Copyrights. In most cases, publishing in traditional scientific media (journal, books and conference proceedings) is indispensable. The reason is that established and renown media still mostly are of this kind,8 publishing in renown media is needed for visibility, and visibility, for both a scientist's career and scientific contributions being noticed (both representing, by the way, two sides of a same coin). In many disciplines, on-line media of quality are emerging. For the time being, however, in most disciplines, including Computer Science, traditional media cannot be ignored. Scientists still have to publish in traditional media.

When a research paper is accepted for publication in a traditional media (journal, book, or proceedings), the copyright on the paper is usually transferred from the paper's author(s) to the media's publisher. As a consequence, the paper might have to be removed from the cyberspace. Whether this is the case depends on the publisher. Many publishers, especially Computer Science publishers, explicitly or tacitly accept papers they publish made available on the Web on a non-commercial basis.9 In such cases, it is wise, if not a legal obligation, to publish together with the papers on the Web, firstly a disclaimer clearly stating that the papers are made available on a non-commercial basis, secondly a mention of the copyright holders.

Transferring copyrights on scientific publications to traditional publishing companies still restrain their distribution for two reasons. First, established publishing companies apparently have no interest in contributing to the emergence of a low costs distribution on the Web. Second, the Web still has no data transfer protocol supporting payment and/or credential verification. This making subscriptions and a verification of the credential of a new customer (in general through a check of a credit card's validity) necessary.

Plagiarism. Many scientists worry more about plagiarism than about copyright violations when considering scientific publishing on the Web. Indeed, plagiarism is a real problem and on-line publishing seems, at first glance, to make it easier. Plagiarism in science, however, has not appeared together with the Web. Furthermore, scanners and OCR (optical character recognition) being nowadays available at keystrokes, it does not make much of a difference for plagiarists to use on-line or paper-printed sources.

More importantly, the Google age gives means of unprecedented effectivity for detecting plagiarism and for fighting against it. Indeed, plagiarism does only make sense if it has good chances to remain undetected. Clearly, this is why at iron curtain times plagiarism was a considerable temptation and, as often said, was prosperous: The east-west information flow, even in sciences, was so limited that plagiarism had good chances to remain undetected.

In the Google age, things are completely different. Search engines make it easy to track scientific publications on every subject, from everywhere, and written in whatever language. Querying a search engine for the title, a sentence, of significant expressions of a paper yields not only (versions of) the paper but also references to papers containing similar expressions or sentences. Even if a plagiary is not available on the Web, references to and citations from both, the source paper and the plagiary, will be found on the Web. This undoubtedly will be suspicious and eventually will unveil the plagiary. If the source paper is on the Web, then the plagiary is easier to detect. Indeed checks of the source papers are, in this case, much easier. Thus, publishing a paper on the Web protects it against plagiarism.

Furthermore, plagiarism-tracking software 10 can, and begins to, be developed, applied, and made public. [8] e.g. reports on such types of software used for tracking plagiarism among customer reviews at Amazon.com aiming at promoting products, ideas, and/or persons. [9] reports on several such stypes of software, their applications, and what they already uncovered. [10] reports on applying a (freely available) plagiarism-tracking software to detecting self-plagiarism in Computer Science. Undoubtedly, the systematic use of plagiarism-tracking software will sooner or later be widespread in science.

3. Research and Researchers' Visibility in the Cyberspace: Possibilities and Pitfalls

What makes Web pages or sites visible in the cyberspace is very much the same in sciences and in any other field. However, sciences offer specific ways to make web pages (or sites) visible. Today, visibility in the cyberspace is identical with visibility on the Web. The visibility of scientific Web pages is considered in the following from three complementary angles: Contents, language, and search engines.

3.1 Contents

It is in sciences like in e.g. media: Interesting, up-to-date quality contents are worth reading and referring to. In the author's opinion, the following four actions are essential to the success of a scientific Web site:

These four actions are discussed in the remainder of this section.

Join Efforts. Joining efforts is often key to build up an interesting scientific Web site. Indeed, a research group's Web site is likely to be more interesting and more usefull to fellow researchers than personal (professional) Web pages of the team members. The same holds of larger scientific organizations like university departments and universities. The potential of well conceived and informative Web sites for increasing a scientific organization's visibility is often neglected: Most universities and university departments worldwide apparently do not invest much effort in high-quality scientific Web sites even though the return on investment, in particular in terms of research fund raising, would be certain.

Keep Web Pages. Keeping Web pages is important, too. The Cyberspace is used not only as a repository of current information but also as an archive. This aspect is often neglected in science which is regrettable because many scientific sources once made available in the Cyberspace are never archived elsewhere. In the Google age, many texts and data are no longer cataloguized and stored at different places but instead at one single place and if needed repeatedly downloaded. If such sources are removed from the Cyberspace, they are lost forever. Apparently, this is often ignored, especially by some librarians that have kept a library's conception of the pre-Google age, i.e. of "keepers of hard printed books", while in the Google age libraries should be information brokers. Electronic archives slowly, too slowly, begin to emerge.11 Electronic archives are as needed in the Google age as traditional library were needed in the pre-Google age.12 Besides, keeping Web pages is an easy mean to ensure visibility while removing pages often significantly reduces the visibilty of a scientific Web site: Especially in science, a significant number of past publications are valuable to many.

Provide what Readers Expect. A scientific Web site should provide what its users expect from it: research papers, publication lists, annoncements of conferences, etc.

As already mentioned, publishing full papers on the Web is highly desirable. Indeed, scientific papers available online are a considerable help in research.

A publication list is worth publishing on the Web because fellow researchers surely are interested to know what a researcher or a research group publishes. Including abstracts, or hyperlinks to abstracts, in a publication list makes it much more useful to fellow researchers: Titles, as good they might be, are rarely sufficiently describing what a research paper is about; furthermore, titles in contrast to (good) abstracts rarely include all relevant keywords that search engines need for properly indexing a paper (cf. below). Publishing paper abstracts on the Web is always possible, even if, for copyright reasons, the full paper cannot be made available on the Web.

Publishing research project descriptions on the Web is a very convenient way to be known by researchers working on similar issues. Surprisingly, or considering how researchers tend to focus on publication, unsurprisingly, this possibility is poorly exploited.

Call for papers and conference programmes are, of course, worth publishing on the Web. They not only contribute to informing a scientific community. They also are a source of highly valuable information on what issues have been considered important at some point of time, on who contributed to scientific fields, and, of course, provide usefull hints at published articles. Thus, keeping call for papers and programmes of past conferences makes much sense.

Depending on the field, many other informations can be published on the Web. In Molecular Biology for example, data collected and data analysis software developed in research are made available on the Web [3]. In Computer Science, software developed in research is often made available as so-called "open source software" via specialized Web-based servers such as SourceForge, cf. http://sourceforge.net/. In Computer Science, it is a common practice to add to a publication list citation entries in a widely used Bibtex format: This makes it easier for authors to cite the papers, and help citations of a paper to be as complete as the paper's authors wish.

Carefully Cure the Data Published. Data curation is too often neglected by researchers. Up-to-date and carefully written publication lists are for example more useful than incomplete or imprecise lists. Incorrectly named journals, books, or conferences are frequently encountered. This leads to confusions and makes it harder to find papers. Data curation can be very time consuming but the reward in terms of visibility is certain: Good data sources, in the cyberspace as elsewhere, are remembered, frequently accessed, and often cited.

3.2 Language

The Web, like many scientific disciplines, is dominated by English. A study [7] estimated that by the end of 2001 73 % of Web pages were written in English, 7 % in German, 5 % in Japanese, 3 % in each of French and Spanish.13

These percentages not only mean that English Web pages are a must if international contacts and visibility on the Web is sought for.

They also signify that Web pages in other languages are good means to achieve visibility. Indeed, a good visibility among a few percent of the Web pages in another language than English is easier to achieve than among the predominant number of English written Web pages. A good "language-local" visibility in turn increases the worldwide visibility: If a Web site is the German first site for, say, "Thomas Mann in American exile", then this will be acknowledged by search engines and boost this web site's rank among all Web sites on Thomas Mann.

3.3 Search Engines

The working principle of search engines is simple: a search engine systematically and regularly crawl (portions of) the Web and indexes the Web pages accessed according to their content; a search engine answers a query by using its index. What makes search engines difficult to built is

The first issue is highly technical and out of the scope of this paper. The second one explains how to make Web pages visible.

Basically and simplifying, as far as indexing and ranking are concerned, there are two brands of search engines: Directory-based search engines and Google. How search engines of both brands rank Web pages are well-kept secrets. Basic principles, however, are known. They suffice to understand how to increase a search engine's rank of a Web page. In the following, the basic principles of the above-mentioned two brands of search engines are briefly described. Then, how search engines rank Web pages is discussed and, finally, (ethical) ways to improve a Web page's Google rank, and pitfalls to avoid in doing so. The second issue explains how to make Web pages visible.

At first, it seems natural to build an index according to a logical structuring, possibly a taxonomy, of concepts. Such a taxonomy could for example have categories entertainment and sciences respectively including among others sub-categories cinema-halls and Humanities. The search engine Yahoo, for example, makes use of such a directory, cf. http://dir.yahoo.com/. The drawback of directory-based search engines is that building directories requires human work and therefore does not scale up. By and by, the Web grows in size and its uses become more many-sided, directories must be adjusted and/or extended. The difficulty can well be illustrated with sciences: A directory for a discipline, say Computer Science, structured based on a classification system well established in the discipline, like the ACM Computing Classification System, cf. http://www.acm.org/class, would surely both, offer extremely valuable services and poorly account for some recent developments (like the emergence of search engines) or marginal, nonetheless important, views. For example, while "H. Information Systems" of the ACM Computing Classification System is appropriate a class for "search engines", it is not quite clear which of its subclasses is convenient for this concept.

Google's ranking is not based on directories, but instead on assigning to a Web page A a value between 1 and 10, called PageRank of A. PageRank is named after Larry Page, co-author of the algorithm, called PageRank algorithm, used in computing Web pages' PageRanks. The PageRank algorithm by Google's founders Sergey Brin and Larry Page is described in [4] and further analyzed in several research papers. PageRank is a Markov model: it expresses a traversal of the Web from a randomly selected page with moves from page to page either following a hyperlink, or to a page selected at random. Thus, PageRank interpretes a hyperlink from a page A to a page B as a (ranked) vote of A for B with following properties:

The PageRank PR(B) of a page B is defined as

PR(B) = (1-d) + d*PR(A1)/L(A1) + ... + d*PR(An)/L(An)

where A1, ..., An are the pages registered linking to B, d is a so-called damping factor (of 0.80 or 0.85), and L(Ai) is the number of hyperlinks to other pages page Ai contains.

Consequences of this definition are:

PageRank, as it is defined in [4] and in other research papers, is not all of a Web page's Google rank. Apparently, Google's ranking is also based on pragmatics of this, very little is known and this little, only empirically. Some of Google's actual ranking principles might become apparent when Google's ranking is modified, what happens from time to time.

3.4 Limitations of Google

Directory-based search engines, however, might in some cases be preferable to Google which also has limitations. It is worth knowing them because they might impact on a Web page visibility. The following limitations of Google are discussed below in more details: Text-based vs. semantics-based search, hyperlinks sometimes improperly reflect the importance of Web pages, relative positions in returned results are sometimes undesirable, and reputation and trust are not considered. Search engines are however a fast evolving field: One might expect Google to soon overcome some of its limitations.

Text-based vs. Semantics-based Search. Google is text-based, not semantics-based. As a consequence, different meanings of words or expressions are not recognized. As mentioned above, a search for 'white house' does not distinguish between the seat of the US goverment and the house of the 18th century naturalist Gilbert White.

Hyperlinks Improperly Reflecting a Web Page's Importance. As of March 2006, a search for 'cinema Munich' at Google did not return the Web page of the yearly 'Munich international film festival' among the first 100 results (of 2.250.000). This film festival, however, is a cultural event of importance for Munich and its region. In contrast, the directory-based search engine Ask.com, cf. http://www.ask.com/, returned the Web page of 'Munich international film festival' at position 3 (of 277.000).

Relative Positions Sometimes Undesirable. Google returns only relative positions. If all Web pages of some set are equally relevant or important, then this is hidden by the irrelevant ranking delivered by Google. As of March 2006, search for 'consulate Washington' returned for example a list with many repetitions and an unnatural ordering.

Reputation and Trust. Reputation and trust are not (yet) considered by any search engine. A search for 'man on moon' returns among other results pages claiming the Moon landings of the NASA were a hoax.

It is worth stressing that the limitations mentioned above are highly relevant for search in science. One might expect that search services targeted at the scientific community, like Google scholar, cf. http://scholar.google.com, SciFinder Scholar, cf. http://www.cas.org/SCIFINDER/SCHOLAR/, and Microsoft's "Windows Live Academic", cf. http://academic.live.com/, will contribute to further developments of search engines.

3.5 Enhancing the Google Rank of Scientific Web Pages

Quite a number of things can be done for increasing a scientific Web page's Google rank. Some are considered un-ethical and may lead Web sites to be removed from the Google index, cf. Google's Quality Guidelines [5]. The following simple actions that are perfectly ethical and do not violate Google's Quality Guidelines will significantly improve the Google rank of scientific Web pages and/or of research teams' Web sites:

The following is empirically known to positively affect Google's ranking of HTML Web pages:

As well as for Google's ranking of Web sites:

The following is known to have a negative influence on a Web page's Google rank:

Currently, the following has an undesirable or negative influence on a Web page's ranking at all search engines:

4. Concluding Remarks

This article does not address several important questions raised by scientific publishing in the Google age, in particular:


The author thanks Tim Furche, Norbert Eisinger and the members of the REWERSE project, cf. http://rewerse.net, especially Michael Schroeder, for usefull hints and/or feedback on a preliminary version of this article.


1 Like e.g. 'grids' cf. http://www.gridforum.org/.
2 Cf. e.g. wikis http://www.mediawiki.org/wiki/MediaWiki, http://wiki.org/, wikipedia http://wikipedia.org, blogs http://www.blogsearchengine.com/, newsclouds http://www.revsys.com/newscloud/.
3 Cf. e.g. Amazon http://amazon.com and eBay http://ebay.com.
4 Like e.g. DBLP http://dblp.uni-trier.de/ in Computer Science.
5 Cf. e.g. Confious http://www.confious.com/ and EasyChair http://www.easychair.org/.
6 For reducing the risk of undesirable, so-called spam emails, it is advisable though not to publish email addresses on the Web in machine readable formats (such as: john.smith@abcd.ef or john.smith@abcd.ef) but as pictures (e.g. hand-written scanned text) or in any other "encoded" manner (such as: john. smith at abcd. ef).
7 Provided, of course, that the journal, book or proceedings policy does not preclude publication in the cyberspace.
8 There are noticeable exceptions such as the online journals of the Public Library of Science (PLoS) http://www.plos.org/.
9 Since 2001 organizations such as the 'Creative Commons' http://creativecommons.org/ -- with its project 'science commons', cf. http://sciencecommons.org/ -- are proponents of copyrights under so-called "open access contents".
Initiatives launched by scientists such as the 'Bethesda Statement on Open Access Publishing', cf. http://www.earlham.edu/~peters/fos/bethesda.htm, and the 'Berlin Declaration' of the German Max Planck Institute, cf. http://www.zim.mpg.de/openaccess-berlin/berlindeclaration.html, advocate for a new form of scientific publishing making publications accessible to all -- at the cost of their authors supporting the publication costs. The Public Library of Science (PLoS), cf. http://www.plos.org, and Biomedcentral.com, cf. http://www.biomedcentral.com, are publishers following this approach.
The European Commission has recently published a report [11] which makes recommendations for improving access to publicly-funded research. A Commission's view on the subject is as follows: "Given the scarcity of public money to provide access to scientific publications, there is a strong interest in seeing that Europe has an effective and functioning system for scientific publication that speedily delivers results to a wide audience." The European Commission is calling for reactions to the report [11] and contributions on other issues related to scientific publishing by 1st June 2006 at rtd-scientific-publication@cec.eu.int.
10 Increasingly used at universities for checking student essays for plagiarism.
11 Such as e.g. arXiv http://www.arxiv.org/ and HAL-INRIA http://hal.ccsd.cnrs.fr/.
12 The role of scientific libraries must be reconsidered in the Google age. This issue is beyond the scope of this paper.
13 Interestingly, Chinese was not cited at that time. What is nowadays the percentage of Web pages in Chinese?
14 Although a search engine specialized in the history of Biology would probably be expected to return only Web pages on the house of the biologist White.
15 Note that Google does not consider keywords in HTML meta elements because they are not displayed by browsers and have been often misused in an unethical manner for enhancing the Google ranking of HTML Web pages. Other informations in HTML meta elements are considered by Google, though.
16 Like 'The Open Directory Project' http://dmoz.org/, a human compiled directory of web sites, 'Zeal' http://www.zeal.com, a human compiled directory of non-commercial web sites, and Yahoo http://www.akamarketing.com/Yahoo-submitting-tips.html.
17 Like e.g. "smart" publication lists only displaying papers or paper references if an author's name or words that might occur in a title are given.
18 There is a rumor that, a few years ago, an official report on the controversial government funded German magnetically-levitated train "transrapid" has been indexed by Google according to cost estimates removed from an "unsanatized" Word file published on the Web.
19 Like e.g. a paper's title on a first Web page, the name of the journal or proceedings where it has been published on another Web page.


[1] Tim Berners-Lee.
Information Management: A Proposal. CERN. March 1989, May 1990.
[2] Steve Lawrence.
Free online availability substantially increases a paper's impact. Nature, volume 411, Number 521, May 2001.
[3] François Bry and Peer Kröger.
A Computational Biology Database Digest: Data, Data Analysis, and Data Management. Journal Distributed and Parallel Databases, volume 13, number 1, pages 7-42, January 2003.
[4] Sergey Brin and Lawrence Page.
The Anatomy of a Large-Scale Hypertextual Web Search Engine. Journal of Computer Networks and ISDN Systems, volume 30, number 1-7, pages 107-117, 1998.
[5] Google's Quality Guidelines.
[6] Antonio Gulli and Alessio Signorini.
The Indexable Web is more than 11.5 billion pages. Proceedings of the 14th International World Wide Web Conference (WWW 2005). May 10-14, 2005.
[7] Edward T. O'Neill, Brian F. Lavoie, Rick Bennett.
Trends in the Evolution of the Public Web 1998 - 2002. D-Lib Magazine, April 2003.
[8] Shay David and Trevor Pinch.
Six Degrees of Reputation: The Use and Abuse of Online reviews and Recommendation Systems. First Monday, volume 11, number 3, March 2006.
[9] Jime Giles.
Taking on the Cheats. Nature, volume 435, pages 258-259, 19 May 2005.
[10] Christian Collberg and Stephen Kobourov.
Self-Plagiarism in Computer Science. Communication of the ACM, volume 48, issue 4, pages 88-94, April 2005.
[11] Mathias Dewatripont, Victor Ginsburgh, Patrick Legros, Alexis Walckiers, Jean-Pierre Devroey, Marianne Dujardin, Françoise Vandooren, Pierre Dubois, Jérôme Foncel, Marc Ivaldi, and Marie-Dominique Heusse
Study on the Economic and Technical Evolution of Scientific Publication Market in Europe.
Commissioned by DG-Research, European Commission. January 2006.

Last modified: Fri Apr 28 13:58:15 CEST 2006   Valid XHTML 1!   Valid CSS!