I have been awarded a Shuttleworth Fellowship to change the world; my first reactions

The Shuttleworth Foundation has done me the honour of appointing me as a Fellow, starting today. The remit (http://www.shuttleworthfoundation.org/fellowship/ ) is:

The holy grail of every funder is sustainability, an idea and approach living long after the money has run out. That is why we fund people not projects. The only true way to sustainability is not a business plan but a champion, someone who will drive an idea through an ever changing landscape, to make a real difference in the world.

We are looking for social innovators who are helping to change the world for the better and are seeking support through an innovative social investment model.


My new entry is here: http://www.shuttleworthfoundation.org/fellows/current/peter-murray-rust/

This is incredible. I’ve had a week or two to adjust but I’m still finding new ideas, visions, people on a daily basis.  So this is a first reaction.

I am going to change the world for the better. Yes. Over the last few years when people have asked me what I want to do I reply “change the world”. It’s what we should all aspire to. And this is the most concentrated  time of innovation in the history of the planet and it’s much easier. In the past heroes such as Diderot had to rely on print to reach people – I can reach millions of people with a few keystrokes.

It’s this ability to create communities that makes us different from our predecessors. As exemplars I look to my own immediate circle of electronic communities: Wikipedia, Mozilla, Open Knowledge Foundation, Creative Commons, Open Rights Group, Blue Obelisk …

All started by people – often just one. And all self-sufficient without their founders. That’s my immediate model for sustainability. I don’t know exactly *how* it will happen , but I am certain it will. (Certainty is an essential ingredient of success). So the goal is to build a community of vision and practice.

This year I have undertaken to liberate 100, 000, 000 FACTs from the scientific/technical/medical literature. FACTs belong to the world, not individuals and not corporations. I use uppercase to stress that they are not protectable as Intellectual property (IP). FACTs save lives (think helicobacter and ulcers). FACTs help to create new materials. FACTs lead to better decision making (e.g. climate change). FACTs generate new information-based industries which generate new wealth (the 4 Billion USD invested in the human genome generated 700 Billion of downstream wealth.  (I’ve blogged a lot about the Content Mine and I’ll be blogging a lot more, of course).

Because it is freely available to everyone on the planet who can connect to the Internet.

Put most of all I must thank the Shuttleworth Foundation. They have a wonderful vision and wonderful people. There’s a lot I am discovering.

But, simply, they put in the effort to make sure people succeed.

They have a wonderful infrastructure that I suspect few other funding bodies can emulate. I have a very real relationship with  Karien Bezuidenhout and Helen Turvey who run the Fellowship program.  We’ve spent a lot of time bouncing ideas around and I shall be meeting Helen in a few days in London. Karien and I will have virtual meetings twice a month! This can make all the difference to being focussed and setting achievable objectives.

And I know several of the Fellows already. Rufus Pollock (OKFN), Daniel Lombraña González (Crowdcrafting)  …

… and Fracois Grey who will be my buddy / mentor. This is a wonderful idea. I’m hoping I can visit New York and run a workshop there with his Citizen Science community.

And then there is the community of the Fellowship – again this is a wonderful resource. Fellows come from all disciplines and experience and the cross-fertilisation will be massive. We meet virtually every week and we have 2 physical meetings a year. I’ll be doing a lot of listening.

It’s a huge responsibility, but that’s absolutely how it should be. I shall give it my best. I cannot know how it will work out in detail. I’ve a loose group of current collaborators and I’ll be talking with Helen about the best way of involving them.We’ve already plotted some activities.

Massive thanks to those who have helped with my application, acted as sounding boards and acted as referees.

Shuttleworth is the difference being *hoping* your ideas will take root  and *knowing* they will.





The dramatic growth of BioMedCentral open access article processing charges

The average article processing charge for BioMedCentral journals requested from the University of Ottawa (uO) Library’s author’s fund increased 27% from 2010-11 to 2012-13. The 15% increase from 2011-12 to 2012-13 is 10 times the rate of inflation. 

The data indicates that this reflects increases in journal prices rather than changes in which journals uO authors publish in. For example:

Globalization and Health (a BMC journal)

  • 2010-11: uO paid an APC of $1,300 US. Assuming this reflects a BMC membership rate in effect at this time (15% discount, that’s still less than $1,500 US.
  • 2011-12: uO paid APCs at 2 different rates: $1,425 US and $1,715 US
  • 2012-13: uO paid APCSs at $1,670 and $1,715 US
  • The BMC rate listed on BMC’s own website as of Feb. 27, 2014 is $2,155 US from: http://www.globalizationandhealth.com/manuscript

An increase in APC from $1,715 US to $2,155 US in the last year is about a 25% increase in the APC for this particular journal. Currency fluctuations could account for about one-tenth of this increase (see below for calculations), and the modest inflation rate would account for about a 1.5% increase. This still leaves more than a 20% increase in price above and beyond currency variations and inflation.

Currency variations UK pound sterling to USD, based on Bank of Canada daily and 10-year currency converter.

  • UK pound sterling to USD conversion rate:
  • Jan. 2011: 1.5586
  • Jan. 2012: 1.5654 (.0043 increase over 2011)
  • Jan. 2013: 1.6254 (.0383 increase over 2012)
  • as of Feb. 27, 2014: 1.6691 (.02688 increase over 2013)
  • Total increase in value of UK pound sterling in comparison with US dollar 2014 / 2011: 7%

Public Library of Science (PLoS), by contrast, has kept prices for their journals at exactly the same rates during this time frame. PLoS’ achievement of a 23% surplus during this time frame indicates that this was done without financial sacrifice. While I continue to call on the not-for-profit PLoS to actually lower their prices to facilitate the transition to open access, the remarkable contrast between PLoS’ holding the line on prices and while BMC raises their prices at rates far above inflation is worth noting.

Thanks to Jeanette Hatherill and the University of Ottawa Library for posting the Open Access publication rates in the uO institutional repository. This dataset contains the amounts paid for through the library’s author’s fund for open access article processing charges from 2010 – 2013. Watch for further calculations and release of my calculations spreadsheet as part of the open access article processing charges series.

This post also illustrates the value of open data. By posting this data for open access in the University of Ottawa’s institutional repository, uO is making it possible for me to conduct research like this that could be useful to uO’s own decision-making processes in future. Let’s hope this post inspires others to follow uO’s lead and share their data, too. 

This post is part of the Open access article processing charges research series

101 uses for Content Mining

It’s often said by detractors and obfuscates that “there is no demand for content mining”. It’s difficult to show demand for something that isn’t widely available and which people have been scared to use publicly. So this is an occasional post to show the very varied things that content mining can do.

It wouldn’t be difficult to make a list of 101 things that a book can be used for. Or television. Or a computer (remember when IBM told the world that it only needed 10 computers?) Content mining of the public Internet is no different.

I’m listing them in the order they come into my head, and varying them. The primary target will be scientific publications (open or closed – FACTs cannot be copyrighted) but the technology can be applied to government documents, catalogues, newspapers, etc. Since most people probably limit “content” to words in the text (e.g. in a search engine) I’ll try to enlarge the vision. I’ll put in brackets the scale of the problem

  1. Which universities in SE Asia do scientists from Cambridge work with? (We get asked this sort of thing regularly by ViceChancellors). By examining the list of authors of papers from Cambridge and the affiliations of their co-authors we can get a very good approximation. (Feasible now).
  2. Which papers contain grayscale images which could be interpreted as Gels? A http://en.wikipedia.org/wiki/Polyacrylamide_gel is a universal method of identifying proteins and other biomolecules. A typical gel (Wikipedia CC-BY-SA) looks like SDS-PAGE  Literally millions of such gels are published each year and they are highly diagnostic for molecular biology. They are always grayscale and have vertical tracks, so very characteristic. (Feasibility – good summer student project in simple computer vision using histograms).
  3. Find me papers in subjects which are (not) editorials, news, corrections, retractions, reviews, etc. Slightly journal/publisher-dependent but otherwise very simple.
  4. Find papers about chemistry in the German language. Highly tractable. Typical approach would be to find the 50 commonest words (e.g. “ein”, “das”,…) in a paper and show the frequency is very different from English (“one”, “the” …)
  5. Find references to papers by a given author. This is metadata and therefore FACTual. It is usually trivial to extract references and authors. More difficult, of course to disambiguate.
  6. Find uses of the term “Open Data” before 2006. Remarkably the term was almost unknown before 2006 when I started a Wikipedia article on it.
  7. Find papers where authors come from chemistry department(s) and a linguistics department.  Easyish (assuming the departments have reasonable names and you have some aliases (“Molecular Sciences”, “Biochemistry”)…)
  8. Find papers acknowledging support from the Wellcome Trust. (So we can check for OA compliance…).
  9. Find papers with supplemental data files. Journal-specific but easily scalable.
  10. Find papers with embedded mathematics.  Lots of possible approaches. Equations are often whitespaced, text contains non-ASCII characters (e.g. greeks, scripts, aleph, etc.) Heavy use of sub- and superscripts. A fun project for an enthusiast

So that’s just a start. I can probably get to 50 fairly easily but I’d love to have ideas from…



[The title many or may not allude to http://en.wikipedia.org/wiki/101_Uses_for_a_Dead_Cat ]

Content Mining Myths 1: “It’s too hard for me to do”; no it’s easy

One of the many myths about content mining is that it’s difficult and only experts can do it.

Quite the opposite – with the right tools anyone can do it. And in fact most of you do content-mining every day…

  • When you type a phrase into a search engine (Google, Bing)  you are using the mined content of the web. You phrase your question to try to get the most precise, most relevant answers. Agreed, it’s not easy to WRITE a search engine, but it is easy to use one. If we know what questions you want to ask the scientific literature then we can work out how to build the engine.
  • When you use software to examine photographs it can pick out faces. Again it’s not easy to write such software but it’s easy to use it. And that’s what we are doing for chemistry – recognising compounds and reactions in pictures. We’ll present this at the upcoming American Chemical Society meeting in Dallas next month so if you are there you’ll get an idea. It’s only 3 months old but we’ve come a long way.
  • When you search your mail for a name you are mining the content. Again it’s easy to do.

Because content-mining in science has been held back by restrictive practices there are lots of valuable tools waiting to be applied. That’s what we are doing. We expect progress to be rapid. Obviously we’ll appreciate direct help, but we’ll also appreciate general interest.

What do you want to be able to do? What FACTs do you want to extract (or for us to extract and publish)? It won’t all be possible , but a huge amount will be.

And when we have tens of thousands of scientists mining the literature and making the results public there will be a huge acceleration.


Impending flood? Hold Onto Your Family!

antsWith the extreme weather we’ve witnessed all over the US this winter, some people may be planning new ways to stay safe in the event of a natural disaster. If we can’t learn to predict these extreme events (as some animals may be able to) we may take a moment to learn from some often overlooked creatures, in this case, Formica selysi ants.

A group of researchers in Switzerland studied this species of ants’ technique for surviving a flooding event. They found that these ants, which regularly inhabit flood plains in the Alps and the Pyrenees, are well-prepared and ready to act in the event of impending submersion. The ants quickly form a “collective structure” by physically grasping on to one another to create a floating platform and raft to safety when a flood comes. This technique keeps nest-mates together, protects the queen, and ensures the survival of the majority of the colony.

Predictably, the researchers observed  that the ants place their queen towards the center of the rafts, in the most protected position. However, instead of likewise protecting their young, the worker ants use the buoyant properties of the brood by placing them at the bottom of the raft where they act as floatation devices. The young suffer little or no mortality from this placement and serve as vital support for the rest of the colony when incorporated into the raft in this fashion. Check out the ants in action in the video below (and on our Youtube channel).

Although we may not be able to literally grab onto each other and float above the water when threatened with a flood, the principle is what might be important. Lesson learned: be prepared and gather your family and friends close to tackle whatever challenge is approaching together.


Citation: Purcell J, Avril A, Jaffuel G, Bates S, Chapuisat M (2014) Ant Brood Function as Life Preservers during Floods. PLoS ONE 9(2): e89211. doi:10.1371/journal.pone.0089211

Image: Figure 1 from doi:10.1371/journal.pone.0089211


The post Impending flood? Hold Onto Your Family! appeared first on EveryONE.

Nature News reports SCIgen gibberish papers; can we rely on conventional peer-review? Or can machines help?

Richard van Noorden has an important report


Two science publishers have withdrawn more than 120 papers after a researcher in France identified them as computer-generated. According to Nature News, 16 fraudulent papers appeared in publications from Germany-based Springer, and more than 100 were published by the New York-based Institute of Electrical and Electronic Engineers (IEEE).

It’s not clear what the motive was – academic fraud? or a Sokal/Bohannon-like demo of the frailty of peer-review? But the immediate effect is to show that a large number of “peer-reviewed” scientific papers have flaws.

This should surprise no-one who understands the process of scientific publication. I will assert that, in principle, every published article has flaws. Most will be minor – typos in references or mislabelled diagrams or typos in tables or misdrawn chemical diagrams or countless other errors.

Consider a doctoral thesis – possibly  the most intensively peer-reviewed document that a scientist produces. The thesis is written knowing that failure may be absolute – a career could depend on it. It has taken months to prepare. Almost always the student has to revise it for “minor errors”. (My own thesis had a number and yet I have asked for it to be digitised at Oxford). Errors are ubiquitous.

There are roughly three absolute reviewers of scientific material:

  • The natural and physical world. Nature (not the journal) always wins. It is fair – God does not play dice – but neither does s/he tolerate errors. This is the ultimate arbiter. One of the strucures in my thesis was “wrong”. I discovered later that it was in a subgroup (Fd3) of the reported space group (Fd3m). This wasn’t trivial – it included a rare sort of twinning (which has given me minor eponymity) This is how science progresses. Science is a series of snapshots.
  • The computer.  It doesn’t lie. If you don’t get the same answer as someone else then either you or they or both have to find out where the problem is. It’s interesting that most of these fake papers were in the area of Computer Science. Properly reported CS should be very difficult to fake. Unfortunately much of it is very badly reported.
  • Humans. Human judgment is variable and changes with time. A “good” paper noes may be “bad” at a later stage and vice versa. An “exciting” one now may be shown to be uninteresting later or vice versa.  Science often changes by paradigm shifts and many of those were rejected when first published. Moving continents? ulcerating bacteria? charged species in solution? Examples of science that would have led to dismissal for lack of  ”impact”

The rush for immediate impact is anti-scientific as is the rush for multiple publications.

I doubt this will change.

But one thing that can help to reduce noise, error, fraud, duplication etc is the use of machines.

Machines can detect fraud (I shall show how shortly). Machines can detect errors – we have already shown this. Machines can reproduce (or fail to reproduce) computational science.  This could and should be done.


The problem is that it is a lot of work to set up the proper apparatus. And publishers don’t like that (I expect a few shining examples such as IUCr/Acta Crystallographica). It costs money to verify and check science. That eats into profits. And while publishers get paid for the number of papers they publish (and generally not the ones they reject) why bother?

Why do chemistry publishers not insist on machine readable spectra. It’s trivial.

Why do they not insist on machine readable chemical structures? That’s even more trivial.

Because it costs effort?

And worse – it means that the scientific literature becomes a semantic database. And that would never do, because it could replace the secondary databases that generate hundreds of millions of dollar income.

I and my friends could have all the tools to create higher quality chemistry, less fraud, more value. And that goes for many other sciences.

Machines can help authors… I’ve tried that for over 10 years. No progress.

Will the culture of publication change in my lifetime??

That’s up to you.

New PLOS Open data policy

PLOS one logoPLOS has announced some changes to their publishing policies, and these changes are great news.  The new PLOS policies will go a significant way towards encouraging open data and open source.  Although the announcement itself is somewhat vague on the subject of source code, the actual PLOS One Sharing Policy is excellent:

…if new software or a new algorithm is central to a PLOS paper, the authors must confirm that the software conforms to the Open Source Definition, have deposited the following three items in an open software archive, and included in the submission as Supporting Information:

  • The associated source code of the software described by the paper. This should, as far as possible, follow accepted community standards and be licensed under a suitable license such as BSD, LGPL, or MIT (see http://www.opensource.org/licenses/alphabetical for a full list). Dependency on commercial software such as Mathematica and MATLAB does not preclude a paper from consideration, although complete open source solutions are preferred.
  • Documentation for running and installing the software. For end-user applications, instructions for installing and using the software are prerequisite; for software libraries, instructions for using the application program interface are prerequisite.
  • A test dataset with associated control parameter settings. Where feasible, results from standard test sets should be included. Where possible, test data should not have any dependencies — for example, a database dump.

However, the one loophole is that they allow for code that runs on closed source platforms in “common use by the readership”  (e.g. MATLAB), although it must run without dependencies on proprietary or otherwise unobtainable ancillary software.  That “common use” loophole could potentially be a mile wide in some fields.  Is Gaussian a common use platform in computational chemistry and therefore exempt from this new policy?   If so, the policy is a bit toothless.  I’d like to see the limits and bounds of the “common use” loophole more clearly stated.

The announcement makes PLOS ONE a much more attractive place to send our next paper.

Centre for Research Communications is Moving!

The Centre for Research Communications is moving to new offices during the week of Monday 24th to Friday 28th February.

Therefore, we will have limited access to our e-mail and phone, during this period.

We will respond to your enquiry as soon as possible, after our move. Please accept our apologies for the inconvenience.

MDPI and Beall – further comments from a “brainwashed Brit”

After my recent post on MDPI there has been a flurry of comments on this blog and I have also received a few private mails. Some are accusatory either of me or other correspondents.

To clarify my position:

  • I have been aware of MDPI for ca 16 years and have no indication that they are other than a reputable scientific publisher. I have 2-3 times corresponded  with them.
  • I wrote “I have no personal involvement with MDPI”. This was poorly phrased – I mean to say I have no financial interest in MDPI nor am I involved in any way in the running of the company.
  • A month ago I accepted an invitation to be on the editorial board of the journal Data. I approve of what Data is setting out to do and I intend to take an active interest – making comments and suggestions where appropriate. I do not approve of editorial boards who simply provide names.  I intended to announce my membership on this blog.
  • I have been invited to contribute an article to a special issue edited by Bjoern Brembs and continue to do so.

I have worked extensively on material in the 3 journals Molecules, Materials and Metabolites because it is well presented and I believe it to be honest science. This does not involve MDPI, although I have told them what I am doing.


I note that there are a great number of accusations about what various people have been doing, some implying fraud or near-criminal activity. I know nothing more of these (that is what the phrase “no personal involvement” was intended to address.) I do not intend to try to find out more about these. I shall not respond to them and may decline to post some of them.

  • I shall continue to mine the content from MDPI journals and publish the resulting science. I can do this with or without the cooperation of MDPI. I shall report the science objectively.
  • I shall continue to be an active member of the board of Data.


I remark that the scholarly publishing industry has a turnover of ca 10-15 Billion dollars. Profit margins are very high. I am not surprised that there are low quality journals. Elsevier’s “Fractals, Solitons and Chaos” is a case in point (see Wikipedia for objective analysis). How many libraries have bought that? What chccks are there on quality? none.

I have argued for many years that Open Access needs a regulatory organ and been generally shouted down. The OA community is now reaping the harvest of its lack of care in standards – the (mis)label “Open Access” costs far more dollars than marginal publishers. Had the OA community created a system whereby MDPI or any other publisher could get formally certified they would not need to be have to defend themselves.

No good can come from single people who set themselves up as self-appointed arbiters, be they Beall or Harnad.  Criticising single articles (as Retraction Watch does or the chemical blogosphere) is admirable – especially as the discussion is open and different points of view are accepted. However Beall writes:

This post is a good example of how Brits in particular and Western Europeans in general have been brainwashed into thinking that individuals should not make any assertions and that any statements, pronouncements, etc. must come from a committee, council, board, or the like. This suppression of individuality is emblematic of the intellectual decline of Western Europe. This suppression is laying the foundation for the erosion of individual rights in Europe and the forced imposition of groupthink throughout the continent.

This immediately shows Beall’s total lack of objectivity. He gave an indication earlier with a white paper effectively attacking Open Access as a capitalist plot (or an anti-capitalist one – I couldn’t work out which). My nationality is irrelevant. Beall’s language verges on the nationalist – the nationality of the proprietor of MDPI (Chinese) is irrelevant for me – the question is does s/he run and host an effective operation.

Murray-Rust’s statement “I have no personal involvement with MDPI” is not reflective of the facts. Indeed, he is listed as serving on the editorial board of one of MDPI’s many (empty) journals, the journal Data. See: http://www.mdpi.com/journal/data/editors (Peter, if you did not know that you were listed here, please let me know, because this is a common practice, adding people to editorial boards without their permission. Otherwise, please explain your statement that you lack involvement with MDPI.)

I have explained this above

It would be great if SPARC were to list predatory publishers and journals, but it and most OA organizations pretend that predatory publishers don’t exist because they are afraid to admit that their OA fantasies are … just fantasies. OASPA’s membership list functions as sort of a white list, so if you don’t like my list, use OASPA.

The word “fantasy” immediately removes any chance or rational discourse.

MDPI is becoming an increasingly controversial publisher. This controversy will rub off on authors who publish there, and in the long run, I think most will wish they had published in a higher quality venue. Authors should make decisions as individuals (while they still can) and do what’s best for themselves as researchers. I am saying that for most individual researchers, MDPI is not a good choice, and you ought to consider a better-quality venue.

“controversial” is a subjective term and irrelevant. It is possible to whip up opinion against an organisation and, where the organisation depends on trust, this can be very difficult to refute. Beall has built a  list of publishers of questionable ethics and practices. Initially I felt it was useful, though I disliked the word “predatory” as it applies to many closed access publishers – they just use different tactics. I now have no regard for Beall’s list which I consist consists of personal prejudices (some of them nationalist).

I shall not write more on this topic. I shall write on Data and I shall write on content extraction.



PLOS’ New Data Policy: Public Access to Data

Access to research results, immediately and800px-Open_Data_stickers without restriction, has always been at the heart of PLOS’ mission and the wider Open Access movement. However, without similar access to the data underlying the findings, the article can be of limited use. For this reason, PLOS has always required that authors make their data available to other academic researchers who wish to replicate, reanalyze, or build upon the findings published in our journals.

In an effort to increase access to this data, we are now revising our data-sharing policy for all PLOS journals: authors must make all data publicly available, without restriction, immediately upon publication of the article. Beginning March 3rd, 2014, all authors who submit to a PLOS journal will be asked to provide a Data Availability Statement, describing where and how others can access each dataset that underlies the findings. This Data Availability Statement will be published on the first page of each article.

What do we mean by data?

“Data are any and all of the digital materials that are collected and analyzed in the pursuit of scientific advances.” Examples could include spreadsheets of original measurements (of cells, of fluorescent intensity, of respiratory volume), large datasets such as

next-generation sequence reads, verbatim responses from qualitative studies, software code, or even image files used to create figures. Data should be in the form in which it was originally collected, before summarizing, analyzing or reporting.

What do we mean by publicly available?

All data must be in one of three places:

  • the body of the manuscript; this may be appropriate for studies where the dataset is small enough to be presented in a table
  • in the supporting information; this may be appropriate for moderately-sized datasets that can be reported in large tables or as compressed files, which can then be downloaded
  • in a stable, public repository that provides an accession number or digital object identifier (DOI) for each dataset; there are many repositories that specialize in specific data types, and these are particularly suitable for very large datasets

Do we allow any exceptions?

Yes, but only in specific cases. We are aware that it is not ethical to make all datasets fully public, including private patient data, or specific information relating to endangered species. Some authors also obtain data from third parties and therefore do not have the right to make that dataset publicly available. In such cases, authors must state that “Data is available upon request”, and identify the person, group or committee to whom requests should be submitted. The authors themselves should not be the only point of contact for requesting data.

Where can I go for more information?

The revised data sharing policy, along with more information about the issues associated with public availability of data, can be reviewed in full at:



Image: Open Data stickers by Jonathan Gray

The post PLOS’ New Data Policy: Public Access to Data appeared first on EveryONE.

It’s a Mad, Mad, Mad, Mad, but Predictable World: Scaling the Patterns of Ancient Urban Growth

cities from sapce

With more than 7.1 billion people living across the globe, cities house more than 50% of the world’s population. The United Nations Population Fund projects that by 2030 more than 5 billion people will live in cities across the world. The Global Heath Observatory, a program run by the World Health Organization, predicts that by 2050, 7 out of 10 people will live in cities, compared to 2 of 10 just 100 years ago.

Recently, researchers developed what is called “urban scaling theory” to mathematically explain how modern cities behave in predictable ways, despite their unprecedented growth. Recent work in urban scaling research considers cities “social reactors”. In other words, the bigger the city, the more people and more opportunity for social interaction.  Think for a moment about the social interactions that occur just on the block outside of your local coffee shop; now multiply those interactions by millions. Cities magnify the number of interactions, increasing both social and economic productivity and, ultimately, encouraging their own growth.

The authors of a recent PLOS ONE paper sought to determine whether ancient cities “behaved” in predictable patterns similar to their modern counterparts. To do so, they developed mathematical models and tested them on archaeological settlements across the Pre-Hispanic Basin of Mexico (BOM, approximated by the red square in the figure below). Based on their findings, they suggest that the principles of settlement organization, which dictate city growth, were very much the same then as they are now, and may be consistent over time.

To test their predictions, the researchers analyzed archaeological data from over 1,500 sites in the BOM, previously surveyed in the 60s and 70s by researchers from the University of Michigan and Penn State.

BOM Location

Using low-altitude aerial photographs and primary survey reports from the original surveyors, the researchers organized the following data from approximately 4,000 sites: the settled area, the average density of potsherds—broken pieces of ceramic material—within it, the count and total surface area of domestic architectural mounds, the settlement type, the estimated population, and the time period.

The researchers were interested in examining areas of the BOM that enabled social interaction between residents, so they excluded site types that did not allow social interaction, for example, isolated ceremonial centers, quarries, and salt mounds. They then grouped the remaining 1,500 sites into both chronological groups and size groups. For chronological grouping, each site was assigned to one of four time periods: the Formative period (1150 B.C.E.–150 B.C.E.), the Classic period (150 B.C.E.–650 C.E.), the Toltec period (650–1200 C.E.), and the Aztec period (1200-1519 C.E.). By the Aztec period, the area had developed from amorphous rural settlements to booming metropolises comprising over 200,000 people.

BOM Population

For site grouping, settlements greater than 5,000 people were categorized differently than smaller settlements. In the figure above, panel B denotes settlements dating to the Formative period (1150 B.C.E.–150 B.C.E.), and panel C, settlements dating to the Aztec period (1200-1519 C.E.).

After separating the data into both chronological groups and size groups, the researchers applied their mathematical models and tested their predictions about urban growth in the settlements of the BOM. One aspect of city development assessed by the researchers was the evolution of defined networks of roads and canals in growing cities. Because roads act as conduits, directly influencing social interaction—much like the roads leading to the aforementioned coffee shop—growing cities develop increasingly defined networks to connect social hubs to one another.

Take, for example, the figure below, which displays both a city in an early stage (panel A) and later (panel B) of growth:


Panel A shows the early, or Amorphous Settlement Model, displaying a small settlement easily accessible to the individual via walking, and thus negating the necessity for clearly defined networks of roads. Panel B, on the other hand, shows the Networked Settlement Model, an infrastructure-dense area where networks are clearly defined to accommodate the increased size of the city and density of the residents. Larger cities analyzed by the authors, like Teotihuacan of the Classic period and Tenochtitlán of the Aztec period, epitomize the Networked Settlement Model with its organized network of roads and canals. The findings from the BOM echo the earlier-stated notion that, like their modern counterparts, ancient cities may have acted as “social reactors”, in part by facilitating an increasingly defined network of roads, themselves directly influencing the ability of residents to socially interact.

Scientists use urban scaling theory to show that population and social phenomena follow distinct, mathematical patterns over time. By developing mathematical models to predict measurable changes in city growth, these researchers applied the same patterns to ancient cities and concluded that the development of settlements over time in the BOM seem analogous to those observed in modern cities. Researchers predict that the same mathematical models could be reformatted to estimate population size of ancient cities, as well as to develop measures for calculating socio-economic output like the production of art and public monuments based on the relationship size between settlement size and division of labor. Although there is still much to be solved through the equations of urban scaling theory, the consistency of city growth over time has implications for both the past and the present.

Citation: Ortman SG, Cabaniss AHF, Sturm JO, Bettencourt LMA (2014) The Pre-History of Urban Scaling. PLoS ONE 9(2): e87902. doi:10.1371/journal.pone.0087902

Image 1: Auroras Over North American as Seen From Space by the NASA Goddard Space Flight Center

Image 2: doi:10.1371/journal.pone.0087902

Image 3: doi:10.1371/journal.pone.0087902

Image 4: doi:10.1371/journal.pone.0087902

The post It’s a Mad, Mad, Mad, Mad, but Predictable World: Scaling the Patterns of Ancient Urban Growth appeared first on EveryONE.

Content Mining Myth Busting 0: “It doesn’t matter to me”

In the next few posts I shall address some common myths about Content Mining (TDM). Many are implicitly or explicitly put to by Toll-Access Publishers (TAPublishers).

The most serious myth is that it’s not important.

Actually it’s important to everyone. The two major information successes of the first decade of this century were both content-mining:

  • Google has systematically mined the Open Web using machines and added its own semantics
  • Wikipedia has systematically mined the info sphere using humans and added its wn semantics.

If you have ever used Google or ever used Wikipedia then you have used the results of content-mining.

Wikipedia is beyond criticism – if you are unhappy about it, get involved and change it. But what about Google.?

Well Google doesn’t do science.

If I want to know what species was recorded in this place at that date; or what chemical reaction occurred under these conditions, then Google doesn’t help. You need a semantic scientific search engine.

Discipline-based Semantic content mining  is the most important development in applied information science. If you want to build the library of the future you should be doing this – not paying rent to third parties. If you want to do multidisciplinary research you need the results of content-mining.

If we were allowed to do it, then I wouldn’t be wring this blog post. As it is, the TAPublishers are fighting tooth-and-nail to stop us content-mining.  People are doing it but in secret. Because if they do it in public, then they will be cut off or sued. It’s not surprising that we don’t yet have  high visibility.

But that’s going to change. And change rapidly. We have literally billions of dollars of information locked up in the current scholarly literature. And 10000 papers come out each day. We need content mining to manage these – read them for us. Organize them. Let us search after we’ve read them. Do some of our routine thinking for us.

On our own terms for our own needs.

It can happen, just as Wikipedia happened.

So don’t turn away – believe that Content Mining matters – matters massively.


All our software is Open Source; our Data is Open and our standards are Open

Several commenters have asked whether the software we write is Open?




All our software is aggressively Open Source or Free, written with a primary purpose of making information universally free. I call it


Some years ago a number of us met in San Diego under the larger Blue Obelisk in Horton Plaza and decided to promote our software as



and came up with the mantra

ODOSOS = Open Data Open Standards Open Source

This has been very successful (see http://www.blueobelisk.org) and we are continuing to bring in new groups.

Our own group has produced:

OSCAR2 – data checking (Chris Waudby, Joe Towsend et al)

OSCAR4 chemical entity recognition (Peter Corbett, David Jessop, Lezan Hawizy)

OPSIN name to structure (Daniel Lowe)

CHEMICAL_TAGGER chemical phrase interpretaion (Lezan Hawizy, Nico Adams, Hannah Barjat)

EUCLID/CMLXOM/JUMBO/ Chemical Markup Language (PM-R et al)

SVG PDF2SVG SVG2XML HTML (PMR , Murray Jensen) interpreting PDFs

SVG2XML PMR more interpreting PDFs

XMTML2STM + FooVisitors (phylo, chemistry( PMR Mark Williamson, Andy Howlett)


CRAWLER and REPO PMR+Mark Williamson)

So yes, there’s lots to build on






Open access and scientific research: towards new values (international conference call for papers)

A call for papers is now open for an international conference called Open access and scientific research: towards new values, Tunis, Nov. 27-28, 2014. Version 

français ici: Libre accès et recherche scientifique: vers de nouvelles valeurs

The deadline for extended abstracts (2 pages) is March 30, 2014.  I am honoured to be a part of the scientific committee for this conference, organized by the research unit “Digital Library and Heritage” of the Higher Institute of Documentation (ISD), Manouba University, Tunisia, in partnership with The National University Center for Scientific and Technical Documentation (CNUDST), Tunisia.