An interview with ChatGPT on issues related to scholarly communication.
The post Thoughts on AI’s Impact on Scholarly Communications? An Interview with ChatGPT appeared first on The Scholarly Kitchen.
An interview with ChatGPT on issues related to scholarly communication.
The post Thoughts on AI’s Impact on Scholarly Communications? An Interview with ChatGPT appeared first on The Scholarly Kitchen.
Interview with Frank Seeliger (TH Wildau) and Anna Kasprzik (ZBW)
We recently had a long talk with experts Anna Kasprzik (ZBW – Leibniz Information Centre for Economics) and Frank Seeliger (Technical University of Applied Sciences Wildau – TH Wildau) about the use of artificial intelligence in academic libraries. The occasion: Both of them were involved in two wide-ranging articles: “On the promising use of AI in libraries: Discussion stage of a white paper in progress – part 1” (German) and “part 2” (German).
In their working context, both of them have an intense connection and great interest in the use of AI in the context of infrastructure institutions and libraries. Dr Frank Seeliger is the director of the university library at the TH Wildau and has been jointly responsible for the part-time programme Master of Science in Library Computer Sciences (M.Sc.) at the Wildau Institute of Technology. Anna Kasprzik is the coordinator of the automation of subject indexing (AutoSE) at the ZBW.
This slightly shortened, three-part series has emerged from our spoken interview. These two articles are also part of the series:
Anna Kasprzik: I have a very clear opinion here and have already written several articles about it. For years, I have been fighting for the necessary resources and I would say that we have manoeuvred ourselves into a really good starting position by now, even if we are not out of the woods yet. The main issue for me is commitment – right up to the level of decision makers. I’ve developed an allergy to the “project” format. Decision makers often say things like, “Oh yes, we should also do something with AI. Let’s do a project, then a working service will develop from it and that’s it.” But it’s not that easy. Things that are developed as projects tend to disappear without a trace in most cases.
We also had a forerunner project at the ZBW. We deliberately raised it to the status of a long-term commitment together with the management. We realised that automation with machine learning methods is a long-term endeavour. This commitment was essential. It was an important change of strategy. We have a team of three people here and I coordinate the whole thing. There’s a doctoral position for a scientific employee who is carrying out applied research, i.e. research that is very much focused on practice. When we received this long-term commitment status, we started a pilot phase. In this pilot phase, we recruited an additional software architect. We therefore have three positions for this, which correspond to three roles and I regard all three of them as very important.
The ZBW has also purchased a lot of hardware because machine learning experiments require serious computing power. We have then started to develop the corresponding software infrastructure. This system is already productive, but will be continually developed based on the results of our in-house applied research. What I’m trying to say is this: the commitment is important and the resources must reflect this commitment.
Frank Seeliger: This is naturally the answer of a Leibniz institution that is well endowed with research professors. However, apart from some national state libraries and larger libraries, this is usually difficult to achieve. Most libraries do not have a corresponding research mandate nor the personnel resources to finance such projects on a long-term basis. Nevertheless, there are also technologies that smaller institutions need to invest in such as cloud-based services or infrastructure as service. But they need to commit to this, including beyond the project phases. It is anchored in the Agenda 2025/30 that it is a long-term commitment within the context of the automation that is coming up anyway. This has been boosted by the coronavirus pandemic in particular, when people saw how well things can function even when they take place online. The fact that people regard this as a task and seek out information about it correspondingly. The mandate is to explore the technology deliberately. Only in this way can people at working or management level see not only the degree of investment required, but also what successes they can expect.
But it’s not only libraries that have recently, i.e. in the last ten years, begun to explore the topic of AI. It is comparable with small and medium-sized businesses or other public institutions that deal with the Online Access Act and other issues. They are also exploring these kinds of algorithms, in order to find solidarity. Libraries are not the only ones here. This is very important because many of the measures, particularly those at the level of the German federal states, were not necessarily designed with libraries in mind in respect of the distribution of AI tasks or funding.
That’s why we intended our publication (German) also as a political paper. Political in the sense of informing politicians or decision-makers about financial possibilities that we also need the framework to be able to apply. In order to then test things and decide whether we want to use any indexing or other tools such as language tools permanently in the library world and to network with other organisations.
The task for smaller libraries who cannot manage to have research groups is definitely to explore the technology and to develop their position for the next five to ten years. This requires such counterpoints to what is commonly covered by meta-search engines such as Wikipedia. Especially as libraries have a completely different lifespan than companies, in terms of their way of thinking and sustainability. Libraries are designed to last as long as the state or the university exists. Our lifecycles are therefore measured differently. And we need to position ourselves accordingly.
Anna Kasprzik:Yes and no. We are in touch with other institutions such as the German National Library. Our scientific employee and developer is working on the further development of the Finnish toolkit Annif with colleagues from the National Library of Finland, for example. This toolkit is also interesting for many other institutions for primary use. I think it’s very good to exchange ideas, also regarding our experiences with toolkits such as this one.
However, I discover time and again that there are limits to this when I advise other institutions; for example, just last week I advised some representatives from Swiss libraries. You can’t do everything for the other institutions. If they want to use these instruments, institutions have to train them on their own data. You can’t just train the models and then plant them one-to-one into other institutions. For sure, we can exchange ideas, give support and try to develop central hubs where at least structures or computing power resources are provided. However, nothing will be developed in this kind of hub that is an off-the-shelf solution for everyone. This is not how machine learning works.
Frank Seeliger: The library landscape in Germany is like a settlement and not like a skyscraper. In the past, there was a German library institute (DBI) that tried to bundle many matters in the academic libraries in Germany across all sectors. This kind of central unit no longer exists, merely several library groups relating to institutions and library associations relating to personnel. So a central library structure that could take on the topic of AI doesn’t exist. There was an RFID working group (German) (or also Special Interest Group RFID at the IFLA), and there should actually also be a working group for robots (German), but of course someone has to do it, usually alongside their actual job.
In any case, there is no central library infrastructure that could take up this kind of topic as a lobby organisation, such as Bitkom, and break it down into the individual companies. The route that we are pursuing is broadly based. This is related to the fact that we operate in very different ways in the different German federal states, owing to the relationship between national government and federal states. The latter have sovereignty in many areas, meaning that we have to work together on a project basis. It will be important to locate cooperation partners and not try to work alone, because it is simply too much. There is definitely not going to be a central contact point. The German Research Center for Artificial Intelligence (DFKI) does not have libraries on its radar either. There’s no one to call. Everything is going to run on a case-by-case and interest-related basis.
Frank Seeliger: That’s why there are library congresses where people can discuss issues. Someone gives a presentation about something they have done and then other people are interested: they get together, write applications for third-party funding or articles together, or try to organise a conference themselves. Such conference already exist, and thus a certain structure of exchange has been established.
I am the conservative type. I read articles in library journals, listen to conference news or attend congresses. That’s where you have the informal exchange – you meet other people. Alongside social media, which is also important. But if you don’t reach people via the social media channels, then there is (hopefully soon to return) physical exchange on site via certain section days, for example. Next week we have another Section IV meeting of the German Library Association (DBV) in Dresden where 100 people will get together. The chances of finding colleagues who have similar issues or are dealing with a similar topic are high. Then you can exchange ideas – the traditional way.
Anna Kasprzik: But there are also smaller workshops for specialists. For example, the German National Library has been organising a specialist congress of the network for automated subject indexing (German) (FNMVE) for those who are interested in automated approaches to subject indexing.
I also enjoy networking via social media. You can also find most people who are active in the field on the internet, e.g. on Twitter or Mastodon. I started using Twitter in 2016 and deliberately developed my account by following people with an interest in semantic web technologies. These are individuals, but they represent an entire network. I can’t name individual institutions; what is relevant are individual community members.
Anna Kasprzik: It’s all Frank’s fault.
Frank Seeliger: Anna came here once. I had invited Mr Puppe in the context of a digitalisation project in which AI methods supported optical character recognition (OCR) and image identification of historical works. Exactly via the traditional route that I’ve just described, i.e. via a symposium; this was how the first people were invited..
Then the need to position ourselves on this topic developed. I had spoken with a colleague from the Netherlands at a conference shortly before. He said that they had been too late with their AI white paper, meaning that politics had not taken them into account and libraries had not received any special funding for AI tools. That was the wake-up call for me and I thought, here in Germany there is also nothing I am aware of that is specifically for information institutions. I then researched who had publications on the topic. That’s how the network, which is still active, developed. We are working on the English translation at the moment.
Anna Kasprzik: For institutions who can, it’s important to develop long-term expertise. But I completely understand Frank’s point of view: it is valid to say that not every institution can afford this. So two aspects are important for me: one is to cluster expertise and resources at certain central institutions. The other is to develop communication structures across institutions or to share a cloud structure or something similar. To create a network in order to spread it around. To enable dissemination, i.e. the sharing of these experiences for reuse.
Frank Seeliger: Perhaps there is a third aspect: to reflect on the business process that you are responsible for so that you can identify whether it is suitable for an AI-supported automation, for example. To reflect on this yourself, but to encourage your colleagues to reflect on their own workflows too, as to whether routine tasks can be taken over by machines and thereby relieve them of some of the workload. For example, in our library association, the Kooperativer Bibliotheksverbund Berlin-Brandenburg (KOBV), we had the problem that we would have liked to set up a lab. Not only to play, but also to see together how we can technically support tasks that are really very close to real life. I don’t want to say that the project failed, but the problem was that first you needed the ideas: What can you actually tackle with AI? What requires a lot of time? Is it the indexing? Other work processes that are done over and over again like a routine with a high degree of similarity? We wanted the lab to look at exactly these processes and check if we could automate them, independently of what library management systems do or all the other tools with which we work.
It’s important to initiate the process of self-reflection on automation and digitalisation in order to identify fields of work. Some have expertise in AI, others in their own fields, and they have to come together. The path leads through one’s own reflection to enter into conversation and to sound out whether solutions can be found..
Frank Seeliger: Leadership is about bringing people together and giving impetus. The coronavirus pandemic and digitalisation have put a lot of pressure on many people. There is a saying by Angela Merkel. She once said that she only got around to thinking during the Christmas period. However, you want to interpret that now. Out of habit and because you want to clear the pile of work on your desk during working hours, it’s often difficult to reflect on what you are doing and if there isn’t already a tool that could help. Then it’s the task of the management level to look at these processes and where appropriate to say, yes, maybe the person could be helped with this. Let’s organise a project and take a closer look.
Anna Kasprzik: Yes, that’s one of the tasks, but for me the role of management is above all to take the load off the employees and clear a path for them. This brings another buzzword into play: agile working. It’s not only about giving an impetus, but also about supporting people by giving them some leeway so that they can work in a self-dependent manner. The agile manifesto, so to speak, which also leads to the fact that one creates space for experimenting and allows for failure sometimes. Otherwise, nothing will come to fruition.
Frank Seeliger:We will soon be doing a “Best of Failure” survey, because we want to ask what kind of error culture we really have, as it is sacrosanct. This will also be the topic of the Wildau Library Symposium (German) from 13 to 14 September 2022. In it, we will explore this error culture more intensively. Because it is right. Even in IT projects, you simply have to allow things to go wrong. Of course, they don’t have to be taken on as a permanent task if they don’t go well. But sometimes it’s good to just try, because you can’t predict whether a service will be accepted or not. What do we learn from these mistakes? We talk about it relatively little, mostly about successful projects that go well and attract crazy amounts of funding. But the other part also has to come into focus in order to learn better from it and be able to utilise aspects of it for the next project.
Frank Seeliger: AI is not just a task for large institutions.
Anna Kasprzik: Exactly, AI concerns everyone. Even though AI should not be dealt with just for the sake of AI, but rather to develop new innovative services that would otherwise not be possible.
Frank Seeliger: There are naturally other topics, no question about that. But you have to address it and sort out the various topics.
Anna Kasprzik: : It’s important that we get the message across to people that automated approaches should not be regarded as a threat, but rather that by now this digital jungle exists anyway, so we need tools to find our way through it. AI therefore represents new potential and added value, and not a threat that will be used to eliminates people’s jobs..
Frank Seeliger: We have also been asked the question: What is the added value of automation? Of course, you spend less time on routine processes that are very manually. This creates scope to explore new technologies, to do advanced training or to have more time for customers. And we need this scope to develop new services. You simply have to create that scope, also for agile project management, so that you don’t spend 100% of your time clearing some pile of work or other from your desks, but can instead use 20% for something new. AI can help give us this time.
Thank you for the interview, Anna and Frank.
Part 1 of the interview on “AI in Academic Libraries” is about areas of activity, the big players and the automation of indexing.
In part 2 of the interview on “AI in Academic Libraries” we explore interesting projects, the future of chatbots and the problem of discrimination through AI.
This might also interest you:
Dr Anna Kasprzik, coordinator of the automation of subject indexing (AutoSE) at the ZBW – Leibniz Information Centre for Economics. Anna’s main focus lies on the transfer of current research results from the areas of machine learning, semantic technologies, semantic web and knowledge graphs into productive operations of subject indexing of the ZBW. You can also find Anna on Twitter and Mastodon.
Portrait: Photographer: Carola Gruebner, ZBW©
Dr Frank Seeliger (German) has been the director of the university library at the Technical University of Applied Sciences Wildau since 2006 and has been jointly responsible for the part-time programme Master of Science in Library Computer Sciences (M.Sc.) at the Wildau Institute of Technology since 2015. One module explores AI. You can find Frank on ORCID.
Portrait: TH Wildau
Featured Image: Alina Constantin / Better Images of AI / Handmade A.I / Licensed by CC-BY 4.0
The post AI in Academic Libraries, Part 3: Prerequisites and Conditions for Successful Use first appeared on ZBW MediaTalk.
Interview with Frank Seeliger (TH Wildau) and Anna Kasprzik (ZBW)
We recently had an intense discussion with Anna Kasprzik (ZBW) and Frank Seeliger (Technical University of Applied Sciences Wildau – TH Wildau) on the use of artificial intelligence in academic libraries. Both of them were recently involved in two wide-ranging articles: “On the promising use of AI in libraries: Discussion stage of a white paper in progress – part 1” (German) and “part 2” (German).
Dr Anna Kasprzik coordinates the automation of subject indexing (AutoSE) at the ZBW – Leibniz Information Centre for Economics. Dr Frank Seeliger (German) is the director of the university library at the Technical University of Applied Sciences Wildau and is jointly responsible for part-time programme Master of Science in Library Computer Sciences (M.Sc.) at the Wildau Institute of Technology.
This slightly shortened, three-part series has been drawn up from our spoken interview. These two articles are also part of it:
We will link it as soon as the text is online.
Anna Kasprzik: Of course, there are many interesting AI projects. Off the top of my head, the following two come to mind: The first one is interesting for you if you are interested in the issue of optical character recognition (OCR). Because, before you can even start to think about automated subject indexing, you have to create metadata, i.e. “food” for the machine. So to speak: segmenting digital texts into their structural fragments, extracting an abstract automatically. In order to do this, you run OCR on the scanned text. Qurator (German) is an interesting project in which machine learning methods are used as well. The Staatsbibliothek zu Berlin (Berlin State Library) and the German Research Center for Artificial Intelligence (DFKI) are involved, among others. This is interesting because at some point in the future it might give us the tools we need in order to be able to obtain the data input required for automated subject indexing.
The other project is the Open Research Knowledge Graph (ORKG) of the TIB Hannover. The Open Research Knowledge Graph is a way of representing scientific results no longer as a document, i.e. as a PDF, but rather in an entity-based way. Author, research topic or method – all nodes in one graph. This is the semantic level and one could use machine learning methods in order to populate it.
Frank Seeliger: Only one project: it is running at the ZBW and the TH Wildau and explores the development of a chatbot with new technologies. The idea of chatbots is actually relatively old. A machine conducts a dialogue with a human being. In the best case, the human being does not recognise that a machine is running in the background – the Turing Test. Things are not quite this advanced yet, but the issue we are all concerned with is that libraries are being consulted – in chat rooms, for example. Many libraries aim to offer a high level of service at the times when researchers and students work, i.e. round the clock. This can only take place if procedures are automated, via chatbots for example, so that difficult questions can be also answered outside the opening hours, at weekends and on public holidays.
I am therefore hoping firstly that the input we receive concerning chatbot development means that it will become a high-quality standard service that offers fast orientation and gives information with excellent predictive quality about a library or special services. This would create the starting point for other machines such as moving robots. Many people are investing in robots, playing around with them and trying out various things. People are expecting that they will be able to go to them and ask, “Where is book XY?” or “How do I find this and that?”, and that these robots can deal with such questions profitably and show “there’s that” in an oriented way and point their finger at it. That’s one thing.
The second thing that I find very exciting for projects is to win people over to AI at an early stage. Not just to save AI as a buzzword, but to look behind the scenes of this technology complex. We tried to offer a certificate course (German). However, demand has been too low for us to offer the course. But we will try it again. The German National Library provides a similar course that was well attended. I think it’s important to make a low-threshold offer across the board, i.e. for a one-person library or for small municipal libraries that are set up on a communal basis, as well as for larger university libraries. That people get to grips with the subject matter and find their own way, where they can reuse something, where there are providers or cooperation partners. I find this kind of project is very interesting and important for the world of libraries.
But this too can only be the starting point for many other offers of special workshops, on Annif for example or other topics that can be discussed at a level that non-informaticians can understand as well. It’s an offer to colleagues who are concerned with it, but not necessarily at an in-depth level. As with a car – they don’t manufacture the vehicle themselves, but want to be able to repair or fine-tune it sometimes. At this level, we definitely need more dialogue with the people who are going to have to work with it, for example as system administrators who set up or manage such projects. The offers must also be focused towards the management level – the people who are in charge of budgeting, i.e. those who sign third-party funding applications.
Frank Seeliger: The interesting perspective for me is that we can operate the development of a chatbot together with other libraries. It is nice when not only one library serves as a knowledge base in the background for the typical examples. This is not possible with locally specific information such as opening hours or spatial conditions. Nevertheless, many synergy effects are created. We can bring them together and be in a position to generate as large a quantity of data as possible, so that the quality of the assertions that are automatically generated is simply better than if we were to set it up individually. The output quality has a lot to do with the data quality. Although it is not true that the more data, the better the information. Other factors also play a role. But generally, small solutions tend to fail because of the small quantity of data.
Especially in view of the fact that a relatively high number of libraries are keen to invest in robot solutions that “walk” through the library outside the opening hours and offer services, like the robot librarian. If the service is used, it therefore makes twice as much sense to offer something online, but also to retrieve it using a machine that rolls through the premises and offers the service. This is important, because the personal approach from the library to the clients is a very decisive and differentiating feature as opposed to the large meta levels that offer their services in the commercial field. Looking for dialogue and paying attention to the special requirements of the users: this is what makes the difference.
Anna Kasprzik: Even though I am not involved in the chatbot project at ZBW, I can think of three challenges. The first is that you need an incredible amount of training data. Getting hold of that much data is relatively difficult. Here at ZBW we have had a chat feature for a long time – without a bot. These chats have been recorded but first they had to be cleaned of all personal data. This was an immense amount of editorial work. That is the first challenge.
The second challenge: it’s a fact that relatively trivial questions, such as the opening hours, are easily answered. But as soon as things become more complex, i.e. when there are specialised questions, you need a knowledge graph behind the chatbot. And setting this up is relatively complex.
Which brings me to the third challenge: during the initial runs, the project team established that quite a few of the users had reservations and quickly thought, “It doesn’t understand me”. So there were reservations on both sides. We therefore have to be mindful of the quality aspect and also of the “trust” of the users.
Frank Seeliger: But the interactions also follow the direction of speech, particularly from the younger generations who are now coming through as students in the libraries. This generation communicates via voice messages: the students speak with Siri or Alexa and they are informal when speaking to technologies. FIZ Karlsruhe has attempted to define search queries using Alexa. That went well in itself, but it failed because of the European General Data Protection Regulation (GDPR), the privacy of information and the fact that data was processed somewhere in the USA. Naturally, that is not acceptable.
That’s why it is good that libraries are doing their own thing – they have data sovereignty and can therefore ensure that the GDPR is maintained and that user data is treated carefully. But it would be a strategic mistake if libraries did not adapt to the corresponding dialogue. Very simply because a lot of these interactions no longer take place with writing and reading alone, but via speech. As far as apps and features are concerned, much is communicated via voice messages, and libraries need to adapt to this fact. It starts with chatbots, but the question is whether search engines will be able to cope with (voice) messages at some point and then filter out the actual question. Making a chatbot functional and usable in everyday life is only the first step. With spoken language, this then incorporates listening and understanding.
Anna Kasprzik: I’m not sure, when the ZBW is planning to put its chatbot online; it could take one or two years. The real question is: when will such chatbots become viable solutions in libraries globally? This may take at least ten years or longer – without wanting to crush hopes too much.
Frank Seeliger: There are always unanticipated revivals popping up, for which a certain impetus is needed. For example, I was in the IT section of the International Federation of Library Associations and Institutions (IFLA) on statistics. We considered whether we could determine statistics clearly and globally, and depict them as a portfolio. Initially it didn’t work – it was limited to one continent: Latin America. Then the section received a huge surprise donation from the Bill and Melinda Gates Foundation and with it, the project IFLA Library Map of the World could be implemented.
It was therefore a very special impetus that led to something that we would normally not have achieved with ten years’ work. And when this impetus exists through tenders, funding, third-party donors that accelerate exactly this kind of project, perhaps also from a long-term perspective, the whole thing takes on a new dynamic. If the development of chatbots in libraries continues to stagnate like this, they will not use them on a market-wide scale. There was also a movement with contactless object recognition via radio waves (Radio-Frequency Identification, RFID). It started in 2001 in Siegburg, then Stuttgart and Munich. Now, it is used in 2,000 to 3,000 libraries. I don’t see this impetus with chatbots at all. That’s why I don’t think that, in ten or 15 years, chatbots will be used in 10% to 20% of libraries. It’s an experimental field. Maybe some libraries will introduce them, but it will be a handful, perhaps a dozen. However if a driving force occurs owing to external factors such as funding or a network initiative, the whole concept may receive new momentum.
Anna Kasprzik: That’s a very tricky question. Not many people are aware that potential difficulties almost always arise from the training data because training data is human data. These data sources contain our prejudices. In other words, whether the results may have a discriminating effect or not depends on the data itself and on the knowledge organisation systems that underpin it.
One movement that is gathering pace is known as de-colonisalisation. People are therefore taking a close look at the vocabularies they use, thesauri and ontologies. The problem has come up for us as well: since we also provide historical texts, terms that have racist connotations today appeared in the thesaurus . Naturally, we primarily incorporate terms that are considered politically correct. But these definitions can shift over time. The question is: what do you do with historical texts where this word occurs in the title? The task is then to find different ways to provide them as hidden elements of the thesaurus but not to display them in the interface.
There are knowledge organisation systems that are very old and have developed in times very different from ours. We need to restructure them completely as a matter of urgency. It’s always a balancing act if you want to display texts from earlier periods with the structures that were in use at that time. Because I must both not falsify the historical context, but also not offend anyone who wants to search in these texts and feel represented or at least not discriminated against. This is a very difficult question, particularly in libraries. People often think: that’s not an issue for libraries, it’s only relevant in politics, or that sort of thing. But on the contrary, libraries reflect the times in which they exist, and rightly so.
Frank Seeliger: Everything that you can use can also be misused. This applies to every object. For example, I was very impressed in Turkey. They are working with a big Koha approach (library software), meaning that more than 1,000 public libraries are using the open source solution Koha as their library management software. They therefore know, among other things, which book is most often borrowed in Turkey. We do not have this kind of information at all in Germany via the German Library Statistics (DBS, German). This doesn’t mean that this knowledge discredits the other books, that they are automatically “leftovers”. You can do a lot with knowledge. The bias that exists with AI is certainly the best known. But it is the same for all information: should monuments be pulled down or left standing? We need to find a path through the various moral phases that we live through as a society.
In my own studies, I specialised in pre-Colombian America. To name one example, the Aztecs never referred to themselves as Aztecs. If you searched in catalogues of libraries pre-1763, the term “Aztec” did not exist. They called themselves Mexi‘ca. Or we could take the Kerensky Offensive – search engines do not have much to offer on that. It was a military offensive that was only named that afterwards. It used to be called something else. It is the same challenge: to refer to both terms, even if the terminology has changed, or if it is no longer “en vogue” to work with a certain term.
Anna Kasprzik: This is also called concept drift and it is generally a big problem. It’s why you always have to retrain the machines: concepts are continually developing, new ones emerge or old terms change their meaning. Even if there is no discrimination, terminology is constantly evolving
.
Anna Kasprzik: The machine learning experts at the institution.
Frank Seeliger: The respective zeitgeist and its intended structure.
Thank you for the interview, Anna and Frank.
Part 1 of the interview on “AI in Academic Libraries” is about areas of activity, the big players and the automation of indexing.
Part 3 of the interview on “AI in Academic Libraries” focuses on prerequisites and conditions for successful use
We will share the link here as soon as the post is published
This text has been translated from German.
This might also interest you:
Dr Anna Kasprzik, coordinator of the automation of subject indexing (AutoSE) at the ZBW – Leibniz Information Centre for Economics. Anna’s main focus lies on the transfer of current research results from the areas of machine learning, semantic technologies, semantic web and knowledge graphs into productive operations of subject indexing of the ZBW. You can also find Anna on Twitter and Mastodon.
Portrait: Photographer: Carola Gruebner, ZBW©
Dr Frank Seeliger (German) has been the director of the university library at the Technical University of Applied Sciences Wildau since 2006 and has been jointly responsible for the part-time programme part-time programme Master of Science in Library Computer Sciences (M.Sc.) at the Wildau Institute of Technology since 2015. One module explores AI. You can find Frank on ORCID.
Portrait: TH Wildau
Featured Image: Alina Constantin / Better Images of AI / Handmade A.I / Licensed by CC-BY 4.0
The post AI in Academic Libraries, Part 2: Interesting Projects, the Future of Chatbots and Discrimination Through AI first appeared on ZBW MediaTalk.
Interview with Frank Seeliger (TH Wildau) and Anna Kasprzik (ZBW)
We recently had an intense discussion with Anna Kasprzik (ZBW) and Frank Seeliger (Technical University of Applied Sciences Wildau) on the use of artificial intelligence (AI) in academic libraries. Both of them were also recently involved in two wide-ranging articles: “On the Promising Use of AI in Libraries: Discussion Stage of a White Paper in Progress – Part 1“ (German) and “Part 2 (German). This slightly shortened, three-part series has been drawn up from our spoken interview. These two articles are also part of the following text:
We will link them here as soon as the texts are online.
An interview with Dr Anna Kasprzik (ZBW – Leibniz Information Centre for Economics) and Dr Frank Seeliger (University Library of the Technical University of Applied Sciences Wildau).
Frank Seeliger: Time and again, reports crop up about how great the automation potential of different job profiles is. This also applies to libraries: In the case of the management of an institution, automation using AI is minimal, but for the specialists for media and information services (FaMI in German), it could be up to 50%.
In the course of automation and digitalisation, it’s largely about changing process chains and automating so that users can borrow or return media autonomously in the libraries – outside opening hours or during rush hour – essentially as an interaction between human and machine.
Even the display of availabilities in the catalogue is a consequence of the use of automation and digitalisation of services in libraries. Users can check at home whether a medium is available. Services in this area – those dealing with how to access a service outside the immediate vicinity and opening hours – are certainly increasing, for example in the context of asking a question or using something during the evening, including via remote access. This process continues and also includes internal procedures such as leave requests or budget planning. These processes run completely differently in comparison to 15 years ago.
One of the first areas of activity for libraries is the automatic letter and number recognition, including for older works, cimelia, early printed books or also generally in the context of digitalisation for all the projects there. This is the one area of expertise of libraries in layout, identification and recognition. The other is the question of indexing. Many years ago, libraries worked almost exclusively with printed works, keywording them and indexing their content. Nowadays detection systems have tables of contents and work with what are known as “component parts of a bibliographically independent work”, i.e. articles that are co-documented in discovery tools or search engines. The question is always: “How should we prepare this knowledge so that it can be found using completely different approaches?” Competitors such as Wikipedia and Google predetermine the speed to some extent. We try to keep up or go into niche fields where we have different expertise, another perspective. These are definitely the first areas of activity in the field of operations, search activities or indexing and digitalisation, where AI is helping us to go further than before.
It has thereby been possible for many libraries to offer services at lower personnel cost even beyond the opening hours of public libraries (Open Level concept). Not round the clock, but for several more hours – even if no-one is in the building.
We need to make sure that we provide students with relatively high-quality information at different places and different times in their various locations. This is why chatbots for example (there’s more to come about this in part 2 of this article series) are such an exciting development, because students do not necessarily work when libraries are open or when our service times are available, but rather during the evenings, at the weekend or on public holidays. Libraries have the urgent task of providing them with sufficient and quality-checked information. We need to position ourselves where the modern technologies are.
Anna Kasprzik: Perhaps I’m biased because I’m working in the field but for me it’s very important to differentiate: I am specialised in the field of automation of subject indexing in academic libraries; the core task is to process and provide information intelligently. For me, this is the most interesting field. However, I sometimes get the impression that some libraries are falling into a trap: they want to do “something with AI” because it’s cool at the moment and then just end up dabbling in it.
But it’s really important to tackle the core tasks and thus prove that libraries can stay relevant. These days, core tasks such as subject indexing are impossible to imagine without automation. Previously this work was done intellectually by people, often even by people with doctorates. But because the tasks are changing and the quantity of digital publications is growing so rapidly, humans can only achieve a fraction of what is required. This is why we need to automate and successively find ways to combine humans and machines more intelligently. In machine learning, we speak of the „Human in the Loop“. By this, we mean the various ways in which humans and machines can work together to solve problems. We really need to focus on the core tasks. And we need to apply methods of artificial intelligence and not just do explorative projects that might be interesting in the short-term but are not thought through at a sustainable level.
Frank Seeliger: The challenge is that, even when you have a very narrow field that you are trying to research and describe, it’s difficult to stay up to date with all relevant articles. You need tools such as the Open Research Knowledge Graph (ORKG). With its help, content can be compared with the same methods and similar facts, without reading the entire article. Because this naturally requires time and energy. It’s impossible to read 20 scientific articles a day. But that’s how many are produced in some fields. That’s why you need to develop intelligent tools that help scientists to get a fast overview of which articles to prioritise for reading, absorbing and reflecting on.
But it goes even further. In the authors’ group of the „White Papers in progress“ (German), which we held for one year, we asked ourselves what search of the future would be like: Will we still search for keywords? We’re familiar with this from plagiarism detection software into which entire documents are entered. The software checks whether there is a match with other publications and whether non-cited text is used without permission. But you can also turn the whole thing around by saying: I have written something; have I forgotten a significant, current contribution in science? As a result, you get a semantic ontological hint that there is already an article on the topic you have explored which you should reflect on and incorporate. This is a perspective for us, because we assume that today one can hardly become master of the situation, even when they have an interdisciplinary focus or are exploring a new field. It would also be exciting to find a way in via a graphic analysis that ensures that you have not forgotten anything important.
Frank Seeliger: We’ve had some very intensive disagreements about this and come to the conclusion that libraries will never have the men-and-women power that other corporations have, even if we were able to only have one single world library. Even then it would be questionable whether we would be able to establish a parallel world (and if we would even want this). After all, others cater for other target groups. But even in the case of Google Scholar, the target group is quite clearly defined.
Our expertise lies in the respective field that we have licenced, for which we have access. Every higher education institution has different points of focus for its own teaching and research. For this, it ensures very privileged, exclusive access which is used to reflect precisely on what is in the full text or is licenced and what can be accessed by going to the shelves. This is and remains the task.
Although it is also changing. How will things develop, for example, if a very high percentage of publications are published in Open Access and the data becomes freely accessible? There are semantic search engines that are experimenting with this. Examples are YEWNO at Bayerische Staatsbibliothek (Bavarian State Library) or iris.ai, a company that has a headquarters in Prague, among other places. They work a lot with Open Access literature and try to process it differently on a scientific level than before. So in this respect, tasks also change.
Libraries need to reposition themselves if they want to stay in the race. But it’s clear that our core task is first of all to process the material that we have licenced and for which we pay a lot of money in the best possible way. The aim must be that our users, i.e. students or researchers, find the information they need relatively quickly and not after the 30th hit.
One of the ways in which libraries are intrinsically different to the big players lies in how they deal with personal data. The relationship to personal data when using services is diametrically opposed to the offers of the big players, because values such as trustworthiness, transparency etc. play an enormously important role for the services of libraries.
Anna Kasprzik: They use Google relatively often. At the ZBW, we are actually currently analysing the routes via which users enter our research portal. It’s often Google hits. But I don’t see that as a problem because the research portal of a library is only one reuse scenario of metadata that libraries create. You can also make it available for reuse as Linked Open Data. And what’s more: Google uses a lot of this data, so it is already integrated into Google.
And to respond to the other question, we have also discussed this in the paper, at least in the early draft. The fact that libraries are publicly funded means that they have a very different set of ethics when dealing with the personal data of users. And this has many advantages because they don’t constantly try to milk the users according to their needs or requirements. Libraries simply want to provide the best-prepared information possible. This is a strong moral advantage, which we can utilise to our benefit. But libraries do not sell this advantage, at least not very much.
There is also an age-old disagreement about this (which has nothing to do with AI, however) – many students or also PhD candidates do not realise that in their everyday lives, they are using data that a library has prepared and made available for them. They call up a paper in the university and do not notice that its link has been made available via their library and that the library has paid for this. And then, there are two factions: some people say that the users shouldn’t notice that it must occur as smoothly as possible. The others believe that, actually, there should be a big fat notice stating “provided by your library” so that people can’t miss it.
Frank Seeliger: The visualisation of the library work that is reused by third parties is a great challenge and must be properly championed because otherwise, if it is no longer visible, people will start asking why they are giving money to libraries at all? The results are visible but not who has financed them and/or people don’t notice that they are actually commercial products.
Another aspect that we discussed was the issue of transparency and freedom from advertising. We organised a virtual Open Access Week (German) from November 2021 to March 2022. We made video recordings of each ninety-minute session. Then we asked ourselves: Should we use YouTube for publication or the non-commercial video portal of the TIB Leibniz Information Centre for Science and Technology and University Library (TIB AV Portal)? We made a clear-cut decision to use the TIB AV portal and they have accepted us there. We decided in favour of the portal precisely because there are no advertisements, no overlays and no pop-up windows. If we work with discovery tools, we try to advertise the fact that you really don’t get any advertising and reach your goal with your very first hit. Therefore, several aspects differentiate us significantly from commercial providers. We are having that discussion right now; it’s an important difference.
Anna Kasprzik: This is a fundamental issue for me. I say: “no”, or perhaps “yes and no”. What we are doing at the moment via our automation of subject indexing with machine learning methods is an attempt to imitate the intellectual subject indexing one-to-one, just the same way it has always been done. But for me this is only a way for us to get our foot in the door technologically. In the next few years, we will address this and start designing the interplay between human knowledge organisation expertise and machines in a more intelligent way – reorganise it completely. I can imagine that we will not necessarily need to do the intellectual subject indexing in advance in the same way that we are currently doing it. Instead, intelligent search engines can try to index content resources taking the context into account.
But even if they are able to do this from the context ad hoc, those engines require a certain amount of underlying semantic structuring. And this structuring needs to exist in advance. It will therefore always be necessary to prepare information so that the pattern recognition algorithms can access them in the first place. If you merely dive into the raw data, the result is chaos, because the available metadata is fuzzy. You need structuring that pulls the whole thing more sharply into focus, even if it only accommodates the machine to a partial extent and not completely. There exist completely different ways of interconnecting search queries and retrieval results. But intelligent search engines still have to have something up their sleeve, and that something is organised knowledge. This knowledge organisation requires human expertise as input at certain points. The question is: at which points?
Frank Seeliger: There is also the opposing view of TIB director Prof. Dr Sören Auer, who says that data collection is overvalued. Certainly also meant as a provocation or simply to test how far one can go. In the future, it may not be necessary to have as many colleagues working in the field of intellectual indexing.
For example, we have 16,000 graduate thesis held in the library of the TH Wildau library; the entire lists of contents are being scanned and made OCR-compatible. The question is, can you systematise them according to the Regensburger Verbundklassifikation (RVK, Regensburger Association Classification; a classification scheme for academic libraries), perhaps with the Annif tool? This means that I don’t have to look at each dissertation and say, this one belongs in the field of engineering, etc., independently of the study courses in which they were written. But instead, here is the RVK graph, there are the tables of contents, then they are matched according to certain algorithms. This is a different approach to when I, as a specialist, take a look at every work and index it correspondingly for keywords, the Integrated Authority File (GND; a service facilitating the collaborative use and administration of authority data) and so on, run through all the procedures. I see this as a new way of master or mistress of the masses, because a great deal is published; because we have taken over responsibilities that did not used to be covered by libraries, such as the indexing of articles, i.e. component parts of a bibliographically independent work, besides bibliographically independent works. It’s definitely a great help.
However I cannot imagine that humans no longer intervene at all in such algorithms and offer a pre-structuring according to which they must act. Up to now, it’s been the case that we require a lot of human intervention to trim and optimise these systems better, so that the results are indexed 99% correctly. That’s one objective. This requires control and pre-structuring, looking at, training data. For example in calligraphy, when you check if a letter has been recognised correctly. Checking and handling by human beings is still necessary.
Anna Kasprzik: Exactly – I mentioned the concept earlier: the “human in the loop”, i.e. that people can be involved at various levels. These can start out very trivially: with the fact that training data or our knowledge organisation systems are generated by humans. Or the fact that you can use automatically generated keywords as suggestions – machine-assisted subject indexing.
There are also concepts such as online learning and active learning. Online learning means that the machine receives feedback relatively consistently from the indexer, as to how good its output was and based on that retraining takes place. Active learning is where the machine can interactively decide at certain points: I now need a person as an oracle for a partial decision. The machine initiates this, saying: “Human, I am pushing a few part-decisions that I need into the queue here – please work through them.” People and machines tend to toss the ball back and forth here, rather than doing it separately in two blocks.
Thank you for the interview, Anna and Frank.
In part 2 of the interview on “AI in Academic Libraries” we explore exciting projects regarding the future of chatbots and discrimination through AI.
Part 3 of the interview on “AI in Academic Libraries” focuses on prerequisites and conditions for successful use.
We’ll share the link here as soon as the post is published.
This text has been translated from German.
This might also interest you:
Dr Anna Kasprzik, coordinator of the automation of subject indexing (AutoSE) at the ZBW – Leibniz Information Centre for Economics. Anna’s main focus lies on the transfer of current research results from the areas of machine learning, semantic technologies, semantic web and knowledge graphs into productive operations of subject indexing of the ZBW. You can also find Anna on Twitter and Mastodon.
Portrait: Photographer: Carola Gruebner, ZBW©
Dr Frank Seeliger (German) has been the director of the university library at the Technical University of Applied Sciences Wildau since 2006 and has been jointly responsible for the part-time programme Master of Science in Library Computer Sciences (M.Sc.) at the Wildau Institute of Technology since 2015. One module explores AI. You can find Frank on ORCID.
Portrait: TH Wildau
Featured Image: Alina Constantin / Better Images of AI / Handmade A.I / Licensed by CC-BY 4.0
The post AI in Academic Libraries, Part 1: Areas of Activity, Big Players and the Automation of Indexing first appeared on ZBW MediaTalk.
An interview with Gunay Kazimzade (Weizenbaum Institute for the Networked Society – The German Internet Institute)
Typically, biases occur in all forms of discrimination in our society, such as political, cultural, financial, or sexual. These are again manifested in the data sets collected and the structures and infrastructures around the data, technology, and society, and thus represent social standards and decision-making behaviour in particular data points. AI systems trained upon those data points show prejudices in various domains and applications.
For instance, facial recognition systems built upon biased data tend to discriminate against people of colour in several computer vision applications. According to research from MIT Media Lab, white male and black female accuracy differ dramatically in vision models. In 2018, Amazon “killed” its hiring system, which has started to eliminate female candidates for engineering and high-level positions. This outcome resulted from the company’s culture to prefer male candidates to females in those particular positions traditionally. These examples clarify that AI systems are not objective and are mapping human biases we have in society to the technological level.
Bias is an unavoidable consequence of situated decision-making. The decision of who and how classifies data, which data points are included in the system, is not new to libraries’ work. Libraries and archives are not just the data storage, processing, and access providers. They are critical infrastructures committed to making information available and discoverable yet with the desirable vision to eliminate discriminatory outcomes of those data points.
Imagine a situation where researchers approach the library asking for images to train a face recognition model. The quality and diversity of this data directly impact the results of the research and system developed upon those data. Diversity in images (Youtube) has been recently investigated in the “Gender shades” study by Joy Buolamwini from MIT Media Lab. The question here is: Could library staff identify demographic bias in the data sets before the Gender Shades study was published? Probably not.
The right mindset comes from awareness. Awareness is the social responsibility and self-determination framed with the critical library skills and subject specialization. Relying only on metadata would not be necessary for eliminating bias in data collections. Diversity in staffing and critical domain-specific skills and tools are crucial assets in analysing library system digitised collections. Training of library staffing, continuous training, and evaluation should be the primary strategy of the libraries on the way to detect, understand and mitigate biases in library information systems.
Whether it is a developer, user, provider, or another stakeholder, the right mindset starts with the
Therefore, long-term strategy for library information systems management should include
Several research findings suggest making recommendations fairer and out of the “filter bubbles” created by technology deployers. In recommendations, transparency and explainability are among the main techniques for approaching this problem. Developers should consider the explainability of the suggestions made by the algorithms and make the recommendations justifiable for the user of the system. It should be transparent for the user based on which criteria this particular book recommendation was made and whether it was based on gender, race, or other sensitive attributes. Library or digital infrastructure staff are the main actors in this technology deployment pipeline. They should be conscious and reinforce the decision-makers to deploy the technology that includes the specific features for explainability and transparency in the library systems.
First, “check-up” should start by verifying the quality of the data through quantitative and qualitative, mixed experimental methods. In addition, there are several open-access methodologies and tools for fairness check and bias detection/mitigation in several domains. For instance, AI Fairness 360 is an open-source toolkit that helps to examine, report, and mitigate discrimination and bias in machine learning models throughout the AI application lifecycle.
Another useful tool is “Datasheets for datasets”, intended to document the datasets used for training and evaluating machine learning models; this tool is very relevant in developing metadata for library and archive systems, which can be further used for model training.
Overall, everything starts with the right mindset and awareness on approaching the bias challenge in specific domains.
Further Readings
We were talking to:
Gunay Kazimzade is a Doctoral Researcher in Artificial Intelligence at the Weizenbaum Institute for the Networked Society in Berlin, where she is currently working with the research group “Criticality of AI-based Systems”. She is also a PhD candidate in Computer Science at the Technical University of Berlin. Her main research directions are gender and racial bias in AI, inclusivity in AI, and AI-enhanced education. She is a TEDx speaker, Presidential Award of Youth winner in Azerbaijan and AI Newcomer Award winner in Germany. Gunay Kazimzade can also be found on Google Scholar, ResearchGate und LinkedIn.
Portrait: Weizenbaum Institute©
The post Discrimination Through AI: To What Extent Libraries are Affected and how Staff can Find the Right Mindset first appeared on ZBW MediaTalk.
0000-0002-9890-5451The potential of using machine learning techniques in medicine is immense. As electronic health records have become widely available, there is hope that machine learning will improve diagnosis and care. However, integrating these new methodologies
0000-0002-9890-5451 PLOS Medicine, PLOS Computational Biology and PLOS ONE announce a cross-journal Call for Papers for high-quality research that applies or develops machine learning methods for improvement of human health. The team of Guest Editors