OLAC Record: Word Importance Dataset

OLAC Record
oai:lindat.mff.cuni.cz:11234/1-5520

Metadata

Title: Word Importance Dataset

Bibliographic Citation: http://hdl.handle.net/11234/1-5520

Creator: Osuský, Adam

Javorský, Dávid

Date (W3CDTF): 2024-07-15T14:35:54Z

Date Available: 2024-07-15T14:35:54Z

Description: This dataset comprises a corpus of 50 text contexts, each about 60 words in length, sourced from five distinct domains. Each context has been evaluated by multiple annotators who identified and ranked the most important words—up to 10% of each text—according to their perceived significance. The annotators followed specific guidelines to ensure consistency in word selection and ranking. For further details, please refer to the cited source. --- rankings_task.csv - This csv contains information about the contexts which are to be annotated: - id: A unique identifier for each task. - content: The context to be ranked. --- rankings_ranking.csv - This csv includes ranking information for various assignments. It contains four columns: - id: A unique identifier for each ranking entry. - score: The score assigned to the entry. - word_order: A JSON detailing the order of words positions. It is essentially the selected word positions and their ordering from an annotator. - assignment_id: A reference ID linking to the assignments. --- rankings_assignment.csv - This csv tracks the completion status of tasks by users. It includes four columns: - id: A unique identifier for each assignment entry. - is_completed: A binary indicator (1 for completed, 0 for not completed). - task_id: A reference ID linking to the tasks. - user_id: The identifier for the user who should complete the task (rank the words). --- Known Issues: Please note that each annotator was intended to rank each context only once. However, due to a bug in the deployment of the annotation tool, some entries may be duplicated. Users of this dataset should be cautious of this issue and verify the uniqueness of the annotations where necessary. --- This dataset is a part of work from a bachelor thesis: OSUSKÝ, Adam. Predicting Word Importance Using Pre-Trained Language Models. Bachelor thesis, supervisor Javorský, Dávid. Prague: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, 2024.

Identifier (URI): http://hdl.handle.net/11234/1-5520

Language: English

Language (ISO639): eng

Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)

Rights: Creative Commons - Attribution 4.0 International (CC BY 4.0)

http://creativecommons.org/licenses/by/4.0/

Subject: word importance

ranking

importance ranking

Type: corpus

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Description: http://www.language-archives.org/archive/lindat.mff.cuni.cz

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:lindat.mff.cuni.cz:11234/1-5520

DateStamp: 2024-07-15

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Osuský, Adam; Javorský, Dávid. 2024. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-5520
Up-to-date as of: Mon Jun 16 1:08:32 EDT 2025

Metadata
Title:		Word Importance Dataset
Bibliographic Citation:		http://hdl.handle.net/11234/1-5520
Creator:		Osuský, Adam
Creator:		Javorský, Dávid
Date (W3CDTF):		2024-07-15T14:35:54Z
Date Available:		2024-07-15T14:35:54Z
Description:		This dataset comprises a corpus of 50 text contexts, each about 60 words in length, sourced from five distinct domains. Each context has been evaluated by multiple annotators who identified and ranked the most important words—up to 10% of each text—according to their perceived significance. The annotators followed specific guidelines to ensure consistency in word selection and ranking. For further details, please refer to the cited source. --- rankings_task.csv - This csv contains information about the contexts which are to be annotated: - id: A unique identifier for each task. - content: The context to be ranked. --- rankings_ranking.csv - This csv includes ranking information for various assignments. It contains four columns: - id: A unique identifier for each ranking entry. - score: The score assigned to the entry. - word_order: A JSON detailing the order of words positions. It is essentially the selected word positions and their ordering from an annotator. - assignment_id: A reference ID linking to the assignments. --- rankings_assignment.csv - This csv tracks the completion status of tasks by users. It includes four columns: - id: A unique identifier for each assignment entry. - is_completed: A binary indicator (1 for completed, 0 for not completed). - task_id: A reference ID linking to the tasks. - user_id: The identifier for the user who should complete the task (rank the words). --- Known Issues: Please note that each annotator was intended to rank each context only once. However, due to a bug in the deployment of the annotation tool, some entries may be duplicated. Users of this dataset should be cautious of this issue and verify the uniqueness of the annotations where necessary. --- This dataset is a part of work from a bachelor thesis: OSUSKÝ, Adam. Predicting Word Importance Using Pre-Trained Language Models. Bachelor thesis, supervisor Javorský, Dávid. Prague: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, 2024.
Identifier (URI):		http://hdl.handle.net/11234/1-5520
Language:		English
Language (ISO639):		eng
Publisher:		Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Rights:		Creative Commons - Attribution 4.0 International (CC BY 4.0)
Rights:		http://creativecommons.org/licenses/by/4.0/
Subject:		word importance
		ranking
		importance ranking
Type:		corpus
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:		http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:lindat.mff.cuni.cz:11234/1-5520
DateStamp:		2024-07-15
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Osuský, Adam; Javorský, Dávid. 2024. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text