/ edsa / posts

Introduction

Creating structured content is the process of creating resources that need to meet a well defined format. It is a term closely related to metadata.

For instance, writing this article is creating content, but doing it using an easy-to-parse language (asciidoctor) focused on structure, including explicit metadata (a header with the title, date, description, tags, etc) and using external tools to transform the resources to the publishing format, html or pdf in my case, is working with structured content.

The benefits of structured content for the content publishers, magazines and such, are, basically,

Create once, publish everywhere.

In this post I would like to talk about the group work that goes into making big content systems. My experience comes from working at MeetIT, a company that is developing a new platform to learn competitive programming, algorithms and data structures for young students.

I will not explain what is the platform or how it looks. I will first explain the format that the content creators need to follow and the revision process, for everyone outside MeetIT. Then I will talk about common mistakes that are being catched in the review process. Finally, I will show my workflow for creating this type of content.

Task Description

Here’s the official description provided inside MeetIT for the content format. A package is a folder containing all the data for a programming problem.

Package Format

A problem’s package is a folder following the specified format. The name of the zipped folder should be a 2~4 letter abbreviation of the task name. Inside the folder there should be the following files.

Inputs and Outputs

A folder named in, containing one plain text file per each test case. Test cases are divided into packages of tests. Each package has a number. Each test inside a package has a sub-name which should be a single letter. The file containing a test that belongs to package X and has a sub-name Y should have a name inXY.txt. For instance, if I want to create the first package of tests, containing 3 test cases, they could have names in1a.txt, in1b.txt, in1c.txt. A folder named out should contain one file with a model answer per each test case with the same name but a prefix out instead of in. For the above example, there should appear files with names out1a.txt, out1b.txt, out1c.txt. Package with number 0 is treated as an example package and should contain all and only tests described in the problem statement. If a package has only one test, then it still needs to have a sub-name. For instance, if I decide to add a second package containing just one test, it could have the name in2a.txt but in2.txt would be incorrect.

Config File

A text file named config.txt. This file should be configured according to the JSON format. It should contain in separate lines the following data in the following order, each starting in a new line, in the format "name": "value". Lines should be separated by a ,.

  • Task’s title.

    For a task with name abc, it should be given as "title": "abc".

  • A global time limit (in milliseconds) adjusted to a model C++ solution.

    To set a time limit of 1 second on each test, you should write "time_limit": "1000". Notice that currently you can only set a single time limit that will apply to every single test.

  • Memory limit (in Megabytes).

    To set a memory limit of 128MB on each test, you should write "mem_limit": "128". Notice that currently you can only set a single memory limit that will apply to every single test.

  • Comments, containing plain text, explaining the purpose and structure of tests created for this problem.

    To add a comment, explaining that the first test group was handcrafted, you should write "comments": "The First test group was handcrafted.".

For the above example, the correct contents of the config file would be the following. Notice that the last line does not end with a ,.

{
"title": "abc",
"time_limit": "1000",
"mem_limit": "128",
"comments": "The First test group was handcrafted."
}

Problem Statement

A text file doc.md containing a markdown of a problem statement following this sample. (The link was removed from this sentence.)

Model Solutions

Folder sol containing the source codes for all the model solutions. For each programming language, there can be four types of solutions. Each should have the name prefX.Y, where pref is the corresponding prefix for each type of solution, X is a solution number and Y is an appropriate file extension.

  • A model solution with prefix ok. Every solution of this type should pass all the test cases. There should be at least one solution of this type.

  • A solution that does not pass all the tests due to time limit, but always produces a correct answer with the prefix tle. Every solution of this type should receive an OK or Time Limit Exceeded verdict on each test.

  • A solution that does not produce an answer for all the tests, but when it does it is always correct with the prefix mre. Every solution of this type should never receive a Wrong Answer verdict. This prefix is meant mainly for memory inefficient solutions.

  • A wrong solution with prefix wa. Every solution of this type should score a non-maximum number of points and there should be at least one Wrong Answer verdict.

For instance, folder sol could contain the following text files.

  • ok1.cpp, ok2.cpp, ok1.py, which are correct solutions written in C++ and Python.

  • tle1.cpp, tle1.py, tle2.py, tle1.java, which are correct, but slow solutions written in C++, Python and Java.

  • mre1.py, mre2.py, which are solutions that exceed the memory limit written in Python.

  • wa1.cpp, wa1.java, which are wrong solutions, written in C++ and Java.

Input Verifier

A text file named ver.X containing the source code of a program that verifies the correctness of a single input file. The extension X should be appropriate to the programming language. This program should read the input of the problem in the same format as model solutions, but instead of producing the correct result it should print to standard output the text OK if the test case is correct or a text describing an error if it is not. Every output that is different from OK, including no output, will be treated as an error. For example, if you want to create a task that reads one integer number with a value between 1 and 1000, you could create a C++ file ver.cpp with the following code.

int main() {
  int a;
  cin >> a;
  if (a < 1) cout << “a is smaller than 1”;
  if (a > 1000) cout << “a is bigger than 1000”;
  if (a >= 1 && a <= 1000) cout << “OK”; }
}

Current Methodology

To develop content for the platform, we follow these steps.

  1. The course development team prepares a course. When the content of the course is decided, a list of packages needs to be prepared. The team will write very short descriptions for the problems.

  2. Using the descriptions, the lesson development team prepares the statements, hints and lessons for each problem.

  3. At this point, a different person from the one that wrote the lesson reviews the text.

  4. The package creation teams then assigns the statements to people in order to finish a package. That involves programming the official solution, programming wrong solutions, an input verifier, generating the testcases, etc. Everything that should be inside the package.

  5. In the end, another person from the package creation team reviews the final package.

Mistakes and Improvements

The reason I’m writing this article is mainly to share some of the problems that I’ve found during the review process.

Most of problems arise from not following the format. For example, I have found packages where a sample file was name 0b.txt, when the documentation says that it should be called in0b.txt. Being programmers, it is very easy to write a short script that checks that all of the input files follow the name format. That is the reason most of these mistakes, although simple, should not reach the review process. Other mistakes are more subtle. For instance, consider the following input description from a problem statement.

The first line of the input contains a positive integer n.
The next n lines contain an integer between 1 and 1000, both inclusive.

Is it acceptable to write an input verifier that reads the integer n from the input and then reads n more integers and checks that they are in range? The answer is, actually not. The input needs to follow exactly the format. What if there are more than n extra integers in the input file? Someone solving the problem may ignore the first line of the input and write a loop that reads the integer in each line until they reach the end of the file. That has to work. What if the last line of the input does not finish with a line end character, is that correct? What if each line contains something else than the integer, like whitespace? Well, it shouldn’t be valid either, because the input says that each line contains an integer. Following the description, it is reasonably to read characters until you encounter a line end character and parse them as an integer. It is also reasonable to read the whole input and split it at line ends, to read each integer as a string. Therefore, the input checking has to follow the format strictly. Luckily, writing a good input verifier is not difficult. In the Workflow Suggestions section I show a simple library that makes the process almost trivial.

Even though working with structured content is great because you create something that you can use easily (in the sense that you can build applications and services that use it easily), it may seem as if following the format was a burden, a tedious process. As counterintuitive as it may seem, it is actually simpler.

As easy as it is to write tools that use the content, it is to write short scripts that check the most common mistakes and automate your tasks.

That is the reason why having a workforce that is capable of programming simple things is very powerful. That is the key to solving the mistakes that have to do with the format and not the quality of the content.

List of typical mistakes

In this section I will provide a reference of the most common problems encountered during the review process.

  • Statement

    • Spelling and grammar errors.

    • No time and memory limit inside the statement

    • The sections should be (in this precise order): # the title, ## Constraints, ## Input, ## Output and ## Examples.

    • Inconsistent input tokens inside examples: input or Input, with or without a blankline between tests. Use input and output.

    • Incorrect use of Latex.

      • $…​$ instead of \ldots.

      • Numbers left outside latex.

      • $i-th$ instead of $i$-th.

  • Configuration File

    • Even though the time limit can be guessed from the complexity of the algorithm, it is sometimes set before the official solution (and the wrong solutions) are created and it is too big or too small. The same can apply for the memory limit, although it is not as important in general.

    • The comments section is usually left empty. There should be information about the different groups of testcases and which were handcrafted.

  • Binaries and code

    • Binaries end up inside the zip file.

    • Usage of bits/stdc++.h in official solutions, which slows compile times and is a gcc and linux only extension.

    • Poor formatting of the code. There are tools that format the code automatically.

  • Input verifier

    • Does not check things like that the given graph does not have multi-edges (when the statement says there aren’t).

    • Does not check the end of file.

    • Is not strict enough.

  • Solutions

    • Only the model solution is provided (no wrong or time limit exceeded solutions).

    • The output generated by the official solution does not end with and end of line.

    • The problem admits several solutions but a custom checker is not implemented.

  • Testcases

    • For review purposes, it may be better to include the generator. The person in charge of reviewing the package shall remove the unneeded files.

    • Usually, there is not a testcase of maximum input size (or not even near).

    • Sometimes hand crafted corner cases are needed.

Workflow Suggestions

In this section I will share the workflow I use for creating packages. I hope it helps other people develop their own workflows.

Workflow File

If you take the package description and write everything you need to do in a short file, you won’t forget anything. Additionally, if you plan to do multiple packages but you spend some days in between doing other things, when you come back, making a packages seems like something new again.

My workflow file
___________________________________________________________________________
TYPICAL WORKFLOW:

1.- Select task and make it visible in azure dev ops.
2.- Create package folder (you can copy another package and modify it)
3.- Put the name of the folder inside the Makefile so that your commands work
4.- Write task statement and lesson
	Use codiMD for both and put links in drive's file
	[erased-url]
	Copy the task statement into doc.md
5.- Set time and memmory values for package/config.txt file.
	These will be used when evaluating the solutions.

6.- Do in any order:
	6.1 - Write ideal solution -> package/sol/ok1.cpp
	6.1 - Write the other solutions -> package/sol/wa.cpp for example
	6.2 - Write checker -> package/checker.cpp
		Usually not necessary because the checker only checks that both files are identical.
	6.3 - Write input verifier -> package/inver.cpp
	6.4 - Write input generators inside gen/ -> gen/gen1.cpp for example
	6.5 - Fill gen/tests with information to generate the testcases
			on each line something like:
			gen1 3b 10 10 10 -> generate in3b.txt using gen1 with params 10 10
	6.6 - Write particular testcases (like examples and corner cases if any)
			inside gen/raw -> gen/raw/in1a.txt for example


At any time (doing it inside the folder with the makefile):
	Compile everything with 'make' or 'make binaries'
	Remove the input files with 'make inputClean'
		(do it if you have copied the folder from another package)
	Create or update your input files with 'make input'
	Check the input files with 'make verif'
	Create the expected outputs (uses ok1) with 'make output'
	Evaluate every code inside sol with 'make eval'
		(uses config.txt for the time and memmory limits)

Once everything is tested:
	Create your zipped package with 'make zip' and send it for reviewing
	You can remove the binary files with 'make clean'

___________________________________________________________________________
NOTES:
You can put .h library files used in the generators inside include and
include them using include "mylibrary.h", for example for using testlib
library.


For generators, params can be accessed with argv[], the way adopted by
testlib. You you can change gentests.sh so that they are given though
standard input using a pipe. Just replace
	'./bin/gen/$gen $args >"$t_out"'		with
	'echo $args | ./bin/gen/$gen >"$_out$'


Package (and therefore zip) should contain:
	+in
		-input files inXY.txt
	+out
		-output files outXY.txt
	+sol
		-solutions
	check.cpp
	config.txt
	doc.md
	inver.cpp

Helpful Tools

I already talked about input verifiers. I am currently using testlib, an open source library meant for creating the types of packages we do. It allows to write simple input verifiers, checkers and testcases generators. I encourage everyone to use it. Why?

  • Because it is very helpful

  • Because it is a waste of human resources developing something that is already done.

Here’s an example input verifier taken from their repository.

An example of an Input Verifier
/**
 * Validates that input contains the only integer between 1 and 100, inclusive.
 * Also validates that file ends with EOLN and EOF.
 */

#include "testlib.h"

using namespace std;

int main(int argc, char* argv[])
{
    registerValidation(argc, argv);
    // Reads an integer in the range [1, 100]
    inf.readInt(1, 100, "n");
    // Reads and end of line following the default system format ("\n" or "\r\n")
    inf.readEoln();
    // Read the end of file
    inf.readEof();

    // If any of the above calls encounter something different,
    // the program terminates and prints a helpful message.

    return 0;
}

Automation and Check Scripts

I created a Makefile to call the most usual tasks. For instance, I can compile every source file inside the package by typing make binaries or simply make.

Task Automation Makefile
CXXFLAGS=-std=c++11 -Wall -O2 -I include

# Package directory
TARGET=tas9

GENERATORS=$(patsubst %.cpp,bin/%,$(wildcard $(TARGET)/gen/*.cpp))
SOLUTIONS=$(patsubst %.cpp,bin/%,$(wildcard $(TARGET)/sol/*.cpp))

binaries: $(GENERATORS) $(SOLUTIONS) bin/$(TARGET)/inver bin/$(TARGET)/check


bin/$(TARGET)/gen/%: $(TARGET)/gen/%.cpp
	@mkdir -p bin bin/$(TARGET)/gen
	g++ $(CXXFLAGS) $< -o $@

bin/%: %.cpp
	@mkdir -p bin/$(TARGET)/sol
	g++ $(CXXFLAGS) $< -o $@

input: binaries
	./0-Scripts/gentests.sh $(TARGET)

verif: binaries
	./0-Scripts/verify.sh $(TARGET)

output: binaries
	./0-Scripts/genouts.sh $(TARGET)

eval: binaries
	./0-Scripts/runtests.sh $(TARGET)

zip:
	@rm -f	"9-zips/$(TARGET).zip"
	zip -r	"9-zips/$(TARGET).zip"	\
			"$(TARGET)/check.cpp"	\
			"$(TARGET)/inver.cpp"	\
			"$(TARGET)/doc.md"		\
			"$(TARGET)/config.txt"	\
			"$(TARGET)/in"			\
			"$(TARGET)/out"			\
			"$(TARGET)/sol"			\


inputClean:
	rm -rf "$(TARGET)/in" "$(TARGET)/out"

clean:
	rm -rf bin

seal:
	chmod 444 "$(TARGET)/"*
	chmod 555 "$(TARGET)" "$(TARGET)/in" "$(TARGET)/out" "$(TARGET)/gen" "$(TARGET)/sol"

unSeal:
	chmod 664 "$(TARGET)/"*
	chmod 775 "$(TARGET)" "$(TARGET)/in" "$(TARGET)/out" "$(TARGET)/gen" "$(TARGET)/sol"

I write the complex tasks in separate scripts that are called from the makefile. For example, this short 11 lines script uses the correct solution to generate the expected output.

0-Scripts/genouts.sh, a short script.
#!/bin/sh
TARGET="$1"
rm -R "$TARGET/out"
mkdir -p "$TARGET/out"
echo $?
for t_in in "$TARGET/in/"*.txt
do
	t_out="$TARGET/out/out${t_in#$TARGET/in/in}"
	echo "generating $t_out..."
	./"bin/$TARGET/sol/ok1" < "$t_in" > "$t_out"
done

I would like to point out that although writing all these scripts seems to take a lot of time. First, I usually do things lazily, that is, when I need them. Hence, I do not need to do a lot of work just to start working. Second, compared to the ammount of time it takes to produce a package, the ammount of time required to write a simple script is negligible. (It is going to take some time the first time you do something like this, but you learn how to do it for all of your projects.) Third, if you are going to prepare at least three packages, the time really pays off.

Folder structure

As you might already have guessed, I work inside a terminal. My folder structure looks like this

My folder structure
useredsa@Device001:~/Documents/meetIT/MVP$ ls
0-Scripts  bitw               farm     Makefile  svar  tas4  tas8
5-Lessons  bops               flsp     nocc      tas1  tas5  tas9
9-zips     cdig               include  sins      tas2  tas6  twob
bin        documentation.txt  lovc     sses      tas3  tas7  workflow.txt

I keep one folder per package and some general folders like 0-Scripts, 5-Lessons or 9-zips (here I store the compressed packages). I also have a folder where I locate the binaries, bin, and my Makefile.

There are two tips I would like to share. The first one is to change the permissions of your package folder once you finish working with it. I do this to prevent erasing anything (even though the files get eventually uploaded to MeetIT’s servers). To do allow or disallow modification, I simply type make seal or make unSeal respectively. The second one is to write a command to compress your files (usually to a zip folder). Most people think the only way to create a zip folder is to right click on a folder and compressing it with everything inside. However, I usually do not want to compress everything inside a package folder. I sometimes have auxiliary files that must not end in the package. Having a command that compresses just what is needed is very useful. Fixing any mistakes after the review process and generating the zip file again is also simpler.

The rule in my Makefile for compressing a package in a zip
zip:
	@rm -f	"9-zips/$(TARGET).zip"
	zip -r	"9-zips/$(TARGET).zip"	\
			"$(TARGET)/check.cpp"	\
			"$(TARGET)/inver.cpp"	\
			"$(TARGET)/doc.md"		\
			"$(TARGET)/config.txt"	\
			"$(TARGET)/in"			\
			"$(TARGET)/out"			\
			"$(TARGET)/sol"			\