CSE320/hw1-doc/README.md

1197 lines
59 KiB
Markdown
Raw Permalink Normal View History

2022-04-16 11:33:13 -04:00
# Homework 1 - CSE 320 - Spring 2022
#### Professor Eugene Stark
### **Due Date: Friday 02/18/2022 @ 11:59pm**
**Read the entire doc before you start**
## Introduction
In this assignment, you will implement functions for parsing JSON input
and building a data structure to represent its contents and for traversing
the data structure and producing JSON output.
You will use these functions to implement a command-line utility
(called `argo`)
which can validate JSON input and transform JSON input into JSON output
in a "canonical" form.
The goal of this homework is to familiarize yourself with C programming,
with a focus on input/output, bitwise manipulations, and the use of pointers.
For all assignments in this course, you **MUST NOT** put any of the functions
that you write into the `main.c` file. The file `main.c` **MUST ONLY** contain
`#include`s, local `#define`s and the `main` function (you may of course modify
the `main` function body). The reason for this restriction has to do with our
use of the Criterion library to test your code.
Beyond this, you may have as many or as few additional `.c` files in the `src`
directory as you wish. Also, you may declare as many or as few headers as you wish.
Note, however, that header and `.c` files distributed with the assignment base code
often contain a comment at the beginning which states that they are not to be
modified. **PLEASE** take note of these comments and do not modify any such files,
as they will be replaced by the original versions during grading.
> :scream: Array indexing (**'A[]'**) is not allowed in this assignment. You
> **MUST USE** pointer arithmetic instead. All necessary arrays are declared in
> the `global.h` header file. You **MUST USE** these arrays. **DO NOT** create
> your own arrays. We **WILL** check for this.
> :nerd: Reference for pointers: [https://beej.us/guide/bgc/html/#pointers](https://beej.us/guide/bgc/html/#pointers).
# Getting Started
Fetch base code for `hw1` as described in `hw0`. You can find it at this link:
[https://gitlab02.cs.stonybrook.edu/cse320/hw1](https://gitlab02.cs.stonybrook.edu/cse320/hw1).
**IMPORTANT: 'FETCH', DO NOT 'CLONE'.**
Both repos will probably have a file named `.gitlab-ci.yml` with different contents.
Simply merging these files will cause a merge conflict. To avoid this, we will
merge the repos using a flag so that the `.gitlab-ci.yml` found in the `hw1`
repo will replace the `hw0` version. To merge, use this command:
```
git merge -m "Merging HW1_CODE" HW1_CODE/master --strategy-option=theirs
```
> :scream: Based on past experience, many students will either ignore the above command or forget
> to use it. The result will be a **merge conflict**, which will be reported by git.
> Once a merge conflict has been reported, it is essential to correct it before committing
> (or to abort the merge without committing -- use `git merge --abort` and go back and try again),
> because git will have inserted markers into the files involved indicating the locations of the
> conflicts, and if you ignore this and commit anyway, you will end up with corrupted files.
> You should consider it important to read up at an early stage on merge conflicts with git and
> how to resolve them properly.
Here is the structure of the base code:
<pre>
.
├── .gitlab-ci.yml
└── hw1
├── .gitignore
├── hw1.sublime-project
├── include
│   ├── argo.h
│   ├── debug.h
│   └── global.h
├── lib
│   └── argo.a
├── Makefile
├── rsrc
│   ├── numbers.json
│   ├── package-lock.json
│   └── strings.json
├── src
│   ├── argo.c
│   ├── const.c
│   ├── main.c
│   └── validargs.c
├── test_output
│   └── .git_keep
└── tests
├── basecode_tests.c
└── rsrc
└── strings_-c.json
</pre>
- The `.gitlab-ci.yml` file is a file that specifies "continuous integration" testing
to be performed by the GitLab server each time you push a commit. Usually it will
be configured to check that your code builds and runs, and that any provided unit tests
are passed. You are free to change this file if you like.
> :scream: The CI testing is for your own information; it does not directly have
> anything to do with assignment grading or whether your commit has been properly
> pushed to the server. If some part of the testing fails, you will see the somewhat
> misleading message "commit failed" on the GitLab web interface.
> This does **not** mean that "your attempt to commit has failed" or that "your commit
> didn't get pushed to the server"; the very fact that the testing was triggered at
> all means that you successfully pushed a commit. Rather, it means that "the CI tests
> performed on a commit that you pushed did not succeed". The purpose of the tests are
> to alert you to possible problems with your code; if you see that testing has failed
> it is worth investigating why that has happened. However, the tests can sometimes
> fail for reasons that are not your fault; for example, the entire CI "runner" system
> may fail if someone submits code that fills up the system disk. You should definitely
> try to understand why the tests have failed if they do, but it is not necessary to be
> overly obsessive about them.
- The `hw1.sublime-project` file is a "project file" for use by the Sublime Text editor.
It is included to try to help Sublime understand the organization of the project so that
it can properly identify errors as you edit your code.
- The `Makefile` is a configuration file for the `make` build utility, which is what
you should use to compile your code. In brief, `make` or `make all` will compile
anything that needs to be, `make debug` does the same except that it compiles the code
with options suitable for debugging, and `make clean` removes files that resulted from
a previous compilation. These "targets" can be combined; for example, you would use
`make clean debug` to ensure a complete clean and rebuild of everything for debugging.
- The `include` directory contains C header files (with extension `.h`) that are used
by the code. Note that these files often contain `DO NOT MODIFY` instructions at the beginning.
You should observe these notices carefully where they appear.
- The `src` directory contains C source files (with extension `.c`).
- The `tests` directory contains C source code (and sometimes headers and other files)
that are used by the Criterion tests.
- The `rsrc` directory contains some samples of data files that you can use for
testing purposes.
- The `test_output` directory is a scratch directory where the Criterion tests can
put output files. You should not commit any files in this directory to your
`git` repository.
- The `lib` directory contains a library with binaries for my functions
`argo_read_value()` and `argo_write_value()`. As discussed below, by commenting out
the stubs for these functions in `argo.c` you can arrange for my versions to be
linked with your code, which may help you to get a jump start on understanding
some things.
## A Note about Program Output
What a program does and does not print is VERY important.
In the UNIX world stringing together programs with piping and scripting is
commonplace. Although combining programs in this way is extremely powerful, it
means that each program must not print extraneous output. For example, you would
expect `ls` to output a list of files in a directory and nothing else.
Similarly, your program must follow the specifications for normal operation.
One part of our grading of this assignment will be to check whether your program
produces EXACTLY the specified output. If your program produces output that deviates
from the specifications, even in a minor way, or if it produces extraneous output
that was not part of the specifications, it will adversely impact your grade
in a significant way, so pay close attention.
> :scream: Use the debug macro `debug` (described in the 320 reference document in the
> Piazza resources section) for any other program output or messages you many need
> while coding (e.g. debugging output).
# Part 1: Program Operation and Argument Validation
In this part of the assignment, you will write a function to validate the arguments
passed to your program via the command line. Your program will treat arguments
as follows:
- If no flags are provided, you will display the usage and return with an
`EXIT_FAILURE` return code.
- If the `-h` flag is provided, you will display the usage for the program and
exit with an `EXIT_SUCCESS` return code.
- If the `-v` flag is provided, then the program will read data from standard input
(`stdin`) and validate that it is syntactically correct JSON. If so, the program
exits with an `EXIT_SUCCESS` return code, otherwise the program exits with an
`EXIT_FAILURE` return code. In the latter case, the program will print to
standard error (`stderr`) an error message describing the error that was discovered.
No other output is produced.
- If the `-c` flag is provided, then the program performs the same function as
described for `-v`, but after validating the input, the program will also output
to standard output (`stdout`) a "canonicalized" version of the input.
"Canonicalized" means that the output is in a standard form in which possibilities
for variation have been eliminated. This is described in more detail below.
Unless `-p` has also been specified, then the produced output contains **no whitespace**
(except within strings that contain whitespace characters).
- If the `-p` flag is provided, then the `-c` flag must also have been provided.
In this case, newlines and spaces are used to format the canonicalized output
in a more human-friendly way. See below for the precise requirements on where
this whitespace must appear. The `INDENT` is an optional nonnegative integer argument
that specifies the number of additional spaces to be output at the beginning of a line
for each increase in indentation level. The format of this argument must be
the same as for a nonnegative integer number in the JSON specification.
If `-p` is provided without any `INDENT`, then a default value of 4 is used.
Note that the program reads data from `stdin` and writes transformed data
to `stdout`. Any other printout, such as diagnostic messages produced by the
program, are written to `stderr`. If the program runs without error, then it
will exit with the `EXIT_SUCCESS` status code; if any error occurs during the
execution of the program, then it will exit with the `EXIT_FAILURE` status code.
> :nerd: `EXIT_SUCCESS` and `EXIT_FAILURE` are macros defined in `<stdlib.h>` which
> represent success and failure return codes respectively.
> :nerd: `stdin`, `stdout`, and `stderr` are special I/O "streams", defined
> in `<stdio.h>`, which are automatically opened at the start of execution
> for all programs, do not need to be reopened, and (almost always) should not
> be closed.
The usage scenarios for this program are described by the following message,
which is printed by the program when it is invoked without any arguments:
<pre>
USAGE: bin/argo [-h] [-c|-v] [-p|-p INDENT]
-h Help: displays this help menu.
-v Validate: the program reads from standard input and checks whether
it is syntactically correct JSON. If there is any error, then a message
describing the error is printed to standard error before termination.
No other output is produced.
-c Canonicalize: once the input has been read and validated, it is
re-emitted to standard output in 'canonical form'. Unless -p has been
specified, the canonicalized output contains no whitespace (except within
strings that contain whitespace characters).
-p Pretty-print: This option is only permissible if -c has also been specified.
In that case, newlines and spaces are used to format the canonical output
in a more human-friendly way. For the precise requirements on where this
whitespace must appear, see the assignment handout.
The INDENT is an optional nonnegative integer argument that specifies the
number of additional spaces to be output at the beginning of a line for each
for each increase in indentation level. If no value is specified, then a
default value of 4 is used.
</pre>
The square brackets indicate that the enclosed argument is optional.
The `-c|-v` means that one of `-c` or `-v` may be specified.
The `-p|-p INDENT` means that `-p` may be specified alone, or with an optional
additional argument `INDENT`.
A valid invocation of the program implies that the following hold about
the command-line arguments:
- All "positional arguments" (`-h`, `-c`, or `-v`) come before any optional arguments
(`-p`).
The optional arguments (well, there is only one) may come in any order after the positional ones.
- If the `-h` flag is provided, it is the first positional argument after
the program name and any other arguments that follow are ignored.
- If the `-h` flag is *not* specified, then exactly one of `-v` or `-c`
must be specified.
- If `-p` is given, then it might or might not be followed by an `INDENT` argument.
If the `INDENT` argument is present, then it must represent a nonnegative integer
in the format allowed for integer numbers in the JSON specification.
For example, the following are a subset of the possible valid argument
combinations:
- `$ bin/argo -h ...`
- `$ bin/argo -v`
- `$ bin/argo -c -p`
- `$ bin/argo -c -p 8`
> :scream: The `...` means that all arguments, if any, are to be ignored; e.g.
> the usage `bin/argo -h -x -y BLAHBLAHBLAH -z` is equivalent to `bin/argo -h`.
Some examples of invalid combinations would be:
- `$ bin/argo -p 1 -c`
- `$ bin/argo -v -c`
- `$ bin/argo -v -p 5`
- `$ bin/argo -z 20`
> :scream: You may use only "raw" `argc` and `argv` for argument parsing and
> validation. Using any libraries that parse command line arguments (e.g.
> `getopt`) is prohibited.
> :scream: Any libraries that help you parse strings are prohibited as well
> (`string.h`, `ctype.h`, etc). The use of `atoi`, `scanf`, `fscanf`, `sscanf`,
> and similar functions is likewise prohibited. *This is intentional and
> will help you practice parsing strings and manipulating pointers.*
> :scream: You **MAY NOT** use dynamic memory allocation in this assignment
> (i.e. `malloc`, `realloc`, `calloc`, `mmap`, etc.). There is one function
> (`argo_append_char()`) provided for you that does the dynamic allocation
> required while accumulating the characters of a string or numeric literal.
> This function is in the file `const.c`, which you are not to modify.
> :nerd: Reference for command line arguments: [https://beej.us/guide/bgc/html/#command-line-arguments](https://beej.us/guide/bgc/html/#command-line-arguments).
**NOTE:** The `make` command compiles the `argo` executable into the `bin` folder.
All commands from here on are assumed to be run from the `hw1` directory.
### **Required** Validate Arguments Function
In `global.h`, you will find the following function prototype (function
declaration) already declared for you. You **MUST** implement this function
as part of the assignment.
```c
int validargs(int argc, char **argv);
```
The file `validargs.c` contains the following specification of the required behavior
of this function:
```c
/**
* @brief Validates command line arguments passed to the program.
* @details This function will validate all the arguments passed to the
* program, returning 0 if validation succeeds and -1 if validation fails.
* Upon successful return, the various options that were specified will be
* encoded in the global variable 'global_options', where it will be
* accessible elsewhere in the program. For details of the required
* encoding, see the assignment handout.
*
* @param argc The number of arguments passed to the program from the CLI.
* @param argv The argument strings passed to the program from the CLI.
* @return 0 if validation succeeds and -1 if validation fails.
* @modifies global variable "global_options" to contain an encoded representation
* of the selected program options.
*/
```
> :scream: This function must be implemented as specified as it will be tested
> and graded independently. **It should always return -- the USAGE macro should
> never be called from validargs.**
The `validargs` function should return -1 if there is any form of failure.
This includes, but is not limited to:
- Invalid number of arguments (too few or too many).
- Invalid ordering of arguments.
- A missing parameter to an option that requires one [doesn't apply to the
current assignment, since the parameter to `-p` is optional].
- Invalid parameter. A numeric parameter specfied with `-p` is invalid if
it does not conform to the format of a nonnegative integer as required by
the JSON specification.
The `global_options` variable of type `int` is used to record the mode
of operation (i.e. encode/decode) of the program and associated parameters.
This is done as follows:
- If the `-h` flag is specified, the most significant bit (bit 31) is 1.
- If the `-v` flag is specified, the second-most significant bit (bit 30)
is 1.
- If the `-c` flag is specified, the third-most significant bit (bit 29)
is 1.
- If the `-p` flag is specified, the fourth-most significant bit (bit 28)
is 1.
- The least significant byte (bits 7 - 0) records the number of spaces of
indentation per level specified with `-p`, or the default value (4)
if no value was specified with `-p`. If `-p` was not specified at all,
then this byte should be 0.
If `validargs` returns -1 indicating failure, your program must call
`USAGE(program_name, return_code)` and return `EXIT_FAILURE`.
**Once again, `validargs` must always return, and therefore it must not
call the `USAGE(program_name, return_code)` macro itself.
That should be done in `main`.**
If `validargs` sets the most-significant bit of `global_options` to 1
(i.e. the `-h` flag was passed), your program must call `USAGE(program_name, return_code)`
and return `EXIT_SUCCESS`.
> :nerd: The `USAGE(program_name, return_code)` macro is already defined for you
> in `argo.h`.
If validargs returns 0, then your program must read input data from `stdin`
and (depending on the options supplied) write output data to `stdout`.
Upon successful completion, your program should exit with exit status `EXIT_SUCCESS`;
otherwise, in case of an error it should exit with exit status `EXIT_FAILURE`.
Unless the program has been compiled for debugging (using `make debug`),
in a successful run that exits with `EXIT_SUCCESS` no other output may be produced
by the program. In an unsuccessful run in which the program exits with `EXIT_FAILURE`
the program should output to `stderr` a one-line diagnostic message that indicates
the reason for the failure. The program must not produce any other output than this
unless it has been compiled for debugging.
> :nerd: Remember `EXIT_SUCCESS` and `EXIT_FAILURE` are defined in `<stdlib.h>`.
> Also note, `EXIT_SUCCESS` is 0 and `EXIT_FAILURE` is 1.
### Example validargs Executions
The following are examples of the setting of `global_options` and the
other global variables for various command-line inputs.
Each input is a bash command that can be used to invoke the program.
- **Input:** `bin/argo -h`. **Setting:** `global_options=0x80000000`
(`help` bit is set, other bits clear).
- **Input:** `bin/argo -v `. **Setting:** `global_options=0x40000000`
(mode is "validate").
- **Input:** `bin/argo -c -p 2`. **Setting:** `global_options=0x30000002`
(mode is "canonicalize", "pretty-print" has been specified with
indentation increment 2).
- **Input:** `bin/argo -p 2 -c`. **Setting:** `global_options=0x0`.
This is an error case because the specified argument ordering is invalid
(`-p` is before `-c`). In this case `validargs` returns -1, leaving
`global_options` unset.
# Part 2: Overview of the JSON Specification
JSON ("JavaScript Object notation") is a standard format for data interchange
that is now commonly used in many areas of computing.
It was designed to be extremely simple to generate and parse and it in fact
achieves these goals: JSON syntax is about as simple as it gets for a
computer language that is actually used in the real world.
The syntax of JSON is defined by an
[ECMA standard](ECMA-404_2nd_edition_december_2017.pdf).
A summary that omits the scarier language from the standard document is given at
[www.json.org](https://www.json.org/json-en.html).
Most likely, you will only need to refer to this summary, but the full standard
document is here if you want to look at it.
In order to understand the JSON syntax specification, you need to be able to
read the "railroad diagrams" that are used to formally specify it.
These diagrams are actually a graphical version of a *context-free grammar*,
which is a standard tool used for formally specifying all kinds of computer
languages. Actually, the white box inset on the right contains the full
grammar; the railroad diagrams only describe the portion of the syntax that
has any significant complexity.
Each of the railroad diagrams defines the syntax of a particular
"syntactic category", which is a set of strings having a similar format.
Examples of syntactic categories for JSON are "object", "array",
"value", "number", *etc*.
The paths along the "railroad tracks" in the diagram for one syntactic category
indicate the possibilities for forming a string in that category from strings
in other categories.
For example, the first diagram says that a string in the category "object"
always has an initial curly bracket `{`. This may be followed immediately by
a closing curly bracket `}` (the top "track"), or between the brackets there
may be something more complicated (a list of "members" -- the lower "track").
By following the lower track, you find that there has to be "whitespace",
followed by a "string", followed by "whitespace", followed by a colon `:`,
followed by a "value". After the "value", it is possible to have the
closing curly bracket `}` or to loop back around and have another instance of
the same pattern that was just seen (a "member"). The path to loop back around
requires that a comma `,` appear before the next member, so this tells you
that the members in between the `{` and `}` are separated by commas.
The other diagrams are read similarly, and even if you have never seen these
before, with a little study they should be self-explanatory so I'm not going
to belabor the explanation further.
Something that was not initially clear to me from just looking at the diagrams
was what the syntax of `true`, `false`, and `null` is. These are shown with
double quotes in the inset box on the right, but in fact, the "token" `true`
simply consists of the four letters: `t`, `r`, `u`, `e` without any quotes.
This is spelled out better in the ECMA standard document.
The description of "character" in the inset box is also a bit mysterious
at first reading. A "character" is something that is permitted to occur within
a string literal. After staring at the description for awhile it becomes clear
that any Unicode code point except for (1) the "control characters"
whose code points range from U+0000 to U+001F, (2) the quote `"`,
and (3) the backslash '\' (they call it "reverse solidus"), may appear directly
representing themselves within a string literal.
In addition, "escape sequences" are permitted. An escape sequence starts
with a backslash `\`, which may be followed by one of the characters
`"`, '\\', '/', 'b', 'f', 'n', 'r', 't', or 'u'. After 'u' there are required
to appear exactly four hexadecimal digits, the letters of which may either
be in upper case or lower case. The meaning of `\"`, `\/`, `\b`, `\f`
`\n`, `\r`, and `\t` is as in a C string. The escape sequence `\/` represents
a single forward slash ("solidus") `/` (I do not know why this is in the
standard.) The meaning of `\uHHHH`, where `HHHH` are four hex digits is
the 16-bit Unicode code point from the "basic multilingual plane" whose
value is given by interpreting `HHHH` as a four-digit hexadecimal number.
Although a Unicode code point outside the basic multilingual plane may
occur directly in a string literal, representing such by an escape requires
the use of a "surrogate pair" as indicated in the ECMA standard document.
Don't worry about this technicality. For this assignment, your implementation
will not have to handle full Unicode and UTF-8-encoded input.
You may assume instead that the input to your program will come as a sequence of
8-bit bytes, each of which directly represents a Unicode code point in the
range U+0000 to U+00FF (the first 128 code points correspond to ASCII codes,
and the meaning of the next 128 code points is defined by the Unicode standard).
Note that this means that we are *not* using the usual UTF-8 encoding to
represent Unicode as a sequence of 8-bit bytes.
As you will see when you look at the definitions of the data structures you
are to use, internally your program will use the 32-bit `int`
(typedef'ed as `ARGO_CHAR`) to represent a character.
This is enough bits to represent any Unicode code point, so there will
be no problem in taking the input bytes that you read in and storing them
internally as Unicode code points. Due to the limitation of the input encoding,
for us a string literal will not be able to directly contain any Unicode
code point greater than U+00FF.
Nevertheless, you will still be able to use escape sequences within
a string literal to represent Unicode code points in the basic multilingual
plane (from U+0000 to U+FFFF), because the escape sequence allows you
to specify the code point directly as four hexadecimal digits.
Since we will also output JSON as a sequence of 8-bit bytes, it will be
necessary to render any Unicode code points greater than U+00FF occuring
in a string literal using escapes.
When reading a specification like this, it is helpful to have examples of
what is being defined. For this purpose, I have provided (in the `rsrc`
directory) some sample JSON files. These files all have the `.json`
extension. Some of these files are examples of what your program is supposed
to do when given other files as input. For example, the file `rsrc/numbers.json`
contains the following content.
```
{
"0": 0,
"2147483648": 2147483648,
"-2147483649": 2147483649,
"0.0": 0.0,
"1": 1,
"789": 789,
"1.0": 1.0,
"-1.0": -1.0,
"1e3": 1e3,
"1E3": 1E3,
"1e-3": 1e-3,
"1.234": 1.234,
"-1.234": -1.234,
"1.234e3": 1.234e3,
"1.234e-3": 1.234e-3
}
```
when your program is run as follows
```
$ bin/argo -c -p 2 < rsrc/numbers.json
```
it should produce the output in `rsrc/numbers_-c_-p_2.json`; namely
```
{
"0": 0,
"2147483648": 2147483648,
"-2147483649": 2147483649,
"0.0": 0.0,
"1": 1,
"789": 789,
"1.0": 0.1e1,
"-1.0": -0.1e1,
"1e3": 0.1e4,
"1E3": 0.1e4,
"1e-3": 0.1000000000000000e-2,
"1.234": 0.1233999999999999e1,
"-1.234": -0.1233999999999999e1,
"1.234e3": 0.1233999999999999e4,
"1.234e-3": 0.1234000000000000e-2
}
```
How this is supposed to happen is explained below.
# Part 3: Implementation
The header file `global.h` lists prototypes for functions you are
required to implement:
```c
ARGO_VALUE *argo_read_value(FILE *);
int argo_read_string(ARGO_STRING *s, FILE *);
int argo_read_number(ARGO_NUMBER *n, FILE *);
int argo_write_value(ARGO_VALUE *, FILE *);
int argo_write_string(ARGO_STRING *, FILE *);
int argo_write_number(ARGO_NUMBER *, FILE *);
int validargs(int argc, char **argv);
```
The `validargs()` function has already been discussed above.
The `argo_read_value()` function reads JSON input from the specified stream
and returns an `ARGO_VALUE` data structure (as described below).
The `argo_read_string()` function takes a pointer to an `ARGO_STRING`
structure (which will be a sub-structure of an `ARGO_VALUE` structure),
as well as a `FILE *` pointer, and it reads a JSON string literal
(starting and ending with a quote `"`) from the input stream and stores
the content of the string (without the quotes, after handling escapes)
in the specified `ARGO_STRING` object.
The `argo_read_number()` function works similarly, except it reads
a JSON numeric literal and uses it to initialize an `ARGO_NUMBER`
structure.
The `argo_write_value()` function takes an `ARGO_VALUE` data structure
and a `FILE *` pointer representing an output stream, and it writes
canonical JSON representing the specified value to the output stream.
The `argo_write_string()` function takes an `ARGO_STRING *` pointer
and a `FILE *` pointer and writes a JSON string literal to the output
stream (including quotes and escaping content that needs to be escaped).
The `argo_write_number()` function similarly takes an `ARGO_NUMBER *`
pointer and a `FILE *` pointer and it writes a JSON numeric literal
to the output stream.
> :scream: Even though your final application will only ever read JSON input
> from `stdin` and write JSON output to `stdout`, the interfaces of these
> functions are designed to accept arbitrary streams as parameters.
> **You must not ignore these parameters.** Also, you must not assume that
> these streams are "seekable" and consequently you may not use the functions
> `fseek()` or `ftell()` in your code.
Besides the general discussion below, more detailed specifications for the
required behavior of these functions are given in the comments preceding
the (non-functional) stubs in `argo.c`. Those specifications are mostly
not repeated here to avoid redundancy and possible inconsistencies between
this document and the specifications in `argo.c`.
Of course, you will also have to make modifications to the `main()` function,
so that after calling `validargs()` it makes the calls to
`argo_read_value()` and `argo_write_value()` to perform the functions required
of the complete application.
Since I want everybody to get the experience of designing and coding their
own implementation for this assignment, I have not spelled out any further
what other functions you will might to implement, but you will almost certainly
want to implement other functions. Note that the function interfaces
that have been specified, together with the problems that have to be solved
by these functions, give you clues about an implementation structure that
you might wish to consider. I will now discuss this briefly.
The `argo_read_value()` function is supposed to read bytes of data from a
stream and attempt to *parse* them as a JSON "value" (which could be
an object, array, string, number, or one of the basic tokens `true`,
`false` or `null`). The result of this parsing process is a data structure
that represents the structure of the JSON in a form that is useful for
further processing. The specification of the syntax has a recursive
structure (*e.g.* an object contains members, members contain elements, which
can themselves contain values, and so on. A simple way to parse a string
according to a recursive specification like this is via a so-called
*recursive descent* parser. Basically, the parser will have a function
for each of the syntactic categories that appear in the syntax specification
(`argo_read_value()` is one such function). Each of these functions will
be called at a point where what is expected on the input stream is a string
belonging to the syntactic category handled by that function.
The function will read one or more characters from the input stream and, based
on what it sees, it will recursively call one or more of the other parser
functions. For example, the function responsible for parsing an "object"
might check that the next character in the input is a curly brace `{`
and then call the function responsible for parsing a "member".
Each parsing function will return a data structure that represents what it
has parsed. To build this data structure, each parsing function will
typically need make use of the data structures returned by the functions
that it called recursively.
In general, each function in a recursive descent parser will need to examine
a certain amount of the input in order to determine what to do. This input
is called "look-ahead". One of the features of the JSON syntax that makes
it so easy to parse is that at most one character of look-ahead is ever
required in order to decide what to do next. For example, once we have
seen the `{` that starts an object, checking whether the next character is
a `}` or not is sufficient to tell whether we have to call functions
to parse members of the object, or whether the object is empty.
In implementing a parser like this, it generally simplifies the design
if you can "peek" at the look-ahead character without consuming it.
That way, when you call another function, it can assume that the input
stream is at the very start of what it is supposed to be trying to parse,
rather than having to keep track of what characters might already have
been been read by the caller.
You should use the `fgetc()` function from the C standard I/O library
to read each byte of data from the input stream. This function consumes
the byte of data from the input stream, but the standard I/O library
also provides a function `ungetc()` that allows you to "push back" a single
character of input. So you can achieve the effect of peeking one character
into the input stream by calling `fgetc()`, looking at the character returned,
and then using `ungetc()` to push it back into the stream if it is not
to be consumed immediately. In some cases, as you descend through recursive
calls, the same character might be examined and pushed back repeatedly.
The recursive structure also dictates a natural form for the implementation
of the output function `argo_write_value()`: you can have one function
for each meaningful entity (*e.g.* "object", "member", "number") in the
JSON specification and these functions will call each other recursively
in order to traverse the data structure and emit characters to the output
stream.
# Part 4: Data Structures
The `argo.h` header file gives C definitions for the data structures you are
produce as the return values from `argo_read_value()` and as the arguments
to `argo_write_value()`. These data structures are basically trees.
The `ARGO_VALUE` structure is the central definition, which spells out what
information is in a node of such a tree. As the same `ARGO_VALUE` structure
is used to represent all the types of JSON values ("object", "array", "number",
*etc.*) it has a `type` field to indicate specifically what type of object
each individual instance represents. The possible types are defined by the
`ARGO_VALUE_TYPE` enumeration. Each node also has a `content` field, which
is where the actual content of the node is stored. The `content` field
is defined using the C `union` type, which allows the same region of memory
to be used to store different types of things at different times.
Depending on what is in the `type` field, exactly one of the `object`, `array`,
`string`, `number`, or `basic` subfields of `content` will be valid.
Except for `ARGO_BASIC`, which just defines a set of possible values,
each of these has its own structure definition, which are given as
`ARGO_OBJECT`, `ARGO_ARRAY`, `ARGO_STRING`, and `ARGO_NUMBER`.
Besides the `type` and `content` fields, each `ARGO_VALUE` node contains
`next` and `prev` fields that point to other `ARGO_VALUES`. These fields
will be used to link each node with other "sibling" nodes into a list.
For example, a JSON "object" has a list of "members".
The JSON object will be represented by an `ARGO_VALUE` node having
`ARGO_OBJECT_TYPE` in its `type` field. The `content` field of this
object will therefore be used to hold an `ARGO_OBJECT` structure.
The `ARGO_OBJECT` structure has a single field: `member_list`, which
points to a "dummy" `ARGO_VALUE` structure used as the head of
a *circularly, doubly linked list* of members (more on this below).
Linked into this list will be `ARGO_VALUE` structures that represent
the members. The `next` and `prev` fields of these are used to chain
the members into a list: the `next` field points from a member to
the next member and the `prev` field points from a member to the previous
member. For `ARGO_VALUE` structures used to represent members of
an object, the `name` field will contain an `ARGO_STRING` structure
that represents the name of the member.
JSON arrays are represented similarly to JSON objects: the array as a
whole is represented by an `ARGO_VALUE` structure whose `type` field
contains `ARGO_ARRAY_TYPE`. The `content` field will therefore be used
to hold an `ARGO_ARRAY` structure, which has an `element_list` field that
points to a "dummy" `ARGO_VALUE` structure at the head of a list of elements,
similarly to what was just described for for object members.
However, array elements don't have names, so the `name` field of each
array element will just be `NULL`.
JSON strings are represented by the `ARGO_STRING` structure, which
has fields `capacity`, `length`, and `content`. These are used to
represent a dynamically growable string, similarly to the way
`ArrayList` is implemented in Java.
At any given time, the `content` field will either be `NULL`
(if the string is empty) or it will point to an array of `ARGO_CHAR`
elements, each of which represents a single Unicode code point.
The `capacity` field tells the total number of "slots" in this
array, whereas the `length` field tells how many of these are
actually used (*i.e.* it gives the current length of the string).
For this assignment, you don't have to actually be concerned with
the dynamic allocation -- that is performed by the function
`argo_append_char()` which has been implemented for you in `const.c`.
All you have to worry about is making sure that the fields
of and `ARGO_STRING` structure that you want to use have been
initialized to zero and then you can just call `argo_append_char()`
to build the string content.
A simple `for` loop using the `length` field as the upper limit
can then be used to traverse the `content` array of an `ARGO_STRING`
once it has been initialized.
The `ARGO_NUMBER` structure is used to represent a number.
One of its fields is a `string_value` field, which is an `ARGO_STRING`
used to hold the digits and other characters that make up the
textual representation of the number. During parsing, characters
are accumulated in this field using `argo_append_char()` in the
same way that characters are accumulated for a string value.
The remaining fields (`int_value`, `float_value`) are used to store
an internal representation (either integer or floating-point)
of the value of the number, as well as flags (`valid_string`,
`valid_int`, `valid_float`) that tell which of the other fields
contain valid information. Note that a JSON number that contains
a fractional part or an exponent part will generally not be representable
in integer format, so the `valid_int` field should be zero and there
will be no useful information in the `int_value` field. Also, if an
`ARGO_NUMBER` is created internally, without parsing it from an input
stream, then a printable representation has not yet been computed, so the
`valid_string` field will be zero and the `string_value` field
will represent an empty string.
To summarize, your `argo_read_value()` function will read bytes of
data from the specified input stream using `fgetc()`.
As it reads and parses the input, it will build up a tree of
`ARGO_VALUE` nodes to represent the structure of the JSON input.
The nodes of the resulting tree must satisfy the following requirements:
- A node with `ARGO_OBJECT_TYPE` in its `type` field represents
a JSON "object". The `content` field then contains an `ARGO_OBJECT`
structure whose `member_list` field points to an `ARGO_VALUE`
node that is the head of a circular, doubly linked list of members.
Each member has a pointer to its associated name (an `ARGO_STRING`)
stored in the `name` field.
- A node with `ARGO_ARRAY_TYPE` in its `type` field represents
a JSON "array". The `content` field then contains an `ARGO_ARRAY`
structure whose `element_list` field points to an `ARGO_VALUE`
node that is the head of a circular, doubly linked list of elements.
- A node with `ARGO_STRING_TYPE` in its `type` field represents
a JSON "string" (without the enclosing quotes that appear in JSON
source). The `content` field then contains an `ARGO_STRING`
that represents the string. The `length` field of the `ARGO_STRING`
gives the length of the string and the `content` field points to
an array of `ARGO_CHAR` values that are the content of the string.
- A node with `ARGO_NUMBER_TYPE` in its `type` field represents
a JSON "number". The `content` field then contains an `ARGO_NUMBER`
object that represents the number in various ways.
* If the `valid_string` field is nonzero, then the `string_value`
field will contain an `ARGO_STRING` that holds the characters that
make up a printable/parseable representation of the number.
* If the `valid_int` field is nonzero, then the `int_value`
field will contain the value of the number as a C `long`.
* If the `valid_float` field is nonzero, then the `float_value`
field will contain the value of the number as a C `double`.
If there is more than one representation of the number present,
then they are required to agree with each other (*i.e* represent
the same value).
- A node with `ARGO_BASIC_TYPE` in its `type` field will have
a `content` field having a value of type `ARGO_BASIC` in its `basic`
field. This value will be one of `ARGO_TRUE`, `ARGO_FALSE`,
or `ARGO_NULL`.
The `argo_read_string()` function will parse a JSON string literal
and store the string content into an `ARGO_STRING` object.
Characters in the input that are not control character and are not
one of the characters that must be escaped are simply appended directly
to the string content. However when a backslash `\` is encounted,
it is necessary to interpret it as the start of an *escape sequence*
that represents the character to be appended. These escape sequences
should be familiar, since they are essentially the same as those
used in Java as well as in C.
The `argo_read_number()` function will parse a JSON numeric literal
and store into an `ARGO_NUMBER` object not only the sequence of
characters that constitute the literal, but also the value of the
number, either in integer format, in floating point format, or both.
In order to do this, you have to actually process the various digits
one at a time and calculate (using integer and/or floating point
arithemetic) the value that is represented. You have to carry out
this conversion yourself; you are not allowed to use any library
functions to do it.
## Circular, Doubly Linked Lists
As already stated, object members and array elements are to be stored as
circular, doubly linked lists of `ARGO_VALUE` structures, using a "dummy"
structure as a sentinel. Even though the sentinel has the same type
(*i.e.* `ARGO_VALUE`) as the elements of the list, it does not itself represent
an element of the list. The only fields used in the sentinel are the
`next` field, which points to the first actual element of the list,
and the `prev` field, which points to the last actual element of the list.
The list is "circular" because starting at the sentinel and following
`next` pointers will eventually lead back to the sentinel again.
Similarly, starting at the sentinel and following `prev` pointers will
eventually lead back to the sentinel. An empty list is represented
by a sentinel whose `next` and `prev` pointers point back to the sentinel
itself. You can read more about this type of data structure by searching,
e.g. Wikipedia for "circularly doubly linked list". The advantage of
using the sentinel is that all insertions and deletions are performed
in exactly the same way, without any edge cases for the first or last
element of the list.
## Dynamic Storage
There are two types of dynamic storage used by this program.
One of these is for the content of an `ARGO_STRING`. As already indicated
above, this is handled for you by the function `argo_append_char()` and you
do not have to worry about how it happens.
The other dynamic storage is for `ARGO_VALUE` structures.
You need a source of such structures while you are building the tree
that represents JSON.
As you are prohibited from declaring your own arrays in this
assignment, you will have to use one that we have already declared for you.
In `global.h` an array `argo_value_storage` has been defined for you,
together with an associated counter `argo_next_value`. You **must** use
this array as the source of `ARGO_VALUE` structures for building your
JSON trees. Use the `argo_next_value` counter to keep track of the
index of the first unused element of this array.
When you need an `ARGO_VALUE` structure, get a pointer to the first unused
element of the `argo_value_storage` array and increment `argo_next_value`.
Be sure to pay attention to the total number `NUM_ARGO_VALUES` of
elements of this array -- if you run off the end you will corrupt other
memory and your program will have unpredictable behavior.
# Part 5: Canonical Output
Your `argo_write_value()` function is supposed to traverse a data structure
such as that returned by `argo_read_value()` and it is supposed to output
JSON to the output stream. First of all, the JSON that you output has to
conform to the JSON standard, so that it can be parsed again to produce
exactly the same internal data structure. Beyond that, the JSON is supposed
to be "canonical", which means that it has been output in a standard way
that does not leave any possibility for variation.
Your canonical JSON output must always satisfy the following conditions:
- An `ARGO_NUMBER` whose `valid_int` field is set is to be printed out as
an integer, without any fraction or exponent.
- An `ARGO_NUMBER` whose `valid_float` field is set is to be printed out
with an integer and fractional part, as in the JSON specification.
The fractional part should be normalized to lie in the interval `[0.1, 1.0)`,
so that there is always just a single `0` digit before the decimal point
and the first digit after the decimal point is always nonzero.
An exponent of 0 is to be omitted completely and for positive exponents
the `+` sign is to be omitted. Exponents always start with lower-case `e`,
rather than upper-case `E`.
- An `ARGO_STRING` is printed so that the following conditions are satisfied:
* Characters (other than backslash `\` and quote `"`) having Unicode control
points greater than U+001F and less than U+00FF are to appear directly
in the string literal as themselves. This includes forward slash `/`.
* Characters with Unicode code points greater than U+00FF
are to appear as escapes using `\u` and the appropriate hex digits,
which must be in lower case.
* Control characters that have special escapes (`\n`, `\t`, *etc.*) must
be printed using those special escapes, not using the generic escape `\u`
with hex digits.
If the pretty-print option has not been specified, then your canonical JSON
output must satisfy the following condition:
- There is no white space in the output, except for white space that occurs
within a string literal.
If the pretty print option has been specified, then your canonical JSON output
will include white space according to the following rules:
- A single newline is output after every `ARGO_VALUE` that is output at the
top-level (*i.e.* not as part of an object or array).
- A single newline is output after every '{', '[', and ',' (except those in string
literals).
- A single newline is output immediately after the last member of an object,
and immediately after the last element of an array.
- Each newline is followed by a number of spaces that depends on the indentation
level of the value currently being printed. The indentation level is maintained
as follows:
* The indentation level is 0 for a top-level JSON value.
* The indentation level is increased by one just after a `{` or `[` has
been printed to start the list of members of an object or elements of
an array.
The indentation level decreases by one just after the last member or
element has been printed, so that the closing `}` or `]` is at the
previous indentation level
* A single space is printed following each colon `:` that separates
the name of an object member from its value.
The number of spaces following a newline is equal to the current indentation
level times the `INDENT` argument given with `-p`, or the default of `4`
if `-p` was specified without an `INDENT` argument.
Note that canonicalization must be an "idempotent" operation, in the sense that
if canonical output previously produced is re-parsed and then re-output using
the same pretty-printing settings, then the new output should be identical
to the previous output.
# Part 6: Strategy of Attack
To make things a little easier for you in getting started on this assignment,
I have distributed with the basecode a library containing binary object
versions of my own implementations of `argo_read_value()` and `argo_write_value()`.
The `Makefile` has been constructed so that it will link your program against
the library I provided. As a result, if you comment out one or both of
these function in `argo.c`, my versions will be linked instead and you can
use them to work on the other parts of the assignment. Note that this library
will **not** be present during grading, so do not leave these functions
commented out or your code will not compile.
Note that the functions whose interfaces have been specified will likely
be unit-tested. This means that their behavior should be completely determined
by their specified interface, which includes their parameters, return values,
and global variables defined in `global.h` (which you may **not** modify).
There should be no implicit assumption that any other functions have been or
will be called or that any particular variables have been set to any particular
values, except for the global variables defined in `global.h`.
So, for example, you may (and should) assume that when `argo_write_object()`
is called, the `global_options` variable has been set according to the desired
program options, but you may **not** assume that before `argo_write_object()`
has been called that some other function was called previously.
My best guess as to the right attack strategy for this assignment is as follows:
First, work on the command-line argument processing (`validargs()`) and
make the changes to `main()` necessary to get the program to honor the command-line
arguments and perform the overall function that the application is supposed
to perform.
Next, start working on implementing `argo_write_value()`, using my version of
`argo_read_value()` as a source of data structures that you can use to increase
your understanding of pointers and the specific data structures that we are using
to represent JSON and, ultimately, as an aid to developing and testing your
implementation.
Finally, now that you have a clear understanding of the data structures you
are trying to produce work on implementing `argo_read_value()`, to parse
a stream of input bytes and produce such a data structure. I expect this part
of the assignment to be the most difficult.
Note that the code that I wrote for `argo_read_value()` and `argo_write_value()`
is only about 800 lines in length. If you find your own code growing
much larger than that, you need to step back and think smarter about trying
to simplify your code.
# Part 7: Running the Program
The `argo` program always reads from `stdin` and possibly writes to `stdout`.
If you want the program to take input from a file or produce output to
a file, you may run the program using **input and output redirection**,
which is implemented by the shell.
A simple example of a command that uses such redirection is the following:
```
$ bin/argo -c < rsrc/numbers.json > numbers.out
```
This will cause the input to the program to be redirected from the text file
`rsrc/numbers.json` and the output from the program to be redirected to the
file `numbers.out`.
The redirection is accomplished by the shell, which interprets the `<` symbol
to mean "input redirection from a file" and the `>` symbol to mean
"output redirection to a file". It is important to understand that redirection
is handled by the shell and that the `bin/argo` program never sees any
of the redirection arguments; in the above example it sees only `bin/argo -c`
and it just reads from `stdin` and writes to `stdout`.
Alternatively, the output from a command can be **piped**
to another program, without the use of a disk file.
This could be done, for example, by the following command:
```
$ bin/argo -c -p 2 < rsrc/package-lock.json | less
```
This sends the (rather lengthy) output to a program called `less`,
which display the first screenful of the output and then gives you the ability
to scan forward and backward to see different parts of it.
Type `h` at the `less` prompt to get help information on what you can do
with it. Type `q` at the prompt to exit `less`.
Programs that read from standard input and write to standard output are
often used as components in more complex "pipelines" that perform multiple
transformations on data.
For example, one way to test your implementation is by using one instance
of it to produce some output and testing to see if that output can be read by
another instance; *e.g.:
```
$ cat rsrc/package-lock.json | bin/argo -c | bin/argo -c -p 2 > p.out
```
Here `cat` (short for "concatenate") is a command that reads the files
specified as arguments, concatenates their contents, and prints the
concatenated content to `stdout`. In the above command, this output
is redirected through a pipe to become the input to `bin/argo -c`.
The output of `bin/argo -c` (which contains no whitespace) is then
sent to `bin/argo -c -p 2` for pretty printing. Finally, the pretty-printed
output is written to file `p.out`. Actually, the original input
file `rsrc/package-lock.json` is already canonical as defined here,
so in the end the file `p.out` should have exactly the same content
as `rsrc/package-lock.json`. One way to check this is to use the
`diff` comand (use `man diff` to read the manual page) to compare the
two files:
```
$ diff rsrc/package-lock.json p.out
$
```
If `diff` exits silently, the files are identical.
Another command that would be useful on output with no whitespace
is the `cmp` command, which performes a byte-by-byte comparison of two files
(even files that contain raw binary data):
```
$ cmp rsrc/package-lock.json p.out
```
If the files have identical content, `cmp` exits silently.
If one file is shorter than the other, but the content is otherwise identical,
`cmp` will report that it has reached `EOF` on the shorter file.
Finally, if the files disagree at some point, `cmp` will report the
offset of the first byte at which the files disagree.
If the `-l` flag is given, `cmp` will report all disagreements between the
two files.
## Unit Testing
Unit testing is a part of the development process in which small testable
sections of a program (units) are tested individually to ensure that they are
all functioning properly. This is a very common practice in industry and is
often a requested skill by companies hiring graduates.
> :nerd: Some developers consider testing to be so important that they use a
> work flow called **test driven development**. In TDD, requirements are turned into
> failing unit tests. The goal is then to write code to make these tests pass.
This semester, we will be using a C unit testing framework called
[Criterion](https://github.com/Snaipe/Criterion), which will give you some
exposure to unit testing. We have provided a basic set of test cases for this
assignment.
The provided tests are in the `tests/basecode_tests.c` file. These tests do the
following:
- `validargs_help_test` ensures that `validargs` sets the help bit
correctly when the `-h` flag is passed in.
- `validargs_validate_test` ensures that `validargs` sets the validate-mode bit
correctly when the `-v` flag is passed.
- `validargs_canonicalize_test` ensures that `validargs` sets the canonicalize-mode bit
correctly when the `-c` flag is passed in.
- `validargs_bits_test` ensures that `validargs` sets the decode-mode bit
correctly when the `-d` flag is passed in and that the value passed with `-b`
is correctly stored in the least-signficant byte of `global_options`.
- `validargs_error_test` ensures that `validargs` returns an error when the `-p`
flag is supplied with the `-v` flag.
- `help_system_test` uses the `system` syscall to execute your program through
Bash and checks to see that your program returns with `EXIT_SUCCESS`.
- `argo_basic_test` performs a basic test of the canonicalization mode of the program.
### Compiling and Running Tests
When you compile your program with `make`, an `argo_tests` executable will be
created in your `bin` directory alongside the `argo` executable. Running this
executable from the `hw1` directory with the command `bin/argo_tests` will run
the unit tests described above and print the test outputs to `stdout`. To obtain
more information about each test run, you can use the verbose print option:
`bin/argo_tests --verbose=0`.
The tests we have provided are very minimal and are meant as a starting point
for you to learn about Criterion, not to fully test your homework. You may write
your own additional tests in `tests/basecode_tests.c`, or in additional source
files in the `tests` directory. However, this is not required for this assignment.
Criterion documentation for writing your own tests can be
found [here](http://criterion.readthedocs.io/en/master/).
Note that grades are assigned based on the number of our own test cases
(not given to you in advance) that your program passes.
So you should work on the assignments in such a way that whatever you do submit
will function. Code that is completely broken will not score any points,
regardless of how voluminous it might be or how long you might have spent on it.
## Sample Input Files
In the `rsrc` directory I have placed a few JSON input files for you to try
your code on.
- `numbers.json`: A JSON file containing a single object with various
numbers as its members. This will exercise most (but probably not all)
of the interesting cases that come up in parsing and outputting numbers.
- `strings.json`: A JSON file containing a single array with various
strings as its elements. These are intended to exercise most (but again,
probably not all) of the cases involving escape sequences in strings.
- `package-lock.json`: This is a larger JSON file that I had lying around
which seemed to be a reasonable overall test.
# Hand-in instructions
**TEST YOUR PROGRAM VIGOROUSLY BEFORE SUBMISSION!**
Make sure that you have implemented all the required functions specifed in `const.h`.
Make sure that you have adhered to the restrictions (no array brackets, no prohibited
header files, no modifications to files that say "DO NOT MODIFY" at the beginning,
no functions other than `main()` in `main.c`) set out in this assignment document.
Make sure your directory tree looks basically like it did when you started
(there could possibly be additional files that you added, but the original organization
should be maintained) and that your homework compiles (you should be sure to try compiling
with both `make clean all` and `make clean debug` because there are certain errors that can
occur one way but not the other).
This homework's tag is: `hw1`
`$ git submit hw1`
> :nerd: When writing your program try to comment as much as possible. Try to
> stay consistent with your formatting. It is much easier for your TA and the
> professor to help you if we can figure out what your code does quickly!