1197 lines
59 KiB
Markdown
1197 lines
59 KiB
Markdown
|
# Homework 1 - CSE 320 - Spring 2022
|
||
|
#### Professor Eugene Stark
|
||
|
|
||
|
### **Due Date: Friday 02/18/2022 @ 11:59pm**
|
||
|
|
||
|
**Read the entire doc before you start**
|
||
|
|
||
|
## Introduction
|
||
|
|
||
|
In this assignment, you will implement functions for parsing JSON input
|
||
|
and building a data structure to represent its contents and for traversing
|
||
|
the data structure and producing JSON output.
|
||
|
You will use these functions to implement a command-line utility
|
||
|
(called `argo`)
|
||
|
which can validate JSON input and transform JSON input into JSON output
|
||
|
in a "canonical" form.
|
||
|
The goal of this homework is to familiarize yourself with C programming,
|
||
|
with a focus on input/output, bitwise manipulations, and the use of pointers.
|
||
|
|
||
|
For all assignments in this course, you **MUST NOT** put any of the functions
|
||
|
that you write into the `main.c` file. The file `main.c` **MUST ONLY** contain
|
||
|
`#include`s, local `#define`s and the `main` function (you may of course modify
|
||
|
the `main` function body). The reason for this restriction has to do with our
|
||
|
use of the Criterion library to test your code.
|
||
|
Beyond this, you may have as many or as few additional `.c` files in the `src`
|
||
|
directory as you wish. Also, you may declare as many or as few headers as you wish.
|
||
|
Note, however, that header and `.c` files distributed with the assignment base code
|
||
|
often contain a comment at the beginning which states that they are not to be
|
||
|
modified. **PLEASE** take note of these comments and do not modify any such files,
|
||
|
as they will be replaced by the original versions during grading.
|
||
|
|
||
|
> :scream: Array indexing (**'A[]'**) is not allowed in this assignment. You
|
||
|
> **MUST USE** pointer arithmetic instead. All necessary arrays are declared in
|
||
|
> the `global.h` header file. You **MUST USE** these arrays. **DO NOT** create
|
||
|
> your own arrays. We **WILL** check for this.
|
||
|
|
||
|
> :nerd: Reference for pointers: [https://beej.us/guide/bgc/html/#pointers](https://beej.us/guide/bgc/html/#pointers).
|
||
|
|
||
|
# Getting Started
|
||
|
|
||
|
Fetch base code for `hw1` as described in `hw0`. You can find it at this link:
|
||
|
[https://gitlab02.cs.stonybrook.edu/cse320/hw1](https://gitlab02.cs.stonybrook.edu/cse320/hw1).
|
||
|
**IMPORTANT: 'FETCH', DO NOT 'CLONE'.**
|
||
|
|
||
|
Both repos will probably have a file named `.gitlab-ci.yml` with different contents.
|
||
|
Simply merging these files will cause a merge conflict. To avoid this, we will
|
||
|
merge the repos using a flag so that the `.gitlab-ci.yml` found in the `hw1`
|
||
|
repo will replace the `hw0` version. To merge, use this command:
|
||
|
|
||
|
```
|
||
|
git merge -m "Merging HW1_CODE" HW1_CODE/master --strategy-option=theirs
|
||
|
```
|
||
|
|
||
|
> :scream: Based on past experience, many students will either ignore the above command or forget
|
||
|
> to use it. The result will be a **merge conflict**, which will be reported by git.
|
||
|
> Once a merge conflict has been reported, it is essential to correct it before committing
|
||
|
> (or to abort the merge without committing -- use `git merge --abort` and go back and try again),
|
||
|
> because git will have inserted markers into the files involved indicating the locations of the
|
||
|
> conflicts, and if you ignore this and commit anyway, you will end up with corrupted files.
|
||
|
> You should consider it important to read up at an early stage on merge conflicts with git and
|
||
|
> how to resolve them properly.
|
||
|
|
||
|
Here is the structure of the base code:
|
||
|
|
||
|
<pre>
|
||
|
.
|
||
|
├── .gitlab-ci.yml
|
||
|
└── hw1
|
||
|
├── .gitignore
|
||
|
├── hw1.sublime-project
|
||
|
├── include
|
||
|
│ ├── argo.h
|
||
|
│ ├── debug.h
|
||
|
│ └── global.h
|
||
|
├── lib
|
||
|
│ └── argo.a
|
||
|
├── Makefile
|
||
|
├── rsrc
|
||
|
│ ├── numbers.json
|
||
|
│ ├── package-lock.json
|
||
|
│ └── strings.json
|
||
|
├── src
|
||
|
│ ├── argo.c
|
||
|
│ ├── const.c
|
||
|
│ ├── main.c
|
||
|
│ └── validargs.c
|
||
|
├── test_output
|
||
|
│ └── .git_keep
|
||
|
└── tests
|
||
|
├── basecode_tests.c
|
||
|
└── rsrc
|
||
|
└── strings_-c.json
|
||
|
</pre>
|
||
|
|
||
|
- The `.gitlab-ci.yml` file is a file that specifies "continuous integration" testing
|
||
|
to be performed by the GitLab server each time you push a commit. Usually it will
|
||
|
be configured to check that your code builds and runs, and that any provided unit tests
|
||
|
are passed. You are free to change this file if you like.
|
||
|
|
||
|
> :scream: The CI testing is for your own information; it does not directly have
|
||
|
> anything to do with assignment grading or whether your commit has been properly
|
||
|
> pushed to the server. If some part of the testing fails, you will see the somewhat
|
||
|
> misleading message "commit failed" on the GitLab web interface.
|
||
|
> This does **not** mean that "your attempt to commit has failed" or that "your commit
|
||
|
> didn't get pushed to the server"; the very fact that the testing was triggered at
|
||
|
> all means that you successfully pushed a commit. Rather, it means that "the CI tests
|
||
|
> performed on a commit that you pushed did not succeed". The purpose of the tests are
|
||
|
> to alert you to possible problems with your code; if you see that testing has failed
|
||
|
> it is worth investigating why that has happened. However, the tests can sometimes
|
||
|
> fail for reasons that are not your fault; for example, the entire CI "runner" system
|
||
|
> may fail if someone submits code that fills up the system disk. You should definitely
|
||
|
> try to understand why the tests have failed if they do, but it is not necessary to be
|
||
|
> overly obsessive about them.
|
||
|
|
||
|
- The `hw1.sublime-project` file is a "project file" for use by the Sublime Text editor.
|
||
|
It is included to try to help Sublime understand the organization of the project so that
|
||
|
it can properly identify errors as you edit your code.
|
||
|
|
||
|
- The `Makefile` is a configuration file for the `make` build utility, which is what
|
||
|
you should use to compile your code. In brief, `make` or `make all` will compile
|
||
|
anything that needs to be, `make debug` does the same except that it compiles the code
|
||
|
with options suitable for debugging, and `make clean` removes files that resulted from
|
||
|
a previous compilation. These "targets" can be combined; for example, you would use
|
||
|
`make clean debug` to ensure a complete clean and rebuild of everything for debugging.
|
||
|
|
||
|
- The `include` directory contains C header files (with extension `.h`) that are used
|
||
|
by the code. Note that these files often contain `DO NOT MODIFY` instructions at the beginning.
|
||
|
You should observe these notices carefully where they appear.
|
||
|
|
||
|
- The `src` directory contains C source files (with extension `.c`).
|
||
|
|
||
|
- The `tests` directory contains C source code (and sometimes headers and other files)
|
||
|
that are used by the Criterion tests.
|
||
|
|
||
|
- The `rsrc` directory contains some samples of data files that you can use for
|
||
|
testing purposes.
|
||
|
|
||
|
- The `test_output` directory is a scratch directory where the Criterion tests can
|
||
|
put output files. You should not commit any files in this directory to your
|
||
|
`git` repository.
|
||
|
|
||
|
- The `lib` directory contains a library with binaries for my functions
|
||
|
`argo_read_value()` and `argo_write_value()`. As discussed below, by commenting out
|
||
|
the stubs for these functions in `argo.c` you can arrange for my versions to be
|
||
|
linked with your code, which may help you to get a jump start on understanding
|
||
|
some things.
|
||
|
|
||
|
## A Note about Program Output
|
||
|
|
||
|
What a program does and does not print is VERY important.
|
||
|
In the UNIX world stringing together programs with piping and scripting is
|
||
|
commonplace. Although combining programs in this way is extremely powerful, it
|
||
|
means that each program must not print extraneous output. For example, you would
|
||
|
expect `ls` to output a list of files in a directory and nothing else.
|
||
|
Similarly, your program must follow the specifications for normal operation.
|
||
|
One part of our grading of this assignment will be to check whether your program
|
||
|
produces EXACTLY the specified output. If your program produces output that deviates
|
||
|
from the specifications, even in a minor way, or if it produces extraneous output
|
||
|
that was not part of the specifications, it will adversely impact your grade
|
||
|
in a significant way, so pay close attention.
|
||
|
|
||
|
> :scream: Use the debug macro `debug` (described in the 320 reference document in the
|
||
|
> Piazza resources section) for any other program output or messages you many need
|
||
|
> while coding (e.g. debugging output).
|
||
|
|
||
|
# Part 1: Program Operation and Argument Validation
|
||
|
|
||
|
In this part of the assignment, you will write a function to validate the arguments
|
||
|
passed to your program via the command line. Your program will treat arguments
|
||
|
as follows:
|
||
|
|
||
|
- If no flags are provided, you will display the usage and return with an
|
||
|
`EXIT_FAILURE` return code.
|
||
|
|
||
|
- If the `-h` flag is provided, you will display the usage for the program and
|
||
|
exit with an `EXIT_SUCCESS` return code.
|
||
|
|
||
|
- If the `-v` flag is provided, then the program will read data from standard input
|
||
|
(`stdin`) and validate that it is syntactically correct JSON. If so, the program
|
||
|
exits with an `EXIT_SUCCESS` return code, otherwise the program exits with an
|
||
|
`EXIT_FAILURE` return code. In the latter case, the program will print to
|
||
|
standard error (`stderr`) an error message describing the error that was discovered.
|
||
|
No other output is produced.
|
||
|
|
||
|
- If the `-c` flag is provided, then the program performs the same function as
|
||
|
described for `-v`, but after validating the input, the program will also output
|
||
|
to standard output (`stdout`) a "canonicalized" version of the input.
|
||
|
"Canonicalized" means that the output is in a standard form in which possibilities
|
||
|
for variation have been eliminated. This is described in more detail below.
|
||
|
Unless `-p` has also been specified, then the produced output contains **no whitespace**
|
||
|
(except within strings that contain whitespace characters).
|
||
|
|
||
|
- If the `-p` flag is provided, then the `-c` flag must also have been provided.
|
||
|
In this case, newlines and spaces are used to format the canonicalized output
|
||
|
in a more human-friendly way. See below for the precise requirements on where
|
||
|
this whitespace must appear. The `INDENT` is an optional nonnegative integer argument
|
||
|
that specifies the number of additional spaces to be output at the beginning of a line
|
||
|
for each increase in indentation level. The format of this argument must be
|
||
|
the same as for a nonnegative integer number in the JSON specification.
|
||
|
If `-p` is provided without any `INDENT`, then a default value of 4 is used.
|
||
|
|
||
|
Note that the program reads data from `stdin` and writes transformed data
|
||
|
to `stdout`. Any other printout, such as diagnostic messages produced by the
|
||
|
program, are written to `stderr`. If the program runs without error, then it
|
||
|
will exit with the `EXIT_SUCCESS` status code; if any error occurs during the
|
||
|
execution of the program, then it will exit with the `EXIT_FAILURE` status code.
|
||
|
|
||
|
> :nerd: `EXIT_SUCCESS` and `EXIT_FAILURE` are macros defined in `<stdlib.h>` which
|
||
|
> represent success and failure return codes respectively.
|
||
|
|
||
|
> :nerd: `stdin`, `stdout`, and `stderr` are special I/O "streams", defined
|
||
|
> in `<stdio.h>`, which are automatically opened at the start of execution
|
||
|
> for all programs, do not need to be reopened, and (almost always) should not
|
||
|
> be closed.
|
||
|
|
||
|
The usage scenarios for this program are described by the following message,
|
||
|
which is printed by the program when it is invoked without any arguments:
|
||
|
|
||
|
<pre>
|
||
|
USAGE: bin/argo [-h] [-c|-v] [-p|-p INDENT]
|
||
|
-h Help: displays this help menu.
|
||
|
-v Validate: the program reads from standard input and checks whether
|
||
|
it is syntactically correct JSON. If there is any error, then a message
|
||
|
describing the error is printed to standard error before termination.
|
||
|
No other output is produced.
|
||
|
-c Canonicalize: once the input has been read and validated, it is
|
||
|
re-emitted to standard output in 'canonical form'. Unless -p has been
|
||
|
specified, the canonicalized output contains no whitespace (except within
|
||
|
strings that contain whitespace characters).
|
||
|
-p Pretty-print: This option is only permissible if -c has also been specified.
|
||
|
In that case, newlines and spaces are used to format the canonical output
|
||
|
in a more human-friendly way. For the precise requirements on where this
|
||
|
whitespace must appear, see the assignment handout.
|
||
|
The INDENT is an optional nonnegative integer argument that specifies the
|
||
|
number of additional spaces to be output at the beginning of a line for each
|
||
|
for each increase in indentation level. If no value is specified, then a
|
||
|
default value of 4 is used.
|
||
|
</pre>
|
||
|
|
||
|
The square brackets indicate that the enclosed argument is optional.
|
||
|
The `-c|-v` means that one of `-c` or `-v` may be specified.
|
||
|
The `-p|-p INDENT` means that `-p` may be specified alone, or with an optional
|
||
|
additional argument `INDENT`.
|
||
|
|
||
|
A valid invocation of the program implies that the following hold about
|
||
|
the command-line arguments:
|
||
|
|
||
|
- All "positional arguments" (`-h`, `-c`, or `-v`) come before any optional arguments
|
||
|
(`-p`).
|
||
|
The optional arguments (well, there is only one) may come in any order after the positional ones.
|
||
|
|
||
|
- If the `-h` flag is provided, it is the first positional argument after
|
||
|
the program name and any other arguments that follow are ignored.
|
||
|
|
||
|
- If the `-h` flag is *not* specified, then exactly one of `-v` or `-c`
|
||
|
must be specified.
|
||
|
|
||
|
- If `-p` is given, then it might or might not be followed by an `INDENT` argument.
|
||
|
If the `INDENT` argument is present, then it must represent a nonnegative integer
|
||
|
in the format allowed for integer numbers in the JSON specification.
|
||
|
|
||
|
For example, the following are a subset of the possible valid argument
|
||
|
combinations:
|
||
|
|
||
|
- `$ bin/argo -h ...`
|
||
|
- `$ bin/argo -v`
|
||
|
- `$ bin/argo -c -p`
|
||
|
- `$ bin/argo -c -p 8`
|
||
|
|
||
|
> :scream: The `...` means that all arguments, if any, are to be ignored; e.g.
|
||
|
> the usage `bin/argo -h -x -y BLAHBLAHBLAH -z` is equivalent to `bin/argo -h`.
|
||
|
|
||
|
Some examples of invalid combinations would be:
|
||
|
|
||
|
- `$ bin/argo -p 1 -c`
|
||
|
- `$ bin/argo -v -c`
|
||
|
- `$ bin/argo -v -p 5`
|
||
|
- `$ bin/argo -z 20`
|
||
|
|
||
|
> :scream: You may use only "raw" `argc` and `argv` for argument parsing and
|
||
|
> validation. Using any libraries that parse command line arguments (e.g.
|
||
|
> `getopt`) is prohibited.
|
||
|
|
||
|
> :scream: Any libraries that help you parse strings are prohibited as well
|
||
|
> (`string.h`, `ctype.h`, etc). The use of `atoi`, `scanf`, `fscanf`, `sscanf`,
|
||
|
> and similar functions is likewise prohibited. *This is intentional and
|
||
|
> will help you practice parsing strings and manipulating pointers.*
|
||
|
|
||
|
> :scream: You **MAY NOT** use dynamic memory allocation in this assignment
|
||
|
> (i.e. `malloc`, `realloc`, `calloc`, `mmap`, etc.). There is one function
|
||
|
> (`argo_append_char()`) provided for you that does the dynamic allocation
|
||
|
> required while accumulating the characters of a string or numeric literal.
|
||
|
> This function is in the file `const.c`, which you are not to modify.
|
||
|
|
||
|
> :nerd: Reference for command line arguments: [https://beej.us/guide/bgc/html/#command-line-arguments](https://beej.us/guide/bgc/html/#command-line-arguments).
|
||
|
|
||
|
**NOTE:** The `make` command compiles the `argo` executable into the `bin` folder.
|
||
|
All commands from here on are assumed to be run from the `hw1` directory.
|
||
|
|
||
|
### **Required** Validate Arguments Function
|
||
|
|
||
|
In `global.h`, you will find the following function prototype (function
|
||
|
declaration) already declared for you. You **MUST** implement this function
|
||
|
as part of the assignment.
|
||
|
|
||
|
```c
|
||
|
int validargs(int argc, char **argv);
|
||
|
```
|
||
|
|
||
|
The file `validargs.c` contains the following specification of the required behavior
|
||
|
of this function:
|
||
|
|
||
|
```c
|
||
|
/**
|
||
|
* @brief Validates command line arguments passed to the program.
|
||
|
* @details This function will validate all the arguments passed to the
|
||
|
* program, returning 0 if validation succeeds and -1 if validation fails.
|
||
|
* Upon successful return, the various options that were specified will be
|
||
|
* encoded in the global variable 'global_options', where it will be
|
||
|
* accessible elsewhere in the program. For details of the required
|
||
|
* encoding, see the assignment handout.
|
||
|
*
|
||
|
* @param argc The number of arguments passed to the program from the CLI.
|
||
|
* @param argv The argument strings passed to the program from the CLI.
|
||
|
* @return 0 if validation succeeds and -1 if validation fails.
|
||
|
* @modifies global variable "global_options" to contain an encoded representation
|
||
|
* of the selected program options.
|
||
|
*/
|
||
|
```
|
||
|
|
||
|
> :scream: This function must be implemented as specified as it will be tested
|
||
|
> and graded independently. **It should always return -- the USAGE macro should
|
||
|
> never be called from validargs.**
|
||
|
|
||
|
The `validargs` function should return -1 if there is any form of failure.
|
||
|
This includes, but is not limited to:
|
||
|
|
||
|
- Invalid number of arguments (too few or too many).
|
||
|
|
||
|
- Invalid ordering of arguments.
|
||
|
|
||
|
- A missing parameter to an option that requires one [doesn't apply to the
|
||
|
current assignment, since the parameter to `-p` is optional].
|
||
|
|
||
|
- Invalid parameter. A numeric parameter specfied with `-p` is invalid if
|
||
|
it does not conform to the format of a nonnegative integer as required by
|
||
|
the JSON specification.
|
||
|
|
||
|
The `global_options` variable of type `int` is used to record the mode
|
||
|
of operation (i.e. encode/decode) of the program and associated parameters.
|
||
|
This is done as follows:
|
||
|
|
||
|
- If the `-h` flag is specified, the most significant bit (bit 31) is 1.
|
||
|
|
||
|
- If the `-v` flag is specified, the second-most significant bit (bit 30)
|
||
|
is 1.
|
||
|
|
||
|
- If the `-c` flag is specified, the third-most significant bit (bit 29)
|
||
|
is 1.
|
||
|
|
||
|
- If the `-p` flag is specified, the fourth-most significant bit (bit 28)
|
||
|
is 1.
|
||
|
|
||
|
- The least significant byte (bits 7 - 0) records the number of spaces of
|
||
|
indentation per level specified with `-p`, or the default value (4)
|
||
|
if no value was specified with `-p`. If `-p` was not specified at all,
|
||
|
then this byte should be 0.
|
||
|
|
||
|
If `validargs` returns -1 indicating failure, your program must call
|
||
|
`USAGE(program_name, return_code)` and return `EXIT_FAILURE`.
|
||
|
**Once again, `validargs` must always return, and therefore it must not
|
||
|
call the `USAGE(program_name, return_code)` macro itself.
|
||
|
That should be done in `main`.**
|
||
|
|
||
|
If `validargs` sets the most-significant bit of `global_options` to 1
|
||
|
(i.e. the `-h` flag was passed), your program must call `USAGE(program_name, return_code)`
|
||
|
and return `EXIT_SUCCESS`.
|
||
|
|
||
|
> :nerd: The `USAGE(program_name, return_code)` macro is already defined for you
|
||
|
> in `argo.h`.
|
||
|
|
||
|
If validargs returns 0, then your program must read input data from `stdin`
|
||
|
and (depending on the options supplied) write output data to `stdout`.
|
||
|
Upon successful completion, your program should exit with exit status `EXIT_SUCCESS`;
|
||
|
otherwise, in case of an error it should exit with exit status `EXIT_FAILURE`.
|
||
|
|
||
|
Unless the program has been compiled for debugging (using `make debug`),
|
||
|
in a successful run that exits with `EXIT_SUCCESS` no other output may be produced
|
||
|
by the program. In an unsuccessful run in which the program exits with `EXIT_FAILURE`
|
||
|
the program should output to `stderr` a one-line diagnostic message that indicates
|
||
|
the reason for the failure. The program must not produce any other output than this
|
||
|
unless it has been compiled for debugging.
|
||
|
|
||
|
> :nerd: Remember `EXIT_SUCCESS` and `EXIT_FAILURE` are defined in `<stdlib.h>`.
|
||
|
> Also note, `EXIT_SUCCESS` is 0 and `EXIT_FAILURE` is 1.
|
||
|
|
||
|
### Example validargs Executions
|
||
|
|
||
|
The following are examples of the setting of `global_options` and the
|
||
|
other global variables for various command-line inputs.
|
||
|
Each input is a bash command that can be used to invoke the program.
|
||
|
|
||
|
- **Input:** `bin/argo -h`. **Setting:** `global_options=0x80000000`
|
||
|
(`help` bit is set, other bits clear).
|
||
|
|
||
|
- **Input:** `bin/argo -v `. **Setting:** `global_options=0x40000000`
|
||
|
(mode is "validate").
|
||
|
|
||
|
- **Input:** `bin/argo -c -p 2`. **Setting:** `global_options=0x30000002`
|
||
|
(mode is "canonicalize", "pretty-print" has been specified with
|
||
|
indentation increment 2).
|
||
|
|
||
|
- **Input:** `bin/argo -p 2 -c`. **Setting:** `global_options=0x0`.
|
||
|
This is an error case because the specified argument ordering is invalid
|
||
|
(`-p` is before `-c`). In this case `validargs` returns -1, leaving
|
||
|
`global_options` unset.
|
||
|
|
||
|
# Part 2: Overview of the JSON Specification
|
||
|
|
||
|
JSON ("JavaScript Object notation") is a standard format for data interchange
|
||
|
that is now commonly used in many areas of computing.
|
||
|
It was designed to be extremely simple to generate and parse and it in fact
|
||
|
achieves these goals: JSON syntax is about as simple as it gets for a
|
||
|
computer language that is actually used in the real world.
|
||
|
The syntax of JSON is defined by an
|
||
|
[ECMA standard](ECMA-404_2nd_edition_december_2017.pdf).
|
||
|
A summary that omits the scarier language from the standard document is given at
|
||
|
[www.json.org](https://www.json.org/json-en.html).
|
||
|
Most likely, you will only need to refer to this summary, but the full standard
|
||
|
document is here if you want to look at it.
|
||
|
|
||
|
In order to understand the JSON syntax specification, you need to be able to
|
||
|
read the "railroad diagrams" that are used to formally specify it.
|
||
|
These diagrams are actually a graphical version of a *context-free grammar*,
|
||
|
which is a standard tool used for formally specifying all kinds of computer
|
||
|
languages. Actually, the white box inset on the right contains the full
|
||
|
grammar; the railroad diagrams only describe the portion of the syntax that
|
||
|
has any significant complexity.
|
||
|
Each of the railroad diagrams defines the syntax of a particular
|
||
|
"syntactic category", which is a set of strings having a similar format.
|
||
|
Examples of syntactic categories for JSON are "object", "array",
|
||
|
"value", "number", *etc*.
|
||
|
The paths along the "railroad tracks" in the diagram for one syntactic category
|
||
|
indicate the possibilities for forming a string in that category from strings
|
||
|
in other categories.
|
||
|
For example, the first diagram says that a string in the category "object"
|
||
|
always has an initial curly bracket `{`. This may be followed immediately by
|
||
|
a closing curly bracket `}` (the top "track"), or between the brackets there
|
||
|
may be something more complicated (a list of "members" -- the lower "track").
|
||
|
By following the lower track, you find that there has to be "whitespace",
|
||
|
followed by a "string", followed by "whitespace", followed by a colon `:`,
|
||
|
followed by a "value". After the "value", it is possible to have the
|
||
|
closing curly bracket `}` or to loop back around and have another instance of
|
||
|
the same pattern that was just seen (a "member"). The path to loop back around
|
||
|
requires that a comma `,` appear before the next member, so this tells you
|
||
|
that the members in between the `{` and `}` are separated by commas.
|
||
|
The other diagrams are read similarly, and even if you have never seen these
|
||
|
before, with a little study they should be self-explanatory so I'm not going
|
||
|
to belabor the explanation further.
|
||
|
|
||
|
Something that was not initially clear to me from just looking at the diagrams
|
||
|
was what the syntax of `true`, `false`, and `null` is. These are shown with
|
||
|
double quotes in the inset box on the right, but in fact, the "token" `true`
|
||
|
simply consists of the four letters: `t`, `r`, `u`, `e` without any quotes.
|
||
|
This is spelled out better in the ECMA standard document.
|
||
|
|
||
|
The description of "character" in the inset box is also a bit mysterious
|
||
|
at first reading. A "character" is something that is permitted to occur within
|
||
|
a string literal. After staring at the description for awhile it becomes clear
|
||
|
that any Unicode code point except for (1) the "control characters"
|
||
|
whose code points range from U+0000 to U+001F, (2) the quote `"`,
|
||
|
and (3) the backslash '\' (they call it "reverse solidus"), may appear directly
|
||
|
representing themselves within a string literal.
|
||
|
In addition, "escape sequences" are permitted. An escape sequence starts
|
||
|
with a backslash `\`, which may be followed by one of the characters
|
||
|
`"`, '\\', '/', 'b', 'f', 'n', 'r', 't', or 'u'. After 'u' there are required
|
||
|
to appear exactly four hexadecimal digits, the letters of which may either
|
||
|
be in upper case or lower case. The meaning of `\"`, `\/`, `\b`, `\f`
|
||
|
`\n`, `\r`, and `\t` is as in a C string. The escape sequence `\/` represents
|
||
|
a single forward slash ("solidus") `/` (I do not know why this is in the
|
||
|
standard.) The meaning of `\uHHHH`, where `HHHH` are four hex digits is
|
||
|
the 16-bit Unicode code point from the "basic multilingual plane" whose
|
||
|
value is given by interpreting `HHHH` as a four-digit hexadecimal number.
|
||
|
|
||
|
Although a Unicode code point outside the basic multilingual plane may
|
||
|
occur directly in a string literal, representing such by an escape requires
|
||
|
the use of a "surrogate pair" as indicated in the ECMA standard document.
|
||
|
Don't worry about this technicality. For this assignment, your implementation
|
||
|
will not have to handle full Unicode and UTF-8-encoded input.
|
||
|
You may assume instead that the input to your program will come as a sequence of
|
||
|
8-bit bytes, each of which directly represents a Unicode code point in the
|
||
|
range U+0000 to U+00FF (the first 128 code points correspond to ASCII codes,
|
||
|
and the meaning of the next 128 code points is defined by the Unicode standard).
|
||
|
Note that this means that we are *not* using the usual UTF-8 encoding to
|
||
|
represent Unicode as a sequence of 8-bit bytes.
|
||
|
As you will see when you look at the definitions of the data structures you
|
||
|
are to use, internally your program will use the 32-bit `int`
|
||
|
(typedef'ed as `ARGO_CHAR`) to represent a character.
|
||
|
This is enough bits to represent any Unicode code point, so there will
|
||
|
be no problem in taking the input bytes that you read in and storing them
|
||
|
internally as Unicode code points. Due to the limitation of the input encoding,
|
||
|
for us a string literal will not be able to directly contain any Unicode
|
||
|
code point greater than U+00FF.
|
||
|
Nevertheless, you will still be able to use escape sequences within
|
||
|
a string literal to represent Unicode code points in the basic multilingual
|
||
|
plane (from U+0000 to U+FFFF), because the escape sequence allows you
|
||
|
to specify the code point directly as four hexadecimal digits.
|
||
|
Since we will also output JSON as a sequence of 8-bit bytes, it will be
|
||
|
necessary to render any Unicode code points greater than U+00FF occuring
|
||
|
in a string literal using escapes.
|
||
|
|
||
|
When reading a specification like this, it is helpful to have examples of
|
||
|
what is being defined. For this purpose, I have provided (in the `rsrc`
|
||
|
directory) some sample JSON files. These files all have the `.json`
|
||
|
extension. Some of these files are examples of what your program is supposed
|
||
|
to do when given other files as input. For example, the file `rsrc/numbers.json`
|
||
|
contains the following content.
|
||
|
|
||
|
```
|
||
|
{
|
||
|
"0": 0,
|
||
|
"2147483648": 2147483648,
|
||
|
"-2147483649": 2147483649,
|
||
|
"0.0": 0.0,
|
||
|
"1": 1,
|
||
|
"789": 789,
|
||
|
"1.0": 1.0,
|
||
|
"-1.0": -1.0,
|
||
|
"1e3": 1e3,
|
||
|
"1E3": 1E3,
|
||
|
"1e-3": 1e-3,
|
||
|
"1.234": 1.234,
|
||
|
"-1.234": -1.234,
|
||
|
"1.234e3": 1.234e3,
|
||
|
"1.234e-3": 1.234e-3
|
||
|
}
|
||
|
```
|
||
|
|
||
|
when your program is run as follows
|
||
|
|
||
|
```
|
||
|
$ bin/argo -c -p 2 < rsrc/numbers.json
|
||
|
```
|
||
|
|
||
|
it should produce the output in `rsrc/numbers_-c_-p_2.json`; namely
|
||
|
|
||
|
```
|
||
|
{
|
||
|
"0": 0,
|
||
|
"2147483648": 2147483648,
|
||
|
"-2147483649": 2147483649,
|
||
|
"0.0": 0.0,
|
||
|
"1": 1,
|
||
|
"789": 789,
|
||
|
"1.0": 0.1e1,
|
||
|
"-1.0": -0.1e1,
|
||
|
"1e3": 0.1e4,
|
||
|
"1E3": 0.1e4,
|
||
|
"1e-3": 0.1000000000000000e-2,
|
||
|
"1.234": 0.1233999999999999e1,
|
||
|
"-1.234": -0.1233999999999999e1,
|
||
|
"1.234e3": 0.1233999999999999e4,
|
||
|
"1.234e-3": 0.1234000000000000e-2
|
||
|
}
|
||
|
```
|
||
|
|
||
|
How this is supposed to happen is explained below.
|
||
|
|
||
|
# Part 3: Implementation
|
||
|
|
||
|
The header file `global.h` lists prototypes for functions you are
|
||
|
required to implement:
|
||
|
|
||
|
```c
|
||
|
ARGO_VALUE *argo_read_value(FILE *);
|
||
|
int argo_read_string(ARGO_STRING *s, FILE *);
|
||
|
int argo_read_number(ARGO_NUMBER *n, FILE *);
|
||
|
|
||
|
int argo_write_value(ARGO_VALUE *, FILE *);
|
||
|
int argo_write_string(ARGO_STRING *, FILE *);
|
||
|
int argo_write_number(ARGO_NUMBER *, FILE *);
|
||
|
|
||
|
int validargs(int argc, char **argv);
|
||
|
```
|
||
|
|
||
|
The `validargs()` function has already been discussed above.
|
||
|
The `argo_read_value()` function reads JSON input from the specified stream
|
||
|
and returns an `ARGO_VALUE` data structure (as described below).
|
||
|
The `argo_read_string()` function takes a pointer to an `ARGO_STRING`
|
||
|
structure (which will be a sub-structure of an `ARGO_VALUE` structure),
|
||
|
as well as a `FILE *` pointer, and it reads a JSON string literal
|
||
|
(starting and ending with a quote `"`) from the input stream and stores
|
||
|
the content of the string (without the quotes, after handling escapes)
|
||
|
in the specified `ARGO_STRING` object.
|
||
|
The `argo_read_number()` function works similarly, except it reads
|
||
|
a JSON numeric literal and uses it to initialize an `ARGO_NUMBER`
|
||
|
structure.
|
||
|
|
||
|
The `argo_write_value()` function takes an `ARGO_VALUE` data structure
|
||
|
and a `FILE *` pointer representing an output stream, and it writes
|
||
|
canonical JSON representing the specified value to the output stream.
|
||
|
The `argo_write_string()` function takes an `ARGO_STRING *` pointer
|
||
|
and a `FILE *` pointer and writes a JSON string literal to the output
|
||
|
stream (including quotes and escaping content that needs to be escaped).
|
||
|
The `argo_write_number()` function similarly takes an `ARGO_NUMBER *`
|
||
|
pointer and a `FILE *` pointer and it writes a JSON numeric literal
|
||
|
to the output stream.
|
||
|
|
||
|
> :scream: Even though your final application will only ever read JSON input
|
||
|
> from `stdin` and write JSON output to `stdout`, the interfaces of these
|
||
|
> functions are designed to accept arbitrary streams as parameters.
|
||
|
> **You must not ignore these parameters.** Also, you must not assume that
|
||
|
> these streams are "seekable" and consequently you may not use the functions
|
||
|
> `fseek()` or `ftell()` in your code.
|
||
|
|
||
|
Besides the general discussion below, more detailed specifications for the
|
||
|
required behavior of these functions are given in the comments preceding
|
||
|
the (non-functional) stubs in `argo.c`. Those specifications are mostly
|
||
|
not repeated here to avoid redundancy and possible inconsistencies between
|
||
|
this document and the specifications in `argo.c`.
|
||
|
|
||
|
Of course, you will also have to make modifications to the `main()` function,
|
||
|
so that after calling `validargs()` it makes the calls to
|
||
|
`argo_read_value()` and `argo_write_value()` to perform the functions required
|
||
|
of the complete application.
|
||
|
|
||
|
Since I want everybody to get the experience of designing and coding their
|
||
|
own implementation for this assignment, I have not spelled out any further
|
||
|
what other functions you will might to implement, but you will almost certainly
|
||
|
want to implement other functions. Note that the function interfaces
|
||
|
that have been specified, together with the problems that have to be solved
|
||
|
by these functions, give you clues about an implementation structure that
|
||
|
you might wish to consider. I will now discuss this briefly.
|
||
|
|
||
|
The `argo_read_value()` function is supposed to read bytes of data from a
|
||
|
stream and attempt to *parse* them as a JSON "value" (which could be
|
||
|
an object, array, string, number, or one of the basic tokens `true`,
|
||
|
`false` or `null`). The result of this parsing process is a data structure
|
||
|
that represents the structure of the JSON in a form that is useful for
|
||
|
further processing. The specification of the syntax has a recursive
|
||
|
structure (*e.g.* an object contains members, members contain elements, which
|
||
|
can themselves contain values, and so on. A simple way to parse a string
|
||
|
according to a recursive specification like this is via a so-called
|
||
|
*recursive descent* parser. Basically, the parser will have a function
|
||
|
for each of the syntactic categories that appear in the syntax specification
|
||
|
(`argo_read_value()` is one such function). Each of these functions will
|
||
|
be called at a point where what is expected on the input stream is a string
|
||
|
belonging to the syntactic category handled by that function.
|
||
|
The function will read one or more characters from the input stream and, based
|
||
|
on what it sees, it will recursively call one or more of the other parser
|
||
|
functions. For example, the function responsible for parsing an "object"
|
||
|
might check that the next character in the input is a curly brace `{`
|
||
|
and then call the function responsible for parsing a "member".
|
||
|
Each parsing function will return a data structure that represents what it
|
||
|
has parsed. To build this data structure, each parsing function will
|
||
|
typically need make use of the data structures returned by the functions
|
||
|
that it called recursively.
|
||
|
|
||
|
In general, each function in a recursive descent parser will need to examine
|
||
|
a certain amount of the input in order to determine what to do. This input
|
||
|
is called "look-ahead". One of the features of the JSON syntax that makes
|
||
|
it so easy to parse is that at most one character of look-ahead is ever
|
||
|
required in order to decide what to do next. For example, once we have
|
||
|
seen the `{` that starts an object, checking whether the next character is
|
||
|
a `}` or not is sufficient to tell whether we have to call functions
|
||
|
to parse members of the object, or whether the object is empty.
|
||
|
In implementing a parser like this, it generally simplifies the design
|
||
|
if you can "peek" at the look-ahead character without consuming it.
|
||
|
That way, when you call another function, it can assume that the input
|
||
|
stream is at the very start of what it is supposed to be trying to parse,
|
||
|
rather than having to keep track of what characters might already have
|
||
|
been been read by the caller.
|
||
|
You should use the `fgetc()` function from the C standard I/O library
|
||
|
to read each byte of data from the input stream. This function consumes
|
||
|
the byte of data from the input stream, but the standard I/O library
|
||
|
also provides a function `ungetc()` that allows you to "push back" a single
|
||
|
character of input. So you can achieve the effect of peeking one character
|
||
|
into the input stream by calling `fgetc()`, looking at the character returned,
|
||
|
and then using `ungetc()` to push it back into the stream if it is not
|
||
|
to be consumed immediately. In some cases, as you descend through recursive
|
||
|
calls, the same character might be examined and pushed back repeatedly.
|
||
|
|
||
|
The recursive structure also dictates a natural form for the implementation
|
||
|
of the output function `argo_write_value()`: you can have one function
|
||
|
for each meaningful entity (*e.g.* "object", "member", "number") in the
|
||
|
JSON specification and these functions will call each other recursively
|
||
|
in order to traverse the data structure and emit characters to the output
|
||
|
stream.
|
||
|
|
||
|
# Part 4: Data Structures
|
||
|
|
||
|
The `argo.h` header file gives C definitions for the data structures you are
|
||
|
produce as the return values from `argo_read_value()` and as the arguments
|
||
|
to `argo_write_value()`. These data structures are basically trees.
|
||
|
The `ARGO_VALUE` structure is the central definition, which spells out what
|
||
|
information is in a node of such a tree. As the same `ARGO_VALUE` structure
|
||
|
is used to represent all the types of JSON values ("object", "array", "number",
|
||
|
*etc.*) it has a `type` field to indicate specifically what type of object
|
||
|
each individual instance represents. The possible types are defined by the
|
||
|
`ARGO_VALUE_TYPE` enumeration. Each node also has a `content` field, which
|
||
|
is where the actual content of the node is stored. The `content` field
|
||
|
is defined using the C `union` type, which allows the same region of memory
|
||
|
to be used to store different types of things at different times.
|
||
|
Depending on what is in the `type` field, exactly one of the `object`, `array`,
|
||
|
`string`, `number`, or `basic` subfields of `content` will be valid.
|
||
|
Except for `ARGO_BASIC`, which just defines a set of possible values,
|
||
|
each of these has its own structure definition, which are given as
|
||
|
`ARGO_OBJECT`, `ARGO_ARRAY`, `ARGO_STRING`, and `ARGO_NUMBER`.
|
||
|
|
||
|
Besides the `type` and `content` fields, each `ARGO_VALUE` node contains
|
||
|
`next` and `prev` fields that point to other `ARGO_VALUES`. These fields
|
||
|
will be used to link each node with other "sibling" nodes into a list.
|
||
|
For example, a JSON "object" has a list of "members".
|
||
|
The JSON object will be represented by an `ARGO_VALUE` node having
|
||
|
`ARGO_OBJECT_TYPE` in its `type` field. The `content` field of this
|
||
|
object will therefore be used to hold an `ARGO_OBJECT` structure.
|
||
|
The `ARGO_OBJECT` structure has a single field: `member_list`, which
|
||
|
points to a "dummy" `ARGO_VALUE` structure used as the head of
|
||
|
a *circularly, doubly linked list* of members (more on this below).
|
||
|
Linked into this list will be `ARGO_VALUE` structures that represent
|
||
|
the members. The `next` and `prev` fields of these are used to chain
|
||
|
the members into a list: the `next` field points from a member to
|
||
|
the next member and the `prev` field points from a member to the previous
|
||
|
member. For `ARGO_VALUE` structures used to represent members of
|
||
|
an object, the `name` field will contain an `ARGO_STRING` structure
|
||
|
that represents the name of the member.
|
||
|
|
||
|
JSON arrays are represented similarly to JSON objects: the array as a
|
||
|
whole is represented by an `ARGO_VALUE` structure whose `type` field
|
||
|
contains `ARGO_ARRAY_TYPE`. The `content` field will therefore be used
|
||
|
to hold an `ARGO_ARRAY` structure, which has an `element_list` field that
|
||
|
points to a "dummy" `ARGO_VALUE` structure at the head of a list of elements,
|
||
|
similarly to what was just described for for object members.
|
||
|
However, array elements don't have names, so the `name` field of each
|
||
|
array element will just be `NULL`.
|
||
|
|
||
|
JSON strings are represented by the `ARGO_STRING` structure, which
|
||
|
has fields `capacity`, `length`, and `content`. These are used to
|
||
|
represent a dynamically growable string, similarly to the way
|
||
|
`ArrayList` is implemented in Java.
|
||
|
At any given time, the `content` field will either be `NULL`
|
||
|
(if the string is empty) or it will point to an array of `ARGO_CHAR`
|
||
|
elements, each of which represents a single Unicode code point.
|
||
|
The `capacity` field tells the total number of "slots" in this
|
||
|
array, whereas the `length` field tells how many of these are
|
||
|
actually used (*i.e.* it gives the current length of the string).
|
||
|
For this assignment, you don't have to actually be concerned with
|
||
|
the dynamic allocation -- that is performed by the function
|
||
|
`argo_append_char()` which has been implemented for you in `const.c`.
|
||
|
All you have to worry about is making sure that the fields
|
||
|
of and `ARGO_STRING` structure that you want to use have been
|
||
|
initialized to zero and then you can just call `argo_append_char()`
|
||
|
to build the string content.
|
||
|
A simple `for` loop using the `length` field as the upper limit
|
||
|
can then be used to traverse the `content` array of an `ARGO_STRING`
|
||
|
once it has been initialized.
|
||
|
|
||
|
The `ARGO_NUMBER` structure is used to represent a number.
|
||
|
One of its fields is a `string_value` field, which is an `ARGO_STRING`
|
||
|
used to hold the digits and other characters that make up the
|
||
|
textual representation of the number. During parsing, characters
|
||
|
are accumulated in this field using `argo_append_char()` in the
|
||
|
same way that characters are accumulated for a string value.
|
||
|
The remaining fields (`int_value`, `float_value`) are used to store
|
||
|
an internal representation (either integer or floating-point)
|
||
|
of the value of the number, as well as flags (`valid_string`,
|
||
|
`valid_int`, `valid_float`) that tell which of the other fields
|
||
|
contain valid information. Note that a JSON number that contains
|
||
|
a fractional part or an exponent part will generally not be representable
|
||
|
in integer format, so the `valid_int` field should be zero and there
|
||
|
will be no useful information in the `int_value` field. Also, if an
|
||
|
`ARGO_NUMBER` is created internally, without parsing it from an input
|
||
|
stream, then a printable representation has not yet been computed, so the
|
||
|
`valid_string` field will be zero and the `string_value` field
|
||
|
will represent an empty string.
|
||
|
|
||
|
To summarize, your `argo_read_value()` function will read bytes of
|
||
|
data from the specified input stream using `fgetc()`.
|
||
|
As it reads and parses the input, it will build up a tree of
|
||
|
`ARGO_VALUE` nodes to represent the structure of the JSON input.
|
||
|
The nodes of the resulting tree must satisfy the following requirements:
|
||
|
|
||
|
- A node with `ARGO_OBJECT_TYPE` in its `type` field represents
|
||
|
a JSON "object". The `content` field then contains an `ARGO_OBJECT`
|
||
|
structure whose `member_list` field points to an `ARGO_VALUE`
|
||
|
node that is the head of a circular, doubly linked list of members.
|
||
|
Each member has a pointer to its associated name (an `ARGO_STRING`)
|
||
|
stored in the `name` field.
|
||
|
|
||
|
- A node with `ARGO_ARRAY_TYPE` in its `type` field represents
|
||
|
a JSON "array". The `content` field then contains an `ARGO_ARRAY`
|
||
|
structure whose `element_list` field points to an `ARGO_VALUE`
|
||
|
node that is the head of a circular, doubly linked list of elements.
|
||
|
|
||
|
- A node with `ARGO_STRING_TYPE` in its `type` field represents
|
||
|
a JSON "string" (without the enclosing quotes that appear in JSON
|
||
|
source). The `content` field then contains an `ARGO_STRING`
|
||
|
that represents the string. The `length` field of the `ARGO_STRING`
|
||
|
gives the length of the string and the `content` field points to
|
||
|
an array of `ARGO_CHAR` values that are the content of the string.
|
||
|
|
||
|
- A node with `ARGO_NUMBER_TYPE` in its `type` field represents
|
||
|
a JSON "number". The `content` field then contains an `ARGO_NUMBER`
|
||
|
object that represents the number in various ways.
|
||
|
|
||
|
* If the `valid_string` field is nonzero, then the `string_value`
|
||
|
field will contain an `ARGO_STRING` that holds the characters that
|
||
|
make up a printable/parseable representation of the number.
|
||
|
|
||
|
* If the `valid_int` field is nonzero, then the `int_value`
|
||
|
field will contain the value of the number as a C `long`.
|
||
|
|
||
|
* If the `valid_float` field is nonzero, then the `float_value`
|
||
|
field will contain the value of the number as a C `double`.
|
||
|
|
||
|
If there is more than one representation of the number present,
|
||
|
then they are required to agree with each other (*i.e* represent
|
||
|
the same value).
|
||
|
|
||
|
- A node with `ARGO_BASIC_TYPE` in its `type` field will have
|
||
|
a `content` field having a value of type `ARGO_BASIC` in its `basic`
|
||
|
field. This value will be one of `ARGO_TRUE`, `ARGO_FALSE`,
|
||
|
or `ARGO_NULL`.
|
||
|
|
||
|
The `argo_read_string()` function will parse a JSON string literal
|
||
|
and store the string content into an `ARGO_STRING` object.
|
||
|
Characters in the input that are not control character and are not
|
||
|
one of the characters that must be escaped are simply appended directly
|
||
|
to the string content. However when a backslash `\` is encounted,
|
||
|
it is necessary to interpret it as the start of an *escape sequence*
|
||
|
that represents the character to be appended. These escape sequences
|
||
|
should be familiar, since they are essentially the same as those
|
||
|
used in Java as well as in C.
|
||
|
|
||
|
The `argo_read_number()` function will parse a JSON numeric literal
|
||
|
and store into an `ARGO_NUMBER` object not only the sequence of
|
||
|
characters that constitute the literal, but also the value of the
|
||
|
number, either in integer format, in floating point format, or both.
|
||
|
In order to do this, you have to actually process the various digits
|
||
|
one at a time and calculate (using integer and/or floating point
|
||
|
arithemetic) the value that is represented. You have to carry out
|
||
|
this conversion yourself; you are not allowed to use any library
|
||
|
functions to do it.
|
||
|
|
||
|
## Circular, Doubly Linked Lists
|
||
|
|
||
|
As already stated, object members and array elements are to be stored as
|
||
|
circular, doubly linked lists of `ARGO_VALUE` structures, using a "dummy"
|
||
|
structure as a sentinel. Even though the sentinel has the same type
|
||
|
(*i.e.* `ARGO_VALUE`) as the elements of the list, it does not itself represent
|
||
|
an element of the list. The only fields used in the sentinel are the
|
||
|
`next` field, which points to the first actual element of the list,
|
||
|
and the `prev` field, which points to the last actual element of the list.
|
||
|
The list is "circular" because starting at the sentinel and following
|
||
|
`next` pointers will eventually lead back to the sentinel again.
|
||
|
Similarly, starting at the sentinel and following `prev` pointers will
|
||
|
eventually lead back to the sentinel. An empty list is represented
|
||
|
by a sentinel whose `next` and `prev` pointers point back to the sentinel
|
||
|
itself. You can read more about this type of data structure by searching,
|
||
|
e.g. Wikipedia for "circularly doubly linked list". The advantage of
|
||
|
using the sentinel is that all insertions and deletions are performed
|
||
|
in exactly the same way, without any edge cases for the first or last
|
||
|
element of the list.
|
||
|
|
||
|
## Dynamic Storage
|
||
|
|
||
|
There are two types of dynamic storage used by this program.
|
||
|
One of these is for the content of an `ARGO_STRING`. As already indicated
|
||
|
above, this is handled for you by the function `argo_append_char()` and you
|
||
|
do not have to worry about how it happens.
|
||
|
The other dynamic storage is for `ARGO_VALUE` structures.
|
||
|
You need a source of such structures while you are building the tree
|
||
|
that represents JSON.
|
||
|
As you are prohibited from declaring your own arrays in this
|
||
|
assignment, you will have to use one that we have already declared for you.
|
||
|
In `global.h` an array `argo_value_storage` has been defined for you,
|
||
|
together with an associated counter `argo_next_value`. You **must** use
|
||
|
this array as the source of `ARGO_VALUE` structures for building your
|
||
|
JSON trees. Use the `argo_next_value` counter to keep track of the
|
||
|
index of the first unused element of this array.
|
||
|
When you need an `ARGO_VALUE` structure, get a pointer to the first unused
|
||
|
element of the `argo_value_storage` array and increment `argo_next_value`.
|
||
|
Be sure to pay attention to the total number `NUM_ARGO_VALUES` of
|
||
|
elements of this array -- if you run off the end you will corrupt other
|
||
|
memory and your program will have unpredictable behavior.
|
||
|
|
||
|
# Part 5: Canonical Output
|
||
|
|
||
|
Your `argo_write_value()` function is supposed to traverse a data structure
|
||
|
such as that returned by `argo_read_value()` and it is supposed to output
|
||
|
JSON to the output stream. First of all, the JSON that you output has to
|
||
|
conform to the JSON standard, so that it can be parsed again to produce
|
||
|
exactly the same internal data structure. Beyond that, the JSON is supposed
|
||
|
to be "canonical", which means that it has been output in a standard way
|
||
|
that does not leave any possibility for variation.
|
||
|
Your canonical JSON output must always satisfy the following conditions:
|
||
|
|
||
|
- An `ARGO_NUMBER` whose `valid_int` field is set is to be printed out as
|
||
|
an integer, without any fraction or exponent.
|
||
|
|
||
|
- An `ARGO_NUMBER` whose `valid_float` field is set is to be printed out
|
||
|
with an integer and fractional part, as in the JSON specification.
|
||
|
The fractional part should be normalized to lie in the interval `[0.1, 1.0)`,
|
||
|
so that there is always just a single `0` digit before the decimal point
|
||
|
and the first digit after the decimal point is always nonzero.
|
||
|
An exponent of 0 is to be omitted completely and for positive exponents
|
||
|
the `+` sign is to be omitted. Exponents always start with lower-case `e`,
|
||
|
rather than upper-case `E`.
|
||
|
|
||
|
- An `ARGO_STRING` is printed so that the following conditions are satisfied:
|
||
|
|
||
|
* Characters (other than backslash `\` and quote `"`) having Unicode control
|
||
|
points greater than U+001F and less than U+00FF are to appear directly
|
||
|
in the string literal as themselves. This includes forward slash `/`.
|
||
|
|
||
|
* Characters with Unicode code points greater than U+00FF
|
||
|
are to appear as escapes using `\u` and the appropriate hex digits,
|
||
|
which must be in lower case.
|
||
|
|
||
|
* Control characters that have special escapes (`\n`, `\t`, *etc.*) must
|
||
|
be printed using those special escapes, not using the generic escape `\u`
|
||
|
with hex digits.
|
||
|
|
||
|
|
||
|
If the pretty-print option has not been specified, then your canonical JSON
|
||
|
output must satisfy the following condition:
|
||
|
|
||
|
- There is no white space in the output, except for white space that occurs
|
||
|
within a string literal.
|
||
|
|
||
|
If the pretty print option has been specified, then your canonical JSON output
|
||
|
will include white space according to the following rules:
|
||
|
|
||
|
- A single newline is output after every `ARGO_VALUE` that is output at the
|
||
|
top-level (*i.e.* not as part of an object or array).
|
||
|
|
||
|
- A single newline is output after every '{', '[', and ',' (except those in string
|
||
|
literals).
|
||
|
|
||
|
- A single newline is output immediately after the last member of an object,
|
||
|
and immediately after the last element of an array.
|
||
|
|
||
|
- Each newline is followed by a number of spaces that depends on the indentation
|
||
|
level of the value currently being printed. The indentation level is maintained
|
||
|
as follows:
|
||
|
|
||
|
* The indentation level is 0 for a top-level JSON value.
|
||
|
|
||
|
* The indentation level is increased by one just after a `{` or `[` has
|
||
|
been printed to start the list of members of an object or elements of
|
||
|
an array.
|
||
|
The indentation level decreases by one just after the last member or
|
||
|
element has been printed, so that the closing `}` or `]` is at the
|
||
|
previous indentation level
|
||
|
|
||
|
* A single space is printed following each colon `:` that separates
|
||
|
the name of an object member from its value.
|
||
|
|
||
|
The number of spaces following a newline is equal to the current indentation
|
||
|
level times the `INDENT` argument given with `-p`, or the default of `4`
|
||
|
if `-p` was specified without an `INDENT` argument.
|
||
|
|
||
|
Note that canonicalization must be an "idempotent" operation, in the sense that
|
||
|
if canonical output previously produced is re-parsed and then re-output using
|
||
|
the same pretty-printing settings, then the new output should be identical
|
||
|
to the previous output.
|
||
|
|
||
|
# Part 6: Strategy of Attack
|
||
|
|
||
|
To make things a little easier for you in getting started on this assignment,
|
||
|
I have distributed with the basecode a library containing binary object
|
||
|
versions of my own implementations of `argo_read_value()` and `argo_write_value()`.
|
||
|
The `Makefile` has been constructed so that it will link your program against
|
||
|
the library I provided. As a result, if you comment out one or both of
|
||
|
these function in `argo.c`, my versions will be linked instead and you can
|
||
|
use them to work on the other parts of the assignment. Note that this library
|
||
|
will **not** be present during grading, so do not leave these functions
|
||
|
commented out or your code will not compile.
|
||
|
|
||
|
Note that the functions whose interfaces have been specified will likely
|
||
|
be unit-tested. This means that their behavior should be completely determined
|
||
|
by their specified interface, which includes their parameters, return values,
|
||
|
and global variables defined in `global.h` (which you may **not** modify).
|
||
|
There should be no implicit assumption that any other functions have been or
|
||
|
will be called or that any particular variables have been set to any particular
|
||
|
values, except for the global variables defined in `global.h`.
|
||
|
So, for example, you may (and should) assume that when `argo_write_object()`
|
||
|
is called, the `global_options` variable has been set according to the desired
|
||
|
program options, but you may **not** assume that before `argo_write_object()`
|
||
|
has been called that some other function was called previously.
|
||
|
|
||
|
My best guess as to the right attack strategy for this assignment is as follows:
|
||
|
First, work on the command-line argument processing (`validargs()`) and
|
||
|
make the changes to `main()` necessary to get the program to honor the command-line
|
||
|
arguments and perform the overall function that the application is supposed
|
||
|
to perform.
|
||
|
Next, start working on implementing `argo_write_value()`, using my version of
|
||
|
`argo_read_value()` as a source of data structures that you can use to increase
|
||
|
your understanding of pointers and the specific data structures that we are using
|
||
|
to represent JSON and, ultimately, as an aid to developing and testing your
|
||
|
implementation.
|
||
|
Finally, now that you have a clear understanding of the data structures you
|
||
|
are trying to produce work on implementing `argo_read_value()`, to parse
|
||
|
a stream of input bytes and produce such a data structure. I expect this part
|
||
|
of the assignment to be the most difficult.
|
||
|
|
||
|
Note that the code that I wrote for `argo_read_value()` and `argo_write_value()`
|
||
|
is only about 800 lines in length. If you find your own code growing
|
||
|
much larger than that, you need to step back and think smarter about trying
|
||
|
to simplify your code.
|
||
|
|
||
|
# Part 7: Running the Program
|
||
|
|
||
|
The `argo` program always reads from `stdin` and possibly writes to `stdout`.
|
||
|
If you want the program to take input from a file or produce output to
|
||
|
a file, you may run the program using **input and output redirection**,
|
||
|
which is implemented by the shell.
|
||
|
A simple example of a command that uses such redirection is the following:
|
||
|
|
||
|
```
|
||
|
$ bin/argo -c < rsrc/numbers.json > numbers.out
|
||
|
```
|
||
|
|
||
|
This will cause the input to the program to be redirected from the text file
|
||
|
`rsrc/numbers.json` and the output from the program to be redirected to the
|
||
|
file `numbers.out`.
|
||
|
The redirection is accomplished by the shell, which interprets the `<` symbol
|
||
|
to mean "input redirection from a file" and the `>` symbol to mean
|
||
|
"output redirection to a file". It is important to understand that redirection
|
||
|
is handled by the shell and that the `bin/argo` program never sees any
|
||
|
of the redirection arguments; in the above example it sees only `bin/argo -c`
|
||
|
and it just reads from `stdin` and writes to `stdout`.
|
||
|
|
||
|
Alternatively, the output from a command can be **piped**
|
||
|
to another program, without the use of a disk file.
|
||
|
This could be done, for example, by the following command:
|
||
|
|
||
|
```
|
||
|
$ bin/argo -c -p 2 < rsrc/package-lock.json | less
|
||
|
```
|
||
|
|
||
|
This sends the (rather lengthy) output to a program called `less`,
|
||
|
which display the first screenful of the output and then gives you the ability
|
||
|
to scan forward and backward to see different parts of it.
|
||
|
Type `h` at the `less` prompt to get help information on what you can do
|
||
|
with it. Type `q` at the prompt to exit `less`.
|
||
|
|
||
|
Programs that read from standard input and write to standard output are
|
||
|
often used as components in more complex "pipelines" that perform multiple
|
||
|
transformations on data.
|
||
|
|
||
|
For example, one way to test your implementation is by using one instance
|
||
|
of it to produce some output and testing to see if that output can be read by
|
||
|
another instance; *e.g.:
|
||
|
|
||
|
```
|
||
|
$ cat rsrc/package-lock.json | bin/argo -c | bin/argo -c -p 2 > p.out
|
||
|
```
|
||
|
|
||
|
Here `cat` (short for "concatenate") is a command that reads the files
|
||
|
specified as arguments, concatenates their contents, and prints the
|
||
|
concatenated content to `stdout`. In the above command, this output
|
||
|
is redirected through a pipe to become the input to `bin/argo -c`.
|
||
|
The output of `bin/argo -c` (which contains no whitespace) is then
|
||
|
sent to `bin/argo -c -p 2` for pretty printing. Finally, the pretty-printed
|
||
|
output is written to file `p.out`. Actually, the original input
|
||
|
file `rsrc/package-lock.json` is already canonical as defined here,
|
||
|
so in the end the file `p.out` should have exactly the same content
|
||
|
as `rsrc/package-lock.json`. One way to check this is to use the
|
||
|
`diff` comand (use `man diff` to read the manual page) to compare the
|
||
|
two files:
|
||
|
|
||
|
```
|
||
|
$ diff rsrc/package-lock.json p.out
|
||
|
$
|
||
|
```
|
||
|
|
||
|
If `diff` exits silently, the files are identical.
|
||
|
Another command that would be useful on output with no whitespace
|
||
|
is the `cmp` command, which performes a byte-by-byte comparison of two files
|
||
|
(even files that contain raw binary data):
|
||
|
|
||
|
```
|
||
|
$ cmp rsrc/package-lock.json p.out
|
||
|
```
|
||
|
|
||
|
If the files have identical content, `cmp` exits silently.
|
||
|
If one file is shorter than the other, but the content is otherwise identical,
|
||
|
`cmp` will report that it has reached `EOF` on the shorter file.
|
||
|
Finally, if the files disagree at some point, `cmp` will report the
|
||
|
offset of the first byte at which the files disagree.
|
||
|
If the `-l` flag is given, `cmp` will report all disagreements between the
|
||
|
two files.
|
||
|
|
||
|
## Unit Testing
|
||
|
|
||
|
Unit testing is a part of the development process in which small testable
|
||
|
sections of a program (units) are tested individually to ensure that they are
|
||
|
all functioning properly. This is a very common practice in industry and is
|
||
|
often a requested skill by companies hiring graduates.
|
||
|
|
||
|
> :nerd: Some developers consider testing to be so important that they use a
|
||
|
> work flow called **test driven development**. In TDD, requirements are turned into
|
||
|
> failing unit tests. The goal is then to write code to make these tests pass.
|
||
|
|
||
|
This semester, we will be using a C unit testing framework called
|
||
|
[Criterion](https://github.com/Snaipe/Criterion), which will give you some
|
||
|
exposure to unit testing. We have provided a basic set of test cases for this
|
||
|
assignment.
|
||
|
|
||
|
The provided tests are in the `tests/basecode_tests.c` file. These tests do the
|
||
|
following:
|
||
|
|
||
|
- `validargs_help_test` ensures that `validargs` sets the help bit
|
||
|
correctly when the `-h` flag is passed in.
|
||
|
|
||
|
- `validargs_validate_test` ensures that `validargs` sets the validate-mode bit
|
||
|
correctly when the `-v` flag is passed.
|
||
|
|
||
|
- `validargs_canonicalize_test` ensures that `validargs` sets the canonicalize-mode bit
|
||
|
correctly when the `-c` flag is passed in.
|
||
|
|
||
|
- `validargs_bits_test` ensures that `validargs` sets the decode-mode bit
|
||
|
correctly when the `-d` flag is passed in and that the value passed with `-b`
|
||
|
is correctly stored in the least-signficant byte of `global_options`.
|
||
|
|
||
|
- `validargs_error_test` ensures that `validargs` returns an error when the `-p`
|
||
|
flag is supplied with the `-v` flag.
|
||
|
|
||
|
- `help_system_test` uses the `system` syscall to execute your program through
|
||
|
Bash and checks to see that your program returns with `EXIT_SUCCESS`.
|
||
|
|
||
|
- `argo_basic_test` performs a basic test of the canonicalization mode of the program.
|
||
|
|
||
|
### Compiling and Running Tests
|
||
|
|
||
|
When you compile your program with `make`, an `argo_tests` executable will be
|
||
|
created in your `bin` directory alongside the `argo` executable. Running this
|
||
|
executable from the `hw1` directory with the command `bin/argo_tests` will run
|
||
|
the unit tests described above and print the test outputs to `stdout`. To obtain
|
||
|
more information about each test run, you can use the verbose print option:
|
||
|
`bin/argo_tests --verbose=0`.
|
||
|
|
||
|
The tests we have provided are very minimal and are meant as a starting point
|
||
|
for you to learn about Criterion, not to fully test your homework. You may write
|
||
|
your own additional tests in `tests/basecode_tests.c`, or in additional source
|
||
|
files in the `tests` directory. However, this is not required for this assignment.
|
||
|
Criterion documentation for writing your own tests can be
|
||
|
found [here](http://criterion.readthedocs.io/en/master/).
|
||
|
|
||
|
Note that grades are assigned based on the number of our own test cases
|
||
|
(not given to you in advance) that your program passes.
|
||
|
So you should work on the assignments in such a way that whatever you do submit
|
||
|
will function. Code that is completely broken will not score any points,
|
||
|
regardless of how voluminous it might be or how long you might have spent on it.
|
||
|
|
||
|
## Sample Input Files
|
||
|
|
||
|
In the `rsrc` directory I have placed a few JSON input files for you to try
|
||
|
your code on.
|
||
|
|
||
|
- `numbers.json`: A JSON file containing a single object with various
|
||
|
numbers as its members. This will exercise most (but probably not all)
|
||
|
of the interesting cases that come up in parsing and outputting numbers.
|
||
|
|
||
|
- `strings.json`: A JSON file containing a single array with various
|
||
|
strings as its elements. These are intended to exercise most (but again,
|
||
|
probably not all) of the cases involving escape sequences in strings.
|
||
|
|
||
|
- `package-lock.json`: This is a larger JSON file that I had lying around
|
||
|
which seemed to be a reasonable overall test.
|
||
|
|
||
|
# Hand-in instructions
|
||
|
|
||
|
**TEST YOUR PROGRAM VIGOROUSLY BEFORE SUBMISSION!**
|
||
|
|
||
|
Make sure that you have implemented all the required functions specifed in `const.h`.
|
||
|
|
||
|
Make sure that you have adhered to the restrictions (no array brackets, no prohibited
|
||
|
header files, no modifications to files that say "DO NOT MODIFY" at the beginning,
|
||
|
no functions other than `main()` in `main.c`) set out in this assignment document.
|
||
|
|
||
|
Make sure your directory tree looks basically like it did when you started
|
||
|
(there could possibly be additional files that you added, but the original organization
|
||
|
should be maintained) and that your homework compiles (you should be sure to try compiling
|
||
|
with both `make clean all` and `make clean debug` because there are certain errors that can
|
||
|
occur one way but not the other).
|
||
|
|
||
|
This homework's tag is: `hw1`
|
||
|
|
||
|
`$ git submit hw1`
|
||
|
|
||
|
> :nerd: When writing your program try to comment as much as possible. Try to
|
||
|
> stay consistent with your formatting. It is much easier for your TA and the
|
||
|
> professor to help you if we can figure out what your code does quickly!
|
||
|
|