59 KiB
Homework 1 - CSE 320 - Spring 2022
Professor Eugene Stark
Due Date: Friday 02/18/2022 @ 11:59pm
Read the entire doc before you start
Introduction
In this assignment, you will implement functions for parsing JSON input
and building a data structure to represent its contents and for traversing
the data structure and producing JSON output.
You will use these functions to implement a command-line utility
(called argo
)
which can validate JSON input and transform JSON input into JSON output
in a "canonical" form.
The goal of this homework is to familiarize yourself with C programming,
with a focus on input/output, bitwise manipulations, and the use of pointers.
For all assignments in this course, you MUST NOT put any of the functions
that you write into the main.c
file. The file main.c
MUST ONLY contain
#include
s, local #define
s and the main
function (you may of course modify
the main
function body). The reason for this restriction has to do with our
use of the Criterion library to test your code.
Beyond this, you may have as many or as few additional .c
files in the src
directory as you wish. Also, you may declare as many or as few headers as you wish.
Note, however, that header and .c
files distributed with the assignment base code
often contain a comment at the beginning which states that they are not to be
modified. PLEASE take note of these comments and do not modify any such files,
as they will be replaced by the original versions during grading.
😱 Array indexing ('A[]') is not allowed in this assignment. You MUST USE pointer arithmetic instead. All necessary arrays are declared in the
global.h
header file. You MUST USE these arrays. DO NOT create your own arrays. We WILL check for this.
:nerd: Reference for pointers: https://beej.us/guide/bgc/html/#pointers.
Getting Started
Fetch base code for hw1
as described in hw0
. You can find it at this link:
https://gitlab02.cs.stonybrook.edu/cse320/hw1.
IMPORTANT: 'FETCH', DO NOT 'CLONE'.
Both repos will probably have a file named .gitlab-ci.yml
with different contents.
Simply merging these files will cause a merge conflict. To avoid this, we will
merge the repos using a flag so that the .gitlab-ci.yml
found in the hw1
repo will replace the hw0
version. To merge, use this command:
git merge -m "Merging HW1_CODE" HW1_CODE/master --strategy-option=theirs
😱 Based on past experience, many students will either ignore the above command or forget to use it. The result will be a merge conflict, which will be reported by git. Once a merge conflict has been reported, it is essential to correct it before committing (or to abort the merge without committing -- use
git merge --abort
and go back and try again), because git will have inserted markers into the files involved indicating the locations of the conflicts, and if you ignore this and commit anyway, you will end up with corrupted files. You should consider it important to read up at an early stage on merge conflicts with git and how to resolve them properly.
Here is the structure of the base code:
. ├── .gitlab-ci.yml └── hw1 ├── .gitignore ├── hw1.sublime-project ├── include │ ├── argo.h │ ├── debug.h │ └── global.h ├── lib │ └── argo.a ├── Makefile ├── rsrc │ ├── numbers.json │ ├── package-lock.json │ └── strings.json ├── src │ ├── argo.c │ ├── const.c │ ├── main.c │ └── validargs.c ├── test_output │ └── .git_keep └── tests ├── basecode_tests.c └── rsrc └── strings_-c.json
- The
.gitlab-ci.yml
file is a file that specifies "continuous integration" testing to be performed by the GitLab server each time you push a commit. Usually it will be configured to check that your code builds and runs, and that any provided unit tests are passed. You are free to change this file if you like.
😱 The CI testing is for your own information; it does not directly have anything to do with assignment grading or whether your commit has been properly pushed to the server. If some part of the testing fails, you will see the somewhat misleading message "commit failed" on the GitLab web interface. This does not mean that "your attempt to commit has failed" or that "your commit didn't get pushed to the server"; the very fact that the testing was triggered at all means that you successfully pushed a commit. Rather, it means that "the CI tests performed on a commit that you pushed did not succeed". The purpose of the tests are to alert you to possible problems with your code; if you see that testing has failed it is worth investigating why that has happened. However, the tests can sometimes fail for reasons that are not your fault; for example, the entire CI "runner" system may fail if someone submits code that fills up the system disk. You should definitely try to understand why the tests have failed if they do, but it is not necessary to be overly obsessive about them.
-
The
hw1.sublime-project
file is a "project file" for use by the Sublime Text editor. It is included to try to help Sublime understand the organization of the project so that it can properly identify errors as you edit your code. -
The
Makefile
is a configuration file for themake
build utility, which is what you should use to compile your code. In brief,make
ormake all
will compile anything that needs to be,make debug
does the same except that it compiles the code with options suitable for debugging, andmake clean
removes files that resulted from a previous compilation. These "targets" can be combined; for example, you would usemake clean debug
to ensure a complete clean and rebuild of everything for debugging. -
The
include
directory contains C header files (with extension.h
) that are used by the code. Note that these files often containDO NOT MODIFY
instructions at the beginning. You should observe these notices carefully where they appear. -
The
src
directory contains C source files (with extension.c
). -
The
tests
directory contains C source code (and sometimes headers and other files) that are used by the Criterion tests. -
The
rsrc
directory contains some samples of data files that you can use for testing purposes. -
The
test_output
directory is a scratch directory where the Criterion tests can put output files. You should not commit any files in this directory to yourgit
repository. -
The
lib
directory contains a library with binaries for my functionsargo_read_value()
andargo_write_value()
. As discussed below, by commenting out the stubs for these functions inargo.c
you can arrange for my versions to be linked with your code, which may help you to get a jump start on understanding some things.
A Note about Program Output
What a program does and does not print is VERY important.
In the UNIX world stringing together programs with piping and scripting is
commonplace. Although combining programs in this way is extremely powerful, it
means that each program must not print extraneous output. For example, you would
expect ls
to output a list of files in a directory and nothing else.
Similarly, your program must follow the specifications for normal operation.
One part of our grading of this assignment will be to check whether your program
produces EXACTLY the specified output. If your program produces output that deviates
from the specifications, even in a minor way, or if it produces extraneous output
that was not part of the specifications, it will adversely impact your grade
in a significant way, so pay close attention.
😱 Use the debug macro
debug
(described in the 320 reference document in the Piazza resources section) for any other program output or messages you many need while coding (e.g. debugging output).
Part 1: Program Operation and Argument Validation
In this part of the assignment, you will write a function to validate the arguments passed to your program via the command line. Your program will treat arguments as follows:
-
If no flags are provided, you will display the usage and return with an
EXIT_FAILURE
return code. -
If the
-h
flag is provided, you will display the usage for the program and exit with anEXIT_SUCCESS
return code. -
If the
-v
flag is provided, then the program will read data from standard input (stdin
) and validate that it is syntactically correct JSON. If so, the program exits with anEXIT_SUCCESS
return code, otherwise the program exits with anEXIT_FAILURE
return code. In the latter case, the program will print to standard error (stderr
) an error message describing the error that was discovered. No other output is produced. -
If the
-c
flag is provided, then the program performs the same function as described for-v
, but after validating the input, the program will also output to standard output (stdout
) a "canonicalized" version of the input. "Canonicalized" means that the output is in a standard form in which possibilities for variation have been eliminated. This is described in more detail below. Unless-p
has also been specified, then the produced output contains no whitespace (except within strings that contain whitespace characters). -
If the
-p
flag is provided, then the-c
flag must also have been provided. In this case, newlines and spaces are used to format the canonicalized output in a more human-friendly way. See below for the precise requirements on where this whitespace must appear. TheINDENT
is an optional nonnegative integer argument that specifies the number of additional spaces to be output at the beginning of a line for each increase in indentation level. The format of this argument must be the same as for a nonnegative integer number in the JSON specification. If-p
is provided without anyINDENT
, then a default value of 4 is used.
Note that the program reads data from stdin
and writes transformed data
to stdout
. Any other printout, such as diagnostic messages produced by the
program, are written to stderr
. If the program runs without error, then it
will exit with the EXIT_SUCCESS
status code; if any error occurs during the
execution of the program, then it will exit with the EXIT_FAILURE
status code.
:nerd:
EXIT_SUCCESS
andEXIT_FAILURE
are macros defined in<stdlib.h>
which represent success and failure return codes respectively.
:nerd:
stdin
,stdout
, andstderr
are special I/O "streams", defined in<stdio.h>
, which are automatically opened at the start of execution for all programs, do not need to be reopened, and (almost always) should not be closed.
The usage scenarios for this program are described by the following message, which is printed by the program when it is invoked without any arguments:
USAGE: bin/argo [-h] [-c|-v] [-p|-p INDENT] -h Help: displays this help menu. -v Validate: the program reads from standard input and checks whether it is syntactically correct JSON. If there is any error, then a message describing the error is printed to standard error before termination. No other output is produced. -c Canonicalize: once the input has been read and validated, it is re-emitted to standard output in 'canonical form'. Unless -p has been specified, the canonicalized output contains no whitespace (except within strings that contain whitespace characters). -p Pretty-print: This option is only permissible if -c has also been specified. In that case, newlines and spaces are used to format the canonical output in a more human-friendly way. For the precise requirements on where this whitespace must appear, see the assignment handout. The INDENT is an optional nonnegative integer argument that specifies the number of additional spaces to be output at the beginning of a line for each for each increase in indentation level. If no value is specified, then a default value of 4 is used.
The square brackets indicate that the enclosed argument is optional.
The -c|-v
means that one of -c
or -v
may be specified.
The -p|-p INDENT
means that -p
may be specified alone, or with an optional
additional argument INDENT
.
A valid invocation of the program implies that the following hold about the command-line arguments:
-
All "positional arguments" (
-h
,-c
, or-v
) come before any optional arguments (-p
). The optional arguments (well, there is only one) may come in any order after the positional ones. -
If the
-h
flag is provided, it is the first positional argument after the program name and any other arguments that follow are ignored. -
If the
-h
flag is not specified, then exactly one of-v
or-c
must be specified. -
If
-p
is given, then it might or might not be followed by anINDENT
argument. If theINDENT
argument is present, then it must represent a nonnegative integer in the format allowed for integer numbers in the JSON specification.
For example, the following are a subset of the possible valid argument combinations:
$ bin/argo -h ...
$ bin/argo -v
$ bin/argo -c -p
$ bin/argo -c -p 8
😱 The
...
means that all arguments, if any, are to be ignored; e.g. the usagebin/argo -h -x -y BLAHBLAHBLAH -z
is equivalent tobin/argo -h
.
Some examples of invalid combinations would be:
$ bin/argo -p 1 -c
$ bin/argo -v -c
$ bin/argo -v -p 5
$ bin/argo -z 20
😱 You may use only "raw"
argc
andargv
for argument parsing and validation. Using any libraries that parse command line arguments (e.g.getopt
) is prohibited.
😱 Any libraries that help you parse strings are prohibited as well (
string.h
,ctype.h
, etc). The use ofatoi
,scanf
,fscanf
,sscanf
, and similar functions is likewise prohibited. This is intentional and will help you practice parsing strings and manipulating pointers.
😱 You MAY NOT use dynamic memory allocation in this assignment (i.e.
malloc
,realloc
,calloc
,mmap
, etc.). There is one function (argo_append_char()
) provided for you that does the dynamic allocation required while accumulating the characters of a string or numeric literal. This function is in the fileconst.c
, which you are not to modify.
:nerd: Reference for command line arguments: https://beej.us/guide/bgc/html/#command-line-arguments.
NOTE: The make
command compiles the argo
executable into the bin
folder.
All commands from here on are assumed to be run from the hw1
directory.
Required Validate Arguments Function
In global.h
, you will find the following function prototype (function
declaration) already declared for you. You MUST implement this function
as part of the assignment.
int validargs(int argc, char **argv);
The file validargs.c
contains the following specification of the required behavior
of this function:
/**
* @brief Validates command line arguments passed to the program.
* @details This function will validate all the arguments passed to the
* program, returning 0 if validation succeeds and -1 if validation fails.
* Upon successful return, the various options that were specified will be
* encoded in the global variable 'global_options', where it will be
* accessible elsewhere in the program. For details of the required
* encoding, see the assignment handout.
*
* @param argc The number of arguments passed to the program from the CLI.
* @param argv The argument strings passed to the program from the CLI.
* @return 0 if validation succeeds and -1 if validation fails.
* @modifies global variable "global_options" to contain an encoded representation
* of the selected program options.
*/
😱 This function must be implemented as specified as it will be tested and graded independently. It should always return -- the USAGE macro should never be called from validargs.
The validargs
function should return -1 if there is any form of failure.
This includes, but is not limited to:
-
Invalid number of arguments (too few or too many).
-
Invalid ordering of arguments.
-
A missing parameter to an option that requires one [doesn't apply to the current assignment, since the parameter to
-p
is optional]. -
Invalid parameter. A numeric parameter specfied with
-p
is invalid if it does not conform to the format of a nonnegative integer as required by the JSON specification.
The global_options
variable of type int
is used to record the mode
of operation (i.e. encode/decode) of the program and associated parameters.
This is done as follows:
-
If the
-h
flag is specified, the most significant bit (bit 31) is 1. -
If the
-v
flag is specified, the second-most significant bit (bit 30) is 1. -
If the
-c
flag is specified, the third-most significant bit (bit 29) is 1. -
If the
-p
flag is specified, the fourth-most significant bit (bit 28) is 1. -
The least significant byte (bits 7 - 0) records the number of spaces of indentation per level specified with
-p
, or the default value (4) if no value was specified with-p
. If-p
was not specified at all, then this byte should be 0.
If validargs
returns -1 indicating failure, your program must call
USAGE(program_name, return_code)
and return EXIT_FAILURE
.
Once again, validargs
must always return, and therefore it must not
call the USAGE(program_name, return_code)
macro itself.
That should be done in main
.
If validargs
sets the most-significant bit of global_options
to 1
(i.e. the -h
flag was passed), your program must call USAGE(program_name, return_code)
and return EXIT_SUCCESS
.
:nerd: The
USAGE(program_name, return_code)
macro is already defined for you inargo.h
.
If validargs returns 0, then your program must read input data from stdin
and (depending on the options supplied) write output data to stdout
.
Upon successful completion, your program should exit with exit status EXIT_SUCCESS
;
otherwise, in case of an error it should exit with exit status EXIT_FAILURE
.
Unless the program has been compiled for debugging (using make debug
),
in a successful run that exits with EXIT_SUCCESS
no other output may be produced
by the program. In an unsuccessful run in which the program exits with EXIT_FAILURE
the program should output to stderr
a one-line diagnostic message that indicates
the reason for the failure. The program must not produce any other output than this
unless it has been compiled for debugging.
:nerd: Remember
EXIT_SUCCESS
andEXIT_FAILURE
are defined in<stdlib.h>
. Also note,EXIT_SUCCESS
is 0 andEXIT_FAILURE
is 1.
Example validargs Executions
The following are examples of the setting of global_options
and the
other global variables for various command-line inputs.
Each input is a bash command that can be used to invoke the program.
-
Input:
bin/argo -h
. Setting:global_options=0x80000000
(help
bit is set, other bits clear). -
Input:
bin/argo -v
. Setting:global_options=0x40000000
(mode is "validate"). -
Input:
bin/argo -c -p 2
. Setting:global_options=0x30000002
(mode is "canonicalize", "pretty-print" has been specified with indentation increment 2). -
Input:
bin/argo -p 2 -c
. Setting:global_options=0x0
. This is an error case because the specified argument ordering is invalid (-p
is before-c
). In this casevalidargs
returns -1, leavingglobal_options
unset.
Part 2: Overview of the JSON Specification
JSON ("JavaScript Object notation") is a standard format for data interchange that is now commonly used in many areas of computing. It was designed to be extremely simple to generate and parse and it in fact achieves these goals: JSON syntax is about as simple as it gets for a computer language that is actually used in the real world. The syntax of JSON is defined by an ECMA standard. A summary that omits the scarier language from the standard document is given at www.json.org. Most likely, you will only need to refer to this summary, but the full standard document is here if you want to look at it.
In order to understand the JSON syntax specification, you need to be able to
read the "railroad diagrams" that are used to formally specify it.
These diagrams are actually a graphical version of a context-free grammar,
which is a standard tool used for formally specifying all kinds of computer
languages. Actually, the white box inset on the right contains the full
grammar; the railroad diagrams only describe the portion of the syntax that
has any significant complexity.
Each of the railroad diagrams defines the syntax of a particular
"syntactic category", which is a set of strings having a similar format.
Examples of syntactic categories for JSON are "object", "array",
"value", "number", etc.
The paths along the "railroad tracks" in the diagram for one syntactic category
indicate the possibilities for forming a string in that category from strings
in other categories.
For example, the first diagram says that a string in the category "object"
always has an initial curly bracket {
. This may be followed immediately by
a closing curly bracket }
(the top "track"), or between the brackets there
may be something more complicated (a list of "members" -- the lower "track").
By following the lower track, you find that there has to be "whitespace",
followed by a "string", followed by "whitespace", followed by a colon :
,
followed by a "value". After the "value", it is possible to have the
closing curly bracket }
or to loop back around and have another instance of
the same pattern that was just seen (a "member"). The path to loop back around
requires that a comma ,
appear before the next member, so this tells you
that the members in between the {
and }
are separated by commas.
The other diagrams are read similarly, and even if you have never seen these
before, with a little study they should be self-explanatory so I'm not going
to belabor the explanation further.
Something that was not initially clear to me from just looking at the diagrams
was what the syntax of true
, false
, and null
is. These are shown with
double quotes in the inset box on the right, but in fact, the "token" true
simply consists of the four letters: t
, r
, u
, e
without any quotes.
This is spelled out better in the ECMA standard document.
The description of "character" in the inset box is also a bit mysterious
at first reading. A "character" is something that is permitted to occur within
a string literal. After staring at the description for awhile it becomes clear
that any Unicode code point except for (1) the "control characters"
whose code points range from U+0000 to U+001F, (2) the quote "
,
and (3) the backslash '' (they call it "reverse solidus"), may appear directly
representing themselves within a string literal.
In addition, "escape sequences" are permitted. An escape sequence starts
with a backslash \
, which may be followed by one of the characters
"
, '\', '/', 'b', 'f', 'n', 'r', 't', or 'u'. After 'u' there are required
to appear exactly four hexadecimal digits, the letters of which may either
be in upper case or lower case. The meaning of \"
, \/
, \b
, \f
\n
, \r
, and \t
is as in a C string. The escape sequence \/
represents
a single forward slash ("solidus") /
(I do not know why this is in the
standard.) The meaning of \uHHHH
, where HHHH
are four hex digits is
the 16-bit Unicode code point from the "basic multilingual plane" whose
value is given by interpreting HHHH
as a four-digit hexadecimal number.
Although a Unicode code point outside the basic multilingual plane may
occur directly in a string literal, representing such by an escape requires
the use of a "surrogate pair" as indicated in the ECMA standard document.
Don't worry about this technicality. For this assignment, your implementation
will not have to handle full Unicode and UTF-8-encoded input.
You may assume instead that the input to your program will come as a sequence of
8-bit bytes, each of which directly represents a Unicode code point in the
range U+0000 to U+00FF (the first 128 code points correspond to ASCII codes,
and the meaning of the next 128 code points is defined by the Unicode standard).
Note that this means that we are not using the usual UTF-8 encoding to
represent Unicode as a sequence of 8-bit bytes.
As you will see when you look at the definitions of the data structures you
are to use, internally your program will use the 32-bit int
(typedef'ed as ARGO_CHAR
) to represent a character.
This is enough bits to represent any Unicode code point, so there will
be no problem in taking the input bytes that you read in and storing them
internally as Unicode code points. Due to the limitation of the input encoding,
for us a string literal will not be able to directly contain any Unicode
code point greater than U+00FF.
Nevertheless, you will still be able to use escape sequences within
a string literal to represent Unicode code points in the basic multilingual
plane (from U+0000 to U+FFFF), because the escape sequence allows you
to specify the code point directly as four hexadecimal digits.
Since we will also output JSON as a sequence of 8-bit bytes, it will be
necessary to render any Unicode code points greater than U+00FF occuring
in a string literal using escapes.
When reading a specification like this, it is helpful to have examples of
what is being defined. For this purpose, I have provided (in the rsrc
directory) some sample JSON files. These files all have the .json
extension. Some of these files are examples of what your program is supposed
to do when given other files as input. For example, the file rsrc/numbers.json
contains the following content.
{
"0": 0,
"2147483648": 2147483648,
"-2147483649": 2147483649,
"0.0": 0.0,
"1": 1,
"789": 789,
"1.0": 1.0,
"-1.0": -1.0,
"1e3": 1e3,
"1E3": 1E3,
"1e-3": 1e-3,
"1.234": 1.234,
"-1.234": -1.234,
"1.234e3": 1.234e3,
"1.234e-3": 1.234e-3
}
when your program is run as follows
$ bin/argo -c -p 2 < rsrc/numbers.json
it should produce the output in rsrc/numbers_-c_-p_2.json
; namely
{
"0": 0,
"2147483648": 2147483648,
"-2147483649": 2147483649,
"0.0": 0.0,
"1": 1,
"789": 789,
"1.0": 0.1e1,
"-1.0": -0.1e1,
"1e3": 0.1e4,
"1E3": 0.1e4,
"1e-3": 0.1000000000000000e-2,
"1.234": 0.1233999999999999e1,
"-1.234": -0.1233999999999999e1,
"1.234e3": 0.1233999999999999e4,
"1.234e-3": 0.1234000000000000e-2
}
How this is supposed to happen is explained below.
Part 3: Implementation
The header file global.h
lists prototypes for functions you are
required to implement:
ARGO_VALUE *argo_read_value(FILE *);
int argo_read_string(ARGO_STRING *s, FILE *);
int argo_read_number(ARGO_NUMBER *n, FILE *);
int argo_write_value(ARGO_VALUE *, FILE *);
int argo_write_string(ARGO_STRING *, FILE *);
int argo_write_number(ARGO_NUMBER *, FILE *);
int validargs(int argc, char **argv);
The validargs()
function has already been discussed above.
The argo_read_value()
function reads JSON input from the specified stream
and returns an ARGO_VALUE
data structure (as described below).
The argo_read_string()
function takes a pointer to an ARGO_STRING
structure (which will be a sub-structure of an ARGO_VALUE
structure),
as well as a FILE *
pointer, and it reads a JSON string literal
(starting and ending with a quote "
) from the input stream and stores
the content of the string (without the quotes, after handling escapes)
in the specified ARGO_STRING
object.
The argo_read_number()
function works similarly, except it reads
a JSON numeric literal and uses it to initialize an ARGO_NUMBER
structure.
The argo_write_value()
function takes an ARGO_VALUE
data structure
and a FILE *
pointer representing an output stream, and it writes
canonical JSON representing the specified value to the output stream.
The argo_write_string()
function takes an ARGO_STRING *
pointer
and a FILE *
pointer and writes a JSON string literal to the output
stream (including quotes and escaping content that needs to be escaped).
The argo_write_number()
function similarly takes an ARGO_NUMBER *
pointer and a FILE *
pointer and it writes a JSON numeric literal
to the output stream.
😱 Even though your final application will only ever read JSON input from
stdin
and write JSON output tostdout
, the interfaces of these functions are designed to accept arbitrary streams as parameters. You must not ignore these parameters. Also, you must not assume that these streams are "seekable" and consequently you may not use the functionsfseek()
orftell()
in your code.
Besides the general discussion below, more detailed specifications for the
required behavior of these functions are given in the comments preceding
the (non-functional) stubs in argo.c
. Those specifications are mostly
not repeated here to avoid redundancy and possible inconsistencies between
this document and the specifications in argo.c
.
Of course, you will also have to make modifications to the main()
function,
so that after calling validargs()
it makes the calls to
argo_read_value()
and argo_write_value()
to perform the functions required
of the complete application.
Since I want everybody to get the experience of designing and coding their own implementation for this assignment, I have not spelled out any further what other functions you will might to implement, but you will almost certainly want to implement other functions. Note that the function interfaces that have been specified, together with the problems that have to be solved by these functions, give you clues about an implementation structure that you might wish to consider. I will now discuss this briefly.
The argo_read_value()
function is supposed to read bytes of data from a
stream and attempt to parse them as a JSON "value" (which could be
an object, array, string, number, or one of the basic tokens true
,
false
or null
). The result of this parsing process is a data structure
that represents the structure of the JSON in a form that is useful for
further processing. The specification of the syntax has a recursive
structure (e.g. an object contains members, members contain elements, which
can themselves contain values, and so on. A simple way to parse a string
according to a recursive specification like this is via a so-called
recursive descent parser. Basically, the parser will have a function
for each of the syntactic categories that appear in the syntax specification
(argo_read_value()
is one such function). Each of these functions will
be called at a point where what is expected on the input stream is a string
belonging to the syntactic category handled by that function.
The function will read one or more characters from the input stream and, based
on what it sees, it will recursively call one or more of the other parser
functions. For example, the function responsible for parsing an "object"
might check that the next character in the input is a curly brace {
and then call the function responsible for parsing a "member".
Each parsing function will return a data structure that represents what it
has parsed. To build this data structure, each parsing function will
typically need make use of the data structures returned by the functions
that it called recursively.
In general, each function in a recursive descent parser will need to examine
a certain amount of the input in order to determine what to do. This input
is called "look-ahead". One of the features of the JSON syntax that makes
it so easy to parse is that at most one character of look-ahead is ever
required in order to decide what to do next. For example, once we have
seen the {
that starts an object, checking whether the next character is
a }
or not is sufficient to tell whether we have to call functions
to parse members of the object, or whether the object is empty.
In implementing a parser like this, it generally simplifies the design
if you can "peek" at the look-ahead character without consuming it.
That way, when you call another function, it can assume that the input
stream is at the very start of what it is supposed to be trying to parse,
rather than having to keep track of what characters might already have
been been read by the caller.
You should use the fgetc()
function from the C standard I/O library
to read each byte of data from the input stream. This function consumes
the byte of data from the input stream, but the standard I/O library
also provides a function ungetc()
that allows you to "push back" a single
character of input. So you can achieve the effect of peeking one character
into the input stream by calling fgetc()
, looking at the character returned,
and then using ungetc()
to push it back into the stream if it is not
to be consumed immediately. In some cases, as you descend through recursive
calls, the same character might be examined and pushed back repeatedly.
The recursive structure also dictates a natural form for the implementation
of the output function argo_write_value()
: you can have one function
for each meaningful entity (e.g. "object", "member", "number") in the
JSON specification and these functions will call each other recursively
in order to traverse the data structure and emit characters to the output
stream.
Part 4: Data Structures
The argo.h
header file gives C definitions for the data structures you are
produce as the return values from argo_read_value()
and as the arguments
to argo_write_value()
. These data structures are basically trees.
The ARGO_VALUE
structure is the central definition, which spells out what
information is in a node of such a tree. As the same ARGO_VALUE
structure
is used to represent all the types of JSON values ("object", "array", "number",
etc.) it has a type
field to indicate specifically what type of object
each individual instance represents. The possible types are defined by the
ARGO_VALUE_TYPE
enumeration. Each node also has a content
field, which
is where the actual content of the node is stored. The content
field
is defined using the C union
type, which allows the same region of memory
to be used to store different types of things at different times.
Depending on what is in the type
field, exactly one of the object
, array
,
string
, number
, or basic
subfields of content
will be valid.
Except for ARGO_BASIC
, which just defines a set of possible values,
each of these has its own structure definition, which are given as
ARGO_OBJECT
, ARGO_ARRAY
, ARGO_STRING
, and ARGO_NUMBER
.
Besides the type
and content
fields, each ARGO_VALUE
node contains
next
and prev
fields that point to other ARGO_VALUES
. These fields
will be used to link each node with other "sibling" nodes into a list.
For example, a JSON "object" has a list of "members".
The JSON object will be represented by an ARGO_VALUE
node having
ARGO_OBJECT_TYPE
in its type
field. The content
field of this
object will therefore be used to hold an ARGO_OBJECT
structure.
The ARGO_OBJECT
structure has a single field: member_list
, which
points to a "dummy" ARGO_VALUE
structure used as the head of
a circularly, doubly linked list of members (more on this below).
Linked into this list will be ARGO_VALUE
structures that represent
the members. The next
and prev
fields of these are used to chain
the members into a list: the next
field points from a member to
the next member and the prev
field points from a member to the previous
member. For ARGO_VALUE
structures used to represent members of
an object, the name
field will contain an ARGO_STRING
structure
that represents the name of the member.
JSON arrays are represented similarly to JSON objects: the array as a
whole is represented by an ARGO_VALUE
structure whose type
field
contains ARGO_ARRAY_TYPE
. The content
field will therefore be used
to hold an ARGO_ARRAY
structure, which has an element_list
field that
points to a "dummy" ARGO_VALUE
structure at the head of a list of elements,
similarly to what was just described for for object members.
However, array elements don't have names, so the name
field of each
array element will just be NULL
.
JSON strings are represented by the ARGO_STRING
structure, which
has fields capacity
, length
, and content
. These are used to
represent a dynamically growable string, similarly to the way
ArrayList
is implemented in Java.
At any given time, the content
field will either be NULL
(if the string is empty) or it will point to an array of ARGO_CHAR
elements, each of which represents a single Unicode code point.
The capacity
field tells the total number of "slots" in this
array, whereas the length
field tells how many of these are
actually used (i.e. it gives the current length of the string).
For this assignment, you don't have to actually be concerned with
the dynamic allocation -- that is performed by the function
argo_append_char()
which has been implemented for you in const.c
.
All you have to worry about is making sure that the fields
of and ARGO_STRING
structure that you want to use have been
initialized to zero and then you can just call argo_append_char()
to build the string content.
A simple for
loop using the length
field as the upper limit
can then be used to traverse the content
array of an ARGO_STRING
once it has been initialized.
The ARGO_NUMBER
structure is used to represent a number.
One of its fields is a string_value
field, which is an ARGO_STRING
used to hold the digits and other characters that make up the
textual representation of the number. During parsing, characters
are accumulated in this field using argo_append_char()
in the
same way that characters are accumulated for a string value.
The remaining fields (int_value
, float_value
) are used to store
an internal representation (either integer or floating-point)
of the value of the number, as well as flags (valid_string
,
valid_int
, valid_float
) that tell which of the other fields
contain valid information. Note that a JSON number that contains
a fractional part or an exponent part will generally not be representable
in integer format, so the valid_int
field should be zero and there
will be no useful information in the int_value
field. Also, if an
ARGO_NUMBER
is created internally, without parsing it from an input
stream, then a printable representation has not yet been computed, so the
valid_string
field will be zero and the string_value
field
will represent an empty string.
To summarize, your argo_read_value()
function will read bytes of
data from the specified input stream using fgetc()
.
As it reads and parses the input, it will build up a tree of
ARGO_VALUE
nodes to represent the structure of the JSON input.
The nodes of the resulting tree must satisfy the following requirements:
-
A node with
ARGO_OBJECT_TYPE
in itstype
field represents a JSON "object". Thecontent
field then contains anARGO_OBJECT
structure whosemember_list
field points to anARGO_VALUE
node that is the head of a circular, doubly linked list of members. Each member has a pointer to its associated name (anARGO_STRING
) stored in thename
field. -
A node with
ARGO_ARRAY_TYPE
in itstype
field represents a JSON "array". Thecontent
field then contains anARGO_ARRAY
structure whoseelement_list
field points to anARGO_VALUE
node that is the head of a circular, doubly linked list of elements. -
A node with
ARGO_STRING_TYPE
in itstype
field represents a JSON "string" (without the enclosing quotes that appear in JSON source). Thecontent
field then contains anARGO_STRING
that represents the string. Thelength
field of theARGO_STRING
gives the length of the string and thecontent
field points to an array ofARGO_CHAR
values that are the content of the string. -
A node with
ARGO_NUMBER_TYPE
in itstype
field represents a JSON "number". Thecontent
field then contains anARGO_NUMBER
object that represents the number in various ways.-
If the
valid_string
field is nonzero, then thestring_value
field will contain anARGO_STRING
that holds the characters that make up a printable/parseable representation of the number. -
If the
valid_int
field is nonzero, then theint_value
field will contain the value of the number as a Clong
. -
If the
valid_float
field is nonzero, then thefloat_value
field will contain the value of the number as a Cdouble
.
If there is more than one representation of the number present, then they are required to agree with each other (i.e represent the same value).
-
-
A node with
ARGO_BASIC_TYPE
in itstype
field will have acontent
field having a value of typeARGO_BASIC
in itsbasic
field. This value will be one ofARGO_TRUE
,ARGO_FALSE
, orARGO_NULL
.
The argo_read_string()
function will parse a JSON string literal
and store the string content into an ARGO_STRING
object.
Characters in the input that are not control character and are not
one of the characters that must be escaped are simply appended directly
to the string content. However when a backslash \
is encounted,
it is necessary to interpret it as the start of an escape sequence
that represents the character to be appended. These escape sequences
should be familiar, since they are essentially the same as those
used in Java as well as in C.
The argo_read_number()
function will parse a JSON numeric literal
and store into an ARGO_NUMBER
object not only the sequence of
characters that constitute the literal, but also the value of the
number, either in integer format, in floating point format, or both.
In order to do this, you have to actually process the various digits
one at a time and calculate (using integer and/or floating point
arithemetic) the value that is represented. You have to carry out
this conversion yourself; you are not allowed to use any library
functions to do it.
Circular, Doubly Linked Lists
As already stated, object members and array elements are to be stored as
circular, doubly linked lists of ARGO_VALUE
structures, using a "dummy"
structure as a sentinel. Even though the sentinel has the same type
(i.e. ARGO_VALUE
) as the elements of the list, it does not itself represent
an element of the list. The only fields used in the sentinel are the
next
field, which points to the first actual element of the list,
and the prev
field, which points to the last actual element of the list.
The list is "circular" because starting at the sentinel and following
next
pointers will eventually lead back to the sentinel again.
Similarly, starting at the sentinel and following prev
pointers will
eventually lead back to the sentinel. An empty list is represented
by a sentinel whose next
and prev
pointers point back to the sentinel
itself. You can read more about this type of data structure by searching,
e.g. Wikipedia for "circularly doubly linked list". The advantage of
using the sentinel is that all insertions and deletions are performed
in exactly the same way, without any edge cases for the first or last
element of the list.
Dynamic Storage
There are two types of dynamic storage used by this program.
One of these is for the content of an ARGO_STRING
. As already indicated
above, this is handled for you by the function argo_append_char()
and you
do not have to worry about how it happens.
The other dynamic storage is for ARGO_VALUE
structures.
You need a source of such structures while you are building the tree
that represents JSON.
As you are prohibited from declaring your own arrays in this
assignment, you will have to use one that we have already declared for you.
In global.h
an array argo_value_storage
has been defined for you,
together with an associated counter argo_next_value
. You must use
this array as the source of ARGO_VALUE
structures for building your
JSON trees. Use the argo_next_value
counter to keep track of the
index of the first unused element of this array.
When you need an ARGO_VALUE
structure, get a pointer to the first unused
element of the argo_value_storage
array and increment argo_next_value
.
Be sure to pay attention to the total number NUM_ARGO_VALUES
of
elements of this array -- if you run off the end you will corrupt other
memory and your program will have unpredictable behavior.
Part 5: Canonical Output
Your argo_write_value()
function is supposed to traverse a data structure
such as that returned by argo_read_value()
and it is supposed to output
JSON to the output stream. First of all, the JSON that you output has to
conform to the JSON standard, so that it can be parsed again to produce
exactly the same internal data structure. Beyond that, the JSON is supposed
to be "canonical", which means that it has been output in a standard way
that does not leave any possibility for variation.
Your canonical JSON output must always satisfy the following conditions:
-
An
ARGO_NUMBER
whosevalid_int
field is set is to be printed out as an integer, without any fraction or exponent. -
An
ARGO_NUMBER
whosevalid_float
field is set is to be printed out with an integer and fractional part, as in the JSON specification. The fractional part should be normalized to lie in the interval[0.1, 1.0)
, so that there is always just a single0
digit before the decimal point and the first digit after the decimal point is always nonzero. An exponent of 0 is to be omitted completely and for positive exponents the+
sign is to be omitted. Exponents always start with lower-casee
, rather than upper-caseE
. -
An
ARGO_STRING
is printed so that the following conditions are satisfied:-
Characters (other than backslash
\
and quote"
) having Unicode control points greater than U+001F and less than U+00FF are to appear directly in the string literal as themselves. This includes forward slash/
. -
Characters with Unicode code points greater than U+00FF are to appear as escapes using
\u
and the appropriate hex digits, which must be in lower case. -
Control characters that have special escapes (
\n
,\t
, etc.) must be printed using those special escapes, not using the generic escape\u
with hex digits.
-
If the pretty-print option has not been specified, then your canonical JSON output must satisfy the following condition:
- There is no white space in the output, except for white space that occurs within a string literal.
If the pretty print option has been specified, then your canonical JSON output will include white space according to the following rules:
-
A single newline is output after every
ARGO_VALUE
that is output at the top-level (i.e. not as part of an object or array). -
A single newline is output after every '{', '[', and ',' (except those in string literals).
-
A single newline is output immediately after the last member of an object, and immediately after the last element of an array.
-
Each newline is followed by a number of spaces that depends on the indentation level of the value currently being printed. The indentation level is maintained as follows:
-
The indentation level is 0 for a top-level JSON value.
-
The indentation level is increased by one just after a
{
or[
has been printed to start the list of members of an object or elements of an array. The indentation level decreases by one just after the last member or element has been printed, so that the closing}
or]
is at the previous indentation level -
A single space is printed following each colon
:
that separates the name of an object member from its value.
The number of spaces following a newline is equal to the current indentation level times the
INDENT
argument given with-p
, or the default of4
if-p
was specified without anINDENT
argument. -
Note that canonicalization must be an "idempotent" operation, in the sense that if canonical output previously produced is re-parsed and then re-output using the same pretty-printing settings, then the new output should be identical to the previous output.
Part 6: Strategy of Attack
To make things a little easier for you in getting started on this assignment,
I have distributed with the basecode a library containing binary object
versions of my own implementations of argo_read_value()
and argo_write_value()
.
The Makefile
has been constructed so that it will link your program against
the library I provided. As a result, if you comment out one or both of
these function in argo.c
, my versions will be linked instead and you can
use them to work on the other parts of the assignment. Note that this library
will not be present during grading, so do not leave these functions
commented out or your code will not compile.
Note that the functions whose interfaces have been specified will likely
be unit-tested. This means that their behavior should be completely determined
by their specified interface, which includes their parameters, return values,
and global variables defined in global.h
(which you may not modify).
There should be no implicit assumption that any other functions have been or
will be called or that any particular variables have been set to any particular
values, except for the global variables defined in global.h
.
So, for example, you may (and should) assume that when argo_write_object()
is called, the global_options
variable has been set according to the desired
program options, but you may not assume that before argo_write_object()
has been called that some other function was called previously.
My best guess as to the right attack strategy for this assignment is as follows:
First, work on the command-line argument processing (validargs()
) and
make the changes to main()
necessary to get the program to honor the command-line
arguments and perform the overall function that the application is supposed
to perform.
Next, start working on implementing argo_write_value()
, using my version of
argo_read_value()
as a source of data structures that you can use to increase
your understanding of pointers and the specific data structures that we are using
to represent JSON and, ultimately, as an aid to developing and testing your
implementation.
Finally, now that you have a clear understanding of the data structures you
are trying to produce work on implementing argo_read_value()
, to parse
a stream of input bytes and produce such a data structure. I expect this part
of the assignment to be the most difficult.
Note that the code that I wrote for argo_read_value()
and argo_write_value()
is only about 800 lines in length. If you find your own code growing
much larger than that, you need to step back and think smarter about trying
to simplify your code.
Part 7: Running the Program
The argo
program always reads from stdin
and possibly writes to stdout
.
If you want the program to take input from a file or produce output to
a file, you may run the program using input and output redirection,
which is implemented by the shell.
A simple example of a command that uses such redirection is the following:
$ bin/argo -c < rsrc/numbers.json > numbers.out
This will cause the input to the program to be redirected from the text file
rsrc/numbers.json
and the output from the program to be redirected to the
file numbers.out
.
The redirection is accomplished by the shell, which interprets the <
symbol
to mean "input redirection from a file" and the >
symbol to mean
"output redirection to a file". It is important to understand that redirection
is handled by the shell and that the bin/argo
program never sees any
of the redirection arguments; in the above example it sees only bin/argo -c
and it just reads from stdin
and writes to stdout
.
Alternatively, the output from a command can be piped to another program, without the use of a disk file. This could be done, for example, by the following command:
$ bin/argo -c -p 2 < rsrc/package-lock.json | less
This sends the (rather lengthy) output to a program called less
,
which display the first screenful of the output and then gives you the ability
to scan forward and backward to see different parts of it.
Type h
at the less
prompt to get help information on what you can do
with it. Type q
at the prompt to exit less
.
Programs that read from standard input and write to standard output are often used as components in more complex "pipelines" that perform multiple transformations on data.
For example, one way to test your implementation is by using one instance of it to produce some output and testing to see if that output can be read by another instance; *e.g.:
$ cat rsrc/package-lock.json | bin/argo -c | bin/argo -c -p 2 > p.out
Here cat
(short for "concatenate") is a command that reads the files
specified as arguments, concatenates their contents, and prints the
concatenated content to stdout
. In the above command, this output
is redirected through a pipe to become the input to bin/argo -c
.
The output of bin/argo -c
(which contains no whitespace) is then
sent to bin/argo -c -p 2
for pretty printing. Finally, the pretty-printed
output is written to file p.out
. Actually, the original input
file rsrc/package-lock.json
is already canonical as defined here,
so in the end the file p.out
should have exactly the same content
as rsrc/package-lock.json
. One way to check this is to use the
diff
comand (use man diff
to read the manual page) to compare the
two files:
$ diff rsrc/package-lock.json p.out
$
If diff
exits silently, the files are identical.
Another command that would be useful on output with no whitespace
is the cmp
command, which performes a byte-by-byte comparison of two files
(even files that contain raw binary data):
$ cmp rsrc/package-lock.json p.out
If the files have identical content, cmp
exits silently.
If one file is shorter than the other, but the content is otherwise identical,
cmp
will report that it has reached EOF
on the shorter file.
Finally, if the files disagree at some point, cmp
will report the
offset of the first byte at which the files disagree.
If the -l
flag is given, cmp
will report all disagreements between the
two files.
Unit Testing
Unit testing is a part of the development process in which small testable sections of a program (units) are tested individually to ensure that they are all functioning properly. This is a very common practice in industry and is often a requested skill by companies hiring graduates.
:nerd: Some developers consider testing to be so important that they use a work flow called test driven development. In TDD, requirements are turned into failing unit tests. The goal is then to write code to make these tests pass.
This semester, we will be using a C unit testing framework called Criterion, which will give you some exposure to unit testing. We have provided a basic set of test cases for this assignment.
The provided tests are in the tests/basecode_tests.c
file. These tests do the
following:
-
validargs_help_test
ensures thatvalidargs
sets the help bit correctly when the-h
flag is passed in. -
validargs_validate_test
ensures thatvalidargs
sets the validate-mode bit correctly when the-v
flag is passed. -
validargs_canonicalize_test
ensures thatvalidargs
sets the canonicalize-mode bit correctly when the-c
flag is passed in. -
validargs_bits_test
ensures thatvalidargs
sets the decode-mode bit correctly when the-d
flag is passed in and that the value passed with-b
is correctly stored in the least-signficant byte ofglobal_options
. -
validargs_error_test
ensures thatvalidargs
returns an error when the-p
flag is supplied with the-v
flag. -
help_system_test
uses thesystem
syscall to execute your program through Bash and checks to see that your program returns withEXIT_SUCCESS
. -
argo_basic_test
performs a basic test of the canonicalization mode of the program.
Compiling and Running Tests
When you compile your program with make
, an argo_tests
executable will be
created in your bin
directory alongside the argo
executable. Running this
executable from the hw1
directory with the command bin/argo_tests
will run
the unit tests described above and print the test outputs to stdout
. To obtain
more information about each test run, you can use the verbose print option:
bin/argo_tests --verbose=0
.
The tests we have provided are very minimal and are meant as a starting point
for you to learn about Criterion, not to fully test your homework. You may write
your own additional tests in tests/basecode_tests.c
, or in additional source
files in the tests
directory. However, this is not required for this assignment.
Criterion documentation for writing your own tests can be
found here.
Note that grades are assigned based on the number of our own test cases (not given to you in advance) that your program passes. So you should work on the assignments in such a way that whatever you do submit will function. Code that is completely broken will not score any points, regardless of how voluminous it might be or how long you might have spent on it.
Sample Input Files
In the rsrc
directory I have placed a few JSON input files for you to try
your code on.
-
numbers.json
: A JSON file containing a single object with various numbers as its members. This will exercise most (but probably not all) of the interesting cases that come up in parsing and outputting numbers. -
strings.json
: A JSON file containing a single array with various strings as its elements. These are intended to exercise most (but again, probably not all) of the cases involving escape sequences in strings. -
package-lock.json
: This is a larger JSON file that I had lying around which seemed to be a reasonable overall test.
Hand-in instructions
TEST YOUR PROGRAM VIGOROUSLY BEFORE SUBMISSION!
Make sure that you have implemented all the required functions specifed in const.h
.
Make sure that you have adhered to the restrictions (no array brackets, no prohibited
header files, no modifications to files that say "DO NOT MODIFY" at the beginning,
no functions other than main()
in main.c
) set out in this assignment document.
Make sure your directory tree looks basically like it did when you started
(there could possibly be additional files that you added, but the original organization
should be maintained) and that your homework compiles (you should be sure to try compiling
with both make clean all
and make clean debug
because there are certain errors that can
occur one way but not the other).
This homework's tag is: hw1
$ git submit hw1
:nerd: When writing your program try to comment as much as possible. Try to stay consistent with your formatting. It is much easier for your TA and the professor to help you if we can figure out what your code does quickly!