A diagram of the system is available [in our compliance repo](https://github.com/GSA/us-notify-compliance/blob/main/diagrams/rendered/apps/application.boundary.png).
Notify is a Flask application running on [cloud.gov](https://cloud.gov), which also brokers access to a PostgreSQL database and Redis store.
In addition to the Flask app, Notify uses Celery to manage the task queue. Celery stores tasks in Redis.
## GitHub Repositories
Application, infrastructure, and compliance work is spread across several repositories:
We use Terraform to manage our infrastructure, providing consistent setups across the environments.
Our Terraform configurations manage components via cloud.gov. This means that the configurations should work out of the box if you are using a Cloud Foundry platform, but will not work for setups based on raw AWS.
Credentials for these services are created by running:
1.`cd terraform/development`
1.`./run.sh`
in both the api repository as well as the admin repository.
This will append credentials to your `.env` file. You will need to manually clean up any prior runs from that file if you run that command again.
You can remove your development infrastructure by running `./run.sh -d`
#### Resetting
`./reset.sh` can be used to import your development infrastructure information in case of a new computer or new working tree and the old terraform state file was not transferred.
#### Offboarding
`./reset.sh -u USER_TO_OFFBOARD` can be used to import another user's development resources in order to clean them up. Steps for use:
1. Move your existing terraform state file aside temporarily, so it is not overwritten.
1.`./reset.sh -u USER_TO_OFFBOARD`
1. Answer no to the prompt about creating missing resources.
1. Run `./run.sh -u USER_TO_OFFBOARD -d` to fully remove the rest of that user's resources.
### Cloud.gov
The cloud.gov environment is configured with Terraform. See [the `terraform` folder](../terraform/) to learn about that.
## AWS
In addition to services provisioned through cloud.gov, we have several services provisioned via [supplemental service brokers](https://github.com/GSA/usnotify-ssb) in AWS. Our AWS services are currently located in [several regions](https://github.com/GSA/usnotify-ssb#aws-accounts-and-regions-in-use) using Studio-controlled AWS accounts.
To send messages, we use Amazon Web Services SNS and SES. In addition, we use AWS Pinpoint to provision and manage phone numbers, short codes, and long codes for sending SMS.
In SNS, we have 3 topics for SMS receipts. These are not currently functional, so senders won't know the status of messages.
Through Pinpoint, the API needs at least one number so that the application itself can send SMS for authentication codes.
The API also has access to AWS S3 buckets for storing CSVs of messages and contact lists. It does not access a third S3 bucket that stores agency logos.
## New Relic
We are using [New Relic](https://one.newrelic.com/nr1-core?account=3389907) for application monitoring and error reporting. When requesting access to New Relic, ask to be added to the Benefits-Studio subaccount.
- [ ] Create the local `.env` file by copying `sample.env` and running `./run.sh` within the `terraform/development` folder (see [these docs](https://github.com/GSA/notifications-api/blob/main/docs/all.md#development))
- [ ] Run through [the local setup process](https://github.com/GSA/notifications-api/tree/main#local-setup)
- [ ] Review [the system diagram](https://github.com/GSA/us-notify-compliance/blob/main/diagrams/rendered/apps/application.boundary.png)
Upon completion, an admin should update 🔒[the permissions and access tracker](https://docs.google.com/spreadsheets/d/1Z8s82dbLHHxGC8fF2U1K6YhtZZYVaEdliOZRbKWW9L4/edit#gid=0).
These steps are required for new cloud.gov environments. Local development borrows SES & SNS infrastructure from the `notify-staging` cloud.gov space, so these steps are not required for new developers.
### Steps to do a clean prod deploy to cloud.gov
Steps for deploying production from scratch. These can be updated for a new cloud.gov environment by subbing out `prod` or `production` for your desired environment within the steps.
1. Find and replace instances in the repo of "testsender", "testreceiver" and "dispostable.com", with your origin and destination email addresses, which you verified in step 1 above.
TODO: create env vars for these origin and destination email addresses for the root service, and create new migrations to update postgres seed fixtures
1. Click the `Exit SMS Sandbox` button and submit the support request. This request should take at most a day to complete. Be sure to request a higher sending limit at the same time.
#### Request new phone numbers
1. Go to Pinpoint console for the same region you are using SNS in.
1. In the lefthand sidebar, go the `SMS and Voice` (bottom) and choose `Phone Numbers`
1. Under `Number Settings` choose `Request Phone Number`
1. Choose Toll-free number, tick SMS, untick Voice, choose `transactional`, hit next and then `request`
1. Select `Toll-free registrations` and `Create registration`
1. Select the number you just created and then `Register existing toll-free number`
1. Complete and submit the form. Approval usually takes about 2 weeks.
1. See the [run book](./run-book.md) for information on how to set those numbers.
Example answers for toll-free registration form

If you're using the `cf` CLI, you can run `cf logs notify-api-ENV` and/or `cf logs notify-admin-ENV` to stream logs in real time. Add `--recent` to get the last few logs, though logs often move pretty quickly.
For general log searching, [the cloud.gov Kibana instance](https://logs.fr.cloud.gov/) is powerful, though quite complex to get started. For shortcuts to errors, some team members have New Relic access.
The links below will open a filtered view with logs from both applications, which can then be filtered further. However, for the links to work, you need to paste them into the URL bar while _already_ logged into and viewing the Kibana page. If not, you'll just be redirected to the generic dashboard.
We're using [`pre-commit`](https://pre-commit.com/) to manage hooks in order to automate common tasks or easily-missed cleanup. It's installed as part of `make bootstrap` and is limited to this project's virtualenv.
To run the hooks in advance of a `git` operation, use `poetry run pre-commit run`. For running across the whole codebase (useful after adding a new hook), use `poetry run pre-commit run --all-files`.
The configuration is stored in `.pre-commit-config.yaml`. In that config, there are links to the repos from which the hooks are pulled, so hop through there if you want a detailed description of what each one is doing.
One of the pre-commit hooks we use is [`detect-secrets`](https://github.com/Yelp/detect-secrets), which checks for all sorts of things that might be committed accidently that should not be. The project is already set up with a baseline file (`.ds.baseline`) and this should just work out of the box, but occasionally it will flag something new when you try and commit something; or, the file may need a refresh after a while. In either case, to get things back on track and update the `.ds.baseline` file, run these two commands:
```sh
detect-secrets scan --baseline .ds.baseline
detect-secrets audit .ds.baseline
```
The second command will walk you through all of the new detected secrets and ask you to validate if they actually are or if they're false positives. Mark off each one as apppropriate (they should all be false positives - if they're not please stop and check in with the team!), then commit the updates to the `.ds.baseline` file and push them remotely so the project stays up-to-date.
We're using GitHub Actions. See [/.github](../.github/) for the configuration.
In addition to commit-triggered scans, the `daily_checks.yml` workflow runs the relevant dependency audits, static scan, and/or dynamic scans at 10am UTC each day. Developers will be notified of failures in daily scans by GitHub notifications.
### Nightly Scans
Within GitHub Actions, several scans take place every day to ensure security and compliance.
`daily-checks.yml` runs `pip-audit`, `bandit`, and `owasp` scans to ensure that any newly found vulnerabilities do not impact notify. Failures should be addressed quickly as they will also block the next attempted deploy.
#### [drift.yml](../.github/workflows/drift.yml)
`drift.yml` checks the deployed infrastructure against the expected configuration. A failure here is a flag to check audit logs for unexpected access and/or behavior and potentially destroy and re-deploy the application. Destruction and redeployment of all underlying infrastructure is an extreme remediation, and should only be attempted after ensuring that a good database backup is in hand.
## Manual testing
If you're checking out the system locally, you may want to create a user quickly.
This will run an interactive prompt to create a user, and then mark that user as active. _Use a real mobile number_ if you want to log in, as the SMS auth code will be sent here.
Feature flagging is now implemented in the Admin application to allow conditional enabling of features. The current setup uses environment variables, which can be configured via the command line with Cloud Foundry (CF). These settings should be defined in each relevant .yml file and committed to source control.
To adjust a feature flag, update the corresponding environment variable and redeploy as needed. This setup provides flexibility for enabling or disabling features without modifying the core application code.
Specifics on the commands can be found in the [Admin Feature Flagging readme](https://github.com/GSA/notifications-admin/blob/main/docs/feature-flagging.md).
The API has 3 deployment environments, all of which deploy to cloud.gov:
- Staging, which deploys from `main`
- Demo, which deploys from `production`
- Production, which deploys from `production`
Configurations for these are located in [the `deploy-config` folder](../deploy-config/). This setup is duplicated for the front end.
To trigger a new deploy, create a pull request from `main` to `production` in GitHub. This PR typically has release notes highlighting major and minor changes in the deployment. For help preparing this, [sorting closed pull requests by "recently updated"](https://github.com/GSA/notifications-api/pulls?q=is%3Apr+sort%3Aupdated-desc+is%3Aclosed) will show all PRs merged since the last production deploy.
Deployment to staging runs via the [base deployment action](../.github/workflows/deploy.yml) on GitHub, which pulls credentials from GitHub's secrets store in the staging environment.
Deployment to demo runs via the [demo deployment action](../.github/workflows/deploy-demo.yml) on GitHub, which pulls credentials from GitHub's secrets store in the demo environment.
Deployment to production runs via the [production deployment action](../.github/workflows/deploy-prod.yml) on GitHub, which pulls credentials from GitHub's secrets store in the production environment.
The [action that we use](https://github.com/18F/cg-deploy-action) deploys using [a rolling strategy](https://docs.cloudfoundry.org/devguide/deploy-apps/rolling-deploy.html), so all deployments should have zero downtime.
In the event that a deployment includes a Terraform change, that change will run before any code is deployed to the environment. Each environment has its own Terraform GitHub Action to handle that change.
Failures in any of these GitHub workflows will be surfaced in the Pull Request related to the code change, and in the case of `checks.yml` actively prevent the PR from being merged. Failure in the Terraform workflow will not actively prevent the PR from being merged, but reviewers should not approve a PR with a failing terraform plan.
## Egress Proxy
The API app runs in a [restricted egress space](https://cloud.gov/docs/management/space-egress/).
This allows direct communication to cloud.gov-brokered services, but
not to other APIs that we require.
As part of the deploy, we create an
[egress proxy application](https://github.com/GSA/cg-egress-proxy) that allows traffic out of our
application to a select list of allowed domains.
Update the allowed domains by updating `deploy-config/egress_proxy/notify-api-<env>.allow.acl`
and deploying an updated version of the application throught he normal deploy process.
For an environment variable to make its way into the cloud.gov environment, it _must_ end up in the `manifest.yml` file. Based on the deployment approach described above, there are 2 ways for this to happen.
Because secrets are pulled from GitHub, they must be passed from our action to the deploy action and then placed into `manifest.yml`. This means that they should be in a 4 places:
- [ ] The GitHub secrets store
- [ ] The deploy action in the `env` section using the format `{secrets.SECRET_NAME}`
- [ ] The deploy action in the `push_arguments` section using the format `--var SECRET_NAME="$SECRET_NAME"`
Public env vars make up the configuration in `deploy-config`. These are pulled in together by the `--vars-file` line in the deploy action. To add or update one, it should be in 2 places:
- [ ] The relevant YAML file in `deploy-config` using the format `var_name: value`
- [ ] The manifest using the format `((var_name))`
In addition to the environment variable management, there may be some [additional application initialization](https://docs.cloudfoundry.org/devguide/deploy-apps/deploy-app.html#profile) that needs to be accounted for. This can include the following:
- Setting other environment variables that require host environment information directly that the application will run in as opposed to being managed by the `manifest.yml` file or or a user-provided service.
- Running app initializing scripts that require host environment information directly prior to starting the application itself.
These initialization steps are taken care of in the `.profile` file, which we use to set a couple of host environment-specific environment variables.
If this is the first time you have used Terraform in this repository, you will first have to hook your copy of Terraform up to our remote state. Follow [Retrieving existing bucket credentials](https://github.com/GSA/notifications-api/tree/main/terraform#retrieving-existing-bucket-credentials).
Check [Terraform troubleshooting](https://github.com/GSA/notifications-api/tree/main/terraform#troubleshooting) if you encounter problems.
Note that you'll have to do this for both the API and the Admin. Once this is complete we shouldn't have to do it again (unless we're setting up a new sandbox environment).
### Deploying to the sandbox
To deploy either the API or the Admin apps to the sandbox, the process is largely the same, but the Admin requires a bit of additional work.
#### Deploying the API to the sandbox
1. Make sure you are in the API project's root directory.
1. Authenticate with cloud.gov in the command line: `cf login -a api.fr.cloud.gov --sso`
1. Run `./scripts/deploy_to_sandbox.sh` from the project root directory.
At this point your target org and space will change with cloud.gov to be the `notify-sandbox` environment and the application will be pushed for deployment.
The script does a few things to make sure the deployment flows smoothly with miniminal work on your part:
* Sets the target org and space in cloud.gov for you.
* Creates a `requirements.txt` file for the Python dependencies so that the deployment picks up on the dependencies properly.
* Pushes the application with the correct environment variables set based on what is supplied by the `deploy-config/sandbox.yml` file.
1. Start a poetry shell as a shortcut to load `.env` file variables by running `poetry shell`. (You'll have to restart this any time you change the file.)
In Notify, several aspects of the system are loaded into the database via migration. This means that
application setup requires loading and overwriting historical data in order to arrive at the current
configuration.
[Here are notes](https://docs.google.com/document/d/1ZgiUtJFvRBKBxB1ehiry2Dup0Q5iIwbdCU5spuqUFTo/edit#)
about what is loaded into which tables, and some plans for how we might manage that in the future.
Flask does not seem to have a great way to squash migrations, but rather wants you to recreate them
from the DB structure. This means it's easy to recreate the tables, but hard to recreate the initial data.
## Data Model Diagram
A diagram of Notify's data model is available [in our compliance repo](https://github.com/GSA/us-notify-compliance/blob/main/diagrams/rendered/apps/data.logical.pdf).
## Migrations
Create a migration:
```
flask db migrate
```
Trim any auto-generated stuff down to what you want, and manually rename it to be in numerical order.
We should only have one migration branch.
Running migrations locally:
```
flask db upgrade
```
This should happen automatically on cloud.gov, but if you need to run a one-off migration for some reason:
```
cf run-task notifications-api-staging --commmand "flask db upgrade" --name db-upgrade
```
## Purging user data
There is a Flask command to wipe user-created data (users, services, etc.).
The command should stop itself if it's run in a production environment, but, you know, please don't run it
# Commands for test loading the local dev database
All commands use the `-g` or `--generate` to determine how many instances to load to the db. The `-g` or `--generate` option is required and will always defult to 1. An example: `flask command add-test-uses-to-db -g 6` will generate 6 random users and insert them into the db.
One-off messages and batch messages both upload a CSV, which are then first stored in S3 and queued as a `Job`. When the job runs, it iterates
through the rows from `tasks.py:process_row`, running `tasks.py:save_sms` (email notifications branch off through `tasks.py:save_email`) to write to the db with `persist_notification` and begin the process of delivering the notification to the provider
through `provider_tasks.deliver_sms`. The exit point to the provider is in `send_to_providers.py:send_sms`.
_Most of the API endpoints in this repo are for internal use. These are all defined within top-level folders under `app/` and tend to have the structure `app/<feature>/rest.py`._
## Overview
Public APIs are intended for use by services and are all located under `app/v2/` to distinguish them from internal endpoints. Originally we did have a "v1" public API, where we tried to reuse / expose existing internal endpoints. The needs for public APIs are sufficiently different that we decided to separate them out. Any "v1" endpoints that remain are now purely internal and no longer exposed to services.
## Documenting APIs
New and existing APIs should be documented within [openapi.yml](./openapi.yml). Tools to help
Here are some pointers for how we write public API endpoints.
### Each endpoint should be in its own file in a feature folder
Example: `app/v2/inbound_sms/get_inbound_sms.py`
This helps keep the file size manageable but does mean a bit more work to register each endpoint if we have many that are related. Note that internal endpoints are grouped differently: in large `rest.py` files.
### Each group of endpoints should have an `__init__.py` file
Note that the error handling setup by `register_errors` (defined in [`app/v2/errors.py`](../app/v2/errors.py)) for public API endpoints is different to that for internal endpoints (defined in [`app/errors.py`](../app/errors.py)).
### Each endpoint should have an adapter in each API client
Example: [Ruby Client adapter to get template by ID](https://github.com/alphagov/notifications-ruby-client/blob/d82c85452753b97e8f0d0308c2262023d75d0412/lib/notifications/client.rb#L110-L115).
All our clients should fully support all of our public APIs.
Each adapter should be documented in each client ([example](https://github.com/alphagov/notifications-ruby-client/blob/d82c85452753b97e8f0d0308c2262023d75d0412/DOCUMENTATION.md#get-a-template-by-id)). We should also document each public API endpoint in our generic API docs ([example](https://github.com/alphagov/notifications-tech-docs/blob/2700f1164f9d644c87e4c72ad7223952288e8a83/source/documentation/_api_docs.md#send-a-text-message)). Note that internal endpoints are not documented anywhere.
### Each endpoint should specify the authentication it requires
This is done as part of registering the blueprint in `app/__init__.py` e.g.
```
post_letter.before_request(requires_auth)
application.register_blueprint(post_letter)
```
# API Usage
## Connecting to the API
To make life easier, the [UK API client libraries](https://www.notifications.service.gov.uk/documentation) are compatible with Notify and the [UK API Documentation](https://docs.notifications.service.gov.uk/rest-api.html) is applicable.
For a usage example, see [our Python demo](https://github.com/GSA/notify-python-demo).
An API key can be created at https://HOSTNAME/services/YOUR_SERVICE_ID/api/keys. This is the same API key that is referenced as `USER_API_TOKEN` below.
Internal-only [documentation for exploring the API using Postman](https://docs.google.com/document/d/1S5c-LxuQLhAtZQKKsECmsllVGmBe34Z195sbRVEzUgw/edit#heading=h.134fqdup8d3m)
For further details of the system and how it connects to supporting services, see the [application boundary diagram](https://github.com/GSA/us-notify-compliance/blob/main/diagrams/rendered/apps/application.boundary.png)
**NOTE: Please be mindful of sharing sensitive information here! If you're not sure of what to write, please ask the team first before writing anything here.**
Policies and Procedures needed before and during Notify.gov Operations. Many of these policies are taken from the Notify.gov System Security & Privacy Plan (SSPP).
Any changes to policies and procedures defined both here and in the SSPP must be kept in sync, and should be done collaboratively with the System ISSO and ISSM to ensure
Operational alerts are posted to the [#pb-notify-alerts](https://gsa-tts.slack.com/archives/C04U9BGHUDB) Slack channel. Please join this channel and enable push notifications for all messages whenever you are on call.
[NewRelic](https://one.newrelic.com/) is being used for monitoring the application. [NewRelic Dashboard](https://onenr.io/08wokrnrvwx) can be filtered by environment and API, Admin, or Both.
[Cloud.gov Logging](https://logs.fr.cloud.gov/) is used to view and search application and platform logs.
In addition to the application logs, there are several tables in the application that store useful information for audit logging purposes:
Our apps must be restaged whenever cloud.gov releases updates to buildpacks. Cloud.gov will send email notifications whenever buildpack updates affect a deployed app.
Restaging the apps rebuilds them with the new buildpack, enabling us to take advantage of whatever bugfixes or security updates are present in the new buildpack.
There are two GitHub Actions that automate this process. Each are run manually and must be run once for each environment to enable testing any changes in staging before running within demo and production environments.
When `notify-api-<env>`, `notify-admin-<env>`, `egress-proxy-notify-api-<env>`, and/or `egress-proxy-notify-admin-<env>` need to be restaged:
1. Navigate to [the Restage apps GitHub Action](https://github.com/GSA/notifications-api/actions/workflows/restage-apps.yml)
1. Click the `Run workflow` button to open a popup
1. Leave `Use workflow from` on it's default of `Branch: main`
1. Select the environment you need to restage from the dropdown
1. Click `Run workflow` within the popup
1. Repeat for other environments
When `ssb-sms`, and/or `ssb-smtp` need to be restaged:
1. Navigate to the [SSB Restage apps GitHub Action](https://github.com/GSA/usnotify-ssb/actions/workflows/restage-apps.yml)
1. Click the `Run workflow` button to open a popup
1. Leave `Use workflow from` on it's default of `Branch: main`
1. Select the environment (either `staging` or `production`) you need to restage from the dropdown
1. Click `Run workflow` within the popup
1. Repeat for other environments
When `ssb-devel-sms` and/or `ssb-devel-smtp` need to be restaged:
1. Navigate to the [SSB Restage apps GitHub Action](https://github.com/GSA/usnotify-ssb/actions/workflows/restage-apps.yml)
1. Click the `Run workflow` button to open a popup
1. Leave `Use workflow from` on it's default of `Branch: main`
1. Select the `development` environment from the dropdown
## <a name="deploying-to-production"></a> Deploying to Production
Deploying to production involves 3 steps that must be done in order, and can be done for just the API, just the Admin, or both at the same time:
1. Create a new pull request in GitHub that merges the `main` branch into the `production` branch; be sure to provide details about what is in the release!
1. Create a new release tag and generate release notes; publish it with the `Pre-release` at first, then update it to latest after a deploy is finished and successful.
1. Review and approve the pull request(s) for the production deployment.
Additionally, you may have to monitor the GitHub Actions as they take place to troubleshoot and/or re-run failed jobs.
This is done entirely in GitHub. First, go to the pull requests section of the API and/or Admin repository, then click on the `New pull request` button.
In the screen that appears, change the `base: main` target branch on the left side of the arrow to `base: production` instead. You want to merge all of the latest changes in `main` to the `production` branch. After you've made the switch, click on the `Create pull request` button.
On the main page of the repository, click on the small heading that says `Releases` on the right to get to the release listing page. Once there, click on the `Draft a new release` button.
You'll first have to choose a tag or create a new one: use the current date as the tag name, e.g., `9/9/2024`. Keep the target set to `main` and then click on the `Generate release notes button`.
Add a title in the format of `<current date>` Production Deploy, e.g., `9/9/2024 Production Deploy`.
Lastly, uncheck the `Set as the latest release` checkbox and check the `Set as a pre-release` checkbox instead.
Once everything is complete, cick on the `Publish release` button and then link to the new release notes in the corresponding production deploy pull request.
When everything is good to go, two people will need to approve the pull request for merging into the `production` branch. Once they do, then merge the pull request.
At this point everything is mostly automatic. The deploy will update both the `demo` and `production` environments. Once the deploys are done and successful, go back into the pre-release release notes and switch the checkboxes to turn it into the latest release and save the change.
Sometimes a deploy will fail and you will have to look at the GitHub Action deployment logs to see what the cause is. In many cases it will be an out of memory error because of the two environments going out at the same time. Whenever the successful deploy is finished, re-run the failed jobs in the other deployment action again.
## <a name="smoke-testing"></a> Smoke-testing the App
To ensure that notifications are passing through the application properly, the following steps can be taken to ensure all parts are operating correctly:
1. Send yourself a password reset email. This will verify SES integration. The email can be deleted once received if you don't wish to change your password.
1. Log into the app. This will verify SNS integration for a one-off message.
1. Upload a CSV and schedule send for the soonest time after "Now". This will verify S3 connections as well as scheduler and worker processes are running properly.
1. Any PRs waiting for approval should be talked about during daily Standup meetings.
### notifications-api & notifications-admin
1. Changes are deployed to the `staging` environment after a successful `checks.yml` run on `main` branch. Branch Protections prevent pushing directly to `main`
1. Changes are deployed to the `demo`_and_`production` environments after merging `main` into `production`. Branch Protections prevent pushing directly to `production`
### usnotify-ssb
1. Changes are deployed to `staging` and `production` environments after merging to the `main` branch. The `staging` deployment must be successful before `production` is attempted. Branch Protections prevent pushing directly to `main`
### ttsnotify-brokerpak-sms
1. A new release is created by pushing a tag to the repository on the `main` branch.
1. To include the new version in released SSB code, create a PR in the `usnotify-ssb` repo updating the version in use in `app-setup-sms.sh`
### datagov-brokerpak-smtp
1. To include new verisons of the SMTP brokerpak in released SSB code, create a PR in the `usnotify-ssb` repo updating the version in use in `app-setup-smtp.sh`
### Vulnerability Mitigation Changes
US_Notify Administrators are responsible for ensuring that remediations for vulnerabilities are implemented. Response times vary based on the level of vulnerability as follows:
Notify.gov DNS records are maintained within [the GSA-TTS/dns repository](https://github.com/GSA-TTS/dns/blob/main/terraform/notify.gov.tf), and the domains and routes are managed directly in our Cloud.gov production space.
1. Create a new branch and update the [`notify.gov.tf`]((https://github.com/GSA-TTS/dns/blob/main/terraform/notify.gov.tf)) Terraform file to update, create, or remove DNS records within AWS Route 53.
1. Open a PR in the repository and verify that the plan output within CircleCI makes the changes that you expect.
1. Request a PR review from the `@tts-tech-operations` team within the GSA-TTS GitHub org.
1. Once the PR is approved and merged, verify that the apply step happened correctly within [CircleCI](https://app.circleci.com/pipelines/github/GSA-TTS/dns).
1. Sign in to the `cf` CLI in your terminal and target the `notify-production` space.
1. Create the new domain(s) with [`cf create-private-domain`](https://docs.cloudfoundry.org/devguide/deploy-apps/routes-domains.html#private-domains).
1. Map the routes needed to the new domain(s) with [`cf map-route`](https://docs.cloudfoundry.org/devguide/deploy-apps/routes-domains.html#map-route).
1. Update the service to account for the new domain(s): `cf update-service notify-admin-domain-production -c '{"domains": "example.gov,www.example.gov,..."}'` (make sure to list *all* domains that need to be accounted for, including any existing ones that you want to keep!).
If you're removing existing domains:
1. Sign in to the `cf` CLI in your terminal and target the `notify-production` space.
1. Unmap the routes to the existing domain(s) with [`cf unmap-route`](https://docs.cloudfoundry.org/devguide/deploy-apps/routes-domains.html#unmap-route).
1. Delete the existing domain(s) with [`cf delete-private-domain`](https://docs.cloudfoundry.org/devguide/deploy-apps/routes-domains.html#private-domains).
1. Update the service to account for the deleted domain(s): `cf update-service notify-admin-domain-production -c '{"domains": "example.gov,www.example.gov,..."}'` (make sure to list *all* domains that need to be accounted for, including any existing ones that you want to keep!).
**Step 3: Redeploy or restage the Admin app:**
Restage or redeploy the `notify-admin-production` app. To restage, you can trigger the action in GitHub or run the command directly: `cf restage notify-admin-production --strategy rolling`.
Test that the changes took effect properly by going to the domain(s) that were adjusted and seeing if they resolve correctly and/or no longer resolve as expected. Note that this may take up to 72 hours, depending on how long it takes for the DNS changes to propogate.
To review the daily scan results and check for any new reported findings that need to be remediated, perform the following steps.
**For the API**
1. Go to the daily scan page: https://github.com/GSA/notifications-api/actions/workflows/daily_checks.yml
1. Click on the latest scan (it should have run on the current day and be at the time)
1. Scroll to the bottom and download the two artifacts: `bandit-report` and `zap_scan` - these are zip files that contain the full scan reports
1. Click on the `pip-audit` job in the menu on the left of the screen
1. Click on the `Run pypa/gh-action-pip-audit` step (the version number may change over time as it gets updated)
1. Check that the output of the step doesn't show any new audit findings (the step and job will have failed if it did)
1. Click on the `static-scan` job in the menu on the left of the screen
1. Click on the `Run scan` step
1. Check that the output of the step doesn't show any new scan findings (note: the step and job may still show as successful even if something was found)
1. Click on the `dynamic-scan` job in the menu on the left of the screen
1. Click on the `Run OWASP API Scan` step
1. Check that the output of the step doesn't show any new scan findings (note: the step and job may still show as successful even if something was found)
Once you're done performing the steps above to gather all of the information, make a note of any new findings that need to be accounted for and remediated and create issues to track the work.
**For the Admin**
1. Go to the daily scan page: https://github.com/GSA/notifications-admin/actions/workflows/daily_checks.yml
1. Click on the latest scan (it should have run on the current day and be at the time)
1. Scroll to the bottom and download the artifact: `zap_scan` - this is a zip file that contains the full scan reports
1. Click on the `dependency-audits` job in the menu on the left of the screen
1. Click on the `Run pypa/gh-action-pip-audit` step (the version number may change over time as it gets updated)
1. Check that the output of the step doesn't show any new audit findings (the step and job will have failed if it did)
1. Click on the `Run npm audit` step
1. Check that the output of the step doesn't show any new audit findings (the step and job will have failed if it did)
1. Click on the `static-scan` job in the menu on the left of the screen
1. Click on the `Run scan` step
1. Check that the output of the step doesn't show any new scan findings (note: the step and job may still show as successful even if something was found)
1. Click on the `dynamic-scan` job in the menu on the left of the screen
1. Click on the `Run OWASP Full Scan` step
1. Check that the output of the step doesn't show any new scan findings (note: the step and job may still show as successful even if something was found)
Once you're done performing the steps above to gather all of the information, make a note of any new findings that need to be accounted for and remediated and create issues to track the work.
There are a few different ways to handle rotating environment variable secrets, depending on what the secret is.
### Secret environment variables (set directly)
The `ADMIN_CLIENT_SECRET`, `DANGEROUS_SALT`, and `SECRET_KEY` environment variables are all generated random strings of characters. To make a new value for any of these environment variables, perform the following steps:
1. Start the API locally with the command `make run-procfile`
1. In a separate terminal tab, navigate to the API project and run `poetry run flask command generate-salt` (this command is found in the [`app/commands.py` file](https://github.com/GSA/notifications-api/blob/main/app/commands.py#L1030-L1037))
1. A random secret will appear in the tab, which you will use to update the value(s) in GitHub
Next, you'll need to go into GitHub for either the [API repo environment settings](https://github.com/GSA/notifications-api/settings/environments) or [Admin repo environment settings](https://github.com/GSA/notifications-admin/settings/environments). Once there you'll see a list of all of the environments; click into the one that you're looking to update and then find the corresponding environment that you need to update. Click on the pencil icon to the right of the environment variable name to edit the value, then paste in the value you generated with the previous steps.
**NOTE:** These values must match between the API and Admin environment variables per environment (meaning, if you change the Admin repo value for any of these values in any environment, the same variable for the API in the same environment must be changed to match it!).
The important thing is to use the same secret for Admin and API on each tier -- i.e. you only generate three secrets per environment.
**NOTE:** You may also have to update these values for Dependabot as well! To do this, go into GitHub and the navigate through `Settings -> Secrets and variables -> Dependabot`, which will take you to a special page to manage environment variables specifically for Dependabot. This is more necessary in the Admin repo because of the E2E tests.
### E2E environment variables (set directly)
See the [end-to-end testing section](#end-to-end-testing).
### Service bindings for Cloud.gov-managed services
For any Cloud.gov service instance that you need to rotate credentials for, you need to run the following commands:
1.`cf unbind-service <APP NAME> <SERVICE NAME>`
1.`cf bind-service <APP NAME> <SERVICE NAME>`
Once you are done unbinding and re-binding all services you're looking to rotate credentials for, you need to restage or redeploy the application(s) for the changes to take effect. You can restage directly in the command line: `cf restage <APP NAME> --strategy rolling`
### Rotating New Relic API keys and licenses
To rotate New Relic API key, license key, and other credentials, you need access to New Relic. If you have access, sign in and then click on your name in the lower left. Click on `API keys` and you'll be taken to the management screen for all of the API keys. From there, perform these steps:
1. Create new versions of whichever key(s) you would like to rotate
1. Update the corresponding environment variable(s) in GitHub for both the [API repo environment settings](https://github.com/GSA/notifications-api/settings/environments) and the [Admin repo environment settings](https://github.com/GSA/notifications-admin/settings/environments)
1. Restage or redeploy the applications
1. Once you confirm the new key(s) in New Relic are working, delete the old keys on the API Key management screen
### Terraform state bucket key rotation
To rotate the Terraform state bucket key, run these commands in the `api/terraform/bootstrap` directory of the API repo:
```sh
# comment out prevent_destroy in terraform/bootstrap/main.tf
# update username to create in run.sh and teardown-creds.sh
$ ./run.sh plan -replace=cloudfoundry_service_key.bucket_creds
1. update the github secrets for staging, demo, production (contents of key.pem go in LOGIN_PEM and contents of cert.crt in LOGIN_PUB). **DO NOT RESTAGE YET**.
1. use the same certificate for staging, demo, and production
1. login to the login.gov partner app (https://portal.int.identitysandbox.gov)
1. add the new certificate to the production version of Notify in the partner app (our partner app account has sandbox and production)
1. Make a Zendesk support request for login.gov to push the new version of Notify (https://zendesk.login.gov)
1. Do not delete the old certificate, because you need things to keep working until you complete the transition.
1. When you receive an email from login.gov that the app has been pushed successfully, restage notify on the staging tier
1. If staging works, you can restage demo and production
1. Delete the old certificate in the partner app, send another zendesk request to push again. This is best practice but a lower priority, because certificates eventually expire anyway and we have changed the certificate in github secrets, so the old cert is no longer relevant.
<dd>Creating or deleting service keys is failing. SSB Logs reference failing to verify certificate/certificate valid for <code>GUID A</code> but not for <code>GUID B</code></dd>
<dt>Solution:</dt>
<dd>Restage SSB apps using the <a href="#restaging-apps">restage apps action</a>
</dl>
### SNS Topic Subscriptions Don't Succeed
<dl>
<dt>Problem:</dt>
<dd>When deploying a new environment, a race condition prevents SNS topic subscriptions from being successfully verified on the AWS side</dd>
<dt>Solution:</dt>
<dd>Manually re-request subscription confirmation from the AWS Console. </dd>
- Infrastructure Accounts and Application Platform Administrators must be approved by the System Owner (Amy) before creation, but people with `Administrator` role can actually do the creation and role assignments.
- At least one agency partner must act as the `User Manager` for their service, with permissions to manage their team according to their agency's policies and procedures.
- All users must utilize `.gov` email addresses.
- Users who leave the team or otherwise have role changes must have their accounts updated to reflect the new roles required (or disabled) within 14 days.
- SpaceDeployer credentials must be rotated within 14 days of anyone with SpaceDeveloper cloud.gov access leaving the team.
- A user report must be created annually (See AC-2(j)). `make cloudgov-user-report` can be used to create a full report of all cloud.gov users.
| Administrator | GitHub | Admin | PBS Fed | Approve & Merge PRs into main and production |
| Administrator | AWS | `NotifyAdministrators` IAM UserGroup | PBS Fed | Read audit logs, verify & fix any AWS service issues within Production AWS account |
| Administrator | Cloud.gov | `OrgManager` | PBS Fed | Manage cloud.gov roles and permissions. Access to production spaces |
| DevOps Engineer | Cloud.gov | `SpaceManager` | PBS Fed or Contractor | Access to non-production spaces |
| DevOps Engineer | AWS | `NotifyAdministrators` IAM UserGroup | PBS Fed or Contractor | Access to non-production AWS accounts to verify & fix any AWS issues in the lower environments |
| Cloud.gov Service Account | Cloud.gov | `OrgManager` and `SpaceDeveloper` | Creds stored in GitHub Environment secrets within api and admin app repos |
| SSB Deployment Account | AWS | `IAMFullAccess` | Creds stored in GitHub Environment secrets within usnotify-ssb repo |
| SSB Cloud.gov Service Account | Cloud.gov | `SpaceDeveloper` | Creds stored in GitHub Environment secrets within usnotify-ssb repo |
| SSB AWS Accounts | AWS | `sms_broker` or `smtp_broker` IAM role | Creds created and maintained by usnotify-ssb terraform |
- For the default phone number, to be used by Notify itself for OTP codes and the default from number for services, set the phone number as the `AWS_US_TOLL_FREE_NUMBER` ENV variable in the environment you are creating
- For service-specific phone numbers, set the phone number in the Service's `Text message senders` in the settings tab.
- +18447952263 - in use as default number. Notify's OTP messages and trial service messages are sent from this number (Also the number for the live service: Federal Test Service)
- +18447891134 - Montgomery County / Ride On
- +18888402596 - Norfolk / DHS
- +18555317292 - Washington State / DHS
- +18889046435 - State Department / Consular Affairs
For a full list of phone numbers in trial and production, team members can access a [tracking list here](https://docs.google.com/spreadsheets/d/1lq3Wi_up7EkcKvmwO3oTw30m7kVt1iXvdS3KAp0smh4/edit#gid=0).
Users and invited users are Federal, State, or Local government employees or contractors. Members of the general public are _not_ users of the system
#### Note 2.
Field-level encryption is used on these fields.
Details on encryption schemes and algorithms can be found in [SC-28(1)](https://github.com/GSA/us-notify-compliance/blob/main/dist/system-security-plans/lato/sc-28.1.md)
#### Note 3.
Probably not PII, this is the country code of the phone.
Seven (7) days by default. Each service can be set with a custom policy via `ServiceDataRetention` by a Platform Admin. The `ServiceDataRetention` setting applies per-service and per-message type and controls both entries in the `notifications` table as well as `csv` contact files uploaded to s3
Data cleanup is controlled by several tasks in the `nightly_tasks.py` file, kicked off by Celery Beat.
Ask the user to provide the csv file name. Either the csv file they uploaded, or the one that is autogenerated when they do a one-off send and is visible in the UI
Starting with the admin logs, search for this file name. When you find it, the log line should have the file name linked to the job_id and the csv file location. Save both of these.
In the api logs, search by job_id. Either you will see evidence of the job failing and retrying over and over (in which case search for a stack trace using timestamp), or you will ultimately get to a log line that links the job_id to a message_id. In this case, now search by message_id. You should be able to find the actual result from AWS, either success or failure, with hopefully some helpful info.
1. Either send a message and capture the csv file name, or get a csv file name from a user
2. Using the log tool at logs.fr.cloud.gov, use filters to limit what you're searching on (cf.app is 'notify-admin-production' for example) and then search with the csv file name in double quotes over the relevant time period (last 5 minutes if you just sent a message, or else whatever time the user sent at)
3. When you find the log line, you should also find the job_id and the s3 file location. Save these somewhere.
4. To get the csv file contents, you can run the command above. This command currently prints to the notify-api log, so after you run the command,
you need to search in notify-api-production for the last 5 minutes with the logs sorted by timestamp. The contents of the csv file unfortunately appear on separate lines so it's very important to sort by time.
5. If you want to see where the message actually failed, search with cf.app is notify-api-production using the job_id that you saved in step #3. If you get far enough, you might see one of the log lines has a message_id. If you see it, you can switch and search on that, which should tell you what happened in AWS (success or failure).
### Routes cannot be mapped to destinations in different spaces
During `cf push` you may see
```
For application 'notify-api-sandbox': Routes cannot be mapped to destinations in different spaces
```
:ghost: This indicates a ghost route squatting on a route you need to create. In the cloud.gov web interface, check for incomplete deployments. They might be holding on to a route. Delete them. Also, check the list of routes (from the CloudFoundry icon in the left sidebar) for routes without an associated app. If they look like a route your app would need to create, delete them.
### API request failed
After pushing the Admin app, you might see this in the logs
```
{"name": "app", "levelname": "ERROR", "message": "API unknown failed with status 503 message Request failed", "pathname": "/home/vcap/app/app/__init__.py", ...
This indicates that the Admin and API apps are unable to talk to each other because of either a missing route or a missing network policy. The apps require [container-to-container networking](https://cloud.gov/docs/management/container-to-container/) to communicate. List `cf network-policies`; you should see one connecting API and Admin on port 61443. If not, you can create one manually: