Setting Up PostgreSQL (EL7)
Setting Up PostgreSQL (EL7)
This scenario will go over the initial setup of a PostgreSQL cluster and give an overview of some important configuration options.
Let's get started!
You've completed the scenario!
We are now finished covering the initial setup of a PostgreSQL instance as well as reviewing some of the most important settings.
Enjoy learning about PostgreSQL? Sign up for our newsletter and get the latest tips from us each month.
Your environment is currently being packaged as a Docker container and the download will begin shortly. To run the image locally, once Docker has been installed, use the commands
cat scrapbook_crunchydata_pg-administration/basic-postgresql-for-dbas-12/setting-up-pg_container.tar | docker load
docker run -it /crunchydata_pg-administration/basic-postgresql-for-dbas-12/setting-up-pg:
Oops!! Sorry, it looks like this scenario doesn't currently support downloads. We'll fix that shortly.
Setting Up PostgreSQL (EL7)
Setting Up Package Repository
The first step to getting PostgreSQL running on a brand new system is to set the repositories that contain the latest package versions. For this demo we use CentOS 7, but unfortunately the default repository contains Postgres version 9.2, which has been EOL (end of life) since Sept 2017 (https://www.postgresql.org/support/versioning).
Thankfully the PostgreSQL community provides an RPM repo that contains the latest version of not only postgres but many other popular third-party related tools as well (pgadmin, pgbouncer, pgbackrest, pg_partman, etc.)
From here, select your desired version and the relevant RPM distro. In this case we want PostgreSQL 12 and CentOS 7 64bit (x86_64). This will give you the relevant command to run to install the repo.
sudo yum install https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
Confirm that this is ok and PostgreSQL is now ready to install!
Note there was a recent change regarding how the community repo is organized. Previously each major version had its own repo, but now there is one universal repo per distro. If you're managing existing systems, it's important to make note of this change for future updates.
Next install the necessary package(s)
sudo yum install postgresql12-server postgresql12-contrib
If you just need the client programs (
pg_restore, etc.), the
postgresql## package (e.g.
postgresql12) can provide that without installing the whole server environment. The server package will pull this dependency.
postgresql##-contrib package provides a suite of extra tools for Postgres. Even if you don't know whether you'll need them, it's recommended to install the package so they are readily available. Some popular examples are
dblink. More can be found here:
Setting Up The Default PG Instance
With the packages installed, the first thing we need to do is to initialize the PostgreSQL cluster. CentOS does not do this for you automatically, but does provide a script (
/usr/pgsql-##/bin/postgresql-##-setup) to set up the default instance for you that will be managed by systemd.
To run the script, execute the following statement:
sudo PGSETUP_INITDB_OPTIONS="--data-checksums" /usr/pgsql-12/bin/postgresql-12-setup initdb
initdb command to the script tells it to initialize the PostgreSQL cluster. We also pass an option to the initdb command in order to enable checksums on our cluster. This feature is highly recommended to enable on all new clusters. Checksums on data pages help to detect corruption by the I/O system that would otherwise be silent. It can currently only be set on database initialization but work is underway to allow enabling it on existing database clusters. https://www.postgresql.org/docs/current/app-initdb.html#APP-INITDB-DATA-CHECKSUMS
Next we enable the service via systemd
sudo systemctl enable postgresql-12
And now we start the database via systemd
sudo systemctl start postgresql-12
After starting a service it's always good to check its status
systemctl status postgresql-12
If you encounter any issues, you can check the syslog for any errors related to systemd starting the service (
/var/log/messages) or check the PostgreSQL logs as well (
/var/lib/pgsql/##/data/log) where ## is the major version of PostgreSQL (e.g. 11, 12, 13).
Note that prior to Postgres version 10, the log directory was called
By default a
postgres system user is created for you by the packages that were installed. Let's change to that user. We're going to do so in the second terminal named
postgres_terminal and from now on, any commands that need to be run by the
postgres system user will be run from this terminal. Any other system commands prepended with a
sudo will be run from the initial terminal.
sudo -iu postgres
Then, log into the database
psql command line is an extremely powerful tool for interacting with PostgreSQL. The full list of commands is always available by typing
\? or by referring to the documentation - https://www.postgresql.org/docs/current/app-psql.html
Probably one of the most important initial commands to know via psql is to see where the data directory for the database you're currently logged into is located. The current value for any setting in PostgreSQL can be seen by using the
SHOW command. So let's check the
PostgreSQL Host Based Authentication (pg_hba.conf)
The pg_hba.conf file, by default located in the data directory, is how connection authentication is managed in PostgreSQL. This covers the incoming connections only, not specific object access within the database itself (the latter is covered via the GRANTs system).
Secure connections via SSL and third-party access management via GSSAPI/Certificate/etc. is also managed via this file. The full capabilities of what can be done via this file can be found in the documentation - https://www.postgresql.org/docs/current/auth-pg-hba-conf.html
The file is always evaluated from the top down and as soon as a match is found, further evaluation is halted. So be careful with the ordering of entries to ensure the proper rules are evaluated in the desired order.
It is recommended to avoid the "trust" authentication method whenever possible since this allows unfettered access for the users that match. It is recommended to always at least require password authentication (
md5). To prevent having to manually enter in passwords, it is recommended to use a
.pgpass file to securely keep user credentials available on the database - https://www.postgresql.org/docs/current/libpq-pgpass.html
The example system in this scenario enables peer authentication by default. This means that any system user that has a matching database role can log in without requiring a password. Since the cluster was created by the
postgres system user, a
postgres role exists in the database. You can see the existing roles via
psql by using the
You can also see the current contents of the pg_hba.conf file from within the database if you are on at least PG10.
SELECT * FROM pg_hba_file_rules;
Note that this shows the actual file's contents, not what may be active within the database. Putting pg_hba.conf changes in place requires reloading the database, which we will do shortly.
An example for adding an entry to allow a password protected connection for replication would be as follows:
host replication replica_user 192.168.1.201/32 md5
The first column controls the connection type, be it TCP, SSL, or local socket.
host means it will be a TCP/IP based connection
The second column controls the databases this rule will apply to.
replication refers to a special database that exists within PostgreSQL to allow streaming replication. Any database(s) can be listed here to control specific access. The special value
all matches all valid databases.
The third column controls which roles this rule will apply to.
replica_user refers to a role with that name that was given the special REPLICATION privilege for this purpose. Again the special
all value can be used here.
The fourth column specifies the client machine address(es) that this record matches. This can be a hostname, a CIDR formatted IP address range or certain special keywords covered in the documentation. If hostnames are being used, ensure reverse name resolution is working properly and performant. Using IP address ranges is recommended.
The fifth column refers to the authentication method. Here
md5 refers to a hashed password method. There are many different methods possible and the above documentation link goes over all of them. The more recent method of
scram-sha-256 is highly recommended over
md5 if supported by the client.
Reloading the database can be done in two ways. While logged into the database, any superuser can call the
SELECT * FROM pg_reload_conf();
Or from the system command line, any users with access to control PostgreSQL via systemd can issue a reload. So on our original root terminal, issue the reload
sudo systemctl reload postgresql-12
If there are any errors encountered in the pg_hba.conf, the changes will not be applied. You can check the PostgreSQL logs for either a successful SIGHUP or any error messages
sudo bash -c "tail /var/lib/pgsql/12/data/log/postgresql-*.log"
Creating Roles and Databases
As we discussed in the previous step, the default setup for the cluster provided by EL7 based systems is to allow peer based authentication. Back in our training user terminal, let's see what happens if we try and log into the database using just the
psql command with no additional options.
The error that returns shows that if you do not tell psql which role you want to log in as (using the
-U option), psql will default to trying to log in as a role that matches the current system user. In this case
So let's create a training role in our
postgres_terminal where we can log in
CREATE ROLE training WITH LOGIN SUPERUSER;
We've added two properties onto this role.
LOGIN is required if you want this role to be able to log in. And we're also just going to make this role a SUPERUSER as well so we can continue using it without having to switch back to the
postgres_terminal all the time to run SQL commands.
Now lets see what happens if we try and log in as training again with no options:
Now we're getting an error about a
training database not existing. So if we don't provide a database name (using the
-d option), psql will try and log in to a database name that matches the current system user.
psql -d postgres \q
peer authentication method only checks for the role name to match, so we can give another database name and log in that way. But to make it even easier, let's create a new database back on the
CREATE DATABASE training;
Let's try one last time back on our training terminal with no options.
And it works! We're now taking advantage of the
peer authentication method to log into the database without a password.
We can also set the password for our own role (or other roles if you're a superuser) using the
\password command. It is recommended to use this method vs setting the password during role creation or using the ALTER ROLE command since this keeps the password from getting put in the clear into the history buffer or the logs.
This step was given in this manner since this is an often seen issue when people are first setting up PostgreSQL and cannot figure out why they can't log in as their normal system user. Going through this process hopefully helps you better understand how psql works in any environment.
postgresql.conf Part 1 - Making changes
The postgresql.conf file is the primary location of configuration settings in PostgreSQL. By default it is located at the top level of the data directory. This last section is going to go over the primary settings that you should be most concerned about when first setting up a PostgreSQL cluster. There are many, many more settings and their importance is determined by your particular use-cases.
Several of the settings in postgresql.conf require a restart while others only require a reload. One of the easiest ways to always be able to quickly see if a setting requires a restart or not is to check the
pg_settings catalog, particularly the
context column. A value of
postmaster in this column always indicates a restart is needed.
Now that we have our own user we can log into the database with, we'll continue running SQL commands there
SELECT name, setting, context FROM pg_settings WHERE name IN ('shared_buffers', 'work_mem', 'archive_mode', 'archive_command');
We can also see the current value of any setting here as well, similar to how we used the
SHOW command earlier. The meanings of the
context column can be found in the documentation - https://www.postgresql.org/docs/current/view-pg-settings.html. For the settings we checked above, we can see that
work_mem only require a reload while
shared_buffers require a restart. And
work_mem is even more unique in that each individual role can change that setting for themselves for the duration of their session.
Starting with PostgreSQL 9.4, you can also change
postgresql.conf settings from within the database without having to manually edit the file itself. This is done with the ALTER SYSTEM command. Any changes done by this command are written to a separate file called
postgresql.auto.conf, so if you make use of ALTER SYSTEM always be sure to check both of these files when manually reviewing them.
postgresql.auto.conf always overrides the settings in
postgresql.conf. An example for enabling WAL archiving is given below. We'll go over these settings more in a future step.
ALTER SYSTEM SET archive_mode = 'on'; ALTER SYSTEM SET archive_command = '/bin/true';
Settings changed with ALTER SYSTEM that require a restart or reload still require that to be done. If you need to restart the database, the best way to do that in a production environment on EL7 is with systemd
\q sudo systemctl restart postgresql-12
Let's move on to reviewing some important settings. We'll be briefly going over each setting along with a link to its documentation for more information.
postgresql.conf Part 2 - Connections & Memory
These last sections of this scenario are all explanation with no further commands to run. So to make it easier to read, you can enlarge this text section of the lesson by dragging the mid-bar to the right.
Some of these settings will require a restart of the database to change, so it is good to try and set these to your desired values when first setting up your clusters. The values given are what we would recommend for most common scenarios. EL7 does set some of our recommendations by default, but not all systems do.
listen_addresses - Sets the IP address(es) that PostgreSQL listens on. Defaults to localhost only for security. Recommend setting to server IP.
* = all IPs. https://www.postgresql.org/docs/current/runtime-config-connection.html
max_connections - Controls the maximum number of connections allowed to the database. Affected by work_mem (see below). https://www.postgresql.org/docs/current/runtime-config-connection.html#GUC-MAX-CONNECTIONS
shared_buffers - Controls how much memory the server uses for shared memory buffers. 8GB is a good starting point, even with very high RAM availability. If it cannot be set that high, 25% of total memory is a good starting point (https://www.postgresql.org/docs/current/runtime-config-resource.html).
work_mem - Specifies the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files. 2-5MB is a good starting point. Note that individual queries can use multiple instances of this value, hence why it generally seems low. https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-WORK-MEM. The PostgreSQL Wiki has a great writeup on how all the memory settings relate to each other and how to better tune this setting - https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
maintenance_work_mem - Specifies the maximum amount of memory to be used by maintenance operations such as vacuuming and index/constraint creation. 1GB is generally a good starting point and often the ideal setting in most situations. Can be adjusted on a per session basis to improve one-off maintenance task performance. https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-MAINTENANCE-WORK-MEM
effective_cache_size - Sets the query planner's assumption about the effective size of the disk cache that is available to a single query. 50% RAM is a good starting point, possibly higher if PostgreSQL is the only service running on the server. https://www.postgresql.org/docs/current/runtime-config-query.html#GUC-EFFECTIVE-CACHE-SIZE
postgresql.conf Part 3 - Replication & Write Ahead Log (WAL)
hot_standby - Setting for replicas to allow read-only queries to be able to be run on this system. Recent versions of PG default to
wal_level - Sets the level of data included in the Write Ahead Log (WAL) files. Valid values for this setting have changed over major PG versions. Now defaults to
replica to immediately be able to allow the WAL stream to be used for replication. For systems with no replicas, can be set to
minimal to reduce WAL traffic size. https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-WAL-LEVEL
archive_mode - Enables secondary archiving of WAL files by running
archive_command. Recommended to leave set to
on even if not archiving WAL. Allows future enabling of archiving without database restart. https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-ARCHIVE-MODE
archive_command = Local shell command that is executed to archive completed WAL files. Any command that returns a
true state tells Postgres that it ran successfully and allows the primary database instance to clean up its own WAL stream as needed. This command failing will cause the primary to keep all WAL generated since that fail. If not using archiving, or if it needs to be temporarily disabled, just set to
archive_command is usually only run on completed WAL files. To limit how old unarchived data can be, this setting can be used to force archiving to switch to a new WAL file, triggering the
archive_command for the previous partial one. This is useful for low-write databases to ensure frequent durable storage of the WAL stream. All WAL files still use the same size despite early archiving, so do not set to happen to frequently to avoid bloating WAL storage. Recommend a default value of 60 (1 minute). https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-ARCHIVE-TIMEOUT
max_wal_senders - Specifies the maximum number of concurrent connections from standby servers or streaming base backup clients. More recent versions of PG default to 10. It is recommended to always have this setting be at least 2 higher than your number of replicas to allow pg_basebackup to run effectively if needed. https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-MAX-WAL-SENDERS
wal_keep_segments - Specifies the minimum number of past log file segments kept in the pg_wal directory, in case a standby server needs to fetch them for streaming replication. Recommend a default value of 30 to start with whenever there is at least one replica. (As of version 13, this has been renamed to
max_replication_slots - Replication slots are covered in better detail in the
Replication part of this training than can be summarized here. Default value of 10 is fine in most cases. https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-MAX-REPLICATION-SLOTS
checkpoint_timeout - Maximum time between automatic WAL checkpoints. Checkpoints are points in the sequence of transactions at which it is guaranteed that the heap and index data files have been updated with all information written before that checkpoint. At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the log file. Depending on write rate, this setting may need to be adjusted to allow a smoother distribution of I/O when checkpointing occurs. For more information about checkpointing see https://www.postgresql.org/docs/current/wal-configuration.html
checkpoint_completion_target - Specifies the target of checkpoint completion, as a fraction of total time between checkpoints. Recommend setting value to
0.9. This will cause the checkpointing process to try to spread its I/O out over 90% of the time designated by
checkpoint_timeout. So a timeout setting of 10 minutes will cause the checkpointing process to try and use up to 9 minutes to complete its flushing process vs trying to complete all those writes as fast as possible. https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-CHECKPOINT-COMPLETION-TARGET
max_wal_size - Maximum size to let the total WAL storage to grow to between automatic WAL checkpoints. Hitting this value will force a checkpoint. WAL storage is allowed to grow beyond this size if necessary, but once write traffic goes down, it will reduce itself to this size or lower after the next checkpoint. Defaults to 1GB. https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-MAX-WAL-SIZE
min_wal_size - As long as WAL disk usage stays below this setting, old WAL files are always recycled for future use at a checkpoint, rather than removed. Recycling WAL files vs recreating them as needed can greatly reduce I/O requirements during the typical database load. Defaults to 80MB. https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-MIN-WAL-SIZE
postgresql.conf Part 4 - Logging
logging_collector - Enables the logging collector background process that captures log messages sent to stderr and redirects them into log files. The RHEL provided instance enables this by default, but databases initialized directly with
initdb will typically not have this enabled by default. https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOGGING-COLLECTOR
log_filename - When
logging_collector is enabled, sets the filename format of the files it creates. Recommended value is
'postgresql-%Y-%m-%d_%H%M%S' to avoid log rotation from overwriting old files with matching names. RHEL defaults to day-of-the-week naming, which means when the same day comes again the next week, it is overwritten, giving you 7 days of log history by default. https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-FILENAME
log_min_duration_statement - Causes any queries that took longer than the given number of milliseconds to run to be logged, along with the time duration. Beware setting too low since that can greatly increase logging size. Recommend setting it to log queries that take longer than the expected maximum query time. https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-MIN-DURATION-STATEMENT. Along with the
auto_explain contrib module this is a very effective slow query logging method. https://www.postgresql.org/docs/current/auto-explain.html
log_connections - Causes each attempted connection to the server to be logged, as well as successful completion of client authentication. https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-CONNECTIONS
log_disconnections - Causes session terminations to be logged, plus the duration of the session. https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-DISCONNECTIONS
log_line_prefix - This is a printf-style string that is output at the beginning of each log line. Default is very minimal, so recommend reviewing to ensure critical session data is getting logged. Recommended value to get started is
'%m [%r] [%p]: [l-%l] user=%u,db=%d,e=%e ', which will cause entries similar to this:
2019-07-25 14:22:40.299 EDT [[local]] : [l-1] user=keith,db=keith,e=00000 LOG: duration: 1.505 ms statement: select * from pg_stat_activity;
log_temp_files - Logs any temporary files that exceed the given value in kilobytes. A good starting value for this is the
work_mem value to catch queries that exceed that and spill to disk for query operations. https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-TEMP-FILES
log_lock_waits - Controls whether a log message is produced when a session waits longer than deadlock_timeout to acquire a lock. Can potentially cause much larger log files, but very useful when trying to narrow down performance issues related to locking. https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-LOCK-WAITS