Optimization of PostgreSQL ETL Processing Pipeline Unicorn Meta Zoo #1: Why another podcast? ...
When speaking, how do you change your mind mid-sentence?
What is the numbering system used for the DSN dishes?
What is a 'Key' in computer science?
What is /etc/mtab in Linux?
`FindRoot [ ]`::jsing: Encountered a singular Jacobian at a point...WHY
In search of the origins of term censor, I hit a dead end stuck with the greek term, to censor, λογοκρίνω
What is the ongoing value of the Kanban board to the developers as opposed to management
/bin/ls sorts differently than just ls
Marquee sign letters
Retract an already submitted Recommendation Letter (written for an undergrad student)
Did war bonds have better investment alternatives during WWII?
Philosophers who were composers?
Why is water being consumed when my shutoff valve is closed?
Will temporary Dex penalties prevent you from getting the benefits of the "Two Weapon Fighting" feat if your Dex score falls below the prerequisite?
TV series episode where humans nuke aliens before decrypting their message that states they come in peace
Why does Java have support for time zone offsets with seconds precision?
What do you call an IPA symbol that lacks a name (e.g. ɲ)?
France's Public Holidays' Puzzle
Stretch a Tikz tree
Coin Game with infinite paradox
Co-worker works way more than he should
Is it OK if I do not take the receipt in Germany?
Why aren't road bicycle wheels tiny?
Does Prince Arnaud cause someone holding the Princess to lose?
Optimization of PostgreSQL ETL Processing Pipeline
Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar ManaraIncremental Backup for PostgreSQL 8.3Free Database Browser for PostgreSQL?PostgreSQL - syntax errorException while configuring symmetricDS with postgresql on LinuxPostgresql: can't rename the databaseIs there a way to let PostgreSQL restore but not create a new timelineRemove PostgreSQL database filesUpdate database from PostgreSQL 8.3.8 + PostGIS 1.3.5 to PostgreSQL 10.1 + PostGIS 2.4PostgreSQL Native Replication | Master - MasterCentos 7 Postgresql 9.6 error on replication
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I’m looking for suggestions to create several different postgresql.conf versions: one optimized for Writing, the second optimized for Reading.
Context
I ETL dozens of very large (multi-GB) text files into PostgreSQL 11. The process is repeated twice a month for some datasets, and monthly for other datasets. Then I replicate the local database tables to a cloud-based server for subscribers to access.
Step 01: Local Processing Server
This is an iMac 27 with 64GB in RAM and a 01 TB SSD disk.
I have an instance of PostgreSQL 11 with the data directory stored in the SSD. This is the "Local SSD Instance".
At this stage I take raw text files and insert the records into PostgreSQL (using either the fantastic pgLoader or my own custom Python code). Then there’s significant post-processing: indexing of fields; many UPDATES; cross-referencing data; and geocoding records.
This process is totally hardware-bound. And I wrote (what I think is) a writing-optimized postgresql.conf file, below. My goal is to optimize PostgreSQL for fast writing (no WAL, huge checkpoint intervals, etc). At this stage I don’t care about crash protection.
In addition, during the ETL process, I use SQL table-creation commands like:
- low FILLFACTOR (“in heavily updated tables smaller fillfactors are appropriate.”)
- CREATE UNLOGGED TABLE
Is there a way to load the entire PostgreSQL application into RAM and perform all operations (raw data loading, indexing, field enrichment) in RAM? Once finished, save data files back to SSD.
Step 02: Transfer to Local Repository
Once a large table is loaded, indexed, and post-processed in Step 01 above I replicate the table to a second local instance of PostgreSQL 11 where the data directory is stored in an external 16 TB Thunderbolt 3 disk, the "Local Thunderbolt 3 Instance". My iMac does not have enough SSD storage to hold all the tables in the SSD.
This is the command I use to transfer a table from "Local SSD Instance" to "Local Thunderbolt 3 Instance":
pg_dump postgresql://<Username>:<Password>@<Source_IP>:<Source_Port>/<Source_Database> -t '<Table_Name>_*' | psql postgresql://<Username>:<Password>@<Target_IP>:<Target_Port>/<Target_Database>
One problem here is that each index is recreated after the actual records are transferred. The re-indexing takes a lot of time. Is there another way to transfer the tables from "Local SSD Instance" to "Local Thunderbolt 3 Instance" without re-indexing every table?
No, I don’t want to transfer tables directly from "Local SSD Instance" to the remote server. I need to have a local repository of all tables as there will be multiple cloud-based instances with different variations of the individual loaded tables.
I read somewhere that transfering a table to a different server somehow "compacts" the data so that I don’t need to run VACUUM on the target table. Is this correct? Or do I still need to run VACUUM after re-indexing the table?
Step 03: Replicate data to cloud-based server
I have set up a "Droplet" in Digital Ocean with 48 GB Memory, 960 GB Disk + 1.66 TB attached storage, running Ubuntu 18.04.2 x64. Here I’m running PostgreSQL 11 as well.
The postgresql.conf file needs to be optimized for writing as I need to transfer over 100 tables with over 1.5 TB of data as quickly as possible. In this step I take individual tables from the "Local Thunderbolt 3 Instance" and transfer the tables to Digital Ocean’s droplet.
The table transfer command I use is:
pg_dump postgresql://<Username>:<Password>@<Source_IP>:<Source_Port>/<Source_Database> -t '<Table_Name>_*' | psql postgresql://<Username>:<Password>@<Target_IP>:<Target_Port>/<Target_Database>
But a problem here is that each index is recreated after the actual records are transferred. The re-indexing takes a lot of time. How can I accelerate the process of transferring tables to the remote server?
Step 04: Optimize cloud-based server for Reading.
Once all the tables have been transferred, I need to change the postgresql.conf file and use a read-optimized configuration.
The read-optimized configuration needs to provide the fastest possible response rates to PostgreSQL queries. Any suggestions?
My Questions:
- How to create a write-optimized postgresql.conf file?
- How to create a read-optimized postgresql.conf file?
- How to transfer tables to different server without re-indexing at the target server?
Thanks.
# Writing-optimized postgresql.conf file:
#
# WAL:
wal_level = minimal
max_wal_size = 10GB
wal_buffers = 16MB
archive_mode = off
max_wal_senders = 0
#
# Memory:
shared_buffers = 1280MB
temp_buffers = 800MB
work_mem = 400MB
maintenance_work_mem = 640MB
dynamic_shared_memory_type = posix
#
autovacuum = off
bonjour = off
checkpoint_completion_target = 0.9
default_statistics_target = 1000
effective_cache_size = 4GB
#
synchronous_commit = off
#
random_page_cost = 1
seq_page_cost = 1
ssd performance database postgresql replication
New contributor
add a comment |
I’m looking for suggestions to create several different postgresql.conf versions: one optimized for Writing, the second optimized for Reading.
Context
I ETL dozens of very large (multi-GB) text files into PostgreSQL 11. The process is repeated twice a month for some datasets, and monthly for other datasets. Then I replicate the local database tables to a cloud-based server for subscribers to access.
Step 01: Local Processing Server
This is an iMac 27 with 64GB in RAM and a 01 TB SSD disk.
I have an instance of PostgreSQL 11 with the data directory stored in the SSD. This is the "Local SSD Instance".
At this stage I take raw text files and insert the records into PostgreSQL (using either the fantastic pgLoader or my own custom Python code). Then there’s significant post-processing: indexing of fields; many UPDATES; cross-referencing data; and geocoding records.
This process is totally hardware-bound. And I wrote (what I think is) a writing-optimized postgresql.conf file, below. My goal is to optimize PostgreSQL for fast writing (no WAL, huge checkpoint intervals, etc). At this stage I don’t care about crash protection.
In addition, during the ETL process, I use SQL table-creation commands like:
- low FILLFACTOR (“in heavily updated tables smaller fillfactors are appropriate.”)
- CREATE UNLOGGED TABLE
Is there a way to load the entire PostgreSQL application into RAM and perform all operations (raw data loading, indexing, field enrichment) in RAM? Once finished, save data files back to SSD.
Step 02: Transfer to Local Repository
Once a large table is loaded, indexed, and post-processed in Step 01 above I replicate the table to a second local instance of PostgreSQL 11 where the data directory is stored in an external 16 TB Thunderbolt 3 disk, the "Local Thunderbolt 3 Instance". My iMac does not have enough SSD storage to hold all the tables in the SSD.
This is the command I use to transfer a table from "Local SSD Instance" to "Local Thunderbolt 3 Instance":
pg_dump postgresql://<Username>:<Password>@<Source_IP>:<Source_Port>/<Source_Database> -t '<Table_Name>_*' | psql postgresql://<Username>:<Password>@<Target_IP>:<Target_Port>/<Target_Database>
One problem here is that each index is recreated after the actual records are transferred. The re-indexing takes a lot of time. Is there another way to transfer the tables from "Local SSD Instance" to "Local Thunderbolt 3 Instance" without re-indexing every table?
No, I don’t want to transfer tables directly from "Local SSD Instance" to the remote server. I need to have a local repository of all tables as there will be multiple cloud-based instances with different variations of the individual loaded tables.
I read somewhere that transfering a table to a different server somehow "compacts" the data so that I don’t need to run VACUUM on the target table. Is this correct? Or do I still need to run VACUUM after re-indexing the table?
Step 03: Replicate data to cloud-based server
I have set up a "Droplet" in Digital Ocean with 48 GB Memory, 960 GB Disk + 1.66 TB attached storage, running Ubuntu 18.04.2 x64. Here I’m running PostgreSQL 11 as well.
The postgresql.conf file needs to be optimized for writing as I need to transfer over 100 tables with over 1.5 TB of data as quickly as possible. In this step I take individual tables from the "Local Thunderbolt 3 Instance" and transfer the tables to Digital Ocean’s droplet.
The table transfer command I use is:
pg_dump postgresql://<Username>:<Password>@<Source_IP>:<Source_Port>/<Source_Database> -t '<Table_Name>_*' | psql postgresql://<Username>:<Password>@<Target_IP>:<Target_Port>/<Target_Database>
But a problem here is that each index is recreated after the actual records are transferred. The re-indexing takes a lot of time. How can I accelerate the process of transferring tables to the remote server?
Step 04: Optimize cloud-based server for Reading.
Once all the tables have been transferred, I need to change the postgresql.conf file and use a read-optimized configuration.
The read-optimized configuration needs to provide the fastest possible response rates to PostgreSQL queries. Any suggestions?
My Questions:
- How to create a write-optimized postgresql.conf file?
- How to create a read-optimized postgresql.conf file?
- How to transfer tables to different server without re-indexing at the target server?
Thanks.
# Writing-optimized postgresql.conf file:
#
# WAL:
wal_level = minimal
max_wal_size = 10GB
wal_buffers = 16MB
archive_mode = off
max_wal_senders = 0
#
# Memory:
shared_buffers = 1280MB
temp_buffers = 800MB
work_mem = 400MB
maintenance_work_mem = 640MB
dynamic_shared_memory_type = posix
#
autovacuum = off
bonjour = off
checkpoint_completion_target = 0.9
default_statistics_target = 1000
effective_cache_size = 4GB
#
synchronous_commit = off
#
random_page_cost = 1
seq_page_cost = 1
ssd performance database postgresql replication
New contributor
add a comment |
I’m looking for suggestions to create several different postgresql.conf versions: one optimized for Writing, the second optimized for Reading.
Context
I ETL dozens of very large (multi-GB) text files into PostgreSQL 11. The process is repeated twice a month for some datasets, and monthly for other datasets. Then I replicate the local database tables to a cloud-based server for subscribers to access.
Step 01: Local Processing Server
This is an iMac 27 with 64GB in RAM and a 01 TB SSD disk.
I have an instance of PostgreSQL 11 with the data directory stored in the SSD. This is the "Local SSD Instance".
At this stage I take raw text files and insert the records into PostgreSQL (using either the fantastic pgLoader or my own custom Python code). Then there’s significant post-processing: indexing of fields; many UPDATES; cross-referencing data; and geocoding records.
This process is totally hardware-bound. And I wrote (what I think is) a writing-optimized postgresql.conf file, below. My goal is to optimize PostgreSQL for fast writing (no WAL, huge checkpoint intervals, etc). At this stage I don’t care about crash protection.
In addition, during the ETL process, I use SQL table-creation commands like:
- low FILLFACTOR (“in heavily updated tables smaller fillfactors are appropriate.”)
- CREATE UNLOGGED TABLE
Is there a way to load the entire PostgreSQL application into RAM and perform all operations (raw data loading, indexing, field enrichment) in RAM? Once finished, save data files back to SSD.
Step 02: Transfer to Local Repository
Once a large table is loaded, indexed, and post-processed in Step 01 above I replicate the table to a second local instance of PostgreSQL 11 where the data directory is stored in an external 16 TB Thunderbolt 3 disk, the "Local Thunderbolt 3 Instance". My iMac does not have enough SSD storage to hold all the tables in the SSD.
This is the command I use to transfer a table from "Local SSD Instance" to "Local Thunderbolt 3 Instance":
pg_dump postgresql://<Username>:<Password>@<Source_IP>:<Source_Port>/<Source_Database> -t '<Table_Name>_*' | psql postgresql://<Username>:<Password>@<Target_IP>:<Target_Port>/<Target_Database>
One problem here is that each index is recreated after the actual records are transferred. The re-indexing takes a lot of time. Is there another way to transfer the tables from "Local SSD Instance" to "Local Thunderbolt 3 Instance" without re-indexing every table?
No, I don’t want to transfer tables directly from "Local SSD Instance" to the remote server. I need to have a local repository of all tables as there will be multiple cloud-based instances with different variations of the individual loaded tables.
I read somewhere that transfering a table to a different server somehow "compacts" the data so that I don’t need to run VACUUM on the target table. Is this correct? Or do I still need to run VACUUM after re-indexing the table?
Step 03: Replicate data to cloud-based server
I have set up a "Droplet" in Digital Ocean with 48 GB Memory, 960 GB Disk + 1.66 TB attached storage, running Ubuntu 18.04.2 x64. Here I’m running PostgreSQL 11 as well.
The postgresql.conf file needs to be optimized for writing as I need to transfer over 100 tables with over 1.5 TB of data as quickly as possible. In this step I take individual tables from the "Local Thunderbolt 3 Instance" and transfer the tables to Digital Ocean’s droplet.
The table transfer command I use is:
pg_dump postgresql://<Username>:<Password>@<Source_IP>:<Source_Port>/<Source_Database> -t '<Table_Name>_*' | psql postgresql://<Username>:<Password>@<Target_IP>:<Target_Port>/<Target_Database>
But a problem here is that each index is recreated after the actual records are transferred. The re-indexing takes a lot of time. How can I accelerate the process of transferring tables to the remote server?
Step 04: Optimize cloud-based server for Reading.
Once all the tables have been transferred, I need to change the postgresql.conf file and use a read-optimized configuration.
The read-optimized configuration needs to provide the fastest possible response rates to PostgreSQL queries. Any suggestions?
My Questions:
- How to create a write-optimized postgresql.conf file?
- How to create a read-optimized postgresql.conf file?
- How to transfer tables to different server without re-indexing at the target server?
Thanks.
# Writing-optimized postgresql.conf file:
#
# WAL:
wal_level = minimal
max_wal_size = 10GB
wal_buffers = 16MB
archive_mode = off
max_wal_senders = 0
#
# Memory:
shared_buffers = 1280MB
temp_buffers = 800MB
work_mem = 400MB
maintenance_work_mem = 640MB
dynamic_shared_memory_type = posix
#
autovacuum = off
bonjour = off
checkpoint_completion_target = 0.9
default_statistics_target = 1000
effective_cache_size = 4GB
#
synchronous_commit = off
#
random_page_cost = 1
seq_page_cost = 1
ssd performance database postgresql replication
New contributor
I’m looking for suggestions to create several different postgresql.conf versions: one optimized for Writing, the second optimized for Reading.
Context
I ETL dozens of very large (multi-GB) text files into PostgreSQL 11. The process is repeated twice a month for some datasets, and monthly for other datasets. Then I replicate the local database tables to a cloud-based server for subscribers to access.
Step 01: Local Processing Server
This is an iMac 27 with 64GB in RAM and a 01 TB SSD disk.
I have an instance of PostgreSQL 11 with the data directory stored in the SSD. This is the "Local SSD Instance".
At this stage I take raw text files and insert the records into PostgreSQL (using either the fantastic pgLoader or my own custom Python code). Then there’s significant post-processing: indexing of fields; many UPDATES; cross-referencing data; and geocoding records.
This process is totally hardware-bound. And I wrote (what I think is) a writing-optimized postgresql.conf file, below. My goal is to optimize PostgreSQL for fast writing (no WAL, huge checkpoint intervals, etc). At this stage I don’t care about crash protection.
In addition, during the ETL process, I use SQL table-creation commands like:
- low FILLFACTOR (“in heavily updated tables smaller fillfactors are appropriate.”)
- CREATE UNLOGGED TABLE
Is there a way to load the entire PostgreSQL application into RAM and perform all operations (raw data loading, indexing, field enrichment) in RAM? Once finished, save data files back to SSD.
Step 02: Transfer to Local Repository
Once a large table is loaded, indexed, and post-processed in Step 01 above I replicate the table to a second local instance of PostgreSQL 11 where the data directory is stored in an external 16 TB Thunderbolt 3 disk, the "Local Thunderbolt 3 Instance". My iMac does not have enough SSD storage to hold all the tables in the SSD.
This is the command I use to transfer a table from "Local SSD Instance" to "Local Thunderbolt 3 Instance":
pg_dump postgresql://<Username>:<Password>@<Source_IP>:<Source_Port>/<Source_Database> -t '<Table_Name>_*' | psql postgresql://<Username>:<Password>@<Target_IP>:<Target_Port>/<Target_Database>
One problem here is that each index is recreated after the actual records are transferred. The re-indexing takes a lot of time. Is there another way to transfer the tables from "Local SSD Instance" to "Local Thunderbolt 3 Instance" without re-indexing every table?
No, I don’t want to transfer tables directly from "Local SSD Instance" to the remote server. I need to have a local repository of all tables as there will be multiple cloud-based instances with different variations of the individual loaded tables.
I read somewhere that transfering a table to a different server somehow "compacts" the data so that I don’t need to run VACUUM on the target table. Is this correct? Or do I still need to run VACUUM after re-indexing the table?
Step 03: Replicate data to cloud-based server
I have set up a "Droplet" in Digital Ocean with 48 GB Memory, 960 GB Disk + 1.66 TB attached storage, running Ubuntu 18.04.2 x64. Here I’m running PostgreSQL 11 as well.
The postgresql.conf file needs to be optimized for writing as I need to transfer over 100 tables with over 1.5 TB of data as quickly as possible. In this step I take individual tables from the "Local Thunderbolt 3 Instance" and transfer the tables to Digital Ocean’s droplet.
The table transfer command I use is:
pg_dump postgresql://<Username>:<Password>@<Source_IP>:<Source_Port>/<Source_Database> -t '<Table_Name>_*' | psql postgresql://<Username>:<Password>@<Target_IP>:<Target_Port>/<Target_Database>
But a problem here is that each index is recreated after the actual records are transferred. The re-indexing takes a lot of time. How can I accelerate the process of transferring tables to the remote server?
Step 04: Optimize cloud-based server for Reading.
Once all the tables have been transferred, I need to change the postgresql.conf file and use a read-optimized configuration.
The read-optimized configuration needs to provide the fastest possible response rates to PostgreSQL queries. Any suggestions?
My Questions:
- How to create a write-optimized postgresql.conf file?
- How to create a read-optimized postgresql.conf file?
- How to transfer tables to different server without re-indexing at the target server?
Thanks.
# Writing-optimized postgresql.conf file:
#
# WAL:
wal_level = minimal
max_wal_size = 10GB
wal_buffers = 16MB
archive_mode = off
max_wal_senders = 0
#
# Memory:
shared_buffers = 1280MB
temp_buffers = 800MB
work_mem = 400MB
maintenance_work_mem = 640MB
dynamic_shared_memory_type = posix
#
autovacuum = off
bonjour = off
checkpoint_completion_target = 0.9
default_statistics_target = 1000
effective_cache_size = 4GB
#
synchronous_commit = off
#
random_page_cost = 1
seq_page_cost = 1
ssd performance database postgresql replication
ssd performance database postgresql replication
New contributor
New contributor
New contributor
asked 11 hours ago
Jose CLlJose CLl
1
1
New contributor
New contributor
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Jose CLl is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1428716%2foptimization-of-postgresql-etl-processing-pipeline%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Jose CLl is a new contributor. Be nice, and check out our Code of Conduct.
Jose CLl is a new contributor. Be nice, and check out our Code of Conduct.
Jose CLl is a new contributor. Be nice, and check out our Code of Conduct.
Jose CLl is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Super User!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1428716%2foptimization-of-postgresql-etl-processing-pipeline%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown