Use the W&B Python SDK to construct artifacts from W&B Runs. You can add files, directories, URIs, and files from parallel runs to artifacts. After you add a file to an artifact, save the artifact to the W&B Server or your own private server. Each artifact is associated with a run.
For information on how to track external files, such as files stored in Amazon S3, see the Track external files page.
Construct an artifact
Construct a W&B Artifact in three steps:
- Create an artifact Python object with
wandb.Artifact()
- Add one or more files to the artifact
- Save your artifact to the W&B server
1. Create an artifact Python object with wandb.Artifact()
Initialize the wandb.Artifact() class to create an artifact object. Specify the following parameters:
- Name: The name of your artifact. The name should be unique, descriptive, and easy to remember.
- Type: The type of artifact. The type should be simple, descriptive, and correspond to a single step of your machine learning pipeline. Common artifact types include
'dataset' or 'model'.
Artifacts can not have the same name, regardless of type. In other words, you can not create an artifact named cats of type dataset and another artifact with the same name of type model.
You can optionally provide a description and metadata when you initialize an artifact object. For more information on available attributes and parameters, see the wandb.Artifact Class definition in the Python SDK Reference Guide.
Copy and paste the following code snippet to create an artifact object. Replace the <name> and <type> placeholders with your own values:
import wandb
# Create an artifact object
artifact = wandb.Artifact(name="<name>", type="<type>")
2. Add one more files to the artifact
Add files, directories, external URI references (such as Amazon S3) and more to your artifact object.
To add a single file, use the artifact object’s Artifact.add_file() method:
artifact.add_file(local_path="path/to/file.txt", name="<name>")
To add a directory, use the Artifact.add_dir() method:
artifact.add_dir(local_path="path/to/directory", name="<name>")
See the next section, Add files to an artifact, for more information on how to add different file types to an artifact.
3. Save your artifact to the W&B server
Save your artifact to the W&B server. Use the run object’s wandb.Run.log_artifact() method to save the artifact.
with wandb.init(project="<project>", job_type="<job-type>") as run:
run.log_artifact(artifact)
When to use to use wandb.Run.log_artifact() or Artifact.save()
- Use
wandb.Run.log_artifact() to create a new artifact and associate it with a specific run.
- Use
Artifact.save() to update an existing artifact without creating a new run.
Putting this all together, the following code snippet demonstrates how to create a dataset artifact, add a file to the artifact, and save the artifact to W&B:
import wandb
artifact = wandb.Artifact(name="<name>", type="<type>")
artifact.add_file(local_path="path/to/file.txt", name="<name>")
artifact.add_dir(local_path="path/to/directory", name="<name>")
with wandb.init(project="<project>", job_type="<job-type>") as run:
run.log_artifact(artifact)
Each time you log an artifact with the same name and type, W&B creates a new version of that artifact. For more information, see Create a new artifact version.
W&B performs calls wandb.Run.log_artifact() asynchronously for performant uploads. This can cause surprising behavior when logging artifacts in a loop. For example:with wandb.init() as run:
for i in range(10):
a = wandb.Artifact(name = "race",
type="dataset",
metadata={
"index": i,
},
)
# ... add files to artifact a ...
run.log_artifact(a)
The artifact version v0 is NOT guaranteed to have an index of 0 in its metadata because artifacts may be logged in an arbitrary order.
Add files to an artifact
The following sections demonstrate how to add different types of objects to an artifact. Assume you have a directory with the following structure as you read through the examples:
root-directory
| - hello.txt
| - images/
| -- | cat.png
| -- | dog.png
| - checkpoints/
| -- | model.h5
| - models/
| -- | model.h5
Add a single file
Use wandb.Artifact.add_file() to add a single local file to an artifact. Provide the local path to the file as the local_path parameter:
import wandb
# Initialize an artifact object
artifact = wandb.Artifact(name="<name>", type="<type>")
# Add a single file
artifact.add_file(local_path="path/file.format")
For example, suppose you had a file called 'hello.txt' in your working local directory.
artifact.add_file("hello.txt")
The artifact now has the following content:
Optionally, pass a different name to the name parameter to rename the file within the artifact object itself. Continuing the previous example:
artifact.add_file(
local_path="hello.txt",
name="new/path/hello_world.txt"
)
The artifact is stored as:
The following table shows how different API calls produce different artifact contents:
| API Call | Resulting artifact |
|---|
artifact.new_file('hello.txt') | hello.txt |
artifact.add_file('model.h5') | model.h5 |
artifact.add_file('checkpoints/model.h5') | model.h5 |
artifact.add_file('model.h5', name='models/mymodel.h5') | models/mymodel.h5 |
Add multiple files
Use the wandb.Artifact.add_dir() method to add multiple files from a local directory to an artifact. Provide the local path to the directory as the local_path parameter.
import wandb
# Initialize an artifact object
artifact = wandb.Artifact(name="<name>", type="<type>")
# Add a local directory to the artifact
artifact.add_dir(local_path="path/file.format", name="optional-prefix")
The following table show how different API calls produce different artifact contents:
| API Call | Resulting artifact |
|---|
artifact.add_dir('images') | cat.png
dog.png
|
artifact.add_dir('images', name='images') | images/cat.png
images/dog.png
|
Add a URI reference
Artifacts track checksums and other information for reproducibility if the URI has a scheme that W&B library knows how to handle.
Add an external URI reference to an artifact with the wandb.Artifact.add_reference() method. Replace the 'uri' string with your own URI. Optionally pass the desired path within the artifact for the name parameter.
# Add a URI reference
artifact.add_reference(uri="uri", name="optional-name")
Artifacts currently support the following URI schemes:
http(s)://: A path to a file accessible over HTTP. The artifact will track checksums in the form of etags and size metadata if the HTTP server supports the ETag and Content-Length response headers.
s3://: A path to an object or object prefix in S3. The artifact will track checksums and versioning information (if the bucket has object versioning enabled) for the referenced objects. Object prefixes are expanded to include the objects under the prefix, up to a maximum of 10,000 objects.
gs://: A path to an object or object prefix in GCS. The artifact will track checksums and versioning information (if the bucket has object versioning enabled) for the referenced objects. Object prefixes are expanded to include the objects under the prefix, up to a maximum of 10,000 objects.
The following table shows how different API calls produce different artifact contents:
| API call | Resulting artifact contents |
|---|
artifact.add_reference('s3://my-bucket/model.h5') | model.h5 |
artifact.add_reference('s3://my-bucket/checkpoints/model.h5') | model.h5 |
artifact.add_reference('s3://my-bucket/model.h5', name='models/mymodel.h5') | models/mymodel.h5 |
artifact.add_reference('s3://my-bucket/images') | cat.png
dog.png
|
artifact.add_reference('s3://my-bucket/images', name='images') | images/cat.png
images/dog.png
|
Add files to artifacts from parallel runs
For large datasets or distributed training, multiple parallel runs might need to contribute to a single artifact.
import wandb
import time
# This example uses Ray to runs in parallel
# for demonstration purposes.
import ray
ray.init()
artifact_type = "dataset"
artifact_name = "parallel-artifact"
table_name = "distributed_table"
parts_path = "parts"
num_parallel = 5
# Each batch of parallel writers should have its own
# unique group name.
group_name = "writer-group-{}".format(round(time.time()))
@ray.remote
def train(i):
"""
Our writer job. Each writer will add one image to the artifact.
"""
with wandb.init(group=group_name) as run:
artifact = wandb.Artifact(name=artifact_name, type=artifact_type)
# Add data to a wandb table.
table = wandb.Table(columns=["a", "b", "c"], data=[[i, i * 2, 2**i]])
# Add the table to folder in the artifact
artifact.add(table, "{}/table_{}".format(parts_path, i))
# Upserting the artifact creates or appends data to the artifact
run.upsert_artifact(artifact)
# Launch your runs in parallel
result_ids = [train.remote(i) for i in range(num_parallel)]
# Join on all the writers to make sure their files have
# been added before finishing the artifact.
ray.get(result_ids)
# Once all the writers are finished, finish the artifact
# to mark it ready.
with wandb.init(group=group_name) as run:
artifact = wandb.Artifact(artifact_name, type=artifact_type)
# Create a "PartitionTable" pointing to the folder of tables
# and add it to the artifact.
artifact.add(wandb.data_types.PartitionedTable(parts_path), table_name)
# Finish artifact finalizes the artifact, disallowing future "upserts"
# to this version.
run.finish_artifact(artifact)
The following code snippet shows how to use the W&B Public API to list the files in a run, including their names and URLs. Replace the <entity/project/run-id> placeholder with your own values:
from wandb.apis.public.files import Files
from wandb.apis.public.api import Api
# Example run object
run = Api().run("<entity/project/run-id>")
# Create a Files object to iterate over files in the run
files = Files(api.client, run)
# Iterate over files
for file in files:
print(f"File Name: {file.name}")
print(f"File URL: {file.url}")
print(f"Path to file in the bucket: {file.direct_url}")
See the File Class for more information on available attributes and methods.