CLI

Usage

waybacktweets

Retrieves archived tweets CDX data from the Wayback Machine, performs necessary parsing, and saves the data.

USERNAME: The Twitter username without @.

waybacktweets [OPTIONS] USERNAME

Options

-c, --collapse <collapse>

Collapse results based on a field, or a substring of a field. XX in the timestamp value ranges from 1 to 14, comparing the first XX digits of the timestamp field. It is recommended to use from 4 onwards, to compare at least by years.

Options:

urlkey | digest | timestamp:XX

-f, --from <DATE>

Filtering by date range from this date. Format: YYYYmmdd

-t, --to <DATE>

Filtering by date range up to this date. Format: YYYYmmdd

-l, --limit <INTEGER>

Query result limits.

-o, --offset <INTEGER>

Allows for a simple way to scroll through the results.

-mt, --matchtype <matchtype>

Results matching a certain prefix, a certain host or all subdomains.

Options:

exact | prefix | host | domain

-v, --verbose

Shows the error log.

Arguments

USERNAME

Required argument

Collapsing

The Wayback Tweets command line tool recommends the use of three types of “collapse”: urlkey, digest, and timestamp field.

  • urlkey: (str) A canonical transformation of the URL you supplied, for example, org,eserver,tc)/. Such keys are useful for indexing.

  • digest: (str) The SHA1 hash digest of the content, excluding the headers. It’s usually a base-32-encoded string.

  • timestamp: (datetime) A 14 digit date-time representation in the YYYYMMDDhhmmss format. We recommend YYYYMMDD.

However, it is possible to use it with other options. Read below text extracted from the official Wayback CDX Server API (Beta) documentation.

Note

A new form of filtering is the option to “collapse” results based on a field, or a substring of a field. Collapsing is done on adjacent CDX lines where all captures after the first one that are duplicate are filtered out. This is useful for filtering out captures that are “too dense” or when looking for unique captures.

To use collapsing, add one or more collapse=field or collapse=field:N where N is the first N characters of field to test.

URL Match Scope

The CDX Server can return results matching a certain prefix, a certain host or all subdomains by using the matchType param.

The package waybacktweets uses the pathname /status followed by the wildcard ‘*’ at the end of the URL to retrieve only tweets. However, if a value is provided for this parameter, the search will be made from the URL twitter.com/<USERNAME>.

Read below text extracted from the official Wayback CDX Server API (Beta) documentation.

Note

For example, if given the url: archive.org/about/ and:

The matchType may also be set implicitly by using wildcard ‘*’ at end or beginning of the url:

  • If url is ends in ‘/*’, eg url=archive.org/* the query is equivalent to url=archive.org/&matchType=prefix

  • If url starts with ‘*.’, eg url=*.archive.org/ the query is equivalent to url=archive.org/&matchType=domain

(Note: The domain mode is only available if the CDX is in SURT-order format.)