API¶
Request¶
Requests data from the Wayback Machine API.
- class waybacktweets.api.request.WaybackTweets(username, collapse=None, timestamp_from=None, timestamp_to=None, limit=None, resumption_key=None, matchtype=None)¶
Class responsible for requesting data from the Wayback CDX Server API.
- Parameters:
username (str) – The username associated with the tweets.
collapse (str, optional) – The field to collapse duplicate lines on.
timestamp_from (str, optional) – The timestamp to start retrieving tweets from.
timestamp_to (str, optional) – The timestamp to stop retrieving tweets at.
limit (int, optional) – The maximum number of results to return.
resumption_key (int, optional) – Key to continue the query from the end of the previous query.
matchtype (str, optional) – Results matching a certain prefix, a certain host or all subdomains.
- get()¶
Sends a GET request to the Internet Archive’s CDX API to retrieve archived tweets.
- Return type:
Optional
[Dict
[str
,Any
]]- Returns:
The response from the CDX API in JSON format, if successful. Otherwise, None.
Parse¶
Parses the returned data from the Wayback CDX Server API.
- class waybacktweets.api.parse.TweetsParser(archived_tweets_response, username, field_options)¶
This class is responsible for the overall parsing of archived tweets.
- Parameters:
archived_tweets_response (List[str]) – The response from the archived tweets.
username (str) – The username associated with the tweets.
field_options (List[str]) – The fields to be included in the parsed data. For more details on each option, visit Field Options.
- _add_field(key, value)¶
Appends a value to a list in the parsed data structure.
- Parameters:
key (str) – The key in the parsed data structure.
value (Any) – The value to be appended.
- Return type:
None
- _add_resumption_key()¶
Adds the resumption key from the last archived tweet response to the parsed tweets.
This method extracts the resumption key from the last item in the archived tweets response list and appends it to the ‘resumption_key’ field in the parsed tweets dictionary. It also prints the resumption key with instructions on how to use it with the ‘limit’ option for continuing the query from the end of the previous query.
- Raises:
ValueError – If the list of archived tweet responses is empty.
- _process_response(response)¶
Processes the archived tweet’s response and adds the relevant CDX data.
- Parameters:
response (List[str]) – The response from the archived tweet.
- Return type:
None
- parse(print_progress=False)¶
Parses the archived tweets CDX data and structures it.
- Parameters:
print_progress (bool) – A boolean indicating whether to print progress or not.
- Return type:
Dict
[str
,List
[Any
]]- Returns:
The parsed tweets data.
- class waybacktweets.api.parse.TwitterEmbed(tweet_url)¶
This class is responsible for parsing tweets using the Twitter Publish service.
- Parameters:
tweet_url (str) – The URL of the tweet to be parsed.
- embed()¶
Parses the archived tweets when they are still available.
This function goes through each archived tweet and checks if it is still available. If the tweet is available, it extracts the necessary information and adds it to the respective lists. The function returns a tuple of three lists:
The first list contains the tweet texts.
The second list contains boolean values indicating whether each tweet is still available.
The third list contains the URLs of the tweets.
- Return type:
Optional
[Tuple
[List
[str
],List
[bool
],List
[str
]]]- Returns:
A tuple of three lists containing the tweet texts, availability statuses, and URLs, respectively. If no tweets are available, returns None.
- class waybacktweets.api.parse.JsonParser(archived_tweet_url)¶
This class is responsible for parsing tweets when the mimetype is application/json.
Note: This class is in an experimental phase.
- Parameters:
archived_tweet_url (str) – The URL of the archived tweet to be parsed.
- parse()¶
Parses the archived tweets in JSON format.
- Return type:
str
- Returns:
The parsed tweet text.
Export¶
Exports the parsed archived tweets.
- class waybacktweets.api.export.TweetsExporter(data, username, field_options)¶
Class responsible for exporting parsed archived tweets.
- Parameters:
data (Dict[str, List[Any]]) – The parsed archived tweets data.
username (str) – The username associated with the tweets.
field_options (List[str]) – The fields to be included in the exported data. For more details on each option, visit Field Options.
- _create_dataframe()¶
Creates a DataFrame from the transposed data.
- Return type:
DataFrame
- Returns:
The DataFrame representation of the data.
- static _datetime_now()¶
Returns the current datetime, formatted as a string.
- Return type:
str
- Returns:
The current datetime.
- static _transpose_matrix(data, fill_value=None)¶
Transposes a matrix, filling in missing values with a specified fill value if needed.
- Parameters:
data (Dict[str, List[Any]]) – The matrix to be transposed.
fill_value (Optional[Any]) – The value to fill in missing values with.
- Return type:
List
[List
[Any
]]- Returns:
The transposed matrix.
- save_to_csv()¶
Saves the DataFrame to a CSV file.
- Return type:
None
- save_to_html()¶
Saves the DataFrame to an HTML file.
- Return type:
None
- save_to_json()¶
Saves the DataFrame to a JSON file.
- Return type:
None
Visualize¶
Generates an HTML file to visualize the parsed data.
- class waybacktweets.api.visualize.HTMLTweetsVisualizer(username, json_path, html_file_path=None)¶
Class responsible for generating an HTML file to visualize the parsed data.
- Parameters:
username (str) – The username associated with the tweets.
json_path (Union[str, List[str]]) – The path of the JSON file or the JSON data itself.
html_file_path (str, optional) – The path where the HTML file will be saved.
- static _json_loader(json_path)¶
Reads and loads JSON data from a specified file path or JSON string.
- Parameters:
json_path (Union[str, List[str]]) – The path of the JSON file or the JSON data itself.
- Return type:
List
[Dict
[str
,Any
]]- Returns:
The content of the JSON file or data.
- generate()¶
Generates an HTML string that represents the parsed data.
- Return type:
str
- Returns:
The generated HTML string.
- save(html_content)¶
Saves the generated HTML string to a file.
- Parameters:
html_content (str) – The HTML string to be saved.
- Return type:
None
Utils¶
Utility functions for handling HTTP requests and manipulating URLs.
- waybacktweets.utils.utils.check_double_status(wayback_machine_url, original_tweet_url)¶
Checks if a Wayback Machine URL contains two occurrences of “/status/” and if the original tweet does not contain “twitter.com”.
- Parameters:
wayback_machine_url (str) – The Wayback Machine URL to check.
original_tweet_url (str) – The original tweet URL to check.
- Return type:
bool
- Returns:
True if the conditions are met, False otherwise.
- waybacktweets.utils.utils.check_pattern_tweet(tweet_url)¶
Extracts the URL from a tweet URL with patterns such as:
Reply: /status//
Link: /status///
Twimg: /status/https://pbs
- Parameters:
tweet_url (str) – The tweet URL to extract the URL from.
- Return type:
str
- Returns:
Only the extracted URL from a tweet.
- waybacktweets.utils.utils.check_url_scheme(url)¶
Corrects the URL scheme if it contains more than two slashes following the scheme.
This function uses a regular expression to find ‘http:’ or ‘https:’ followed by two or more slashes. It then replaces this with the scheme followed by exactly two slashes.
- Parameters:
url (str) – The URL to be corrected.
- Returns:
The corrected URL.
- waybacktweets.utils.utils.clean_tweet_url(tweet_url, username)¶
Cleans a tweet URL by ensuring it is associated with the correct username.
- Parameters:
tweet_url (str) – The tweet URL to clean.
username (str) – The username to associate with the tweet URL.
- Return type:
str
- Returns:
The cleaned tweet URL.
- waybacktweets.utils.utils.clean_wayback_machine_url(wayback_machine_url, archived_timestamp, username)¶
Cleans a Wayback Machine URL by ensuring it is associated with the correct username and timestamp.
- Parameters:
wayback_machine_url (str) – The Wayback Machine URL to clean.
archived_timestamp (str) – The timestamp to associate with the Wayback Machine URL.
username (str) – The username to associate with the Wayback Machine URL.
- Return type:
str
- Returns:
The cleaned Wayback Machine URL.
- waybacktweets.utils.utils.delete_tweet_pathnames(tweet_url)¶
Removes any pathnames from a tweet URL.
- Parameters:
tweet_url (str) – The tweet URL to remove pathnames from.
- Return type:
str
- Returns:
The tweet URL without any pathnames.
- waybacktweets.utils.utils.get_response(url, params=None)¶
Sends a GET request to the specified URL and returns the response.
- Parameters:
url (str) – The URL to send the GET request to.
params (dict, optional) – The parameters to include in the GET request.
- Return type:
Tuple
[Optional
[Response
],Optional
[str
],Optional
[str
]]- Returns:
The response from the server.
- Raises:
ReadTimeoutError – If a read timeout occurs.
ConnectionError – If a connection error occurs.
HTTPError – If an HTTP error occurs.
EmptyResponseError – If the response is empty.
- waybacktweets.utils.utils.is_tweet_url(twitter_url)¶
Checks if the provided URL is a Twitter status URL.
This function checks if the provided URL contains “/status/” exactly once, which is a common pattern in Twitter status URLs.
- Parameters:
twitter_url (str) – The URL to check.
- Return type:
bool
- Returns:
True if the URL is a Twitter status URL, False otherwise.
- waybacktweets.utils.utils.semicolon_parser(string)¶
Replaces semicolons in a string with %3B.
- Parameters:
string (str) – The string to replace semicolons in.
- Return type:
str
- Returns:
The string with semicolons replaced by %3B.
- waybacktweets.utils.utils.timestamp_parser(timestamp)¶
Parses a timestamp into a formatted string.
- Parameters:
timestamp (str) – The timestamp string to parse.
- Returns:
%M:%S”, or None if the timestamp could not be parsed.
- Return type:
The parsed timestamp in the format “%Y/%m/%d %H
Config¶
Configuration module.
Manages global configuration settings throughout the application.