Post by TheCracksOverhead on May 7, 2021 16:15:48 GMT
I'm not an expert, but I've used the Wayback Machine enough that I feel that I could write a short guide for it.
Here's a few tips before we begin.
- Tips -
• Bring archived links back to life.
If you're lucky, you can bring uncaptured links that you see on an archived page to life by actually just trying the live version of the link.
This is especially useful if you're looking for exes, mp3s, zips, etc.
Just because the live site doesn't have links to something anymore doesn't mean that it isn't still there at its original location.
• Try out different captures of the same page.
Websites can change owners, designs, and completely change the information on their pages at any time.
Looking at different captures of the same page can show you something that was once there before being deleted.
• Read everything. Click on links. Don't be afraid to poke around.
If you've never explored a certain website before, look around and get to know it and its content.
There may be things that you see or read that could give you hints on where to find what you're looking for.
• Check the green redirect captures.
Sometimes websites like to make pages redirect to other pages.
Most of the time what you're looking for will be under a blue circle capture, but sometimes there will only be green redirect captures for a page.
It's a good idea to check these redirect captures, as they might actually take you to the page you need.
• Save stuff now, prevent headache later.
If something important is still up on a website, whether it be a page or even just individual files like images, zips, mp4s, etc., then you can make sure it isn't lost by using the Wayback Machine's Save Page Now feature: archive.org/web/
• Be prepared to go deep.
Often times you'll see links that lead to a completely different website than the one you're looking at. These are where the rabbit holes start.
Eventually you could end up several layers deep and be overwhelmed by all of the pages you need to keep track of. This is why it's good practice to open links in new tabs.
Just be sure that you know how to keep all of your tabs organized too. I use the browser extension Tab Session Manager to help with this.
- YouTube -
• There's a url that everybody should know about if they want to instantly check if a YouTube video has been captured by the Wayback Machine or not:
web.archive.org/web/2oe_/http://wayback-fakeurl.archive.org/yt/VIDEOIDHERE
Just edit this link to have the video's ID at the end. The ID is the random numbers/letters after the "v=" part of the url.
If it's archived, you will be presented with the video file itself, which is usually .mp4 or .flv. With .mp4 just right-click the video and click "Save video as..." For .flv it should automatically try to save the file.
• If a video's watch page is not archived, try making sure that the url you're putting in the Wayback Machine doesn't have any unnecessary garbage at the end of it, like "&list=playlistidhere".
Just "watch?v=videoidhere" should be enough. Alternatively, if a regular url doesn't show any results, you may want to try adding these garbage parameters to the end of the url, like "&feature=related" for example.
- Checking For Embedded Media -
This is if you're wanting to see if the current page you're exploring on the Wayback Machine has any media such as images, sound files, Flash files (swf), Shockwave files (dcr/dir), etc.
I don't know how to do this with Chrome because I don't use Chrome, but if you're using Firefox, go to Tools->Page Info (shortcut is Ctrl+I) and click on the Media tab.
This will give you a list of all the media that's in the current page. Just make sure the page has finished loading first before doing this.
- The File/Page Listing Goldmine -
This is the part of the Wayback Machine that everyone needs to know about.
Did you know that you can get a list of every single page and file that the Wayback Machine has captured for a particular website?
It's pretty amazing. Sometimes you can find stuff that you had no idea was even on a site!
There's two ways of doing this: the easy but limited way, and the advanced way.
• The Easy Way
This method will only give you up to 100,000 results, but sometimes certain links will be left out of the results for whatever reason.
It also takes some time for the results to load depending on how many there are, and it only shows the results for the current (sub)domain.
All you have to do is put a * at the end of the link that you're entering into the Wayback Machine. An example of this would be "example.com*" or "subdomain.example.com*".
You can even search inside of specific sections/folders of a site. Let's say you notice that an image's url is located at "example.com/images/12345.png".
If you enter in "example.com/images/*" then you will get a listing of all captured links inside of the "images" folder. This includes subfolders.
The results are paginated, so look at the bottom of the page and click the numbers to see more of the results.
You can also filter the results by using the "Filter results" text box. This is useful if you're looking for a specific kind of file or page name.
• The Advanced Way: Using the CDX Search
The CDX Search will allow you to get all available results instead of just up to 100,000.
It is used by editing a url to have the parameters that you want to search by. Most of it kind of goes over my head, so I only use it when I need to.
Here's a link to its documentation: github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md
Here are some example links that I use:
web.archive.org/cdx/search/cdx?url=*.example.com&fl=original&collapse=urlkey&showNumPages=false&page=0 (Shows everything for the base domain and all subdomains.)
web.archive.org/cdx/search/cdx?url=example.com/*&fl=original&collapse=urlkey&showNumPages=false&page=0 (Shows everything for just the current (sub)domain.)
Take note of the "page" part of the url. The CDX Search is also paginated and this is how you switch between pages, starting at an index of 0.
Here's a few tips before we begin.
- Tips -
• Bring archived links back to life.
If you're lucky, you can bring uncaptured links that you see on an archived page to life by actually just trying the live version of the link.
This is especially useful if you're looking for exes, mp3s, zips, etc.
Just because the live site doesn't have links to something anymore doesn't mean that it isn't still there at its original location.
• Try out different captures of the same page.
Websites can change owners, designs, and completely change the information on their pages at any time.
Looking at different captures of the same page can show you something that was once there before being deleted.
• Read everything. Click on links. Don't be afraid to poke around.
If you've never explored a certain website before, look around and get to know it and its content.
There may be things that you see or read that could give you hints on where to find what you're looking for.
• Check the green redirect captures.
Sometimes websites like to make pages redirect to other pages.
Most of the time what you're looking for will be under a blue circle capture, but sometimes there will only be green redirect captures for a page.
It's a good idea to check these redirect captures, as they might actually take you to the page you need.
• Save stuff now, prevent headache later.
If something important is still up on a website, whether it be a page or even just individual files like images, zips, mp4s, etc., then you can make sure it isn't lost by using the Wayback Machine's Save Page Now feature: archive.org/web/
• Be prepared to go deep.
Often times you'll see links that lead to a completely different website than the one you're looking at. These are where the rabbit holes start.
Eventually you could end up several layers deep and be overwhelmed by all of the pages you need to keep track of. This is why it's good practice to open links in new tabs.
Just be sure that you know how to keep all of your tabs organized too. I use the browser extension Tab Session Manager to help with this.
- YouTube -
• There's a url that everybody should know about if they want to instantly check if a YouTube video has been captured by the Wayback Machine or not:
web.archive.org/web/2oe_/http://wayback-fakeurl.archive.org/yt/VIDEOIDHERE
Just edit this link to have the video's ID at the end. The ID is the random numbers/letters after the "v=" part of the url.
If it's archived, you will be presented with the video file itself, which is usually .mp4 or .flv. With .mp4 just right-click the video and click "Save video as..." For .flv it should automatically try to save the file.
• If a video's watch page is not archived, try making sure that the url you're putting in the Wayback Machine doesn't have any unnecessary garbage at the end of it, like "&list=playlistidhere".
Just "watch?v=videoidhere" should be enough. Alternatively, if a regular url doesn't show any results, you may want to try adding these garbage parameters to the end of the url, like "&feature=related" for example.
- Checking For Embedded Media -
This is if you're wanting to see if the current page you're exploring on the Wayback Machine has any media such as images, sound files, Flash files (swf), Shockwave files (dcr/dir), etc.
I don't know how to do this with Chrome because I don't use Chrome, but if you're using Firefox, go to Tools->Page Info (shortcut is Ctrl+I) and click on the Media tab.
This will give you a list of all the media that's in the current page. Just make sure the page has finished loading first before doing this.
- The File/Page Listing Goldmine -
This is the part of the Wayback Machine that everyone needs to know about.
Did you know that you can get a list of every single page and file that the Wayback Machine has captured for a particular website?
It's pretty amazing. Sometimes you can find stuff that you had no idea was even on a site!
There's two ways of doing this: the easy but limited way, and the advanced way.
• The Easy Way
This method will only give you up to 100,000 results, but sometimes certain links will be left out of the results for whatever reason.
It also takes some time for the results to load depending on how many there are, and it only shows the results for the current (sub)domain.
All you have to do is put a * at the end of the link that you're entering into the Wayback Machine. An example of this would be "example.com*" or "subdomain.example.com*".
You can even search inside of specific sections/folders of a site. Let's say you notice that an image's url is located at "example.com/images/12345.png".
If you enter in "example.com/images/*" then you will get a listing of all captured links inside of the "images" folder. This includes subfolders.
The results are paginated, so look at the bottom of the page and click the numbers to see more of the results.
You can also filter the results by using the "Filter results" text box. This is useful if you're looking for a specific kind of file or page name.
• The Advanced Way: Using the CDX Search
The CDX Search will allow you to get all available results instead of just up to 100,000.
It is used by editing a url to have the parameters that you want to search by. Most of it kind of goes over my head, so I only use it when I need to.
Here's a link to its documentation: github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md
Here are some example links that I use:
web.archive.org/cdx/search/cdx?url=*.example.com&fl=original&collapse=urlkey&showNumPages=false&page=0 (Shows everything for the base domain and all subdomains.)
web.archive.org/cdx/search/cdx?url=example.com/*&fl=original&collapse=urlkey&showNumPages=false&page=0 (Shows everything for just the current (sub)domain.)
Take note of the "page" part of the url. The CDX Search is also paginated and this is how you switch between pages, starting at an index of 0.
"showNumPages" can be true or false, and when it's set to true it will give you a number that represents the total amount of pages there are for your query.
I hope this helps!