[ACCEPTED] Clone file links upload

pullvideos

Member
YetiShare User
Dec 15, 2013
88
0
6
This gives users ability to clone any file links from site to their own account.
Very usefull function to avoid same files re-uploading by users.
Thanks!
 

mfa9884569

New Member
YetiShare User
Dec 22, 2014
6
1
1
Re: Clone file links upload

It' s very good to have this function:

1. Detect the uploading file HASH to check if it exists in file server. If it's uploaded before, just point the new file to the same old file. This function will make uploading finished in 1 second even if the file is huge.
2. It save many server space. Save many copies of the same file is big wasting of server storage.
3. Members can copy files to their own account from the other member's share, so that they can have a soft copy to avoid the file is deleted.
 

enricodias4654

Member
YetiShare User
Jan 13, 2015
411
1
16
Re: Clone file links upload

If you put links from the same site on the remote upload form, the script will clone the file link (but not the file).

The script checks for md5 and doesn't allow 2 identical files to exist. But it would be a nice feature to duplicate files using free space in several servers to make redundancy. It would remove the necessity of raid on the file servers and will make it possible to balance the load.

Checking the md5 of a file before the upload would need software on the client side. I'm not sure if it is possible to do with javascript on the browser. And even with this it will not be possible to upload in 1 second. Large files will require many seconds just to calculate the md5.
 

ruslan5467

New Member
YetiShare User
YetiShare Supporter
Oct 16, 2015
26
2
3
Re: Clone file links upload

+1
Virustotal also works like this.
1. client side calculate file hash BEFORE upload file and send it to server side.
2. server side chech hash
3. a) if file exist it just copy file link to new owner and reply to client side that no need to upload file. It is weldone.
b) oterwize start file uploading.

All recent file hosting software uses this algorythm in uploading files. China cloud systems, russian ones, european and others. They save up to 40% of storage and income traffic.

This feature is usefull and is highly desirable.
 

adam

Administrator
Staff member
Dec 5, 2009
2,046
108
63
Re: Clone file links upload

ruslan5467 said:
+1
Virustotal also works like this.
1. client side calculate file hash BEFORE upload file and send it to server side.
2. server side chech hash
3. a) if file exist it just copy file link to new owner and reply to client side that no need to upload file. It is weldone.
b) oterwize start file uploading.

All recent file hosting software uses this algorythm in uploading files. China cloud systems, russian ones, european and others. They save up to 40% of storage and income traffic.

This feature is usefull and is highly desirable.
It's entirely possible, the issue comes when calculating the hash. For large files it can take a huge amount of time if done in the clients browser. The script already wont store duplicate files, so storage isn't the benefit of this, bandwidth usage would be though.

I've had a few people ask if there's anything that can be done here so we'll take a look for the next release. It may be that we initially just do it with files less than say 100MB with the option to increase it in the admin area. Or store a hash of the first 100MB and use this for the checking, but there's a (slight) chance the first 100MB of a larger file would be the same as another different file.

Maybe a check on the filesize & 100MB hash would do it?
 

enricodias4654

Member
YetiShare User
Jan 13, 2015
411
1
16
Re: Clone file links upload

adam said:
ruslan5467 said:
+1
Virustotal also works like this.
1. client side calculate file hash BEFORE upload file and send it to server side.
2. server side chech hash
3. a) if file exist it just copy file link to new owner and reply to client side that no need to upload file. It is weldone.
b) oterwize start file uploading.

All recent file hosting software uses this algorythm in uploading files. China cloud systems, russian ones, european and others. They save up to 40% of storage and income traffic.

This feature is usefull and is highly desirable.
It's entirely possible, the issue comes when calculating the hash. For large files it can take a huge amount of time if done in the clients browser. The script already wont store duplicate files, so storage isn't the benefit of this, bandwidth usage would be though.

I've had a few people ask if there's anything that can be done here so we'll take a look for the next release. It may be that we initially just do it with files less than say 100MB with the option to increase it in the admin area. Or store a hash of the first 100MB and use this for the checking, but there's a (slight) chance the first 100MB of a larger file would be the same as another different file.

Maybe a check on the filesize & 100MB hash would do it?
It would take long to calculate the md5 in the client but it will be always faster than upload the file. The script may check the md5 while another thread uploads the file. If the md5 matches some file in the server, the upload is canceled. In my tests using the script that I sent you via tickets my 5 years old laptop took 10 seconds to calculate the md5 of a 200mb file using 50% of the cpu. I added a setTimeout function to reduce the cpu usage.
 

adam

Administrator
Staff member
Dec 5, 2009
2,046
108
63
Agreed that this sounds like the way forward. 10 seconds for a 200mb file is actually quicker than I'd thought. Which library did you use for the test?
 

enricodias4654

Member
YetiShare User
Jan 13, 2015
411
1
16
SparkMD5 (https://github.com/satazor/SparkMD5) with a 50ms interval between chunks and 2mb buffer. Without the 50ms interval the cpu goes to 100% and it takes only 5 seconds.
 

ruslan5467

New Member
YetiShare User
YetiShare Supporter
Oct 16, 2015
26
2
3
Ive create a test file in linux with the command:
Code:
dd if=/dev/urandom of=100mp.test bs=1024 count=102400
I choose random file to be sure hashing isn't easy for CPU.
And upload it to virustotal twice.
For the first time it takes 6 seconds to "Calculating hash" and 53 seconds to real upload here:
https://www.virustotal.com/ru/file/1a5fcbf4235baa4f6d5e169de65319db40034e6ac901588c02a7603a2bba4726/analysis/

Second time hashing was slightly faster - 5.2 seconds (probably of operating system hard drive cache) and nothing to upload.
So in fact I saved almost one minute.
Now I store online video collection so it might be CPU and time consumption to hashing all video file.
So probably your idea:
Maybe a check on the filesize & 100MB hash would do it?
Can do the trick and save alsa a time to calculate hash for big files. But I suppose to complicate your method.
  • Calculate file size.
  • if file size < 100Mbytes than hash it all and send to serverside.
  • if file size > 100Mbytes than hash every first 1Mbyte, last one, and first 1mbyte of every 100Mbytes.
  • hash all hashes (to avoid complex data structure on serverside) and send to serverside.

Probably this method can speed up hashing big video collection up to HDD linear read speed. And let me do not think what I upload allready and let software do this job for me and upload only missing files.
 

enricodias4654

Member
YetiShare User
Jan 13, 2015
411
1
16
The easiest way is to calculate the hash for the entire file. It would also be easy to upgrade since the script already have the md5 of all files.

The ISPs here have 0.5mbps for upload. There are users here with 100mbps for download and ridiculous 0.5mbps to upload. In my case, calculating the hash will aways be faster than upload the file.

In the future, calculating the hash while another thread do the upload would be the best solution.